Streaming#

StreamingDataset makes training on large datasets from cloud storage as fast, cheap, and scalable as possible. It’s specially designed for multi-node, distributed training of large models—maximizing correctness guarantees, performance, flexibility, and ease of use. Now, you can efficiently train anywhere, independent of where your dataset lives. Just train on the data you need, right when you need it.

StreamingDataset is compatible with any data type, including images, text, video, and multimodal data. With support for major cloud storage providers (AWS, OCI, GCS, Azure, Databricks UC Volume, and any S3 compatible object store such as Cloudflare R2, Coreweave, Backblaze b2, etc.) and designed as a drop-in replacement for your PyTorch IterableDataset class, StreamingDataset seamlessly integrates into your existing training workflows.

from torch.utils.data import DataLoader
from streaming import StreamingDataset

dataloader = DataLoader(dataset=StreamingDataset(remote='s3://...', batch_size=1))

💾 Installation#

  1. Set up your Python development environment.

  2. Install Streaming with pip:

pip install mosaicml-streaming
  1. Verify the installation with:

python -c "import streaming; print(streaming.__version__)"
  1. Jump to our Quick Start and Main Concepts guides.

🔑 Key Features#

  • Elastic Determinism: Samples are in the same order regardless of the number of GPUs, nodes, or CPU workers. This makes it simple to reproduce and debug training runs and loss spikes. You can load a checkpoint trained on 64 GPUs and debug on 8 GPUs with complete reproducibility. Read more here.

  • Instant Mid-Epoch Resumption: Resume training in seconds, not hours, in the middle of a long training run. Minimizing resumption latency saves thousands of dollars in egress fees and idle GPU compute time compared to existing solutions. Read more here

  • High throughput: Our MDS format cuts extraneous work to the bone, resulting in ultra-low sample retrieval latency and higher throughput compared to alternatives.

  • Effective Shuffling: Model convergence using StreamingDataset is just as good as using local disk, thanks to our specialized shuffling algorithms. StreamingDataset’s shuffling reduces egress costs, preserves shuffle quality, and runs efficiently, whereas alternative solutions force tradeoffs between these factors.

  • Random access: Access samples right when you need them – simply call dataset[i] to get sample i. You can also fetch data on the fly by providing NumPy style indexing to StreamingDataset.

  • Flexible data mixing: During streaming, different data sources are shuffled and mixed seamlessly just-in-time. Control how datasets are combined using our batching and sampling methods.

  • Disk usage limits: Dynamically delete least recently used shards in order to keep disk usage under a specified limit. Read more here

  • Parallelism-aware: Easily train with data parallelism, sequence parallelism, or tensor parallelism – the right samples end up in the right GPUs, using sample replication.

Community#

Streaming is part of the broader ML/AI community, and we welcome any contributions, pull requests, and issues.

If you have any questions, please feel free to reach out to us on Twitter,  Email, or Slack!