Streaming#
Welcome to MosaicML’s Streaming documentation page! Streaming is a PyTorch compatible dataset that enables users to stream training data from
cloud-based object stores. Streaming can read files from local disk or from cloud-based object stores. As a drop-in replacement for your Dataset
class, it’s easy to get streaming:
dataloader = torch.utils.data.DataLoader(dataset=ImageStreamingDataset(remote='s3://...'))
For additional details, please see our Quick Start and User Guide.
Streaming was originally developed as a part of MosaicML’s Composer training library and is a critical component of our efficient machine learning infrastructure.
Installation#
pip install mosaicml-streaming
Key Benefits#
High performance, accurate streaming of training data from cloud storage
Efficiently train anywhere, independent of training data location
Cloud-native, no persistent storage required
Enhanced data security—data exists ephemerally on training cluster
Features#
Drop-in replacement for
torch.utils.data.IterableDataset
class.Built-in support for popular open source datasets (e.g., ADE20K, C4, COCO, Enwiki, ImageNet, etc.).
Support for various image, structured and unstructured text formats.
Helper utilities to convert proprietary datasets to streaming format.
Streaming dataset compression (e.g., gzip, snappy, zstd, bz2, etc.).
Streaming dataset integrity (e.g., SHA2, SHA3, MD5, xxHash, etc.).
Community#
Streaming is part of the broader Machine Learning community, and we welcome any contributions, pull requests, and issues.
If you have any questions, please feel free to reach out to us on Twitter, Email, or Slack!