Streaming#

Welcome to MosaicML’s Streaming documentation page! Streaming is a PyTorch compatible dataset that enables users to stream training data from cloud-based object stores. Streaming can read files from local disk or from cloud-based object stores. As a drop-in replacement for your Dataset class, it’s easy to get streaming:

dataloader = torch.utils.data.DataLoader(dataset=ImageStreamingDataset(remote='s3://...'))

For additional details, please see our Quick Start and User Guide.

Streaming was originally developed as a part of MosaicML’s Composer training library and is a critical component of our efficient machine learning infrastructure.

Installation#

pip install mosaicml-streaming

Key Benefits#

  • High performance, accurate streaming of training data from cloud storage

  • Efficiently train anywhere, independent of training data location

  • Cloud-native, no persistent storage required

  • Enhanced data security—data exists ephemerally on training cluster

Features#

  • Drop-in replacement for torch.utils.data.IterableDataset class.

  • Built-in support for popular open source datasets (e.g., ADE20K, C4, COCO, Enwiki, ImageNet, etc.).

  • Support for various image, structured and unstructured text formats.

  • Helper utilities to convert proprietary datasets to streaming format.

  • Streaming dataset compression (e.g., gzip, snappy, zstd, bz2, etc.).

  • Streaming dataset integrity (e.g., SHA2, SHA3, MD5, xxHash, etc.).

Community#

Streaming is part of the broader Machine Learning community, and we welcome any contributions, pull requests, and issues.

If you have any questions, please feel free to reach out to us on TwitterEmail, or Slack!