LocalDataset#

class streaming.LocalDataset(local, split=None)[source]#

A streaming dataset whose shards reside locally as a pytorch Dataset.

Parameters:
  • local (str) – Local dataset directory where shards are cached by split.

  • split (str, optional) – Which dataset split to use, if any. Defaults to None.

get_item(sample_id)[source]#

Get sample by global sample ID.

Parameters:

sample_id (int) – Sample ID.

Returns:

Dict[str, Any] – Column name with sample data.

property size#

Get the size of the dataset in samples.

Returns:

int – Number of samples.