get_shuffle_py1e#

streaming.base.shuffle.get_shuffle_py1e(shard_sizes, num_canonical_nodes, seed, epoch, block_size=262144)[source]#

Get the shuffled global ordering of samples for an epoch.

The assignment of shards to nodes is fixed across epochs, but each grouping of shards is processed concurrently in a different order by each node’s workers each epoch.

Parameters
  • shard_sizes (NDArray[np.int64]) – Number of samples contained in each shard, in order.

  • num_canonical_nodes (int) – Number of canonical nodes.

  • seed (int) – Base random seed, which is held constant over an entire training run.

  • epoch (int) – Current epoch, which is added to the seed to get a different deterministic shuffle each epoch.

  • block_size (int) – Unit of shuffle, used to set the std and clip length for the gaussian noise to be added to each shard. Defaults to 1 << 18.

Returns

NDArray[np.int64] – 1:1 mapping of sample ID to shuffled sample ID.