Reader#

class streaming.base.format.Reader(dirname, split, compression, hashes, samples, size_limit)[source]#

Provides random access to the samples of a shard.

Parameters
  • dirname (str) – Local dataset directory.

  • split (str, optional) – Which dataset split to use, if any.

  • compression (str, optional) – Optional compression or compression:level.

  • hashes (List[str]) – Optional list of hash algorithms to apply to shard files.

  • samples (int) – Number of samples in this shard.

  • size_limit (Union[int, str], optional) – Optional shard size limit, after which point to start a new shard. If None, puts everything in one shard. Can specify bytes in human-readable format as well, for example "100kb" for 100 kilobyte (100*1024) and so on.

abstract decode_sample(data)[source]#

Decode a sample dict from bytes.

Parameters

data (bytes) – The sample encoded as bytes.

Returns

Dict[str, Any] – Sample dict.

evict()[source]#

Remove all files belonging to this shard.

Returns

int – Bytes evicted from cache.

get_item(idx)[source]#

Get the sample at the index.

Parameters

idx (int) – Sample index.

Returns

Dict[str, Any] – Sample dict.

get_max_size()[source]#

Get the full size of this shard.

β€œMax” in this case means both the raw (decompressed) and zip (compressed) versions are resident (assuming it has a zip form). This is the maximum disk usage the shard can reach. When compressed was used, even if keep_zip is False, the zip form must still be resident at the same time as the raw form during shard decompression.

Returns

int – Size in bytes.

get_persistent_size(keep_zip)[source]#

Get the persistent size of this shard.

β€œPersistent” in this case means whether both raw and zip are present is subject to keep_zip. If we are not keeping zip files after decompression, they don’t count to the shard’s persistent size on disk.

Parameters

keep_zip (bool) – Whether to keep zip files after decompressing.

Returns

int – Size in bytes.

get_raw_size()[source]#

Get the raw (uncompressed) size of this shard.

Returns

int – Size in bytes.

abstract get_sample_data(idx)[source]#

Get the raw sample data at the index.

Parameters

idx (int) – Sample index.

Returns

bytes – Sample data.

get_zip_size()[source]#

Get the zip (compressed) size of this shard, if compression was used.

Returns

Optional[int] – Size in bytes, or None if does not exist.

set_up_local(listing, safe_keep_zip)[source]#

Bring what shard files are present to a consistent state, returning whether present.

Parameters
  • listing (Set[str]) – The listing of all files under dirname/[split/]. This is listed once and then saved because there could potentially be very many shard files.

  • safe_keep_zip (bool) – Whether to keep zip files when decompressing. Possible when compression was used. Necessary when local is the remote or there is no remote.

Returns

bool – Whether the shard is present.

property size#

Get the number of samples in this shard.

Returns

int – Sample count.

validate(allow_unsafe_types)[source]#

Check whether this shard is acceptable to be part of some Stream.

Parameters

allow_unsafe_types (bool) – If a shard contains Pickle, which allows arbitrary code execution during deserialization, whether to keep going if True or raise an error if False.