MDSWriter#
- class streaming.MDSWriter(*, columns, out, keep_local=False, compression=None, hashes=None, size_limit=67108864, **kwargs)[source]#
Writes a streaming MDS dataset.
- Parameters
Output dataset directory to save shard files. 1. If out is a local directory, shard files are saved locally. 2. If out is a remote directory, a local temporary directory is created to
cache the shard files and then the shard files are uploaded to a remote location. At the end, the temp directory is deleted once shards are uploaded.
- If out is a tuple of (local_dir, remote_dir), shard files are saved in the
local_dir and also uploaded to a remote location.
keep_local (bool) – If the dataset is uploaded, whether to keep the local dataset directory or remove it after uploading. Defaults to
False
.compression (str, optional) – Optional compression or compression:level. Defaults to
None
.hashes (List[str], optional) – Optional list of hash algorithms to apply to shard files. Defaults to
None
.size_limit (int, optional) – Optional shard size limit, after which point to start a new shard. If
None
, puts everything in one shard. Defaults to1 << 26
.**kwargs (Any) – Additional settings for the Writer.
- encode_joint_shard()[source]#
Encode a joint shard out of the cached samples (single file).
- Returns
bytes – File data.