XSVWriter#

class streaming.XSVWriter(*, columns, separator, newline='\n', out, keep_local=False, compression=None, hashes=None, size_limit=67108864, **kwargs)[source]#

Writes a streaming XSV dataset.

Parameters
  • columns (Dict[str, str]) – Sample columns.

  • separator (str) – String used to separate columns.

  • newline (str) – Newline character inserted between samples. Defaults to \\n.

  • out (str | Tuple[str, str]) –

    Output dataset directory to save shard files.

    1. If out is a local directory, shard files are saved locally.

    2. If out is a remote directory, a local temporary directory is created to cache the shard files and then the shard files are uploaded to a remote location. At the end, the temp directory is deleted once shards are uploaded.

    3. If out is a tuple of (local_dir, remote_dir), shard files are saved in the local_dir and also uploaded to a remote location.

  • keep_local (bool) – If the dataset is uploaded, whether to keep the local dataset directory or remove it after uploading. Defaults to False.

  • compression (str, optional) – Optional compression or compression:level. Defaults to None.

  • hashes (List[str], optional) – Optional list of hash algorithms to apply to shard files. Defaults to None.

  • size_limit (Union[int, str], optional) – Optional shard size limit, after which point to start a new shard. If None, puts everything in one shard. Can specify bytes human-readable format as well, for example "100kb" for 100 kilobyte (100*1024) and so on. Defaults to 1 << 26

  • **kwargs (Any) –

    Additional settings for the Writer.

    progress_bar (bool): Display TQDM progress bars for uploading output dataset files to

    a remote location. Default to False.

    max_workers (int): Maximum number of threads used to upload output dataset files in

    parallel to a remote location. One thread is responsible for uploading one shard file to a remote location. Default to min(32, (os.cpu_count() or 1) + 4).

    exist_ok (bool): If the local directory exists and is not empty, whether to overwrite

    the content or raise an error. False raises an error. True deletes the content and starts fresh. Defaults to False.

encode_sample(sample)[source]#

Encode a sample dict to bytes.

Parameters

sample (Dict[str, Any]) – Sample dict.

Returns

bytes – Sample encoded as bytes.

encode_split_shard()[source]#

Encode a split shard out of the cached samples (data, meta files).

Returns

Tuple[bytes, bytes] – Data file, meta file.

get_config()[source]#

Get object describing shard-writing configuration.

Returns

Dict[str, Any] – JSON object.