C4#

class streaming.text.C4(tokenizer_name, max_seq_len, group_method, local, remote=None, split=None, shuffle=True, prefetch=100000, keep_zip=None, retry=2, timeout=60, hash=None, batch_size=None)[source]#

Implementation of the C4 (Colossal Cleaned Common Crawl) dataset using streaming Dataset.

Parameters

tokenizer_name (str) – The name of the HuggingFace tokenizer to use to tokenize samples.
max_seq_len (int) – The max sequence length of each token sample.
group_method (str) – How to group text samples into token samples. Currently only supporting 'truncate'.
local (str) – Local filesystem directory where dataset is cached during operation.
remote (str, optional) – Remote directory (S3 or local filesystem) where dataset is stored. Defaults to None.
split (str, optional) – The dataset split to use, either ‘train’ or ‘val’. Defaults to None.
shuffle (bool) – Whether to iterate over the samples in randomized order. Defaults to True.
prefetch (int, optional) – Target number of samples remaining to prefetch while iterating. Defaults to 100_000.
keep_zip (bool, optional) – Whether to keep or delete the compressed file when decompressing downloaded shards. If set to None, keep iff remote is local. Defaults to None.
retry (int) – Number of download re-attempts before giving up. Defaults to 2.
timeout (float) – Number of seconds to wait for a shard to download before raising an exception. Defaults to 60.
hash (str, optional) – Hash or checksum algorithm to use to validate shards. Defaults to None.
batch_size (int, optional) – Hint the batch size that will be used on each device’s DataLoader. Defaults to None.