merge_index#

streaming.base.util.merge_index(*args, **kwargs)[source]#

Merge index.json from partitions to form a global index.json.

This can be called as

merge_index(index_file_urls, out, keep_local, download_timeout)

merge_index(out, keep_local, download_timeout)

The first signature takes in a list of index files URLs of MDS partitions. The second takes the root of a MDS dataset and parse the partition folders from there.

Parameters
  • index_file_urls (List[Union[str, Tuple[str,str]]]) –

    index.json from all the partitions. Each element can take the form of a single path string or a tuple string.

    1. If index_file_urls is a List of local URLs, merge locally without download.

    2. If index_file_urls is a List of tuple (local, remote) URLs, check if local index.json are missing, download before merging.

    3. If index_file_urls is a List of remote URLs, download all and merge.

  • out (Union[str, Tuple[str,str]]) –

    folder that contain MDS partitions and to put the merged index file

    1. A local directory, merge index happens locally.

    2. A remote directory, download all the sub-directories index.json, merge locally and upload.

    3. A tuple (local_dir, remote_dir), check if local index.json exist, download if not.

  • keep_local (bool) – Keep local copy of the merged index file. Defaults to True.

  • download_timeout (int) – The allowed time for downloading each json file. Defaults to 60.