Configure Cloud Storage Credentials#

Streaming dataset supports the following cloud storage providers to stream your data directly to your instance.

Amazon S3#

For an S3 bucket with public access, no additional setup is required, simply specify the S3 URI of the resource.

MosaicML platform#

For MosaicML platform users, follow the steps mentioned in the AWS S3 MCLI documentation page on how to configure the cloud provider credentials.

Others#

First, make sure the awscli is installed, and then run aws configure to create the config and credential files:

python -m pip install awscli
aws configure

Note

The requested credentials can be retrieved through your AWS console, typically under “Command line or programmatic access”.

Your config and credentials files should follow the standard structure output by aws configure:

~/.aws/config

[default]
region=<your region, e.g. us-west-2>
output=json

~/.aws/credentials

[default]
aws_access_key_id=<key ID>
aws_secret_access_key=<application key>

More details about the authentication can be found here.

Alternatively, this can also be set through environment variables.

import os
os.environ["AWS_ACCESS_KEY_ID"] = '<key ID>'
os.environ["AWS_SECRET_ACCESS_KEY"] = '<application key>'
os.environ["AWS_DEFAULT_REGION"] = '<your region, e.g. us-west-2>'

Requester Pays Bucket#

If the bucket you are accessing is a Requester Pays bucket, then set the below environment variable by providing a bucket name. If there are more than one requester pays bucket, provide each one separated by a comma.

import os
os.environ['MOSAICML_STREAMING_AWS_REQUESTER_PAYS'] = 'streaming-bucket'

# For more than one requester pays bucket
os.environ['MOSAICML_STREAMING_AWS_REQUESTER_PAYS'] = 'streaming-bucket,another-bucket'

Canned ACL#

Canned ACLs (Access Control Lists) are predefined sets of permissions in AWS S3 that you can apply to your objects. These can simplify the process of managing access to your S3 resources. Examples of canned ACLs include private, public-read, public-read-write, authenticated-read, etc. You can set a canned ACL for your S3 objects by using the S3_CANNED_ACL environment variable. This allows you to manage access permissions to your S3 resources in a simplified manner.

import os
os.environ['S3_CANNED_ACL'] = 'authenticated-read'

Any S3 compatible object store#

For any S3 compatible object store such as Cloudflare R2, Coreweave, Backblaze b2, etc., set up your credentials as mentioned in the above Amazon S3 section. Alternatively, you may use the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variable names to specify your credentials, even though you are not using AWS. The only difference is that you must set your object store endpoint url. To do this, you need to set the S3_ENDPOINT_URL environment variable.

Below are the examples of setting an R2 or Backblaze endpoint url in your run environment.

Note

R2: Your endpoint url is https://<accountid>.r2.cloudflarestorage.com. The account ID can be retrieved through your Cloudflare console. Backblaze: Your endpoint url is ‘https://s3..backblazeb2.com’. The region can be retrieved through your Backblaze console.

import os
# If using R2
os.environ['S3_ENDPOINT_URL'] = 'https://<accountid>.r2.cloudflarestorage.com'
# If using Backblaze
os.environ['S3_ENDPOINT_URL'] = 'https://s3.<your region>.backblazeb2.com'

Note that even with S3 compatible object stores, URLs should be of the form s3://<bucket name>/<path within the bucket> and use the s3:// path prefix, instead of <endpoint url>/<bucket name>/<path within the bucket>.

Google Cloud Storage#

MosaicML platform#

For MosaicML platform users, follow the steps mentioned in the Google Cloud Storage MCLI documentation page on how to configure the cloud provider credentials.

GCP User Auth Credentials Mounted as Environment Variables#

Streaming dataset supports GCP user credentials or HMAC keys for User account. Users must set their GCP user access key and GCP user access secret in the run environment.

From the Google Cloud console, navigate to Google Storage > Settings (Left vertical pane) > Interoperability > Service account HMAC > User account HMAC > Access keys for your user account > Create a key.

import os
os.environ['GCS_KEY'] = 'EXAMPLEFODNN7EXAMPLE'
os.environ['GCS_SECRET'] = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'

GCP Application Default Credentials#

Streaming dataset supports the use of Application Default Credentials (ADC) to authenticate you with Google Cloud. When no HMAC keys are given (see above), it will attempt to authenticate using ADC. This will, in order, check

  1. a key-file whose path is given in the GOOGLE_APPLICATION_CREDENTIALS environment variable.

  2. a key-file in the Google cloud configuration directory.

  3. the Google App Engine credentials.

  4. the GCE Metadata Service credentials.

See the Google Cloud Docs for more details.

To explicitly use the GOOGLE_APPLICATION_CREDENTIALS (point 1 above), users must set their GCP account credentials to point to their credentials file in the run environment.

import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'KEY_FILE'

Oracle Cloud Storage#

MosaicML platform#

For MosaicML platform users, follow the steps mentioned in the Oracle Cloud Storage MCLI documentation page on how to configure the cloud provider credentials.

Others#

To set up OCI SSH keys and SDK, please read the Oracle Cloud Infrastructure documentation here.

Specifically:

  1. To generate the required keys and OCIDs, follow the instructions here.

  2. To get the SDK/CLI configuration files, follow the link here.

A sample config file (~/.oci/config) would look like this:

[DEFAULT]
user=ocid1.user.oc1..<unique_ID>
fingerprint=<your_fingerprint>
key_file=~/.oci/oci_api_key.pem
tenancy=ocid1.tenancy.oc1..<unique_ID>
region=us-ashburn-1

The key file (~/.oci/oci_api_key.pem) is a PEM file that would look like a typical RSA private key file. The streaming dataset authenticates the credentials by reading the ~/.oci/config and ~/.oci/oci_api_key.pem.

Azure Blob Storage and Azure DataLake#

If you wish to create a new storage account, you can use the Azure Portal, Azure PowerShell, or Azure CLI:

# Create a new resource group to hold the storage account -
# if using an existing resource group, skip this step
az group create --name my-resource-group --location westus2

# Create the storage account
az storage account create -n my-storage-account-name -g my-resource-group

Users must set their Azure account name and Azure account access key in the run environment.

The account access key can be found in the Azure Portal under the "Access Keys" section or by running the following Azure CLI command:

az storage account keys list -g MyResourceGroup -n MyStorageAccount
os.environ['AZURE_ACCOUNT_NAME'] = 'test'
os.environ['AZURE_ACCOUNT_ACCESS_KEY'] = 'NN1KHxKKkj20ZO92EMiDQjx3wp2kZG4UUvfAGlgGWRn6sPRmGY/TEST/Dri+ExAmPlEExAmPlExA+ExAmPlExA=='

Databricks#

To authenticate Databricks access for both Unity Catalog and Databricks File System (DBFS), users must set their Databricks host (DATABRICKS_HOST) and access token (DATABRICKS_TOKEN) in the run environment.

See the Databricks documentation for instructions on how to create a personal access token.

MosaicML platform#

For MosaicML platform users, follow the steps mentioned in the Databricks MCLI documentation page on how to configure the credentials.

Others#

os.environ['DATABRICKS_HOST'] = 'hostname'
os.environ['DATABRICKS_TOKEN'] = 'token key'