Skip to content

Data Transfers

Background

Each Balsam Job may require data to be staged in prior to execution or staged out after execution. A core feature of Balsam is to interface with services such as Globus Transfer and automatically submit and monitor batched transfer tasks between endpoints.

This enables distributed workflows where large numbers of Jobs with relatively small datasets are submitted in real-time: the Site manages the details of efficient batch transfers and marks individual jobs as STAGED_IN as the requisite data arrives.

To use this functionality, the first step is to define the Transfer Slots for a given Balsam App. We can then submit Jobs with transfer items that fill the required transfer slots.

Be sure to read these two sections in the user guide for more information. The only other requirement is to configure the transfer plugin at the Balsam Site and authenticate with Globus, which we explain below.

Configuring Transfers

When using the Globus transfer interface, Balsam needs an access token to communicate with the Globus Transfer API. You may already have an access token stored from a Globus CLI installation on your machine: check globus whoami to see if this is the case. Otherwise, Balsam ships with the necessary tooling and you can follow the same Globus authentication flow by running:

$ balsam site globus-login

Next, we configure the transfers section of settings.yml:

  • transfer_locations should be set to a dictionary of trusted location aliases. If you need to add Globus endpoints, they can be inserted here.
  • globus_endpoint_id should refer to the endpoint ID of the local Site. On public subscriber endpoints at large HPC facilities, this value will be set correctly in the default Balsam site configuration.
  • max_concurrent_transfers determines the maximum number of in-flight transfer tasks, where each task manages a batch of files for many Jobs.
  • transfer_batch_size determines the maximum number of transfer items per transfer task. This should be tuned depending on your workload (a higher number makes sense to utilize available bandwidth for smaller files).
  • num_items_query_limit determines the maximum number of transfer items considered in any single transfer task submission.
  • service_period determines the interval (in seconds) between transfer task submissions.

Once settings.yml has been configured appropriately, be sure to restart the Balsam Site:

$ balsam site sync

The Site will start issuing stage in and stage out tasks immediately and advancing Jobs as needed. We can track the state of transfers using the Python API:

from balsam.api import TransferItem

for item in TransferItem.objects.filter(direction="in",  state="active"):
    print(f"File {item.remote_path} is currently staging in via task ID: {item.task_id}")