Skip to content

Frequently Asked Questions

Any tips for running VASP workflows on Theta?

Navigate to the vasp-users repository on Github for some guidelines and code snippets for managing large ensembles of VASP jobs on Theta.

Why isn't the launcher running my jobs?

Check the log for how many workers the launcher is assigning jobs to. It may be that a long-running job is hogging more nodes than you think, and there aren't enough idle nodes to run any jobs. If the launcher has a workflow filter set, be sure that the workflow matches the jobs you expect to run.

Where does the output of my jobs go?

Look in the data/ subdirectory of your Balsam database directory . The jobs will be organized into subfolders according to the name of their workflow, and each job working directory is in turn given a unique name from its name and UUID.

All stdout/stderr from a job is directed into the file {jobname}.out, along with job timing information. Any files created by the job will be placed in its working directory, unless another location is specified explicitly.

How can I move the output of my jobs to an external location?

This is easy to do with the "stage out" feature of BalsamJobs. You need to specify two fields, stage_out_url and stage_out_files, either from the balsam job command line interface or as arguments to dag.create_job() or dag.spawn_child().

stage_out_url:
Set this field to the location where you want files to go. Balsam supports a number of protocols for remote and local transfers (scp, GridFTP, etc...). If you just want the files to move to another directory in the same file system, use the local protocol like this:

    stage_out_url="local:/path/to/my/destination"

stage_out_files: This is a whitespace-separated list of shell file-patterns, for example:

    stage_out_url='result.out *.log simulation*.dat'

Any file matching any of the patterns in this field will get copied to the stage_out_url.

I want my program to wait on the completion of a job it created

If you need to wait for a job to finish, you can set up a polling function like the following:

from balsam.launcher import dag
import time

def poll_until_state(job, state, timeout_sec=60.0, delay=5.0):
    start = time.time()
    while time.time() - start < timeout_sec:
        time.sleep(delay)
        job.refresh_from_db()
        if job.state == state:
            return True
    return False

Then you can check for any state with a specified maximum waiting time and delay. For finished jobs, you can do:

newjob = dag.add_job( ... )
success= poll_until_state(newjob, 'JOB_FINISHED')

There is a convenience function for reading files in a job's working directory:

if success:
    output = newjob.read_file_in_workdir(output.dat) # contents of file in a string

Querying the Job database

You can perform complex queries on the BalsamJob database thanks to Django. If you ever need to filter the jobs according to some criteria, the entire database is available via dag.BalsamJob

See the official Django documentation for lots of examples, which directly apply wherever you can replace Entry with BalsamJob. For example, say you want to filter for all jobs containing "simulation" in their name, but exclude jobs that are already finished:

from balsam.launcher import dag
BalsamJob = dag.BalsamJob
pending_simulations = BalsamJob.objects.filter(name__contains=simulation").exclude(state=“JOB_FINISHED”)

You could count this query:

num_pending = pending_simulations.count()

Or iterate over the pending jobs and kill them:

for sim in pending_simulations:
    dag.kill(sim)

You can use the balsam.launcher.dag API to automate a lot of tasks that might be tedious from the command line. For example, say you want to delete all jobs that contain "master" in the name, but reset all jobs that start with "task" to the "RESTART_READY" state, so they may run again:

from balsam.launcher.dag import BalsamJob

BalsamJob.objects.filter(name__contains="master").delete()

for job in BalsamJob.objects.filter(name__startswith="task"):
    job.update_state("RESTART_READY")

Useful command lines

Create a dependency between two jobs:

balsam dep <parent> <child> # where <parent>, <child> are the first few characters of job ID

balsam ls --tree # see a tree view showing the dependencies between jobs

Reset a failed job state after some changes were made:

balsam modify jobs b0e --attr state --value CREATED # where b0e is the first few characters of the job id

See the state history of your jobs and any error messages that were recorded while the job ran:

balsam ls --hist | less

Remove all jobs with substring "task"

balsam rm jobs --name task