Frequently Asked Questions
Any tips for running VASP workflows on Theta?
Navigate to the vasp-users repository on Github for some guidelines and code snippets for managing large ensembles of VASP jobs on Theta.
Why isn't the launcher running my jobs?
Check the log for how many workers the launcher is assigning jobs to. It may be that a long-running job is hogging more nodes than you think, and there aren't enough idle nodes to run any jobs. If the launcher has a workflow filter set, be sure that the workflow matches the jobs you expect to run.
Where does the output of my jobs go?
Look in the data/ subdirectory of your Balsam database directory . The jobs will be organized into subfolders according to the name of their workflow, and each job working directory is in turn given a unique name from its name and UUID.
All stdout/stderr from a job is directed into the file
along with job timing information. Any files created by the job will be
placed in its working directory, unless another location is specified
How can I move the output of my jobs to an external location?
This is easy to do with the "stage out" feature of BalsamJobs. You
need to specify two fields,
either from the
balsam job command line interface or as arguments to
Set this field to the location where you want files to go. Balsam supports a number of protocols for remote and local transfers (scp, GridFTP, etc...). If you just want the files to move to another directory in the same file system, use the
local protocol like this:
stage_out_files: This is a whitespace-separated list of shell file-patterns, for example:
stage_out_url='result.out *.log simulation*.dat'
Any file matching any of the patterns in this field will get copied
I want my program to wait on the completion of a job it created
If you need to wait for a job to finish, you can set up a polling function like the following:
from balsam.launcher import dag import time def poll_until_state(job, state, timeout_sec=60.0, delay=5.0): start = time.time() while time.time() - start < timeout_sec: time.sleep(delay) job.refresh_from_db() if job.state == state: return True return False
Then you can check for any state with a specified maximum waiting time and delay. For finished jobs, you can do:
newjob = dag.add_job( ... ) success= poll_until_state(newjob, 'JOB_FINISHED')
There is a convenience function for reading files in a job's working directory:
if success: output = newjob.read_file_in_workdir(‘output.dat’) # contents of file in a string
Querying the Job database
You can perform complex queries on the BalsamJob database thanks to
Django. If you ever need to filter the jobs according to some criteria,
the entire database is available via
See the official Django documentation
for lots of examples, which directly apply wherever you can replace
BalsamJob. For example, say you want to filter for all
jobs containing "simulation" in their name, but exclude jobs that are
from balsam.launcher import dag BalsamJob = dag.BalsamJob pending_simulations = BalsamJob.objects.filter(name__contains=“simulation").exclude(state=“JOB_FINISHED”)
You could count this query:
num_pending = pending_simulations.count()
Or iterate over the pending jobs and kill them:
for sim in pending_simulations: dag.kill(sim)
You can use the
balsam.launcher.dag API to automate a lot of tasks
that might be tedious from the command line. For example, say you want
to delete all jobs that contain "master" in the name, but reset
all jobs that start with "task" to the "RESTART_READY" state, so they may
from balsam.launcher.dag import BalsamJob BalsamJob.objects.filter(name__contains="master").delete() for job in BalsamJob.objects.filter(name__startswith="task"): job.update_state("RESTART_READY")
Useful command lines
Create a dependency between two jobs:
balsam dep <parent> <child> # where <parent>, <child> are the first few characters of job ID balsam ls --tree # see a tree view showing the dependencies between jobs
Reset a failed job state after some changes were made:
balsam modify jobs b0e --attr state --value CREATED # where b0e is the first few characters of the job id
See the state history of your jobs and any error messages that were recorded while the job ran:
balsam ls --hist | less
Remove all jobs with substring "task"
balsam rm jobs --name task