Frequently Asked Questions
Any tips for running VASP workflows on Theta?
Navigate to the vasp-users repository on Github for some guidelines and code snippets for managing large ensembles of VASP jobs on Theta.
Why isn't the launcher running my jobs?
Check the log for how many workers the launcher is assigning jobs to. It may be that a long-running job is hogging more nodes than you think, and there aren't enough idle nodes to run any jobs. If the launcher has a workflow filter set, be sure that the workflow matches the jobs you expect to run.
Where does the output of my jobs go?
Look in the data/ subdirectory of your Balsam database directory . The jobs will be organized into subfolders according to the name of their workflow, and each job working directory is in turn given a unique name from its name and UUID.
All stdout/stderr from a job is directed into the file {jobname}.out
,
along with job timing information. Any files created by the job will be
placed in its working directory, unless another location is specified
explicitly.
How can I move the output of my jobs to an external location?
This is easy to do with the "stage out" feature of BalsamJobs. You
need to specify two fields, stage_out_url
and stage_out_files
,
either from the balsam job
command line interface or as arguments to
dag.create_job()
or dag.spawn_child()
.
stage_out_url:
Set this field to the location where you want files to go. Balsam supports a
number of protocols for remote and local transfers (scp, GridFTP, etc...). If
you just want the files to move to another directory in the same file system,
use the local
protocol like this:
stage_out_url="local:/path/to/my/destination"
stage_out_files: This is a whitespace-separated list of shell file-patterns, for example:
stage_out_url='result.out *.log simulation*.dat'
Any file matching any of the patterns in this field will get copied
to the stage_out_url
.
How can I control the way an application runs in my workflow?
There are several optional fields that can be set for each BalsamJob. These fields can be set at run-time, during the dynamic creation of jobs, which gives a lot of flexibility in the way an application is run.
- args: Command-line arguments passed to the application
- environ_vars: Environment variables to be set for the duration of the application execution
- input_files: Which files are "staged-in" from the working directories of parent
jobs. This follows the same shell file-pattern format as the
stage_out_files
field mentioned above. It is intended to facilitate data-flow from parent to child jobs in a DAG, without resorting to stage-out functionality. - preprocess and postprocess: You can override the default pre- and post-processing scripts which run before and after the application is executed. (The default processing scripts are defined alongside the application).
I want my program to wait on the completion of a job it created.
If you need to wait for a job to finish, you can set up a polling function like the following:
from balsam.launcher import dag
import time
def poll_until_state(job, state, timeout_sec=60.0, delay=5.0):
start = time.time()
while time.time() - start < timeout_sec:
time.sleep(delay)
job.refresh_from_db()
if job.state == state:
return True
return False
Then you can check for any state with a specified maximum waiting time and delay. For finished jobs, you can do:
newjob = dag.add_job( ... )
success = poll_until_state(newjob, 'JOB_FINISHED')
There is a convenience function for reading files in a job's working directory:
if success:
output = newjob.read_file_in_workdir(‘output.dat’) # contents of file in a string
Querying the Job database
You can perform complex queries on the BalsamJob database thanks to
Django. If you ever need to filter the jobs according to some criteria,
the entire database is available via dag.BalsamJob
See the official Django documentation
for lots of examples, which directly apply wherever you can replace
Entry
with BalsamJob
. For example, say you want to filter for all
jobs containing "simulation" in their name, but exclude jobs that are
already finished:
from balsam.launcher import dag
BalsamJob = dag.BalsamJob
pending_simulations = BalsamJob.objects.filter(name__contains=“simulation").exclude(state=“JOB_FINISHED”)
You could count this query:
num_pending = pending_simulations.count()
Or iterate over the pending jobs and kill them:
for sim in pending_simulations:
dag.kill(sim)
Useful command lines
Create a dependency between two jobs:
$ balsam dep <parent> <child> # where <parent>, <child> are the first few characters of job ID
$ balsam ls --tree # see a tree view showing the dependencies between jobs
Reset a failed job state after some changes were made:
$ balsam modify jobs b0e --attr state --value CREATED # where b0e is the first few characters of the job id
See the state history of your jobs and any error messages that were recorded while the job ran:
$ balsam ls --hist | less
Remove all jobs with substring "task"
$ balsam rm jobs --name task
Useful Python scripts
You can use the balsam.launcher.dag
API to automate a lot of tasks
that might be tedious from the command line. For example, say you want
to delete all jobs that contain "master" in their name, but reset
all jobs that start with "task" to the "CREATED" state, so they may
run again:
import balsam.launcher.dag as dag
dag.BalsamJob.objects.filter(name__contains="master").delete()
for job in dag.BalsamJob.objects.filter(name__startswith="task"):
job.update_state("CREATED")