The BioHPC param_runner
(Parameter Runner CLI) is a command line tool to run a command multiple times, exploring a defined parameter space, and summarizing results.
Note that a web-based interface is also available here which is easier to use if you are not comfortable editing parameter files and working in the command line.
This tool uses a simple YAML configuration file to define a parameter space to exhaustively search, and runs tasks in parallel by distributing them over a set of allocated nodes on the BioHPC Nucleus cluster. Supported parameter spaces are:
The output of a command run with a certain parameter combination can be captured for summary in tabular format by defining regular expressions matching against the output of the command.
module add param_runner
param_runner check myexperiment.yml
param_runner submit myexperiment.yml
A download containing an example parameter file and test program is available here.
When param_runner submit
is run it will submit a job to the cluster to perform the parameter optimization. The job number will be reported on the command line:
...
INFO Submitted batch job 290006
...
You can monitor the status of the job using the squeue
command, or the BioHPC web portal. When the job starts to run it will create a log file in the directory it was launched from, named according to the job ID that was reported above.
param_runner_290008.out
You can examine this file with commands such as cat
or less
. Look for a line stating the output directory for the optimization:
...
- Trace file and output will be at: test/test_data/param_runner_20170111-12131484158381
...
Inside this directory you will find the master trace.txt
file that summarizes the parameter exploration. There are also many xxx.out and xxx.err file which contain the full standard and error output from each individual command that was run throughout the parameter exploration, e.g:
...
100.err
100.out
101.err
101.out
102.err
102.out
103.err
103.out
104.err
104.out
105.err
105.out
...
The number in the filename corresponds to the index column in the master summary trace.txt
.
The parameter file that defines the parameter exploration to be run is written in a simple text based format called YAML. A simple introduction to YAML can be found at the link below, but it should be easy to work from the example parameter file given later.
https://learnxinyminutes.com/docs/yaml/
Whenever you write a parameter file be sure to param_runner check
it before you attempt to submit to the cluster. The check will list any problems with the parameter file that you should correct before submitting to the cluster.
The example below includes documentation of each parameter, and can be used as a starting point for your own parameter files
# The command to run, including any arguments that always stay the same,
# and will not be explored by the runner.
# e.g. in this example we run an imaginary program called `train_ann`,
# which could be a program to train an artificial neural network. We
# want to run it with various different training parameters, but the
# input and crossk arguments are always the same. This example is
# typical of a machine learning experiment.
command: train_ann --input train.set --crossk 10
# An optional workind directory. param_runner will cd to this directory
# before beginning to run commands.
# work_dir: /tmp/param_work_test
# If summary entries are defined here we will look at the standard
# output each time the command is run, and try to extract result values
# using regular expressions that are provided. Each summary entry has
# an id, which will become a column in our trace.txt summary file.
# The regex supplied must contain a single matching group. The value it
# matches against will be written into the trace.txt file in the
# relevant column.
#
# Here we will pull at TPF and FPF values that our train_ann program
# reports when it has finished training our neural network.
summary:
- id: True_Pos_Fraction
regex: 'TPF: ([-+]?[0-9]*\.?[0-9])'
- id: False_pos_Fraction
regex: 'FPF: ([-+]?[0-9]*\.?[0-9])'
# Now we define cluster options, describing how to run our commands on
# 1 or more nodes on the BioHPC nucleus cluster.
# Cluster partition to use
partition: 256GB
# Generic Resource (e.g. GPU request)
# OPTIONAL: Request a SLURM generic resource like a GPU
# gres: 'gpu:1'
# Total number of nodes to use
nodes: 4
# Number of CPUs required by each task. Here we request 4 cpus for each
# task. On the 256GB nodes there are 48 logical cores, so the runner
# can start 12 tasks in parallel per node, for a total of 48 concurrent
# tasks across the 4 nodes we have requested.
cpus_per_task: 4
# Time limit - you must specify a timelimit for the job here, in the
# format <DAYS>-<HOURS>:<MINUTES>
# If the job reaches the limit it will be terminated, so please allow
# a margin of safety. Here we ask for 3 days.
time_limit: 3-00:00
# Modules to load - an optional list of environment modules that the
# command you wish to execute needs to be loaded
modules:
- matlab/2013a
- python/2.7.x-anaconda
# Now we configure the list of parameters we want to explore.
#
# For each parameter the following properties are required:
#
# type: 'int_range' A range of integers -or-
# 'real_range' A range of real numbers -or-
# 'choice' A list of string options
#
#
# The following properties are optional:
#
# flag: '--example' A flag which should proceed the value of the paremeter
# optional: true If true, we consider combinations excluding this parameter
#
# substitution: '%1' If substitution is specfied, the parameter value
# optional: true will replace the placeholder supplied in the
# command, instead of being appended to the command
#
# int_range and real_range types take paremeters:
#
# min: 10 Minimum value for the parameter
# max: 1e4 Maximum value for the parameter
# step: 20 Amount to add at each step through the range
# scale: 10 Amount to multiply value at each step through the range
#
# e.g. for an arithmetic progression of 10 - 100 in steps of 5:
# min: 10
# max: 100
# step: 5
#
# e.g. for a geometric progression of 1, 10, 100, 1,000, 10,000 (scale factor 10)
# min: 1
# max: 10000
# scale: 10
#
#
# In the example below we explore a total of 170 parameter combinations,
# consisting of every combination of:
# - 16 different numbers of hidden nodes
# - 4 different regularization beta values
# - 2 different activation functions
parameters:
# Even numbers from 2 up to 32 for the value of --hidden
- id: 'hidden'
type: "int_range"
flag: '--hidden'
min: 2
max: 32
step: 2
description: "Number of hidden nodes"
# 0.1, 1.0, 10, 100 for the value of --beta
- id: 'beta'
type: real_range
flag: '-b'
optional: true
min: 0.1
max: 100
scale: 10
description: Regularization beta parameter
# Wither step or sigmoid for the value of --activation
- id: 'activation'
type: choice
flag: '--activation'
values:
- step
- sigmoid
description: Activation function