CellProfiler on BioHPC

Updated 2017-09-28 David Trudgian

CellProfiler from the Broad Institute is a powerful application for processing biological images in a pipeline. It offers an easy-to-use GUI environment to create and run pipelines against any number of images.

CellProfiler Module & GUI

The latest version of CellProfiler installed on BioHPC can be run by using the following commands in an interactive GUI session, or on a workstation/thin-client.

module add cellprofiler
cellprofiler

CellProfiler Analyst Module

In addition to the base CellProfiler application, CellProfiler Analyst is available. This tool supports interactive exploration and analysis of data. To start CellProfiler Analyst use the following commands in an interactive GUI session, or on a workstation/thin-client:

module add cellprofiler-analyst
CellProfiler-Analyst.py

Running Analyses on a Single Machine

If you are performing analysis with CellProfiler, the easiest way to run your project is to use the GUI within an webGUI session started from the BioHPC User Portal.

The CellProfiler GUI will automatically use all CPU cores available on the machine when an analysis is run.

A webGUI session is limited to 20 hours. If you sometimes need longer than 20 hours to complete your analysis please email biohpc-help@utsouthwestern.edu to request an extension for your webGUI session. We are happy to grant extensions for valid reasons, as long as sufficient notice is given.

If you find that your analysis takes a long time to complete, you may want to read on, to learn about running batched analyses through a cluster job.

Running Analyses on Multiple Cluster Nodes

CellProfiler has a batch processing feature that can be used to run jobs on a cluster independent of the graphical interface. It is flexible enough to support running large analysis projects across multiple nodes in the Nucleus cluster, to speed up time to completion. However, it is not straightforward to run a batch analysis.

There are 5 basic steps to performing a batch analysis:

Convert your project to a batch analysis in the CellProfiler GUI
Generate the batch file from the GUI
Write a SLURM sbatch script to run the batches
Submit the script, and wait for the job to finish
Collate the results from the batches

Converting a Project to a Batch Analysis

CellProfiler's batch mode depends on a batch file that contains information about what processing is needed for each image set. This file must be generated through the CellProfiler GUI.

You can convert a project to a batch analysis as follows:

File->Open Project
Find and open your project
Right click in a blank space in the 'Analysis Module' panel
Select the tool called 'All->CreateBatchFiles'
Use the down arrow in the 'Adjust Modules' area to move 'CreateBatchFiles' to the very bottom of pipeline - it must be the final step.

The CreateBatchFiles step has some default settings that need to be modified:

Chose 'Remove this path mapping'. BioHPC filesystems are always mounted in same place on all machines, so we don't need this option.
Chose 'No' for store in default output folder.
Set a default output folder on the /project or /work filesystems. Do Not output to the /home2 filesystem. This can impact speed for other users.
Chose 'No' for cluster running Windows. BioHPC machines run Linux.
Chose 'No' for 'Launch BatchProfiler'. BatchProfiler is a tool for submitting batch jobs automatically to a cluster, but we need to write our own batch submission script.

Create the Batch Data File

Once the BatchProfiler tool is configured at the end of your pipeline you need to create the batch data file.

Make sure that the input file list for your pipeline is correct. Add or remove files as necessary if you are reusing an old pipeline project.
In File->Preferences make sure that default input and output directories are set to a sensible location for your project.
Save the project - you might want to save this batch version under a different name to the original.
Click the 'Analyze Images' button.

CellProfiler will no process the list of files and the pipeline to compute information that is needed to setup and run batch processing. When complete, a file named Batch_data.h5 will be created in the pipeline output directory.

Understanding Batch CellProfiler

Before writing a batch script that will process all of your image sets across multiple nodes it's important to understand a little about how CellProfiler is used to analyze a batch of images, in a non-interactive manner.

In a terminal session, cd to the location of your pipeline file and Batch_data.h5 that you created above.

We can then run a test batch analysis, on a batch consisting of only the first 10 input image sets (for speed):

module add cellprofiler/2.2.0-20160712
# -p batch file location
# -c no GUI
# -r run the pipeline on startup
# -f first image in batch
# -l last image in batch
# -o name of output file for the batch
cellprofiler -p Batch_data.h5 -c -r -b -f 1 -l 10 -o batch1.out

Note here that the -f, -l, -o options are the most important, and need to be changed for each batch when we run our full analysis.

-f specifies the index (starting at 1) for the first image set we will process in this batch.
-l specifies the index for the last image set we will process in this batch.
-o specifies the output directory for any files created from this batch.

*Note* it's strongly advised to use the -o option to set a different output directory for each batch you run in parallel. When you are running programs across multiple nodes it is easy to have delays or freezes caused by locking when multiple nodes try to create, modify, and delete files in the same directory. Writing output to batch-specific directories prevents any problems with file locking from multiple nodes.

An important thing to note, or observe, is that when run in batch mode, CellProfiler only uses one CPU core for processing, instead of all available cores used when running a complete analysis from the GUI. This is important - to use all CPU cores on multiple nodes we will need to run as many batches in parallel as we have CPU cores.

Writing a Batch Script

To process our complete project across multiple compute nodes we will need to write a batch script, that can be submitted to the SLURM job scheduler. This script will need to:

Reserve multiple compute nodes
Load the CellProfiler module
Start as many batch analysis processes as CPUs are available
Wait for all analysis processes to complete

Decide in advance on the number of nodes you wish to use. You can use up to 16 nodes concurrently, but we recommend using a lower number that's sufficient to keep total run times in the range of 4-12 hours. Requesting a lot of nodes to complete your analysis very quickly may result in long waits on the cluster queue, and the higher number of batched outputs may be more difficult to combine.

The following script performs analysis of a project with 1000 image sets across 4 nodes, in batches of 100 image sets.

We use 4 nodes from the 256GBv1 cluster. Each node has 56 logical cores, or 28 physical cores. CellProfiler is best run using a physical core for each process, so we have 4 * 28 = 112 cores available across 4 nodes.

It's easiest to use equal size batches for processing, so we will run 100 batches of 10 image sets each, to complete our 1000 image set project. This wastes 12 of the 112 cores available, but that is a small percentage and dealing with uneven batch sizes to use all cores is difficult.

Note In future we will update our BioHPC Parameter Runner tool to make submitting Cell Profiler batch processing easier. At present it doesn't support computing start and end indexes for a range, so cannot be used for this task.

Use the script below as a template to setup your batch processing experiment. Create the script in the same location as the Batch_data.h5 file that was generated previously. Read the comments in the script, as you will need to change some settings related to batch sizes etc.

#!/bin/bash

# Use the 256GBv1 partition
# (56 physical, 28 physical cores per node)
# You can use other partitions, see the portal for details about cores per node.
# Don't use super as it gives mixed nodes with different configurations.
#SBATCH -p 256GBv1

# Use 4 nodes total
#SBATCH -N 4

# Time Limit of 1 day
#SBATCH -t 1-00:00:00

# CellProfiler creates lots of threads, raise our process limit
# or batches may fail
ulimit -u 4096

# Load the cellprofiler module
module add cellprofiler/2.2.0-20160712

# We have 4 nodes * 56 cores
# BUT we want to use only physical cores here for best speed 
# So we have 4 nodes * 28 cores = 112 cores available
# As a test we'll process 1000 image sets
# Closest factor of 1000 to 112 threads is 100
# So we will run 100 batches, so each will deal with 10 image sets

# srun is used to distribute our cellprofiler batchs across the allocated nodes
# we run each bask as a background task (& at end of command)
# otherwise the script will wait for batches to complete 1 by 1 

# Loop over i, which will be the first image set in the batch
# With 10000 image sets, and 100 batches this is a sequence from
# 1 to 991, in increments of 10
# You need to change the values in the seq command below for different
# project and batch sizes
for i in $(seq 1 10 991); do

    # First image of the batch is our $i
    BATCH_START=$i
    # Last image is $i + 9 (10 images inclusive)
    # e.g. if first in batch is 21, last is 30 (10 images)
    # Change the 9 to match your real batch size
    BATCH_END=$((i+9))

    # Run CellProfiler for tha batch via srun
    # Allocate 2 cpus per task because we want a physical core per process 
    # and there are 2 logical cores per physical core
    # Specify JVM heap size as otherwize fails due to out of RAM
    # with large numbers of batches per node
    # -p specifies our batch file
    # -c disables the GUI
    # -r runs the batch immediately
    # -b do not build extensions
    # -f first image set for this batch
    # -l last image set for this batch
    # -o output folder for this batch
    srun --exclusive -N1 -n1 --cpus-per-task=2 cellprofiler --jvm-heap-size=4G -p Batch_data.h5 -c -r -b -f ${BATCH_START} -l ${BATCH_END} -o batch_${i}_out &

done

# When we get here we've started all the batches running
# Now wait for all the batches we started to finish
wait

Submit the script

Once the batch job script is written and ready, submit it with the `sbatch` command, e.g.:

sbatch cellprofiler_4_node.sh

The output of the sbatch command is a job id, if your script is valid, or an error message if there is a mistake in your script.

You can check the status of your job with the `squeue -u $USER` command, which lists jobs on the cluster belonging to you.

As the job runs it will write log messages into a file called `slurm-<jobid>.out` where `<jobid>` is the job id you saw from the `sbatch` or `squeue`.

Collating Output

If you follow the script above, the output of each batch will be in a separate folder named e.g. batch_41_out, where 41 is the index of the first image set in the batch.

You may wish to collect all the input into a single directory. If your output is uniquely named files per input (i.e. for each input you create a derivative image named similarly to the input file) you can use a command such as:

mkdir combined_output
cp -r batch_*_out/* combined_output/

However, if you generate e.g. summary CSV files then they will be named the same in each batch specific output directory, and you will need to merge the data inside the CSV files across batches. If you just copy the files as above then only the CSV file from the final batch will remain in your combined_output directory.

It's possible to merge CSV files that have identical structure using Linux commands. If you generate a file called MyExpt_Nuclei.csv from each batch run then you can create a merged file by:

# Take the header (first line) from the first batch
head -n1 batch_1_out/MyExpt_Nuclei.csv > Combined_Nuclei.csv
# Append all other lines except header from all files
tail -q -n+2 batch_*_out/MyExpt_Nuclei.csv >> Combined_Nuclei.csv

Cautions & Limitations

Check batch processing results vs a GUI run, until you are comfortable running batched processing, and identifying any mistakes in the output.
Submit your job to a partition with a single configuration of node, so that you know in advance how my CPUs will be available.
Be very careful with the output collation step - it is easy to mix or overwrite output when combining results from the different batches.