Introduction to Bioconductor

The Bioconductor project is a software ecosystem for the statistical analysis of large biological datasets in the R programming language. The project was initiated in 2001, and since then has grown tremendously in size and usage. Bioconductor provides software providing access to databases, data structures for analysis of high throughput data, as well as novel statistical methods of analysis of these data.

Bioconductor on BioHPC

Accessing Bioconductor software on Nucleus is straightforward. All that is needed is a working R installation. On Nucleus, there are three main ways that users can run R: using the OnDemand RStudio app, through the module system, or via containers.

RStudio OnDemand

The RStudio OnDemand tool allows users to reserve a dedicated node on Nucleus, and serves the RStudio integrated development environment as a web app. This is the most straightforward way to access R/Bioconductor on Nucleus.

To access this, navigate to the BioHPC portal, and select "BioHPC OnDemand > OnDemand RStudio". The next page will allow you to select a partition to run on and launch a job right away, which you can connect to in your browser.

This approach is fast and easy, and provides a graphical environment to work from. However, jobs are capped at 20 hours, and users' ability to modify their environment in this setup is limited. If you need to have access to a shell or run a particularly long job, read on.

Module system

Users can also access R/Bioconductor by loading an R module, and then installing and loading software from Bioconductor. This can be done from a terminal in a WebVis session launched from the portal, or via an SSH connection. Ensure that you're not on a login node (Nucleus004 or Nucleus005) before launching a resource intensive job!

In a terminal, load the R or RStudio module corresponding to the version of R you wish to use, then launch R. For example, to launch version 4.4 of R, you could use the following commands in a terminal on a compute node on Nucleus:

module load R/4.4-img
R

This method provides the user some greater control over their environment, and is amenable to long running and/or batch jobs. However, users can only access the versions of R which are available as modules. Additionally, some packages which require compilation may fail to install if their dependencies cannot be met. This can be a challenge with some Bioconductor packages in particular.

Containers

Another way to access R/Bioconductor software on BioHPC is through containers. In this method, the user downloads and runs a pre-built container containing R and potentially some packages and their dependencies.

To do this, we can load the Singularity module to run containers, then obtain and run a container that includes the software that we want. Given that this guide is focused on Bioconductor, we can pull a container specifically designed to work with Bioconductor:

module load singularity/3.9.9
singularity run docker://bioconductor/bioconductor:RELEASE_3_21-R-4.5.0

This will launch an RStudio Server instance which you can connect to using your browser.

Installing Bioconductor

Once you've successfully launched R, you can access Bioconductor by installing the BiocManager package:

install.packages("BiocManager")

Now, to install any package from Bioconductor, we can use BiocManager::install() instead of the typical install.packages() command. For example, to install the package DESeq2 from Bioconductor, we could use the following command:

BiocManager::install("DESeq2")

Software examples

Here we provide some examples of software available on Bioconductor for analysis of particular types of data. Any of these can be installed by running BiocManager::install() at an R prompt as described above.

General sequencing analysis

  • GenomicRanges: A container for collections of genomic ranges with associated annotation.
  • SummarizedExperiment: A container linking matrix-like objects with metadata for both rows (features or ranges) and columns (samples). Additional slots are available to include experiment-wide metadata.

Schematic of a SummarizedExperiment object. From SummarizedExperiment documentation.

  • Rsubread: Alignment and quantification of DNA/RNA sequencing data.
  • Rsamtools: Read and manipulate SAM, BAM, FASTA, and other seq-related files in R.
  • DEseq2: Differential expression analysis from counts-based data, such as RNAseq.
  • EdgeR: Differential expression analysis from counts-based data, such as RNAseq.

Variant calling

  • VariantTools: Provides a framework for developing and testing variant calling pipelines.
  • VariantAnnotation: An R package for reading and filtering VCF files. Tools are also included for variant effect prediction using ENSEMBL's VEP.
  • VariantFiltering: Allows variant filtering using a number of different qualifiers.

RNAseq

  • tximeta: Automatic import of quantification data and metadata into R as a SummarizedExperiment object. Full compatibility with quantifications generated by SalmonSailfish, or piscem-infer; partial compatibility with quantifications produced by other upstream tools. 
  • RNAseq analysis pipeline with tximeta. From tximeta documentation.
  • tximport: Automatic import of quantification data into R as a collection of matrices. Compatible with a number of upstream quantification tools.
  • clusterProfiler: Over-representation and gene set enrichment analysis, with visualization options and ontologies/pathways built in.
  • fgsea: Fast gene set enrichment analysis.

Single cell genomics

 

Summary of scRNA analysis in Bioconductor. From Orchestrating Single Cell Analysis in Bioconductor.

  • SingleCellExperiment: Expands the SummarizedExperiment class for single cell transcriptomic studies.
  • scater: A collection of tools for quality control and visualization of single cell transcriptomics data.
  • scuttle: A collection of utility functions for the analysis of single cell transcriptomics data.
  • scran: A collection of tools for the interpretation of single cell transcriptomics data.
  • scDblFinder: Doublet detection of single cell data.
  • UCell: Calculates scores for gene sets in single cell datasets in a manner robust to dataset size and heterogeneity.

Chromatin accessibility

  • ChIPQC: Generate a quality control report for ChIPseq data.
  • ChIPseeker: Analysis and visualization tools for peak annotation, comparison, and visualization.
  • ChIPPeakAnno: Functions for peak annotation, visualization, and comparison.
  • DiffBind: Differential binding analysis for ChIPseq or ChIP-like data.
  • ChromVAR: Calculate chromatin accessibility deviations for single cell or sparse ATACseq data.
  • motifmatchr: Identify known motifs present in a set of genomic sequences, such as peaks generated by ATACseq or ChIPseq.
  • TFBSTools: Tools for transcription factor binding sites analysis.

Proteomics/metabolomics

  • mzR: Parse and import mass spectrometry data.
  • Spectra: Provides a user-friendly infrastructure for storing and handling mass spectrometry data.
  • PSMatch: Tools to handle and manage peptide spectrum matches.
  • QFeatures: Management and processing of quantitative features for high-throughput mass spectrometry, with aggregation of low-level features into higher level data.

Example of aggregative relationship between assays in a QFeatures object. From R for Mass Spectrometry book.

Interaction with other ecosystems

Interaction between Bioconductor and other ecosystems is generally done through saving a file to disk that is compatible with both desired systems. Oftentimes domain- or task-specific data types have generally agreed upon file types that can be used. For example, FASTQ, BAM, and VCF files are de facto standards for short-read sequencing data, and can be used to import/export data from one system to another.

Here are a few specific examples and tips for interaction:

  • Single cell data: A SingleCellExperiment object can be converted to/from a Seurat object using Seurat's built-in as.Seurat() and as.SingleCellExperiment() functions, respectively. Data can be saved as an h5ad file for use with Scanpy by using the zellkonvertersceasy, or anndataR packages.
  • Shell commands: The system() function within R allows users to run a shell command. Check the documentation for this function, as some operations may have unexpected results.
  • Python: Python code can be run from R by using the reticulate package. If you're using modules on BioHPC to load Conda and/or Python, you'll likely need to provide an explicit path executable to reticulate. Similarly, the Python module rpy2 can be used to run R code from within Python.

Summary

The Bioconductor project provides numerous high quality bioinformatics software packages for several different domains. Many of these packages build off of common data structures, allowing for ease of use. Utilizing most Bioconductor packages on Nucleus is straightforward, but if you encounter any issue, please submit a ticket to biohpc-help@utsouthwestern.edu.