Using SLUPipe with High Throughput Computing

SLUPipe has been developed to be compatible with High Throughput Computing, using SLURM Job Scheduling.

SLUPipe Execution:

Step 1: Construct and provide a base JSON configuration file providing same arguments as before with the inclusion of two new key values:

Number of Nodes : Nodes used during HPC Workflow
Node Samples : Samples processed per node during HPC Workflow

Please Note: SLUPipe in HTC mode will process ALL samples found within the input directory.

HPC Base Configuration File Example

[
  {
    "Pipeline_Mode":"-T",
    "Variant_Callers":["Pindel","Platypus"],
    "Input_Directory":"/student/foo/SLUPipe/src/input",
    "Output_Directory":"student/foo/SLUPipe/src/output",
    "Chromosome_Range": "chr1:16,000,000-215,000,000",
    "vep_ScriptPath": "/student/foo/.conda/envs/SLUPipe/share/ensembl-vep-95.3-0",
    "vep_CachePath": "/student/foo/.vep",
    "reference_directory": "/student/foo/referenceFiles",
    "nodes": "2",
    "node_samples": [] <- Must always be empty list
  }
]

Step 2: Execute the following script to adapt workload for SLURM compatibility:

$ python3 gen_batches.py <base_configuration_file>

This scripts divides all the samples found in the input directory into smaller jobs by generating new JSON files, each representing a portion of a the total workload:

Example:

Input Directory:
    -> Demo1_T.bam
    -> Demo1_N.bam
    -> Demo2_T.bam
    -> Demo2_N.bam


2 Samples / 2 Nodes = 1 Sample Per Job: 

Auto Generated JSON 1:
[
  {
    "Pipeline_Mode":"-T",
    "Variant_Callers":["Pindel","Platypus"],
    "Input_Directory":"/student/foo/SLUPipe/src/input",
    "Output_Directory":"student/foo/SLUPipe/src/output",
    "Chromosome_Range": "chr1:16,000,000-215,000,000",
    "vep_ScriptPath": "/student/foo/.conda/envs/SLUPipe/share/ensembl-vep-95.3-0",
    "vep_CachePath": "/student/foo/.vep",
    "reference_directory": "/student/foo/referenceFiles",
    "nodes": "2",
    "node_samples:["Demo1_T.bam","Demo1_N.bam"]
  }
]

Auto Generated JSON 2:
[
  {
    "Pipeline_Mode":"-T",
    "Variant_Callers":["Pindel","Platypus"],
    "Input_Directory":"/student/foo/SLUPipe/src/input",
    "Output_Directory":"student/foo/SLUPipe/src/output",
    "Chromosome_Range": "chr1:16,000,000-215,000,000",
    "vep_ScriptPath": "/student/foo/.conda/envs/SLUPipe/share/ensembl-vep-95.3-0",
    "vep_CachePath": "/student/foo/.vep",
    "reference_directory": "/student/foo/referenceFiles",
    "nodes": "2",
    "node_samples:["Demo2_T.bam","Demo2_N.bam"]
  }
]

Step 3: Create SLURM compatible BASH script to send jobs to SLURM Job Scheduler:

#1/bin/bash

source activate SLUPipe

for FILE in *.json:
    echo ${FILE}; do
    sbatch -n 2 -t 1-00:00 --job-name=SLUPipe --cpus-per-task=10 --partition=medmem --wrap="python3 slupipe_apex.py ${FILE}"
    sleep 1
    
done

Step 4: Run BASH Script

$ ./run_slupipe_hpc.sh

Each job’s results will be placed in the output directory specified in base configuration JSON file.