Week 2: Indexing the genome
2024-01-22
Our files are gzipped (compressed). So they have to be un-zipped with gzip.
Tip
--help or -h to see a help menu--version or -v to see if the program is even loadedssh into graham, we are at a log in nodegzip thing just now? It’s okay for a log in nodeslurm
slurm is a program running on the HPC that manages our requestsslurm job script headerThese lines outline the resources we are requesting:
bash script (mandatory)stdout and stderr, respectively. Custom names for these files.Let’s set our $SLURM_ACCOUNT environment variable so that you can copy & paste code.
STARSTAR is a fast splice-aware read alignment program. Designed specifically for mapping RNA-seq reads. Cited some cool 38k times since 2013.
Other alternatives: HISAT2, Kallisto, Salmon
runThreadN: running 12 cores in parallelrunMode: creating an indexgenomeDir: where the output (index) will gogenomeFastaFiles: path to our reference genomesjdbGTFfile: path to our gene annotationsjdbOverhang: read length - 1#!/bin/bash
#SBATCH --account=$SLURM_ACCOUNT
#SBATCH --time=1:00:00
#SBATCH --cpus-per-task=12
#SBATCH --mem-per-cpu=16G
#SBATCH --job-name=STAR_index
#SBATCH --output=%x.%j.out
#SBATCH --error=%x.%j.err
STAR --runThreadN 12 --runMode genomeGenerate \
--genomeDir \
/home/$USER/project/$USER/workshop/stargenome \
--genomeFastaFiles \
/home/$USER/project/$USER/workshop/mm10.fa \
--sjdbGTFfile \
/home/$USER/project/$USER/workshop/gencode.vM1.annotation.gtf \
--sjdbOverhang 74nanonano is a built-in text editor. Use the arrow keys to navigate.
Paste in the full job script, save with ctrl/cmd + x, hit y to “Save modified buffer”, and hit enter to confirm “File Name to Write”.
STAR with moduleOne of the benefits of these HPCs is that popular programs are pre-installed. They just have to be loaded before we submit our job script.
Tip
module spider PROGRAM will tell you how to load certain programsmodule list shows all active modulesWarning
Always check if your program (as a module) is loaded by running PROGRAM --version / -v / -V. If it’s not, your job will fail.
Check the slurm job schedule for your account:
These HPCs are free like free beer, but you often have to wait for your stuff to run.
Wait time depends on usage (CPUs, time, memory). slurm keeps track of these.
You’d want to avoid wasteful usage. You can check efficiency by running seff JOBID.
Frequently used commands: sbatch, sq, sacct -b, seff
On the other hand, cloud services (AWS, Azure, GCP) have no wait times, but functionally not free.