A primer on RNA-seq analysis

Week 1: Intro and HPC

Joon Kim

2024-01-15

Plan

  1. Intro to RNA-seq analysis (today)
  2. RNA-seq pt 1
    • indexing the genome
    • read alignment
    • counting
  3. RNA-seq pt 2
    • differential gene expression with edgeR
  4. RNA-seq pt 3
    • visualization

Not within the scope

  • when/why you should or shouldn’t do RNA-seq
  • how to purify RNA and send to sequencing (Toronto/McGill)
  • budget and design (might cover later)
  • core facility jobs
    • RNA-seq library generation
      • non-core labs don’t typically make their own libraries
    • Illumina sequencing technology (ex. bridge amplification)

Overview

Sample 1
..ATGCTAGATCG…
…GGTACTACT…

Sample 2
..TGGTGTGCTG…
…GCTAGCTGCA…

Gene Sample 1 Sample 2
XYZ 1234 4312
12 44

Within the scope

  • You have (or will have) some RNA-seq data to analyze
    • ie. you have (or will have) some samples.fq.gz files
  • how to go from here to PCA, Volcano plots, heatmaps, etc.? 🤔🤔🤔

Two phases to RNA-seq analysis

Step Environment Rationale
reads ➡️ count table HPC
  • large file size

  • high compute

  • non-interactive

count table ➡️ plots local
  • small file size

  • low compute

  • interactive

Working on a high performance computing cluster (HPC)

  • A big challenge to learning bioinformatics is the lack of graphical user interfaces (GUI).
    • click and drag
  • Instead we have to use the command-line interface (CLI).
    • we have to type out what we want to do
  • To process reads into a count table, we need to use a HPC at a remote location.

A HPC @ Waterloo: Graham


  • it’s a big freaking computer
  • we can’t drive over to Waterloo every time
  • We can log in to graham from our computer using ssh

Logging into graham through Terminal/Command Prompt

Warning

Change skim823 with your username!


ssh skim823@graham.computecanada.ca


  • Prompts you with “are you sure blah blah?”; type yes
  • Prompts with your password; what you type is invisible for security
  • Prompts you with MFA (annoying)

Linux terminal basics

Basic commands
- pwd
- ls
- cd
- cp
- mv
- mkdir
- touch
- rm
- cat
- head
- tail

Tips
- tab to “auto” complete
- ctrl/cmd + c to abort
- ctrl/cmd + a to move the cursor to the beginning
- ctrl/cmd + e to move the cursor to the end
- ctrl/cmd + l to clear the screen

Workflow on the HPC side

  • we need at least (and at most) three kinds of files:
    • genome.fa file
    • genes.gtf/.gff file
    • reads.fq.gz files
  • hopefully you got your reads.fq.gz from your sequencing core
  • genome is the reference build of your organism
    • ex. mm10, mm39, hg38
  • genes contains coordinates (among others) to all genes of that genome
    • protein-coding, non-coding, how well annotated?
  • you have to make sure your genome and genes match

Downloading these files to our HPC

First, let’s create a new directory somewhere to save our files


# go to your PROJECTS directory
mkdir workshop
cd workshop


Question: what is your current directory?

wget is a command we can use to download files over the internet

wget https://hgdownload.cse.ucsc.edu/goldenpath/mm10/bigZips/mm10.fa.gz
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M1/gencode.vM1.annotation.gtf.gz

Downloading these files to our HPC

These are our dummy reads (*.fq.gz):

  • 1% of all reads subsampled to reduce computational cost
  • DKO for double KO cells
  • D1 for DMSO treated replicate 1
  • L00? for sequencing lanes
    • samples are sequenced across different “lanes” on the sequencing machine
      • to reduce lane bias? idk lol
    • you might get pooled reads or not depending on the core/machine setup
wget https://github.com/kimsjune/ccir-bioinformatics2024/tree/main/week1/DKO_D1_L001/DKO_D1_L001.gz
wget https://github.com/kimsjune/ccir-bioinformatics2024/tree/main/week1/DKO_D1_L002/DKO_D1_L002.gz
wget https://github.com/kimsjune/ccir-bioinformatics2024/tree/main/week1/DKO_D1_L003/DKO_D1_L003.gz
wget https://github.com/kimsjune/ccir-bioinformatics2024/tree/main/week1/DKO_D1_L004/DKO_D1_L004.gz

Progress so far

  • we got our reads.fq.gz, genome.fa, and genes.gtf
  • next week we will perform STAR index
    • necessary to align our reads to the genome

Filesystem and policies

Folder Default quota Backed up? Purged? Use case
/home/$USER 50GB, 500k files Y N setup files
/scratch/$USER 20TB, 1M files N Y, after 60d day-to-day (temp)
/project/SLRUM_GROUP/$USER 1TB, 500k files Y N static data
/home/$USER/nearline 2TB, 5k files per group Y N long term storage

Documentation

Resources