Nextflow: using splitCsv() operator

Introduction

When I get *.fastq.gz files back for my Visium spatial libraries, spaceranger count command is used to generate various output files for QC metrics and downstream analysis. The command in my slurm job script looks like this:

spaceranger count --id=18_57617_A1 --transcriptome=/home/skim823/projects/def-fdick/skim823/genomes/spacerange_hg38/refdata-gex-GRCh38-2020-A --probe-set=/home/skim823/projects/def-fdick/skim823/programs/spaceranger-2.1.1/probe_sets/Visium_Human_Transcriptome_Probe_Set_v2.0_GRCh38-2020-A.csv --fastqs=/scratch/skim823/visium/20240117_LH00244_0047_A22GM27LT3_Mura_Kim --sample=18_57617_A1_D1 --cytaimage=/scratch/skim823/visium/20240117_LH00244_0047_A22GM27LT3_Mura_Kim/etc/assay_CAVG10505_2023-12-06_10-13-34_V43L25-333_1701876913_CytAssist/CAVG10505_2023-12-06_10-35-13_2023-12-06_10-13-34_V43L25-333_D1_18-57617-A1.tif --image=/scratch/skim823/visium/20240117_LH00244_0047_A22GM27LT3_Mura_Kim/etc/tiff/18-57617-A1.tif --slide=V43L25-333 --area=D1 --loupe-alignment=/scratch/skim823/visium/20240117_LH00244_0047_A22GM27LT3_Mura_Kim/etc/json/18_57617_A1.json

With future samples, I want to use Nextflow to automate job submission.

Strategy

My initial thought was to parse params.fastq, but --cytaimage, --image, --area, and --loupe-alignment arguments are no where to be found in these fastq files (unless I submit an ungodly sample name to the genomics core). Instead, I can provide a metadata.csv and use splitCsv() to store and consume all the required arguments.

id	sample	cytaimage	image	slide	area	json
18_57617_A1	18_57617_A1_D1	etc/assay_CAVG10505_2023-12-06_10-13-34_V43L25-333_1701876913_CytAssist/CAVG10505_2023-12-06_10-35-13_2023-12-06_10-13-34_V43L25-333_D1_18-57617-A1.tif	etc/tiff/18-57617-A1.tif	V43L25-333	D1	etc/json/18_57617_A1.json
20_24241_B2	20_24241_B2_A1	etc/assay_CAVG10505_2023-12-06_10-13-34_V43L25-333_1701876913_CytAssist/CAVG10505_2023-12-06_10-35-13_2023-12-06_10-13-34_V43L25-333_A1_20-24241-B2.tif	etc/tiff/20-24241-B2.tif	V43L25-333	A1	etc/json/20_24241_B2.json

In the working directory, I have ${sample}_{S7,S8}_{L001,L002}_{R1,R2}_001.fastq.gz files. id and sample arguments in the .csv file must follow such format above. I think spaceranger is expecting some pre-determined fastq.gz read pairs across a couple of sequencing lanes.

etc/ is a subdirectory with CytAssist images, hi-res images, and alignment json files.

Nextflow

The full main.nf looks like this:

nextflow.enable.dsl=2
params.csv = "$projectDir/metadata.csv"
params.transcriptome = "/home/skim823/projects/def-fdick/skim823/genomes/spacerange_hg38/refdata-gex-GRCh38-2020-A"
params.probeSet = "/home/skim823/projects/def-fdick/skim823/programs/spaceranger-2.1.1/probe_sets/Visium_Human_Transcriptome_Probe_Set_v2.0_GRCh38-2020-A.csv"

csv_ch = Channel
            .fromPath(params.csv)
            .splitCsv(header: true)
            .map(
                row -> 
                tuple(row.id,
                row.sample,
                file (row.cytaimage),
                file (row.image),
                row.slide,
                row.area,
                file(row.json))
            )

transcriptome_ch = Channel.fromPath(params.transcriptome)
probeSet_ch = Channel.fromPath(params.probeSet)

process SPACECOUNT {
    publishDir "$projectDir/output", mode: "copy"
    cpus 32
    memory 128.GB
    time 2.h
    clusterOptions '--account=def-muram'

    input:
    tuple val(id), val(sample), file (cytaimage), file (image), val(slide), val(area), file (json)
    // setting directories as path() doesn't seem to work. It can't resolve relative paths. If I just use val(), I just have to express parameters as absolute paths in the script. 
    // path doesn't work but file does!
    path transcriptome
    path probeSet

    output:
    path "$id/"

    script:
    """
    spaceranger count --id $id  --fastqs $baseDir --sample $sample --cytaimage $cytaimage --image $image --slide $slide --area $area --loupe-alignment $json --transcriptome $transcriptome --probe-set $probeSet
    """
}

workflow {
    SPACECOUNT(csv_ch, transcriptome_ch.collect(), probeSet_ch.collect())
}

Important

within .map() (lines 9-18), must use file() instead of path() (error otherwise)
line 34: must use file() for file paths instead of… path() (no error, but the relative path does not resolve). I thought file() was DSL=1 lingo, but maybe not?
reference

\ (•◡•) /