Stranded Mapping from Oriented Long Reads
David A Eccles
Abstract
This protocol demonstrates how to map strand-oriented long reads to a genome, and visualise them in a genome browser.
The general idea is to use minimap2 to create stranded BAM files, which are split for forward/reverse orientation then converted into BigWig format for display in a genome browser.
Input(s) :
-
stranded fastq files (see protocol Preparing Reads for Stranded Mapping)
-
a FASTA file containing the genome / sequence of interest. Output(s):
-
Genome-mapped stranded BAM files
-
Genome-mapped stranded BigWig files
Before start
Steps
Orient Reads
Orient reads as per protocol Preparing Reads for Stranded Mapping.
If this has been done, then the following command should produce output without errors:
for bc in $(awk '{print $2}' barcode_counts.txt);
do ls oriented/${bc}_reads_dirAdjusted.fq.gz;
done
```Example output:
oriented/BC03_reads_dirAdjusted.fq.gz oriented/BC04_reads_dirAdjusted.fq.gz oriented/BC05_reads_dirAdjusted.fq.gz oriented/BC06_reads_dirAdjusted.fq.gz oriented/BC07_reads_dirAdjusted.fq.gz oriented/BC08_reads_dirAdjusted.fq.gz
Index Preparation
Prepare genome index for spliced alignment:
Software
Value | Label |
---|---|
minimap2 | NAME |
Linux | OS_NAME |
https://github.com/lh3/minimap2 | REPOSITORY |
Heng Li | DEVELOPER |
https://github.com/lh3/minimap2/releases/tag/v2.24 | LINK |
2.24 | VERSION |
minimap2 -d GRCm39.primary_assembly.genome-splice.idx -Q -t 10 -x splice GRCm39.primary_assembly.genome.fa
Prepare a chromosome size index file:
samtools faidx GRCm39.primary_assembly.genome.fa
cat GRCm39.primary_assembly.genome.fa.fai | awk '{print $1"\t"$2}' > GRCm39.primary_assembly.genome.chrInfo.txt
Read Mapping
Map the long reads to the genome using minimap2, using samtools to covert to a sorted BAM format. This is where the reverse complementing done during demultiplexing gives a big saving of effort. As this BAM file is one of the main outputs, the run name is added to the file name.
Software
Value | Label |
---|---|
SAMtools | NAME |
Linux | OS_NAME |
https://github.com/samtools/samtools | REPOSITORY |
Wellcome Trust Sanger Institute | DEVELOPER |
http://www.htslib.org/download/ | LINK |
1.8 | VERSION |
runName="CHANGE_THIS";
indexFile="/index/location/GRCm39.primary_assembly.genome-splice.idx";
mkdir -p mapped;
for bc in $(awk '{print $2}' barcode_counts.txt);
do echo ${bc};
minimap2 -t 10 -a -x splice ${indexFile} oriented/${bc}_reads_dirAdjusted.fq.gz | \
samtools view -b | samtools sort > mapped/mm2_${runName}_called_${bc}_vs_genome.bam;
done
Creating BigWig Coverage Files (not needed for JBrowse 2)
A bedGraph of coverage is created using BEDTools. The default options for BEDTools treat sequence deletions (which happen frequently in nanopore reads) as a drop in coverage, which can make exon hunting and coverage calculation more difficult. I submitted a pull request to add an additional ignoreD parameter to the command line to allow cDNA reads with split coverage across introns to ignore deletions when considering coverage; this request has now been incorporated into the main BEDtools repository (as of v2.30.0):
Software
Value | Label |
---|---|
BEDtools | NAME |
Linux | OS_NAME |
https://github.com/arq5x/bedtools2 | REPOSITORY |
Aaron Quinlan | DEVELOPER |
https://github.com/arq5x/bedtools2/releases/tag/v2.30.0 | LINK |
2.30.0 | VERSION |
runName="CHANGE_THIS";
for bc in $(awk '{print $2}' barcode_counts.txt);
do echo ${bc};
basename="mapped/mm2_${runName}_called_${bc}_vs_genome";
samtools view -b ${basename}.bam | \
bedtools genomecov -bga -strand '+' -split -ignoreD -ibam - > ${basename}.bg.plus
samtools view -b ${basename}.bam | \
bedtools genomecov -bga -strand '-' -split -ignoreD -ibam - > ${basename}.bg.minus
perl -i -pe 's/([0-9]+)$/-$1/' ${basename}.bg.minus
done;
Stranded bedgraph files are converted to bigwig. This requires BEDTools and a genome information file containing chromosome lengths, as generated in the first step:
runName="CHANGE_THIS";
infoFile="/index/location/GRCm39.primary_assembly.genome-splice.chrInfo.txt";
for bc in $(awk '{print $2}' barcode_counts.txt);
do echo ${bc};
basename="mapped/mm2_${runName}_called_${bc}_vs_genome"
bedGraphToBigWig ${basename}.bg.plus ${infoFile} ${basename}.bw.plus
bedGraphToBigWig ${basename}.bg.minus ${infoFile} ${basename}.bw.minus
done
JBrowse 2 Configuration
Software
Value | Label |
---|---|
JBrowse 2 | NAME |
Linux | OS_NAME |
https://github.com/GMOD/jbrowse-components | REPOSITORY |
GMOD | DEVELOPER |
https://github.com/GMOD/jbrowse-components/releases/tag/v1.6.7 | LINK |
1.6.7 | VERSION |
Each BAM file should have its own JBrowse rack. JBrowse 2 can create both alignment views and coverage plots from BAM files:
runName="CHANGE_THIS";
jbrowseLoc="/path/to/jbrowse2/folder";
for bc in $(awk '{print $2}' barcode_counts.txt);
do echo ${bc};
basename="mapped/mm2_${runName}_called_${bc}_vs_genome";
# create index
samtools index ${basename}.bam;
# copy and create track in Jbrowse2 config file
jbrowse add-track ${basename}.bam --load copy --out ${jbrowseLoc} --subDir bam
done;
If you have a "smart" file system mount that doesn't let you directly copy files via the jbrowse add-track command, you can use the /tmp file system and symlinks to do the same thing:
runName="CHANGE_THIS";
jbrowseLoc="/path/to/jbrowse2/folder";
mkdir -p /tmp/jbrowse/bam
rm -v /tmp/jbrowse/bam/*
mkdir -p ${jbrowseLoc}/bam
for bc in $(awk '{print $2}' barcode_counts.txt);
do echo ${bc};
basename="mapped/mm2_${runName}_called_${bc}_vs_genome";
# create index
samtools index ${basename}.bam;
# copy JBrowse2 config file to temporary folder
cp -v ${jbrowseLoc}/config.json /tmp/jbrowse/
# copy and create track in Jbrowse2 config file
jbrowse add-track ${basename}.bam --load symlink --out /tmp/jbrowse --subDir bam
# copy BAM files and updated config file to JBrowse 2 folder
cp -v /tmp/jbrowse/config.json ${jbrowseLoc}
cp -v ${basename}.bam ${basename}.bam.bai ${jbrowseLoc}/bam
done;