RatGTEx pipeline
Abraham Palmer, Daniel Munro, Laura M Saba, Hao Chen, Pejman Mohammadi
Abstract
This is the pipeline used to process data for the RatGTEx Portal. It is loosely based on the GTEx eQTL mapping pipeline, though it includes some utility scripts from there in their entirety. All code for this pipeline can be found in the repository. It is built on Snakemake, a Python-based framework for reproducible data analysis. The commands reproduced here use Snakemake-style templating, with variables in brackets to represent different input/output file names and parameters.
Steps
Align RNA-Seq reads
Generate the index for STAR.
Software
Value | Label |
---|---|
STAR | NAME |
https://github.com/alexdobin/STAR | LINK |
2.7.9a | VERSION |
#STAR index
STAR --runMode genomeGenerate \
--genomeDir {params.outdir} \
--genomeFastaFiles {input.fasta} \
--sjdbGTFfile {input.gtf} \
--sjdbOverhang {params.overhang} \
--runThreadN {resources.cpus}
Get an individual-specific VCF file. This is used by STAR to consider the individual's variants for better alignment.
Software
Value | Label |
---|---|
bcftools | NAME |
https://github.com/samtools/bcftools | REPOSITORY |
https://github.com/samtools/bcftools | LINK |
1.12 | VERSION |
#bcftools view
bcftools view -s {wildcards.rat_id} --min-ac=1 -O z -o {output} {input}
Align RNA-Seq reads for a sample using STAR.
Software
Value | Label |
---|---|
STAR | NAME |
https://github.com/alexdobin/STAR | LINK |
2.7.9a | VERSION |
#STAR align
STAR --runMode alignReads \
--runThreadN {resources.cpus} \
--genomeDir {params.index_dir} \
--readFilesIn {params.fastq_list} \
--readFilesCommand zcat \
--twopassMode Basic \
--varVCFfile <(zcat {input.vcf}) \
--waspOutputMode SAMtag \
--quantMode TranscriptomeSAM \
--outSAMstrandField intronMotif \
--outSAMattrRGline {params.read_groups} \
--outSAMtype BAM SortedByCoordinate \
--outSAMunmapped Within \
--outFileNamePrefix {params.prefix}
Identify and correct sample mixups
Get all exon regions from gene annotations.
#Exons from GTF annotations
grep -v '^#' {input} | awk '$3=="exon"' | cut -f1,4,5 | gzip -c > {output}
Subset genotypes to only exon SNPs, SNPs with MAF >= 0.2, and SNPs with <10% missing values.
Software
Value | Label |
---|---|
bcftools | NAME |
https://github.com/samtools/bcftools | REPOSITORY |
https://github.com/samtools/bcftools | LINK |
1.12 | VERSION |
#bcftools view
bcftools view {input.vcf} \
--regions-file {input.regions} \
-Ou | bcftools view \
--min-af 0.2:minor \
-i 'F_MISSING<0.1' \
-Oz -o {output.vcf}
Software
Value | Label |
---|---|
samtools | NAME |
https://github.com/samtools/samtools | REPOSITORY |
https://github.com/samtools/samtools | LINK |
1.12 | VERSION |
#tabix
tabix -p vcf {output.vcf}
Count reads with REF vs. ALT allele in a sample for each SNP.
Software
Value | Label |
---|---|
gatk | NAME |
#ASEReadCounter
gatk ASEReadCounter \
-R {input.ref} \
-I {input.bam} \
-V {input.vcf} \
-O {output} \
--min-depth-of-non-filtered-base 10 \
--min-mapping-quality 250 \
--min-base-quality 15
Compare similarity of genotypes and allele counts from RNA-Seq to identify mixups.
Produces a matrix of RNA-Seq samples x VCF individuals giving similarity across the test SNPs. Similarity is mean(1 - abs(RNA_frac - geno_frac)), where RNA_frac is fraction of reads with ALT allele for each SNP, and geno_frac is fraction of DNA strands with ALT allele (i.e. 0, 0.5, or 1) for each SNP. Also produces a summary table with top similarity per RNA-Seq sample to quickly check for mismatches.
For samples without a genotype match, check for matches in all rats.
After running the previous step, some mismatched RNA samples might still not have a genotype match. This rule will compare their test SNPs to those of all 6000+ rats we have genotypes for to see if any match. It's good to also include at least one RNA sample ID that did match as a positive control (as long as it's included in the all-rat VCF).
Examine the outputs to identify samples that need to be relabeled (e.g. if two labels get swapped) or removed.
- To relabel a sample, edit the ID in the 2nd column of fastq_map.txt for all of its FASTQ files so that its BAM file gets labeled correctly. You'll then need to regenerate the BAM file since it will now use the correct VCF individual as input to STAR.
- To remove a sample, remove its ID from rat_ids.txt and delete its BAM and any other generated files.
If a match is found with a rat outside the current dataset in the previous step, you'll need to add the new matching genotypes to the VCF file. However, if the matching genotypes differ greatly from the current dataset, e.g. they include a very different set of SNPs, it may be better to just remove the sample.
Quantify gene expression
Generate the index for RSEM.
Software
Value | Label |
---|---|
RSEM | NAME |
#rsem-prepare-reference
rsem-prepare-reference \
{input.fasta} \
ref/rsem_reference/rsem_reference \
--gtf {input.gtf} \
--num-threads {resources.cpus}
Quantify expression from a BAM file.
Software
Value | Label |
---|---|
RSEM | NAME |
#rsem-calculate-expression
rsem-calculate-expression \
{params.paired_end_flag} \
--num-threads {resources.cpus} \
--quiet \
--estimate-rspd \
--no-bam-output \
--alignments {input.bam} \
{params.ref_prefix} \
{params.out_prefix}
Software
Value | Label |
---|---|
gzip | NAME |
#gzip
gzip {params.out_prefix}.genes.results
Combine all isoforms of a gene into a single transcript.
Originally used in the GTEx pipeline.
python3 src/collapse_annotation.py \
ref/Rattus_norvegicus.Rnor_6.0.99.gtf \
ref/Rattus_norvegicus.Rnor_6.0.99.genes.gtf
Combine RSEM output from all samples into log-count and TPM expression tables.
Also computes inverse-quantile normalized values to use for eQTL mapping. The filtered version is to avoid a tensorQTL error on phenotypes with 1 nonzero value.
python3 src/assemble_expression.py {params.rsem_dir} {input.anno} {params.prefix}
Software
Value | Label |
---|---|
samtools | NAME |
https://github.com/samtools/samtools | REPOSITORY |
https://github.com/samtools/samtools | LINK |
1.12 | VERSION |
#bgzip
bgzip {params.prefix}.log2.bed
bgzip {params.prefix}.tpm.bed
bgzip {params.prefix}.iqn.bed
bgzip {params.prefix}.iqn.filtered.bed
#tabix
tabix {params.prefix}.log2.bed.gz
tabix {params.prefix}.tpm.bed.gz
tabix {params.prefix}.iqn.bed.gz
tabix {params.prefix}.iqn.filtered.bed.gz
Map eQTLs
Get SNPs with sufficient variation in a given set of samples and convert VCF to plink.
Software
Value | Label |
---|---|
plink | NAME |
https://www.cog-genomics.org/plink/2.0/ | LINK |
2 | VERSION |
#plink2 make bed
plink2 --make-bed \
--vcf {input.vcf} \
--keep {input.samples} \
--maf 0.01 \
--mac 2 \
--max-alleles 2 \
--out {params.prefix}
Prune genotypes to compute covariate PCs.
--indep-pairwise parameters are based on GTEx methods.
Software
Value | Label |
---|---|
plink | NAME |
https://www.cog-genomics.org/plink/2.0/ | LINK |
2 | VERSION |
#plink2 prune
plink2 \
--bfile {params.prefix} \
--geno 0.05 \
--maf 0.05 \
--indep-pairwise 200 100 0.1 \
--out {params.pruned_prefix}
#plink2 subset to pruned
plink2 \
--bfile {params.prefix} \
--extract {params.pruned_prefix}.prune.in \
--export vcf bgz id-paste=iid \
--out {params.pruned_prefix}
Compute genotype (n=5) and expression (n=20) PCs and combine into covariates table.
Rscript src/covariates.R {input.vcf} {input.bed} {params.n_geno_pcs} {params.n_expr_pcs} {output}
Map cis-eQTLs, determining significance using permutations.
Outputs the top association per gene. This script calls tensorQTL and uses the random_tiebreak parameter, which is important for outbred populations in which there are often multiple top eSNPs in 100% LD. Without the random tiebreak, the first or last tied top SNP is returned, resulting in positional bias.
Software
Value | Label |
---|---|
tensorQTL | NAME |
python3 src/run_tensorqtl.py \
{params.geno_prefix} \
{input.bed} \
{output} \
--covariates {input.covar} \
--mode cis
Use stepwise regression to identify multiple conditionally independent cis-eQTLs per gene.
Software
Value | Label |
---|---|
tensorQTL | NAME |
python3 src/run_tensorqtl.py \
{params.geno_prefix} \
{input.bed} \
{output} \
--covariates {input.covar} \
--cis_output {input.cis} \
--mode cis_independent
Get summary statistics for all tested cis-window SNPs per gene.
Software
Value | Label |
---|---|
tensorQTL | NAME |
#tensorqtl cis_nominal
python3 -m tensorqtl \
{params.geno_prefix} \
{input.bed} \
{wildcards.tissue} \
--covariates {input.covar} \
--output_dir {params.outdir} \
--mode cis_nominal
Map trans-eQTLs.
Software
Value | Label |
---|---|
tensorQTL | NAME |
#tensorqtl trans
python3 -m tensorqtl \
{params.geno_prefix} \
{input.bed} \
{wildcards.tissue} \
--covariates {input.covar} \
--output_dir {params.outdir} \
--output_text \
--batch_size 10000 \
--mode trans
Extract all significant cis SNP-gene pairs.
python3 src/tensorqtl_all_signif.py {input.perm} {params.nom_prefix} {output}
Extract p-values for all tested cis-window SNPs per gene.
python3 src/tensorqtl_all_cis_pvals.py {params.nom_dir} {output}
Get effect size (allelic fold change) for top association per gene and all significant cis-eQTLs.
Software
Value | Label |
---|---|
aFC | NAME |
https://github.com/secastel/aFC | REPOSITORY |
python3 tools/aFC/aFC.py \
--vcf {input.vcf} \
--pheno {input.bed} \
--qtl <(python3 src/prepare_qtl_for_afc.py {input.qtl} {input.qtl_indep}) \
--cov {input.covar} \
--log_xform 1 \
--output {output}