Label-free quantification (LFQ) proteomic data analysis from DIA-NN output files

Yan Chen, Christopher J Petzold

Published: 2022-08-03 DOI: 10.17504/protocols.io.5qpvobk7xl4o/v1

Disclaimer

This protocol is for research purposes only.

Abstract

This protocol details the analysis of label-free quantification (LFQ) data from data independent acquisition (DIA) discovery (shotgun) proteomic experiments and generates a series of outputs.

Before start

INPUTS:

Required:

DIANN peptide report file (Experiment report.pr_matrix.csv) Optional:
A list of selected proteins for bar chart visualization with Protein.Group identifiers (e.g., P0C054, P0C058)
A list of two-sample comparisons of different samples (Sample A vs. Sample B; Sample B vs. Sample C, etc.)

OUTPUTS:

Note

Abundance values correspond to summed peptide peak area in arbitrary unitsSVG files are provided for easy editing with Adobe Illustrator or similar programsYou can visualize the .plotly files by using Plotly or a You can visualize the .plotly files by using Plotly or a Colab jupyter notebook. This provides an interactive view that you can see data labels, zoom, and save parts of the plot as separate .png files.. This provides an interactive view that you can see data labels, zoom, and save parts of the plot as separate .png files.

Protein data table (CSV) - a full list of protein quantitative values from the summed peptide abundances
Summary Protein data table (CSV) - a full list of protein quantitative values averaged over the sample replicates
Protein data table in EDD upload format (CSV) - a full list of protein quantitative values from the summed peptide abundances in EDD data upload format with Time (e.g., 24h) and Units (e.g., counts)

If applicable:

Summary data table of a selected list of proteins (CSV) - a list of selected protein quantitative values averaged over the sample replicates
Bar charts of selected proteins in .png, .svg, and .plotly formats
Individual bar charts of protease, heat shock, or insoluble expression marker protein(s) abundance if they are present (.png, .svg, and .plotly formats)
Data tables with the Welch's t-test results
Data tables with a list of the significantly UP and DOWN regulated proteins if there are any
Volcano plots visualizing the Welch's t-test p-value significance and log(2) normalized Fold Change (FC) between the two samples (.png, .svg, and .plotly formats)
Volcano plots visualizing the t-test adjusted p-value (Benjamini-Hochberg) significance and log(2) normalized Fold Change (FC) between the two samples (.png, .svg, and .plotly formats)

Steps

Data processing

We start with a DIA data acquisition peptide search output file the DIANN search (DIA; link to DIA-NN paper) and we trim out unused columns in the reports to simplify the analysis.

DIA-NN report restricted to:

Protein.Group
Protein.Name
Genes
Protein.Description
Sample
Replicate
Abundance value (Peptide peak area in arbitrary units)

The peptide abundance values are summed to the protein abundances. The resulting data table is exported as:

Full_list_proteins_XXXXXXXXX-xxxxxxx.csv

Note

A file for Experiment Data Depot (EDD) data import is also generated with the name: Full_list_proteins_EDDformat_XXXXXXXXX-xxxxxxx.csv Directions for the EDD import process can be found Directions for the EDD import process can be found here..

Then the protein abundances of the sample replicates are averaged (mean), the standard deviation (SD), and percent coefficient of variation (CV%) are calculated. The resulting data table is exported as:

Full_list_proteins_summary_XXXXXXXXX-xxxxxxx.csv

Note

A similar output file is generated for a select list of proteins if one is provided. The resulting data table is exported as:Selected_proteins_summary_XXXXXXXXX-xxxxxxx.csv

Selected Proteins: Bar Charts

If a list of selected proteins is provided bar charts are generated in .png, .svg, and .plotly formats:

Proteins selectproteins-bar_XXXXXXXX-xxxxxx.png

  selectproteins-bar_XXXXXXXX-xxxxxx.svg

  selectproteins-bar_XXXXXXXX-xxxxxx.plotly

Note

NOTE : You can visualize the .plotly files by using Plotly or a NOTE: You can visualize the .plotly files by using Plotly or a Colab jupyter notebook. This provides an interactive view that you can see data labels, zoom, and save parts of the plot as separate .png files.. This provides an interactive view that you can see data labels, zoom, and save parts of the plot as separate .png files.

If other commonly analyzed proteins (e.g., insoluble protein diagnsotic marker proteins, proteases, heat shock proteins) are detected and quantified then a bar chart is generated in .png, .svg, and .plotly formats with only the corresponding data:

Example filenames:

insol-marker-bar_XXXXXXXX-xxxxxx.png

insol-marker-bar_XXXXXXXX-xxxxxx.svg

insol-marker-bar_XXXXXXXX-xxxxxx.plotly

Sample A-B comparisons: t-Test and volcano plots

If applicable, two samples (A and B) are selected for comparison then a Welch's t-Test is performed by using the ttest_ind_from_stats function from scipy (details here). This is comparable to the Excel function t-Test: Two-Sample Assuming Unequal Variances.

For this analysis:

Missing values and zero abundance values are filled with '1000', a value that is just below our level of detection.
Abundance values are log2 transformed prior to the t-Test
The False Discovery Rate (FDR; adjusted p-value; q-value) is calculated by the Benjamini-Hochberg method by using the statsmodels.stats.multitest.multipletests function.

Significantly changing proteins are defined as:

a p-value (or adjusted p-value) < 0.05
a fold change of > 2 (UP) or < 0.5 (DOWN)

The resulting data tables are exported as:

Full t-test export:

t-Test_SampleA_OVER_SampleB_XXXXXXXX-xxxxxx.csv

Significant changing proteins (p-value <0.05):

t-Test_signifDOWN_SampleA_OVER_SampleB_XXXXXXXX-xxxxxx.csv

t-Test_signifUP_SampleA_OVER_SampleB_XXXXXXXX-xxxxxx.csv

Significant changing proteins (adjusted p-value <0.05):

t-Test_signifDOWN_adj-p_SampleA_OVER_SampleB_XXXXXXXX-xxxxxx.csv

t-Test_signifUP_adj-p_SampleA_OVER_SampleB_XXXXXXXX-xxxxxx.csv

Note

Note: The definition of 'significance' for your experiment may be different from these values. You can use the full t-test output to select data based on your criteria or process the full dataset as needed.

If a list of selected proteins is provided two volcano plots are generated in .png, .svg, and .plotly formats (six total volcano plot visualization outputs) for the two sample comparisons:

log2 (Fold change) vs. -log10(p-value) plots Volcano_plot_p-value_SampleA_OVER_SampleB_XXXXXXXX-xxxxxx.png

  Volcano_plot_p-value_SampleA_OVER_SampleB_XXXXXXXX-xxxxxx.svg



  Volcano_plot_p-value_SampleA_OVER_SampleB_XXXXXXXX-xxxxxx.plotly

log2 (Fold change) vs. -log10(adjusted-p-value) plots Volcano_plot_adj-p-value_SampleA_OVER_SampleB_XXXXXXXX-xxxxxx.png

  Volcano_plot_adj-p-value_SampleA_OVER_SampleB_XXXXXXXX-xxxxxx.svg



  Volcano_plot_adj-p-value_SampleA_OVER_SampleB_XXXXXXXX-xxxxxx.plotly

The significance cutoffs are defined as:

Fold Change = 0.5x and 2x (-1 and 1 on the log2 axis)
p-value and adj-p-value = 0.05 (1.3 on the -log10 axis)

Note

NOTE: You can visualize the .plotly files by using Plotly or a NOTE: You can visualize the .plotly files by using Plotly or a Colab jupyter notebook. This provides an interactive view that you can see data labels, zoom, and save parts of the plot as separate .png files.. This provides an interactive view that you can see data labels, zoom, and save parts of the plot as separate .png files.

Note

NOTE: Typically there are more significantly changing (UP & DOWN) proteins observed in the p-value plot than the adjusted-p-value plot. Which plot is most applicable for your experiment will depend on the questions of interest.

Label-free quantification (LFQ) proteomic data analysis from DIA-NN output files

Disclaimer

Abstract

Before start

Steps

Data processing

Selected Proteins: Bar Charts

Sample A-B comparisons: t-Test and volcano plots

推荐阅读