Label-free quantification (LFQ) proteomic data analysis from DIA-NN output files

Yan Chen, Christopher J Petzold

Published: 2022-08-03 DOI: 10.17504/protocols.io.5qpvobk7xl4o/v1

Disclaimer

This protocol is for research purposes only.

Abstract

This protocol details the analysis of label-free quantification (LFQ) data from data independent acquisition (DIA) discovery (shotgun) proteomic experiments and generates a series of outputs.

Before start

INPUTS:

Required:

  • DIANN peptide report file (Experiment report.pr_matrix.csv) Optional:

  • A list of selected proteins for bar chart visualization with Protein.Group identifiers (e.g., P0C054, P0C058)

  • A list of two-sample comparisons of different samples (Sample A vs. Sample B; Sample B vs. Sample C, etc.)

OUTPUTS:

Note
Abundance values correspond to summed peptide peak area in arbitrary unitsSVG files are provided for easy editing with Adobe Illustrator or similar programsYou can visualize the .plotly files by using Plotly or a You can visualize the .plotly files by using Plotly or a Colab jupyter notebook. This provides an interactive view that you can see data labels, zoom, and save parts of the plot as separate .png files.. This provides an interactive view that you can see data labels, zoom, and save parts of the plot as separate .png files.

  • Protein data table (CSV) - a full list of protein quantitative values from the summed peptide abundances
  • Summary Protein data table (CSV) - a full list of protein quantitative values averaged over the sample replicates
  • Protein data table in EDD upload format (CSV) - a full list of protein quantitative values from the summed peptide abundances in EDD data upload format with Time (e.g., 24h) and Units (e.g., counts)

If applicable:

  • Summary data table of a selected list of proteins (CSV) - a list of selected protein quantitative values averaged over the sample replicates
  • Bar charts of selected proteins in .png, .svg, and .plotly formats
  • Individual bar charts of protease, heat shock, or insoluble expression marker protein(s) abundance if they are present (.png, .svg, and .plotly formats)
  • Data tables with the Welch's t-test results
  • Data tables with a list of the significantly UP and DOWN regulated proteins if there are any
  • Volcano plots visualizing the Welch's t-test p-value significance and log(2) normalized Fold Change (FC) between the two samples (.png, .svg, and .plotly formats)
  • Volcano plots visualizing the t-test adjusted p-value (Benjamini-Hochberg) significance and log(2) normalized Fold Change (FC) between the two samples (.png, .svg, and .plotly formats)

Steps

Data processing

1.

We start with a DIA data acquisition peptide search output file the DIANN search (DIA; link to DIA-NN paper) and we trim out unused columns in the reports to simplify the analysis.

DIA-NN report restricted to:

  • Protein.Group
  • Protein.Name
  • Genes
  • Protein.Description
  • Sample
  • Replicate
  • Abundance value (Peptide peak area in arbitrary units)
2.

The peptide abundance values are summed to the protein abundances. The resulting data table is exported as:

Full_list_proteins_XXXXXXXXX-xxxxxxx.csv

Output file: Full_list_proteins_XXXXXXXXX-xxxxxxx.csv
Output file: Full_list_proteins_XXXXXXXXX-xxxxxxx.csv

Note
A file for Experiment Data Depot (EDD) data import is also generated with the name: Full_list_proteins_EDDformat_XXXXXXXXX-xxxxxxx.csv Directions for the EDD import process can be found Directions for the EDD import process can be found here..

3.

Then the protein abundances of the sample replicates are averaged (mean), the standard deviation (SD), and percent coefficient of variation (CV%) are calculated. The resulting data table is exported as:

Full_list_proteins_summary_XXXXXXXXX-xxxxxxx.csv

Output file: Full_list_proteins_summary_XXXXXXXXX-xxxxxxx.csv
Output file: Full_list_proteins_summary_XXXXXXXXX-xxxxxxx.csv

Note
A similar output file is generated for a select list of proteins if one is provided. The resulting data table is exported as:Selected_proteins_summary_XXXXXXXXX-xxxxxxx.csv

Selected Proteins: Bar Charts

4.

If a list of selected proteins is provided bar charts are generated in .png, .svg, and .plotly formats:

  • Proteins selectproteins-bar_XXXXXXXX-xxxxxx.png

      selectproteins-bar_XXXXXXXX-xxxxxx.svg
    
      selectproteins-bar_XXXXXXXX-xxxxxx.plotly
    
Proteins labels
Proteins labels

Note
NOTE : You can visualize the .plotly files by using Plotly or a NOTE: You can visualize the .plotly files by using Plotly or a Colab jupyter notebook. This provides an interactive view that you can see data labels, zoom, and save parts of the plot as separate .png files.. This provides an interactive view that you can see data labels, zoom, and save parts of the plot as separate .png files.

5.

If other commonly analyzed proteins (e.g., insoluble protein diagnsotic marker proteins, proteases, heat shock proteins) are detected and quantified then a bar chart is generated in .png, .svg, and .plotly formats with only the corresponding data:

Example filenames:

insol-marker-bar_XXXXXXXX-xxxxxx.png

insol-marker-bar_XXXXXXXX-xxxxxx.svg

insol-marker-bar_XXXXXXXX-xxxxxx.plotly

Insoluble protein bar chart
Insoluble protein bar chart

Sample A-B comparisons: t-Test and volcano plots

6.

If applicable, two samples (A and B) are selected for comparison then a Welch's t-Test is performed by using the ttest_ind_from_stats function from scipy (details here). This is comparable to the Excel function t-Test: Two-Sample Assuming Unequal Variances.

For this analysis:

  • Missing values and zero abundance values are filled with '1000', a value that is just below our level of detection.
  • Abundance values are log2 transformed prior to the t-Test
  • The False Discovery Rate (FDR; adjusted p-value; q-value) is calculated by the Benjamini-Hochberg method by using the statsmodels.stats.multitest.multipletests function.

Significantly changing proteins are defined as:

  • a p-value (or adjusted p-value) < 0.05
  • a fold change of > 2 (UP) or < 0.5 (DOWN)

The resulting data tables are exported as:

Full t-test export:

t-Test_SampleA_OVER_SampleB_XXXXXXXX-xxxxxx.csv

Significant changing proteins (p-value <0.05):

t-Test_signifDOWN_SampleA_OVER_SampleB_XXXXXXXX-xxxxxx.csv

t-Test_signifUP_SampleA_OVER_SampleB_XXXXXXXX-xxxxxx.csv

Significant changing proteins (adjusted p-value <0.05):

t-Test_signifDOWN_adj-p_SampleA_OVER_SampleB_XXXXXXXX-xxxxxx.csv

t-Test_signifUP_adj-p_SampleA_OVER_SampleB_XXXXXXXX-xxxxxx.csv

Note
Note: The definition of 'significance' for your experiment may be different from these values. You can use the full t-test output to select data based on your criteria or process the full dataset as needed.

7.

If a list of selected proteins is provided two volcano plots are generated in .png, .svg, and .plotly formats (six total volcano plot visualization outputs) for the two sample comparisons:

  • log2 (Fold change) vs. -log10(p-value) plots Volcano_plot_p-value_SampleA_OVER_SampleB_XXXXXXXX-xxxxxx.png

      Volcano_plot_p-value_SampleA_OVER_SampleB_XXXXXXXX-xxxxxx.svg
    
    
    
      Volcano_plot_p-value_SampleA_OVER_SampleB_XXXXXXXX-xxxxxx.plotly
    
  • log2 (Fold change) vs. -log10(adjusted-p-value) plots Volcano_plot_adj-p-value_SampleA_OVER_SampleB_XXXXXXXX-xxxxxx.png

      Volcano_plot_adj-p-value_SampleA_OVER_SampleB_XXXXXXXX-xxxxxx.svg
    
    
    
      Volcano_plot_adj-p-value_SampleA_OVER_SampleB_XXXXXXXX-xxxxxx.plotly
    

The significance cutoffs are defined as:

  • Fold Change = 0.5x and 2x (-1 and 1 on the log2 axis)
  • p-value and adj-p-value = 0.05 (1.3 on the -log10 axis)
Volcano Plot
Volcano Plot

Note
NOTE: You can visualize the .plotly files by using Plotly or a NOTE: You can visualize the .plotly files by using Plotly or a Colab jupyter notebook. This provides an interactive view that you can see data labels, zoom, and save parts of the plot as separate .png files.. This provides an interactive view that you can see data labels, zoom, and save parts of the plot as separate .png files.

Note
NOTE: Typically there are more significantly changing (UP & DOWN) proteins observed in the p-value plot than the adjusted-p-value plot. Which plot is most applicable for your experiment will depend on the questions of interest.

推荐阅读

Nature Protocols
Protocols IO
Current Protocols
扫码咨询