Since mass spectrographs and spectrometers were introduced in the early 1900s,1 mass spectrometry (MS) has undergone tremendous technological improvements. Once a methodology primarily used by chemists, MS is now an incredibly versatile analytical technique with several applications in research including structural biology, clinical diagnostics, environmental analysis, forensics, food and beverage analysis, omics and beyond.
MS produces a vast quantity of data that needs to be analyzed. Managing, processing and interpreting these large data outputs is computationally intensive and often prone to errors, particularly when manual or semi-automated processes are used. Consequently, artificial intelligence (AI) and machine learning (ML) have become immensely popular for processing MS-generated data and statistical analysis as they can be applied to various biological disciplines,2 limit errors and enhance data analysis.
This article delves into what MS data analysis entails and its associated challenges, how AI/ML can aid analyses and exciting potential future developments in the field, with specific applications to proteomics and metabolomics research.
What does MS data analysis entail?
“Data analysis in proteomics and metabolomics is a complex, multi-step process that begins with the collection of biological samples and culminates in the extraction of meaningful biological insights,” Dr. Wout Bittremieux, assistant professor in the Adrem Data Laboratory at the University of Antwerp, said.
After the laborious sample preparation of extracting proteins, peptides or metabolites of interest,3 they are ionized and introduced into a mass spectrometer where they are detected based on their mass-to-charge (m/z) ratio, producing a mass spectrum. The coupling of MS with other analytical tools, such as gas chromatography and liquid chromatography, allows for the further separation and identification of such analytes.
“One of the key challenges in MS is the accurate annotation of MS spectra to their corresponding molecules,” Dr. Bittremieux said.
“In proteomics, the dominant method for this task is sequence database searching. This relies on comparing experimental to theoretical spectra simulated from peptides assumed to be present. However, these theoretical spectra are often oversimplified and do not capture detailed fragment ion intensity information, which can lead to significant ambiguities and false identifications.”
Once data are quantified, either relatively or absolutely, statistical analyses can take place to facilitate biological interpretation.
“To contextualize the results, pathway analysis tools can be used to map the identified proteins or metabolites onto known biological pathways to help in understanding the functional implications of the changes observed in the data. Alternatively, biomarker candidates can be identified based on their ability to distinguish between different biological conditions or groups,” Bittremieux explained.
Applying AI/ML to MS data analysis
Although some scientists still have concerns about large-scale AI implementation, AI and ML have become indispensable tools for MS data analysis; aiding clinical decisions, guiding metabolic engineering and stimulating fundamental biological discoveries.
Applying AI/ML in MS research attempts to minimize errors associated with data analysis –including high noise levels, batch effects during measurements and missing values4 – enhance usability and maximize data outputs.5 Further, training ML models on large datasets of empirical MS spectra allows the generation of highly accurate predicted spectra that closely match the experimental data.6 This overcomes the limitations of traditional sequence database searching, which relies on crude, theoretical spectra.
Developments in AI/ML have led to more accurate, efficient and comprehensive interpretations of biological data, including de novo peptide sequencing.7
“De novo peptide sequencing, which involves determining the peptide sequence directly from tandem (MS/MS) spectra without relying on a reference database, is a challenging problem. ML approaches are starting to impact this area significantly by learning patterns from known spectra and using them to predict peptide sequences from unknown spectra, making it feasible to analyze complex proteomes without relying solely on existing protein databases,” Dr. Bittremieux said.
Another area AI/ML has been applied to in MS data analysis is repository-scale data analysis.8 Public data repositories have continued to expand, now containing millions to billions of MS spectra. Despite existing data providing ample opportunity to extract new biological insights, the sheer volume of data presents significant challenges in terms of data processing and analysis.
“We have developed AI algorithms capable of performing large-scale analyses across these repositories, identifying patterns across experiments and detecting novel peptides and proteins that were previously missed. This has led to discoveries that would have been impossible with manual or traditional computational methods.”
Recent developments in AI/ML
Although advancements in AI have been fruitful, applying these technological developments to MS data is challenging due to its unique nature, making a direct translation of AI advancements to MS data non-trivial.
“One of the most significant recent advancements in AI relevant to MS data analysis is the development of more sophisticated deep learning models capable of handling high-dimensional data and extracting intricate patterns,” said Bittremieux.
“For example, transformer neural networks, which were originally developed for natural language processing, are now effectively used to ‘translate’ between sequences of peaks in tandem MS spectra to sequences of amino acids during de novo peptide sequencing. These models can learn from vast amounts of empirical MS data, identifying subtle features that traditional methods might overlook.”
“Despite such advancements, the successful application of AI to MS data still requires deep expertise in both AI and MS. This multidisciplinary skill set remains relatively rare, which has slowed the broader adoption of AI in the field. However, as more researchers receive training in both areas and as AI tools become more accessible, we are beginning to see a new generation of scientists capable of bridging this gap.”
Looking towards the future
Although significant advancements in AI and ML have aided the continual development of MS data analysis, there is still room for improvement.
“One of the key areas where I believe future developments should be focused is on the generation and curation of high-quality, large-scale datasets. While advancements in AI model architectures have been impressive, these models are only as good as the data they are trained on,” discussed Bittremieux.
Greater availability of diverse MS data sets would ultimately enable the development of AI tools suitable for use across multiple experimental conditions in differing biological topics.9
“These datasets should include comprehensive annotations, such as accurate peptide and metabolite identifications, quantification data and metadata related to sample preparation and instrument settings. This diversity will enable AI models to learn more generalizable patterns, improving their performance across different applications.”
Researchers are sometimes at fault for testing their models on cherry-picked datasets. This contributes to a lack of standardization evaluations assessing the performance of different models. Dr. Bittremieux detailed that “the development of benchmarking suites would allow for a fair comparison of different algorithms, fostering transparency and driving genuine progress in the field.”
“As AI tools become more accessible and interpretable, we will likely see a surge in innovative applications, from personalized medicine to environmental monitoring.”
References
1. Wilkinson DJ. Historical and contemporary stable isotope tracer approaches to studying mammalian protein metabolism. Mass Spectrom. Rev. 2018;37(1):57-80. doi:10.1002/mas.21507
2. Neagu AN, Jayathirtha M, Baxter E, Donnelly M, Petre BA, Darie CC. Applications of tandem mass spectrometry (MS/MS) in protein analysis for biomedical research. Molecules. 2022;27(8):2411. doi:10.3390/molecules27082411
3. Luque-Garcia JL, Neubert TA. Sample preparation for serum/plasma profiling and biomarker identification by mass spectrometry. J. Chromatogr. A. 2007;1153(1):259-276. doi:10.1016/j.chroma.2006.11.054
4. Liebal UW, Phan ANT, Sudhakar M, Raman K, Blank LM. Machine learning applications for mass spectrometry-based metabolomics. Metabolites. 2020;10(6):243. doi:10.3390/metabo10060243
5. Beck AG, Muhoberac M, Randolph CE, et al. Recent developments in machine learning for mass spectrometry. ACS Meas Sci Au. 2024;4(3):233-246. doi:10.1021/acsmeasuresciau.3c00060
6. Adams C, Gabriel W, Laukens K, et al. Fragment ion intensity prediction improves the identification rate of non-tryptic peptides in timsTOF. Nat Commun. 2024;15(1):3956. doi:10.1038/s41467-024-48322-0
7. Yilmaz M, Fondrie WE, Bittremieux W, et al. Sequence-to-sequence translation from mass spectra to peptides with a transformer model. Nat Commun. 2024;15(1):6427. doi:10.1038/s41467-024-49731-x
8. Bittremieux W, May DH, Bilmes J, Noble WS. A learned embedding for efficient joint analysis of millions of mass spectra. Nat Methods. 2022;19(6):675-678. doi:10.1038/s41592-022-01496-1
9. Dens C, Adams C, Laukens K, Bittremieux W. Machine learning strategies to tackle data challenges in mass spectrometry-based proteomics. J Am Soc Mass Spectrom. 2024;35(9):2143-2155. doi:10.1021/jasms.4c00180
link