Ever since Watson and Crick first described the molecular structure of nucleic acids in their seminal 1953 Nature paper, the world of science, and increasingly the biopharmaceutical/diagnostic industry, has been enthralled with genetics and genomics. Rightly so, as the genome lies at the root of all biological processes. The tiny alphabet of nucleic acids, now well characterized, serves as the blueprint for the immense complexity of life. This complexity, however, is not explained by the genome itself.
We have come to appreciate that the impact of the genome is critically determined by factors that modify, diversify, and amplify the relatively modest complexity of the genetic blueprint. This yields several levels of increasingly differentiated information. Among these are epigenetics, which refers to processes, such as methylation, that determine which subset of the genetic information is actually transcribed. Additionally, there are many pools of molecules downstream in the biological cascade, such as the transcriptome, the proteome, or the metabolome. Each of these molecular pools eclipses the information content found at the level of the genome by many orders of magnitude. The static nature of the germline genome and its modest complexity (2 x 104 genes) inherently pose limitations on its information content. The dynamic range of genomic information is similarly narrow, typically framed in binary terms as the presence or absence of a given mutation.
Within the biological cascade, a much richer source of information resides at the level of proteins, at an estimated complexity of about 106 different molecules. However, the information content of proteins is still limited by some of the structural characteristics that they share with nucleic acids, including a non-branching linear structure, a limited number of building blocks, and unvarying chemical linkage bonds.
It is at the level of further modification of the proteome where we encounter the third—in addition to nucleic acids and proteins—fundamental class of information-carrying macromolecules, glycans. The structure of glycans is characterized by branching architecture, a multitude of molecular linkages among individual carbohydrate residues, and a large number of different building blocks. Unlike nucleic acids and proteins, glycan structures are not hardwired into the genome and dependent upon a template for their synthesis. Rather, glycan structures result from the concerted actions of highly specific glycosyltransferases and glycosidases, whose actions in turn are dependent upon the concentrations and localization of high-energy nucleotide sugar donors, such as UDP-N-acetylglucosamine, the endpoint of the hexosamine biosynthetic pathway. Therefore, the glycoforms of a glycoprotein depend upon many factors directly tied to both gene expression and cellular metabolism.
Post-translational modification of proteins by the addition of glycan molecules—glycosylation—leverages the structural complexity and diversity of glycans, resulting in a vast new palette of potential information content, adding orders of magnitude to that of the proteome.
Since glycosylation represents the most common and most complex class of post-translational modification, affecting up to 80% of all proteins, it is evident that the glycoproteome represents an extremely rich and attractive target for the discovery of highly differentiated biomarkers and clinically relevant analytes. Until recently, the challenges inherent in characterizing molecules of this complexity presented a nearly insurmountable barrier to a deeper understanding of the biological function of these myriad glycoforms.
The potential of glycoproteins as biomarkers is underlined by the fact that protein glycosylation dramatically affects protein structure, conformation, and function. Glycosylation thus plays a crucial role in the way that the “parent”-protein affects important intercellular and intracellular biological processes that are fundamentally important to multicellular organisms, like cell signaling, host–pathogen interaction, immune response, and the pathogenesis of many diseases, prominently including cancer. The glycosylation pattern of a given protein, i.e., the amino acid residue to which the glycan is attached, and the glycan moiety present are the critical determinants of effects on protein function. In eukaryotes, protein glycosylation entails the covalent attachment of glycans either as N-linked glycosylation to asparagine residues, or as O-linked glycosylation to serine or threonine residues. Complex glycans are mainly attached to secreted or cell surface proteins, and generally do not cycle on and off the polypeptide backbone. In contrast, monosaccharide O-linked N-acetylglucosamine cycles rapidly on serine or threonine residues of many nuclear and cytoplasmic proteins.
Detailed characterization of a glycoprotein, requiring identification both of the specific residue that carries the glycan and of its structure, represents a formidable challenge. Glycans are often profiled after their in vitro enzymatic release from polypeptides, which results in the loss of any information about proteins and sites to which they were attached. Even though it is much more difficult to identify glycans’ individual amino acid residue conjugation sites, it is vastly more informative to do so, along with concomitant detailed profiling and structural analysis of these glycans.
Given the increasing recognition of the biological importance of protein glycosylation, the field has been the target of a rapidly growing body of basic research; however, comparatively little attention has so far been paid to its potential role in translational and applied research. The daunting task of identifying the number, structure, and function of glycans in cellular biology has been approached primarily using mass spectrometry (MS), which allows large-scale hypothesis-free analyses. However, the enormous structural diversity of glycans and the heterogeneous nature of glycosylation sites has made comprehensive analysis particularly challenging. The MS spectra tend to be complicated due to the presence of isomers, often requiring manual interpretation. Furthermore, searching databases to identify the multitude of features of a given MS chromatogram can quickly become an overpowering computational problem, requiring innovative data processing and bioinformatics solutions.
Recent developments in MS instrumentation, fragmentation strategies, and high-throughput workflows have made analyzing intact glycoproteins more approachable, and specific enrichment strategies have made even low abundance glycans and glycopeptides detectable. A spectrum of experimental workflows for both N- and O-linked glycans have been developed, and a variety of innovative software packages based on fragment-ion indexing strategies are now available. These offer substantial increases in speed for glycopeptide identification and site assignments in individual experiments.
However, despite these advances, scaling up the technology to allow efficient and economic processing of the much larger number of samples encountered in translational and clinical research has remained a major challenge until recently. Application of powerful artificial intelligence technology for automated, high-throughput chromatogram interpretation, coupled with advanced unsupervised machine learning and neural network approaches for algorithmic compilation of panels of glycopeptides into multivariable classifiers has recently made the application of glycoproteomics to clinically relevant questions feasible for the first time. Initial results of this approach have been highly intriguing, yielding high-accuracy predictors for a spectrum of disease indications and treatment outcomes, particularly in oncology.
In summary, there is now reason to believe that glycoproteomics will not only join other -omics fields, which have so far taken the lion’s share of attention, but quite possibly out-perform them. Comprehensive understanding of glycosylation at different levels of granularity is bound to become an increasingly important aspect of both basic and translational research.