Register Now for Our Live Webinar: "Unlocking the Glycoproteome: Utilizing a Robust AI/ML Based Platform for Biomarker Discovery" on April 30th at 11am PT/2pm ET
August 7, 2019
The Multitude of Glycoforms
Dan Serie

It’s the late 2010’s, and genomic technologies have matured. The knowledge gleaned from yesterday’s basic research is beginning to make inroads into clinical practice, so it’s natural to seek out new frontiers. The central dogma of biology dictates that DNA stores our genetic information, RNA transcribes it for translation into proteins, and it’s these proteins which carry out the functions of life. But the story does not end there, as proteins themselves are decorated with various molecules that alter their function – so-called post-translational modifications (PTMs). Exacting measurement of these states will help unlock the full power of proteomics, i.e. the next frontier.

The most prominently studied PTMs, such as methylation, acetylation, and phosphorylation, are relatively straightforward. They consist of a handful of atoms and function as on-off switches – they’re either present at a given site, or not. Downstream effects of these binary PTMs have been illuminated in the study of epigenetics, histones, and protein kinases. The true wilderness is a little further out, in the realm of complex PTMs such as glycosylation.

A glycan is a branching tree of sugar chains, often consisting of hundreds of atoms, whose sequence and composition determine its structure and function. They are attached to asparagine (N-linked) or serine/threonine residues (O-linked) in translated peptide sequences. Glycans of various motifs can be found on membrane proteins in every cell in your body and enable a multitude of additional functions in secreted proteins. From facilitating intercellular communication to mediating interactions with the immune system, aberrant patterns of glycosylation have been implicated in all manner of disease [1-3]. Indeed, a coherent claim can be made for their evolutionary import [4]. But apart from a handful of well-characterized proteins (immunoglobulins in particular [5]), the heuristics by which they modify function largely remain unknown. This is mostly due to issues in assessment; compared to binary PTMs it’s significantly more difficult to measure long, branching oligosaccharides that can make up a substantial proportion of a proteoform’s weight.

But modern mass spectrometry has opened up new vistas of biological inquiry. Large scale measurement of site-specific glycosylation has only been possible in recent years, and advances in instrumentation and software have facilitated workflows that were previously tedious or impossible. The ability to reliably assess these states will enable downstream elucidation of biological function at a scale we could only dream of before. We stand on the cusp of an “epiproteomic” revolution, in which knowledge of fully specific proteoforms will supplant the hazy capabilities currently associated with a given protein (which are typically, as it stands, a weighted average of their possible states).

This is where InterVenn comes in. Our technology allows us to quantify the abundance of proteins and their site-specific glycosylation, promising great insight into the underlying biology of disease. We have just begun to mark associations between these widely varying glycoforms and their effect on human health. And indeed the number of possible glycoforms is staggering – the multiplicative expansion is huge for any PTM, but with the variety of glycan motifs available, the scale becomes astronomical.

Just what sort of scale are we dealing with? Prominent groups have attempted to address this question, through methods ranging from pure experimentation [6] to mathematical approaches [7]. Though the “correct” answer is currently out of reach, we can find approximations for several common PTMs. Here we proceed to calculate a rough, back-of-the-envelope estimate of the expansion of the proteome based on modifications taken one at a time

To arrive at such an estimate, we work with genome-wide averages. We calculate the mean number of available amino acid residues for each PTM from the UniProt reference sequence (UP000005640 [8]), pull the approximate percentage of those sites that are occupied from the literature, and couple this with the number of possible states per PTM (plus one for a non-modified state). From here an exponential equation yields the multiplicative modifier for the proteome (Figure 1).

Of course, in doing so we have made a number of simplifying (possibly horrifying!) assumptions. For a complete rundown, see the appendix, but to wit: we ignore the contribution of protein isoforms, amino acid polymorphisms, and combinatorics generated by multiple simultaneous PTMs, we maximize the possible sites of attachment, ignore rare sites of modifications, and base site-occupancy on a literature review. Please check out the fine print.

With those caveats out of the way, we present the back of the envelope.

Generally speaking, the mathematical interplay between the number of potential sites, percent occupancy, and possible states favors PTMs with complex forms, i.e. glycosylation and ubiquitination. Phosphorylation has the largest number of possible sites, but still doesn’t come close to making up this difference in base. The low rate of acetylation may reflect the historical study of epigenetic modification of histones, though this role has been expanded upon further research [17].

O-linked glycosylation claims the biggest multiplier given its high number of sites, occupancy, and selection of motifs available. One can play with these numbers a bit, but qualitatively the complexity of the glyco-code remains dominant. Notably, N-linked glycosylation is one order of magnitude smaller than O-linked, though it’s likely that the larger N-glycans individually result in more structural and functional variation [18]. Interestingly, this point can be seen amidst the mathematics, where the selective residue pattern of N-glycans (with less than three sites per protein on average) and extraordinarily high site occupancy point towards the same deeper meaning: when and where N-linked glycosylation occurs, it’s vital. Recent studies assessing the complex interplay between glycan heterogeneity and protein dynamics [19] have borne out this conclusion.

Combining these estimates and multiplying by the number of genes in the database (20,874), we estimate a total of 4.19 billion possible proteoforms, based on singular PTM expansions alone. Without even considering PTM combinatorics, this is roughly twice the number of stars in the Milky Way! An obvious counterpoint is that the number of possible proteoforms is much greater than the number of existing proteoforms, and this is certainly true. But then, our colleagues working in the synthetic biology space may have something to say about that before too long…

At InterVenn we’re excited to tap into this new level of -omics, and hope to enable the broader research community to do the same. Changes in protein glycosylation are extraordinarily important, and understanding them will be necessary to unlock the power of proteomics. One danger is that this new language is too dense and ad hoc to yield to human understanding. Luckily, we also sit on the cusp of a golden era in deep learning research, and neural nets will serve us well as translators, interpreters, and enablers. As we set these technologies loose to make sense of this galaxy of molecular interactions, our goal remains applications in the greater multiverse of human health.

The Fine Print

  • We utilize the 20,874 unique proteins from UniProt [8], which discards isoforms resulting from alternative splicing. This also disregards the expansion that occurs due to single-nucleotide polymorphisms (SNPs) in the genome, which result in single amino-acid polymorphisms (SAPs).
  • We ignore the combinatorics that result from multiple different PTMs existing simultaneously in a given protein, and disregard effects of competitive site occupancy (for example, at sites where both phosphorylation and O-glycosylation could occur, only one modification can be present). For an interesting paper taking a preliminary stab at this question, see here [7].
  • The number of sites for each PTM is a maximal estimate; that is, we count any possible site using the broadest rules for which amino acid residues could be modified. Additional heuristics limit which sites actually result in a bound modification, though this is accounted for somewhat in the site occupancy estimate.
  • Conversely, there are rare sites of attachment for all these PTMs: histidines for acetylation, cysteines for glycosylation, etc. We’re sticking with the classical sites of attachment to utilize their more accurate picture of site occupancy. Hopefully, these are just rounding errors.
  • And to address one other classical site – we’re ignoring N-terminus modifications since they are very common.
  • Our site-occupancy estimates, or “stoichiometry” in the analytical chemistry realm, are based on literature estimates, which are in turn based on samples of the entire proteome (from specific cellular states). These may lend the most uncertainty to our estimates.
  • The number of possible PTM states per site are probably overestimates, as some potential sites couldn’t support larger branching glycans or long ubiquitin chains, due to the nature of the folded protein.
  • Finally, a word for ubiquitination. I had difficulty finding estimates for the prevalence of various chains and linkages. So we liberally allow for 0-10 length poly-ubiquitination based on this paper [12], with any of the 8 possible linkages available at each juncture. This is the basis for our estimate of 8^9+2: one non-ubiquitinated state, one mono-ubiquitinated, and then 8 possibilities each for the 9 polyubiquitin states from length 2-10. The low stoichiometry for this PTM reigns in the massive number of possible states.


  1. Pinho, S. S., & Reis, C. A. (2015). Glycosylation in cancer: mechanisms and clinical implications. Nature Reviews Cancer, 15(9), 540–555.
  2. Rudd, P. M. (2001). Glycosylation and the Immune System. Science, 291(5512), 2370–2376.
  3. Frenkel-Pinter, M., Shmueli, M. D., Raz, C., Yanku, M., Zilberzwige, S., Gazit, E., & Segal, D. (2017). Interplay between protein glycosylation pathways in Alzheimer’s disease. Science Advances, 3(9), e1601576.
  4. Lauc, G., Kristic, J., & Zoldos, V. (2014). Glycans – the third revolution in evolution. Frontiers in Genetics, 5.
  5. Jennewein, M. F., & Alter, G. (2017). The Immunoregulatory Roles of Antibody Glycosylation. Trends in Immunology, 38(5), 358–372.
  6. Aebersold, R., Agar, J. N., Amster, I. J., Baker, M. S., Bertozzi, C. R., Boja, E. S., … Zhang, B. (2018). How many human proteoforms are there? Nature Chemical Biology, 14(3), 206–214.
  7. Compton, P. D., Kelleher, N. L., & Gunawardena, J. (2018). Estimating the Distribution of Protein Post-Translational Modification States by Mass Spectrometry. Journal of Proteome Research, 17(8), 2727–2734.
  8. UniProtKB (May 16, 2019). Proteomes – Homo sapiens (Human).
  9. Hansen, B. K., Gupta, R., Baldus, L., Lyon, D., Narita, T., Lammers, M., … Weinert, B. T. (2019). Analysis of human acetylation stoichiometry defines mechanistic constraints on protein regulation. Nature Communications, 10(1).
  10. Lim, M. Y., O’Brien, J., Paulo, J. A., & Gygi, S. P. (2017). Improved Method for Determining Absolute Phosphorylation Stoichiometry Using Bayesian Statistics and Isobaric Labeling. Journal of Proteome Research, 16(11), 4217–4226.
  11. Li, Y., Evers, J., Luo, A., Erber, L., Postler, Z., & Chen, Y. (2018). A Quantitative Chemical Proteomics Approach for Site-specific Stoichiometry Analysis of Ubiquitination. Angewandte Chemie International Edition, 58(2), 537–541.
  12. Nguyen, L. K., Dobrzyński, M., Fey, D., & Kholodenko, B. N. (2014). Polyubiquitin chain assembly and organization determine the dynamics of protein activation and degradation. Frontiers in Physiology, 5.
  13. Stanley P, Taniguchi N, Aebi M. N-Glycans. 2017. In: Varki A, Cummings RD, Esko JD, et al., editors. Essentials of Glycobiology [Internet]. 3rd edition. Cold Spring Harbor (NY): Cold Spring Harbor Laboratory Press; 2015-2017. Chapter 9. doi: 10.1101/glycobiology.3e.009
  14. Song, T., Aldredge, D., & Lebrilla, C. B. (2015). A Method for In-Depth Structural Annotation of Human Serum Glycans That Yields Biological Variations. Analytical Chemistry, 87(15), 7754–7762.
  15. Clark, P. M., Rexach, J. E., & Hsieh-Wilson, L. C. (2013). Visualization ofO-GlcNAc Glycosylation Stoichiometry and Dynamics Using Resolvable Poly(ethylene glycol) Mass Tags. Current Protocols in Chemical Biology. John Wiley & Sons, Inc.
  16. UniProt Consortium. Swiss Institute of Bioinformatics.
  17. Narita, T., Weinert, B. T., & Choudhary, C. (2018). Functions and mechanisms of non-histone protein acetylation. Nature Reviews Molecular Cell Biology, 20(3), 156–174.
  18. Freeze, H. H., Chong, J. X., Bamshad, M. J., & Ng, B. G. (2014). Solving Glycosylation Disorders: Fundamental Approaches Reveal Complicated Pathways. The American Journal of Human Genetics, 94(2), 161–175.
  19. Wu, D., Struwe, W. B., Harvey, D. J., Ferguson, M. A. J., & Robinson, C. V. (2018). N-glycan microheterogeneity regulates interactions of plasma proteins. Proceedings of the National Academy of Sciences, 115(35), 8763–8768.

Get in touch