Download PDF

Subjects

- Evolutionary biology

- Gene expression

- Genetic databases

- Open reading frames

- Protein databases

Abstract

A major scientific drive is to characterize the protein-coding genome, which is a primary basis for studying human health. But the fundamental question remains of what has been missed in previous analyses. Over the past decade, the translation of non-canonical open reading frames (ncORFs) has been observed across human cell types and disease states1,2,3, with major implications for biomedical science. However, a key gap in knowledge has been which ncORFs produce small microproteins or alternative protein molecules that contribute to the human proteome. Here we report the collaborative efforts of the TransCODE Consortium4 to produce a consensus landscape of protein-level evidence for ncORFs. We show that about 25% of a set of 7,264 ncORFs gives rise to detectable peptides in a large-scale analysis of 95,520 proteomics experiments. We develop an annotation framework for ncORF-encoded microproteins as human proteins and codify the new conceptual model of ‘peptideins’ as microproteins that have indeterminate potential as functional proteins. To probe the biological implications of peptideins, we create an evolutionary analysis approach, termed ORF relative branch length (ORBL), and determine that evolutionary constraint is common and associates with observation of ncORF-derived peptides. We then characterize a pan-essential cellular phenotype for one peptidein from the OLMALINC long non-coding RNA. Overall, we generate public research tools supported by GENCODE and PeptideAtlas and advance biomedical discovery for understudied components of the human proteome.

Main

Whether the human genome encodes substantially more than the approximately 19,500 canonical protein-coding genes has sparked a spirited debate in recent years. Protein-coding genes are the bedrock of biomedical investigations, including the overwhelming majority of drug development programmes. Therefore, any wholesale addition of protein-coding genes creates ripple effects across human bioscience.

Curation and maintenance of these genes is the task of reference annotation projects, such as Ensembl-GENCODE (hereafter, GENCODE) and UniProtKB/Swiss-Prot (hereafter, UniProt), the work of which builds on the Human Genome Project. Although the number of canonical protein-coding genes has been refined continuously over time, it was felt to be largely stable until recent evidence of translation of small polypeptide or protein sequences from thousands of unannotated ncORFs, which are variably referred to as microproteins, small ORF-encoded peptides (SEPs) or micropeptides (hereafter, microproteins). These ncORFs and their encoded polypeptides are now reported widely as part of a ‘dark proteome’, and their promise to advance medical science is manifested in their contributions to the genetic basis of disease5,6, mechanisms of cancer biology7, and cancer-restricted and HLA-presented cryptic antigens targetable by immunotherapy8.

Yet, the number of ncORFs that represent true protein-coding genes has been a subject of controversy. To date, few microproteins have been annotated as canonical proteins by reference annotation catalogues (such as GENCODE and UniProt) because their uncertain structure and low evolutionary constraint complicate their classification as conventional proteins. At the same time, peptides resulting from cryptic translation have become an emerging area for therapeutic targeting discovery in cancer and other disorders2,8,9,10.

In 2022, we launched the international TransCODE Consortium4 with the goal of defining standards for the reference annotation of ncORFs and their encoded microproteins, including members of GENCODE11, PeptideAtlas12, the Human Proteome Organization-Human Proteome Project (HUPO-HPP)13 and the HUPO-Human ImmunoPeptidome Project (HUPO-HIPP)14 (Fig. 1a). Here we develop a pathway for microproteins to be annotated as reference human proteins when annotation-quality proteomics support is present. To bring formal reference gene annotation status to less-well-characterized microproteins, we introduce ‘peptidein’ as a classification scheme, recognized by our consortia, to exist alongside conventional proteins. To illustrate that further characterization of a peptidein may elevate its classification, we use functional genomics and evolutionary constraint to pinpoint examples that exhibit a signature consistent with a protein-coding gene. Lastly, we propose a research agenda based on consensus among the multiconsortium group, intended to guide future efforts to bring ncORFs, microproteins and peptideins from research discoveries to biological, societal and biomedical impact through ongoing standardized annotation.

Fig. 1: Overview of the centres participating in the annotation effort and the PeptideAtlas framework for protease-digested sample (mostly trypsin) MS and immunopeptidomics bu

Expanding the human proteome with microproteins and peptideins | Nature