Download PDF

Subjects

- Genetic association study

- Genetic variation

- Genome assembly algorithms

- Sequencing

- Structural variation

Abstract

Pangenomes are revolutionizing our ability to resolve genomic regions with complex variations1. However, existing human pangenomes2,3, constrained by small sample sizes, provide limited utility for medical and population genetic applications. Here we generated 1,116 diploid genome assemblies (55 de novo and 1,061 pangenome-informed) with an average size of 2.98 Gb and a mean quality value of 46 as part of the 1000 Chinese Pangenome (1KCP) project. On the basis of these assemblies, we constructed a pangenome comprising 405.3 million base pairs of sequences absent from the current references GRCh38 and CHM13, including 26.2 million base pairs of functional genic and predicted regulatory elements. We catalogued a full spectrum of genetic variation, including 35.4 million small variants, 110,530 structural variants (SVs), 485,575 tandem repeats (TRs) and 0.86 million nested variants embedded in non-reference sequences. This extensive dataset enabled detailed characterization of multiscale genic variations relevant to medical genetics, including gene-altering SVs, TR expansions, gene cluster variations and HLA gene haplotypes. Coupled with the 1KCP gene expression data, we conducted pan-variant expression quantitative trait locus (eQTL) mapping to analyse diverse variant types. We identified 3,256 eQTLs involving complex variants (SVs, TRs and nested variants) and elucidated their regulatory complexity. Finally, we developed a 1KCP pan-variant imputation reference panel, which provides multitype genetic markers to enhance the resolution of future association studies. This resource advances our understanding of complex variants and their functional implications to provide new insights into human health.

Main

Since the release of the first human reference genome4, understanding the diversity of the human genome has become a fundamental task with far-reaching implications for human health and biology. The substantial reduction in the cost of short-read sequencing has enabled the sequencing of millions of human genomes, which has revealed a wealth of genetic variation essential for advancing genomic research and associated applications5,6,7,8. However, although short-read sequencing is effective for detecting small variants, such as single-nucleotide variants (SNVs) and small insertions and deletions (indels), it has limitations in identifying larger and more complex variants, such as SVs and TRs9,10. Recent advancements in long-read sequencing technology11 and assembly algorithms12 have enabled the generation of high-quality diploid genome assemblies, which provide a more comprehensive view of complex variants and enhance our understanding of their formation mechanisms and functional consequences13. To effectively integrate the full spectrum of genetic variants in populations into a unified framework, researchers have turned to the concept of the pangenome14.

A pangenome refers to the collection of genome sequences in a population and is typically constructed from the diploid assemblies of multiple individuals1. Recent efforts by the Human Pangenome Reference Consortium (HPRC)2 and the Chinese Pangenome Consortium (CPC)3 have demonstrated the potential of pangenomes in resolving structurally divergent regions of the human genome. However, the relatively small sample sizes of current human pangenomes pose several challenges to their broader applications. First, rare variants, which are more likely to be pathogenic than common variants15 and have pivotal roles in inherited diseases16,17, are underrepresented in small-scale pangenomes. Second, the limited sample sizes hamper accurate estimations of allele frequencies (AFs), which is crucial for conducting association analyses and clinical diagnostics. Third, highly repetitive genomic regions, such as TRs18, exhibit increased mutation rates, which makes their sequence diversity difficult to broadly resolve in small populations.

To address these challenges, here we generate 1,116 diploid genome assemblies (55 de novo and 1,061 pangenome-informed) as part of the 1KCP project. We also construct a functionally annotated pangenome and provide an extensive catalogue of diverse genetic variants. Using this extensive dataset, we describe the medical relevance of genic variations at multiple resolutions and demonstrate the regulatory complexity of complex variants through eQTL analyses. Finally, we build a 1KCP imputation reference panel and develop a 1KCP data portal (https://yanglab.westlake.edu.cn/1kcp) with user-friendly tools for online browsing and imputation. Together, our results highlight the potential utility of the 1KCP dataset in human genetic research and associated applications.

Overview of the 1,116 diploid assemblies

The 1KCP project enrolled 1,379 participants, most of whom clustered with Han Chinese refer

The 1000 Chinese Pangenome empowers medical and population genetics | Nature