Polygenic height prediction for the Han Chinese in Taiwan

Introduction
Human height is a widely studied polygenic trait because it can be measured accurately and is readily available from large cohorts1,2,3,4. Besides sex differences and genetic factors, however, adult height is also influenced by nutrition, age, and environmental factors. For example, from 1985 to 2019, the average height of males and females in Taiwan increased from 169.2 cm to 173.5 cm and from 158.3 cm to 160.7 cm, respectively; those for males and females in the United Kingdom increased from 176.4 cm to 178.2 cm and from 162.7 cm to 163.9 cm, respectively5,6,7,8,9,10. In addition, height is associated with several human diseases, including cancer11,12, coronary heart disease13, stroke14, and macular degeneration15. It is speculated that genetic loci associated with height may be pleiotropic and influence one’s susceptibility to diseases. Genome-wide association studies (GWAS) and machine learning techniques have been used to identify genetic variants associated with height16,17,18,19,20,21,22.
Polygenic prediction of height has been examined extensively in European populations19,23,24 and briefly in admixed populations25,26, but compared with studies in European populations, fewer studies have examined polygenic height prediction in non-European populations27. Here, we report our findings of height prediction based on genetic and other factors in two large Han Chinese cohorts as part of the Taiwan Biobank (78,719 individuals) and the Taiwan Precision Medicine Initiative (40,641 individuals).
Methods
Sample characteristics
The Taiwan Biobank (https://www.twbiobank.org.tw/; TWB) is a community-based prospective community cohort study of the Taiwanese population. Those between 30 to 70 years old and without cancer can join the project, but there is no age limitation for tracking cases. A standard questionnaire is used at 44 recruitment centers in all counties and cities across Taiwan to collect participants’ demographics, socioeconomic status, environmental exposures, lifestyle, dietary habits, family history and self-reported disease status. Anthropometric measurements and blood / urine samples are collected at the time of enrollment, and genetic profiling is performed on custom genome-wide single nucleotide polymorphism (SNP) arrays, TWBv1 with 653,291 SNPs and TWBv2 (also named TPM1) with 752,921 SNPs28.
The Taiwan Precision Medicine Initiative (TPMI; https://tpmi.ibms.sinica.edu.tw/www/) is a cohort study of the Taiwanese population in partnership with 33 hospitals across Taiwan. Participants consent to providing their electronic medical records (EMR) and residual blood samples for genetic profiling. The EMR includes outpatient and admission/discharge notes, surgical records, together with imaging, pathological, and blood test reports. Genetic profiling was performed on two custom genome-wide SNP arrays. The first array, TPM1, also known as TWBv2, has 752,921 SNPs and was used for early participants (before 2022). The second array, TPM2, has 755,191 SNPs and was used for subsequent participants29.
This study includes 81,061 TWB participants genotyped on the TPM1/TWBv2 arrays as a training and testing set, and 68,610 TPMI participants genotyped on the TPM1/TWBv2 arrays as a validating set and for subsequent analysis. The latest height measurements are used in the analyses for those TWB and TPMI participants with follow-up data.
Written informed consent was obtained from all participants, with ethics approval granted by the Academia Sinica Institutional Review Board (AS-IRB01-23066). This study was conducted in accordance with the Declaration of Helsinki and relevant ethical regulations. Summary statistics are available from the corresponding author upon reasonable request, and individual data and biomaterials can be accessed through the Taiwan Biobank following established procedures.
Quality control
We conducted standard quality control (QC) for the two datasets30,31,32 (Supplementary Figure 1). Individuals with gender error, genotyping miss call rate > 0.1, birthdate outside 1946–1986 range, and 3rd degree or closer kinship relationship with other participants were excluded, resulting in 78,719 QC-passed samples from 81,061 TWB participants in the training and testing sets, and 40,641 QC-passed samples passed from 68,610 TPMI participants in the validating set. Next, we removed SNPs with call rates of <0.9, minor allele frequency (MAF) < 0.01, and deviation from Hardy-Weinberg equilibrium (HWE < 10-8), resulting in 543,064 and 543,701 high quality SNPs in the training/testing and validating sets, respectively. Finally, 542,988 SNPs in common between the training/testing and validating sets were selected for subsequent analyses.
Statistical analysis
Males and females are analyzed separately due to known height differences between them. Height predictors are developed using the following process (Supplementary Figure 2): (I) We employed the “10-Fold Cross-Validation” method33,34,35,36 to randomly divide all TWB samples into 10 subgroups, labeled G1, G2, …, G9, and G10. (II) When the G1 subgroup was used as the testing set then the other 9 subgroups, G2-G10, were used as the training set. Similarly, G2 was used as the testing set and the other 9 subgroups, G1, G3-G10, were used as the training sets and so on. In this step, 10 analysis groups of training and testing sets were obtained. (III) The genome-wide association study (GWAS) was conducted by regressing height (dependent variable) on the single nucleotide polymorphism (SNP) (independent variable), one SNP at a time, on the training set in each analysis group. Manhattan plot results for males and females were presented in Supplementary Figure 3. (IV) Next, we filtered out SNPs with P-values greater than 0.05 in the 10 analysis group training sets and further selected the intersecting SNPs that were present in all 10 groups for subsequent analysis. (V) To select a subset of informative SNPs that illustrate the relationship between the genome and height, the maximum R-square stepping algorithm least absolute shrinkage and selection operator (LASSO)37 method was used (via a least angle regression) in the 10 analysis group training sets. Only SNP information was included to select the most appropriate SNP combination. (VI) We then picked out the intersecting SNPs that were selected by the step (V) LASSO method in all the 10 analysis groups training set. There were 5,878 SNPs in the male groups and 20,311 SNPs in the female groups in the final combination of SNPs (Supplementary Table 1). (VII) Multiple linear regression was used to calculate the weights of four different combinations in the 10 analysis groups training set: SNPs (polygenic score, PGS) only, SNPs (PGS) + birth year in Anno Domini (AD), SNP (PGS) + age at measurement, and SNP (PGS) + birth year in AD + age at measurement. The weight, beta value, of each SNP, birth year in AD, and age at interview were used for subsequent height prediction. (VIII) We used the weights from step (VII) to calculate the average of the predicted heights over 10 runs for the TWB training and testing sets and, TPMI validating sets, respectively. The descriptive statistics of height and age, mean, standard deviation (SD), median, and range are presented in TWB training and testing and TPMI validating sets. All data analyses were performed using PLINK30, KING38, SAS 9.4 (SAS institute, Cary, NC, USA), and R 4.2.2 (R Foundation for Statistical Computing, Vienna, Austria).
Results
Clinical Characteristics of Taiwan Biobank (TWB) and Taiwan Precision Medicine Initiative (TPMI)
After performing quality control measures, a total of 119,360 individuals are included in this study (TWB: 54,064 females and 24,655 males; TPMI: 22,508 females and 18,133 males). There is no significant difference between the distribution of height in the TWB and TPMI datasets (Table 1 and Fig. 1B). Tabulation of the average height based on birth year (between 1946 and 1986) clearly shows an increase in average height in participants born in later years (Fig. 1A). The trendline slopes of the height are: 0.2258 in TWB males, 0.1890 in TWB females, 0.1789 in TPMI males and 0.1766 in TPMI females, respectively. Birth year is therefore included in the model to account for the effect of nutritional and other improvement in Taiwan between 1946 and 1986 on height. As height also varies with age, age at measurement is included in the model to account for this.

A The average height in the year of birth, the trendline equation: TWB Male = 0.2258*(year of birth in AD)-274.22; TPMI Male = 0.1789*(year of birth in AD)-183.20; TWB Female = 0.1890*(year of birth in AD)-213.90; TPMI Female = 0.1766*(year of birth in AD)-190.01, B count of the different heights.
Actual height vs. predicted height in TWB training and testing sets
Birth year, age at measurement, SNPs obtained by the univariable linear regression, and their weights obtained by the multivariable linear regression LASSO selection method are used to predict height in the TWB training set. Using only birth year, only age at measurement, or birth year plus age at measurement to predict height does not yield good predictions in the TWB training set (Supplementary Figure 4 and Supplementary Table 2) but adding height-related SNPs increases height prediction accuracy (Fig. 2A–C and Table 2). Combining birth year, age at measurement, and height-related SNPs simultaneously improves the accuracy of height prediction and decreases difference between actual and predicted height (Fig. 2D and Table 2).

A Polygenic score (PGS) only, B PGS + birth year in AD, C PGS + age at measurement, D PGS + birth year in AD + age at measurement.
Applying the same method to predict height in the TWB testing set, the Pearson correlation coefficient between actual and predicted height based on birth year, age at measurement, and height-related SNPs are found to be 0.7759 and 0.6084 for males and females, respectively (Table 2), substantially higher than all other combinations. Moreover, the difference between the height predicted by this combination and the actual height is also the smallest. The distribution of actual and predicted height in the TWB testing set also yields the same results, suggesting that combining birth year, age at measurement, and height-related SNPs will give the best height predictions (Fig. 3 and Supplementary Fig. 5).

A PGS only, B PGS + birth year in AD, C PGS + age at measurement, D PGS + birth year in AD + age at measurement.
Actual height vs. principal component analysis (PCA) adjustment predicted height in TWB training and testing sets
In most genomic related studies, the principal component analysis (PCA) adjustment is applied to correct for population stratification19,23,25,27,28. The TWB female and male PCA eigenvalues were tabulated in Supplementary Table 1. Since the PCA eigenvalue after the 20th in females is <1, PCA1-PCA20 were selected for subsequent analysis. Although males are <1 from the 15th onwards, based on the consistency of analysis, the same PCA1-PCA20 as females were selected for subsequent analysis (Supplementary Table 3). The outcomes are the same as those without 20 PCA factors (Figs. 4, 5, and Table 3). Though the accuracy was more precise, there were no statistically significant differences between the TWB testing set without and with PCA adjustment; the Pearson correlation coefficient in males was 0.7759 and 0.7816, respectively. And in the female group, it was 0.6084 and 0.6262 (Tables 2 and 3). The overall impact of PCA is minimal. In the subsequent validation analysis, we opted for a model that excluded PCA to mitigate the potential variations stemming from differing PCA coefficients across various databases.

A PGS only, B PGS + birth year in AD, C PGS + age at measurement, D PGS + birth year in AD + age at measurement.

A PGS only, B PGS + Birth year in AD, C PGS + Age at measurement, D PGS + Birth year in AD + Age at measurement.
Validation of height predictions in the TPMI dataset
To assess the reliability of the height prediction method, an independent dataset, TPMI, was employed for validation and performance evaluation. Birth year, age at measurement, and height-related SNPs with the same weighting was used for height prediction in the TPMI dataset. The distribution of actual and predicted height based on the combination model (birth year, age at measurement, and height-related SNPs) again shows improvement in the accuracy of height prediction compared to those using only one element or combinations of only two of the elements (Fig. 6). For example, the combination model improves the correlation of predicted to actual height from 0.2225 to 0.3980 for males and 0.2708 to 0.4444 for females, respectively (Table 4). Similarly, the proportion of males and females with >5% difference between their actual and predicted heights decreases significantly, from 19.51% to 11.70% for males and from 19.93% to 13.20% for females, respectively.

A PGS only, B PGS + birth year in AD, C PGS + age at measurement, D PGS + birth year in AD + age at measurement.
Discussion
Geographical location and environmental factors influence Taiwan’s population composition. The majority of Taiwanese ethnic groups are of Han ancestry (>95%), with ~2% being of Aboriginal ancestry (Austronesian)39,40. Furthermore, based on PCA results comparing TWB and TPMI samples with the 1000 Genomes Project, the TWB and TPMI samples cluster with East Asian ancestry, confirming that the majority of the samples belong to the Han Chinese ancestry group (Supplementary Fig. 6). The Taiwanese Han Chinese population comprises Min-Nan (also known as Holo), Hakka, and Mainlanders. Although there are genetic, lifestyle, and dietary habit differences among these ethnicities, there is no statistically significant difference in actual and predicted height when comparing the three major ethnic groups (ethnic information comes from TWB phenotypic data) within the Taiwanese Han Chinese population (Supplementary Table 4).
Generally, humans grow taller and consume more nutritious diets when food is abundant41. However, as age increases, height tends to gradually decrease due to factors such as spinal disc degeneration, osteoporosis, and muscle loss. Women typically lose around two inches between the ages of 30 and 70, while men lose about an inch by age 70 and two inches by age 8042,43,44,45,46. This is consistent with the results shown in this study, where bone density increased with birth year in both males and females but decreased with age at measurement (Supplementary Table 5). Although the inclusion of both birth year in AD and age at measurement in the prediction model raises concerns about collinearity and potential overfitting, our analysis indicates that the variance inflation factors (VIF) for birth year in AD and age at measurement were 9.36 and 8.72, respectively, both below the threshold of 10, indicating no collinearity issues. Therefore, including birth year in AD and age at measurement in the model increases the accuracy of height prediction.
The average age at measurement for the TPMI validating set (female: 53.09 ± 12.15; male: 55.90 ± 11.64) is slightly older than the TWB training and testing sets (female: 51.68 ± 10.40; male: 51.96 ± 11.16) for both sexes. This difference may cause a bias for the height prediction. However, age at measurement is included in the correction in analysis models to avoid the impact of differences. With the additional adjustment, we can also estimate the impact of birth cohort changes on height by using the deviation caused by birth year. The new method for height prediction that combines genetic and age factors as a surrogate for nutritional status in two large datasets (TWB and TPMI), is shown to estimate height accurately for the Han Chinese in Taiwan.
In our analysis, all 10 analysis groups (G1-G10) were used simultaneously as both training and testing sets. This dual role of the data could potentially lead to the testing set showing a somewhat inflated performance due to its inclusion in the training process. This observation may explain the notably high Pearson correlation coefficient of 0.7759 for male and 0.6084 for female observed in the model involving SNPs, birth year, and age at measurement in the TWB male testing set. However, while the result on the testing set may be somewhat inflated due to its dual role in the analysis, the independent TPMI dataset validation results remain robust and are the primary focus of our paper. That said, we acknowledge that there is a notable drop in the correlation coefficient, decreasing from over 0.7 in the testing set to ~0.4 in the validation set. This reduction highlights the challenges of generalizing the model to an independent dataset. The validation results provide a more reliable assessment of the model’s generalization performance, which is consistent with other articles47,48 and constitutes a key aspect of our findings.
Furthermore, we observed that females required a larger set of SNPs (20,311 SNPs) compared to males (5878 SNPs) to achieve higher prediction accuracy. One plausible explanation is the influence of hormonal dynamics, particularly estrogen levels, which play a significant role in skeletal growth and development. Hormonal fluctuations, such as those occurring during menopause49,50, can impact height-related genetic variants differently in females compared to males. Additionally, age-related processes, including height loss due to aging, may necessitate the inclusion of a broader array of genetic markers in females to account for these physiological changes.
A recent study analyzed the rare and low-frequency coding variants found in >200,000 individuals of six different ethnicities and identified >1000 variants associated with height51. The authors observed that these variants were associated with body mass index, bone mineral density, and lung function51. In a GWAS study of the Taiwan Biobank (TWB), four novel genes—NABP2, RASA2, RNF41, and SLC39A5—were identified for human height, and it was also discovered that these genes have associated with cardiovascular disease, diabetes, and cancer52. In our current study, with the exception of rs295321 in the RASA2 gene, all SNPs from these four height-related genes (NABP2, RASA2, RNF41, and SLC39A5) were incorporated into our height prediction model. Other studies using TWB data have suggested potential associations between height and certain health-related outcomes, though the most significant findings were related to anthropometric traits. While there may be trends indicating that taller individuals could have lower risks for some chronic diseases, such as cardiovascular disease, diabetes, and cancer, these associations are not definitive. Additionally, height has been suggested to be associated with longer life expectancy in some populations, but further research is needed to confirm these findings53. In addition, the relationship between height and mate choice and reproduction in Taiwan found that taller men were more likely to have a partner and have more children. They were also more likely to have shorter periods of celibacy and live with their partners for longer periods of their lives54. Furthermore, a polygenic risk predisposition score for familial short stature (FSS) in the Han Chinese population, comprising 10 novel SNPs and nine previously reported height-related SNPs, demonstrated high predictive accuracy for FSS risk, with an area under the curve of 0.940 in the testing group55. These nine height-related SNPs have also been included in our height prediction model to enhance its predictive capability. The height prediction study by Yengo et al.16 identified 12,111 independent SNPs significantly associated with height, based on data from a genome-wide association study of 5.4 million individuals from diverse ancestries16. In our analysis, 34.20% ((855 + 1155)/5878) of these SNPs were included in our male height prediction model, and 33.55% ((1155 + 5659)/20311) were included in our female model (supplementary Fig. 7). Although approximately one-third of the SNPs overlap, the heritability (h²) of the SNPs we selected for height prediction is 0.4775 ± 0.0069 in males and 0.4267 ± 0.0050 in females. Furthermore, gene ontology (GO) analysis based on the SNPs we selected for height prediction identified the top 30 GO terms with the smallest false discovery rate (FDR) (Supplementary Fig. 8). The top five GO terms in females—developmental process, system development, anatomical structure development, multicellular organism development, and multicellular organismal process—are all related to height (Supplementary Fig. 8A). Similarly, in males, the GO term with the smallest FDR is also related to height (Supplementary Fig. 8B). Therefore, the SNPs we selected are indeed associated with height and can be used for height prediction.
Limitations of this study include the following: First, TWB only includes individuals aged 30–70 years, limiting the generalizability of the study results to other age groups. Second, because it is not a long-term follow-up data and only the most recent height measurement is used, this may not fully explain the decrease in height with age. Third, due to data limitations of TPMI, which has not yet completed imputed SNP data, we lack fully available imputed SNP data and we can only use SNPs confirmed by genotyping microarrays, which limits the genetic resolution of the study. In addition, although we validated the model using the Taiwan Precision Medicine Initiative (TPMI) dataset, both TWB and TPMI datasets are from the same population, which may limit generalization to other populations. Finally, although non-genetic factors such as year of birth and age at measurement improve prediction accuracy, they may introduce collinearity and overfitting risks, even if we checked for these issues.
In conclusion, the height prediction model which matches theoretical expectations has been effectively developed and validated within the Han Chinese population of both TWB and TPMI databases. We employed a 10-fold cross-validation procedure33,34,35,36 to ensure methodological rigor in developing and evaluating the final model. It has long been known that an increase in height is correlated with improved nutrition56 and a decrease in height is correlated with advanced age57. Despite potential variations in environmental and genetic factors across different databases, this study consistently emphasizes the predictability of height based on combining genetic factors, birth year, and age at measurement. It also underscores the high data quality of the two Taiwanese databases. Understanding the genetics of height carries significant importance, given its associations with various diseases. These newly established predictors for Han Chinese height represent another crucial step toward achieving this overarching research objective.
Responses