Ancestry Analysis in Population-Scale Genomic Data
Keywords:
Principal Component Analysis, Polymorphisms, SNPs, Ancestry analysisAbstract
Significant distinctions exist among ethnic groups, encompassing variations in traits such as height, eye color, skin tone, susceptibility to certain illnesses, and responses to specific medications. However, there has been insufficient exploration into the genetic foundations of these differences. The Human Genome Diversity Project has amassed extensive genotypic data from Asian populations. Although Principal Component Analysis (PCA) can aid in discerning disparities among populations, it overlooks variations in individual Single Nucleotide Polymorphisms (SNPs) between populations. Thus, alternative statistical methodologies, such as the "mutual information algorithm," prove valuable in identifying SNPs associated with specific ethnicities and quantifying the discrepancies in SNPs within the Pakistani population. This study endeavors to uncover SNP variations among various ethnic groups in Pakistan. Employing the mutual information algorithm, we statistically compare each SNP across diverse ethnicities within our sample. Subsequently, we construct a classifier capable of determining an individual's ethnicity based on their genetic data, likely through techniques like feature engineering or dimensionality reduction. To assess the classifier's accuracy, we utilize a separate test dataset. The results indicate a 40% success rate in accurately predicting an individual's ethnicity within the test dataset.