Electronic ISSN 2287-0237

VOLUME

WHAT HAVE WE LEARNED FROM THE PAST 7 YEARS AND THE MILLIONS OF DOLLARS SPENT ON THE GENOME-WIDE ASSOCIATION STUDIES?

FEBRUARY 2014 - VOL.7 | OTHER FEATURE

The advance in genomic technology has allowed us to genotype millions of single nucleotide polymorphisms (SNPs) across the human genome at the same time. This high-throughput technology, genome-wide association studies (GWAS), enables us to search for genes contributing to many diseases from age-related macular degeneration, type 2 diabetes, coronary heart disease, cancer to common traits such as obesity, height, eye color. To date, there are over 1,779 studies and 12,126 SNPs reported to be associated with more than 300 diseases.1

Quite a few of GWASs have been conducted in the Thai popu- lation. Searching from HuGE Navigator (an integrated, searchable knowledge base of genetic associations and human genome epidemiology: http://hugenavigator.net/) with a search term “Thai” showed 7 articles. Disease phonotypes studied in the Thai popu- lation using the GWAS approach so far are: severity in ß0-thalassemia/ Hb E2, susceptibility to tuberculosis3, chronic hepatitis B4,5, systemic lupus erythematosus (SLE)6, thyrotoxic hypokalemic periodic paralysis7 and nevirapine-induced rash in HIV infected patients.8 The results from these GWAS have been used for personalized medicine ranging from disease risk prediction, treatment selection and medication dosing guide.9

The origin of GWAS design came from an attempt to map genomic loci to a disease. Sequencing the whole 3 billion nucleotides in the genome was prohibitively expensive. To study the whole genome, geneticists employed  a  concept  of  tagging  SNP,  using a SNP as a proxy marker for nearby genetic variants. Through correlation between untyped genetic variants and genotyped  SNPs, several tagged  genetic  variants  can  be  represented  with a single SNP on a GWAS panel. The SNPs on the GWAS panel  are then used to test for association with diseases. Genome-wide significant association signals imply  that  the  genomic  location of these variants play a role in disease-causing mechanisms. For example, in a genome-wide association study, rs11591147 was found to have genome-wide significantly associated with hyper- cholesterolemia.10 As the rs11591147 is located within PCSK9 gene, the function of PCSK9 will be implicated in hypercholesterolemia. Functional implication of these SNPs is quite straightforward if the identified SNPs are functional variants, e.g. non-synonymous mutation, stop codon mutation. However, as more than 80% of the signals from GWAS fall into inter-genic or non-coding regions, establishing the causal relationship between these variants and disease mechanisms is still challenging.

The major success of GWAS is in identification of several new common disease loci. GWAS assume that most complex diseases can be explained by a few common genetic variants. This assumption is known as the common disease/common variant hypothesis (CDCV). CDCV has been proven in several studies of common complex genetic diseases. For example, the region on chromosome 9p21. 3 near genes have been associated with coronary artery disease in European descent population the studies of three common SNPs (rs1333049, rs10757274, and rs2383207)11, all have MAF close to 50%.12

Although correlations between common variants and some less common variants allow us to identify the variants with MAF < 5% from the GWAS platform, many rare variants that are not tagged by common variants may have been missed. As a result, we are left with an incomplete characterization of the underlying genetic mechanisms of the studied diseases. Correlation between genetic markers (linkage disequilibrium: LD)  can also cause a problem  in identification of causal loci in a region with high LD among genetic variants.

GWAS also allow us to estimate genetic risks from the associated genetic variants. Using the same CAD associated variants mentioned above as an example, 9p21.3 risk loci and CAD was estimated to contribute to an OR of 1.25 (95% CI, 1.21-1.29).11 However, for most diseases, the identified variants had small effects on the diseases and explained a small proportion of the estimated heritability. Heritability of height, the proportion of variance in height explained by genetic factors, has been estimated to be as high as 80-90%. But, the results from GWAS revealed genetic loci that explained only 5% of the variation in height in the human population.13 For risk prediction, the small effects associated with the disease limit the use     of genetic markers found from GWAS on disease risk classification, especially for disease with well-established risk factors such as coronary artery disease.14

One limitation of GWAS is the ability to characterize structural variation, such as insertion/deletion polymor-phisms, duplication, and rearrangement. Although copy number variations (CNV) have been widely studied  using GWAS array, the ability to detect CNV vary greatly among different methods for CNV  prediction.  How  well GWAS SNPs  correlate  with  structural  variations in that area determines whether GWAS will be able to detect the association between the structural variants and diseases.

Despite the discovery of many novel genes in these complex diseases, researchers worldwide are frustrated that they have not uncovered the complete biological mechanisms of these diseases. The main criticism for GWAS is a small proportion of the phenotypes that can be explained by the genetic variants found on the GWAS panel. Hence, when next-generation sequencing (NGS) came into the picture, many groups of scientists were ready to switch from GWAS to NGS.15

NGS technology fills in several gaps that GWAS. Unlike GWAS which  genotypes  only  limited  number of variants included on the genotyping platform, NGS enables rapid sequencing of the whole human genome   at a much lower cost compared to traditional sequencing methods. All variants in the genome detected including structural variants, small/large insertion/deletion. NGS have been used for several clinical applications such as diagnoses of difficult to diagnose genetic diseases, choosing the appropriate chemotherapy for cancer patients, or in- vestigating novel genetic causes of complex diseases. Although the cost for sequencing may get cheaper, the later steps after sequencing are still a big challenge. The downstream processes to interpret the genome require experts in several fields e.g. bioinformatics, computational biology, statistical genetics, and biostatistics. Hence, it might cost a thousand dollars to sequence the genome, but 100,000 dollars to understand the meaning of.16