Genome-wide association with structural variants
Genome-wide association is a powerful tool to identify the molecular causes of trait diversity within species. In most association studies, genotyping single nucleotide polymorphism (SNPs) is regarded as sufficient. Some attempts have been made to to include short insertions, deletions (INDELs) and larger structural variants, but few improvements have been reported. A new study now shows that INDEL association and integrated gene-based testing could explain a large proportion of missing heritability in both plants and animals
A team of researchers from Germany and the UK led by Xiangchao Gan at Max Planck Institute for Plant Breeding Research in Cologne demonstrated that a multi-allelic artefact caused by inconsistent alignments was a key obstacle for testing association of INDELs and for integrated gene-based association methods, such as integrated burden testing. When aligning a divergent sequence to a reference genome, alignment isomorphs frequently occur, where the essentially same sequence is aligned in different ways. In the context of variant calling, this ambiguity results in false multi-allelic calls for the same allele, even if the surrounding sequence is the same between samples.
To address this problem, they developed the software Irisas that synchronizes variants and integrates the impact of SNPs, INDELs and structural variants for testing. They re-analysed two publicly available datasets with multiple traits in A. thaliana and D. melanogaster. The new association tests identified novel loci that contain well-established candidate genes that SNP-based GWAS failed to detect. INDEL-specific genome-wide significant loci generally have weak local linkage disequilibrium with nearby SNPs.
By leveraging INDEL association and integrated burden testing, Song et al. were for the first time able to associate the TFL1 gene with flowering time related phenotypes in a natural population of A. thaliana (by a 1bp insertion). Similarly, FRIGIDA failed most of GWA studies but showed high enrichment with the proposed method (by a 16bp insertion and a 245bp complex deletion). Interestingly, none of the three INDELs were available in the original variant tables of the 1001 A. thaliana genome project, but fortunately captured by the software IMR/DENOM developed in the same group previously (http://chi.mpipz.mpg.de/imrdenom/). This also indicates that a better genotyping method or new sequencing platform could be a key requirement for recovering missing heritability in future GWA studies for plants, animals and humans.