Gene calling, or structural gene annotation, is critical to extracting biological knowledge from a genome, yet existing methods for gene calling in Eukaryotes lag far behind genome sequencing and assembly in quality, ease and speed. However, we have every reason to believe gene calling is a tractable problem. The information about what is or is not a gene is encoded in the raw DNA sequence. Hence, we need to update our modeling. Deep Learning is a new and transformative technology that can model extraordinarily complex and non-linear relationships—like those found in biology—and has the potential to ‘decode’ the information in DNA. We have previously demonstrated the applicability of Deep Learning to gene calling in our project Helixer, which showed ground-breaking performance in classifying genic categories. Here we establish usability by post-processing the base-wise predictions from Helixer into full primary gene models with a Hidden Markov Model. Preliminary results in selected plant species indicate the resulting gene models redefine state-of-the-art for a
de novo gene caller and—on some species—even approach reference quality as compared to RNAseq data. In the future, we will expand applicability across Eukaryotes and expand our annotation targets to include additional genomic features such as promotors. The improved annotations will support research methods from cloning to ‘omics analyses and a wide variety of applications including biomedical research and crop bioengineering. The code is available at https://github.com/weberlab-hhu/Helixer
[more]