A simple means of finding mutations

An algorithm that compares genomes to find serious mutations

March 11, 2013

Cologne. For decades, searching for the mutation behind a new characteristic has been the equivalent to searching for a needle in a haystack. Korbinian Schneeberger, George Coupland and their colleagues at the Max Planck Institute for Plant Breeding Research in Cologne have developed a new algorithm for comparing closely related genomes, irrespective of their species. The algorithm efficiently identifies sequences in which the genomes differ. This also includes the mutation that makes a plant behave completely different (Nature Biotechnology): doi:10.1038/nbt.2515).

Gene mapping, coupling analysis, sequence comparisons – these three terms stand for the long and difficult search for the genetic mutation behind an interesting phenotype. For a long time, scientific attempts to identify relevant mutations could only be described as piecemeal. The search for causal mutations was made simpler by the sequencing of entire genomes. Their reconstruction, however, requires the entire sequence of a representative individual, i.e. the reference sequence. As there is no matching reference sequence for every plant, the search for relevant mutations remains very difficult to this day.

Korbinian Schneeberger, George Coupland and their colleagues have now developed a method that does not need reference sequences. Based on the simple theory that the DNA of the parental plant differs to the DNA of the mutants in the relevant mutation, the method therefore seeks to draw a direct comparison of these closely related genomes. If the identical sequences are removed by an algorithm, this means that only those that differentiate the two genomes are left. These are analysed using so-called “k-mers”. K-mers describe fragments that are roughly thirty base pairs in length and can thereby be counted and grouped very easily and efficiently. All identical k-mers, i.e. all identical DNA sequences, are grouped together in a stack. As fragments with the relevant mutation have a different sequence to the parental sequence, a new k-mer stack is opened for their specific sequence information. In the end, the new algorithm shows which new stacks have arisen from the comparison and the genes that they belong to.

How do Schneeberger and his colleagues now ensure that they do not end up spending their time on irrelevant mutations or sequence errors in the genome comparison? “To exclude these sources of interference, there are various strategies that can be used. These are even applied to a certain extent during the actual comparison,” says Schneeberger. “We have to exclude non-causal mutations at an early stage.” When sequencing the genomes, the genetic information is read a number of times. Sequence errors only appear from time to time and not always in the same place. They are therefore uncommon. Such rare sequence mutations can be eliminated from the k-mer stacks. The exclusion of irrelevant mutations is more difficult. For this task, the selection of the parental material is important. Either two mutants are compared with one another (whereby the same gene has unmistakably mutated) or the parental plant is compared with mutant pools. Mutant pools result from a cross between the parental plant and mutants and represent the F2 generation. Each plant in these pools has exactly the same mutation for the new phenotype. The causal mutation is therefore in the majority compared to irrelevant mutations. This means that the irrelevant mutations are rare and can be eliminated from the k-mer stacks. “We gave the new method the name NIKS,” says Karl Nordström, the developer behind the algorithm. “NIKS for ‘needle in the k-stack’.”

If you compare genomes of parental plants with genomes from cross pools, you will find the relevant mutation in a k-mer stack. The stack will be missing from the parental plants, but present in the cross pool. If you compare two plants with different mutations in one and the same gene, you will see which new k-mer stack belongs to the same gene in both plants. “Our method is so robust that astonishingly few false positive results are produced,” says Schneeberger, commenting on the potential of NIKS. “The percentage of correctly identified mutations is well over 98 percent. And all this without the aid of a reference sequence.”

The bioinformatics specialist and his team have tested the new method in different ways. First, well-known mutations in rice were confirmed. Schneeberger and Coupland then looked for unknown mutations in the alpine rockcress Arabis alpina. A special feature of this plant is that it normally only flowers once it has been exposed to the cold of winter. Maria Albani and George Coupland isolated a mutant that no longer depends on the cold stimulus. “Using NIKS, we found the causal mutation among more than 350 million bases. This shows that we can find new and relevant mutations without having to resort to using a reference sequence,” says Schneeberger. “The greatest value of NIKS will be in the ability to get to the relevant mutation faster in an unknown genome.” The scientists in Cologne even see a new field of work here, as many interesting phenotypes – e.g. the resistance to pests – only appear in species that are rarely studied and for which there are no reference sequences.