Relaxations of the seriation problem and applications to de novo genome assembly.

Authors Publication date
2018
Publication type
Thesis
Summary In a sequencing experiment, we can only “read” small fragments (reads) of DNA due to physical limitations, whose location on the genome is unknown. De novo assembly aims to put them together to retrieve the full DNA sequence, like a jigsaw puzzle. The OLC approach computes pairwise Overlaps between reads to find their Layout, and then derive a Consensus sequence. The layout can be cast as an instance of the Seriation combinatorial problem, seeking to reorder a set of elements based on their pairwise similarity, such that similar elements are nearby. In a noiseless setting, a spectral method can solve Seriation efficiently. Still, it often fails on noisy, real DNA data. Notably, assembly is challenged by repeated genomic regions (repeats) causing distant fragments to be similar. Most assembly engines follow hierarchical, greedy schemes, including modules dedicated to detect and disambiguate repeats while constructing the output sequence. We explore a simpler approach using Seriation to lay out all reads at once. Our first contribution is to show that the spectral method can be seamlessly integrated in an OLC framework, yielding competitive results compared to standard methods on real data. However, due to repeats, the method can only find fragmented assemblies (with a few large assembled fragments), i.e., it does not succeed to lay- out all the reads together at once. In our second contribution, we extend the spectral method using a multidimensional spectral embedding. It provides a unifying framework for seriation and circular seriation, a variant searching for a cyclic ordering of the data. This method significantly improves the robustness of the original algorithm on noisy data, and yields single contig assembly of bacterial genomes. As a third contribution, we introduce the Robust Seriation framework, formalizing the task of seriation on corrupted data. We outline the relation between (robust) seriation and other combinatorial problems, particularly for stylized matrices modeling DNA sequencing data. We pro- pose dedicated algorithms that experimentally improve robustness on synthetic and real data, although they turn out to be more sensitive than the method constituting our second contribution. In a fourth contribution, we introduce the problem of Seriation with Duplications, which is motivated by the application of assembling cancer genome from spatial conformation (Hi-C) data. We propose an alternated minimization algorithm that can utilize methods designed to solve Robust Seriation, and evaluate it on toy data.
Topics of the publication
Themes detected by scanR from retrieved publications. For more information, see https://scanr.enseignementsup-recherche.gouv.fr