Relaxations of the seriation problem and applications to de novo genome assembly.

Authors
  • RECANATI Antoine
  • ASPREMONT Alexandre d
  • VERT Jean philippe
  • ASPREMONT Alexandre d
  • VERT Jean philippe
  • LAVENIER Dominique
  • VIALETTE Stephane
  • BRULS Thomas
  • FOGEL Fajwel
  • LAVENIER Dominique
  • VIALETTE Stephane
Publication date
2018
Publication type
Thesis
Summary DNA sequencing technologies can only read short fragments, whose position on the genome is unknown. De novo assembly aims to reconstitute a whole DNA sequence by putting these fragments together like a puzzle. In the OLC (overlap-layout-consensus) approach, we calculate the overlap between fragments in order to rearrange them, and then extract a consensus sequence. The reordering can be written as a combinatorial seriation problem, where we reorder comparable elements between them, so that two adjacent elements are similar. This problem is efficiently solved by a spectral algorithm in the absence of noise, but the real genomic data is different. In particular, regions of the genome are similar although distant (repeated sequences), making assembly problematic. Assembly methods employ hierarchical and gluttonous algorithms to disambiguate repeated sequences. We propose here a purified approach where we rearrange all fragments "at once" via seriation resolution. Our first contribution shows that the use of the spectral method for rearrangement integrates well with the OLC scheme, producing results of similar quality to standard methods. However, due to the repeated sequences, this method produces fragmented assemblies (typically in a few subsequences instead of one). The second contribution is an extension of the spectral method related to dimension reduction under distance conservation, encompassing the problems of seriation and circular seriation (a variant where elements can be ordered in a cycle) in a unified framework. This extension makes the algorithm robust to noise and solves the fragmentation problem of the previous assembly. Our third contribution formalizes robust seriation, where one wishes to reorder noisy data. We describe connections with other combinatorial problems, in particular for matrices modeling real DNA data. We propose adapted al- gorithms, experimentally improving the robustness on synthetic and real data, although less clearly than the second contribution. The fourth contribution presents the problem of seriation with duplication, motivated by the assembly of cancerous genomes via spatial conformation data, which we attempt to solve with an alternating projection algorithm based in part on robust seriation methods, on synthetic data.
Topics of the publication
Themes detected by scanR from retrieved publications. For more information, see https://scanr.enseignementsup-recherche.gouv.fr