Approximate Algorithms for Resequencing of Human Genome using High Throughput Short
Human genetic variation has been traditionally studied at the single nucleotide polymorphism (SNP) level. It is clear that, human genetic variation extends beyond single nucleotide polymorphism. New projects aimed at identifying structural variation have been initiated through the use of high throughput sequencing technologies. Although the new sequencing technologies produce short reads with respect to Sanger sequencing experiments, the focus of search for differences between a sample and its reference is moving from single based differences to structural variations. The existing analysis methods concentrate on the mapping of individual reads and do not seek confirmation from other reads in the experimental data. As a consequence, they miss certain kinds of structural variations even if they are well‐represented in the data set. The internship offers a solution to this problem. The research will focus on data generated by single molecule sequencing technologies, with a special emphasis on the HellScope platform, which has a non‐uniform read length and a particular error model dominant with indels. Incorporating these features with a simultaneous mapping of multiple reads appears to be an NP hard problem. The research team will use specialized combinatorial optimization algorithms that provide provably approximate solutions for the worst case, to identify reads with overlapping alignments. The intern will monitor the convergence characteristics of the multi‐mapping calculations, and will default to reporting a subset of coordinates for reads with mapping multiplicity of more than a set of threshold.