Genome Sequence Compression Algorithms Using Locally Consistent Parsing


The high throughput sequencing (HTS) platforms generate unprecedented amounts of data that introduce challenges in computational infrastructure. Data management, storage, and analysis become a major logistical undertaking for those adopting the new platforms. The requirement of large investments for this purpose almost signaled the end of the Sequence Read Archive hosted at the NCBI, which holds most, if not all the sequence data generated world wide. Currently, most HTS data is compressed through general purpose algorithms such as gzip. These algorithms are not designed for compressing the data generated by the HTS platforms; for example they do not take advantage of the specific nature of the sequence data, i.e. limited alphabet size and high similarity among reads. Fast and efficient compression algorithms designed specifically for HTS data may be able to address some of the issues in data management, storage, and communication. Here we propose SCALP, a "boosting" scheme based on Locally Consistent Parsing technique that reorganizes the reads in a way that results in a higher compression speed and compression rate, independent of the compression  algorithm in  use.

Faculty Supervisor:

Dr. Cenk Sahinalp


Ibrahim Numanagic


Vancouver Prostate Centre


Computer science


Life sciences


Simon Fraser University



Current openings

Find the perfect opportunity to put your academic skills and knowledge into practice!

Find Projects