Genome Sequence Compression Algorithms Using Locally Consistent Parsing

The high throughput sequencing (HTS) platforms generate unprecedented amounts of data that introduce challenges in computational infrastructure. The requirement of large investments for this purpose almost signaled the end of the Sequence Read Archive hosted at the NCBI, which holds most of the sequence data generated worldwide.

Currently, most HTS data is compressed through general purpose algorithms such as gzip. These algorithms are not designed for compressing the data generated by the HTS platforms; for example they do not take advantage of the specific nature of the sequence data. Fast and efficient compression algorithms designed specifically for HTS data may be able to address some of the issues in data management, storage, and communication. Here we propose a “boosting” scheme based on Locally Consistent Parsing technique that reorganizes the reads in a way that results in a higher compression speed and compression rate, independent of the compression algorithm in use.

Faculty Supervisor:

Cenk Sahinalp

Student:

Partner:

University of British Columbia

Discipline:

Computer science

Sector:

University:

Simon Fraser University

Program:

Accelerate

Current openings

Find the perfect opportunity to put your academic skills and knowledge into practice!

Find Projects