Deep learning for identifying structural variants in the genome - ON-112

Preferred Disciplines and Level: Software engineer (Masters, PhD or PDF)
Company: DNAstack
Project Length: 8-12 months (2 units)
Desired start date: ASAP
Location: Toronto, Ontario
No. of Positions: 1
Preferences: Language: English

About the Company: 

DNAstack is a Toronto-based software company developing a cutting-edge cloud-based platform for genomics analysis, interpretation and sharing. We are a small team of passionate software engineers, bioinformaticians, geneticists and entrepreneurs, helping to define the standards that will drive the field of genomics into the future.

We are looking for talented interns to join our team and assist in the design and development of various aspects of the backend of our platform, and open source projects. We are agile and move quickly. You can expect to tackle tough problems, design and implement features for a robust, secure and scalable cloud-based platform. You will also have the opportunity to be involved in the development of standards defining the future of genomics.

Project Description:

The commodification of Whole Genome Sequencing (WGS) represents an enormous and timely opportunity for scientists to discover the genetic causes of diseases. While WGS technologies have significantly reduced cost with increasing throughput, the sequence reads that these technologies produce are mostly short and error prone. These factors make it particularly challenging to identify the presence of complex and/or large genomic rearrangements, including Structural Variations (SVs) and Copy Number Variants (CNVs), from these data.

Google has recently open-sourced Deep Variant, a technology that uses a Machine Learning (ML) model called Convolutional Neural Network (CNN) to identify the locations of SNPs and indels. It learns characteristic signatures of base mismatches, quality scores, and other metrics, trained from read pileups images, at variant locations in the genome where these events do and do not exist. However, Deep Variant cannot yet detect large or complex genomic rearrangements, which are important drivers for genetic diseases.

The goal of this project is to develop Deep SV, a ML-based technology for the detection of SVs and CNVs from WGS data, using a CNN trained with examples using a novel encoding system for SPEM information that yields unique signatures for the major classes of SV and CNV events.

Research Objectives:​

A CNN will be trained on custom Depth of Coverage (DOC) and Split / Paired-End Mappings (SPEM) representations of read alignment data. This project will be divided into these parts :

  • Compilation of train, development and test sets consisting of read alignment data where SVs and CNVs are known to exist
  • Design a custom representation of both DOC and SPEM data suitable to train the model
  • Development of the CNN

Methodology:

  • To be determined

Expertise and Skills Needed:

  • Experience with Python
  • Intermediate understanding of Neural Networks
  • Basic understanding of TensorFlow
  • Basic understanding of Bioinformatics/Genomics
  • Strong understanding of professional software development and design practices
  • Knowledge of Docker and Linux is an advantage

 

For more info or to apply to this applied research position, please

  1. Check your eligibility and find more information about open projects.

  2. Complete this webform. You will be asked to upload your CV. Remember to indicate the title of the project(s) you are interested in and obtain your professor’s approval to proceed.

Program: