Interactive genomic search - ON-113

Preferred Disciplines and Level: Software engineer (Masters, PhD or PDF)
Company: DNAstack
Project Length: 8-12 months (2 units)
Desired start date: ASAP
Location: Toronto, Ontario
No. of Positions: 1
Preferences: Language: English

About the Company: 

DNAstack is a Toronto-based software company developing a cutting-edge cloud-based platform for genomics analysis, interpretation and sharing. We are a small team of passionate software engineers, bioinformaticians, geneticists and entrepreneurs, helping to define the standards that will drive the field of genomics into the future.

We are looking for talented interns to join our team and assist in the design and development of various aspects of the backend of our platform, and open source projects. We are agile and move quickly. You can expect to tackle tough problems, design and implement features for a robust, secure and scalable cloud-based platform. You will also have the opportunity to be involved in the development of standards defining the future of genomics.

Project Description:

One of the big problems we’re tackling at the moment is focused on enabling interactive search and exploration of very large datasets.

Genetic information of a single patient often consists of 100-200 GB of data. Large population studies need to cover thousands of patients, and the size of such datasets reaches petabytes, and many billions of genetic mutations. To make discoveries, researchers and clinicians need to cross-reference the data with standard databases of annotation metadata, which typically contain hundreds of millions of mutations. They subsequently iteratively filter the information based on various fields as well as the annotation metadata, until they identify a few mutations of interest.

Traditionally, this process needs to be done on small subsets of data, and takes a long time to complete. Our goal is to take this analysis to the next level by allowing it to run interactively on much larger datasets.

Research Objectives:​

To accomplish this goal, we need to tackle several issues :

  • extracting the needed information from the data, transforming the data into a normalized form that can be mapped to annotation metadata
  • storing the data in a way that allows for interactive faceted search
  • wrapping the functionality in an API allowing for design and implementation of upstream applications
  • making everything usable in a highly scalable, multi-tenant environment
  • The final deliverable should include a collection of web services performing the import and search on the data, which can be plugged into the DNAstack platform

Methodology:

To be more specific, the work consists of the following activities :

  1. Data preprocessing and extraction of relevant information
  2. Design of the way the data is stored, benchmarking of available technologies for storage and querying with respect to the format of the data. Different schemas, levels of normalization and ways of joining different types of data need to be investigated
  3. Implementation of DNAstack’s current search capabilities on the technology stack chosen in the previous step
  4. Extension of the search engine to support annotation metadata
  5. Finding an intelligent way to normalizing the data in order to enable annotation mapping
  6. Efficient search across custom annotations compiled by DNAstack users
  7. Wrapping of search services in APIs allowing for upstream analysis and development of applications on top of search
  8. Advanced search API incorporating information from various external services for genomic data, such as phenotype or disease data

DNAstack is currently working on Activity 2. The scope of your project includes Activities 3-8 as well as potential minor modifications to deliverables from Activities 1-2, as required by Activites 3-8.

Expertise and Skills Needed:

  • Extensive experience with Java, particularly with Java EE or Spring
  • Experience with backend web development, design and implementation of RESTful web services
  • Hands-on experience with SQL and NoSQL databases and building systems on top of them
  • Strong understanding of professional software development and design practices
  • Familiarity with cloud computing and building distributed systems with microservice architecture
  • Motivation and ability to work independently and as part of an agile, global team

 

For more info or to apply to this applied research position, please

  1. Check your eligibility and find more information about open projects.

  2. Complete this webform. You will be asked to upload your CV. Remember to indicate the title of the project(s) you are interested in and obtain your professor’s approval to proceed.

Program: