Determine optimized flows for interpretation of data regarding product failure incidents - ON-210

Preferred Disciplines: Post Doc in Data Science
Company: Anonymous
Project Length: 1 year (3 units)
Desired start date: As soon as possible
Location: Toronto, ON
No. of Positions: 1
Preferences: Universities in Kitchener/Waterloo area. Strong preference for institutions without onerous IP requirements.

About the Company: 

Organization is an AI startup focused on harvesting and enriching data related to product defects for sale to corporations.

Project Description:

The aim of this assignment is investigation of methods to gather, analyze and present data relating to product failure incidents. This will involve supervision of graduate students in related disciplines.

Research Objectives:

  • Investigation and documentation of ideal flow of data from internet sources through to information suitable for making business decisions.
  • Experimentation with designs related to this flow: data harvesting, data storage, processing in an AI context, interpretation for insights.
  • Supervision of other researchers.


The overall (multi-position) objective is to create a data lake with a visualization dashboard that can support analysis:

  1. Find unique sources of data that relate to reviews of consumer products
  2. Analyze these sources of data for added value and clean
  3. Set up an online repository to collect these raw sources of data
  4. Implement ability to tap into external APIs (e.g. weather, traffic) for additional raw data
  5. Load raw sources of data into distributed database
  6. Attach a visualization tool that can support dashboard analysis
  7. Investigate any correlations using AI or statistical data science, e.g between location and weather at time of incident
  8. Implement an emulation of a mobile app that can inject glitch data into the data lake with automatic ability to lookup external data


  • Data might be biased for example not updating amazon reviews during a product life cycle
  • Need to investigate novel sources of data that can supplement primary sources - data may not be readily available
  • Sheer quantity of data available (that can be scraped) is enormous

Expertise and Skills Needed:

  • PhD in Computer Science or related field in Data Management.
  • Expertise in building distributed systems, query processing, for big data ecosystem 
  • Strong Understanding of data lakes and warehousing - architect and design data warehouse
  • Expertise with data schema - logical and physical data modeling
  • Deep Knowledge of ETL processes and tools
  • Experience with AWS or a major cloud platform such as GCP
  • Proficiency in: Python, PostgreSQL
  • 5+ Years’ Experience in leading data engineering teams in development and maintenance of ETL software
  • 5+ Years’ Experience in hands on development of automated ETL solutions

For more info or to apply to this applied research position, please

  1. Check your eligibility and find more information about open projects
  2. Interested students need to get the approval from their supervisor and send their CV along with a link to their supervisor’s university webpage by applying through the webform.