Determine optimized methods for harvesting web data regarding defects - ON-207

Preferred Disciplines: Computer Science focus on data analytics, post-Masters preferred, post-graduate acceptable
Company: Anonymous
Project Length: 1 year (2 units)
Desired start date: As soon as possible
Location: Kitchener ON, may be a split team with some in Toronto
No. of Positions: 1
Preferences: Colleges and Universities in Kitchener/Waterloo area. Strong preference for institutions without onerous IP requirements.

About the Company: 

Organization is an AI startup focused on harvesting and enriching data related to product defects for sale to corporations.

Project Description:

The aim of this assignment is investigation of methods to optimize discovery and linkage to a wide variety of data sources. Proof of success is the creation of a data lake to store raw, structured and unstructured data as well as a data warehouse (if necessary).

Research Objectives:

  • Investigation and discovery of data sources for product defects.
  • Investigation and discovery of data sources for proxies of product defects, in the event insufficient quantity or quality of data become available through item 1.
  • Advanced methods and techniques for correlating data from unrelated sources to create reliable records of incidents of product failures. 

Methodology:

The overall (multi-position) objective is to create a data lake with a visualization dashboard that can support analysis:

  1. Find unique sources of data that relate to reviews of consumer products
  2. Analyze these sources of data for added value and clean
  3. Set up an online repository to collect these raw sources of data
  4. Implement ability to tap into external APIs (e.g. weather, traffic) for additional raw data
  5. Load raw sources of data into distributed database
  6. Attach a visualization tool that can support dashboard analysis
  7. Investigate any correlations using AI or statistical data science, e.g between location and weather at time of incident
  8. Implement an emulation of a mobile app that can inject glitch data into the data lake with automatic ability to lookup external data

Challenges:

  • Data might be biased for example not updating amazon reviews during a product life cycle
  • Need to investigate novel sources of data that can supplement primary sources - data may not be readily available
  • Sheer quantity of data available (that can be scraped) is enormous

Expertise and Skills Needed:

  • An understanding of computer science fundamentals specifically around databases, cloud computing and infrastructure setup
  • Proficient in SQL and database setup including PostgreSQL, MySQL
  • Proficient with ETL processes with the Python programming language
  • Proficient in Unix/Linux operating systems
  • Knowledge and/or experience with messaging queues like Kafka, Celery a plus.

For more info or to apply to this applied research position, please

  1. Check your eligibility and find more information about open projects
  2. Interested students need to get the approval from their supervisor and send their CV along with a link to their supervisor’s university webpage by applying through the webform.
Program: