Data Integration: how to define and implement a unified query interface for multiple independent data sources - MB-003

Preferred Disciplines: Data Science or Information Theory (Master, PhD or Post-Doc)
Project length: 6 months (1 unit)
Desired start date: As soon as possible
Location: Winnipeg, MB
No. of Positions: 1
Preferences: IP policies of the partner university should allow maximum flexbility to the industrial partner.
Company: Anonymous

About Company:

Partner organization is an established IT Strategy, Services and Products firm with operations in Canada and in the United States. It offers technology consulting and strategic resourcing across industries, including healthcare, financial services, insurance, and utilities/energy. Since its inception in 2006 the organization has assisted 75+ clients in achieving their Business System goals and initiatives by providing management consulting services, program & project management, software testing, systems development, and integration services to organizations of all sizes. 

Since 2014, through its Innovations Division the organization has been actively executing its R&D program in healthcare and with local and international partners has been working on several innovative Health IT product and service offerings.

Project Description:

The project is focused on the problem of building a combined dataset (e.g. for a data researcher ot data secientist) by creating a quiry that can pull data from multiple independent data sources, when 1) the independent data sources are loosely associated with each other, in that records in one data source may be related to records in another; 2) data catalog is available that contains metadata for each data source, but the source data itself will not be accessible to the researcher for privacy reasons. Among the challenges/research aspects to be addressed: only the metadata is available, while combining the data “manually” (interpreting the records and making logical conclusions based on the data) is not possible due to privacy concerns; The data sources are independent, and therefore unlikely to conform to a single standard format or naming convention. i.e. the same data may have different labels, or the same label may refer to different things in different sources.

The goal of the project is to define/create a query mechanism and to define/create a user interface for that query mechanism, which will allow a data researcher or data scientist to overcome the problem (as stated above) as simply and efficiently as possible.

Project obectives:

  • Research the state-of-the-art (theory, architecture & user interfaces) to determine how the problem (as stated above) is/can be solved using existing technology and potential advancements of the current state-of-the-art
  • Identify the best strategy for the project
  • Design user interfaces:
    • For building queries that span multiple sources
    • To visualize the relationships (“links”) between data sets
  • Develop software:
    • For user interfaces
    • To implement the query functionality

Research Objectives/Sub-Objectives:

  • State-of-the-art and potential improvements
    • What is the state-of-the-art with respect to Data Integration and Data Virtualization for large collections of loosely-related data from multiple sources?
    • If there are existing solutions to this problem, which architecture do they folow, how do they perform query/search, what user interfaces do they employ, and how much of the process can be automated?
    • Can we improve on the state-of-the-art in any of these categories?
  • Architecture
    • Which approach is best for the project between GAV or LAV?
    • What information should be contained in the data catalog when adding a data source?
    • What structure should be used for metadata in the data catalog?
  • Query/Search mechanism
    • What methods can be used to search for available data, within the constraints of the problem (as stated above)?
    • What methods can be used to identify usable “links” between data sets? i.e. common identifiers or similar elements.
    • What would be the format and content of a query?
  • User interface
    • What would be the ideal user interface (elements, composition, workflow, functionality) for a data researcher or data scientist who wanted to create/edit a query as described above? For exmple: would a query language be sufficient, or would a drag-and-drop user interface be more suitable?
    • Would it be possible and/or beneficial to generate a visualization of the “links” between data elements in loosely-related data sets?
  • Automation
    • Is it possible to automate the process of finding “links” between similar elements in loosely-related data sets?

Methodology:

  • To determine the state-of-the-art: review of research papers and scientifica publications, patent search, market analysis, etc.
  • To explore potential advancements: conduct user surveys, develop prototypes, evaluate prototypes.

Expertise and Skills Needed:

Required Qualifications:

  • Background in Data Science or Information Theory
  • Understanding of modern software development concepts, languages, methods & tools
  • Strong communication skills
  • Strong problem-solving skills

Ideal Qualifications:

  • Software development experience
  • Has written code in Javascript (node.js), python
  • Has worked with GraphQL or other query languages
  • Experience with User Interface (UI) / User eXperience (UX) design

 

For more info or to apply to this applied research position, please

  1. Check your eligibility and find more information about open projects.
  2. Interested students need to get the approval from their supervisor and send their CV along with a link to their supervisor’s university webpage by applying through the webform or directly to Iman Yahyaie at iyahyaie(a)mitacs.ca
Program: