Automated extraction and compilation of data from multiple and diverse data sources - BC-418

Preferred Disciplines: Computer Science, AI / Machine Learning (Masters, PhD or Post-Doc)
Project length: 4-6 months (1 unit)
Approx. start date: As soon as possible
Location: Vancouver, BC
No. of Positions: 1
Preferences: UBC, SFU
Company: Lunge Systems  

About Company:

Lunge Systems is a young startup in stealth mode. We are building an innovative and intuitive technology framework that will improve employee and personal productivity. Our product helps companies map their organizational capabilities and use AI/ ML algorithms to enhance them. Improvements and innovation in workplace productivity will assist our clients to become more competitive in the global arena.

Summary of Project:

Our company has built an innovative framework to increase workplace productivity. To implement this technology, we need to map what tasks the employees in different roles at top companies in each sector perform. Much of this information exists in public domain but at different places and is not in structured format. Some of these sources include: “About Us” pages on company’s websites, “Responsibilities” section on job postings, LinkedIn profiles of company’s current and past employees. The challenge is to extract the relevant information from these sources, to identify keywords to build classification schema (roles, responsibilities/ outputs and tasks in a company or a sector) , and to create the relationship structure.

This project is a standalone piece that would help us build such a database. It will involve studying the structure of these information sources and then devise the strategy to extract the desired relationships (e.g. who does what in a given company/ sector).

Research Objectives/Sub-Objectives:

  • To study data sources, then research best approaches & subsequently \ create an algorithm that generates knowledge graph about roles, responsibilities/ outputs and tasks from different types of commonly used job description (job ads, company websites, LinkedIn profiles).
  • To build an economywide database of sector, company, role, responsibility, output and tasks using the above algorithm.


    • The research will involve researching best approaches and then creating webparsers to extract content from the internet.
    • Use NLP to identify keywords and build classification schema.
    • The algorithm should be tested using ML or other techniques to test its accuracy.
    • Run the algorithm to build the relationship structure for industry sector using data on Fortune 500 companies.

    Expertise and Skills Needed:

    • Programming
    • Web Scrapping
    • Natural Language Processing
    • Machine Learning
    • Familiarity with data sources like LinkedIn, Monster (Jobs)
    • Experience working with Developer APIs (especially LinkedIn).

    For more info or to apply to this applied research position, please

    1. Check your eligibility and find more information about open projects.
    2. Interested students need to get the approval from their supervisor and send their CV along with a link to their supervisor’s university webpage by applying through the webform