BoxOfDocs Tool Development using AI, Machine Learning, Pattern Recognition - AB-046

Preferred Disciplines: Computer Engineering, or Mathematical and Statistical Sciences, or Computing Science (Masters, PhD or Post-Doc)
Project length: 4-6 months (1 unit)
Desired start date: As soon as possible
Location: Calgary, AB
No. of Positions: 1
Preferences: University of Calgary, University of Alberta
Company: Go-By Design Inc. o/a BoxOfDocs 

About Company:

BoxOfDocs is an online ‘one-stop’ curated document sharing platform for the sharing of documented information between peer members in any industry. An AI-driven industry-specific search engine ensures that members can quickly and easily find the most recent and relevant documents available from their industry peers.

Project Description:

Go-By Design is building BoxOfDocs: an advanced web-based tool with industry-specific customized document search and filter capabilities using Deep Learning / AI / Machine Learning / Pattern Recognition / NLP.

At the core is the ECOS system, the engine that will Extract, Cleanse, Organize and Self-Populate the repository of documents collected from various sources.

Research Objectives/Sub-Objectives:

OBJECTIVE 1 – IDENTIFY

  • Develop an automated method to identify all municipal websites within a given country.
  • The solution must be configurable, allowing the system administrator to change the countries of interest as needed.

OBJECTIVE 2 – ECOS Build

  • Sub-Objective 2.1 – Extract
    • For predetermined websites, discovered in Objective #1, develop an automated method via web data extraction or other methodology, for identifying and extracting relevant content, documents and document links within the websites.
    • The solution must also have the ability to extract data from secured locations where clients have given us permission and security credentials to access their documents.
    • Documents types may include, but are not limited to, PDF, Word, Excel, images, diagrams, etc.
    • The definition of relevant must be fluid allowing for a system administrator to modify its parameters.
  • Sub-Objective 2.2 – Cleanse
    • All page content and document downloaded must be assigned a relevant name permitting a reader to immediately understand the subject of that document. 
    • Develop an automated methodology to sweep through the document, its filename and its metadata to identify its subject and assign it a reader-friendly document name.
  • Sub-Objective 2.3 – Organize
    • Each document downloaded must be assigned specific attributes.
    • Develop an automated methodology to determine the document type, document municipality source, by-law issue dates, to identify key words, extract key phrases and find insights and relationships in its text.
  • Sub-Objective 2.4 – Self-Populate
    • All downloaded web content, documents, associated metadata and data produced from Sub-Objectives 2.2 and 2.3 must be loaded into the existing BoxOfDocs AWS-S3 database structure.
    • An efficient, cost effective and sustainable data load process must be developed in collaboration with our development team. The current manual mapping solution can be studied to understand the as-is process.

OBJECTIVE 3 – ECOS Maintenance

  • Sub-Objective 3.1 – Sweep & Extract
    • Develop an automated method to regularly sweep the websites from Objective # 1 to identify any new, changed or deleted content on the pages and the documents linked within.
    • New content:
      • Identify and extract previously existing pages with new relevant content and documents;
      • Identify new pages within the websites;
      • Identify and extract all new relevant content and documents on new pages within the website.
    • Changed content:
      • Identify pre-existing pages that have been modified;
      • Identify and extract all pre-existing pages with relevant modified content;
      • Identify and extract all pre-existing documents that have been modified.
    • Deleted content:
      • Identify previously existing pages that have been deleted;
      • Identify all relevant deleted content from previously existing pages;
      • Identify all deleted documents that were previously on the site.
  • Sub-Objective 3.2 – Cleanse-Organize-Load

For all new and modified content, data and documents collected in Sub-Objective 3.1, repeat the processes in Sub-Objective 2.2 (Cleanse), Sub-Objective 2.3 (Organize) and Sub-Objective 2.4 (Self-Populate).

  • Sub-Objective 3.3 – Retire

For all deleted content and documents identified in Sub-Objective 3.1, retire or archive the related content, data and documents.

OBJECTIVE 4 – BoxOfDocs Learning

  • Based on user profile, user behaviour, peer behaviour and other factors to be determined, develop algorithms that will create custom document search criteria and results to present to users on BoxOfDocs.

OBJECTIVE 5 – BoxOfDocs Dashboard

  • Develop a dashboard, for internal use only, with key indicators in the areas of:
    • Application performance;
    • User behaviour;
    • Document and database content.

Methodology:

  • The use of existing and/or open source solutions or modules shall be maximized, with custom integration as required. Key considerations will be cost of solution, availability, stabilty, maintenance, scaleability and ease of integration between components.
  • For the Extract and Sweep components in Sub-Objective 2.1 (Extract) and Sub-Objective 3.1 (Sweep & Extract) consideration should be made to tools such as scrappy.org or other similarly available web data extraction tools.
  • For the Cleansing, Organising and Learning components in Sub-Objective 2.2 (Cleanse), Sub-Objective 2.3 (Organize), Sub-Objective 3.2 (Cleanse-Organize-Load) and Objective 4 (BoxOfDocs Learning), Amazon Comprehend and Amazon SageMaker must be explored. Other tools listed below, but not limited to, can be considered so long as they conform to the key considerations mentioned previously:
    • IBM Watson Natural Language Understanding;
    • Microsoft Azure (Text analytics API);
    • Google Cloud Natural Language;
    • Microsoft Azure (Linguistic Analysis API) - beta.
  • If applicable, the use of Amazon Machine Images (AMI) will be considered.

Expertise and Skills Needed:

  • Machine Learning / Pattern Recognition / Natural Language Processing
  • AI
  • Data science
  • Knowledge of Web application services
  • Web data extraction
  • Process automation

For more info or to apply to this applied research position, please

  1. Check your eligibility and find more information about open projects.
  2. Interested students need to get the approval from their supervisor and send their CV along with a link to their supervisor’s university webpage by applying through the webform.
Program: