BoxOfDocs ECOS - AB-046
Preferred Disciplines: Computer Engineering, or Mathematical and Statistical Sciences, or Computing Science (Masters, PhD or Post-Doc)
Project length: 4-6 months (1 unit). Potential for more.
Desired start date: As soon as possible
Location: Calgary, AB
No. of Positions: 1
Preferences: University of Calgary, University of Alberta
Company: Go-By Design Inc. o/a BoxOfDocs
BoxOfDocs is an online ‘one-stop’ curated document sharing platform for the sharing of documented information between peer members in any industry. An AI-driven industry-specific search engine ensures that members can quickly and easily find the most recent and relevant documents available from their industry peers.
The ECOS project requires the engineering of an intelligent agent that will, with no direct human control, collect key unstructured data from 3500 distinct websites.
Using Deep Learning, develop an algorithm that will determine how and when the websites must be searched, identify which data and documents to collect, categorize each, extract key information from the documents, and ensure all the data collected is kept up to date and secure.
Ensure the algorithm’s performance is efficient and accurate.
OBJECTIVE 1 – Collect
- Determine the best methodology for accessing and collecting key unstructured data and documents from 3500 distinct websites on a regularly scheduled basis.
- Documents types may include, but are not limited to, PDF, Word, Excel, images, diagrams, etc.
- The definition of key data must be fluid allowing for a system administrator to modify its parameters.
OBJECTIVE 2 – Schedule
- Develop an algorithm that determines the optimal schedule for searching the 3500 websites from Objective 1, ensuring minimal impact to the websites, networks and users.
- The frequency of searches must ensure that the data collected is kept up to date.
OBJECTIVE 3 – Identify
- The solution to Objective #1 will be run on a regularly scheduled basis.
- Develop an algorithm to determine the status of a document collected in Objective #1. Each document must be classified as 1 of 4 options:
- New - documents not previously collected;
- Modified – documents that have been modified since last collected;
- Unchanged – documents that have not been modified since last collected;
- Removed – documents that were previously on a website but are no longer there.
OBJECTIVE 4 – Recognize
- Sub-Objective 4.1 – Recognize
- Develop an algorithm using supervised machine learning to recognize and categorize each document into distinct document types.
- Sub-Objective 4.2 – Data Extraction
- Develop an algorithm to extract key data elements from the documents collected, including document source, issue date, etc.
- Sub-Objective 4.3 – Re-Title
- Develop, most likely through Natural Language Processing and Natural Language Generation, an algorithm that identifies documents with non-descript titles. Generate a new reader-friendly title for those documents.
OBJECTIVE 5 – Correlate
- Develop an algorithm that identifies documents that are related.
- Design a structure that will maintain that document relation within a database and accessible to web applications.
- Complete a Build vs. Buy Assessment for each objective to ascertain whether a solution currently exists and determine the best approach for BoxOfDocs. Key considerations are:
- cost of solution,
- maintenance and support,
- scalability and
- ease of integration between components.
- For Objective 1, consideration should be made to tools such as scrappy.org, Python, and other similarly available web data extraction tools.
- When assessing exisitng AI tools, Amazon Comprehend and Amazon SageMaker and other existing tools must be explored.
- The use of existing and/or open source solutions or modules shall be maximized, with custom integration as required.
Expertise and Skills Needed:
- Machine Learning / Pattern Recognition / Natural Language Processing
- Knowledge of Web application services
- Web data extraction
- Process automation
For more info or to apply to this applied research position, please