Experiment design, clustering, and creation of an ensemble of models to classify domain observations- QC-346

Project type: Research
Desired discipline(s): Computer science, Mathematical Sciences, Operations research, Statistics / Actuarial sciences
Company: Opslock
Project Length: 6 months to 1 year
Preferred start date: 10/01/2020
Language requirement: English
Location(s): Montreal, Canada
No. of positions: 1 or 2
Desired education level: CollegeUndergraduate/BachelorMaster'sPhDPostdoctoral fellow
Search across Mitacs’ international networks - check this box if you’d also like to receive profiles of researchers based outside of Canada: 
No

About the company: 

At Opslock we are predicting risk to the health and safety of workers as well as the risk of various business operations in the industrial sector. Similar to the insurance business, but instead of being based on actuarial finance, our user-facing application is actually used for running the day-to-day operations, which means our data is incredibly rich and we can make a lot of high-level inferences from it.

Describe the project.: 

We have a knowledge graph for labelling observations, the ability to query training data, manage experiments, and run ML deployment pipelines including randomized sampling and pre-processing, but we do not yet have a researcher to drive the scientific discovery process itself. We are seeking to develop our own living hypothesis in the domain of risk management for health and safety. The knowledge graph constitutes an ontology that is intentionally unsound, but provides enough structure to test reality.
Our current approach is targeting an ensemble of models (similar to MILABOT), except that an expert system (instead of a dialogue manager) uses forward-chaining and regular business logic to discover the best queries to explain novel observations. Those queries are classification tasks, but we do not yet have the models in place to predict the labels.
We are able to auto-classify some data in the application code, and we are planning to integrate an annotation UI based on continuous/online learning (e.g. Prodigy), with a view to letting the first scientist to join our team to focus entirely on the research and less on the engineering/plumbing around the applications. We will outsource or otherwise delegate annotation work to preserve this focus.
We use some NLP features for query expansion when user-generated data is not rich enough on its own. This involves part of speech tagging and WordNet for appending synonyms, filtering out antonyms and tuning other linguistic properties. Some of our models will definitely not be neural nets, so we are looking for more of a generalist with a knack for probability and testing theories.
The researcher will join our backend & data engineering team for direct collaboration, so that our engineers can continuously build tooling to support the experiments and focus on the hard problems revealed by knowledge engineering.
 

Required expertise/skills: 

  • Ideally the scientist has some experience with empirical sciences that incorporate ontologies and taxonomies into their theoretical frameworks. E.g. drug discovery, legal informatics, computational history would all be relevant.
  • Ideally the scientist will be trained in gradient descent and most of the leading ML algorithms, but we value research methods and experiment design more than comprehensive memorization of the deep learning literature.
  • Interest in (and opinions about) model de-biasing, and “explainable” models.
  • Familiarity with NLP techniques and TF-IDF data structures, naive Bayes and other algorithms that operate on bags of words.
  • Experience with KNN and other geometric approaches to comparing the structure of distant objects (we want to make “analogies” in the data).
  • Clustering data to discover relationships that were not previously known.
  • Presenting and explaining findings of research to the business. Communicating concepts like precision, recall, f-score, p-values etc.