Using Natural Language Processing to Detect Dataset Re-use in the Scientific Literature

This research aims to address one of the challenging problems in open science, i.e., how to reliably ensure that authors share all of the datasets, code, protocols, and any new lab materials associated with their articles. The solution will employ state-of-the-art natural language processing techniques to detect sentences where authors describe data collection or the generation of other research outputs and check whether those outputs are publicly shared. This solution can also be applied to detecting code & software, protocols, and lab materials. The authors will know if all the re-used existing resources are appropriately cited in their publication. Thus, the research outcomes will facilitate the research data management, sharing and citation in the research community and support Canada’s digital research infrastructure.

Faculty Supervisor:

Zheng Liu

Student:

Partner:

DataSeer

Discipline:

Engineering

Sector:

Professional, scientific and technical services

University:

The University of British Columbia - Okanagan

Program: