Using Natural Language Processing to Detect Dataset Re-use in the Scientific Literature
This research aims to address one of the challenging problems in open science, i.e., how to reliably ensure that authors share all of the datasets, code, protocols, and any new lab materials associated with their articles. The solution will employ state-of-the-art natural language processing techniques to detect sentences where authors describe data collection or the generation of other research outputs and check whether those outputs are publicly shared. This solution can also be applied to detecting code & software, protocols, and lab materials. The authors will know if all the re-used existing resources are appropriately cited in their publication. Thus, the research outcomes will facilitate the research data management, sharing and citation in the research community and support Canada’s digital research infrastructure.
View Full Project DescriptionZheng Liu
DataSeer
Engineering
Professional, scientific and technical services
The University of British Columbia - Okanagan
Accelerate