A machine learning based approach for supporting triage of HIV-related documents
The number of biomedical scientific publications available in multiple repositories is huge and rapidly growing. As of April 2014, PubMed, the largest knowledge source for biomedical and life science literature, comprises more than 23 million citations. Querying PubMed with the keyword HIV provides a list of almost three hundred thousand citations. Retrieving data of particular interest for a specific research field in such a large volume of publications is often like looking for a needle in a haystack. Researchers querying various biomedical bibliographic databases collect a long list of potentially relevant papers, they read abstracts first, and then select the most relevant ones for further curation of full papers. This step, called literature Triage, often implies reading thousands of abstracts while only a small number of them will be kept. The objective of this project is to support the manual triage of HIV related PubMed abstracts by designing, implementing and evaluating an automatic classification system that will predict document relevance. The system will be able to process sets of potentially interesting abstracts, then select/reject them as a human reader would do. Our system will be based on a machine learning approach.