College of Science & Engineering

Extracting complex and nested biomedical relations from published literature

Supervisor: Dr Jake Lever

School: Computing Science

Description:

Modern biomedical research now uses natural language processing and language models to extract complex essential knowledge directly from research papers. This structured knowledge is regularly used to interpret patient genetics data and add context to experimental results. However, the extracted knowledge often takes a fairly simplistic form without context and excluding qualifiers. Papers that discuss a drug being effective for a treatment in a certain population and at a certain dose can be simplified down to “drug treats disease”. This lack of nuance is simply not acceptable. 

This project proposes to build a more complex representation of biomedical knowledge and develop AI methods to extract it from the text of research papers. This would capture conditional information, cause-and-effect and correlations described in the text of individual sentences and paragraphs. 

The project will be undertaken in three stages. The first stage involves the construction of a dataset which builds on existing resources including a recently developed knowledge schema. Initially this would involve generation of datasets using large LLMs and would be supplemented by annotations created by domain experts. The second stage will investigate the strength of heavy-duty large language models for solving this task. The third stage will investigate whether lighter-weight and more “industry-friendly” language models can be adapted for this task which would enable larger scale application to the millions of text-minable research articles. 

Each of those three stages would produce a valuable contribution to the research community. The project success would be measured through traditional machine learning metrics through evaluation on the constructed dataset.  The project would be shared through an appropriate publication to a top-tier ACL family conference or similar. Generated data and code would be shared through GitHub. 

This is a potentially highly impactful project in natural language processing and biomedical informatics. It proposes to develop new representations for knowledge that go beyond the current knowledge graph representations and develop new AI methods for capturing that valuable knowledge. Future application of the developed methods would enable construction of a valuable knowledge base for cancer information with US-based collaborators.