The title of my research is "Enhancing the bioscience literature for people and machines". I am trying to make the knowledge within scientific literature easier to access for machines in the belief that if machines can access the knowledge, we can build tools that will allow humans to access the knowledge in novel and more efficient ways.

It is important to do this as the rate of growth of the volume of published work is very high: reportedly beyond the ability of bioscientists to comfortably engage with it.

We don't know when or if scholarly communication will be published with machines in mind, and all the currently published work is unlikely to be republished machine-readable, so we're developing tools that allow us to make that machine-readable scientific literature out of the very machine-unfriendly PDFs that are currently distributed.

This machine-readable literature will be open access, allowing other researchers and engineers to make their own tools to explore the literature, enabling research on the nature of the literature itself and acting to pull the knowledge within the current scientific literature further into the public domain.

We plan to extract knowledge via three main methods: integration of existing public access data (e.g. open citation graph, Microsoft Academic Graph); automated analysis of scientific papers themselves; and by non-invasive crowdsourcing techniques.

The project engages with these research questions:

We are approaching these problems from an engineering angle: we'll try to build something to answer the question and then test if it works.

At the moment, I'm preparing an experiment to study how bioscientists read scientific papers and whether we can infer anything useful from the reading behaviour we can detect. This is the first step in our crowdsourcing strategy.

A little more on knowledge extraction

The first problem with automated analysis of papers is how to get them. Most papers are published by paid access journals that prevent bulk downloads of their work via legal and technical mechanisms.

We can legally evade these mechanisms by encouraging a large enough group of people to read the scientific literature and to upload the papers they read to us for further analysis.

Perhaps the most innovative part of the project is in developing techniques to non-invasively harvest valuable knowledge from expert readers. We want to do this because we believe that our automated extraction tools are not sophisticated enough to understand all sorts of important properties about papers. We want to get the information from experts non-invasively because experts are very expensive and the literature is far too large for us to pay them to annotate it all for us.

We want expert readers to help us work out which bits of a paper are important or difficult, which papers and parts of papers are relevant to what queries, amongst some other less expert tasks.

We think we might be able to extract this kind of information by monitoring the reading behaviour of experts and by developing tools that both aid researchers and feed useful data back to us (e.g. tool to extract data tables, enabling post-publication review, personal library management tools).

So we've built a PDF reader that offers useful extra tools to researchers but will soon also upload papers for analysis and act as a platform for our crowdsourcing strategy.

The general idea is that users don't mind being used like this because they and the general public benefit from the open publication of all the information and knowledge we extract and the tools we and others build on it.

The automated analysis of papers currently consists of building a structured representation of the text of the paper (itself a reasonable problem) ...

e.g.

<paper>
<title>Gene xb22 increases prevalence of pancreatic cancer in mice</title>
<abstract>blah</abstract>
<introduction>blahblah blah <citation>1134</citation></introduction>
etc.

... and then running some reasonably fancy text-mining tools over that representation. The text-mining tools are currently used to extract sentences and fragments of text that refer to certain predefined concepts in bioscience. This is a fairly crude but seemingly useful tool.

The information extracted from the papers is uploaded into a big graph database that we run. We can then ask questions about the corpus, clustering papers by shared concepts and so on.

Soon, we will also have citation information for many papers and tools to extract citations from newly analysed works. We believe this will allow us to build the most comprehensive and open citation graph of scientific literature.

We also plan to extract other simple but valuable information such as authors, institutions, journal, publication date, etc. soon.

My research