‘If Citations Could Talk’: Extracting, Structuring And Linking References To Reveal Earlier Research Findings
Researchers have been making their early-stage research available on preprint servers since the early 90s, but it’s really over the past year or two that preprints have gone mainstream.
As well as the huge growth in submissions to established repositories such as arXiv and bioRxiv, there are now preprint servers for marine biology, the social sciences, psychology, chemistry, health sciences, and larger publishers are starting to get in on the action.
Monthly bioRXiv submissions
Unless you insist that authors write their preprints in LaTeX or XML, the majority of these preprints are not in Web friendly formats. Ben Firshman has done some great work with arxiv-vanity, which turns LaTeX articles from arXiv.org into responsive web documents with citations as hyperlinks. And bioRxiv is investing significant funds to convert its PDFs to XML.
But in the main, preprints are Word or PDF files that have:
- no linked citations
- little or no metadata
- no easy navigation
In this post, we’ll look at a possible solution to the first point – the lack of linked citations – and we’ll cover metadata and navigation in future posts.
At Scholarcy we’ve built software that can automatically extract, structure and link citations from any document in any format, but we wanted to go further than creating hyperlinks to all citations. Today, Scholarcy uncovers and displays the information behind those citations, using our own API, and the APIs from CrossRef, Unpaywall and Scite.ai.
In Scholarcy, all citations are hyperlinked
In the example above, you can see how the citations in the original flat PDF have been turned into links to various resources, including the full text. The context of the citation is also highlighted.
We use the Scite API to display the citation sentiment (how many times it’s been supported or contradicted), and link to the full report card on the Scite.ai website.
scite.ai API integration with Scholarcy
And we use the Scholarcy API to provide a quick summary of the article being cited.
Scholarcy summarises the findings of cited sources
It’s also useful to be able to compare what the citing author is claiming vs. what previous, cited work has claimed. This allows us to build on the excellent work that Scite.ai has done on identifying subjective sentiment – we can also show the objective differences between studies.
In the example above, we can see that the Scott and Schramke study had 207 participants, whereas the current study (below) has 2403, and the makeup of the cohorts in each study was different. These differences might partly explain why the current study does not replicate the Scott and Schramke findings, beyond the broad ‘supporting’ or ‘contradicting’ categories usefully provided by Scite.
Scholarcy identifies study participants
Taking this a step further, we aim to automatically flag where the cited findings differ from those of the current study – and even identify if the citing paper is misreporting the cited study in some way.
In future, the summary information from each citation could be incorporated into the citation graph if expressed in a concise way. Sources would then carry metadata about their own findings in a machine-readable format, which would make comparing studies much easier.