Custom CSS

Uncovering Previous Research Findings In Preprints

Photo of the author who wrote this blog post
Phil Gooch
4 min read

‘If Citations Could Talk’: Extracting, Structuring And Linking References To Reveal Earlier Research Findings

Researchers have been making their early-stage research available on preprint servers since the early 90s, but it's really over the past year or two that preprints have gone mainstream.As well as the huge growth in submissions to established repositories such as arXiv and bioRxiv, there are now preprint servers for marine biology, the social sciences, psychology, chemistry, health sciences, and larger publishers are starting to get in on the action.

Uncover Previous Research Findings In Preprints
Machine learning arXiv papers per year


Monthly bioRXiv submissions

Monthly bioRxiV submissions

Unless you insist that authors write their preprints in LaTeX or XML, the majority of these preprints are not in Web friendly formats. Ben Firshman has done some great work with arxiv-vanity, which turns LaTeX articles from into responsive web documents with citations as hyperlinks. And bioRxiv is investing significant funds to convert its PDFs to XML.But in the main, preprints are Word or PDF files that have:

  • no linked citations
  • little or no metadata
  • no easy navigation

In this post, we'll look at a possible solution to the first point - the lack of linked citations - and we'll cover metadata and navigation in future posts.

Uncover Previous Research Findings In Preprints

At Scholarcy we've built software that can automatically extract, structure and link citations from any document in any format, but we wanted to go further than creating hyperlinks to all citations. Today, Scholarcy uncovers and displays the information behind those citations, using our own API, and the APIs from CrossRef, Unpaywall and

Citations in Scholarcy - Uncover Previous Research Findings In Preprints
In Scholarcy, all citations are hyperlinked

In the example above, you can see how the citations in the original flat PDF have been turned into links to various resources, including the full text. The context of the citation is also highlighted.We use the Scite API to display the citation sentiment (how many times it's been supported or contradicted), and link to the full report card on the website. integration with Scholarcy - uncover Previous Research Findings In Preprints API integration with Scholarcy

And we use the Scholarcy API to provide a quick summary of the article being cited.

Uncover Previous Research Findings In Preprints
Scholarcy summarises the findings of cited sources

It's also useful to be able to compare what the citing author is claiming vs. what previous, cited work has claimed. This allows us to build on the excellent work that has done on identifying subjective sentiment - we can also show the objective differences between studies. In the example above, we can see that the Scott and Schramke study had 207 participants, whereas the current study (below) has 2403, and the makeup of the cohorts in each study was different. These differences might partly explain why the current study does not replicate the Scott and Schramke findings, beyond the broad 'supporting' or 'contradicting' categories usefully provided by Scite.

Scholarcy identifies study participants
Scholarcy identifies study participants

Taking this a step further, we aim to automatically flag where the cited findings differ from those of the current study - and even identify if the citing paper is misreporting the cited study in some way. In future, the summary information from each citation could be incorporated into the citation graph if expressed in a concise way. Sources would then carry metadata about their own findings in a machine-readable format, which would make comparing studies much easier.