How Scholarcy contributes to and makes use of open citations

Open citations benefit researchers, journals and publishers

When researchers cite previous work, they recognise the foundations of their own research and provide evidence that may support or refute their methods and results. This creates a narrative path that contextualises their contributions within the larger body of scientific knowledge. But keeping up with this growing volume of literature is increasingly reliant on the use of digital discovery services. As a result, open access to citation metadata in machine-readable format is critical to the dissemination of knowledge. Thanks to the Initiative for Open Citations (I4OC) [1], the proportion of research papers with openly accessible and freely reusable citation data has grown from 1% to over 40% during the period September 2016 to April 2017 [2].

Publishers participating in I4OC deposit structured citation metadata with CrossRef, who make this data freely available via their API. The modern publishing process typically involves an ‘XML first’ workflow [3], where the author manuscript, perhaps originating in word-processing software or LaTeX, is converted into XML (in a manual or semi-automated process), and then corrected and typeset in this format. As part of this task, the article’s references will be converted into structured citation metadata. However, a publisher may have a large archive of older content that exists only in PDF format, originating either from earlier typesetting systems, or that has been scanned from the printed copy before a digital version existed. Equally, researchers across the sciences increasingly deposit early versions of their papers on preprint servers, prior to journal submission or peer review. As with older, pre-XML papers, these preprints typically exist only in PDF or Word format. This is also true of articles and PhD theses residing in institutional repositories.

To solve this problem, a number of tools have been developed that automatically extract and structure citation metadata, either directly from PDF files, or from plain text files that have previously been created from PDF or Word source documents [4]. Some recent examples include the ExCITE [5] and the Deep Reference Parsing [6] projects. Until now, however, there was no public API that would allow anyone to upload research papers, of any reasonable size and format, and quickly and accurately extract the citation metadata into a structured format.

Scholarcy has made its reference extraction API available to address this need [7]. The API currently accepts PDF, Word or plain text input, and uses machine learning to extract and output citation metadata in RIS, BibTeX, or XML format. As we are a self-funded, bootstrapped startup, free use of the public API is currently restricted to a limited number of calls per day. However, publishers and other organisations can license the API to convert their back-catalogue of articles into CrossRef XML metadata, enabling them to contribute to the open citations initiative. As part of this, we have recently partnered with BMJ Publishing Group to add millions of citations, from hundreds of thousands of articles, to the Open Citations Corpus [8] and CrossRef [9].

The Scholarcy browser extension and Library web application combines our citation extraction functionality with the CrossRef [10] and the Unpaywall [11] APIs, so researchers can follow the knowledge trail to open-access versions of cited sources while they are reading a paper.

 
Scholarcy uses open citations to enable citation chaining — great for literature reviews

Scholarcy uses open citations to enable citation chaining — great for literature reviews

 

This is particularly useful when reviewing manuscripts and preprints, as readers can validate authors’ claims and sources without interrupting their workflow. Thanks to machine learning, this process is completely automated. We hope that by providing ‘one click’ access to cited papers will make it easier both for reviewers to check sources, and for new readers to discover previously unknown works. While a number of open access repositories, such as Semantic Scholar and Dimensions, provide this functionality for the published papers that they curate, Scholarcy is, as far as I am aware, the only tool to do this for any document, on demand, in real time. In future, this process could also be used to check computationally whether the citation confirms or refutes the claim. A new startup, Scite_ [12], is planning on using open citations data to do precisely this.

Open citations — originating either from CrossRef or this on-demand extraction process — also allow readers to import a work’s bibliography into their favourite reference management software. This citation ‘snowballing’ is seen as an important part of the systematic literature review process [13]. In future, a service that combines this with automatic fact extraction could prove very powerful.

In summary, open citations encourage and enable

  • a new ecosystem of collaboration between non-profits, startups, publishers and academia;

  • the development of new tools to improve the research workflow; and

  • easier fact checking, discovery and verification of scientific knowledge.

If you found this article useful, it would really help if you could hit like and share it. Thanks!

References

[1] I4OC (2018). Initiative for Open Citations. Available from: https://i4oc.org.

[2] Molteni M (2017). The Initiative for Open Citations Is Tearing Down Science’s Citation Paywall, One Link At A Time. Wired, 6 April 2017. Available from: https://www.wired.com/2017/04/tearing-sciences-citation-paywall-one-link-time/.

[3] Majurey M. (2008). From Books to Bytes: How Digital Content is Changing the Way a Publishing Business Functions, Editors’ Bulletin, 4:3, 115–119, DOI: 10.1080/17521740802651369. Available from: https://www.tandfonline.com/doi/pdf/10.1080/17521740802651369

[4] Knoth P, Gooch P, Jack K (2017). What Others Say About This Work? Scalable Extraction of Citation Contexts from Research Papers. Lecture Notes in Computer Science, 10450 pp. 287–299. Available from: http://oro.open.ac.uk/52924/

[5] Körner M, Ghavimi B, Mayr P, Hartmann H, Staab S (2017). Evaluating Reference String Extraction Using Line-Based Conditional Random Fields: A Case Study with German Language Publications. In: Kirikova M. et al. (eds) New Trends in Databases and Information Systems. ADBIS 2017. Communications in Computer and Information Science, vol 767. Available from: http://west.uni-koblenz.de/en/research/excite

[6] Rodrigues Alves D, Colavizza, G, Kaplan F (2018). Deep Reference Mining From Scholarly Literature in the Arts and Humanities. Frontiers in Research Metrics and Analytics 3(21): 21. doi: 10.3389/frma.2018.00021. Available from: https://www.frontiersin.org/articles/10.3389/frma.2018.00021/full

[7] Scholarcy (2018). Reference extraction API. Available from: https://ref.scholarcy.com

[8] OCC (2108) Open Citations Corpus. Available from: http://opencitations.net/corpus

[9] BMJ Digital (2018). Machine learning for large-scale legacy reference extraction. Available from: https://digital.bmj.com/machine-learning-for-large-scale-legacy-reference-extraction-at-bmj/

[10] CrossRef (2017). REST API. Available from: https://www.crossref.org/services/metadata-delivery/rest-api/

[11] Unpaywall (2018). REST API. Available from: https://unpaywall.org/products/api

[12] Scite (2018). Making science more reliable. Available from: https://scite.ai

[13] Wohlin C. (2014). Guidelines for snowballing in systematic literature studies and a replication in software engineering. In Proceedings of the 18th international conference on evaluation and assessment in software engineering(p. 38). ACM. Available from: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.709.9164&rep=rep1&type=pdf