Custom CSS

How does Scholarcy work its magic?

Photo of the author who wrote this blog post
Phil Gooch
4 min read

8 Steps To Structured Knowledge Extraction From Research Papers

We are often asked: 'How does Scholarcy summarise and identify the key points in research papers and other articles, and in what ways is it unique?'. Without giving too much away about our secret sauce, here's an overview of what is happening under the hood.

  • No matter what the input format - PDF, Word, HTML, XML, e-pub, Powerpoint, text - we convert the document into a unified, internal format, so that, for the purposes of our algorithms, all documents have the same structure.
  • We classify each line/paragraph according to its semantic class: for example, whether it is a metadata item, section heading, section text, tabular text, caption, bibliographic reference, or noise such as headers and footers.
  • We identify key terms and knowledge statements using techniques inspired by Open Domain Information Extraction research[1, 2], which often forms the basis of methods for automated knowledge base construction[3]. This avoids the need for handcrafted taxonomies or ontologies.
  • We combine this with a summarisation approach inspired by Google's PageRank algorithm[4, 5] combined with ideas from recent research on 'bottom-up attention' [6]. By using extractive techniques, with a touch of abstractive summarisation in the right places, we ensure that all summary sentences are factually correct and can be traced back to the original source.
  • Summary sentences are simplified to remove transitional phrases and redundant subclauses, and coreferences are resolved where possible, to help ensure a coherent flow.
  • We map key terms to an external knowledge base: currently Wikipedia, using approaches inspired by a various entity linking techniques [7, 8].
  • We map citation markers in the text to their entries in the bibliography, and map the bibliographic items themselves to their source documents, using resources such as PubMed, CrossRef and Unpaywall.
  • Finally, we enrich the summary with these links to cited sources and to Wikipedia.

Bear in mind that all this is happening in real time, so we have highly optimised ways of achieving these steps. The slowest steps are usually Wikipedia linking, CrossRef lookup and figure/image extraction (which is disabled by default).

What is unique about Scholarcy, and how does it improve on existing services?

There are a number of areas which distinguish us:

  • First, we surface relevant, referenced facts, enriched with explanations and definitions. Other summarisation tools tend not provide any context.
  • Second, our technology will work with almost any document in any format. Other solutions tend to require the document to be relatively short and to have a specific structure.
  • Third, the process can be easily customised by the user. You can determine how very long documents are handled (for example, attempt to process all of it, or have our AI take a representative sample), the length and language variation in the summary, and so on.
  • Fourth, our tools are backed by powerful, flexible APIs that also provide metadata extraction, figure and table extraction - perfect for enhancing and automating article submission systems, or converting preprints into web- and mobile-friendly formats.
Scholarcy converts preprints into a mobile-friendly format

We're constantly improving our algorithms and are currently in the process of developing new summarisation techniques. Our ultimate goal is convert the world of unstructured documents from any domain into repositories of validated, referenced knowledge statements that are both machine-readable and understandable to the lay reader.If you'd like to learn more about what we do, how we may be able to help structure and enrich your content, or simply automate your metadata creation, we'd love to hear from you!

References

  1. https://en.wikipedia.org/wiki/Open_information_extraction
  2. https://nlp.stanford.edu/software/openie.html
  3. http://deepdive.stanford.edu/kbc
  4. https://en.wikipedia.org/wiki/PageRank
  5. https://nlpforhackers.io/textrank-text-summarization/
  6. https://arxiv.org/pdf/1808.10792.pdf
  7. https://en.wikipedia.org/wiki/Entity_linking
  8. http://nlp.cs.rpi.edu/paper/wikificationtutorial.pdf

[/vc_column_text][/vc_column][/vc_row]

Tags