From: Emily Packer <[log in to unmask]>
Date: Fri, 4 Aug 2017 16:55:17 +0100

[With apologies for cross-posting]


Hi all,


eLife has today outlined a new project to convert PDF to XML with high
accuracy by complementing existing tools with computer vision technology.

A vast trove of scientific research is locked inside the PDF format, and
extracting key information from these files is not trivial. It would
therefore be useful to be able to extract and store this data in a more
accessible and reusable format such as XML (of the publishing industry
standard JATS variety or otherwise).

Science Beam uses computer vision algorithms to help ‘see’ the structure of
a research paper in PDF as a human would. This can then be used to assign
the correct metadata to the document’s content.

You can read more about the project in our latest eLife Labs post:
https://elifesciences.org/labs/5b56aff6/science-beam-a-
computer-vision-approach-to-the-extraction-of-pdf-data

In order for it to be able to extract good metadata from the myriad
variations in font, layout and content of PDFs from different sources, we
need to train our system with a wide variety of PDFs and their
corresponding XML. To this end, we will be collaborating with other
publishers to collate a broad corpus of valid PDF/XML pairs to help train
and test our neural networks. Our hope is that the wide variety of papers
and formats in this corpus will help our system learn to deduce the
structure of a research paper well enough to be useful in real-world
applications.

For more information, or to speak to us further about Science Beam, please
don’t hesitate to contact me.

Best wishes,

Emily

Emily Packer
Press Officer

+44 1223 855373 <+44%201223%20855373> (office)

http://elifesciences.org

eLife Sciences Publications, Ltd is a limited liability non-profit
non-stock corporation incorporated in the State of Delaware, USA, with
company number 5030732, and is registered in the UK with company number
FC030576 and branch number BR015634 at the address First Floor, 24 Hills
Road, Cambridge CB2 1JP