All Projects → CeON → Cermine

CeON / Cermine

Licence: agpl-3.0
Content ExtRactor and MINEr

Programming Languages

java
68154 projects - #9 most used programming language

Projects that are alternatives of or similar to Cermine

Fluentreports
📄 Fluent Reports - Data Driven Reporting Engine for Node.js and Browsers 📄
Stars: ✭ 305 (-14.57%)
Mutual labels:  pdf
Django Easy Pdf
PDF views, the easy way
Stars: ✭ 324 (-9.24%)
Mutual labels:  pdf
Technical Ebooks
PDFs for programming tutorials.
Stars: ✭ 342 (-4.2%)
Mutual labels:  pdf
Node Html Pdf
📄 Html to pdf converter in nodejs. It spawns a phantomjs process and passes the pdf as buffer or as filename.
Stars: ✭ 3,364 (+842.3%)
Mutual labels:  pdf
Crx Selection Translate
一站式划词 / 截图 / 网页全文 / 音视频翻译扩展。
Stars: ✭ 3,603 (+909.24%)
Mutual labels:  pdf
Tea School
Simplified HTML + CSS --> PDF Generator for Nodejs
Stars: ✭ 326 (-8.68%)
Mutual labels:  pdf
Pandoc Latex Template
A pandoc LaTeX template to convert markdown files to PDF or LaTeX.
Stars: ✭ 3,750 (+950.42%)
Mutual labels:  pdf
Jupyterlab Latex
JupyterLab extension for live editing of LaTeX documents
Stars: ✭ 349 (-2.24%)
Mutual labels:  pdf
Percollate
A command-line tool to turn web pages into beautiful, readable PDF, EPUB, or HTML docs.
Stars: ✭ 3,535 (+890.2%)
Mutual labels:  pdf
Latexdraw
A vector drawing editor for LaTeX (JavaFX).
Stars: ✭ 336 (-5.88%)
Mutual labels:  pdf
Ruby Hacking Guide.github.com
Ruby Hacking Guide Translation
Stars: ✭ 305 (-14.57%)
Mutual labels:  pdf
Exifcleaner
Cross-platform desktop GUI app to clean image metadata
Stars: ✭ 305 (-14.57%)
Mutual labels:  pdf
Pdf Bookmark
pdf bookmark generator 目录 书签 大纲
Stars: ✭ 327 (-8.4%)
Mutual labels:  pdf
Docspell
Assist in organizing your piles of documents, resulting from scanners, e-mails and other sources with miminal effort.
Stars: ✭ 303 (-15.13%)
Mutual labels:  pdf
Universalviewer
A community-developed open source project on a mission to help you share your 📚📜📰📽️📻🗿 with the 🌎
Stars: ✭ 343 (-3.92%)
Mutual labels:  pdf
Tucl
The first-ever paper on the Unix shell written by Ken Thompson in 1976 scanned, transcribed, and redistributed with permission
Stars: ✭ 303 (-15.13%)
Mutual labels:  pdf
Itextsharp.lgplv2.core
iTextSharp.LGPLv2.Core is an unofficial port of the last LGPL version of the iTextSharp (V4.1.6) to .NET Core
Stars: ✭ 322 (-9.8%)
Mutual labels:  pdf
E Books
IT technical related e-books and PPT information, continuous updating. For those in need, Keep real, peace and love.
Stars: ✭ 357 (+0%)
Mutual labels:  pdf
Lightnovel Crawler
Download and generate e-books from online sources.
Stars: ✭ 344 (-3.64%)
Mutual labels:  pdf
Maroto
A maroto way to create PDFs. Maroto is inspired in Bootstrap and uses gofpdf. Fast and simple.
Stars: ✭ 334 (-6.44%)
Mutual labels:  pdf

Content ExtRactor and MINEr

CERMINE is a Java library and a web service (cermine.ceon.pl) for extracting metadata and content from PDF files containing academic publications. CERMINE is written in Java at Centre for Open Science at Interdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw.

The code is licensed under GNU Affero General Public License version 3.

How to cite CERMINE:

Dominika Tkaczyk, Pawel Szostek, Mateusz Fedoryszak, Piotr Jan Dendek and Lukasz Bolikowski. 
CERMINE: automatic extraction of structured metadata from scientific literature. 
In International Journal on Document Analysis and Recognition (IJDAR), 2015, 
vol. 18, no. 4, pp. 317-335, doi: 10.1007/s10032-015-0249-8.

DOI of CERMINE release 1.13:

DOI

Using CERMINE

CERMINE can be used for:

  • extracting metadata, full text and parsed references from a PDF file,
  • extracting metadata from reference strings,
  • extracting metadata from affiliation strings.

In all tasks the default output format is NLM JATS.

There are three way of using CERMINE, depending on the user's needs:

  • standalone application -- use this, if you need to process larger amounts of data locally on your laptop or server
  • Maven dependency -- allows to use CERMINE's API in your own Java/Scala code
  • web application -- for demonstration purposes and only small amounts (less than 50 files) of data

Refer to one of the sections below for details.

Standalone application

The easiest way to process files on a laptop/server is using CERMINE as a standalone application. All you will need is a single JAR file containing all the tools, external libraries and learned models. The latest release can be downloaded from the repository (look for a file called cermine-impl-<VERSION>-jar-with-dependencies.jar). The current version is 1.13.

Processing PDF documents

The basic command for processing PDF files is the following:

$ java -cp cermine-impl-<VERSION>-jar-with-dependencies.jar pl.edu.icm.cermine.ContentExtractor -path path/to/directory/with/pdfs/

Additional argument -outputs can be used to specify the types of the outputs. The value should be a comma-separated list of one or more of the following:

  • jats - document metadata and content in NLM JATS format
  • text - raw document text with the reading order preserved
  • zones - text zones of the documents labeled with functional classes
  • trueviz - geometric structure of the document in TrueViz format
  • images - images from the document
  • bibtex - references in BibTeX format

Processing references

To extract metadata from a reference string use the following:

$ java -cp cermine-impl-<VERSION-jar-with-dependencies.jar pl.edu.icm.cermine.bibref.CRFBibReferenceParser -reference "the text of the reference"

Processing affiliations

To extract metadata from an affiliation string use:

$ java -cp cermine-impl-<VERSION>-jar-with-dependencies.jar pl.edu.icm.cermine.metadata.affiliation.CRFAffiliationParser -affiliation "the text of the affiliation"

(OPTIONAL) if you would like to build an executable JAR yourself, clone the project and execute:

$ cd CERMINE/cermine-impl
$ mvn compile assembly:single

This will result in a file cermine-impl-<VERSION>-jar-with-dependencies.jar in cermine-impl/target directory.

Maven dependency

CERMINE can be used in Java projects by adding the following dependency and repository to the project's pom.xml file:

<dependency>
	<groupId>pl.edu.icm.cermine</groupId>
	<artifactId>cermine-impl</artifactId>
	<version>${cermine.version}</version>
</dependency>

<repository>
	<id>icm</id>
	<name>ICM repository</name>
	<url>http://maven.icm.edu.pl/artifactory/repo</url>
</repository>

Example code to extract the content from a PDF file:

ContentExtractor extractor = new ContentExtractor();
InputStream inputStream = new FileInputStream("path/to/pdf/file");
extractor.setPDF(inputStream);
Element result = extractor.getContentAsNLM();

Example code to extract metadata from a reference string:

CRFBibReferenceParser parser = CRFBibReferenceParser.getInstance();
BibEntry reference = parser.parseBibReference(referenceText);

Example code to extract metadata from an affiliation string:

CRFAffiliationParser parser = new CRFAffiliationParser();
Element affiliation = parser.parse(affiliationText);

REST service

The third possibility is to use CERMINE's REST service with cURL tool. Note, however, that this should only be used for small amounts of data, as the server does not have a lot of resources. Moreover, the web application might not use the latest code version. In most cases using the executable JAR is a better choice.

To extract the content from a PDF file:

$ curl -X POST --data-binary @article.pdf \
  --header "Content-Type: application/binary"\
  http://cermine.ceon.pl/extract.do

To extract metadata from a reference string:

$ curl -X POST --data "reference=the text of the reference" \
  http://cermine.ceon.pl/parse.do

To extract metadata from an affiliation string:

$ curl -X POST --data "affiliation=the text of the affiliation" \
  http://cermine.ceon.pl/parse.do
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].