Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → CeON → Cermine

CeON / Cermine

Licence: agpl-3.0

Content ExtRactor and MINEr

Programming Languages

java

68154 projects - #9 most used programming language

Labels

machine-learning pdf

Projects that are alternatives of or similar to Cermine

Fluentreports

📄 Fluent Reports - Data Driven Reporting Engine for Node.js and Browsers 📄

Stars: ✭ 305 (-14.57%)

Mutual labels: pdf

Django Easy Pdf

PDF views, the easy way

Stars: ✭ 324 (-9.24%)

Mutual labels: pdf

Technical Ebooks

PDFs for programming tutorials.

Stars: ✭ 342 (-4.2%)

Mutual labels: pdf

Node Html Pdf

📄 Html to pdf converter in nodejs. It spawns a phantomjs process and passes the pdf as buffer or as filename.

Stars: ✭ 3,364 (+842.3%)

Mutual labels: pdf

Crx Selection Translate

一站式划词 / 截图 / 网页全文 / 音视频翻译扩展。

Stars: ✭ 3,603 (+909.24%)

Mutual labels: pdf

Tea School

Simplified HTML + CSS --> PDF Generator for Nodejs

Stars: ✭ 326 (-8.68%)

Mutual labels: pdf

Pandoc Latex Template

A pandoc LaTeX template to convert markdown files to PDF or LaTeX.

Stars: ✭ 3,750 (+950.42%)

Mutual labels: pdf

Jupyterlab Latex

JupyterLab extension for live editing of LaTeX documents

Stars: ✭ 349 (-2.24%)

Mutual labels: pdf

Percollate

A command-line tool to turn web pages into beautiful, readable PDF, EPUB, or HTML docs.

Stars: ✭ 3,535 (+890.2%)

Mutual labels: pdf

Latexdraw

A vector drawing editor for LaTeX (JavaFX).

Stars: ✭ 336 (-5.88%)

Mutual labels: pdf

Ruby Hacking Guide.github.com

Ruby Hacking Guide Translation

Stars: ✭ 305 (-14.57%)

Mutual labels: pdf

Exifcleaner

Cross-platform desktop GUI app to clean image metadata

Stars: ✭ 305 (-14.57%)

Mutual labels: pdf

Pdf Bookmark

pdf bookmark generator 目录书签大纲

Stars: ✭ 327 (-8.4%)

Mutual labels: pdf

Docspell

Assist in organizing your piles of documents, resulting from scanners, e-mails and other sources with miminal effort.

Stars: ✭ 303 (-15.13%)

Mutual labels: pdf

Universalviewer

A community-developed open source project on a mission to help you share your 📚📜📰📽️📻🗿 with the 🌎

Stars: ✭ 343 (-3.92%)

Mutual labels: pdf

Tucl

The first-ever paper on the Unix shell written by Ken Thompson in 1976 scanned, transcribed, and redistributed with permission

Stars: ✭ 303 (-15.13%)

Mutual labels: pdf

Itextsharp.lgplv2.core

iTextSharp.LGPLv2.Core is an unofficial port of the last LGPL version of the iTextSharp (V4.1.6) to .NET Core

Stars: ✭ 322 (-9.8%)

Mutual labels: pdf

E Books

IT technical related e-books and PPT information, continuous updating. For those in need, Keep real, peace and love.

Stars: ✭ 357 (+0%)

Mutual labels: pdf

Lightnovel Crawler

Download and generate e-books from online sources.

Stars: ✭ 344 (-3.64%)

Mutual labels: pdf

Maroto

A maroto way to create PDFs. Maroto is inspired in Bootstrap and uses gofpdf. Fast and simple.

Stars: ✭ 334 (-6.44%)

Mutual labels: pdf

View All Similar Projects ➔

Content ExtRactor and MINEr

CERMINE is a Java library and a web service (cermine.ceon.pl) for extracting metadata and content from PDF files containing academic publications. CERMINE is written in Java at Centre for Open Science at Interdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw.

The code is licensed under GNU Affero General Public License version 3.

How to cite CERMINE:

Dominika Tkaczyk, Pawel Szostek, Mateusz Fedoryszak, Piotr Jan Dendek and Lukasz Bolikowski. 
CERMINE: automatic extraction of structured metadata from scientific literature. 
In International Journal on Document Analysis and Recognition (IJDAR), 2015, 
vol. 18, no. 4, pp. 317-335, doi: 10.1007/s10032-015-0249-8.

DOI of CERMINE release 1.13:

Using CERMINE

CERMINE can be used for:

extracting metadata, full text and parsed references from a PDF file,
extracting metadata from reference strings,
extracting metadata from affiliation strings.

In all tasks the default output format is NLM JATS.

There are three way of using CERMINE, depending on the user's needs:

standalone application -- use this, if you need to process larger amounts of data locally on your laptop or server
Maven dependency -- allows to use CERMINE's API in your own Java/Scala code
web application -- for demonstration purposes and only small amounts (less than 50 files) of data

Refer to one of the sections below for details.

Standalone application

The easiest way to process files on a laptop/server is using CERMINE as a standalone application. All you will need is a single JAR file containing all the tools, external libraries and learned models. The latest release can be downloaded from the repository (look for a file called cermine-impl-<VERSION>-jar-with-dependencies.jar). The current version is 1.13.

Processing PDF documents

The basic command for processing PDF files is the following:

$ java -cp cermine-impl-<VERSION>-jar-with-dependencies.jar pl.edu.icm.cermine.ContentExtractor -path path/to/directory/with/pdfs/

Additional argument -outputs can be used to specify the types of the outputs. The value should be a comma-separated list of one or more of the following:

jats - document metadata and content in NLM JATS format
text - raw document text with the reading order preserved
zones - text zones of the documents labeled with functional classes
trueviz - geometric structure of the document in TrueViz format
images - images from the document
bibtex - references in BibTeX format

Processing references

To extract metadata from a reference string use the following:

$ java -cp cermine-impl-<VERSION-jar-with-dependencies.jar pl.edu.icm.cermine.bibref.CRFBibReferenceParser -reference "the text of the reference"

Processing affiliations

To extract metadata from an affiliation string use:

$ java -cp cermine-impl-<VERSION>-jar-with-dependencies.jar pl.edu.icm.cermine.metadata.affiliation.CRFAffiliationParser -affiliation "the text of the affiliation"

(OPTIONAL) if you would like to build an executable JAR yourself, clone the project and execute:

$ cd CERMINE/cermine-impl
$ mvn compile assembly:single

This will result in a file cermine-impl-<VERSION>-jar-with-dependencies.jar in cermine-impl/target directory.

Maven dependency

CERMINE can be used in Java projects by adding the following dependency and repository to the project's pom.xml file:

<dependency>
	<groupId>pl.edu.icm.cermine</groupId>
	<artifactId>cermine-impl</artifactId>
	<version>${cermine.version}</version>
</dependency>

<repository>
	<id>icm</id>
	<name>ICM repository</name>
	<url>http://maven.icm.edu.pl/artifactory/repo</url>
</repository>

Example code to extract the content from a PDF file:

ContentExtractor extractor = new ContentExtractor();
InputStream inputStream = new FileInputStream("path/to/pdf/file");
extractor.setPDF(inputStream);
Element result = extractor.getContentAsNLM();

Example code to extract metadata from a reference string:

CRFBibReferenceParser parser = CRFBibReferenceParser.getInstance();
BibEntry reference = parser.parseBibReference(referenceText);

Example code to extract metadata from an affiliation string:

CRFAffiliationParser parser = new CRFAffiliationParser();
Element affiliation = parser.parse(affiliationText);

REST service

The third possibility is to use CERMINE's REST service with cURL tool. Note, however, that this should only be used for small amounts of data, as the server does not have a lot of resources. Moreover, the web application might not use the latest code version. In most cases using the executable JAR is a better choice.

To extract the content from a PDF file:

$ curl -X POST --data-binary @article.pdf \
  --header "Content-Type: application/binary"\
  http://cermine.ceon.pl/extract.do

To extract metadata from a reference string:

$ curl -X POST --data "reference=the text of the reference" \
  http://cermine.ceon.pl/parse.do

To extract metadata from an affiliation string:

$ curl -X POST --data "affiliation=the text of the affiliation" \
  http://cermine.ceon.pl/parse.do

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 357

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (37) 🔗