All Projects → inukshuk → Anystyle

inukshuk / Anystyle

Licence: other
Fast and smart citation reference parsing

Programming Languages

ruby
36898 projects - #4 most used programming language

Projects that are alternatives of or similar to Anystyle

crossref
Client for the Crossref API
Stars: ✭ 29 (-93.38%)
Mutual labels:  science, bibliography
Deta parser
快速中文分词分析word segmentation
Stars: ✭ 476 (+8.68%)
Mutual labels:  parser, science
Stream Parser
⚡ PHP7 / Laravel Multi-format Streaming Parser
Stars: ✭ 391 (-10.73%)
Mutual labels:  parser
Librmath.js
Javascript Pure Implementation of Statistical R "core" numerical libRmath.so
Stars: ✭ 425 (-2.97%)
Mutual labels:  science
Tomlplusplus
Header-only TOML config file parser and serializer for C++17 (and later!).
Stars: ✭ 403 (-7.99%)
Mutual labels:  parser
Astexplorer
A web tool to explore the ASTs generated by various parsers.
Stars: ✭ 4,330 (+888.58%)
Mutual labels:  parser
Crossplane
Quick and reliable way to convert NGINX configurations into JSON and back.
Stars: ✭ 407 (-7.08%)
Mutual labels:  parser
Swaggen
OpenAPI/Swagger 3.0 Parser and Swift code generator
Stars: ✭ 385 (-12.1%)
Mutual labels:  parser
Tiny Compiler
A tiny compiler for a language featuring LL(2) with Lexer, Parser, ASM-like codegen and VM. Complex enough to give you a flavour of how the "real" thing works whilst not being a mere toy example
Stars: ✭ 425 (-2.97%)
Mutual labels:  parser
Php Parser
🌿 NodeJS PHP Parser - extract AST or tokens (PHP5 and PHP7)
Stars: ✭ 400 (-8.68%)
Mutual labels:  parser
Recipy
Effortless method to record provenance in Python
Stars: ✭ 418 (-4.57%)
Mutual labels:  science
Datefinder
Find dates inside text using Python and get back datetime objects
Stars: ✭ 397 (-9.36%)
Mutual labels:  parser
Opensim Core
SimTK OpenSim C++ libraries and command-line applications, and Java/Python wrapping.
Stars: ✭ 392 (-10.5%)
Mutual labels:  science
Javalang
Pure Python Java parser and tools
Stars: ✭ 408 (-6.85%)
Mutual labels:  parser
Toml11
TOML for Modern C++
Stars: ✭ 390 (-10.96%)
Mutual labels:  parser
Binary Parser
Blazing-fast declarative parser builder for binary data
Stars: ✭ 422 (-3.65%)
Mutual labels:  parser
Verible
Verible is a suite of SystemVerilog developer tools, including a parser, style-linter, and formatter.
Stars: ✭ 384 (-12.33%)
Mutual labels:  parser
Json Rust
JSON implementation in Rust
Stars: ✭ 395 (-9.82%)
Mutual labels:  parser
Jsonparser
One of the fastest alternative JSON parser for Go that does not require schema
Stars: ✭ 4,323 (+886.99%)
Mutual labels:  parser
Picofeed
PHP library to parse and write RSS/Atom feeds
Stars: ✭ 439 (+0.23%)
Mutual labels:  parser

AnyStyle

Build Status Coverage Status

AnyStyle is a very fast and smart parser for academic references. It was originally inspired by ParsCit and FreeCite; AnyStyle uses machine learning algorithms and aims to make it easy to train the model with data that is relevant to your parsing needs.

Using AnyStyle CLI

$ [sudo] gem install anystyle-cli
$ anystyle --help
$ anystyle help find
$ anystyle help parse

See anystyle-cli for more details.

Using AnyStyle in Ruby

Install the anystyle gem.

$ [sudo] gem install anystyle

Once installed, you can use the static Parser and Finder instances by calling the AnyStyle.parse or AnyStyle.find methods. For example:

require 'anystyle'

pp AnyStyle.parse 'Derrida, J. (1967). L’écriture et la différence (1 éd.). Paris: Éditions du Seuil.'
#-> [{
#  :author=>[{:family=>"Derrida", :given=>"J."}],
#  :date=>["1967"],
#  :title=>["L’écriture et la différence"],
#  :edition=>["1"],
#  :location=>["Paris"],
#  :publisher=>["Éditions du Seuil"],
#  :language=>"fr",
#  :scripts=>["Common", "Latin"],
#  :type=>"book"
#}]

Alternatively, you can create your own AnyStyle::Parser or AnyStyle::Finder with custom options.

Using the AnyStyle Web App

AnyStyle is available as web application at anystyle.io.

The web application is open source and you can also host yourself!

Training

You can train custom Finder and Parser models. To do this, you need to prepare your own data sets for training. You can create your own data from scratch or build on AnyStyle's default sets. The default parser model is based on the core data set; the default finder model source data is not publicly available in its entirety, but you can find a number of tagged documents here.

When you have compiled a data set for training, you will be ready to create your own model:

$ anystyle train training-data.xml custom.mod

This will save your new model as custom.mod. To use your model instead of AnyStyle's default, use the -P or --parser-model flag and, respectively, -F or --finder-model to use a custom Finder model. For instance, the command below would parse all references in bib.txt using the custom model we just trained and print the result to STDOUT using the JSON output format:

$ anystyle -P custom.mod -f json parse bib.txt -

When training your own models, it is good practice to check the quality using a second data set. For example, using AnyStyle's own gold data set (a large, manually curated data set) we could check our custom model like this:

$ anystyle -P x.mod check ./res/parser/gold.xml
Checking gold.xml.................   1 seq  0.06%   3 tok  0.01%  3s

This command will print the sequence and token error rates; in the case of AnyStyle a the number of sequence errors is the number of references which were tagged differently by the parser than they were in the input; the number of token errors is the total number of words across all the references which were tagged differently. In the example above, we got one reference wrong (out of 1700 at the time); but even this one reference was mostly tagged correctly, because only a total of 3 words were tagged differently.

When working with training data, it is a good idea to use the Wapiti::Dataset API in Ruby: it supports all the standard set operators and makes it very easy to combine or compare data sets.

Dictionary Adapters

During the statistical analysis of reference strings, AnyStyle relies on a large feature dictionary; by default, AnyStyle creates a persistent Ruby Hash in the folder of the anystyle-data Gem. This uses up about 2MB of disk space and keeps the entire dictionary in memory. If you prefer a smaller memory footprint, you can alternatively use AnyStyle's GDBM dictionary. GDBM bindings are part of the Ruby standard library and are supported on all platforms, but you may have to install GDBM on your platform before installing Ruby.

If you do not want to use the the persistent Ruyb Hash nor the GBDM bindings, you can store your dictionary in memory (not recommended) or use a Redis. The best way to change the default dictionary adapter is by adjusting AnyStyle's default configuration (when using the default parser instances you must set the default before using the parser):

AnyStyle::Dictionary.defaults[:adapter] = :ruby
#-> Use a persistent Ruby hash;
#-> slower start-up than GDBM but no extra dependency

AnyStyle::Dictionary.defaults[:adapter] = :hash
#-> Use in-memory dictionary; slow start-up but uses no space on disk

require 'anystyle/dictionary/gdbm'
AnyStyle::Dictionary.defaults[:adapter] = :gdbm

To use Redis, install the redis and redis/namespace (optional) Gems and configure AnyStyle to use the Redis adapter:

AnyStyle::Dictionary.defaults[:adapter] = :redis

# Adjust the Redis-specifi configuration
require 'anystyle/dictionary/redis'
AnyStyle::Dictionary::Redis.defaults[:host] = 'localhost'
AnyStyle::Dictionary::Redis.defaults[:port] = 6379

Contributing

The AnyStyle source code is hosted on GitHub. You can check out a copy of the latest code using Git:

$ git clone https://github.com/inukshuk/anystyle.git

If you've found a bug or have a question, please open an issue on the AnyStyle issue tracker. Or, for extra credit, clone the AnyStyle repository, write a failing example, fix the bug and submit a pull request.

Credits

AnyStyle is a volunteer effort and we encourage you to join us! Over the years our main contributors have been:

License

Copyright 2011-2020 Sylvester Keil. All rights reserved.

AnyStyle is distributed under a BSD-style license. See LICENSE for details.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].