All Projects → ispras → Atr4s

ispras / Atr4s

Licence: apache-2.0
Toolkit with state-of-the-art Automatic Terms Recognition methods in Scala

Programming Languages

scala
5932 projects

Projects that are alternatives of or similar to Atr4s

Giveme5W
Extraction of the five journalistic W-questions (5W) from news articles
Stars: ✭ 16 (-30.43%)
Mutual labels:  nlp-library
Contextualized Topic Models
A python package to run contextualized topic modeling. CTMs combine BERT with topic models to get coherent topics. Also supports multilingual tasks. Cross-lingual Zero-shot model published at EACL 2021.
Stars: ✭ 318 (+1282.61%)
Mutual labels:  nlp-library
Kagome
Self-contained Japanese Morphological Analyzer written in pure Go
Stars: ✭ 554 (+2308.7%)
Mutual labels:  nlp-library
clj-duckling
Language, engine, and tooling for expressing, testing, and evaluating composable language rules on input strings. (a duckling clojure fork)
Stars: ✭ 15 (-34.78%)
Mutual labels:  nlp-library
Quick Nlp
Pytorch NLP library based on FastAI
Stars: ✭ 279 (+1113.04%)
Mutual labels:  nlp-library
Pynlpl
PyNLPl, pronounced as 'pineapple', is a Python library for Natural Language Processing. It contains various modules useful for common, and less common, NLP tasks. PyNLPl can be used for basic tasks such as the extraction of n-grams and frequency lists, and to build simple language model. There are also more complex data types and algorithms. Moreover, there are parsers for file formats common in NLP (e.g. FoLiA/Giza/Moses/ARPA/Timbl/CQL). There are also clients to interface with various NLP specific servers. PyNLPl most notably features a very extensive library for working with FoLiA XML (Format for Linguistic Annotation).
Stars: ✭ 426 (+1752.17%)
Mutual labels:  nlp-library
Nuts
自然语言处理常见任务(主要包括文本分类,序列标注,自动问答等)解决方案试验田
Stars: ✭ 21 (-8.7%)
Mutual labels:  nlp-library
Kuromoji
Kuromoji is a self-contained and very easy to use Japanese morphological analyzer designed for search
Stars: ✭ 745 (+3139.13%)
Mutual labels:  nlp-library
Giveme5w1h
Extraction of the journalistic five W and one H questions (5W1H) from news articles: who did what, when, where, why, and how?
Stars: ✭ 316 (+1273.91%)
Mutual labels:  nlp-library
Sudachi
A Japanese Tokenizer for Business
Stars: ✭ 496 (+2056.52%)
Mutual labels:  nlp-library
NLP-tools
Useful python NLP tools (evaluation, GUI interface, tokenization)
Stars: ✭ 39 (+69.57%)
Mutual labels:  nlp-library
Chatbot ner
chatbot_ner: Named Entity Recognition for chatbots.
Stars: ✭ 273 (+1086.96%)
Mutual labels:  nlp-library
Ekphrasis
Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).
Stars: ✭ 433 (+1782.61%)
Mutual labels:  nlp-library
classy
classy is a simple-to-use library for building high-performance Machine Learning models in NLP.
Stars: ✭ 61 (+165.22%)
Mutual labels:  nlp-library
Pythainlp
Thai Natural Language Processing in Python.
Stars: ✭ 582 (+2430.43%)
Mutual labels:  nlp-library
NLP Toolkit
Library of state-of-the-art models (PyTorch) for NLP tasks
Stars: ✭ 92 (+300%)
Mutual labels:  nlp-library
Lingua
👄 The most accurate natural language detection library for Java and the JVM, suitable for long and short text alike
Stars: ✭ 341 (+1382.61%)
Mutual labels:  nlp-library
Underthesea
Underthesea - Vietnamese NLP Toolkit
Stars: ✭ 823 (+3478.26%)
Mutual labels:  nlp-library
Janome
Japanese morphological analysis engine written in pure Python
Stars: ✭ 630 (+2639.13%)
Mutual labels:  nlp-library
Spacy
💫 Industrial-strength Natural Language Processing (NLP) in Python
Stars: ✭ 21,978 (+95456.52%)
Mutual labels:  nlp-library

ATR4S

An open-source library for Automatic Term Recognition written in Scala.

To cite ATR4S:

N.Astrakhantsev. ATR4S: Toolkit with State-of-the-art Automatic Terms Recognition Methods in Scala. arXiv preprint arXiv:1611.07804, 2016.

Implemented algorithms

  1. AvgTermFreq
  2. ResidualIDF
  3. TotalTF-IDF
  4. CValue
  5. Basic
  6. ComboBasic
  7. PostRankDC
  8. Relevance
  9. Weirdness
  10. DomainPertinence
  11. NovelTopicModel
  12. LinkProbability
  13. KeyConceptRelatedness
  14. Voting
  15. PU-ATR

Requirements

Libraries

Scala 2.11

Spark 1.5+ (for Voting and PU-ATR)

Emory nlp4j

(Apache OpenNLP is also supported, but preliminary experiments showed that its quality is not better than Emory nlp4j, while it is not thread-safe; if you are going to use OpenNLP, download models from Apache OpenNLP and place them into src/main/resources)

(Stanford CoreNLP is also supported by this helper, which is moved to a separate module licensed by GPL, due to GPL licensing of Stanford CoreNLP).

Data

In order to use some algorithms you need to download auxiliary files and place them into WORKING_DIRECTORY/data directory (note that working directory can be specified in gradle.properties - by default, this is experiments) or specify path in the corresponding configuration/builder class (e.g. Word2VecAdapterConfig of KeyConceptRelatedness).

Namely,

Datasets used in the experiments can be downloaded from Release page.

OS

PU algorithm may or may not work on Windows due to some bugs in Spark (see relevant questions on Stackoverflow, maybe they help you: 1, 2, 3).

Linking

The library is published into Maven central and JCenter. Add the following lines depending on your build system.

Gradle

compile 'ru.ispras:atr4s:1.2.2'

Maven

<dependency>
    <groupId>ru.ispras</groupId>
    <artifactId>atr4s</artifactId>
    <version>1.2.2</version>
</dependency>

SBT

libraryDependencies += "ru.ispras" % "atr4s" % "1.2.2"

Building from Sources

Build library with gradle:

./gradlew jar

Usage

Command line example

./gradlew recognize -Pdataset=acl2 -PtopCount=10 -Pconfig=CValue.conf -Poutput=cvalueterms.txt

Here we recognize top 10 terms from text files stored in acl2 directory (should be subdirectory of WORKING_DIRECTORY) by CValue measure (stored in CValue.conf file) and writes recognized terms with weights in cvalueterms.txt.

Note that if the encoding of input text files differs from UTF-8, then you should specify the correct encoding in the config of NLPPreprocessor (or convert input files, there are many tools for that).

Program API

See ATRConfig class, which is a Configuration/builder for a facade class AutomaticTermsRecognizer.

See AutomaticTermsRecognizer object for example.

Program API (Java)

Usage in Java does not differ significantly, so see the same classes for examples. However, since Java does not support parameters with default values, we provide helper static functions named make() for most classes containing parameters with default values or parameters with Scala collections, see example below.

Also note that there is a special method returning weighted terms as Java Iterable, so that you won't need to convert Scala collections to Java ones.

class ATRExample {
    public static void main(String[] args) {
        String datasetDir = args[0];
        int topCount = args[1];
        ATRConfig atrConfig = new ATRConfig(EmoryNLPPreprocessorConfig.make(),
                TCCConfig.make(),
                new OneFeatureTCWeighterConfig(Weirdness.make()));
        Iterable<WeightedTerm> terms = atrConfig.build().recognizeAsJavaIterable(datasetDir, topCount);
        for (WeightedTerm termAndWeight: terms) {
            System.out.println(termAndWeight);
        }
    }
}

License

Apache License Version 2.0.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].