All Projects → ICIJ → Datashare

ICIJ / Datashare

Licence: agpl-3.0
Better analyze information, in all its forms

Programming Languages

java
68154 projects - #9 most used programming language

Projects that are alternatives of or similar to Datashare

Open Semantic Etl
Python based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & Named Entity Recognition) & data enrichment (annotation) pipelines & ingestor to Solr or Elastic search index & linked data graph database
Stars: ✭ 165 (-35.04%)
Mutual labels:  extract, elasticsearch, named-entity-recognition
Chatbot ner
chatbot_ner: Named Entity Recognition for chatbots.
Stars: ✭ 273 (+7.48%)
Mutual labels:  elasticsearch, named-entity-recognition
Wikipedia ner
📖 Labeled examples from wiki dumps in Python
Stars: ✭ 61 (-75.98%)
Mutual labels:  named-entity-recognition, text-extraction
Eland
Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch
Stars: ✭ 235 (-7.48%)
Mutual labels:  elasticsearch
Pytorch Bert Crf Ner
KoBERT와 CRF로 만든 한국어 개체명인식기 (BERT+CRF based Named Entity Recognition model for Korean)
Stars: ✭ 236 (-7.09%)
Mutual labels:  named-entity-recognition
Sist2
Lightning-fast file system indexer and search tool
Stars: ✭ 245 (-3.54%)
Mutual labels:  elasticsearch
Ner Bert Pytorch
PyTorch solution of named entity recognition task Using Google AI's pre-trained BERT model.
Stars: ✭ 249 (-1.97%)
Mutual labels:  named-entity-recognition
Sumy
Module for automatic summarization of text documents and HTML pages.
Stars: ✭ 2,705 (+964.96%)
Mutual labels:  text-extraction
Elastic
An Elasticsearch REST API client for Rust
Stars: ✭ 248 (-2.36%)
Mutual labels:  elasticsearch
Link Preview Js
Parse and/or extract web links meta information: title, description, images, videos, etc. [via OpenGraph], runs on mobiles and node.
Stars: ✭ 240 (-5.51%)
Mutual labels:  extract
Retail Demo Store
AWS Retail Demo Store is a sample retail web application and workshop platform demonstrating how AWS infrastructure and services can be used to build compelling customer experiences for eCommerce, retail, and digital marketing use-cases
Stars: ✭ 238 (-6.3%)
Mutual labels:  elasticsearch
Bert ner
Ner with Bert
Stars: ✭ 240 (-5.51%)
Mutual labels:  named-entity-recognition
Eui
Elastic UI Framework 🙌
Stars: ✭ 3,248 (+1178.74%)
Mutual labels:  elasticsearch
Esbulk
Bulk indexing command line tool for elasticsearch
Stars: ✭ 235 (-7.48%)
Mutual labels:  elasticsearch
Elastik Nearest Neighbors
Go to: https://github.com/alexklibisz/elastiknn
Stars: ✭ 249 (-1.97%)
Mutual labels:  elasticsearch
Fancy Nlp
NLP for human. A fast and easy-to-use natural language processing (NLP) toolkit, satisfying your imagination about NLP.
Stars: ✭ 233 (-8.27%)
Mutual labels:  named-entity-recognition
Elasticsearch Analysis Stconvert
STConvert is analyzer that convert chinese characters between traditional and simplified.中文简繁體互相转换.
Stars: ✭ 247 (-2.76%)
Mutual labels:  elasticsearch
Typo3 Docker Boilerplate
🍲 TYPO3 Docker Boilerplate project (NGINX, Apache HTTPd, PHP-FPM, MySQL, Solr, Elasticsearch, Redis, FTP)
Stars: ✭ 240 (-5.51%)
Mutual labels:  elasticsearch
Neo4j To Elasticsearch
GraphAware Framework Module for Integrating Neo4j with Elasticsearch
Stars: ✭ 241 (-5.12%)
Mutual labels:  elasticsearch
Agriculture knowledgegraph
农业知识图谱(AgriKG):农业领域的信息检索,命名实体识别,关系抽取,智能问答,辅助决策
Stars: ✭ 2,957 (+1064.17%)
Mutual labels:  named-entity-recognition

Datashare

CircleCI Crowdin

Download

https://datashare.icij.org/

Documentation

Datashare's user guide can be found here: https://icij.gitbook.io/datashare/

Follow new updates and features

@ICIJorg publishes video tweets of new features with the hashtag #ICIJDatashare.

Frontend

This repository is only the backend part of Datashare.

Please find the frontend here : https://github.com/ICIJ/datashare-client.

Description

Datashare is a free open-source desktop application developed by non-profit International Consortium of Investigative Journalists (ICIJ).

Datashare allows investigative journalists to:

  • access all their documents in one place locally on their computer while securing them from potential third-party interferences
  • search pdfs, images, texts, spreadsheets, slides and any files, simultaneously
  • automatically detect and filter by people, organizations and locations

Translation of the interface

You're welcome to suggest translations on Datashare's Crowdin https://crwd.in/datashare. Please contact us if you would like to add a language.

Installing and using

Using with elasticsearch

You can download the script at datashare.icij.org.

To access web GUI, go in your documents folder and launch path/to/datashare.sh then connect datashare on http://localhost:8080

Using only Named Entity Recognition

You can use the datashare docker container only for HTTP exposed name finding API.

Just run :

docker run -ti -p 8080:8080 -v /path/to/dist/:/home/datashare/dist icij/datashare:0.10 -m NER

A bit of explanation :

  • -w tells datashare to run the webserver. It is launched on 8080 that's why the port is mapped for docker
  • -m NER runs datashare without index at all on a stateless mode
  • -v /path/to/dist:/home/datashare/dist maps the directory where the NLP models will be read (and downloaded if they don't exist)

Then query with curl the server with :

curl -i localhost:8080/ner/findNames/CORENLP --data-binary @path/to/a/file.txt

The last path part (CORENLP) is the framework. You can choose it among CORENLP, IXAPIPE, MITIE or OPENNLP.

Extract Text from Files

Implementations

Support

Tika File Formats

Extract Persons, Organizations or Locations from Text

Implementations

  • org.icij.datashare.text.nlp.corenlp.CorenlpPipeline

    Stanford CoreNLP v3.8.0, (Conditional Random Fields), Composite GPL v3+

  • org.icij.datashare.text.nlp.ixapipe.IxapipePipeline

    Ixa Pipes Nerc v1.6.1, (Perceptron), Apache Licence v2.0

  • org.icij.datashare.text.nlp.mitie.MitiePipeline

    MIT Information Extraction v0.8, (Structural Support Vector Machines), Boost Software License v1.0

  • org.icij.datashare.text.nlp.opennlp.OpennlpPipeline

    Apache OpenNLP v1.6.0, (Maximum Entropy), Apache Licence v2.0

Natural Language Processing Stages Support

NlpStage
TOKEN
SENTENCE
POS
NER

Named Entity Recognition Language Support

NlpStage.NER ENGLISH SPANISH GERMAN FRENCH CHINESE
NlpPipeline.Type.CORENLP X X X (w/ EN) X
NlpPipeline.Type.OPENNLP X X - X -
NlpPipeline.Type.IXAPIPE X X X - -
NlpPipeline.Type.MITIE X X X - -

Named Entity Categories Support

NamedEntity.Category
ORGANIZATION
PERSON
LOCATION

Parts-of-Speech Language Support

NlpStage.POS ENGLISH SPANISH GERMAN FRENCH
NlpPipeline.Type.CORE X X X X
NlpPipeline.Type.OPEN X X X X
NlpPipeline.Type.IXA X X X X
NlpPipeline.Type.MITIE - - - -

Store and Search Documents and Named Entities

Implementations

  • org.icij.datashare.text.indexing.elasticsearch.ElasticsearchIndexer

    Elasticsearch v7.9.1, Apache Licence v2.0

Compilation / Build

Requires JDK 11, Maven 3 and a running PostgreSQL database (hostname postgres) with two databases datashare and test with write access for user test / password test. You'll need also a running elasticsearch instance with elasticsearch as hostname ; and a redis server named redis as well.

mvn validate
mvn -pl commons-test -am install
mvn -pl datashare-db liquibase:update
mvn test

Keeping the development environment up to date

It is important to keep datashare and datashare-client up to date by pulling from each repository's master branch.

To ensure that updates are registered, make clean dist must be run locally from each repository.

If dependencies have been updated on datashare-client, run yarn before make clean dist.

If the database models have changed within datashare, run the following commands before make clean dist:

sh datashare-db/scr/reset_datashare_db.sh
mvn -pl commons-test -am install
mvn -pl datashare-db liquibase:update
mvn test

License

Datashare is released under the GNU Affero General Public License

Bug report, comment or (pull) request

We welcome feedback as well as contributions!

For any bug, question, comment or (pull) request,

please contact us at [email protected]

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].