All Projects → codeforkjeff → conciliator

codeforkjeff / conciliator

Licence: GPL-3.0 license
OpenRefine reconciliation services for VIAF, ORCID, and Open Library + framework for creating more.

Programming Languages

java
68154 projects - #9 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to conciliator

IATI.cloud
The open-source IATI datastore for IATI data with RESTful web API providing XML, JSON, CSV output. It extracts and parses IATI XML files referenced in the IATI Registry and powered by Apache Solr.
Stars: ✭ 35 (-63.16%)
Mutual labels:  solr
use-redux-hook
A simple react hook to get access to redux store
Stars: ✭ 13 (-86.32%)
Mutual labels:  openlibrary
jesterj
Document Ingestion Framework for Search Systems
Stars: ✭ 26 (-72.63%)
Mutual labels:  solr
hello-nlp
A natural language search microservice
Stars: ✭ 85 (-10.53%)
Mutual labels:  solr
nlpir-analysis-cn-ictclas
Lucene/Solr Analyzer Plugin. Support MacOS,Linux x86/64,Windows x86/64. It's a maven project, which allows you change the lucene/solr version. //Maven工程,修改Lucene/Solr版本,以兼容相应版本。
Stars: ✭ 71 (-25.26%)
Mutual labels:  solr
Merge-Machine
Merge Dirty Data with Clean Reference Tables
Stars: ✭ 35 (-63.16%)
Mutual labels:  entity-resolution
jstarcraft-nlp
专注于解决自然语言处理领域的几个核心问题:词法分析,句法分析,语义分析,语种检测,信息抽取,文本聚类和文本分类. 为相关领域的研发人员提供完整的通用设计与参考实现. 涵盖了多种自然语言处理算法,适配了多个自然语言处理框架. 兼容Lucene/Solr/ElasticSearch插件.
Stars: ✭ 92 (-3.16%)
Mutual labels:  solr
whatis
WhatIs.this: simple entity resolution through Wikipedia
Stars: ✭ 18 (-81.05%)
Mutual labels:  entity-resolution
solrq
Python Solr query utility // http://solrq.readthedocs.org/en/latest/
Stars: ✭ 18 (-81.05%)
Mutual labels:  solr
ezplatform-search-extra
Netgen's extra bits for eZ Platform search
Stars: ✭ 13 (-86.32%)
Mutual labels:  solr
OpenRefine-ecology-lesson
Data Cleaning with OpenRefine for Ecologists
Stars: ✭ 20 (-78.95%)
Mutual labels:  openrefine
specs
Specifications of the reconciliation API
Stars: ✭ 22 (-76.84%)
Mutual labels:  reconciliation-service
mdserver-web
Simple Linux Panel
Stars: ✭ 1,064 (+1020%)
Mutual labels:  solr
solr-stack
Ambari stack service for easily installing and managing Solr on HDP cluster
Stars: ✭ 18 (-81.05%)
Mutual labels:  solr
solr-ontology-tagger
Automatic tagging and analysis of documents in an Apache Solr index for faceted search by RDF(S) Ontologies & SKOS thesauri
Stars: ✭ 36 (-62.11%)
Mutual labels:  solr
orcidlink-LaTeX-command
LaTeX style file to add a macro for inserting a linked ORCiD logo
Stars: ✭ 53 (-44.21%)
Mutual labels:  orcid
turing
✨ 🧬 Turing AI - Semantic Navigation, Chatbot using Search Engine and Many NLP Vendors.
Stars: ✭ 30 (-68.42%)
Mutual labels:  solr
bitnami-docker-solr
Bitnami Docker Image for Solr
Stars: ✭ 33 (-65.26%)
Mutual labels:  solr
basic-solr-config
A starting point for solr schema, config and xslt.
Stars: ✭ 17 (-82.11%)
Mutual labels:  solr
BnLMetsExporter
Command Line Interface (CLI) to export METS/ALTO documents to other formats.
Stars: ✭ 11 (-88.42%)
Mutual labels:  solr

conciliator

conciliator is a growing collection of OpenRefine reconciliation services, as well as a Java framework for creating them. A reconciliation service tries to match variant text (usually names of things) to standard IDs for the entity represented by that text.

This project supercedes refine_viaf.

Table of Contents

Public Server

If your needs are low and you can't or don't want to run this software yourself, you can use the public server at http://refine.codefork.com/. Visit that address for more instructions.

General Features

  • Out of the box support for the following data sources:

    • VIAF - Virtual International Authority File
    • ORCID - digital identifiers for researchers
    • Open Library - an open, editable library catalog
    • Any Apache Solr collection
    • more to come (if you can contribute, please submit pull requests!)
  • Good performance (uses threads; stable memory usage; caches results)

  • Super easy to run (works on Linux, Mac, Windows)

Data Source Features

VIAF

  • Support for the following types of names provided by VIAF: Corporate Names, Geographic Names, Personal Names, Works, Expressions

  • "Proxy mode" to retrieve IDs used by source institutions, instead of VIAF IDs. (NOTE: hyperlinks to source record pages in OpenRefine are supported for BNE, DNB, ICCU, JPG, LC, NDL, SELIBR, SUDOC, and WKP. Links are BROKEN for BNC, BNF, DBC, and NUKAT. For all other sources, the links will take you to the VIAF page.)

ORCID

  • Uses the ORCID v2.1 API. The detailed search results of the v1.2 API are no longer supported, so n+1 requests are made to fetch name details, which is gross, but it's the best we can do. Heavy use may cause the ORCID API to start returning rate-limiting errors.

  • Properties are supported as a way to do fielded searches using Solr syntax. For lists of valid field names to use in the "As Property" box, see the section titled "Search for specific elements by field" on this page, and the list of identifier fields on the Supported Work Identifiers page.

    For example, if you have a column containing Scopus EIDs, you can select the "Include?" checkbox for it and enter "eid" in the "As Property" box on the reconciliation screen.

  • By default, queries are keyword searches on the entire ORCID bios, which can return odd results sometimes. The "smartnames" mode (see the instructions below) splits up names and searches on the given-names and family-name fields specifically; if there are no results, it falls back to a keyword search.

Open Library

  • Open Library has rate limits on its API, so requests are not run in a threadpool. Expect it to be slow.

  • Support for including additional columns (useful for specifying author(s), for example, to help narrow down searches for common book titles). If no results are found, the code tries again with only the original column.

Solr

  • Any Apache Solr collection can be used as a data source. See the sample commented-out lines in the conciliator.properties file for more details.

Running Conciliator on Your Own Computer

Install Java 1.8 or greater if you don't already have it.

Download the .jar file for the latest release. Alternatively, you can download the source code tarball or clone this repository, and build the .jar file using maven.

Run this command:

# replace VERSION with the release you downloaded
java -jar conciliator-VERSION.jar

That's it! You should see some messages as the application starts up. Now you're ready to configure OpenRefine to use the service. When you're done with it, hit Ctrl-C to quit the application.

If a file named conciliator.properties exists in the current directory, conciliator will use the options found in it. See the sample file in this repository.

By default, conciliator will run on port 8080, which is used in the example URLs below. To use a different port, set the server.port property as follows when running the program:

java -Dserver.port=7000 -jar conciliator-VERSION.jar

Docker Image

A docker image created by tobinski is available here:

https://hub.docker.com/r/tobinski/docker-codefork-conciliator/

Configuring OpenRefine

  1. In OpenRefine, chose a column of names you want to reconcile, and select "Reconcile" and "Start Reconciling..." in the column pull-down menu.

  2. Click "Add Standard Service..."

  3. Enter a URL based on the data source you wish to use.

    To reconcile against names from any VIAF source, type in:

    http://localhost:8080/reconcile/viaf
    

    To reconcile against a specific VIAF source, append its code to the end of the path. For example, to search only names from the Bibliothèque nationale de France, type in:

    http://localhost:8080/reconcile/viaf/BNF
    

    To retrieve the IDs used by source institutions, rather than VIAF IDs, use "proxy mode." For example, to search only names from the Library of Congress and retrieve their IDs, type in:

    http://localhost:8080/reconcile/viafproxy/LC
    

    To use ORCID:

    http://localhost:8080/reconcile/orcid
    

    To use ORCID with "smartnames" mode when reconciliing names:

    http://localhost:8080/reconcile/orcid/smartnames
    

    To use Open Library: (On the reconciliation screen, under the "Also use relevant details from other columns" panel, you can check the "Include?" box for columns to include in the query. Give them any name in the "As Property" box. If no results are found with these column values added to the query, the service will try again with only the original selected column.)

    http://localhost:8080/reconcile/openlibrary
    
  4. Follow the instructions on the dialog box to start reconciling names.

Creating Your Own Data Source

  1. Clone this repository to get the source code. The code you create in the next steps should live under a new com.codefork.refine.NEW_SOURCE package so that Spring's auto-scanning picks it up.

  2. Create a class for your data source that extends DataSource for very bare-bones functionality, or WebServiceDataSource if you are making requests to another web service. See the other data sources for some template code. Implement the abstract methods as required.

  3. Create a controller that autowires your new DataSource and hooks up a unique path, e.g. /reconcile/new_source. See VIAFController for an example.

  4. Write a test or two if you like.

  5. Set some default properties in Config if your data source has any settings you want to be configurable.

  6. Build a new .jar by running mvn clean package. Run the .jar file as in the instructions above, and you should be able to access the service for your new data source at:

    http://localhost:8080/reconcile/new_source
    

Advanced Usage

To build from the source code, install maven and type:

mvn package

If you want to host this software on a server for long-term usage or if you want to enable logging for debugging purposes, take a look at run.sh for some helpful options.

You can change run-time options by editing the conciliator.properties file.

TODO

  • A few aspects of the Reconciliation Service API aren't implemented by this framework yet.
  • Use dependency injection instead of singleton for the threadpool shared by all VIAF instances. Might need to rework how data sources get instantiated on-the-fly in ReconcileController.

Resources

Specification for the Reconciliation Service API:

https://reconciliation-api.github.io/specs/latest/

This code drew inspiration from these other projects:

Do you use this thing??

Apparently, you do. Here's a bibliography of things that reference conciliator:

https://github.com/codeforkjeff/conciliator/wiki

If you use conciliator, please take a few seconds to leave a comment on this page. Hearing from users really motivates me to continue improving this project.

License

This code is distributed under a GNU General Public License. See the file LICENSE for details.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].