All Projects → nsoft → jesterj

nsoft / jesterj

Licence: Apache-2.0 license
Document Ingestion Framework for Search Systems

Programming Languages

java
68154 projects - #9 most used programming language

Projects that are alternatives of or similar to jesterj

multi-select-facet
An example of multi-select facet with Solr, Vue and Go
Stars: ✭ 30 (+15.38%)
Mutual labels:  solr
kitodo-presentation
Kitodo.Presentation is a feature-rich framework for building a METS- or IIIF-based digital library. It is part of the Kitodo Digital Library Suite.
Stars: ✭ 33 (+26.92%)
Mutual labels:  solr
solrq
Python Solr query utility // http://solrq.readthedocs.org/en/latest/
Stars: ✭ 18 (-30.77%)
Mutual labels:  solr
ltr-tools
Set of command line tools for Learning To Rank
Stars: ✭ 13 (-50%)
Mutual labels:  solr
chorus
Towards an open source stack for e-commerce search
Stars: ✭ 86 (+230.77%)
Mutual labels:  solr
IATI.cloud
The open-source IATI datastore for IATI data with RESTful web API providing XML, JSON, CSV output. It extracts and parses IATI XML files referenced in the IATI Registry and powered by Apache Solr.
Stars: ✭ 35 (+34.62%)
Mutual labels:  solr
django-solr
Solr Search Engine ORM for Django
Stars: ✭ 24 (-7.69%)
Mutual labels:  solr
BnLMetsExporter
Command Line Interface (CLI) to export METS/ALTO documents to other formats.
Stars: ✭ 11 (-57.69%)
Mutual labels:  solr
wasp
WASP is a framework to build complex real time big data applications. It relies on a kind of Kappa/Lambda architecture mainly leveraging Kafka and Spark. If you need to ingest huge amount of heterogeneous data and analyze them through complex pipelines, this is the framework for you.
Stars: ✭ 19 (-26.92%)
Mutual labels:  solr
nlpir-analysis-cn-ictclas
Lucene/Solr Analyzer Plugin. Support MacOS,Linux x86/64,Windows x86/64. It's a maven project, which allows you change the lucene/solr version. //Maven工程,修改Lucene/Solr版本,以兼容相应版本。
Stars: ✭ 71 (+173.08%)
Mutual labels:  solr
solr-zkutil
Solr Cloud and ZooKeeper CLI
Stars: ✭ 14 (-46.15%)
Mutual labels:  solr
skipchunk
Extracts a latent knowledge graph from text and index/query it in elasticsearch or solr
Stars: ✭ 18 (-30.77%)
Mutual labels:  solr
solr-stack
Ambari stack service for easily installing and managing Solr on HDP cluster
Stars: ✭ 18 (-30.77%)
Mutual labels:  solr
searchhub
Fusion demo app searching open-source project data from the Apache Software Foundation
Stars: ✭ 42 (+61.54%)
Mutual labels:  solr
turing
✨ 🧬 Turing AI - Semantic Navigation, Chatbot using Search Engine and Many NLP Vendors.
Stars: ✭ 30 (+15.38%)
Mutual labels:  solr
yasa
Yet Another Solr Admin
Stars: ✭ 48 (+84.62%)
Mutual labels:  solr
jstarcraft-nlp
专注于解决自然语言处理领域的几个核心问题:词法分析,句法分析,语义分析,语种检测,信息抽取,文本聚类和文本分类. 为相关领域的研发人员提供完整的通用设计与参考实现. 涵盖了多种自然语言处理算法,适配了多个自然语言处理框架. 兼容Lucene/Solr/ElasticSearch插件.
Stars: ✭ 92 (+253.85%)
Mutual labels:  solr
ezplatform-search-extra
Netgen's extra bits for eZ Platform search
Stars: ✭ 13 (-50%)
Mutual labels:  solr
mdserver-web
Simple Linux Panel
Stars: ✭ 1,064 (+3992.31%)
Mutual labels:  solr
hello-nlp
A natural language search microservice
Stars: ✭ 85 (+226.92%)
Mutual labels:  solr

JesterJ

License Build Status

A new highly flexible, highly scalable document ingestion system.

See the web site and the documentation for more info

Getting Started

Please see the documentation in the wiki

Project Status

Current release version: 1.0-beta2. (But head revision in github is much better right now! New release soon)

Can be used with gradle configuration:

repositories {
  mavenCentral()
  maven {
    url 'https://jesterj.jfrog.io/jesterj/libs-release/'
  }
  maven {
    url 'https://clojars.org/repo'
  }
}

dependencies {
  compile ('org.jesterj:ingest:1.0-beta2')
}

The extra repos are for a patched version of cassandra, and should go away in future releases (see https://issues.apache.org/jira/browse/CASSANDRA-13396). The clojars repo is for is for a clojure based implementation of docopt, which will hopefully become unnecessary in future versions.

JDK versions

Presently only JDK 8 has been supported. JDK 9/10 will not be explicitly supported. Now that JDK 11 is out as an LTS version, support for it will commence. JDK 11 is supported in the master branch, but not yet released

Slack Chanel

If you want an invite to our slack channel just send mail to:

                     ______    _           __            _                  
   ____ ___  _______/ ____ \  (_)__  _____/ /____  _____(_)____  _________ _
  / __ `/ / / / ___/ / __ `/ / / _ \/ ___/ __/ _ \/ ___/ // __ \/ ___/ __ `/
 / /_/ / /_/ (__  ) / /_/ / / /  __(__  ) /_/  __/ /  / // /_/ / /  / /_/ / 
 \__, /\__,_/____/\ \__,_/_/ /\___/____/\__/\___/_/__/ (_)____/_/   \__, /  
/____/             \____/___/                     /___/            /____/   

Features:

In this release we have the following features

  • Embedded Cassandra server
  • Cassandra config and data location configurable, defaults to ~/.jj/cassandra
  • Initial support for fault tolerance via logging statuses to the embedded cassandra server (WIP)
  • Log4j appender to write to Cassandra where desired
  • Initial API/process for user written steps. (see documentation)
  • 40% test coverage (jacoco)
  • Simple filesystem scanner
  • Copy Field processor
  • Date Reformat processor
  • Human Readable File Size processor
  • Tika processor to extract content
  • Solr sender to send documents to solr in batches.
  • Runnable example to execute a plan that scans a filesystem, and indexes the documents in solr.

Release 0.1 is intended to be the smallest functional unit. Plans and steps will need to be assembled in code etc and only run locally, only single node supported. Documents indexed will have fields for mod-time, file name and file size.

Progress for 1.0

  • JDBC scanner
  • Cassandra based FTI
  • Document hashing to detect changed docs (any scanner)
  • Node and Transport style senders for Elastic
  • Ability to load Java based config from a jar file - experimental.
  • More processors: Fetch URL, Regex Replace Value, Delete Field, Parse Field as Template, URL Encode Field
  • Publish jars on Maven Central
  • Up to date docs in wiki.

The Java config feature is experimental but working out better than expected. I wanted to use what I had built for a project, but the lack of externalized configuration was a blocker. It was a quick fix but it's turning out to be quite pleasant to work with. The down side is I'm not sure how it would carry forward to later stages of the project so it might still go away. Feedback welcome.

TODO for 1.0 final

  • 50% test coverage
  • fix #84
  • Build a demo jar that can be run to demonstrate the java config usage
  • Demo/tutorial to demonstrate indexing a database and a filesystem simultaneously into solr

Release 1.0 is intended to be the usable for single node systems, and therefore suitable for production use on small to medium sized projects.

TODO for 2.0

  • Serialized format for a plan/steps.
  • JINI Registrar
  • Register Node Service on JINI Registrar
  • Display nodes visible in control web app.
  • JINI Service to accept serialized format
  • Ability to build a plan in web-app.
  • 60% test coverage
  • Availability on maven central.
  • Build and run the 0.2 scenario via the control web-app.

Release 2.0 is intended to be similar to 1.0 but with a very basic web control UI. At this point it should be possible to install the war file, start a node,

TODO for 3.0

  • secure connections among nodes and with the web app. (credential provider)
  • Ensure nodes namespace their cassandra data dirs to avoid disasters if more than one node run per user account
  • Cassandra cluster formation
  • pass Documents among nodes using Java Spaces
  • Support for adding helper nodes that scale a step or several steps horizontally.
  • Make the control UI pretty.

Release 3.0 is intended to be the first release to spread work across nodes.

What is FTI?

FTI stands for Fault Tolerant Indexing. For our purposes this means that once a scanner is pointed at a document source, it is guaranteed to eventually do one of the following things with every qualifying document:

  • Process the document and send it to solr.
  • Log an error explaining why the document processing failed.

It will do this no matter how many nodes fail, or how many times Solr is rebooted

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].