All Projects → SeldonIO → semantic-vectors-lucene-tools

SeldonIO / semantic-vectors-lucene-tools

Licence: Apache-2.0 license
Tools for building a Lucene index for Semantic Vectors

Programming Languages

java
68154 projects - #9 most used programming language
shell
77523 projects
python
139335 projects - #7 most used programming language

Lucene Tool for Semantic Vectors

Allows creation of a Lucene index from meta data about items within a Seldon MySQL Database (e.g. movies, articles). The resulting index can be used with the Semantic Vectors https://code.google.com/p/semanticvectors/ tool to create a Semantic Vectors database that can be used for item similarity.

The main use case is at present to call the library as below:

java -cp target/semvec-lucene-tools-1.2-jar-with-dependencies.jar io.seldon.semvec.CreateLuceneIndexFromDb -l <lucene_folder> -raw-ids -use-item-attrs -attr-names <attr_names> -recreate -item-limit <item_limit> -jdbc <JDBC>
  • <lucene_folder> : the folder in which to recreate the lucene index
  • <attr_names> : the list of attr names to use to get meta data
  • <item_limit> : only get these number of items from the items table
  • <jdbc> : the JDBC for the database holding the Seldon meta data for items

There is also code to allow:

Caveats

At present the code is specific to a MySQL version of the Seldon database. Eventually, the code could be made more generally useful by allowing interfacing to a general datastore for meta data that is not Seldon specific.

A cut down schema for the 4 tables needed is in schema-minimal.sql, this contains

  • items : a table which provides ids and names to each document
  • item_attr : a list of attributes for each document
  • item_map_varchar : a table to hold varchar attributes (text < 256 characters)
  • item_map_text : a table to hold large text attributes

Example Use Case

The examples folder has some simple examples. First build the project with Maven:

mvn -DskipTests=true clean package

Go into the examples folder and edit common_vars.sh for your local MySQL database configuration.

Then run:

  • ./create_wiki_film_db.sh : this will download film abstracts from dbpedia and populate a mysql database.

You can now run:

  • ./create_basic_index.sh : this will create a lucene index from the film abstracts ; create semantic vectors dbs from this and run an example query.
  • ./create_ner_index.sh : this will download openNLP models for Person name entity extract ; create a lucene index with names extracted and connected by underscore ; build semantic vectors dbs and run an example query.

License

This project is licensed under the Apache 2 license. See LICENSE.txt.

Seldon Prediction Engine

Join our Beta program for the Seldon prediction engine.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].