Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → tmaciejewski → see

tmaciejewski / see

Licence: GPL-3.0 license

Search Engine in Erlang

Programming Languages

1774 projects

75241 projects

184084 projects - #8 most used programming language

Labels

search-engine crawler information-retrieval

Projects that are alternatives of or similar to see

Vectorsinsearch

Dice.com repo to accompany the dice.com 'Vectors in Search' talk by Simon Hughes, from the Activate 2018 search conference, and the 'Searching with Vectors' talk from Haystack 2019 (US). Builds upon my conceptual search and semantic search work from 2015

Stars: ✭ 71 (+162.96%)

Mutual labels: search-engine, information-retrieval

Evildork targeting your fiancee👁️

Stars: ✭ 46 (+70.37%)

Mutual labels: search-engine, information-retrieval

🔍 Haystack is an open source NLP framework that leverages Transformer models. It enables developers to implement production-ready neural search, question answering, semantic document search and summarization for a wide range of applications.

Stars: ✭ 3,409 (+12525.93%)

Mutual labels: search-engine, information-retrieval

PISA: Performant Indexes and Search for Academia

Stars: ✭ 489 (+1711.11%)

Mutual labels: search-engine, information-retrieval

Conceptualsearch

Train a Word2Vec model or LSA model, and Implement Conceptual Search\Semantic Search in Solr\Lucene - Simon Hughes Dice.com, Dice Tech Jobs

Stars: ✭ 245 (+807.41%)

Mutual labels: search-engine, information-retrieval

Hardware-accelerated vector-based search engine. Available as a HTTP service or as an embedded library.

Stars: ✭ 529 (+1859.26%)

Mutual labels: search-engine, information-retrieval

Apache Lucene open-source search software

Stars: ✭ 1,009 (+3637.04%)

Mutual labels: search-engine, information-retrieval

Rated Ranking Evaluator

Search Quality Evaluation Tool for Apache Solr & Elasticsearch search-based infrastructures

Stars: ✭ 134 (+396.3%)

Mutual labels: search-engine, information-retrieval

Drop in solution for Decentralized Neural Information Retrieval. Index latent vectors along with JSON metadata and do efficient k-NN search.

Stars: ✭ 222 (+722.22%)

Mutual labels: search-engine, information-retrieval

A Python implementation of the BM25 ranking function.

Stars: ✭ 159 (+488.89%)

Mutual labels: search-engine, information-retrieval

Apache Lucene and Solr open-source search software

Stars: ✭ 4,217 (+15518.52%)

Mutual labels: search-engine, information-retrieval

query-wellformedness

25,100 queries from the Paralex corpus (Fader et al., 2013) annotated with human ratings of whether they are well-formed natural language questions.

Stars: ✭ 80 (+196.3%)

Mutual labels: search-engine, information-retrieval

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.

Stars: ✭ 362 (+1240.74%)

Mutual labels: search-engine, information-retrieval

Relevancyfeedback

Dice.com's relevancy feedback solr plugin created by Simon Hughes (Dice). Contains request handlers for doing MLT style recommendations, conceptual search, semantic search and personalized search

Stars: ✭ 19 (-29.63%)

Mutual labels: search-engine, information-retrieval

A math-aware search engine.

Stars: ✭ 278 (+929.63%)

Mutual labels: search-engine, information-retrieval

Search Formula-1——A distributed high performance massive data engine for enterprise/vertical search

Stars: ✭ 158 (+485.19%)

Mutual labels: search-engine, information-retrieval

PatZilla is a modular patent information research platform and data integration toolkit with a modern user interface and access to multiple data sources.

Stars: ✭ 71 (+162.96%)

Mutual labels: search-engine, information-retrieval

Apache Solr open-source search software

Stars: ✭ 651 (+2311.11%)

Mutual labels: search-engine, information-retrieval

Highlight or Hide Search Engine Results

Stars: ✭ 89 (+229.63%)

Mutual labels: search-engine

Create vertical search web application in minutes with generator (based on ItemsAPI)

Stars: ✭ 21 (-22.22%)

Mutual labels: search-engine

View All Similar Projects ➔

SEE

SEE (or see, whatever) is a simple search engine written in Erlang. It provides web crawler, search engine and web frontend. It's split up into two applications: see_db, see_crawler.

see_db

see_db application handles indexing and web interface. It's designed to allow switching storage backend and ranking algorithms.

To start the application, run start_db_node script.

Application parameters:

ip (eg. {0,0,0,0}) -- web server ip address
port (eg. 8888) -- web server port
domain_filter (eg. "^localhost") -- regexp filter for URLs (useful for narrowing searching for only specific domain)
storage -- storage backend (see below for available storage backends)
rank -- ranking algorithm (see below for available ranking algorithms)

Storage backends

Storage backend is responsible for storing web pages with additional data structures that facilitates indexing. It abstracts away from engine logic, ie. computing final results, and is more or less a key/value storage. Selecting storage backend is done by setting storage option in the app file. Currently only ETS and Mnesia backends are implemented.

ETS storage

ETS storage is easy to set up, but it lacks persistance and distribution. Only one db node is allowed, so the entire data must fit into RAM of a single machine.

To select ETS storage use see_db_storage_ets value as storage app option.

Mnesia storage

Mnesia storage can be used to gain persistance and distribution. There can be as many db nodes as needed, though it was only tested using a single node. All tables are disc_copy, so it still must fit into RAM as table fragmentation is not yet implemented.

To select Mnesia storage use see_db_storage_mnesia value as storage app option. Then you need to create schema and tables. To do it for a single node, run a script create_mnesia_schema.

Ranking

Ranking is the most important part of a search engine, as queries may return thousands or millions of results, which is too many for a human to be useful of any kind. Users want only a dozen of the most relevant results, and this is the job of a ranking algorithm.

Selecting ranking algorithm is done by setting rank option in the app file. Currently only tf-idf is implemented.

tf-idf ranking

tf-idf is a simple ranking algorithm that takes into account only word occurences in a page vs in the whole index. To select this algorithm, use see_rank_tfidf value as rank option in the app file.

see_crawler

This application is responsible for crawling the web. There may be many nodes running this application.

To start the application, run start_crawler_node.

Application parameters:

crawler_num (eg. 1) -- number of crawler workers
db_node (eg. 'db@localhost') -- see_db node name

Usage

By default the web interface is available at http://localhost:8888 on db node. You need to add first URL to begin crawling with. To find a page, type your query in the search text box and click "Search" or press Enter. Only 100 most relevant results are shown.

Each crawler requests an unvisited URL from db node and visits it, extracting words (as they are) and links from the page, and sends them back to db node. Words after normalization are saved into the index and links are inserted as unvisited URLs.

Demo

Check out this site to see a running version: http://vps238545.ovh.net:8888/

It has indexed whole Erlang documentation.

TODO

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 27

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗