All Projects β†’ tmaciejewski β†’ see

tmaciejewski / see

Licence: GPL-3.0 license
Search Engine in Erlang

Programming Languages

erlang
1774 projects
HTML
75241 projects
javascript
184084 projects - #8 most used programming language

Projects that are alternatives of or similar to see

Vectorsinsearch
Dice.com repo to accompany the dice.com 'Vectors in Search' talk by Simon Hughes, from the Activate 2018 search conference, and the 'Searching with Vectors' talk from Haystack 2019 (US). Builds upon my conceptual search and semantic search work from 2015
Stars: ✭ 71 (+162.96%)
Mutual labels:  search-engine, information-retrieval
evildork
Evildork targeting your fianceeπŸ‘οΈ
Stars: ✭ 46 (+70.37%)
Mutual labels:  search-engine, information-retrieval
Haystack
πŸ” Haystack is an open source NLP framework that leverages Transformer models. It enables developers to implement production-ready neural search, question answering, semantic document search and summarization for a wide range of applications.
Stars: ✭ 3,409 (+12525.93%)
Mutual labels:  search-engine, information-retrieval
Pisa
PISA: Performant Indexes and Search for Academia
Stars: ✭ 489 (+1711.11%)
Mutual labels:  search-engine, information-retrieval
Conceptualsearch
Train a Word2Vec model or LSA model, and Implement Conceptual Search\Semantic Search in Solr\Lucene - Simon Hughes Dice.com, Dice Tech Jobs
Stars: ✭ 245 (+807.41%)
Mutual labels:  search-engine, information-retrieval
Resin
Hardware-accelerated vector-based search engine. Available as a HTTP service or as an embedded library.
Stars: ✭ 529 (+1859.26%)
Mutual labels:  search-engine, information-retrieval
lucene
Apache Lucene open-source search software
Stars: ✭ 1,009 (+3637.04%)
Mutual labels:  search-engine, information-retrieval
Rated Ranking Evaluator
Search Quality Evaluation Tool for Apache Solr & Elasticsearch search-based infrastructures
Stars: ✭ 134 (+396.3%)
Mutual labels:  search-engine, information-retrieval
Aquiladb
Drop in solution for Decentralized Neural Information Retrieval. Index latent vectors along with JSON metadata and do efficient k-NN search.
Stars: ✭ 222 (+722.22%)
Mutual labels:  search-engine, information-retrieval
Bm25
A Python implementation of the BM25 ranking function.
Stars: ✭ 159 (+488.89%)
Mutual labels:  search-engine, information-retrieval
Lucene Solr
Apache Lucene and Solr open-source search software
Stars: ✭ 4,217 (+15518.52%)
Mutual labels:  search-engine, information-retrieval
query-wellformedness
25,100 queries from the Paralex corpus (Fader et al., 2013) annotated with human ratings of whether they are well-formed natural language questions.
Stars: ✭ 80 (+196.3%)
Mutual labels:  search-engine, information-retrieval
Sparkler
Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
Stars: ✭ 362 (+1240.74%)
Mutual labels:  search-engine, information-retrieval
Relevancyfeedback
Dice.com's relevancy feedback solr plugin created by Simon Hughes (Dice). Contains request handlers for doing MLT style recommendations, conceptual search, semantic search and personalized search
Stars: ✭ 19 (-29.63%)
Mutual labels:  search-engine, information-retrieval
Search Engine
A math-aware search engine.
Stars: ✭ 278 (+929.63%)
Mutual labels:  search-engine, information-retrieval
Sf1r Lite
Search Formula-1β€”β€”A distributed high performance massive data engine for enterprise/vertical search
Stars: ✭ 158 (+485.19%)
Mutual labels:  search-engine, information-retrieval
patzilla
PatZilla is a modular patent information research platform and data integration toolkit with a modern user interface and access to multiple data sources.
Stars: ✭ 71 (+162.96%)
Mutual labels:  search-engine, information-retrieval
solr
Apache Solr open-source search software
Stars: ✭ 651 (+2311.11%)
Mutual labels:  search-engine, information-retrieval
hohser
Highlight or Hide Search Engine Results
Stars: ✭ 89 (+229.63%)
Mutual labels:  search-engine
starter
Create vertical search web application in minutes with generator (based on ItemsAPI)
Stars: ✭ 21 (-22.22%)
Mutual labels:  search-engine

SEE

SEE (or see, whatever) is a simple search engine written in Erlang. It provides web crawler, search engine and web frontend. It's split up into two applications: see_db, see_crawler.

see_db

see_db application handles indexing and web interface. It's designed to allow switching storage backend and ranking algorithms.

To start the application, run start_db_node script.

Application parameters:

  • ip (eg. {0,0,0,0}) -- web server ip address
  • port (eg. 8888) -- web server port
  • domain_filter (eg. "^localhost") -- regexp filter for URLs (useful for narrowing searching for only specific domain)
  • storage -- storage backend (see below for available storage backends)
  • rank -- ranking algorithm (see below for available ranking algorithms)

Storage backends

Storage backend is responsible for storing web pages with additional data structures that facilitates indexing. It abstracts away from engine logic, ie. computing final results, and is more or less a key/value storage. Selecting storage backend is done by setting storage option in the app file. Currently only ETS and Mnesia backends are implemented.

ETS storage

ETS storage is easy to set up, but it lacks persistance and distribution. Only one db node is allowed, so the entire data must fit into RAM of a single machine.

To select ETS storage use see_db_storage_ets value as storage app option.

Mnesia storage

Mnesia storage can be used to gain persistance and distribution. There can be as many db nodes as needed, though it was only tested using a single node. All tables are disc_copy, so it still must fit into RAM as table fragmentation is not yet implemented.

To select Mnesia storage use see_db_storage_mnesia value as storage app option. Then you need to create schema and tables. To do it for a single node, run a script create_mnesia_schema.

Ranking

Ranking is the most important part of a search engine, as queries may return thousands or millions of results, which is too many for a human to be useful of any kind. Users want only a dozen of the most relevant results, and this is the job of a ranking algorithm.

Selecting ranking algorithm is done by setting rank option in the app file. Currently only tf-idf is implemented.

tf-idf ranking

tf-idf is a simple ranking algorithm that takes into account only word occurences in a page vs in the whole index. To select this algorithm, use see_rank_tfidf value as rank option in the app file.

see_crawler

This application is responsible for crawling the web. There may be many nodes running this application.

To start the application, run start_crawler_node.

Application parameters:

  • crawler_num (eg. 1) -- number of crawler workers
  • db_node (eg. 'db@localhost') -- see_db node name

Usage

By default the web interface is available at http://localhost:8888 on db node. You need to add first URL to begin crawling with. To find a page, type your query in the search text box and click "Search" or press Enter. Only 100 most relevant results are shown.

Each crawler requests an unvisited URL from db node and visits it, extracting words (as they are) and links from the page, and sends them back to db node. Words after normalization are saved into the index and links are inserted as unvisited URLs.

Demo

Check out this site to see a running version: http://vps238545.ovh.net:8888/

It has indexed whole Erlang documentation.

TODO

  • HTTPS support
  • different encoding support (eg. iso-8859, cp-1250)
  • td-idf ranking
  • Mnesia storage backend
  • Amazon S3 storage backend
  • PageRank
  • stemming
  • complex queries (phrases, logic operators, inurl:, intitle:, site:)
  • periodically updating already visited pages
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].