Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → USCDataScience → Sparkler

USCDataScience / Sparkler

Licence: apache-2.0

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.

Programming Languages

java

68154 projects - #9 most used programming language

Labels

search spark distributed-systems big-data search-engine information-retrieval solr web-crawler

Projects that are alternatives of or similar to Sparkler

Lucene Solr

Apache Lucene and Solr open-source search software

Stars: ✭ 4,217 (+1064.92%)

Mutual labels: search, search-engine, information-retrieval, solr

Relevancyfeedback

Dice.com's relevancy feedback solr plugin created by Simon Hughes (Dice). Contains request handlers for doing MLT style recommendations, conceptual search, semantic search and personalized search

Stars: ✭ 19 (-94.75%)

Mutual labels: search-engine, information-retrieval, solr

Vectorsinsearch

Dice.com repo to accompany the dice.com 'Vectors in Search' talk by Simon Hughes, from the Activate 2018 search conference, and the 'Searching with Vectors' talk from Haystack 2019 (US). Builds upon my conceptual search and semantic search work from 2015

Stars: ✭ 71 (-80.39%)

Mutual labels: search-engine, information-retrieval, solr

Logisland

Scalable stream processing platform for advanced realtime analytics on top of Kafka and Spark. LogIsland also supports MQTT and Kafka Streams (Flink being in the roadmap). The platform does complex event processing and is suitable for time series analysis. A large set of valuable ready to use processors, data sources and sinks are available.

Stars: ✭ 97 (-73.2%)

Mutual labels: spark, big-data, solr

bigdata-fun

A complete (distributed) BigData stack, running in containers

Stars: ✭ 14 (-96.13%)

Mutual labels: big-data, spark, solr

Awesome Solr

A curated list of Awesome Apache Solr links and resources.

Stars: ✭ 69 (-80.94%)

Mutual labels: search, search-engine, solr

Conceptualsearch

Train a Word2Vec model or LSA model, and Implement Conceptual Search\Semantic Search in Solr\Lucene - Simon Hughes Dice.com, Dice Tech Jobs

Stars: ✭ 245 (-32.32%)

Mutual labels: search-engine, information-retrieval, solr

Rated Ranking Evaluator

Search Quality Evaluation Tool for Apache Solr & Elasticsearch search-based infrastructures

Stars: ✭ 134 (-62.98%)

Mutual labels: search, search-engine, information-retrieval

Resin

Hardware-accelerated vector-based search engine. Available as a HTTP service or as an embedded library.

Stars: ✭ 529 (+46.13%)

Mutual labels: search, search-engine, information-retrieval

Pisa

PISA: Performant Indexes and Search for Academia

Stars: ✭ 489 (+35.08%)

Mutual labels: search, search-engine, information-retrieval

Haystack

🔍 Haystack is an open source NLP framework that leverages Transformer models. It enables developers to implement production-ready neural search, question answering, semantic document search and summarization for a wide range of applications.

Stars: ✭ 3,409 (+841.71%)

Mutual labels: search, search-engine, information-retrieval

solr

Apache Solr open-source search software

Stars: ✭ 651 (+79.83%)

Mutual labels: search-engine, information-retrieval, solr

indieweb-search

Source code for the IndieWeb search engine.

Stars: ✭ 16 (-95.58%)

Mutual labels: search, search-engine

Minsql

High-performance log search engine.

Stars: ✭ 356 (-1.66%)

Mutual labels: search, search-engine

Darksearch

🔍 Search engine for hidden material. Scraping dark web onions, irc logs, deep web etc...

Stars: ✭ 260 (-28.18%)

Mutual labels: search, search-engine

SolrConfigExamples

Examples of Solr configuration entries for Solr plugins and Conceptual Search\Semantic Search from Simon Hughes Dice.com

Stars: ✭ 26 (-92.82%)

Mutual labels: information-retrieval, solr

Succinct

Enabling queries on compressed data.

Stars: ✭ 257 (-29.01%)

Mutual labels: spark, big-data

Searchcode Server

The offical home of searchcode-server where you can run searchcode locally. Note that master is generally unstable in the sense that it is not a release. Check releases for release versions https://github.com/boyter/searchcode-server/releases

Stars: ✭ 262 (-27.62%)

Mutual labels: search, search-engine

Go Cyber

Your 🔵 Superintelligence

Stars: ✭ 270 (-25.41%)

Mutual labels: search, search-engine

Search Engine

A math-aware search engine.

Stars: ✭ 278 (-23.2%)

Mutual labels: search-engine, information-retrieval

View All Similar Projects ➔

Sparkler

A web crawler is a bot program that fetches resources from the web for the sake of building applications like search engines, knowledge bases, etc. Sparkler (contraction of Spark-Crawler) is a new web crawler that makes use of recent advancements in distributed computing and information retrieval domains by conglomerating various Apache projects like Spark, Kafka, Lucene/Solr, Tika, and pf4j. Sparkler is an extensible, highly scalable, and high-performance web crawler that is an evolution of Apache Nutch and runs on Apache Spark Cluster.

NOTE:

~~Sparkler is being proposed to Apache Incubator. Review the proposal document and provide your suggestions here here~~ Will be done later, eventually!

Notable features of Sparkler:

Provides Higher performance and fault tolerance: The crawl pipeline has been redesigned to take advantage of the caching and fault tolerance capability of Apache Spark.
Supports complex and near real-time analytics: The internal data-structure is an indexed store powered by Apache Lucene and has the functionality to answer complex queries in near real time. Apache Solr (Supporting standalone for a quick start and cloud mode to scale horizontally) is used to expose the crawler analytics via HTTP API. These analytics can be visualized using intuitive charts in Admin dashboard (coming soon).
Streams out the content in real-time: Optionally, Apache Kafka can be configured to retrieve the output content as and when the content becomes available.
Java Script Rendering Executes the javascript code in webpages to create final state of the page. The setup is easy and painless, scales by distributing the work on Spark. It preserves the sessions and cookies for the subsequent requests made to a host.
Extensible plugin framework: Sparkler is designed to be modular. It supports plugins to extend and customize the runtime behaviour.
Universal Parser: Apache Tika, the most popular content detection, and content analysis toolkit that can deal with thousands of file formats, is used to discover links to the outgoing web resources and also to perform analysis on fetched resources.

Quick Start: Running your first crawl job in minutes

To use sparkler, install docker and run the below commands:

# Step 0. Get this script
wget https://raw.githubusercontent.com/USCDataScience/sparkler/master/bin/dockler.sh
# Step 1. Run the script - it starts docker container and forwards ports to host
bash dockler.sh 
# Step 2. Inject seed urls
/data/sparkler/bin/sparkler.sh inject -id 1 -su 'http://www.bbc.com/news'
# Step 3. Start the crawl job
/data/sparkler/bin/sparkler.sh crawl -id 1 -tn 100 -i 2     # id=1, top 100 URLs, do -i=2 iterations

Running Sparkler with seed urls file:

1. Follow Steps 0-1
2. Create a file name seed-urls.txt using Emacs editor as follows:     
       a. emacs sparkler/bin/seed-urls.txt 
       b. copy paste your urls 
       c. Ctrl+x Ctrl+s to save  
       d. Ctrl+x Ctrl+c to quit the editor [Reference: http://mally.stanford.edu/~sr/computing/emacs.html]

* Note: You can use Vim and Nano editors also or use: echo -e "http://example1.com\nhttp://example2.com" >> seedfile.txt command.

3. Inject seed urls using the following command, (assuming you are in sparkler/bin directory) 
$bash sparkler.sh inject -id 1 -sf seed-urls.txt
4. Start the crawl job.

To crawl until the end of all new URLS, use -i -1, Example: /data/sparkler/bin/sparkler.sh crawl -id 1 -i -1

Access the dashboard http://localhost:8983/banana/ (forwarded from docker image). The dashboard should look like the one in the below:

Dashboard

Making Contributions:

Contact Us

Any questions or suggestions are welcomed in our mailing list [email protected] Alternatively, you may use the slack channel for getting help http://irds.usc.edu/sparkler/#slack

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 362

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (32) 🔗