All Projects → USCDataScience → Sparkler

USCDataScience / Sparkler

Licence: apache-2.0
Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.

Programming Languages

java
68154 projects - #9 most used programming language

Projects that are alternatives of or similar to Sparkler

Lucene Solr
Apache Lucene and Solr open-source search software
Stars: ✭ 4,217 (+1064.92%)
Mutual labels:  search, search-engine, information-retrieval, solr
Relevancyfeedback
Dice.com's relevancy feedback solr plugin created by Simon Hughes (Dice). Contains request handlers for doing MLT style recommendations, conceptual search, semantic search and personalized search
Stars: ✭ 19 (-94.75%)
Mutual labels:  search-engine, information-retrieval, solr
Vectorsinsearch
Dice.com repo to accompany the dice.com 'Vectors in Search' talk by Simon Hughes, from the Activate 2018 search conference, and the 'Searching with Vectors' talk from Haystack 2019 (US). Builds upon my conceptual search and semantic search work from 2015
Stars: ✭ 71 (-80.39%)
Mutual labels:  search-engine, information-retrieval, solr
Logisland
Scalable stream processing platform for advanced realtime analytics on top of Kafka and Spark. LogIsland also supports MQTT and Kafka Streams (Flink being in the roadmap). The platform does complex event processing and is suitable for time series analysis. A large set of valuable ready to use processors, data sources and sinks are available.
Stars: ✭ 97 (-73.2%)
Mutual labels:  spark, big-data, solr
bigdata-fun
A complete (distributed) BigData stack, running in containers
Stars: ✭ 14 (-96.13%)
Mutual labels:  big-data, spark, solr
Awesome Solr
A curated list of Awesome Apache Solr links and resources.
Stars: ✭ 69 (-80.94%)
Mutual labels:  search, search-engine, solr
Conceptualsearch
Train a Word2Vec model or LSA model, and Implement Conceptual Search\Semantic Search in Solr\Lucene - Simon Hughes Dice.com, Dice Tech Jobs
Stars: ✭ 245 (-32.32%)
Mutual labels:  search-engine, information-retrieval, solr
Rated Ranking Evaluator
Search Quality Evaluation Tool for Apache Solr & Elasticsearch search-based infrastructures
Stars: ✭ 134 (-62.98%)
Mutual labels:  search, search-engine, information-retrieval
Resin
Hardware-accelerated vector-based search engine. Available as a HTTP service or as an embedded library.
Stars: ✭ 529 (+46.13%)
Mutual labels:  search, search-engine, information-retrieval
Pisa
PISA: Performant Indexes and Search for Academia
Stars: ✭ 489 (+35.08%)
Mutual labels:  search, search-engine, information-retrieval
Haystack
🔍 Haystack is an open source NLP framework that leverages Transformer models. It enables developers to implement production-ready neural search, question answering, semantic document search and summarization for a wide range of applications.
Stars: ✭ 3,409 (+841.71%)
Mutual labels:  search, search-engine, information-retrieval
solr
Apache Solr open-source search software
Stars: ✭ 651 (+79.83%)
Mutual labels:  search-engine, information-retrieval, solr
indieweb-search
Source code for the IndieWeb search engine.
Stars: ✭ 16 (-95.58%)
Mutual labels:  search, search-engine
Minsql
High-performance log search engine.
Stars: ✭ 356 (-1.66%)
Mutual labels:  search, search-engine
Darksearch
🔍 Search engine for hidden material. Scraping dark web onions, irc logs, deep web etc...
Stars: ✭ 260 (-28.18%)
Mutual labels:  search, search-engine
SolrConfigExamples
Examples of Solr configuration entries for Solr plugins and Conceptual Search\Semantic Search from Simon Hughes Dice.com
Stars: ✭ 26 (-92.82%)
Mutual labels:  information-retrieval, solr
Succinct
Enabling queries on compressed data.
Stars: ✭ 257 (-29.01%)
Mutual labels:  spark, big-data
Searchcode Server
The offical home of searchcode-server where you can run searchcode locally. Note that master is generally unstable in the sense that it is not a release. Check releases for release versions https://github.com/boyter/searchcode-server/releases
Stars: ✭ 262 (-27.62%)
Mutual labels:  search, search-engine
Go Cyber
Your 🔵 Superintelligence
Stars: ✭ 270 (-25.41%)
Mutual labels:  search, search-engine
Search Engine
A math-aware search engine.
Stars: ✭ 278 (-23.2%)
Mutual labels:  search-engine, information-retrieval

Sparkler

Slack

A web crawler is a bot program that fetches resources from the web for the sake of building applications like search engines, knowledge bases, etc. Sparkler (contraction of Spark-Crawler) is a new web crawler that makes use of recent advancements in distributed computing and information retrieval domains by conglomerating various Apache projects like Spark, Kafka, Lucene/Solr, Tika, and pf4j. Sparkler is an extensible, highly scalable, and high-performance web crawler that is an evolution of Apache Nutch and runs on Apache Spark Cluster.

NOTE:

Sparkler is being proposed to Apache Incubator. Review the proposal document and provide your suggestions here here Will be done later, eventually!

Notable features of Sparkler:

  • Provides Higher performance and fault tolerance: The crawl pipeline has been redesigned to take advantage of the caching and fault tolerance capability of Apache Spark.
  • Supports complex and near real-time analytics: The internal data-structure is an indexed store powered by Apache Lucene and has the functionality to answer complex queries in near real time. Apache Solr (Supporting standalone for a quick start and cloud mode to scale horizontally) is used to expose the crawler analytics via HTTP API. These analytics can be visualized using intuitive charts in Admin dashboard (coming soon).
  • Streams out the content in real-time: Optionally, Apache Kafka can be configured to retrieve the output content as and when the content becomes available.
  • Java Script Rendering Executes the javascript code in webpages to create final state of the page. The setup is easy and painless, scales by distributing the work on Spark. It preserves the sessions and cookies for the subsequent requests made to a host.
  • Extensible plugin framework: Sparkler is designed to be modular. It supports plugins to extend and customize the runtime behaviour.
  • Universal Parser: Apache Tika, the most popular content detection, and content analysis toolkit that can deal with thousands of file formats, is used to discover links to the outgoing web resources and also to perform analysis on fetched resources.

Quick Start: Running your first crawl job in minutes

To use sparkler, install docker and run the below commands:

# Step 0. Get this script
wget https://raw.githubusercontent.com/USCDataScience/sparkler/master/bin/dockler.sh
# Step 1. Run the script - it starts docker container and forwards ports to host
bash dockler.sh 
# Step 2. Inject seed urls
/data/sparkler/bin/sparkler.sh inject -id 1 -su 'http://www.bbc.com/news'
# Step 3. Start the crawl job
/data/sparkler/bin/sparkler.sh crawl -id 1 -tn 100 -i 2     # id=1, top 100 URLs, do -i=2 iterations

Running Sparkler with seed urls file:

1. Follow Steps 0-1
2. Create a file name seed-urls.txt using Emacs editor as follows:     
       a. emacs sparkler/bin/seed-urls.txt 
       b. copy paste your urls 
       c. Ctrl+x Ctrl+s to save  
       d. Ctrl+x Ctrl+c to quit the editor [Reference: http://mally.stanford.edu/~sr/computing/emacs.html]

* Note: You can use Vim and Nano editors also or use: echo -e "http://example1.com\nhttp://example2.com" >> seedfile.txt command.

3. Inject seed urls using the following command, (assuming you are in sparkler/bin directory) 
$bash sparkler.sh inject -id 1 -sf seed-urls.txt
4. Start the crawl job.

To crawl until the end of all new URLS, use -i -1, Example: /data/sparkler/bin/sparkler.sh crawl -id 1 -i -1

Access the dashboard http://localhost:8983/banana/ (forwarded from docker image). The dashboard should look like the one in the below:

Dashboard

Making Contributions:

Contact Us

Any questions or suggestions are welcomed in our mailing list [email protected] Alternatively, you may use the slack channel for getting help http://irds.usc.edu/sparkler/#slack

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].