All Projects → Nutch → Similar Projects or Alternatives

580 Open source projects that are alternatives of or similar to Nutch

GooglePlay-Web-Crawler
Mapreduce project by Hadoop, Nutch, AWS EMR, Pig, Tez, Hive
Stars: ✭ 18 (-99.21%)
Mutual labels:  hadoop, nutch
Mimo-Crawler
A web crawler that uses Firefox and js injection to interact with webpages and crawl their content, written in nodejs.
Stars: ✭ 22 (-99.03%)
Mutual labels:  web-crawler, crawling
Hive
Apache Hive
Stars: ✭ 4,031 (+77.03%)
Mutual labels:  hadoop, apache
Antch
Antch, a fast, powerful and extensible web crawling & scraping framework for Go
Stars: ✭ 198 (-91.3%)
Mutual labels:  crawling, web-crawler
implyr
SQL backend to dplyr for Impala
Stars: ✭ 74 (-96.75%)
Mutual labels:  hadoop, apache
hive-bigquery-storage-handler
Hive Storage Handler for interoperability between BigQuery and Apache Hive
Stars: ✭ 16 (-99.3%)
Mutual labels:  hadoop, apache
flink-crawler
Continuous scalable web crawler built on top of Flink and crawler-commons
Stars: ✭ 48 (-97.89%)
Mutual labels:  web-crawler, crawling
Spark With Python
Fundamentals of Spark with Python (using PySpark), code examples
Stars: ✭ 150 (-93.41%)
Mutual labels:  hadoop, apache
EngineeringTeam
와이빅타 엔지니어링팀의 자료를 정리해두는 곳입니다.
Stars: ✭ 41 (-98.2%)
Mutual labels:  hadoop, crawling
yarn-prometheus-exporter
Export Hadoop YARN (resource-manager) metrics in prometheus format
Stars: ✭ 44 (-98.07%)
Mutual labels:  hadoop, apache
Spidy
The simple, easy to use command line web crawler.
Stars: ✭ 257 (-88.71%)
Mutual labels:  crawling, web-crawler
Gopa
[WIP] GOPA, a spider written in Golang, for Elasticsearch. DEMO: http://index.elasticsearch.cn
Stars: ✭ 277 (-87.83%)
Mutual labels:  crawling, web-crawler
hadoop-data-ingestion-tool
OLAP and ETL of Big Data
Stars: ✭ 17 (-99.25%)
Mutual labels:  hadoop, apache
hive-jdbc-driver
An alternative to the "hive standalone" jar for connecting Java applications to Apache Hive via JDBC
Stars: ✭ 31 (-98.64%)
Mutual labels:  hadoop, apache
Tez
Apache Tez
Stars: ✭ 313 (-86.25%)
Mutual labels:  hadoop, apache
Hive Jdbc Uber Jar
Hive JDBC "uber" or "standalone" jar based on the latest Apache Hive version
Stars: ✭ 188 (-91.74%)
Mutual labels:  hadoop, apache
Owasp Mth3l3m3nt Framework
OWASP Mth3l3m3nt Framework is a penetration testing aiding tool and exploitation framework. It fosters a principle of attack the web using the web as well as pentest on the go through its responsive interface.
Stars: ✭ 139 (-93.9%)
Mutual labels:  apache
Gobblin
A distributed data integration framework that simplifies common aspects of big data integration such as data ingestion, replication, organization and lifecycle management for both streaming and batch data ecosystems.
Stars: ✭ 2,006 (-11.9%)
Mutual labels:  apache
Eel Sdk
Big Data Toolkit for the JVM
Stars: ✭ 140 (-93.85%)
Mutual labels:  hadoop
Xlearning
AI on Hadoop
Stars: ✭ 1,709 (-24.95%)
Mutual labels:  hadoop
Linkedin Profile Scraper
🕵️‍♂️ LinkedIn profile scraper returning structured profile data in JSON. Works in 2020.
Stars: ✭ 171 (-92.49%)
Mutual labels:  crawling
Htaccess
✂A collection of useful .htaccess snippets.
Stars: ✭ 11,830 (+419.54%)
Mutual labels:  apache
Newspaper
News, full-text, and article metadata extraction in Python 3. Advanced docs:
Stars: ✭ 11,545 (+407.03%)
Mutual labels:  crawling
Aliyun Emapreduce Datasources
Extended datasource support for Spark/Hadoop on Aliyun E-MapReduce.
Stars: ✭ 132 (-94.2%)
Mutual labels:  hadoop
Abot
Cross Platform C# web crawler framework built for speed and flexibility. Please star this project! +1.
Stars: ✭ 1,961 (-13.88%)
Mutual labels:  web-crawler
Collector Http
Norconex HTTP Collector is a flexible web crawler for collecting, parsing, and manipulating data from the Internet (or Intranet) to various data repositories such as search engines.
Stars: ✭ 130 (-94.29%)
Mutual labels:  web-crawler
Massivedl
Download a large list of files concurrently
Stars: ✭ 141 (-93.81%)
Mutual labels:  crawling
Geode
Apache Geode
Stars: ✭ 2,016 (-11.46%)
Mutual labels:  apache
Azure Event Hubs Spark
Enabling Continuous Data Processing with Apache Spark and Azure Event Hubs
Stars: ✭ 140 (-93.85%)
Mutual labels:  apache
Apache exporter
Prometheus exporter for Apache.
Stars: ✭ 172 (-92.45%)
Mutual labels:  apache
Instagram Bot
An Instagram bot developed using the Selenium Framework
Stars: ✭ 138 (-93.94%)
Mutual labels:  crawling
Presto
The official home of the Presto distributed SQL query engine for big data
Stars: ✭ 12,957 (+469.04%)
Mutual labels:  hadoop
Hbaseclient
HBase客户端数据管理软件
Stars: ✭ 135 (-94.07%)
Mutual labels:  hadoop
N2h4
네이버 뉴스 수집을 위한 도구
Stars: ✭ 177 (-92.23%)
Mutual labels:  crawling
Beyond Jupyter
🐍💻📊 All material from the PyCon.DE 2018 Talk "Beyond Jupyter Notebooks - Building your own data science platform with Python & Docker" (incl. Slides, Video, Udemy MOOC & other References)
Stars: ✭ 135 (-94.07%)
Mutual labels:  apache
Holiday Cn
📅🇨🇳 中国法定节假日数据 自动每日抓取国务院公告
Stars: ✭ 157 (-93.1%)
Mutual labels:  crawling
Mod auth cas
An Apache httpd module for integrating with Apereo CAS Server project.
Stars: ✭ 130 (-94.29%)
Mutual labels:  apache
Htconvert
Convert .htaccess redirects to nginx.conf redirects
Stars: ✭ 171 (-92.49%)
Mutual labels:  apache
Hadoop Common
Mirror of Apache Hadoop common
Stars: ✭ 155 (-93.19%)
Mutual labels:  hadoop
Calcite Avatica
Mirror of Apache Calcite - Avatica
Stars: ✭ 130 (-94.29%)
Mutual labels:  hadoop
Gaffer
A large-scale entity and relation database supporting aggregation of properties
Stars: ✭ 1,642 (-27.89%)
Mutual labels:  hadoop
Airflow Pipeline
An Airflow docker image preconfigured to work well with Spark and Hadoop/EMR
Stars: ✭ 128 (-94.38%)
Mutual labels:  hadoop
Goaccess
GoAccess is a real-time web log analyzer and interactive viewer that runs in a terminal in *nix systems or through your browser.
Stars: ✭ 14,096 (+519.06%)
Mutual labels:  apache
Bigdata Playground
A complete example of a big data application using : Kubernetes (kops/aws), Apache Spark SQL/Streaming/MLib, Apache Flink, Scala, Python, Apache Kafka, Apache Hbase, Apache Parquet, Apache Avro, Apache Storm, Twitter Api, MongoDB, NodeJS, Angular, GraphQL
Stars: ✭ 177 (-92.23%)
Mutual labels:  hadoop
Deeplearning4j
Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math code and a java based math library on top of the core c++ library. Also includes samediff: a pytorch/tensorflow like library for running deep learni…
Stars: ✭ 12,277 (+439.17%)
Mutual labels:  hadoop
Movie recommend
基于Spark的电影推荐系统,包含爬虫项目、web网站、后台管理系统以及spark推荐系统
Stars: ✭ 2,092 (-8.12%)
Mutual labels:  hadoop
Serverpilot Letsencrypt
Automate the installation of Let's Encrypt SSL on the free plan of ServerPilot
Stars: ✭ 129 (-94.33%)
Mutual labels:  apache
Spydra
Ephemeral Hadoop clusters using Google Compute Platform
Stars: ✭ 128 (-94.38%)
Mutual labels:  hadoop
Correios
A client library for Brazilian Correios APIs and services (SIGEP & SRO).
Stars: ✭ 153 (-93.28%)
Mutual labels:  apache
Griffon Vm
Griffon Data Science Virtual Machine
Stars: ✭ 128 (-94.38%)
Mutual labels:  hadoop
Bhban rpa
6개월 치 업무를 하루 만에 끝내는 업무 자동화(생능출판사, 2020)의 예제 코드입니다. 파이썬을 한 번도 배워본 적 없는 분들을 위한 예제이며, 엑셀부터 디자인, 매크로, 크롤링까지 업무 자동화와 관련된 다양한 분야 예제가 제공됩니다.
Stars: ✭ 124 (-94.55%)
Mutual labels:  crawling
Qpid Proton
Mirror of Apache Qpid Proton
Stars: ✭ 164 (-92.8%)
Mutual labels:  apache
Hadoop Hdfs
Mirror of Apache Hadoop HDFS
Stars: ✭ 152 (-93.32%)
Mutual labels:  hadoop
Newznab Tmux
Laravel based usenet indexer
Stars: ✭ 127 (-94.42%)
Mutual labels:  apache
Corpuscrawler
Crawler for linguistic corpora
Stars: ✭ 127 (-94.42%)
Mutual labels:  crawling
Hadoopcryptoledger
Hadoop Crypto Ledger - Analyzing CryptoLedgers, such as Bitcoin Blockchain, on Big Data platforms, such as Hadoop/Spark/Flink/Hive
Stars: ✭ 126 (-94.47%)
Mutual labels:  hadoop
Parquet4s
Read and write Parquet in Scala. Use Scala classes as schema. No need to start a cluster.
Stars: ✭ 125 (-94.51%)
Mutual labels:  hadoop
Guacamole Install Rhel 7
Apache Guacamole installation bash script for RHEL 7 and CentOS 7 including options for Nginx, HTTPS, SSL, LDAP, Let's Encrypt certificates and more
Stars: ✭ 174 (-92.36%)
Mutual labels:  apache
Big Whale
Spark、Flink等离线任务的调度以及实时任务的监控
Stars: ✭ 163 (-92.84%)
Mutual labels:  hadoop
Awesome Web Scraper
A collection of awesome web scaper, crawler.
Stars: ✭ 147 (-93.54%)
Mutual labels:  web-crawler
1-60 of 580 similar projects