All Projects → apache → Nutch

apache / Nutch

Licence: apache-2.0
Apache Nutch is an extensible and scalable web crawler

Programming Languages

java
68154 projects - #9 most used programming language
HTML
75241 projects
shell
77523 projects
XSLT
1337 projects
Rich Text Format
576 projects
Dockerfile
14818 projects

Projects that are alternatives of or similar to Nutch

hive-jdbc-driver
An alternative to the "hive standalone" jar for connecting Java applications to Apache Hive via JDBC
Stars: ✭ 31 (-98.64%)
Mutual labels:  hadoop, apache
hadoop-data-ingestion-tool
OLAP and ETL of Big Data
Stars: ✭ 17 (-99.25%)
Mutual labels:  hadoop, apache
Mimo-Crawler
A web crawler that uses Firefox and js injection to interact with webpages and crawl their content, written in nodejs.
Stars: ✭ 22 (-99.03%)
Mutual labels:  web-crawler, crawling
yarn-prometheus-exporter
Export Hadoop YARN (resource-manager) metrics in prometheus format
Stars: ✭ 44 (-98.07%)
Mutual labels:  hadoop, apache
Tez
Apache Tez
Stars: ✭ 313 (-86.25%)
Mutual labels:  hadoop, apache
implyr
SQL backend to dplyr for Impala
Stars: ✭ 74 (-96.75%)
Mutual labels:  hadoop, apache
GooglePlay-Web-Crawler
Mapreduce project by Hadoop, Nutch, AWS EMR, Pig, Tez, Hive
Stars: ✭ 18 (-99.21%)
Mutual labels:  hadoop, nutch
hive-bigquery-storage-handler
Hive Storage Handler for interoperability between BigQuery and Apache Hive
Stars: ✭ 16 (-99.3%)
Mutual labels:  hadoop, apache
Gopa
[WIP] GOPA, a spider written in Golang, for Elasticsearch. DEMO: http://index.elasticsearch.cn
Stars: ✭ 277 (-87.83%)
Mutual labels:  crawling, web-crawler
Spidy
The simple, easy to use command line web crawler.
Stars: ✭ 257 (-88.71%)
Mutual labels:  crawling, web-crawler
Antch
Antch, a fast, powerful and extensible web crawling & scraping framework for Go
Stars: ✭ 198 (-91.3%)
Mutual labels:  crawling, web-crawler
Spark With Python
Fundamentals of Spark with Python (using PySpark), code examples
Stars: ✭ 150 (-93.41%)
Mutual labels:  hadoop, apache
EngineeringTeam
와이빅타 엔지니어링팀의 자료를 정리해두는 곳입니다.
Stars: ✭ 41 (-98.2%)
Mutual labels:  hadoop, crawling
flink-crawler
Continuous scalable web crawler built on top of Flink and crawler-commons
Stars: ✭ 48 (-97.89%)
Mutual labels:  web-crawler, crawling
Hive
Apache Hive
Stars: ✭ 4,031 (+77.03%)
Mutual labels:  hadoop, apache
Hive Jdbc Uber Jar
Hive JDBC "uber" or "standalone" jar based on the latest Apache Hive version
Stars: ✭ 188 (-91.74%)
Mutual labels:  hadoop, apache
Geode
Apache Geode
Stars: ✭ 2,016 (-11.46%)
Mutual labels:  apache
Apache exporter
Prometheus exporter for Apache.
Stars: ✭ 172 (-92.45%)
Mutual labels:  apache
Gobblin
A distributed data integration framework that simplifies common aspects of big data integration such as data ingestion, replication, organization and lifecycle management for both streaming and batch data ecosystems.
Stars: ✭ 2,006 (-11.9%)
Mutual labels:  apache
Presto
The official home of the Presto distributed SQL query engine for big data
Stars: ✭ 12,957 (+469.04%)
Mutual labels:  hadoop

Apache Nutch README

For the latest information about Nutch, please visit our website at:

https://nutch.apache.org/

and our wiki, at:

https://cwiki.apache.org/confluence/display/NUTCH/Home

To get started using Nutch read Tutorial:

https://cwiki.apache.org/confluence/display/NUTCH/NutchTutorial

Contributing

To contribute a patch, follow these instructions (note that installing Hub is not strictly required, but is recommended).

0. Download and install hub.github.com
1. File JIRA issue for your fix at https://issues.apache.org/jira/projects/NUTCH/issues
- you will get issue id NUTCH-xxx where xxx is the issue ID.
2. git clone https://github.com/apache/nutch.git
3. cd nutch
4. git checkout -b NUTCH-xxx
5. edit files (please try and include a test case if possible)
6. git status (make sure it shows what files you expected to edit)
7. Make sure that your code complies with the [Nutch codeformatting template](https://raw.githubusercontent.com/apache/nutch/master/eclipse-codeformat.xml), which is basially two space indents
8. git add <files>
9. git commit -m “fix for NUTCH-xxx contributed by <your username>”
10. git fork
11. git push -u <your git username> NUTCH-xxx
12. git pull-request

IDE setup

Generate Eclipse project files

ant eclipse

and follow the instructions in Importing existing projects.

For Intellij IDEA, first install the IvyIDEA Plugin. then run ant eclipse.

Then open the project in IntelliJ. You may see popups like "Ant build scripts found", "Frameworks detected - IvyIDEA Framework detected". Just follow the simple steps in these dialogs.

You must configure the nutch-site.xml before running. Make sure, you've added http.agent.name and plugin.folders properties. The plugin.folders normally points to <project_root>/build/plugins.

Now create a Java Application Configuration, choose org.apache.nutch.crawl.Injector, add two paths as arguments. First one is the crawldb directory, second one is the URL directory where, the injector can read urls. Now run your configuration.

If we still see the No plugins found on paths of property plugin.folders="plugins", update the plugin.folders in the nutch-default.xml, this is a quick fix, but should not be used.

Export Control

This distribution includes cryptographic software. The country in which you currently reside may have restrictions on the import, possession, use, and/or re-export to another country, of encryption software. BEFORE using any encryption software, please check your country's laws, regulations and policies concerning the import, possession, or use, and re-export of encryption software, to see if this is permitted. See https://www.wassenaar.org/ for more information.

The U.S. Government Department of Commerce, Bureau of Industry and Security (BIS), has classified this software as Export Commodity Control Number (ECCN) 5D002.C.1, which includes information security software using or performing cryptographic functions with asymmetric algorithms. The form and manner of this Apache Software Foundation distribution makes it eligible for export under the License Exception ENC Technology Software Unrestricted (TSU) exception (see the BIS Export Administration Regulations, Section 740.13) for both object code and source code.

The following provides more details on the included cryptographic software:

Apache Nutch uses the PDFBox API in its parse-tika plugin for extracting textual content and metadata from encrypted PDF files. See https://pdfbox.apache.org/ for more details on PDFBox.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].