All Projects β†’ VIDA-NYU β†’ Ache

VIDA-NYU / Ache

Licence: apache-2.0
ACHE is a web crawler for domain-specific search.

Programming Languages

java
68154 projects - #9 most used programming language

Projects that are alternatives of or similar to Ache

OLX Scraper
πŸ“» An OLX Scraper using Scrapy + MongoDB. It Scrapes recent ads posted regarding requested product and dumps to NOSQL MONGODB.
Stars: ✭ 15 (-95.31%)
Mutual labels:  web-crawler, web-scraping
Pulsar
Turn large Web sites into tables and charts using simple SQLs.
Stars: ✭ 100 (-68.75%)
Mutual labels:  web-scraping, web-crawler
Spidr
A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.
Stars: ✭ 656 (+105%)
Mutual labels:  web-scraping, web-crawler
Gopa
[WIP] GOPA, a spider written in Golang, for Elasticsearch. DEMO: http://index.elasticsearch.cn
Stars: ✭ 277 (-13.44%)
Mutual labels:  web-scraping, web-crawler
ComicBookMaker
Script to fetch webcomics and use them to create ebooks.
Stars: ✭ 27 (-91.56%)
Mutual labels:  web-crawler
Movie-Recommendation-System-with-Sentiment-Analysis
Content based movie recommendation system with sentiment analysis
Stars: ✭ 44 (-86.25%)
Mutual labels:  web-scraping
flink-crawler
Continuous scalable web crawler built on top of Flink and crawler-commons
Stars: ✭ 48 (-85%)
Mutual labels:  web-crawler
PaperScraper
A web scraping tool to systematically extract the text of scientific papers and corresponding metadata from university accessible journals.
Stars: ✭ 63 (-80.31%)
Mutual labels:  web-scraping
Basketball reference web scraper
NBA Stats API via Basketball Reference
Stars: ✭ 279 (-12.81%)
Mutual labels:  web-scraping
Php Curl Class
PHP Curl Class makes it easy to send HTTP requests and integrate with web APIs
Stars: ✭ 2,903 (+807.19%)
Mutual labels:  web-scraping
comic-scraper
[Python] Scraps comics and manga from various websites and creates cbz files from them
Stars: ✭ 16 (-95%)
Mutual labels:  web-scraping
comp thinking social science
Computational Thinking for Social Scientists book project
Stars: ✭ 42 (-86.87%)
Mutual labels:  web-scraping
UnChain
A tool to find redirection chains in multiple URLs
Stars: ✭ 77 (-75.94%)
Mutual labels:  web-crawler
article-summary-deep-learning
πŸ“– Using deep learning and scraping to analyze/summarize articles! Just drop in any URL!
Stars: ✭ 18 (-94.37%)
Mutual labels:  web-scraping
Apify Js
Apify SDK β€” The scalable web scraping and crawling library for JavaScript/Node.js. Enables development of data extraction and web automation jobs (not only) with headless Chrome and Puppeteer.
Stars: ✭ 3,154 (+885.63%)
Mutual labels:  web-scraping
papercut
Papercut is a scraping/crawling library for Node.js built on top of JSDOM. It provides basic selector features together with features like Page Caching and Geosearch.
Stars: ✭ 15 (-95.31%)
Mutual labels:  web-scraping
Stock-Fundamental-data-scraping-and-analysis
Project on building a web crawler to collect the fundamentals of the stock and review their performance in one go
Stars: ✭ 40 (-87.5%)
Mutual labels:  web-scraping
Spidy
The simple, easy to use command line web crawler.
Stars: ✭ 257 (-19.69%)
Mutual labels:  web-crawler
Text-Analysis
Explaining textual analysis tools in Python. Including Preprocessing, Skip Gram (word2vec), and Topic Modelling.
Stars: ✭ 48 (-85%)
Mutual labels:  web-scraping
Competitive Programming Score API
API to get user details for competitive coding platforms - Codeforces, Codechef, SPOJ, Interviewbit
Stars: ✭ 118 (-63.12%)
Mutual labels:  web-scraping

Build Status Documentation Status Gitter License Coverage Status

ACHE Focused Crawler

ACHE is a focused web crawler. It collects web pages that satisfy some specific criteria, e.g., pages that belong to a given domain or that contain a user-specified pattern. ACHE differs from generic crawlers in sense that it uses page classifiers to distinguish between relevant and irrelevant pages in a given domain. A page classifier can be from a simple regular expression (that matches every page that contains a specific word, for example), to a machine-learning based classification model. ACHE can also automatically learn how to prioritize links in order to efficiently locate relevant content while avoiding the retrieval of irrelevant content.

ACHE supports many features, such as:

  • Regular crawling of a fixed list of web sites
  • Discovery and crawling of new relevant web sites through automatic link prioritization
  • Configuration of different types of pages classifiers (machine-learning, regex, etc)
  • Continuous re-crawling of sitemaps to discover new pages
  • Indexing of crawled pages using Elasticsearch
  • Web interface for searching crawled pages in real-time
  • REST API and web-based user interface for crawler monitoring
  • Crawling of hidden services using TOR proxies

License

Starting from version 0.11.0 onwards, ACHE is licensed under Apache 2.0. Previous versions were licensed under GNU GPL license.

Documentation

More info is available on the project's documentation.

Installation

You can either build ACHE from the source code, download the executable binary using conda, or use Docker to build an image and run ACHE in a container.

Build from source with Gradle

Prerequisite: You will need to install recent version of Java (JDK 8 or latest).

To build ACHE from source, you can run the following commands in your terminal:

git clone https://github.com/ViDA-NYU/ache.git
cd ache
./gradlew installDist

which will generate an installation package under ache/build/install/. You can then make ache command available in the terminal by adding ACHE binaries to the PATH environment variable:

export ACHE_HOME="{path-to-cloned-ache-repository}/ache/build/install/ache"
export PATH="$ACHE_HOME/bin:$PATH"

Running using Docker

Prerequisite: You will need to install a recent version of Docker. See https://docs.docker.com/engine/installation/ for details on how to install Docker for your platform.

We publish pre-built docker images on Docker Hub for each released version. You can run the latest image using:

docker run -p 8080:8080 vidanyu/ache:latest

Alternatively, you can build the image yourself and run it:

git clone https://github.com/ViDA-NYU/ache.git
cd ache
docker build -t ache .
docker run -p 8080:8080 ache

The Dockerfile exposes two data volumes so that you can mount a directory with your configuration files (at /config) and preserve the crawler stored data (at /data) after the container stops.

Download with Conda

Prerequisite: You need to have Conda package manager installed in your system.

If you use Conda, you can install ache from Anaconda Cloud by running:

conda install -c vida-nyu ache

NOTE: Only released tagged versions are published to Anaconda Cloud, so the version available through Conda may not be up-to-date. If you want to try the most recent version, please clone the repository and build from source or use the Docker version.

Running ACHE

Before starting a crawl, you need to create a configuration file named ache.yml. We provide some configuration samples in the repository's config directory that can help you to get started.

You will also need a page classifier configuration file named pageclassifier.yml. For details on how configure a page classifier, refer to the page classifiers documentation.

After you have configured a classifier, the last thing you will need is a seed file, i.e, a plain text containing one URL per line. The crawler will use these URLs to bootstrap the crawl.

Finally, you can start the crawler using the following command:

ache startCrawl -o <data-output-path> -c <config-path> -s <seed-file> -m <model-path>

where,

  • <configuration-path> is the path to the config directory that contains ache.yml.
  • <seed-file> is the seed file that contains the seed URLs.
  • <model-path> is the path to the model directory that contains the file pageclassifier.yml.
  • <data-output-path> is the path to the data output directory.

Example of running ACHE using the sample pre-trained page classifier model and the sample seeds file available in the repository:

ache startCrawl -o output -c config/sample_config -s config/sample.seeds -m config/sample_model

The crawler will run and print the logs to the console. Hit Ctrl+C at any time to stop it (it may take some time). For long crawls, you should run ACHE in background using a tool like nohup.

Data Formats

ACHE can output data in multiple formats. The data formats currently available are:

  • FILES (default) - raw content and metadata is stored in rolling compressed files of fixed size.
  • ELATICSEARCH - raw content and metadata is indexed in an ElasticSearch index.
  • KAFKA - pushes raw content and metadata to an Apache Kafka topic.
  • WARC - stores data using the standard format used by the Web Archive and Common Crawl.
  • FILESYSTEM_HTML - only raw page content is stored in plain text files.
  • FILESYSTEM_JSON - raw content and metadata is stored using JSON format in files.
  • FILESYSTEM_CBOR - raw content and some metadata is stored using CBOR format in files.

For more details on how to configure data formats, see the data formats documentation page.

Bug Reports and Questions

We welcome user feedback. Please submit any suggestions, questions or bug reports using the Github issue tracker.

We also have a chat room on Gitter.

Contributing

Code contributions are welcome. We use a code style derived from the Google Style Guide, but with 4 spaces for tabs. A Eclipse Formatter configuration file is available in the repository.

Contact

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].