All Projects β†’ tokenmill β†’ crawling-framework

tokenmill / crawling-framework

Licence: other
Easily crawl news portals or blog sites using Storm Crawler.

Programming Languages

java
68154 projects - #9 most used programming language

Projects that are alternatives of or similar to crawling-framework

Lulu
[Unmaintained] A simple and clean video/music/image downloader πŸ‘Ύ
Stars: ✭ 789 (+3486.36%)
Mutual labels:  scraping, crawling
Awesome Puppeteer
A curated list of awesome puppeteer resources.
Stars: ✭ 1,728 (+7754.55%)
Mutual labels:  scraping, crawling
Grawler
Grawler is a tool written in PHP which comes with a web interface that automates the task of using google dorks, scrapes the results, and stores them in a file.
Stars: ✭ 98 (+345.45%)
Mutual labels:  scraping, crawling
scrapy-fieldstats
A Scrapy extension to log items coverage when the spider shuts down
Stars: ✭ 17 (-22.73%)
Mutual labels:  scraping, crawling
Memorious
Distributed crawling framework for documents and structured data.
Stars: ✭ 248 (+1027.27%)
Mutual labels:  scraping, crawling
Easy Scraping Tutorial
Simple but useful Python web scraping tutorial code.
Stars: ✭ 583 (+2550%)
Mutual labels:  scraping, crawling
Scrapy
Scrapy, a fast high-level web crawling & scraping framework for Python.
Stars: ✭ 42,343 (+192368.18%)
Mutual labels:  scraping, crawling
Crawly
Crawly, a high-level web crawling & scraping framework for Elixir.
Stars: ✭ 440 (+1900%)
Mutual labels:  scraping, crawling
Colly
Elegant Scraper and Crawler Framework for Golang
Stars: ✭ 15,535 (+70513.64%)
Mutual labels:  scraping, crawling
Antch
Antch, a fast, powerful and extensible web crawling & scraping framework for Go
Stars: ✭ 198 (+800%)
Mutual labels:  scraping, crawling
Headless Chrome Crawler
Distributed crawler powered by Headless Chrome
Stars: ✭ 5,129 (+23213.64%)
Mutual labels:  scraping, crawling
double-agent
A test suite of common scraper detection techniques. See how detectable your scraper stack is.
Stars: ✭ 123 (+459.09%)
Mutual labels:  scraping, crawling
Ferret
Declarative web scraping
Stars: ✭ 4,837 (+21886.36%)
Mutual labels:  scraping, crawling
diffbot-php-client
[Deprecated - Maintenance mode - use APIs directly please!] The official Diffbot client library
Stars: ✭ 53 (+140.91%)
Mutual labels:  scraping, crawling
Dataflowkit
Extract structured data from web sites. Web sites scraping.
Stars: ✭ 456 (+1972.73%)
Mutual labels:  scraping, crawling
Dotnetcrawler
DotnetCrawler is a straightforward, lightweight web crawling/scrapying library for Entity Framework Core output based on dotnet core. This library designed like other strong crawler libraries like WebMagic and Scrapy but for enabling extandable your custom requirements. Medium link : https://medium.com/@mehmetozkaya/creating-custom-web-crawler-with-dotnet-core-using-entity-framework-core-ec8d23f0ca7c
Stars: ✭ 100 (+354.55%)
Mutual labels:  scraping, crawling
Sasila
δΈ€δΈͺ灡活、友ε₯½ηš„ηˆ¬θ™«ζ‘†ζžΆ
Stars: ✭ 286 (+1200%)
Mutual labels:  scraping, crawling
Spidermon
Scrapy Extension for monitoring spiders execution.
Stars: ✭ 309 (+1304.55%)
Mutual labels:  scraping, crawling
Linkedin Profile Scraper
πŸ•΅οΈβ€β™‚οΈ LinkedIn profile scraper returning structured profile data in JSON. Works in 2020.
Stars: ✭ 171 (+677.27%)
Mutual labels:  scraping, crawling
scrape-github-trending
Tutorial for web scraping / crawling with Node.js.
Stars: ✭ 42 (+90.91%)
Mutual labels:  scraping, crawling

Crawling Framework

Maven Central pipeline status

Crawling Framework aims at providing instruments to configure and run your Storm Crawler based crawler. It mainly aims at easing crawling of article content publishing sites like news portals or blog sites. With the help of GUI tool Crawling Framework provides you can:

  1. Specify which sites to crawl.
  2. Configure URL inclusion and exclusion filters, thus controlling which sections of the site will be fetched.
  3. Specify which elements of the page provide information about article publication name, its title and main body.
  4. Define tests which validate that extraction rules are working.

Once configuration is done the Crawling Framework runs Storm Crawler based crawling following the rules specified in the configuration.

Introduction

We have recorded a video on how to setup and use Crawling Framework. Click on the image below to watch in on Youtube.

Crawling Framework Intro

Requirements

Framework writes its configuration and stores crawled data to ElasticSearch. Before starting crawl project install ElasticSearch (Crawling Framework is tested to work with Elastic v7.x).

Crawling Framework is a Java lib which will have to be extended to run Storm Crawler topology, thus Java (JDK8, Maven) infrastructure will be needed.

Using password protected ElasticSearch

Some providers hide ElasticSearch under authentification step (Which makes sense). Just set environment variables ES_USERNAME and ES_PASSWORD accordingly, everything else can remain the same. Authentification step will be done implicitly if proper credentials are there

Configuring and Running a crawl

See Crawling Framework Example project's documentation.

License

Copyright Β© 2017-2019 TokenMill UAB.

Distributed under the The Apache License, Version 2.0.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].