Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

java-study 是本人学习Java过程中记录的一些代码！从Java基础的数据类型、jdk1.8的Lambda、Stream和日期的使用、 IO流、数据集合、多线程使用、并发编程、23种设计模式示例代码、常用的工具类，以及一些常用框架，netty、mina、springboot、kafka、storm、zookeeper、redis、elasticsearch、hbase、hive等等。

Stars: ✭ 571 (-38%)

Mutual labels: kafka, redis

Dnc

dnc 去中心化开源社区轻联盟 dncto.com QQ群 779699538

Stars: ✭ 551 (-40.17%)

Mutual labels: kafka, redis

Easy Scraping Tutorial

Simple but useful Python web scraping tutorial code.

Stars: ✭ 583 (-36.7%)

Mutual labels: scrapy, scraping

Go Streams

A lightweight stream processing library for Go

Stars: ✭ 615 (-33.22%)

Mutual labels: kafka, redis

Python Spider

豆瓣电影top250、斗鱼爬取json数据以及爬取美女图片、淘宝、有缘、CrawlSpider爬取红娘网相亲人的部分基本信息以及红娘网分布式爬取和存储redis、爬虫小demo、Selenium、爬取多点、django开发接口、爬取有缘网信息、模拟知乎登录、模拟github登录、模拟图虫网登录、爬取多点商城整站数据、爬取微信公众号历史文章、爬取微信群或者微信好友分享的文章、itchat监听指定微信公众号分享的文章

Stars: ✭ 615 (-33.22%)

Mutual labels: scrapy, redis

Freestyle

A cohesive & pragmatic framework of FP centric Scala libraries

Stars: ✭ 627 (-31.92%)

Mutual labels: kafka, redis

Stream Reactor

Streaming reference architecture for ETL with Kafka and Kafka-Connect. You can find more on http://lenses.io on how we provide a unified solution to manage your connectors, most advanced SQL engine for Kafka and Kafka Streams, cluster monitoring and alerting, and more.

Stars: ✭ 753 (-18.24%)

Mutual labels: kafka, redis

Arq

Fast job queuing and RPC in python with asyncio and redis.

Stars: ✭ 695 (-24.54%)

Mutual labels: redis, distributed

Funpyspidersearchengine

Word2vec 千人千面个性化搜索 + Scrapy2.3.0(爬取数据) + ElasticSearch7.9.1(存储数据并提供对外Restful API) + Django3.1.1 搜索

Stars: ✭ 782 (-15.09%)

Mutual labels: scrapy, redis

Netdiscovery

NetDiscovery 是一款基于 Vert.x、RxJava 2 等框架实现的通用爬虫框架/中间件。

Stars: ✭ 573 (-37.79%)

Mutual labels: kafka, redis

Javakeeper

✍️ Java 工程师必备架构体系知识总结：涵盖分布式、微服务、RPC等互联网公司常用架构，以及数据存储、缓存、搜索等必备技能

Stars: ✭ 502 (-45.49%)

Mutual labels: kafka, redis

Scrapple

A framework for creating semi-automatic web content extractors

Stars: ✭ 464 (-49.62%)

Mutual labels: scrapy, scraping

Springbootexamples

Spring Boot 学习教程

Stars: ✭ 794 (-13.79%)

Mutual labels: kafka, redis

Spring Boot Study

SpringBoot框架源码实战（已更新到springboot2版本实现）~基本用法，Rest，Controller，事件监听，连接数据库MySQL，jpa，redis集成，mybatis集成（声明式与xml两种方式~对应的添删查改功能），日志处理，devtools配置，拦截器用法，资源配置读取，测试集成，Web层实现请求映射，security安全验证，rabbitMq集成，kafka集成，分布式id生成器等。项目实战：https://github.com/hemin1003/yfax-parent 已投入生产线上使用

Stars: ✭ 440 (-52.23%)

Mutual labels: kafka, redis

Testcontainers Spring Boot

Container auto-configurations for spring-boot based integration tests

Stars: ✭ 460 (-50.05%)

Mutual labels: kafka, redis

Redlock Php

Redis distributed locks in PHP

Stars: ✭ 651 (-29.32%)

Mutual labels: redis, distributed

View All Similar Projects ➔

Scrapy Cluster

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.

The goal is to distribute seed URLs among many waiting spider instances, whose requests are coordinated via Redis. Any other crawls those trigger, as a result of frontier expansion or depth traversal, will also be distributed among all workers in the cluster.

The input to the system is a set of Kafka topics and the output is a set of Kafka topics. Raw HTML and assets are crawled interactively, spidered, and output to the log. For easy local development, you can also disable the Kafka portions and work with the spider entirely via Redis, although this is not recommended due to the serialization of the crawl requests.

Dependencies

Please see the requirements.txt within each sub project for Pip package dependencies.

Other important components required to run the cluster

Python 2.7 or 3.6: https://www.python.org/downloads/
Redis: http://redis.io
Zookeeper: https://zookeeper.apache.org
Kafka: http://kafka.apache.org

Core Concepts

This project tries to bring together a bunch of new concepts to Scrapy and large scale distributed crawling in general. Some bullet points include:

The spiders are dynamic and on demand, meaning that they allow the arbitrary collection of any web page that is submitted to the scraping cluster
Scale Scrapy instances across a single machine or multiple machines
Coordinate and prioritize their scraping effort for desired sites
Persist data across scraping jobs
Execute multiple scraping jobs concurrently
Allows for in depth access into the information about your scraping job, what is upcoming, and how the sites are ranked
Allows you to arbitrarily add/remove/scale your scrapers from the pool without loss of data or downtime
Utilizes Apache Kafka as a data bus for any application to interact with the scraping cluster (submit jobs, get info, stop jobs, view results)
Allows for coordinated throttling of crawls from independent spiders on separate machines, but behind the same IP Address
Enables completely different spiders to yield crawl requests to each other, giving flexibility to how the crawl job is tackled

Scrapy Cluster test environment

To set up a pre-canned Scrapy Cluster test environment, make sure you have Docker.

Steps to launch the test environment:

Build your containers (or omit --build to pull from docker hub)

docker-compose up -d --build

Tail kafka to view your future results

docker-compose exec kafka_monitor python kafkadump.py dump -t demo.crawled_firehose -ll INFO

From another terminal, feed a request to kafka

curl localhost:5343/feed -H "content-type:application/json" -d '{"url": "http://dmoztools.net", "appid":"testapp", "crawlid":"abc123"}'

Validate you've got data!

# wait a couple seconds, your terminal from step 2 should dump json data
{u'body': '...content...', u'crawlid': u'abc123', u'links': [], u'encoding': u'utf-8', u'url': u'http://dmoztools.net', u'status_code': 200, u'status_msg': u'OK', u'response_url': u'http://dmoztools.net', u'request_headers': {u'Accept-Language': [u'en'], u'Accept-Encoding': [u'gzip,deflate'], u'Accept': [u'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'], u'User-Agent': [u'Scrapy/1.5.0 (+https://scrapy.org)']}, u'response_headers': {u'X-Amz-Cf-Pop': [u'IAD79-C3'], u'Via': [u'1.1 82c27f654a5635aeb67d519456516244.cloudfront.net (CloudFront)'], u'X-Cache': [u'RefreshHit from cloudfront'], u'Vary': [u'Accept-Encoding'], u'Server': [u'AmazonS3'], u'Last-Modified': [u'Mon, 20 Mar 2017 16:43:41 GMT'], u'Etag': [u'"cf6b76618b6f31cdec61181251aa39b7"'], u'X-Amz-Cf-Id': [u'y7MqDCLdBRu0UANgt4KOc6m3pKaCqsZP3U3ZgIuxMAJxoml2HTPs_Q=='], u'Date': [u'Tue, 22 Dec 2020 21:37:05 GMT'], u'Content-Type': [u'text/html']}, u'timestamp': u'2020-12-22T21:37:04.736926', u'attrs': None, u'appid': u'testapp'}

Documentation

Please check out the official Scrapy Cluster documentation for more information on how everything works!

Branches

The master branch of this repository contains the latest stable release code for Scrapy Cluster 1.2.

The dev branch contains bleeding edge code and is currently working towards Scrapy Cluster 1.3. Please note that not everything may be documented, finished, tested, or finalized but we are happy to help guide those who are interested.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 921

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (11) 🔗