All Projects → damklis → Dataengineeringproject

damklis / Dataengineeringproject

Licence: mit
Example end to end data engineering project.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Dataengineeringproject

Kafka Connect Ui
Web tool for Kafka Connect |
Stars: ✭ 388 (+373.17%)
Mutual labels:  s3, kafka, kafka-connect, redis, elasticsearch
Stream Reactor
Streaming reference architecture for ETL with Kafka and Kafka-Connect. You can find more on http://lenses.io on how we provide a unified solution to manage your connectors, most advanced SQL engine for Kafka and Kafka Streams, cluster monitoring and alerting, and more.
Stars: ✭ 753 (+818.29%)
Mutual labels:  kafka, kafka-connect, redis, mongodb, elasticsearch
Springbootexamples
Spring Boot 学习教程
Stars: ✭ 794 (+868.29%)
Mutual labels:  kafka, redis, mongodb, elasticsearch
Szt Bigdata
深圳地铁大数据客流分析系统🚇🚄🌟
Stars: ✭ 826 (+907.32%)
Mutual labels:  kafka, redis, mongodb, elasticsearch
Springboot Templates
springboot和dubbo、netty的集成,redis mongodb的nosql模板, kafka rocketmq rabbit的MQ模板, solr solrcloud elasticsearch查询引擎
Stars: ✭ 100 (+21.95%)
Mutual labels:  kafka, redis, mongodb, elasticsearch
Goodskill
🐂基于springcloud +dubbo构建的模拟秒杀项目,模块化设计,集成了分库分表、elasticsearch🔍、gateway、mybatis-plus、spring-session等常用开源组件
Stars: ✭ 786 (+858.54%)
Mutual labels:  kafka, redis, mongodb, elasticsearch
Spring Boot 2.x Examples
Spring Boot 2.x code examples
Stars: ✭ 104 (+26.83%)
Mutual labels:  kafka, redis, mongodb, elasticsearch
Streamx
kafka-connect-s3 : Ingest data from Kafka to Object Stores(s3)
Stars: ✭ 96 (+17.07%)
Mutual labels:  s3, kafka, big-data, kafka-connect
Operators
Collection of Kubernetes Operators built with KUDO.
Stars: ✭ 175 (+113.41%)
Mutual labels:  hacktoberfest, kafka, redis, elasticsearch
Firecamp
Serverless Platform for the stateful services
Stars: ✭ 194 (+136.59%)
Mutual labels:  kafka, redis, mongodb, elasticsearch
Testcontainers Spring Boot
Container auto-configurations for spring-boot based integration tests
Stars: ✭ 460 (+460.98%)
Mutual labels:  kafka, redis, mongodb, minio
Javakeeper
✍️ Java 工程师必备架构体系知识总结:涵盖分布式、微服务、RPC等互联网公司常用架构,以及数据存储、缓存、搜索等必备技能
Stars: ✭ 502 (+512.2%)
Mutual labels:  s3, kafka, redis, elasticsearch
Goodreads etl pipeline
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
Stars: ✭ 793 (+867.07%)
Mutual labels:  s3, airflow, data-engineering
Demo Scene
👾Scripts and samples to support Confluent Demos and Talks. ⚠️Might be rough around the edges ;-) 👉For automated tutorials and QA'd code, see https://github.com/confluentinc/examples/
Stars: ✭ 806 (+882.93%)
Mutual labels:  kafka, kafka-connect, elasticsearch
Kafka Elasticsearch Injector
Golang app to read records from a set of kafka topics and write them to an elasticsearch cluster
Stars: ✭ 70 (-14.63%)
Mutual labels:  hacktoberfest, kafka, elasticsearch
Mall Swarm
mall-swarm是一套微服务商城系统,采用了 Spring Cloud Hoxton & Alibaba、Spring Boot 2.3、Oauth2、MyBatis、Docker、Elasticsearch、Kubernetes等核心技术,同时提供了基于Vue的管理后台方便快速搭建系统。mall-swarm在电商业务的基础集成了注册中心、配置中心、监控中心、网关等系统功能。文档齐全,附带全套Spring Cloud教程。
Stars: ✭ 7,874 (+9502.44%)
Mutual labels:  redis, mongodb, elasticsearch
Scrapy Cluster
This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.
Stars: ✭ 921 (+1023.17%)
Mutual labels:  kafka, scraping, redis
Mall Learning
mall学习教程,架构、业务、技术要点全方位解析。mall项目(40k+star)是一套电商系统,使用现阶段主流技术实现。涵盖了SpringBoot 2.3.0、MyBatis 3.4.6、Elasticsearch 7.6.2、RabbitMQ 3.7.15、Redis 5.0、MongoDB 4.2.5、Mysql5.7等技术,采用Docker容器化部署。
Stars: ✭ 10,236 (+12382.93%)
Mutual labels:  redis, mongodb, elasticsearch
Kafka Connect Elasticsearch Source
Kafka Connect Elasticsearch Source
Stars: ✭ 22 (-73.17%)
Mutual labels:  kafka, kafka-connect, elasticsearch
Community
一个仿照牛客网实现的讨论社区,不仅实现了基本的注册,登录,发帖,评论,点赞,回复功能,同时使用前缀树实现敏感词过滤,使用wkhtmltopdf生成长图和pdf,实现网站UV和DAU统计,并将用户头像等信息存于七牛云服务器。
Stars: ✭ 80 (-2.44%)
Mutual labels:  kafka, redis, elasticsearch

Data Engineering Project

Build Status Coverage Status Python 3.8

Data Engineering Project is an implementation of the data pipeline which consumes the latest news from RSS Feeds and makes them available for users via handy API. The pipeline infrastructure is built using popular, open-source projects.

Access the latest news and headlines in one place. 💪

Table of Contents

Architecture diagram

MVP Architecture

How it works

Data Scraping

Airflow DAG is responsible for the execution of Python scraping modules. It runs periodically every X minutes producing micro-batches.

  • First task updates proxypool. Using proxies in combination with rotating user agents can help get scrapers past most of the anti-scraping measures and prevent being detected as a scraper.

  • Second task extracts news from RSS feeds provided in the configuration file, validates the quality and sends data into Kafka topic A. The extraction process is using validated proxies from proxypool.

Data flow

  • Kafka Connect Mongo Sink consumes data from Kafka topic A and stores news in MongoDB using upsert functionality based on _id field.
  • Debezium MongoDB Source tracks a MongoDB replica set for document changes in databases and collections, recording those changes as events in Kafka topic B.
  • Kafka Connect Elasticsearch Sink consumes data from Kafka topic B and upserts news in Elasticsearch. Data replicated between topics A and B ensures MongoDB and ElasticSearch synchronization. Command Query Responsibility Segregation (CQRS) pattern allows the use of separate models for updating and reading information.
  • Kafka Connect S3-Minio Sink consumes records from Kafka topic B and stores them in MinIO (high-performance object storage) to ensure data persistency.

Data access

  • Data gathered by previous steps can be easily accessed in API service using public endpoints.

Prerequisites

Software required to run the project. Install:

Running project

Script manage.sh - wrapper for docker-compose works as a managing tool.

  • Build project infrastructure
./manage.sh up
  • Stop project infrastructure
./manage.sh stop
  • Delete project infrastructure
./manage.sh down

Testing

Script run_tests.sh executes unit tests against Airflow scraping modules and Django Rest Framework applications.

./run_tests.sh

API service

Read detailed documentation on how to interact with data collected by pipeline using search endpoints.

Example searches:

  • see all news
http://0.0.0.0:5000/api/v1/news/ 
  • add search_fields title and description, see all of the news containing the Robert Lewandowski phrase
http://0.0.0.0:5000/api/v1/news/?search=Robert%20Lewandowski 
  • find news containing the Lewandowski phrase in their titles
http://0.0.0.0:5000/api/v1/news/?search=title|Lewandowski 
  • see all of the polish news containing the Lewandowski phrase
http://0.0.0.0:5000/api/v1/news/?search=lewandowski&language=pl

References

Inspired by following codes, articles and videos:

Contributions

Contributions are what makes the open-source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Please feel free to contact me if you have any questions. Damian Kliś @DamianKlis

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].