定期更新Hadoop生态圈中常用大数据组件文档重心依次为: Flink Solr Sparksql ES Scala Kafka Hbase/phoenix Redis Kerberos (项目包含hadoop思维导图印象笔记 Scala版本简单demo 常用工具类去敏后的train code 持续更新!!!)

Stars: ✭ 567 (+173.91%)

Mutual labels: hadoop

Go spider

[爬虫框架 (golang)] An awesome Go concurrent Crawler(spider) framework. The crawler is flexible and modular. It can be expanded to an Individualized crawler easily or you can use the default crawl components only.

Stars: ✭ 1,745 (+743%)

Mutual labels: pipeline

Deep Forest

An Efficient, Scalable and Optimized Python Framework for Deep Forest (2021.2.1)

Stars: ✭ 547 (+164.25%)

Mutual labels: random-forest

Vistrails

VisTrails is an open-source data analysis and visualization tool. It provides a comprehensive provenance infrastructure that maintains detailed history information about the steps followed and data derived in the course of an exploratory task: VisTrails maintains provenance of data products, of the computational processes that derive these products and their executions.

Stars: ✭ 94 (-54.59%)

Mutual labels: pipeline

Ttyplot

a realtime plotting utility for terminal/console with data input from stdin

Stars: ✭ 532 (+157%)

Mutual labels: pipeline

Machine Learning Models

Decision Trees, Random Forest, Dynamic Time Warping, Naive Bayes, KNN, Linear Regression, Logistic Regression, Mixture Of Gaussian, Neural Network, PCA, SVD, Gaussian Naive Bayes, Fitting Data to Gaussian, K-Means

Stars: ✭ 160 (-22.71%)

Mutual labels: random-forest

Hyperparameter Optimization Of Machine Learning Algorithms

Implementation of hyperparameter optimization/tuning methods for machine learning & deep learning models (easy&clear)

Stars: ✭ 516 (+149.28%)

Mutual labels: random-forest

Wifi

基于wifi抓取信息的大数据查询分析系统

Stars: ✭ 93 (-55.07%)

Mutual labels: hadoop

Machinelearnjs

Machine Learning library for the web and Node.

Stars: ✭ 498 (+140.58%)

Mutual labels: random-forest

Xlearning

AI on Hadoop

Stars: ✭ 1,709 (+725.6%)

Mutual labels: hadoop

Gis Tools For Hadoop

The GIS Tools for Hadoop are a collection of GIS tools for spatial analysis of big data.

Stars: ✭ 485 (+134.3%)

Mutual labels: hadoop

Mnemonic

Apache Mnemonic - A non-volatile hybrid memory storage oriented library

Stars: ✭ 91 (-56.04%)

Mutual labels: bigdata

Pdf

编程电子书，电子书，编程书籍，包括C，C#，Docker，Elasticsearch，Git，Hadoop，HeadFirst，Java，Javascript，jvm，Kafka，Linux，Maven，MongoDB，MyBatis，MySQL，Netty，Nginx，Python，RabbitMQ，Redis，Scala，Solr，Spark，Spring，SpringBoot，SpringCloud，TCPIP，Tomcat，Zookeeper，人工智能，大数据类，并发编程，数据库类，数据挖掘，新面试题，架构设计，算法系列，计算机类，设计模式，软件测试，重构优化，等更多分类

Stars: ✭ 12,009 (+5701.45%)

Mutual labels: hadoop

Chefboost

A Lightweight Decision Tree Framework supporting regular algorithms: ID3, C4,5, CART, CHAID and Regression Trees; some advanced techniques: Gradient Boosting (GBDT, GBRT, GBM), Random Forest and Adaboost w/categorical features support for Python

Stars: ✭ 176 (-14.98%)

Mutual labels: random-forest

Gaia

Build powerful pipelines in any programming language.

Stars: ✭ 4,534 (+2090.34%)

Mutual labels: pipeline

Drake

An R-focused pipeline toolkit for reproducibility and high-performance computing

Stars: ✭ 1,301 (+528.5%)

Mutual labels: pipeline

Data Science Ipython Notebooks

Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.

Stars: ✭ 22,048 (+10551.21%)

Mutual labels: hadoop

Jenkins Pipeline Library

wcm.io Jenkins Pipeline Library for CI/CD

Stars: ✭ 134 (-35.27%)

Mutual labels: pipeline

The App

Sample application and CD Pipeline for DevOps Dojo

Stars: ✭ 88 (-57.49%)

Mutual labels: pipeline

Optimus

🚚 Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark

Stars: ✭ 986 (+376.33%)

Mutual labels: bigdata

Circosjs

d3 library to build circular graphs

Stars: ✭ 436 (+110.63%)

Mutual labels: bigdata

Aws Serverless Cicd Workshop

Learn how to build a CI/CD pipeline for SAM-based applications

Stars: ✭ 158 (-23.67%)

Mutual labels: pipeline

Cortx

CORTX Community Object Storage is 100% open source object storage uniquely optimized for mass capacity storage devices.

Stars: ✭ 426 (+105.8%)

Mutual labels: bigdata

Biglasso

biglasso: Extending Lasso Model Fitting to Big Data in R

Stars: ✭ 87 (-57.97%)

Mutual labels: bigdata

Rush

A cross-platform command-line tool for executing jobs in parallel

Stars: ✭ 421 (+103.38%)

Mutual labels: pipeline

Karton

Distributed malware processing framework based on Python, Redis and MinIO.

Stars: ✭ 134 (-35.27%)

Mutual labels: pipeline

Marmaray

Generic Data Ingestion & Dispersal Library for Hadoop

Stars: ✭ 414 (+100%)

Mutual labels: hadoop

Text classification

Text Classification Algorithms: A Survey

Stars: ✭ 1,276 (+516.43%)

Mutual labels: random-forest

Serving

A flexible, high-performance carrier for machine learning models（『飞桨』服务化部署框架）

Stars: ✭ 403 (+94.69%)

Mutual labels: pipeline

Pipeline.rs

☔️ => ⛅️ => ☀️

Stars: ✭ 188 (-9.18%)

Mutual labels: pipeline

Pex Context

Modern WebGL state wrapper for PEX: allocate GPU resources (textures, buffers), setup state pipelines and passes, and combine them into commands.

Stars: ✭ 117 (-43.48%)

Mutual labels: pipeline

Weblogsanalysissystem

A big data platform for analyzing web access logs

Stars: ✭ 37 (-82.13%)

Mutual labels: hadoop

Bio embeddings

Get protein embeddings from protein sequences

Stars: ✭ 86 (-58.45%)

Mutual labels: pipeline

Pytorch classification

利用pytorch实现图像分类的一个完整的代码，训练，预测，TTA，模型融合，模型部署，cnn提取特征，svm或者随机森林等进行分类，模型蒸馏，一个完整的代码

Stars: ✭ 395 (+90.82%)

Mutual labels: random-forest

Mara Pipelines

A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow

Stars: ✭ 1,841 (+789.37%)

Mutual labels: pipeline

Iceberg

Iceberg is a table format for large, slow-moving tabular data

Stars: ✭ 393 (+89.86%)

Mutual labels: hadoop

Clusterflow

A pipelining tool to automate and standardise bioinformatics analyses on cluster environments.

Stars: ✭ 85 (-58.94%)

Mutual labels: pipeline

Orc

Apache ORC - the smallest, fastest columnar storage for Hadoop workloads

Stars: ✭ 389 (+87.92%)

Mutual labels: hadoop

Spacy Wordnet

spacy-wordnet creates annotations that easily allow the use of wordnet and wordnet domains by using the nltk wordnet interface

Stars: ✭ 156 (-24.64%)

Mutual labels: pipeline

Learning Spark

零基础学习spark，大数据学习

Stars: ✭ 37 (-82.13%)

Mutual labels: hadoop

Airbyte

Airbyte is an open-source EL(T) platform that helps you replicate your data in your warehouses, lakes and databases.

Stars: ✭ 4,919 (+2276.33%)

Mutual labels: pipeline

Spark With Python

Fundamentals of Spark with Python (using PySpark), code examples

Stars: ✭ 150 (-27.54%)

Mutual labels: hadoop

Lastbackend

System for containerized apps management. From build to scaling.

Stars: ✭ 1,536 (+642.03%)

Mutual labels: pipeline

Mlj.jl

A Julia machine learning framework

Stars: ✭ 982 (+374.4%)

Mutual labels: pipeline

Jsr203 Hadoop

A Java NIO file system provider for HDFS

Stars: ✭ 35 (-83.09%)

Mutual labels: hadoop

Datax

DataX is an open source universal ETL tool that support Cassandra, ClickHouse, DBF, Hive, InfluxDB, Kudu, MySQL, Oracle, Presto(Trino), PostgreSQL, SQL Server

Stars: ✭ 116 (-43.96%)

Mutual labels: hadoop

Cimonitor

Displays CI statuses on a dashboard and triggers fun modules representing the status!

Stars: ✭ 34 (-83.57%)

Mutual labels: pipeline

301-360 of 840 similar projects