This package contains a generic implementation of greedy Information Theoretic Feature Selection (FS) methods. The implementation is based on the common theoretic framework presented by Gavin Brown. Implementations of mRMR, InfoGain, JMI and other commonly used FS filters are provided.

Stars: ✭ 123 (-42.79%)

Mutual labels: spark

Spark Bigquery

Google BigQuery support for Spark, Structured Streaming, SQL, and DataFrames with easy Databricks integration.

Stars: ✭ 65 (-69.77%)

Mutual labels: spark

Spark Tsne

Distributed t-SNE via Apache Spark

Stars: ✭ 151 (-29.77%)

Mutual labels: spark

W2v

Word2Vec models with Twitter data using Spark. Blog:

Stars: ✭ 64 (-70.23%)

Mutual labels: spark

Dynamometer

A tool for scale and performance testing of HDFS with a specific focus on the NameNode.

Stars: ✭ 122 (-43.26%)

Mutual labels: hadoop

Pysparkgeoanalysis

🌐 Interactive Workshop on GeoAnalysis using PySpark

Stars: ✭ 63 (-70.7%)

Mutual labels: spark

Facebook Hive Udfs

Facebook's Hive UDFs

Stars: ✭ 213 (-0.93%)

Mutual labels: hadoop

Roffildlibrary

Library for MQL5 (MetaTrader) with Python, Java, Apache Spark, AWS

Stars: ✭ 63 (-70.7%)

Mutual labels: spark

Benchm Ml

A minimal benchmark for scalability, speed and accuracy of commonly used open source implementations (R packages, Python scikit-learn, H2O, xgboost, Spark MLlib etc.) of the top machine learning algorithms for binary classification (random forests, gradient boosted trees, deep neural networks etc.).

Stars: ✭ 1,835 (+753.49%)

Mutual labels: spark

Warp

Convert and analyze large data sets at light speed, on Mac and iOS.

Stars: ✭ 62 (-71.16%)

Mutual labels: big-data

Deequ

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.

Stars: ✭ 2,020 (+839.53%)

Mutual labels: spark

Nabhash

An extremely fast Non-crypto-safe AES Based Hash algorithm for Big Data

Stars: ✭ 62 (-71.16%)

Mutual labels: big-data

Zparkio

Boiler plate framework to use Spark and ZIO together.

Stars: ✭ 121 (-43.72%)

Mutual labels: spark

Silex

something to help you spark

Stars: ✭ 61 (-71.63%)

Mutual labels: spark

Data Science Cookbook

🎓 Jupyter notebooks from UFC data science course

Stars: ✭ 60 (-72.09%)

Mutual labels: spark

Spark Nlp

State of the Art Natural Language Processing

Stars: ✭ 2,518 (+1071.16%)

Mutual labels: spark

Hudi

Upserts, Deletes And Incremental Processing on Big Data.

Stars: ✭ 2,586 (+1102.79%)

Mutual labels: bigdata

Eat pyspark in 10 days

pyspark🍒🥭 is delicious，just eat it!😋😋

Stars: ✭ 116 (-46.05%)

Mutual labels: spark

Zemberek Nlp Server

Zemberek Türkçe NLP Java Kütüphanesi üzerine REST Docker Sunucu

Stars: ✭ 60 (-72.09%)

Mutual labels: spark

Verticapy

VerticaPy is a Python library that exposes sci-kit like functionality to conduct data science projects on data stored in Vertica, thus taking advantage Vertica’s speed and built-in analytics and machine learning capabilities.

Stars: ✭ 59 (-72.56%)

Mutual labels: big-data

Example Spark Kafka

Apache Spark and Apache Kafka integration example

Stars: ✭ 120 (-44.19%)

Mutual labels: spark

Likelike

An implementation of locality sensitive hashing with Hadoop

Stars: ✭ 58 (-73.02%)

Mutual labels: hadoop

Attic Lens

Mirror of Apache Lens

Stars: ✭ 58 (-73.02%)

Mutual labels: big-data

Aztk

AZTK powered by Azure Batch: On-demand, Dockerized, Spark Jobs on Azure

Stars: ✭ 152 (-29.3%)

Mutual labels: spark

Sigmf

The Signal Metadata Format Specification

Stars: ✭ 120 (-44.19%)

Mutual labels: big-data

Ymcache

YMCache is a lightweight object caching solution for iOS and Mac OS X that is designed for highly parallel access scenarios.

Stars: ✭ 58 (-73.02%)

Mutual labels: big-data

Pyspark Examples

Code examples on Apache Spark using python

Stars: ✭ 58 (-73.02%)

Mutual labels: spark

Teddy

Spark Streaming监控平台，支持任务部署与告警、自启动

Stars: ✭ 120 (-44.19%)

Mutual labels: spark

Rumble

⛈️ Rumble 1.11.0 "Banyan Tree"🌳 for Apache Spark | Run queries on your large-scale, messy JSON-like data (JSON, text, CSV, Parquet, ROOT, AVRO, SVM...) | No install required (just a jar to download) | Declarative Machine Learning and more

Stars: ✭ 58 (-73.02%)

Mutual labels: spark

Data Science Live Book

An open source book to learn data science, data analysis and machine learning, suitable for all ages!

Stars: ✭ 193 (-10.23%)

Mutual labels: big-data

Keyvi

Keyvi - a key value index that powers Cliqz search engine. It is an in-memory FST-based data structure highly optimized for size and lookup performance.

Stars: ✭ 171 (-20.47%)

Mutual labels: big-data

Kinesis Sql

Kinesis Connector for Structured Streaming

Stars: ✭ 120 (-44.19%)

Mutual labels: spark

Model Serving Tutorial

Code and presentation for Strata Model Serving tutorial

Stars: ✭ 57 (-73.49%)

Mutual labels: spark

Athenacli

AthenaCLI is a CLI tool for AWS Athena service that can do auto-completion and syntax highlighting.

Stars: ✭ 151 (-29.77%)

Mutual labels: bigdata

Elassandra

Elassandra = Elasticsearch + Apache Cassandra

Stars: ✭ 1,610 (+648.84%)

Mutual labels: spark

Net.jgp.labs.spark

Apache Spark examples exclusively in Java

Stars: ✭ 55 (-74.42%)

Mutual labels: spark

Sparkit Learn

PySpark + Scikit-learn = Sparkit-learn

Stars: ✭ 1,073 (+399.07%)

Mutual labels: apache-spark

Kibble 1

Apache Kibble - a tool to collect, aggregate and visualize data about any software project

Stars: ✭ 54 (-74.88%)

Mutual labels: big-data

Attic Predictionio

PredictionIO, a machine learning server for developers and ML engineers.

Stars: ✭ 12,522 (+5724.19%)

Mutual labels: big-data

Albedo

A recommender system for discovering GitHub repos, built with Apache Spark

Stars: ✭ 149 (-30.7%)

Mutual labels: apache-spark

Lifion Kinesis

A native Node.js producer and consumer library for Amazon Kinesis Data Streams

Stars: ✭ 54 (-74.88%)

Mutual labels: big-data

Utils4s

scala、spark使用过程中，各种测试用例以及相关资料整理

Stars: ✭ 1,070 (+397.67%)

Mutual labels: spark

Spark Submit Ui

This is a based on playframwork for submit spark app

Stars: ✭ 53 (-75.35%)

Mutual labels: spark

Cube.js

📊 Cube — Open-Source Analytics API for Building Data Apps

Stars: ✭ 11,983 (+5473.49%)

Mutual labels: spark

Macro ml

Course Website on Macroeconomic Analysis with Machine Learning and Big Data

Stars: ✭ 53 (-75.35%)

Mutual labels: big-data

Oodt

Mirror of Apache OODT

Stars: ✭ 52 (-75.81%)

Mutual labels: big-data

Cc Pyspark

Process Common Crawl data with Python and Spark

Stars: ✭ 147 (-31.63%)

Mutual labels: spark

Cmak

CMAK is a tool for managing Apache Kafka clusters

Stars: ✭ 10,544 (+4804.19%)

Mutual labels: big-data

Awesome Spark

A curated list of awesome Apache Spark packages and resources.

Stars: ✭ 1,061 (+393.49%)

Mutual labels: apache-spark

Datumbox Framework

Datumbox is an open-source Machine Learning framework written in Java which allows the rapid development of Machine Learning and Statistical applications.

Stars: ✭ 1,063 (+394.42%)

Mutual labels: big-data

Datax

DataX is an open source universal ETL tool that support Cassandra, ClickHouse, DBF, Hive, InfluxDB, Kudu, MySQL, Oracle, Presto(Trino), PostgreSQL, SQL Server

Stars: ✭ 116 (-46.05%)

Mutual labels: hadoop

Play Spark Scala

Stars: ✭ 51 (-76.28%)

Mutual labels: spark

Attic Predictionio Sdk Ruby

PredictionIO Ruby SDK

Stars: ✭ 192 (-10.7%)

Mutual labels: big-data

Avro

Apache Avro is a data serialization system.

Stars: ✭ 2,005 (+832.56%)

Mutual labels: bigdata

Amazon S3 Find And Forget

Amazon S3 Find and Forget is a solution to handle data erasure requests from data lakes stored on Amazon S3, for example, pursuant to the European General Data Protection Regulation (GDPR)

Stars: ✭ 115 (-46.51%)

Mutual labels: big-data

301-360 of 1035 similar projects