CloudbreakA tool for provisioning and managing Apache Hadoop clusters in the cloud. Cloudbreak, as part of the Hortonworks Data Platform, makes it easy to provision, configure and elastically grow HDP clusters on cloud infrastructure. Cloudbreak can be used to provision Hadoop across cloud infrastructure providers including AWS, Azure, GCP and OpenStack.
Stars: ✭ 301 (-15.45%)
check-engineData validation library for PySpark 3.0.0
Stars: ✭ 29 (-91.85%)
pipelineOONI data processing pipeline
Stars: ✭ 36 (-89.89%)
classifai🔥 One of the most comprehensive open-source data annotation platform.
Stars: ✭ 99 (-72.19%)
storm-mlan online learning algorithm library for Storm
Stars: ✭ 18 (-94.94%)
Baize白泽自动化运维系统:配置管理、网络探测、资产管理、业务管理、CMDB、CD、DevOps、作业编排、任务编排等功能,未来将添加监控、报警、日志分析、大数据分析等部分内容
Stars: ✭ 296 (-16.85%)
ByteSlice"Byteslice: Pushing the envelop of main memory data processing with a new storage layout" (SIGMOD'15)
Stars: ✭ 24 (-93.26%)
leaflet heatmap简单的可视化湖州通话数据 假设数据量很大,没法用浏览器直接绘制热力图,把绘制热力图这一步骤放到线下计算分析。使用Apache Spark并行计算数据之后,再使用Apache Spark绘制热力图,然后用leafletjs加载OpenStreetMap图层和热力图图层,以达到良好的交互效果。现在使用Apache Spark实现绘制,可能是Apache Spark不擅长这方面的计算或者是我没有设计好算法,并行计算的速度比不上单机计算。Apache Spark绘制热力图和计算代码在这 https://github.com/yuanzhaokang/ParallelizeHeatmap.git .
Stars: ✭ 13 (-96.35%)
StroomStroom is a highly scalable data storage, processing and analysis platform.
Stars: ✭ 344 (-3.37%)
meetups-archivosPpts, códigos y videos de las meetups, data science days, videollamadas y workshops. Data Science Research es una organización sin fines de lucro que busca difundir, descentralizar y difundir los conocimientos en Ciencia de Datos e Inteligencia Artificial en el Perú, dando oportunidades a nuevos talentos mediante MeetUps, Workshops y Semilleros …
Stars: ✭ 60 (-83.15%)
lensMirror of Apache Lens
Stars: ✭ 57 (-83.99%)
CrateCrateDB is a distributed SQL database that makes it simple to store and analyze
massive amounts of data in real-time.
Stars: ✭ 3,254 (+814.04%)
LoL-Match-PredictionWin probability predictions for League of Legends matches using neural networks
Stars: ✭ 34 (-90.45%)
Uproot3ROOT I/O in pure Python and NumPy.
Stars: ✭ 312 (-12.36%)
incubator-liminalApache Liminals goal is to operationalise the machine learning process, allowing data scientists to quickly transition from a successful experiment to an automated pipeline of model training, validation, deployment and inference in production. Liminal provides a Domain Specific Language to build ML workflows on top of Apache Airflow.
Stars: ✭ 117 (-67.13%)
beekeeperService for automatically managing and cleaning up unreferenced data
Stars: ✭ 43 (-87.92%)
Oie ResourcesA curated list of Open Information Extraction (OIE) resources: papers, code, data, etc.
Stars: ✭ 283 (-20.51%)
siembolAn open-source, real-time Security Information & Event Management tool based on big data technologies, providing a scalable, advanced security analytics framework.
Stars: ✭ 153 (-57.02%)
alluxio-pyAlluxio Python client - Access Any Data Source with Python
Stars: ✭ 18 (-94.94%)
Devops RoadmapDevOps methodology & roadmap for a devops developer in 2019. Interesting books to learn new technologies.
Stars: ✭ 349 (-1.97%)
CS Book🔥 Latest computer science e-books。提供最新技术类电子书下载, “我无非就是想卷死各位,或者被各位卷死!”
Stars: ✭ 40 (-88.76%)
spark-recordsBulletproof Apache Spark jobs with fast root cause analysis of failures.
Stars: ✭ 67 (-81.18%)
Knowage ServerKnowage is the professional open source suite for modern business analytics over traditional sources and big data systems.
Stars: ✭ 276 (-22.47%)
RemoteShuffleServiceCeleborn provides an elastic and high-performance service for shuffle and spilled data.
Stars: ✭ 262 (-26.4%)
pyparEfficient and scalable parallelism using the message passing interface (MPI) to handle big data and highly computational problems.
Stars: ✭ 66 (-81.46%)
MistServerless proxy for Spark cluster
Stars: ✭ 309 (-13.2%)
dxramA distributed in-memory key-value storage for billions of small objects.
Stars: ✭ 25 (-92.98%)
pytorch kmeansImplementation of the k-means algorithm in PyTorch that works for large datasets
Stars: ✭ 38 (-89.33%)
img2datasetEasily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
Stars: ✭ 1,173 (+229.49%)
GDLibraryMatlab library for gradient descent algorithms: Version 1.0.1
Stars: ✭ 50 (-85.96%)
pyspark-cheatsheetPySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
Stars: ✭ 115 (-67.7%)
lcbo-apiA crawler and API server for Liquor Control Board of Ontario retail data
Stars: ✭ 152 (-57.3%)
OzoneScalable, redundant, and distributed object store for Apache Hadoop
Stars: ✭ 330 (-7.3%)
gan deeplearning4jAutomatic feature engineering using Generative Adversarial Networks using Deeplearning4j and Apache Spark.
Stars: ✭ 19 (-94.66%)
hyper-enginePython library for Bayesian hyper-parameters optimization
Stars: ✭ 80 (-77.53%)
FlameStreamDistributed stream processing model and its implementation
Stars: ✭ 14 (-96.07%)
SuccinctEnabling queries on compressed data.
Stars: ✭ 257 (-27.81%)
ngmswissgeol.ch gives you insight in geoscientific data - above and below the surface.
Stars: ✭ 23 (-93.54%)
big-data-liteSamples to the Oracle Big Data Lite VM
Stars: ✭ 41 (-88.48%)
automile-netAutomile offers a simple, smart, cutting-edge telematics solution for businesses to track and manage their business vehicles.
Stars: ✭ 24 (-93.26%)
HelixMirror of Apache Helix
Stars: ✭ 304 (-14.61%)
falconMirror of Apache Falcon
Stars: ✭ 95 (-73.31%)
VespaThe open big data serving engine. https://vespa.ai
Stars: ✭ 3,747 (+952.53%)
Grouparoo🦘 The Grouparoo Monorepo - open source customer data sync framework
Stars: ✭ 334 (-6.18%)
MorpheusMorpheus brings the leading graph query language, Cypher, onto the leading distributed processing platform, Spark.
Stars: ✭ 303 (-14.89%)
mmtf-workshop-2018Structural Bioinformatics Training Workshop & Hackathon 2018
Stars: ✭ 50 (-85.96%)
couchdb-mangoMirror of Apache CouchDB Mango
Stars: ✭ 34 (-90.45%)