All Projects → alibaba → Clusterdata

alibaba / Clusterdata

cluster data collected from production clusters in Alibaba for cluster management research

Labels

Projects that are alternatives of or similar to Clusterdata

Nas Bench 201
NAS-Bench-201 API and Instruction
Stars: ✭ 537 (-25.21%)
Mutual labels:  dataset
Awesome chinese medical nlp
中文医学NLP公开资源整理:术语集/语料库/词向量/预训练模型/知识图谱/命名实体识别/QA/信息抽取/模型/论文/etc
Stars: ✭ 623 (-13.23%)
Mutual labels:  dataset
Person search
Joint Detection and Identification Feature Learning for Person Search
Stars: ✭ 666 (-7.24%)
Mutual labels:  dataset
Total Text Dataset
Total Text Dataset. It consists of 1555 images with more than 3 different text orientations: Horizontal, Multi-Oriented, and Curved, one of a kind.
Stars: ✭ 580 (-19.22%)
Mutual labels:  dataset
Label Studio
Label Studio is a multi-type data labeling and annotation tool with standardized output format
Stars: ✭ 7,264 (+911.7%)
Mutual labels:  dataset
Uhttbarcodereference
Universe-HTT barcode reference
Stars: ✭ 634 (-11.7%)
Mutual labels:  dataset
Pycococreator
Helper functions to create COCO datasets
Stars: ✭ 530 (-26.18%)
Mutual labels:  dataset
Cluener2020
CLUENER2020 中文细粒度命名实体识别 Fine Grained Named Entity Recognition
Stars: ✭ 689 (-4.04%)
Mutual labels:  dataset
Gensim Data
Data repository for pretrained NLP models and NLP corpora.
Stars: ✭ 622 (-13.37%)
Mutual labels:  dataset
Proteinnet
Standardized data set for machine learning of protein structure
Stars: ✭ 664 (-7.52%)
Mutual labels:  dataset
Open stt
Open STT
Stars: ✭ 584 (-18.66%)
Mutual labels:  dataset
Couplet Dataset
Dataset for couplets. 70万条对联数据库。
Stars: ✭ 589 (-17.97%)
Mutual labels:  dataset
Devblogs
+2600 developer-related blogs and publications.
Stars: ✭ 637 (-11.28%)
Mutual labels:  dataset
Hate Speech And Offensive Language
Repository for the paper "Automated Hate Speech Detection and the Problem of Offensive Language", ICWSM 2017
Stars: ✭ 543 (-24.37%)
Mutual labels:  dataset
Wilayah Administratif Indonesia
Data Provinsi, Kota/Kabupaten, Kecamatan, dan Kelurahan/Desa di Indonesia
Stars: ✭ 667 (-7.1%)
Mutual labels:  dataset
Awesome Twitter Data
A list of Twitter datasets and related resources.
Stars: ✭ 533 (-25.77%)
Mutual labels:  dataset
Esc 50
ESC-50: Dataset for Environmental Sound Classification
Stars: ✭ 631 (-12.12%)
Mutual labels:  dataset
Caffenet Benchmark
Evaluation of the CNN design choices performance on ImageNet-2012.
Stars: ✭ 700 (-2.51%)
Mutual labels:  dataset
Chatito
🎯🗯 Generate datasets for AI chatbots, NLP tasks, named entity recognition or text classification models using a simple DSL!
Stars: ✭ 678 (-5.57%)
Mutual labels:  dataset
Awesome Project Ideas
Curated list of Machine Learning, NLP, Vision, Recommender Systems Project Ideas
Stars: ✭ 6,114 (+751.53%)
Mutual labels:  dataset

Alibaba Cluster Trace Program

Overview

The Alibaba Cluster Trace Program is published by Alibaba Group. By providing cluster trace from real production, the program helps the researchers, students and people who are interested in the field to get better understanding of the characterastics of modern internet data centers (IDC's) and the workloads.

So far, two versions of traces have been released:

  • cluster-trace-v2017 includes about 1300 machines in a period of 12 hours. The trace-v2017 firstly introduces the collocation of online services (aka long running applications) and batch workloads. To see more about this trace, see related documents (trace_2017). Download link is available after a short survey (survey link).
  • cluster-trace-v2018 includes about 4000 machines in a period of 8 days. Besides having larger scaler than trace-v2017, this piece trace also contains the DAG information of our production batch workloads. See related documents for more details (trace_2018). Download link is available after a survey (less than a minute, survey link).

We encourage anyone to use the traces for study or research purposes, and if you had any question when using the trace, please contact us via email: aliababa-clusterdata, or file an issue on Github. Filing an issue is recommanded as the discussion would help all the community. Note that the more clearly you ask the question, the more likely you would get a clear answer.

It would be much appreciated if you could tell us once any publication using our trace is available, as we are maintaining a list of related publicatioins for more researchers to better communicate with each other.

In future, we will try to release new traces at a regular pace, please stay tuned.

Our motivation

As said at the beginning, our motivation on publishing the data is to help people in related field get a better understanding of modern data centers and provide production data for researchers to varify their ideas. You may use trace however you want as long as it is for reseach or study purpose.

From our perspective, the data is provided to address the challenges Alibaba face in IDC's where online services and batch jobs are collocated. We distill the challenges as the following topics:

  1. Workload characterizations. How to characterize Alibaba workloads in a way that we can simulate various production workload in a representative way for scheduling and resource management strategy studies.
  2. New algorithms to assign workload to machines. How to assign and reschedule workloads to machines for better resource utilization and ensuring the performance SLA for different applications (e.g. by reducing resource contention and defining proper proirities).
  3. Collaboration between online service scheduler (Sigma) and batch jobs scheduler (Fuxi). How to adjust resource allocation between online service and batch jobs to improve throughput of batch jobs while maintain acceptable QoS (Quolity of Service) and fast failure recovery for online service. As the scale of collocation (workloads managed by different schedulers) keeps growing, the design of collaboration mechanism is becoming more and more critical.

Last but not least, we are always open to work together with researchers to improve the efficiency of our clusters, and there are positions open for research interns. If you had any idea in your mind, please contact us via aliababa-clusterdata or Haiyang Ding (Haiyang maintains this cluster trace and works for Alibaba's resource management & scheduling group).

Outcomes from the trace

Papers using Alibaba cluster trace

The fundemental idea of our releasing cluster data is to enable researchers & practitioners doing resaerch, simulation with more realistic data and thus making the result closer to industry adoption. It is a huge encouragement to us to see more works using our data. Here is a list of existing works using Alibaba cluster data. If your paper uses our trace, it would be great if you let us know by sending us email (aliababa-clusterdata).

Tech reports and projects on analysing the trace

So far this session is empty. In future, we are going to link some reports and open source repo on how to anaylsis the trace here, with the permission of the owner.

The purpose of this is to help more beginners to get start on learning either basic data analysis or how to inspect cluster from statistics perspective.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].