All Projects → yangjufo → Learned Indexes

yangjufo / Learned Indexes

Licence: mit
Implementation of BTree part for paper 'The Case for Learned Index Structures'

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Learned Indexes

Hypopg
Hypothetical Indexes for PostgreSQL
Stars: ✭ 594 (+828.13%)
Mutual labels:  index, database
Sonic
🦔 Fast, lightweight & schema-less search backend. An alternative to Elasticsearch that runs on a few MBs of RAM.
Stars: ✭ 12,347 (+19192.19%)
Mutual labels:  index, database
Stackexchange Dump To Postgres
Python scripts to import StackExchange data dump into Postgres DB.
Stars: ✭ 58 (-9.37%)
Mutual labels:  database
Reporting Services Examples
📕 Various example reports I use for SQL Server Reporting Services (SSRS) as well as documents for unit testing, requirements and a style guide template.
Stars: ✭ 63 (-1.56%)
Mutual labels:  database
Hermitdb
A private decentralized database replicated over Git (or any other distributed log)
Stars: ✭ 61 (-4.69%)
Mutual labels:  database
Data Science Best Resources
Carefully curated resource links for data science in one place
Stars: ✭ 1,104 (+1625%)
Mutual labels:  database
Blazor.indexeddb.framework
A framework for blazor which acts as an interface to IndexedDB
Stars: ✭ 62 (-3.12%)
Mutual labels:  database
Dolt
Dolt – It's Git for Data
Stars: ✭ 9,880 (+15337.5%)
Mutual labels:  database
Semantic forms
Form generators leveraging semantic web standards (RDF(S), OWL, SPARQL , ...
Stars: ✭ 63 (-1.56%)
Mutual labels:  database
Mssql Docker
Official Microsoft repository for SQL Server in Docker resources
Stars: ✭ 1,111 (+1635.94%)
Mutual labels:  database
Eventql
Distributed "massively parallel" SQL query engine
Stars: ✭ 1,121 (+1651.56%)
Mutual labels:  database
Skunk
A data access library for Scala + Postgres.
Stars: ✭ 1,107 (+1629.69%)
Mutual labels:  database
Pyhawk
Searches the directory of choice for interesting files. Such as database files and files with passwords stored on them
Stars: ✭ 60 (-6.25%)
Mutual labels:  database
Laravel Cadillac
🍺 A database tool for laravel.
Stars: ✭ 63 (-1.56%)
Mutual labels:  database
Restfeel
RESTFeel: 一个企业级的API管理&测试平台。RESTFeel帮助你设计、开发、测试您的API。
Stars: ✭ 59 (-7.81%)
Mutual labels:  database
Aliyun Tablestore Go Sdk
TableStore SDK for Golang
Stars: ✭ 63 (-1.56%)
Mutual labels:  database
Sql.js
A javascript library to run SQLite on the web.
Stars: ✭ 9,594 (+14890.63%)
Mutual labels:  database
Layr
Dramatically simplify full‑stack development
Stars: ✭ 1,111 (+1635.94%)
Mutual labels:  database
Awesome Cn
awesome项目中文翻译,提升查阅效率
Stars: ✭ 62 (-3.12%)
Mutual labels:  index
Littletable
An in-memory database of Python objects, searchable using quasi-SQL API
Stars: ✭ 64 (+0%)
Mutual labels:  database

Learned Indexes

Implementation of BTree part for paper 'The Case for Learned Index Structures'.

T. Kraska, A. Beutel, E. H. Chi, J. Dean, and N. Polyzotis. The Case for Learned Index Structures. https://arxiv.org/abs/1712.01208, 2017

Language: Python
Support content: Integer values, Random and Exponential distribution

Files Structures

data/: test data
model/: learned NN model
perfromance/:NN and BTree performance
Learned_BTree.py: main file
Trained_NN.py: NN structures

HOW TO RUN

First, you need to install python2.7.x and package tensorflow, pandas, numpy, enum.
Second, use command to run the Learned_BTree.py fule, that is,
python Learned_BTree.py -t <Type> -d <Distribution> [-p|-n] [Percent]|[Number] [-c] [New data] [-h].

Parameters:
'type': 'Type: sample, full',
'distribution': 'Distribution: random, exponential',
'percent': 'Percent: 0.1-1.0, default value = 0.5; sample train data size = 300,000',
'number': 'Number: 10,000-10,000,000, default value = 300,000',
'new data' 'New Data: INTEGER, 0 for no creating new data file, others for creating'

Example:
python Learned_BTree.py -t full -d random -n 100000 -c 1

Other Content

Sample Training

Sample training is also included in this project, you can use parameter 'sample' for '-t' to test sample training, while '-p' is used for change the sample training percent.

Example:
python Learned_BTree.py -t sample -d random -p 0.3 -c 0

Storage Optimization

More Information will be added soon.


中文

本项目代码是对《The Case For Learned Index Structures》一文的简单实现,实现了文章中BTree的部分,目前支持整数测试集,可以选用随机分布或者指数分布进行测试。

T. Kraska, A. Beutel, E. H. Chi, J. Dean, and N. Polyzotis. The Case for Learned Index Structures. https://arxiv.org/abs/1712.01208, 2017

此外,项目还对有新数据插入的场景进行了探索。

数据索引

主要步骤

  1. 依据论文中思想,搭建混合多级神经网络架构 Stage Models
 Input: int threshold, int stages[]
 Data: record data[]
 Result: trained index
1 M = stages.size;
2 tmp_records[][];
3 tmp_records[1][1] = all_data;
4 for i ← 1 to M do
5   for j ← 1 to stages[i] do
6     index[i][j] = new NN trained on tmp.records[i][j];
7     if i < M then
8       for r ∈ tmp.records[i][j] do
9         𝑞 = f(r.key) / stages[i + 1];
10        tmp.records[i + 1][ 𝑞].add(r);
11 for j ← 1 to index[M].size do
12   index[M][j].calc_err(tmp.records[M][j]);
13   if index[M][j].max_abs_err > threshold then
14     index[M][j] = new B-Tree trained on tmp_records[M][j];
15 return index;

在以上程序中,从整个数据集开始(第3行)开始,首先训练第1级模型。基于第1级模型的预测,从下一级挑选模型,并添加相应的关键字到该模型训练集(第9行和第10行),然后训练这个模型。最后,检查最后一级的模型,如果平均误差高于预定义的阈值,则用B树替换神经网络模型(第11-14行)。 所使用的模型均为全连接网络,随机分布用的是没有隐藏的全连接网络;指数分布用的是有1个有8个核的隐藏层的全连接网络

  1. 使用数据测试神经网络索引和B树索引,对比两者性能

抽样学习

神经网络模型的训练需要较长的时间,通过抽取一部分数据训练的方式,加快训练的速度。


存储优化

基于后续插入数据的分布与现有分布相近的观点。

主要步骤

  1. 根据建立的数据索引估计数据分布,并移动数据的位置,预留出空间。比如原先0-100的数据占据100个BLOCK,预计最终存储数据是现在的2倍,则预留100个BLOCK。

  2. 插入数据,与不进行优化比较。

优势

  1. 新插入数据冲突少,加快插入速度。
  2. 无需重新调整索引,降低索引维护代价,支持了新数据插入场景。
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].