All Projects → openacid → Slim

openacid / Slim

Licence: mit
Surprisingly space efficient trie in Golang(11 bits/key; 100 ns/get).

Programming Languages

go
31211 projects - #10 most used programming language
shell
77523 projects
python
139335 projects - #7 most used programming language
Jinja
831 projects
Makefile
30231 projects

Projects that are alternatives of or similar to Slim

slimarray
SlimArray compresses uint32 into several bits, by using a polynomial to describe overall trend of an array.
Stars: ✭ 39 (-97.71%)
Mutual labels:  memory, compress, compacted
Algo Tree
Algo-Tree is a collection of Algorithms and data structures which are fundamentals to efficient code and good software design. Creating and designing excellent algorithms is required for being an exemplary programmer. It contains solutions in various languages such as C++, Python and Java.
Stars: ✭ 166 (-90.26%)
Mutual labels:  tree, trie
Algorithms
Data Structures & Algorithms. Includes solutions for Cracking the Coding Interview 6th Edition
Stars: ✭ 89 (-94.78%)
Mutual labels:  tree, trie
treap
🍃 🌳 🍂 Efficient implementation of the implicit treap data structure
Stars: ✭ 64 (-96.25%)
Mutual labels:  tree, datastructure
Data Structures
Go datastructures.
Stars: ✭ 336 (-80.29%)
Mutual labels:  tree, trie
AlgoDaily
just for fun
Stars: ✭ 118 (-93.08%)
Mutual labels:  tree, trie
Leetcode
High-quality LeetCode solutions
Stars: ✭ 178 (-89.56%)
Mutual labels:  tree, trie
treelike
A trait to abstract over common tree functionality
Stars: ✭ 33 (-98.06%)
Mutual labels:  tree, datastructure
treap
A thread-safe, persistent Treap (tree + heap) for ordered key-value mapping and priority sorting.
Stars: ✭ 23 (-98.65%)
Mutual labels:  tree, datastructure
Data-Structures-and-Algorithms
Implementation of various Data Structures and algorithms - Linked List, Stacks, Queues, Binary Search Tree, AVL tree,Red Black Trees, Trie, Graph Algorithms, Sorting Algorithms, Greedy Algorithms, Dynamic Programming, Segment Trees etc.
Stars: ✭ 144 (-91.55%)
Mutual labels:  tree, trie
Java Ds Algorithms
Data Structures and Algorithms in Java
Stars: ✭ 125 (-92.67%)
Mutual labels:  tree, trie
Span Tree
🌳 Tree for GitLab
Stars: ✭ 123 (-92.79%)
Mutual labels:  tree
Data Structures
Data-Structures using C++.
Stars: ✭ 121 (-92.9%)
Mutual labels:  trie
Tree Online
An online tree-like utility for generating ASCII folder structure diagrams. Written in TypeScript and React.
Stars: ✭ 119 (-93.02%)
Mutual labels:  tree
Polygon Wind
Wind shader for low poly assets in Unity.
Stars: ✭ 119 (-93.02%)
Mutual labels:  tree
Containers
This library provides various containers. Each container has utility functions to manipulate the data it holds. This is an abstraction as to not have to manually manage and reallocate memory.
Stars: ✭ 125 (-92.67%)
Mutual labels:  tree
Androidofferkiller
💪 Help you get a better offer.
Stars: ✭ 1,669 (-2.11%)
Mutual labels:  datastructure
Ack2
**ack 2 is no longer being maintained. ack 3 is the latest version.**
Stars: ✭ 1,504 (-11.79%)
Mutual labels:  tree
Wmztreeview
类似前端elementUI的树形控件,可自定义节点内容,支持无限极节点,可拖拽增删节点等等,非递归实现
Stars: ✭ 118 (-93.08%)
Mutual labels:  tree
Bplustree
A minimal but extreme fast B+ tree indexing structure demo for billions of key-value storage
Stars: ✭ 1,598 (-6.28%)
Mutual labels:  tree

Slim - surprisingly space efficient data types in Golang

Travis test

Report card Coverage Status

GoDoc PkgGoDev Sourcegraph

Slim is collection of surprisingly space efficient data types, with corresponding serialization APIs to persisting them on-disk or for transport.

Why slim

As data on internet keeps increasing exponentially, the capacity gap between memory and disk becomes greater.

Most of the time, a data itself does not need to be loaded into expensive main memory. Only the much more important information, WHERE-A-DATA-IS, deserve a seat in main memory.

This is what slim does, keeps as little information as possible in main memory, as a minimized index of huge amount external data.

  • SlimIndex: is a common index structure, building on top of SlimTrie.

    GoDoc

  • SlimTrie is the underlying index data structure, evolved from trie.

    GoDoc

    Features:

    • Minimized: 11 bits per key(far less than an 64-bits pointer!!).

    • Stable: memory consumption is stable in various scenarios. The Worst case converges to average consumption tightly. See benchmark.

    • Loooong keys: You can have VERY long keys(16K bytes), without any waste of memory(and money). Do not waste your life writing another prefix compression:). (aws-s3 limits key length to 1024 bytes). Memory consumption only relates to key count, not to key length.

    • Ordered: like btree, keys are stored. Range-scan will be ready in 0.6.0.

    • Fast: ~150 ns per Get(). Time complexity for a get is O(log(n) + k); n: key count; k: key length.

    • Ready for transport: a single proto.Marshal() is all it requires to serialize, transport or persisting on disk etc.

Memory overhead

  • Random string, fixed length, default mode, no label is store if possible:

    Bits/key: memory or disk-space in bits a key consumed in average. It does not change when key-length(k) becomes larger!

  • 1 million var-length string, 10 to 20 byte in different mode SlimTrie:

    - size gzip-size
    Original 15.0M 14.0M
    Complete 14.0M 10.0M
    InnerLabel 1.3M 0.9M
    NoLabel 1.3M 0.8M

    Raw string list and serialized slim is stored in: https://github.com/openacid/testkeys/tree/master/assets

    • Original: raw string lines in a text file.

    • Complete: NewSlimTrie(..., Opt{Complete:Bool(true)}): lossless SlimTrie, stores complete info of every string. This mode provides accurate query.

    • InnerLabel: NewSlimTrie(..., Opt{InnerPrefix:Bool(true)}) SlimTrie stores only label strings of inner nodes(but not label to a leaf). There is false positive in this mode.

    • NoLabel: No label info is stored. False positive rate is higher.

Performance

Time(in nano second) spent on a Get() with golang-map, SlimTrie, array and btree by google.

  • 3.3 times faster than the btree.
  • 2.3 times faster than binary search.

Time(in nano second) spent on a Get() with different key count(n) and key length(k):

False Positive Rate

Bloom filter requires about 9 bits/key to archieve less than 1% FPR.

See: trie/report/

Status

  • SlimTrie APIs are stable, and has been used in a production env.

    Meanwhile we focus on optimizing memory usage and query performance.

  • Internal data structure are promised to be backward compatible for ever. No data migration issue!

Roadmap

  • 2021-01-15 v0.5.11 Query by range
  • 2019-09-18 v0.5.10 Reduce false positive rate to less than 0.05%
  • 2019-06-03 v0.5.9 Large key set benchmark
  • 2019-05-29 v0.5.6 Support up to 2 billion keys
  • 2019-05-18 v0.5.4 Reduce memory usage from 40 to 14 bits/key
  • 2019-04-20 v0.4.3 Range index: many keys share one index item
  • 2019-04-18 v0.4.1 Marshaling support
  • 2019-03-08 v0.1.0 SlimIndex SlimTrie

Change-log

Change-log

Synopsis

Index keys, get by key

package index_test

import (
	"fmt"
	"strings"

	"github.com/openacid/slim/index"
)

type Data string

func (d Data) Read(offset int64, key string) (string, bool) {
	kv := strings.Split(string(d)[offset:], ",")[0:2]
	if kv[0] == key {
		return kv[1], true
	}
	return "", false
}

func Example() {

	// Accelerate external data accessing (in memory or on disk) by indexing
	// them with a SlimTrie:

	// `data` is a sample of some unindexed data. In our example it is a comma
	// separated key value series.
	//
	// In order to let SlimTrie be able to read data, `data` should have
	// a `Read` method:
	//     Read(offset int64, key string) (string, bool)
	data := Data("Aaron,1,Agatha,1,Al,2,Albert,3,Alexander,5,Alison,8")

	// keyOffsets is a prebuilt index that stores key and its offset in data accordingly.
	keyOffsets := []index.OffsetIndexItem{
		{Key: "Aaron", Offset: 0},
		{Key: "Agatha", Offset: 8},
		{Key: "Al", Offset: 17},
		{Key: "Albert", Offset: 22},
		{Key: "Alexander", Offset: 31},
		{Key: "Alison", Offset: 43},
	}

	// `SlimIndex` is simply a container of SlimTrie and its data.
	st, err := index.NewSlimIndex(keyOffsets, data)
	if err != nil {
		fmt.Println(err)
	}

	// Lookup
	v, found := st.Get("Alison")
	fmt.Printf("key: %q\n  found: %t\n  value: %q\n", "Alison", found, v)

	v, found = st.Get("foo")
	fmt.Printf("key: %q\n  found: %t\n  value: %q\n", "foo", found, v)

	// Output:
	// key: "Alison"
	//   found: true
	//   value: "8"
	// key: "foo"
	//   found: false
	//   value: ""
}

Index key ranges, get by key

Create an index item for every 4(or more as you wish) keys.

Let several adjacent keys share one index item reduces a lot memory cost if there are huge amount keys in external data. Such as to index billions of 4KB objects on a 4TB disk(because one disk IO costs 20ms for either reading 4KB or reading 1MB).

package index_test

import (
	"fmt"
	"strings"

	"github.com/openacid/slim/index"
)

type RangeData string

func (d RangeData) Read(offset int64, key string) (string, bool) {
	for i := 0; i < 4; i++ {
		if int(offset) >= len(d) {
			break
		}

		kv := strings.Split(string(d)[offset:], ",")[0:2]
		if kv[0] == key {
			return kv[1], true
		}
		offset += int64(len(kv[0]) + len(kv[1]) + 2)

	}
	return "", false
}

func Example_indexRanges() {

	// Index ranges instead of keys:
	// In this example at most 4 keys shares one index item.

	data := RangeData("Aaron,1,Agatha,1,Al,2,Albert,3,Alexander,5,Alison,8")

	// keyOffsets is a prebuilt index that stores range start, range end and its offset.
	keyOffsets := []index.OffsetIndexItem{
		// Aaron  +--> 0
		// Agatha |
		// Al     |
		// Albert |

		// Alexander +--> 31
		// Alison    |

		{Key: "Aaron", Offset: 0},
		{Key: "Agatha", Offset: 0},
		{Key: "Al", Offset: 0},
		{Key: "Albert", Offset: 0},

		{Key: "Alexander", Offset: 31},
		{Key: "Alison", Offset: 31},
	}

	st, err := index.NewSlimIndex(keyOffsets, data)
	if err != nil {
		panic(err)
	}

	v, found := st.RangeGet("Aaron")
	fmt.Printf("key: %q\n  found: %t\n  value: %q\n", "Aaron", found, v)

	v, found = st.RangeGet("Al")
	fmt.Printf("key: %q\n  found: %t\n  value: %q\n", "Al", found, v)

	v, found = st.RangeGet("foo")
	fmt.Printf("key: %q\n  found: %t\n  value: %q\n", "foo", found, v)

	// Output:
	// key: "Aaron"
	//   found: true
	//   value: "1"
	// key: "Al"
	//   found: true
	//   value: "2"
	// key: "foo"
	//   found: false
	//   value: ""
}

Scan

package trie

import (
	"fmt"

	"github.com/openacid/slim/encode"
)

func ExampleSlimTrie_ScanFrom() {
	var keys = []string{
		"",
		"`",
		"a",
		"ab",
		"abc",
		"abca",
		"abcd",
		"abcd1",
		"abce",
		"be",
		"c",
		"cde0",
		"d",
	}
	values := makeI32s(len(keys))

	codec := encode.I32{}
	st, _ := NewSlimTrie(codec, keys, values, Opt{
		Complete: Bool(true),
	})

	// untilD stops when encountering "d".
	untilD := func(k, v []byte) bool {
		if string(k) == "d" {
			return false
		}

		_, i32 := codec.Decode(v)
		fmt.Println(string(k), i32)
		return true
	}

	fmt.Println("scan (ab, +∞):")
	st.ScanFrom("ab", false, true, untilD)

	fmt.Println()
	fmt.Println("scan [be, +∞):")
	st.ScanFrom("be", true, true, untilD)

	fmt.Println()
	fmt.Println("scan (ab, be):")
	st.ScanFromTo(
		"ab", false,
		"be", false,
		true, untilD)

	// Output:
	//
	// scan (ab, +∞):
	// abc 4
	// abca 5
	// abcd 6
	// abcd1 7
	// abce 8
	// be 9
	// c 10
	// cde0 11
	//
	// scan [be, +∞):
	// be 9
	// c 10
	// cde0 11
	//
	// scan (ab, be):
	// abc 4
	// abca 5
	// abcd 6
	// abcd1 7
	// abce 8
}

Getting started

Install

go get github.com/openacid/slim/trie

Who are using slim

baishancloud

Feedback and contributions

Feedback and Contributions are greatly appreciated.

At this stage, the maintainers are most interested in feedback centered on:

  • Do you have a real life scenario that slim supports well, or doesn't support at all?
  • Do any of the APIs fulfill your needs well?

Let us know by filing an issue, describing what you did or wanted to do, what you expected to happen, and what actually happened:

Or other type of issue.

Authors

See also the list of contributors who participated in this project.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].