All Projects → luispedro → Diskhash

luispedro / Diskhash

Licence: other
Diskbased (persistent) hashtable

Programming Languages

python
139335 projects - #7 most used programming language
c
50402 projects - #5 most used programming language
haskell
3896 projects

Projects that are alternatives of or similar to Diskhash

Instantobjects
Pupular OOP-OPF Library for Delphi (from D2010 to 10.4 Sydney)
Stars: ✭ 50 (-51.92%)
Mutual labels:  persistence
Phpsploit
Full-featured C2 framework which silently persists on webserver with a single-line PHP backdoor
Stars: ✭ 1,188 (+1042.31%)
Mutual labels:  persistence
Slowpoke
Low-level key/value store in pure Go.
Stars: ✭ 98 (-5.77%)
Mutual labels:  persistence
Backbone Faux Server
A framework for mocking up server-side persistence / processing for Backbone.js
Stars: ✭ 55 (-47.12%)
Mutual labels:  persistence
Query Validator
Compile time validation for HQL and JPQL queries in Java code
Stars: ✭ 70 (-32.69%)
Mutual labels:  persistence
Cistern
Ruby API client framework
Stars: ✭ 81 (-22.12%)
Mutual labels:  persistence
Rom Repository
THIS PROJECT WAS MOVED TO rom-rb/rom
Stars: ✭ 48 (-53.85%)
Mutual labels:  persistence
Fcuuid
iOS UUID / Universally Unique Identifiers library as alternative to UDID and identifierForVendor. 📱
Stars: ✭ 1,387 (+1233.65%)
Mutual labels:  persistence
Drain3
Drain log template miner in Python3
Stars: ✭ 71 (-31.73%)
Mutual labels:  persistence
Unissist
⛑ A ~300b unistore helper to persist your data using equally tiny storage adapters
Stars: ✭ 94 (-9.62%)
Mutual labels:  persistence
Moor
Moor is an easy to use, reactive, typesafe persistence library for Dart & Flutter
Stars: ✭ 1,090 (+948.08%)
Mutual labels:  persistence
Dr0p1t Framework
A framework that create an advanced stealthy dropper that bypass most AVs and have a lot of tricks
Stars: ✭ 1,132 (+988.46%)
Mutual labels:  persistence
Tupl
The Unnamed Persistence Library
Stars: ✭ 83 (-20.19%)
Mutual labels:  persistence
Batchman
This library for Android will take any set of events and batch them up before sending it to the server. It also supports persisting the events on disk so that no event gets lost because of an app crash. Typically used for developing any in-house analytics sdk where you have to make a single api call to push events to the server but you want to optimize the calls so that the api call happens only once per x events, or say once per x minutes. It also supports exponential backoff in case of network failures
Stars: ✭ 50 (-51.92%)
Mutual labels:  persistence
Kratosknife
KratosKnife is a Advanced BOTNET Written in python 3 for Windows OS. Comes With Lot of Advanced Features such as Persistence & VM Detection Methods, Built-in Binder, etc
Stars: ✭ 97 (-6.73%)
Mutual labels:  persistence
Multibootusb
Create multiboot live Linux on a USB disk...
Stars: ✭ 1,042 (+901.92%)
Mutual labels:  persistence
Winpayloads
Undetectable Windows Payload Generation
Stars: ✭ 1,211 (+1064.42%)
Mutual labels:  persistence
Malwarepersistencescripts
A collection of scripts I've written to help red and blue teams with malware persistence techniques.
Stars: ✭ 103 (-0.96%)
Mutual labels:  persistence
Lychee
The most complete and powerful data-binding library and persistence infra for Kotlin 1.3, Android & Splitties Views DSL, JavaFX & TornadoFX, JSON, JDBC & SQLite, SharedPreferences.
Stars: ✭ 102 (-1.92%)
Mutual labels:  persistence
Hibernate Performance
Samples for "Hibernate performance tuning" talk
Stars: ✭ 87 (-16.35%)
Mutual labels:  persistence

Disk-based hashtable

Travis License: MIT

A simple disk-based hash table (i.e., persistent hash table).

It is a hashtable implemented on memory-mapped disk, so that it can be loaded with a single mmap() system call and used in memory directly (being as fast as an in-memory hashtable once it is loaded from disk).

The code is in C, wrappers are provided for Python, Haskell, and C++. The wrappers follow similar APIs with variations to accommodate the language specificity. They all use the same underlying code, so you can open a hashtable created in C from Haskell, modify it within your Haskell code, and later open the result in Python.

Cross-language functionality will only work for simple types where you can control their binary representation (64-bit integers, for example).

Reading does not touch the disk representation at all and, thus, can be done on top of read-only files or using multiple threads (and different processes will share the memory: the operating system does that for you). Writing or modifying values is, however, not thread-safe.

Examples

The following examples all create a hashtable to store longs (int64_t), then set the value associated with the key "key" to 9. In the current API, the maximum size of the keys needs to be pre-specified, which is the value 15 below.

Raw C

#include <stdio.h>
#include <inttypes.h>
#include "diskhash.h"

int main(void) {
    HashTableOpts opts;
    opts.key_maxlen = 15;
    opts.object_datalen = sizeof(int64_t);
    char* err = NULL;
    HashTable* ht = dht_open("testing.dht", opts, O_RDWR|O_CREAT, &err);
    if (!ht) {
        if (!err) err = "Unknown error";
        fprintf(stderr, "Failed opening hash table: %s.\n", err);
        return 1;
    }
    long i = 9;
    dht_insert(ht, "key", &i);
    
    long* val = (long*) dht_lookup(ht, "key");
    printf("Looked up value: %l\n", *val);

    dht_free(ht);
    return 0;
}

The C API relies on error codes and error strings (the &err argument above). The header file has decent documentation.

Haskell

In Haskell, you have different types/functions for read-write and read-only hashtables. Read-write operations are IO operations, read-only hashtables are pure.

Read write example:

import Data.DiskHash
import Data.Int
main = do
    ht <- htOpenRW "testing.dht" 15
    htInsertRW ht "key" (9 :: Int64)
    val <- htLookupRW "key" ht
    print val

Read only example (htLookupRO is pure in this case):

import Data.DiskHash
import Data.Int
main = do
    ht <- htOpenRO "testing.dht" 15
    let val :: Int64
        val = htLookupRO "key" ht
    print val

Python

Python's interface is based on the struct module. For example, 'll' refers to a pair of 64-bit ints (longs):

import diskhash

tb = diskhash.StructHash(
    fname="testing.dht", 
    keysize=15, 
    structformat='ll',  # store pairs of longs
    mode='rw',
)
value = [1, 2]  # pair of longs
tb.insert("key", *value)
print(tb.lookup("key"))

The Python interface is currently Python 3 only. Patches to extend it to 2.7 are welcome, but it's not a priority.

C++

In C++, a simple wrapper is defined, which provides a modicum of type-safety. You use the DiskHash<T> template. Additionally, errors are reported through exceptions (both std::bad_alloc and std::runtime_error can be thrown) and not return codes.

#include <iostream>
#include <string>

#include <diskhash.hpp>

int main() {
    const int key_maxlen = 15;
    dht::DiskHash<uint64_t> ht("testing.dht", key_maxlen, dht::DHOpenRW);
    std::string line;
    uint64_t ix = 0;
    while (std::getline(std::cine, line)) {
        if (line.length() > key_maxlen) {
            std::cerr << "Key too long: '" << line << "'. Aborting.\n";
            return 2;
        }
        const bool inserted = ht.insert(line.c_str(), ix);
        if (!inserted) {
            std::cerr  << "Found repeated key '" << line << "' (ignored).\n";
        }
        ++ix;
    }
    return 0;
}

Stability

This is beta software. It is good enough that I am using it, but the API can change in the future with little warning. The binary format is versioned (the magic string encodes its version, so changes can be detected and you will get an error message in the future rather than some silent misbehaviour.

Automated unit testing ensures that basic mistakes will not go uncaught.

Limitations

  • You must specify the maximum key size. This can be worked around either by pre-hashing the keys (with a strong hash) or using multiple hash tables for different key sizes. Neither is currently implemented in diskhash.

  • You cannot delete objects. This was not a necessity for my uses, so it was not implemented. A simple implementation could be done by marking objects as "deleted" in place and recompacting when the hash table size changes or with an explicit dht_gc() call. It may also be important to add functionality to shrink hashtables so as to not waste disk space.

  • The algorithm is a rather naïve implementation of linear addression. It would not be hard to switch to robin hood hashing and this may indeed happen in the near future.

License: MIT

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].