Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → nlfiedler → Fastcdc Rs

nlfiedler / Fastcdc Rs

Licence: mit

FastCDC implementation in Rust

Programming Languages

rust

11053 projects

Labels

deduplication

Projects that are alternatives of or similar to Fastcdc Rs

IntraArchiveDeduplicator

Tool for managing data-deduplication within extant compressed archive files, along with a relatively performant BK tree implementation for fuzzy image searching.

Stars: ✭ 87 (+180.65%)

Mutual labels: deduplication

UMICollapse

Accelerating the deduplication and collapsing process for reads with Unique Molecular Identifiers (UMI). Heavily optimized for scalability and orders of magnitude faster than a previous tool.

Stars: ✭ 31 (+0%)

Mutual labels: deduplication

Talisman

Straightforward fuzzy matching, information retrieval and NLP building blocks for JavaScript.

Stars: ✭ 584 (+1783.87%)

Mutual labels: deduplication

dduper

Fast block-level out-of-band BTRFS deduplication tool.

Stars: ✭ 108 (+248.39%)

Mutual labels: deduplication

record-linkage-resources

Resources for tackling record linkage / deduplication / data matching problems

Stars: ✭ 67 (+116.13%)

Mutual labels: deduplication

Libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.

Stars: ✭ 3,312 (+10583.87%)

Mutual labels: deduplication

cargo-limit

Cargo with less noise: warnings are skipped until errors are fixed, Neovim integration, etc.

Stars: ✭ 105 (+238.71%)

Mutual labels: deduplication

Borgmatic

Simple, configuration-driven backup software for servers and workstations

Stars: ✭ 902 (+2809.68%)

Mutual labels: deduplication

RocketMQDedupListener

RocketMQ消息幂等去重消费者，支持使用MySQL或者Redis做幂等表，开箱即用

Stars: ✭ 132 (+325.81%)

Mutual labels: deduplication

Recordlinkage

A toolkit for record linkage and duplicate detection in Python

Stars: ✭ 532 (+1616.13%)

Mutual labels: deduplication

splink

Implementation of Fellegi-Sunter's canonical model of record linkage in Apache Spark, including EM algorithm to estimate parameters

Stars: ✭ 181 (+483.87%)

Mutual labels: deduplication

gencore

Generate duplex/single consensus reads to reduce sequencing noises and remove duplications

Stars: ✭ 91 (+193.55%)

Mutual labels: deduplication

Alertmanager

Prometheus Alertmanager

Stars: ✭ 4,574 (+14654.84%)

Mutual labels: deduplication

acid-store

A library for secure, deduplicated, transactional, and verifiable data storage

Stars: ✭ 48 (+54.84%)

Mutual labels: deduplication

Rdedup

Data deduplication engine, supporting optional compression and public key encryption.

Stars: ✭ 690 (+2125.81%)

Mutual labels: deduplication

yadf

Yet Another Dupes Finder

Stars: ✭ 32 (+3.23%)

Mutual labels: deduplication

lieu

Dedupe/batch geocode addresses and venues around the world with libpostal

Stars: ✭ 73 (+135.48%)

Mutual labels: deduplication

Dupandas

📊 python package for performing deduplication using flexible text matching and cleaning in pandas dataframe

Stars: ✭ 20 (-35.48%)

Mutual labels: deduplication

Jdupes

A powerful duplicate file finder and an enhanced fork of 'fdupes'.

Stars: ✭ 790 (+2448.39%)

Mutual labels: deduplication

Kopia

Cross-platform backup tool for Windows, macOS & Linux with fast, incremental backups, client-side end-to-end encryption, compression and data deduplication. CLI and GUI included.

Stars: ✭ 507 (+1535.48%)

Mutual labels: deduplication

View All Similar Projects ➔

FastCDC

This crate implements the "FastCDC" content defined chunking algorithm in pure Rust. A critical aspect of its behavior is that it returns exactly the same results for the same input. To learn more about content defined chunking and its applications, see the reference material linked below.

Requirements

Rust stable (2018 edition)

Building and Testing

$ cargo clean
$ cargo build
$ cargo test

Example Usage

An example can be found in the examples directory of the source repository, which demonstrates reading files of arbitrary size into a memory-mapped buffer and passing them through the chunker (and computing the SHA256 hash digest of each chunk).

$ cargo run --example dedupe -- --size 32768 test/fixtures/SekienAkashita.jpg
    Finished dev [unoptimized + debuginfo] target(s) in 0.03s
     Running `target/debug/examples/dedupe --size 32768 test/fixtures/SekienAkashita.jpg`
hash=5a80871bad4588c7278d39707fe68b8b174b1aa54c59169d3c2c72f1e16ef46d offset=0 size=32857
hash=13f6a4c6d42df2b76c138c13e86e1379c203445055c2b5f043a5f6c291fa520d offset=32857 size=16408
hash=0fe7305ba21a5a5ca9f89962c5a6f3e29cd3e2b36f00e565858e0012e5f8df36 offset=49265 size=60201

The unit tests also have some short examples of using the chunker, of which this code snippet is an example:

let read_result = fs::read("test/fixtures/SekienAkashita.jpg");
assert!(read_result.is_ok());
let contents = read_result.unwrap();
let chunker = FastCDC::new(&contents, 16384, 32768, 65536);
let results: Vec<Chunk> = chunker.collect();
assert_eq!(results.len(), 3);
assert_eq!(results[0].offset, 0);
assert_eq!(results[0].length, 32857);
assert_eq!(results[1].offset, 32857);
assert_eq!(results[1].length, 16408);
assert_eq!(results[2].offset, 49265);
assert_eq!(results[2].length, 60201);

Reference Material

The algorithm is as described in "FastCDC: a Fast and Efficient Content-Defined Chunking Approach for Data Deduplication"; see the paper, and presentation for details. There are some minor differences, as described below.

Differences with the FastCDC paper

The explanation below is copied from ronomon/deduplication since this codebase is little more than a translation of that implementation:

The following optimizations and variations on FastCDC are involved in the chunking algorithm:

31 bit integers to avoid 64 bit integers for the sake of the Javascript reference implementation.

A right shift instead of a left shift to remove the need for an additional modulus operator, which would otherwise have been necessary to prevent overflow.

Masks are no longer zero-padded since a right shift is used instead of a left shift.

A more adaptive threshold based on a combination of average and minimum chunk size (rather than just average chunk size) to decide the pivot point at which to switch masks. A larger minimum chunk size now switches from the strict mask to the eager mask earlier.

Masks use 1 bit of chunk size normalization instead of 2 bits of chunk size normalization.

The primary objective of this codebase was to have a Rust implementation with a permissive license, which could be used for new projects, without concern for data parity with existing implementations.

Other Implementations

This crate is little more than a rewrite of the implementation by Joran Dirk Greef (see the ronomon link below), in Rust, and greatly simplified in usage. One significant difference is that the chunker in this crate does not calculate a hash digest of the chunks.

ronomon/deduplication
- C++ and JavaScript implementation on which this code is based.
rdedup_cdc at docs.rs
- An alternative implementation of FastCDC to the one in this crate.
jrobhoward/quickcdc
- Similar but slightly earlier algorithm by some of the same researchers.
titusz/fastcdc-py
- Pure Python port of FastCDC. Compatible with this implementation.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 31

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (1) 🔗