All Projects → sergey-dryabzhinsky → dedupsqlfs

sergey-dryabzhinsky / dedupsqlfs

Licence: MIT license
Deduplicating filesystem via Python3, FUSE and SQLite

Programming Languages

c
50402 projects - #5 most used programming language
python
139335 projects - #7 most used programming language
cython
566 projects

Projects that are alternatives of or similar to dedupsqlfs

zpaqfranz
Deduplicating archiver with encryption and paranoid-level tests. Swiss army knife for the serious backup and disaster recovery manager. Ransomware neutralizer. Win/Linux/Unix
Stars: ✭ 86 (+258.33%)
Mutual labels:  backup, compression, deduplication
Frost
A backup program that does deduplication, compression, encryption
Stars: ✭ 25 (+4.17%)
Mutual labels:  backup, compression, deduplication
Borgmatic
Simple, configuration-driven backup software for servers and workstations
Stars: ✭ 902 (+3658.33%)
Mutual labels:  backup, compression, deduplication
acid-store
A library for secure, deduplicated, transactional, and verifiable data storage
Stars: ✭ 48 (+100%)
Mutual labels:  fuse, compression, deduplication
Vdo
Userspace tools for managing VDO volumes.
Stars: ✭ 138 (+475%)
Mutual labels:  compression, deduplication
Kvdo
A pair of kernel modules which provide pools of deduplicated and/or compressed block storage.
Stars: ✭ 168 (+600%)
Mutual labels:  compression, deduplication
Bareos
Main repository with the code for the libraries and daemons
Stars: ✭ 651 (+2612.5%)
Mutual labels:  backup, compression
JFileSync3
File Syncing with encryption and compression (partly) compatible with encfs / boxcryptor (classic) volumes for local folders and WebDAV backends. Based on JFileSync - hence the name.
Stars: ✭ 20 (-16.67%)
Mutual labels:  backup, compression
Rdedup
Data deduplication engine, supporting optional compression and public key encryption.
Stars: ✭ 690 (+2775%)
Mutual labels:  backup, deduplication
Btrfs Sxbackup
Incremental btrfs snapshot backups with push/pull support via SSH
Stars: ✭ 105 (+337.5%)
Mutual labels:  backup, compression
Snebu
Simple Network Encrypting Backup Utility
Stars: ✭ 92 (+283.33%)
Mutual labels:  backup, compression
Fsarchiver
file system archiver for linux
Stars: ✭ 135 (+462.5%)
Mutual labels:  backup, compression
Kopia
Cross-platform backup tool for Windows, macOS & Linux with fast, incremental backups, client-side end-to-end encryption, compression and data deduplication. CLI and GUI included.
Stars: ✭ 507 (+2012.5%)
Mutual labels:  backup, deduplication
Restic
Fast, secure, efficient backup program
Stars: ✭ 15,105 (+62837.5%)
Mutual labels:  backup, deduplication
ratarmount
Random Access Read-Only Tar Mount
Stars: ✭ 217 (+804.17%)
Mutual labels:  fuse, compression
lz4ultra
Optimal LZ4 compressor, that produces files that decompress faster while keeping the best compression ratio
Stars: ✭ 49 (+104.17%)
Mutual labels:  compression
py-lz4framed
LZ4-frame library for Python (via C bindings)
Stars: ✭ 42 (+75%)
Mutual labels:  compression
FastIntegerCompression.js
Fast integer compression library in JavaScript
Stars: ✭ 46 (+91.67%)
Mutual labels:  compression
ruby-xz
Ruby bindings for liblzma, using fiddle
Stars: ✭ 33 (+37.5%)
Mutual labels:  compression
blz4
Example of LZ4 compression with optimal parsing using BriefLZ algorithms
Stars: ✭ 24 (+0%)
Mutual labels:  compression

DedupSQLfs

Deduplicating filesystem via FUSE and SQLite written in Python

Based on code written by Peter Odding: http://github.com/xolox/dedupfs/

Rewriten to use Python3 (3.4+), new compression methods, snapshots / subvolumes.

I know about ZFS and Btrfs. But they are still complicated to use under linux and has disadvantages like need in block device, weak block hash algorithms, very little variants of compression methods.

Usage

The following shell commands show how to install and use the DedupFS file system on Ubuntu (where it was developed):

$ sudo apt-get install python3-pip libfuse-dev
#
$ sudo pip3 install llfuse==1.4.1
#
# llfuse must be version 1.4.1
#
$ git clone https://github.com/sergey-dryabzhinsky/dedupsqlfs.git
#
$ mkdir mount_point
$ ./bin/mount.dedupsqlfs --mountpoint mount_point
# Now copy some files to mount_point/ and observe that the size of the two
# databases doesn't grow much when you copy duplicate files again :-)
# The databases are by default stored in the following locations:
# ~/data/dedupsqlfs/*.sqlite3 contains the tree, meta and blocks data
# You can choose another location by --data option.
#
# As of 1.2.919 version cache_flusher helper starts
# and touches hidden file in mount_point directory.
# So umount may fail in that time. Call it repeated with some lag:
$ umount mount_point || sleep 0.5 && umount mount_point
# Or yout can disable it with switch `--no-cache-flusher`

Status

Development on DedupSqlFS began as a proof of concept to find out how much disk space the author could free by employing deduplication to store his daily backups. Since then it's become more or less usable as a way to archive old backups, i.e. for secondary storage deduplication. It's not recommended to use the file system for primary storage though, simply because the file system is too slow. I also wouldn't recommend depending on DedupFS just yet, at least until a proper set of automated tests has been written and successfully run to prove the correctness of the code.

The file system initially stored everything in a multiple SQLite databases. It turned out that in single file database after the database grew beyond 8 GB the write speed would drop from 8-12 MB/s to 2-3 MB/s. In multiple files it drops to 6-8 MB/s with other changes applied even after 150 GB. Speed highly is depends on hardware - memory, disks, flesystem for data storage.

It's used about 3+ years to store large amount of backups of VMs. Survived several power outages.

What's new

  • Filesystem data stored in multiple SQLite files.
  • Tuned SQLite connections.
  • Delayed writes for blocks (hashing and compression too).
  • Use "stream"-like writes and read of data blocks, don't store complete files in memory.
  • Cached filesystem tree nodes, inodes and data blocks.
  • Many hashing methods: md5, sha1, and other that supported by hashlib module.
  • Many compression methods: zlib, bzip2, lzma, lzo, lz4, quicklz, zstd, snappy, brotli.
  • Support for data storage in localy started MySQL server.
  • Hashes can be rehashed by do command.
  • Data block can be recompressed by do command.

Limitations

In the current implementation a file's content DON'T needs to fit in a cStringIO instance, which limited the maximum file size to your free RAM. But sometimes you need to tune caching timeouts to drop caches more friquently, on massive reads for example.

There is limit of SQLite database size: about 4 TB with default settings of pages_count (2**30) * page_size (4096). And page_size can be set up to 64kB, so database file theoreticaly limited by 140 TB. Size of page_size would be adjusted automaticaly depends on database file size, or median of written block size.

Note: dynamic subvolume and snapshot creation available only with MySQL storage engine. SQLite is keeping database locked. Though dynamic subvolume switching not available. For now MySQL table engine is MyISAM by default - it's fast and not bloated. InnoDB with page compression working good but slower.

MariaDB's Aria working slowly than MyISAM - doing too much logging...

More benchmarks in docs/benchmarks folder.

Dependencies

DedupSQLfs was developed using Python 3.4, it also work with newer versions. Python 3.7-3.10 is recommended now.

Additional compression modules can be builded with commands:

$ sudo apt-get install build-essential python3-dev liblzo2-dev libsnappy-dev liblz4-dev liblzma-dev libzstd-dev libbrtoli-dev
$ cd lib-dynload/lzo
$ python3 setup.py clean -a
$ python3 setup.py build_ext clean
## ... same for lz4, snappy,..
# If you need extra optimization - tune for your CPU for example - then call
$ python3 setup.py clean -a
$ python3 setup.py build_ext --extra-optimization clean

Additional storage engine via MySQL can be accessed with commands:

$ sudo pip3 install pymysql

or use bundled one.

Additional performance gain about 1-5% (depends on python verson) via Cython:

## Setup tools If not installed
$ sudo pip3 install setuptools
$ sudo pip3 install cython
$ python3 setup.py build_ext --cython-build
$ python3 setup.py stripall
## Warning! This deletes all .py files
$ python3 setup.py cleanpy

Lesser memory usage via RecordClass:

$ sudo pip3 install recordclass

or use bundled one.

Notes about Cython

  1. Profiling via cProfile not working for compiled code.
  2. Always keep copy of dedupsqlfs directory if you will run cleanpy command.
  3. RecordClass not compatible with Cython - install only one of them
  4. Python 3.9 seems like working faster by it self, no need in cython.

Contact

If you have questions, bug reports, suggestions, etc. the author can be contacted at [email protected] and github issues list. The latest version of DedupSqlFS is available at https://github.com/sergey-dryabzhinsky/dedupsqlfs.

License

This software is licensed under the MIT license.

© 2013-2022 Sergey Dryabzhinsky <[email protected]>.

© 2010 Peter Odding <[email protected]>.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].