All Projects → martymac → Fpart

martymac / Fpart

Licence: bsd-2-clause
Sort files and pack them into partitions

Programming Languages

c
50402 projects - #5 most used programming language

Projects that are alternatives of or similar to Fpart

Tensorbase
TensorBase BE is building a high performance, cloud neutral bigdata warehouse for SMEs fully in Rust.
Stars: ✭ 440 (+246.46%)
Mutual labels:  bigdata, data
Msrsync
Multi-stream rsync wrapper
Stars: ✭ 328 (+158.27%)
Mutual labels:  parallel, rsync
cephgeorep
An efficient unidirectional remote backup daemon for CephFS.
Stars: ✭ 27 (-78.74%)
Mutual labels:  rsync, parallel
Awesome Bigdata
A curated list of awesome big data frameworks, ressources and other awesomeness.
Stars: ✭ 10,478 (+8150.39%)
Mutual labels:  bigdata, data
Core
parallel finite element unstructured meshes
Stars: ✭ 124 (-2.36%)
Mutual labels:  parallel
Data Store
Easily get, set and persist config data. Fast. Supports dot-notation in keys. No dependencies.
Stars: ✭ 120 (-5.51%)
Mutual labels:  data
Pipedream
Connect APIs, remarkably fast. Free for developers.
Stars: ✭ 2,068 (+1528.35%)
Mutual labels:  data
Redo
Smaller, easier, more powerful, and more reliable than make. An implementation of djb's redo.
Stars: ✭ 1,589 (+1151.18%)
Mutual labels:  parallel
Githubrankingsspain
⬆️ Rankings with the most active GitHub users in Spain (sorted by public contributions) 🇪🇸
Stars: ✭ 127 (+0%)
Mutual labels:  data
Threadsx.jl
Parallelized Base functions
Stars: ✭ 126 (-0.79%)
Mutual labels:  parallel
Riko
A Python stream processing engine modeled after Yahoo! Pipes
Stars: ✭ 1,571 (+1137.01%)
Mutual labels:  data
Nimble
Stars: ✭ 121 (-4.72%)
Mutual labels:  parallel
Mainframer Intellij Plugin
An intellij idea plugin for mainframer project
Stars: ✭ 125 (-1.57%)
Mutual labels:  rsync
Awesome Opendata Rus
Opendata resources in Russian / Открытые данные на русском языке
Stars: ✭ 121 (-4.72%)
Mutual labels:  data
Hadoopcryptoledger
Hadoop Crypto Ledger - Analyzing CryptoLedgers, such as Bitcoin Blockchain, on Big Data platforms, such as Hadoop/Spark/Flink/Hive
Stars: ✭ 126 (-0.79%)
Mutual labels:  bigdata
Jira Azuredevops Migrator
Tool to migrate work items from Atlassian Jira to Microsoft Azure DevOps/VSTS/TFS.
Stars: ✭ 120 (-5.51%)
Mutual labels:  migration
Android Download Manager Pro
Android/Java download manager library help you to download files in parallel mechanism in some chunks.
Stars: ✭ 1,568 (+1134.65%)
Mutual labels:  parallel
Police Settlements
A FiveThirtyEight/The Marshall Project effort to collect comprehensive data on police misconduct settlements from 2010-19.
Stars: ✭ 124 (-2.36%)
Mutual labels:  data
Daff
Diff, patch and merge for data.frames, see http://paulfitz.github.io/daff/
Stars: ✭ 121 (-4.72%)
Mutual labels:  data
Micro Jaymock
Tiny API mocking microservice for generating fake JSON data.
Stars: ✭ 123 (-3.15%)
Mutual labels:  data
    _______ ____   __         _      __
   / /  ___|  _ \ / /_ _ _ __| |_   / /
  / /| |_  | |_) / / _` | '__| __| / /
 / / |  _| |  __/ / (_| | |  | |_ / /
/_/  |_|   |_| /_/ \__,_|_|   \__/_/

What is fpart ?

Fpart is a tool that helps you sort file trees and pack them into bags (called "partitions"). It is developed in C and available under the BSD license.

It splits a list of directories and file trees into a certain number of partitions, trying to produce partitions with the same size and number of files. It can also produce partitions with a given number of files or of a limited size. Fpart uses a bin packing algorithm to optimize space utilization amongst partitions.

Once generated, partitions are either printed as file lists to stdout (default) or to files. Those lists can then be used by third party programs.

Fpart also includes a live mode, which allows it to crawl very large filesystems and produce partitions in live. Hooks are available to act on those partitions (e.g. immediately start a transfer using rsync(1) or cpio(1)) without having to wait for the filesystem traversal job to be finished. Used that way, fpart can be seen as a powerful basis for a data migration tool.

Fpart can also generate lists of directories instead of files. That mode can be useful to enable usage of options requiring overall knowledge of directories such as rsync's --delete.

As a demonstration of fpart possibilities, a tool called fpsync is provided in the tools/ directory (see also below for more details).

Compatibility :

Fpart is primarily developed on FreeBSD.

It has been successfully tested on :

  • FreeBSD (i386, amd64)
  • GNU/Linux (x86_64, arm)
  • Solaris 9, 10 (Sparc, i386)
  • OpenIndiana (i386)
  • NetBSD (amd64, alpha)
  • Mac OS X (10.6, 10.8)

and will probably work on other operating systems too (*).

(*) fpart built as a static binary within a Debian (armel) chroot will give you a powerful tool for backing-up your Android (arm) phone ;-)

Examples :

Common usage :

The following will produce 3 partitions, with (approximatively) the same size and number of files. Three files: "var-parts.[0-2]", are generated as output :

$ fpart -n 3 -o var-parts /var

$ ls var-parts*
var-parts.0 var-parts.1 var-parts.2

$ head -n 2 var-parts.0
/var/some/file1
/var/some/file2

The following will produce partitions of 4.3 GB, containing music files ready to be burnt to a DVD (for example). Files "music-parts.[0-n]", are generated as output :

$ fpart -s 4617089843 -o music-parts /path/to/my/music

The following will produce partitions containing 10000 files each by examining /usr first and then /home and display only partition 0 on stdout :

$ find /usr ! -type d | fpart -f 10000 -i - /home | grep '^0:'

The following will produce two partitions by re-using du(1) output. Fpart will not examine the filesystem but instead re-use arbitrary values printed by du(1) and sort them :

$ du * | fpart -n 2 -a

Live mode :

By default, fpart will wait for FS crawling to terminate before generating and displaying partitions. If you use the live mode (option -L), fpart will display each partition as soon as it is complete. You can combine that option with hooks; they will be triggered just before (pre-part hook, option -w) or after (post-part hook, option -W) partitions' completion.

Hooks provide several environment variables (see fpart(1)); they are a convenient way of getting information about fpart's and partition's current states. For example, ${FPART_PARTFILENAME} will contain the name of the output file of the partition that has just been generated; using that variable within a post-part hook permits starting manipulating the files just after partition generation.

See the following example :

$ mkdir foo && touch foo/{bar,baz}
$ fpart -L -f 1 -o /tmp/part.out -W \
    'echo == ${FPART_PARTFILENAME} == ; cat ${FPART_PARTFILENAME}' foo/
== /tmp/part.out.0 ==
foo/bar
== /tmp/part.out.1 ==
foo/baz

This example crawls foo/ in live mode (option -L). For each file (option -f, 1 file per partition), it generates a partition into /tmp/part.out. (option -o; is the partition index and will be automatically added by fpart) and executes the following post-part hook (option -W) :

echo == ${FPART_PARTFILENAME} == ; cat ${FPART_PARTFILENAME}

This hook will display the name of current partition's output file name as well as display its contents.

Migrating data :

Here is a more complex example that will show you how to use fpart, GNU Parallel and Rsync to split up a directory and immediately schedule data synchronization of smaller lists of files, while FS crawling goes on. We will be synchronizing data from /data/src to /data/dest.

First, go to the source directory (as rsync's --files-from option takes a file list relative to its source directory) :

$ cd /data/src

Then, run fpart from here :

$ fpart -L -f 10000 -x '.snapshot' -x '.zfs' -zz -o /tmp/part.out -W \
  '/usr/local/bin/sem -j 3
    "/usr/local/bin/rsync -av --files-from=${FPART_PARTFILENAME}
      /data/src/ /data/dest/"' .

This command will start fpart in live mode (option -L), making it generate partitions during FS crawling. Fpart will produce partitions containing at most 10000 files each (option -f), will skip files and folders named '.snapshot' or '.zfs' (option -x) and will list empty and non-accessible directories (option -zz; that option is necessary when working with rsync to make sure the whole file tree will be re-created within the destination directory). Last but not least, each partition will be written to /tmp/part.out. (option -o) and used within the post-part hook (option -W), run immediately by fpart once the partition is complete :

/usr/local/bin/sem -j 3
    "/usr/local/bin/rsync -av --files-from=${FPART_PARTFILENAME} /data/src/ /data/dest/"

This hook is itself a nested command. It will run GNU Parallel's sem scheduler (any other scheduler would do) to run at most 3 rsync jobs in parallel.

The scheduler will finally trigger the following command :

/usr/local/bin/rsync -av --files-from=${FPART_PARTFILENAME} /data/src/ /data/dest/

where ${FPART_PARTFILENAME} will be part of rsync's environment when it runs and contains the file name of the partition that has just been generated.

That's all, folks ! Pretty simple, isn't it ?

In this example, FS crawling and data transfer are run from the same -local- machine, but you can use it as the basis of a much sophisticated solution: at $work, by using a cluster of machines connected to our filers through NFS and running Open Grid Scheduler, we successully migrated over 400 TB of data.

Note: several successive fpart runs can be launched using the above example; you will perform incremental synchronizations. That is, deleted files from the source directory will not be removed from destination unless rsync's --delete option is used. Unfortunately, this option cannot be used with a list of files (files that do not appear in the list are just ignored). To use the --delete option in conjunction with fpart, you have to provide rsync's --files-from option a list of directories (only); that can be performed using fpart's -E option.

Fpsync :

To demonstrate fpart possibilities, a program called 'fpsync' is provided within the tools/ directory. This tool is a shell script that wraps fpart(1) and rsync(1) (or cpio(1)) to launch several synchronization jobs in parallel as presented in the previous section, but while the previous example used GNU Parallel to schedule transfers, fpsync provides its own -embedded- scheduler. It can execute several synchronization processes locally or launch them on several nodes (workers) through SSH.

Despite its initial "proof of concept" status, fpsync has quickly evolved into a powerful (yet simple to use) migration tool and has been successfully used to boost migration of several hundreds of TB of data (initially at $work but it has also been tested by several organizations such as UCI, Intel and Amazon ; see the 'See also' section at the end of this document).

In addition to being very fast (as transfers start during FS crawling and are parallelized), fpsync is able to resume synchronization jobs (see option -r) and presents an overall progress status. It also has a small memory footprint compared to rsync itself when migrating filesystems with a big number of files.

Last but not least, fpsync is very easy to set up and only requires a few (common) software to run: fpart, rsync and/or cpio, a POSIX shell, sudo and ssh.

See fpsync(1) to learn more about that tool and get a list of all supported options.

Here is a simple representation of how it works :

fpsync [args] /data/src/ /data/dst/
  |
  +-- fpart (live mode) crawls /data/src/, generates parts.[1] + sync jobs ->
  |    \    \    \
  |     \    \    +___ part. #n + job #n
  |      \    \
  |       \    +______ part. #1 + job #1
  |        \
  |         +_________ part. #0 + job #0
  |
  +-- fpsync scheduler, executes jobs either locally or remotely ----------->
       \    \    \
        \    \    +___ sync job #n... --------------------------------------> +
         \    \                                                               |
          \    +______ sync job #1 ---------------------------------->        |
           \                                                                  |
            +_________ sync job #0 ----------------------------->             +
                                                                             /
                                                                            /
              Filesystem tree rebuilt and synchronized! <------------------+

[1] Either containing file lists (default mode) or directory lists (option -E)

File mode :

In its default mode, fpsync uses rsync(1) and works with file lists to perform incremental (only) synchronizations. Using the cpio(1) tool (option -m) will perform the same kind of synchronizations but using the cpio(1) tool (see 'Notes about cpio tool' below).

The following examples show two typical usage.

The command :

$ fpsync -n 4 -f 1000 -s $((100 * 1024 * 1024)) \
    /data/src/ /data/dst/

will synchronize /data/src/ to /data/dst/ using 4 local workers, each one transferring at most 1000 files and 100 MB per synchronization job.

The command :

$ fpsync -n 8 -f 1000 -s $((100 * 1024 * 1024)) \
    -w [email protected] -w [email protected] -d /mnt/nfs/fpsync \
    /data/src/ /data/dst/

will synchronize /data/src/ to /data/dst/ using the same transfer limits, but through 8 concurrent synchronization jobs spread over two machines (machine1 and machine2). Those machines must both be able to access /data/src/ and /data/dst/, as well as /mnt/nfs/fpsync, which is fpsync's shared working directory.

As previously mentioned, those two examples work with file lists and will perform incremental synchronizations. As a consequence, they will require a final -manual- 'rsync --delete' pass to delete extra files from the /data/dst/ directory.

Directory mode :

If you want to avoid that final pass, use fpsync's option -E (only compatible with rsync tool). That option will make fpsync work with a list of directories (instead of files) and will (forcibly) enable rsync's --delete option with each synchronization job. The counterpart of using that mode is that directory lists are coarse-grained and will probably be less balanced than file lists. The best option is probably to run several incremental jobs and keep the -E option to speed up the final pass only.

(you can read the file 'Solving_the_final_pass_challenge.txt' in the docs/ directory for more details about fpsync's option -E)

Notes about cpio tool :

Fpsync's option '-m' allows you to use cpio(1) instead of rsync(1) to copy files. Cpio(1) is much faster than rsync(1) but there is a catch: when re-creating a complex file tree, missing parent directories are created on-the-fly. In that case, original directory metadata (e.g. timestamps) are not copied from source.

To overcome that limitation, fpsync uses fpart's -zzz option to ask fpart to also pack every single directory (0-sized) with file lists. Making directories appear in file lists will ask cpio to copy their metadata when the directory is processed (of course, fpart ensures that a parent directory entry appears after files beneath. If the parent directory is missing it is first created on the fly, then the directory entry makes cpio update its metadata).

This works fine with a single cpio process (fpsync's option -n 1) but not with 2 or more parallel processes which can treat partitions out-of-order. Indeed, if several workers copy files to the same directory at the same time, it is possible that the parent directory's original metadata gets re-applied while another worker is still adding files to that directory. That can occur if a directory list spreads over more than one partition. In such a situation, original metadata (here, mtime) gets overwritten while new files get added to the directory.

That race condition is un-avoidable (fpart would have to guarantee the directory entry belongs to the same partition as its files beneath, that would probably lead to un-balanced partitions as well as increased -and useless- complexity).

You've been warned. Anyway, maybe you do not care about copying original directory mtimes. If this is the case, you can ignore that situation. If you care about them, running a second pass of fpsync will fix the timestamps.

Notes about GNU cpio (specifically) :

Developments have been made with BSD cpio (FreeBSD version). Fpsync will work with GNU cpio too but there are small behaviour differences you must be aware of :

  • for an unknown reason, GNU cpio will not apply mtime to the main target directory (AKA './' when received by cpio).

  • when using GNU cpio, you will get the following warnings when performing a second pass :

    not created: newer or same age version exists

You can ignore those warnings as that second pass will fix directory timestamps anyway.

Warning: if you pass option '-u' to cpio (trough fpsync's option '-o') to get rid of those messages, you will possibly re-touch directory mtimes (loosing original ones). Also, be aware of what that option implies: re-transferring every single file.

Notes about hard links :

Rsync can detect and replicate hard links with option -H but that will NOT work with fpsync because rsync collects hard links' information on a per-run basis.

So, as for directory metadata (see above), being able to propagate hard links with fpsync would require from fpart the guarantee that all related links belong to the same partition.

Unfortunately, this is not something fpart can do because, in live mode (used by fpsync to start synchronization as soon as possible), it crawls the filesystem as it comes. As a consequence, there is no mean to know if a hard link connected to a file already written to a partition (and probably already synchronized through an independent rsync process) will appear later or not. Also, in non-live mode, trying to group related hardlinks into the same partitions would propably lead to un-balanced partitions as well as complexify code.

If you need to propagate hard links, you have 3 options:

  • Re-create hard links on the target, but this is not optimal as you may not want to link 2 files together, even if they are similar

  • Pre-copy hard linked files together (using find's '-type f -links +1' options) before running fpsync. That will work but linked files that have changed since your first synchronization will be converted back to regular files when running fpsync

  • Use a final -monolithic- rsync pass with option -H that will re-create them

SSH options :

When dealing with SSH options and keys, keep in mind that fpsync uses SSH for two kinds of operations :

  • data synchronization (when ssh is forked by rsync), can occur locally or on remote workers (if using any)
  • communication with workers (when ssh is forked by fpsync), only occurs locally (on the scheduler)

If you need specific options for the first case, you can pass ssh options by using rsync's option '-e' (through fpsync's option '-o') and triple-escaping the quote characters :

$ fpsync [...] -o "-lptgoD -v --numeric-ids -e \\\"ssh -i ssh_key\\\"" \
    /data/src/ [email protected]:/data/dst/

The key will have to be present and accessible on all workers.

Fpsync does not offer options to deal with the second case. You will have to tune your ssh config file to enable passwordless communication with workers. Something like :

$ cat ~/.ssh/config
Host remote
IdentityFile /path/to/the/passwordless/key

should work.

Limitations :

  • Fpart will NOT modify data, it will NOT split your files !

    As a consequence, if you have a directory containing several small files and a huge one, it will be unable to produce partitions with the same size. Fpart does magic, but not that much ;-)

  • Fpart will not deduplicate paths !

    If you provide several paths to fpart, it will examine all of them. If those paths overlap or if the same path is specified more than once, same files will appear more than once within generated partitions. This is not a bug, fpart does not deduplicate FS crawling results.

  • Fpsync only synchronizes directory contents !

    Contrary to rsync, fpsync enforces the final '/' on the source directory. It means that directory contents are synchronized, not the source directory itself (i.e. you will not get a subdirectory of the name of the source directory in the target directory after synchronization).

Installing :

Packages are already available for the following operating systems :

If a pre-compiled package is not available for your favourite operating system, installing from sources is simple. First, if there is no 'configure' script in the main directory, run :

$ autoreconf -i

(autoreconf comes from the GNU autotools), then run :

$ ./configure
$ make

to configure and build fpart.

Finally, install fpart (as root) :

# make install

Portability considerations :

On OpenIndiana, if you need to use fpsync(1), the script will need adjustments :

  • Change shebang from /bin/sh to a more powerful shell that understands local variables, such as /bin/bash.
  • Adapt fpart(1) and grep(1) paths (use ggrep(1) instead of grep(1) as default grep(1) doesn't know about -E flag).
  • Remove -0 and --quiet options from cpio call (they are not supported). As a consequence, also remove -0 from fpart options.

On Alpine Linux, you will need the 'fts-dev' package to build fpart(1).

See also :

They use Fpart or talk about it :

Author / Licence :

Fpart has been written by Ganael LAPLANCHE [email protected] and is available under the BSD license (see COPYING for details).

Thanks to Jean-Baptiste Denis for having given me the idea of this program !

Donation :

If fpart is useful to you or your organization, you can make a donation here:

paypal

That will help me not running out of coffee :)

Contributions :

FTS code comes from FreeBSD :

lib/libc/gen/fts.c -> fts.c
include/fts.h      -> fts.h

and is available under the BSD license.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].