All Projects → shashi → Filetrees.jl

shashi / Filetrees.jl

Licence: other
Parallel file processing made easy

Programming Languages

julia
2034 projects

Labels

Projects that are alternatives of or similar to Filetrees.jl

Php Parallel Lint
This tool check syntax of PHP files faster than serial check with fancier output.
Stars: ✭ 646 (+1645.95%)
Mutual labels:  parallel
Numba
NumPy aware dynamic Python compiler using LLVM
Stars: ✭ 7,090 (+19062.16%)
Mutual labels:  parallel
Prpl
parallel Raster Processing Library (pRPL) is a MPI-enabled C++ programming library that provides easy-to-use interfaces to parallelize raster/image processing algorithms
Stars: ✭ 15 (-59.46%)
Mutual labels:  parallel
Raftlib
The RaftLib C++ library, streaming/dataflow concurrency via C++ iostream-like operators
Stars: ✭ 717 (+1837.84%)
Mutual labels:  parallel
Pgbackrest
Reliable PostgreSQL Backup & Restore
Stars: ✭ 766 (+1970.27%)
Mutual labels:  parallel
Gush
Fast and distributed workflow runner using ActiveJob and Redis
Stars: ✭ 894 (+2316.22%)
Mutual labels:  parallel
P Map
Map over promises concurrently
Stars: ✭ 639 (+1627.03%)
Mutual labels:  parallel
Lean Batch Launcher
Unofficial alternative launcher for QuantConnect's LEAN allowing for parallel execution and looping/batching with customizable parameters and ranges.
Stars: ✭ 30 (-18.92%)
Mutual labels:  parallel
Parallec
Fast Parallel Async HTTP/SSH/TCP/UDP/Ping Client Java Library. Aggregate 100,000 APIs & send anywhere in 20 lines of code. Ping/HTTP Calls 8000 servers in 12 seconds. (Akka) www.parallec.io
Stars: ✭ 777 (+2000%)
Mutual labels:  parallel
Read Multiple Files
Read multiple files Observable way
Stars: ✭ 13 (-64.86%)
Mutual labels:  parallel
Eaopt
🍀 Evolutionary optimization library for Go (genetic algorithm, partical swarm optimization, differential evolution)
Stars: ✭ 718 (+1840.54%)
Mutual labels:  parallel
Appiumtestdistribution
A tool for running android and iOS appium tests in parallel across devices... U like it STAR it !
Stars: ✭ 764 (+1964.86%)
Mutual labels:  parallel
Parallel Hashmap
A family of header-only, very fast and memory-friendly hashmap and btree containers.
Stars: ✭ 858 (+2218.92%)
Mutual labels:  parallel
Moose
Multiphysics Object Oriented Simulation Environment
Stars: ✭ 652 (+1662.16%)
Mutual labels:  parallel
Load.js
Dynamically loading external JavaScript and CSS files
Stars: ✭ 15 (-59.46%)
Mutual labels:  parallel
Adaptive
📈 Adaptive: parallel active learning of mathematical functions
Stars: ✭ 646 (+1645.95%)
Mutual labels:  parallel
Pxctest
Execute tests in parallel on multiple iOS Simulators
Stars: ✭ 800 (+2062.16%)
Mutual labels:  parallel
Python Parallel Programming Cookbook Cn
📖《Python Parallel Programming Cookbook》中文版
Stars: ✭ 978 (+2543.24%)
Mutual labels:  parallel
Wssh
WSSH Is a tool for brute forcing servers that has port 22 open via ssh, wssh is probably the fastest ssh brute forcer available
Stars: ✭ 21 (-43.24%)
Mutual labels:  parallel
Parallel Ssh
Asynchronous parallel SSH client library.
Stars: ✭ 864 (+2235.14%)
Mutual labels:  parallel

FileTrees

Build Status Build status Coverage Status

Easy everyday parallelism with a file tree abstraction.

Installation

using Pkg
Pkg.add("FileTrees")

With FileTrees you can

  • Read a directory structure as a Julia data structure, (lazy-)load the files, apply map and reduce operations on the data while not exceeding available memory if possible. (docs)
  • Filter data by file name using familiar Unix syntax (docs)
  • Make up a file tree in memory, create some data to go with each file (in parallel), write the tree to disk (in parallel). (See example below)
  • Virtually mv and cp files within trees, merge and diff trees, apply different functions to different subtrees. (docs)

Go to the documentation →

Example

Here is an example of using FileTrees to create a 3025 images which form a big 16500x16500 image of a Mandelbrot set (I tried my best to make them all contiguous, it's almost right, but I'm still figuring out those parameters.)

Then we load it back and compute a Histogram of the HSV values across all the images in parallel using OnlineStats.jl.

@everywhere using Images, FileTrees, FileIO

tree = maketree("mandel"=>[]) # an empty file tree
params = [(x, y) for x=-1:0.037:1, y=-1:0.037:1]
for i = 1:size(params,1)
    for j = 1:size(params,2)
        tree = touch(tree, "$i/$j.png"; value=params[i, j])
    end
end

# map over the values to create an image at each node.
# 300x300 tile per image.
t1 = FileTrees.mapvalues(tree, lazy=true) do params
    mandelbrot(50, params..., 300) # zoom level, moveX, moveY, size
end
 
# save it
@time FileTrees.save(t1) do file
    FileIO.save(path(file), file.value)
end

This takes about 150 seconds when Julia is started with 10 processes with 4 threads each, in other words on a 12 core machine. (oversubscribing this much gives good perormance in this case.) In other words,

export JULIA_NUM_THREADS=4
julia -p 10

Then load it back in a new session:

using Distributed
@everywhere using FileTrees, FileIO, Images, .Threads, OnlineStats, Distributed

t = FileTree("mandel")

# Lazy-load each image and compute its histogram
t1 = FileTrees.load(t; lazy=true) do f
    h = Hist(0:0.05:1)
    img = FileIO.load(path(f))
    println("pid, ", myid(), "threadid ", threadid(), ": ", path(f))
    fit!(h, map(x->x.v, HSV.(img)))
end

# combine them all into one histogram using `merge` method on OnlineStats

@time h = reducevalues(merge, t1) |> exec # exec computes a lazy value

Plot the Histogram:

                                                 
    0.0 ┤■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 100034205   
   0.05  302199                                   
    0.1  666776                                   
   0.15  378473                                   
    0.2  864297                                   
   0.25  1053490                                  
    0.3  602937                                   
   0.35  667619                                   
    0.4  1573476                                  
   0.45  949928                                   
    0.5 ┤■ 2370727                                 
   0.55  1518383                                  
    0.6 ┤■ 3946507                                 
   0.65 ┤■■ 6114414                                
    0.7 ┤■ 4404784                                 
   0.75 ┤■■ 5920436                                
    0.8 ┤■■■■■■ 20165086                           
   0.85 ┤■■■■■■ 19384068                           
    0.9 ┤■■■■■■■■■■■■■■■■■■■■■■ 77515666           
   0.95 ┤■■■■■■■ 23816529                          
                                                 

this takes about 100 seconds.

At any point in time the whole computation holds 40 files in memory, because there are 40 computing elements 4 threads x 10 processes. The scheduler also takes care of freeing any memory that it knows will not be used after the result is computed. This means you can work on data that on the whole will not fit in memory.

See the docs →

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].