All Projects → shirosaidev → saisoku

shirosaidev / saisoku

Licence: Apache-2.0 license
Saisoku is a Python module that helps you build complex pipelines of batch file/directory transfer/sync jobs.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to saisoku

ob bulkstash
Bulk Stash is a docker rclone service to sync, or copy, files between different storage services. For example, you can copy files either to or from a remote storage services like Amazon S3 to Google Cloud Storage, or locally from your laptop to a remote storage.
Stars: ✭ 113 (+182.5%)
Mutual labels:  sync, s3, rclone, data-pipeline
Rclone
"rsync for cloud storage" - Google Drive, S3, Dropbox, Backblaze B2, One Drive, Swift, Hubic, Wasabi, Google Cloud Storage, Yandex Files
Stars: ✭ 30,541 (+76252.5%)
Mutual labels:  sync, s3, rclone
Luigi
Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.
Stars: ✭ 15,226 (+37965%)
Mutual labels:  scheduling, orchestration-framework, luigi
S4
🔄 Fast and cheap synchronisation of files using Amazon S3
Stars: ✭ 69 (+72.5%)
Mutual labels:  sync, s3
Drone Cache
A Drone plugin for caching current workspace files between builds to reduce your build times
Stars: ✭ 194 (+385%)
Mutual labels:  pipeline, s3
serverless-s3bucket-sync
Serverless Plugin to sync local folders with an S3 bucket
Stars: ✭ 24 (-40%)
Mutual labels:  sync, s3
docker-aws-s3-sync
Docker container to sync a folder to Amazon S3
Stars: ✭ 21 (-47.5%)
Mutual labels:  sync, s3
docker base images
Vlad's Base Images for Docker
Stars: ✭ 61 (+52.5%)
Mutual labels:  sync, s3
PyFiSync
Python (+ rsync or rclone) based intelligent file sync with automatic backups and file move/delete tracking.
Stars: ✭ 88 (+120%)
Mutual labels:  sync, rclone
Cloudexplorer
Cloud Explorer
Stars: ✭ 170 (+325%)
Mutual labels:  sync, s3
Docker S3 Volume
Docker container with a data volume from s3.
Stars: ✭ 166 (+315%)
Mutual labels:  sync, s3
S3sync
Really fast sync tool for S3
Stars: ✭ 224 (+460%)
Mutual labels:  sync, s3
Scrapy S3pipeline
Scrapy pipeline to store chunked items into Amazon S3 or Google Cloud Storage bucket.
Stars: ✭ 57 (+42.5%)
Mutual labels:  pipeline, s3
re-mote
Re-mote operations using SSH and Re-gent
Stars: ✭ 61 (+52.5%)
Mutual labels:  pipeline, scheduling
nifi
Deploy a secured, clustered, auto-scaling NiFi service in AWS.
Stars: ✭ 37 (-7.5%)
Mutual labels:  pipeline, s3
datajob
Build and deploy a serverless data pipeline on AWS with no effort.
Stars: ✭ 101 (+152.5%)
Mutual labels:  pipeline, data-pipeline
rclone-drive
☁️Simple web cloud storage based on rclone, transform cloud storage (s3, google drive, one drive, dropbox) into own custom web-based storage
Stars: ✭ 30 (-25%)
Mutual labels:  s3, rclone
acid-store
A library for secure, deduplicated, transactional, and verifiable data storage
Stars: ✭ 48 (+20%)
Mutual labels:  s3, rclone
Akubra
Simple solution to keep a independent S3 storages in sync
Stars: ✭ 79 (+97.5%)
Mutual labels:  sync, s3
aws-pdf-textract-pipeline
🔍 Data pipeline for crawling PDFs from the Web and transforming their contents into structured data using AWS textract. Built with AWS CDK + TypeScript
Stars: ✭ 141 (+252.5%)
Mutual labels:  s3, data-pipeline

saisoku - Fast file transfer orchestration pipeline

saisoku

Saisoku is a Python (2.7, 3.6 tested) package that helps you build complex pipelines of batch file/directory transfer/sync jobs. It supports threaded transferring of files locally, over network mounts, or HTTP. With Saisoku you can also transfer files to and from AWS S3 buckets and sync directories using Rclone and keep directories in sync "real-time" with Watchdog.

Saisoku includes a Transfer Server and Client which support copying over TCP sockets.

Saisoku uses Luigi for task management and web ui. To learn more about Luigi, see it's github or readthedocs.

License Release Sponsor Patreon Donate PayPal

Requirements

  • luigi
  • tornado
  • scandir
  • pyfastcopy
  • tqdm
  • requests
  • beautifulsoup4
  • boto3
  • watchdog

Install above python modules using pip

$ pip install -r requirements.txt

Download

$ git clone https://github.com/shirosaidev/saisoku.git
$ cd saisoku

Download latest version

How to use

Start Luigi

Create directory for state file for Luigi

$ mkdir /usr/local/var/luigi-server

Start Luigi scheduler daemon in foreground with

$ luigid --state-path=/usr/local/var/luigi-server/state.pickle

or in the background with

$ luigid --background --state-path=/usr/local/var/luigi-server/state.pickle --logdir=/usr/local/var/log

It will default to port 8082, so you can point your browser to http://localhost:8082 to access the web ui.

Configure Boto 3

If you are going to use the S3 copy Luigi tasks, first start be setting up Boto 3 (aws sdk python module) with the quick start instructions at boto 3 github.

Usage - Luigi tasks

Local/network mount copy

With the Luigi centralized scheduler running, we can send a copy files task to Luigi

$ python run_luigi.py CopyFiles --src /source/path --dst /dest/path

See below for the different parameters for each Luigi task.

Tarball package copy

To run a copy package task, which will create a tar.gz (gzipped tarball) file containing all files at src and copy the tar.gz to dst

$ python run_luigi.py CopyFilesPackage --src /source/path --dst /dest/path

HTTP copy

Start up 2 Saisoku http servers, the get requests from saisoku clients will be load balanced across these.

$ python saisoku_server.py --httpserver -p 5005 -d /src/dir
$ python saisoku_server.py --httpserver -p 5006 -d /src/dir

This will create an index.html file on http://localhost:5005 serving up the files in /src/dir.

To send a HTTP copy files task to Luigi

$ python run_luigi.py CopyFilesHTTP --src http://localhost --dst /dest/path --ports [5005,5006] --threads 2

S3 copy

To copy a local file to s3 bucket

$ python run_luigi.py CopyLocalFileToS3 --src /source/file --dst s3://bucket/foo/bar

s3 bucket object to local file

$ python run_luigi.py CopyS3lFileToLocal --src s3://bucket/foo/bar --dst /dest/file

Rclone sync

Saisoku can use Rclone to sync directories, etc. First, make sure you have Rclone installed and in your PATH.

To to do a dry-run sync from source to dest using Rclone:

$ python run_luigi.py SyncDirsRclone --src /source/path --dst /dest/path

To sync from source to dest using Rclone

$ python run_luigi.py SyncDirsRclone --src /source/path --dst /dest/path --cmdargs '["-vv"]'

To change the subcommand that Rclone uses (default is sync)

$ python run_luigi.py SyncDirsRclone --src /source/path --dst /dest/path --command 'subcommand'

Watchdog directory sync

Saisoku can use watchdog to keep directories synced in "real-time". First, make sure you have rsync installed and in your PATH.

To keep directories in sync from source to dest using Watchdog

$ python run_luigi.py SyncDirsWatchdog --src /source/path --dst /dest/path

Usage - Server -> Client transfer

Start up Saisoku Transfer server listening on all interfaces on port 5005 (default)

$ python saisoku_server.py --host 0.0.0.0 -p 5005

Run client to download file from server

$ python saisoku_client.py --host 192.168.2.3 -p 5005 /path/to/file

Log file

Saisoku output get logged to os env TEMP/TMPDIR directory in saisoku.log file.

Using saisoku module in Python

ThreadedCopy

Saisoku's ThreadedCopy class requires two parameters:

src source directory containing files you want to copy

dst destination directory of where you want the files to go (directory will be created if not there already)

Optional parameters:

filelist optional txt file containing one filename per line of files in src directory (not full path)

ignore optional ignore files list, example ['*.pyc', 'tmp*']

threads number of worker copy threads (default 16)

symlinks copy symlinks (default False)

copymeta copy file stat info (default True)

>>> from saisoku import ThreadedCopy

>>> ThreadedCopy(src='/source/dir', dst='/dest/dir', filelist='filelist.txt')
calculating total file size..
100%|██████████████████████████████████████████████████████████| 173/173 [00:00<00:00, 54146.30files/s]
copying 173 files..
100%|██████████████████████████████████████████████| 552M/552M [00:06<00:00, 97.6MB/s, file=dk-9.4.zip]

ThreadedHTTPCopy

Saisoku's ThreadedHTTPCopy class requires two parameters:

src source http tornado server (tserv) serving a directory of files you want to copy

dst destination directory of where you want the files to go (directory will be created if not there already)

Optional parameters:

threads number of worker copy threads (default 1)

ports tornado server (tserv) ports, these ports will be load balanced (default [5000])

fetchmode file get mode, either requests or urlretrieve (default urlretrieve)

chunksize chunk size for requests fetchmode (default 8192)

>>> from saisoku import ThreadedHTTPCopy

>>> ThreadedHTTPCopy('http://localhost', '/dest/dir')

Rclone

Saisoku's Rclone class requires two parameters:

src source directory of files you want to sync

dst destination directory of where you want the files to go

Optional parameters:

def init(self, src, dst, flags=[], command='sync', cmdargs=[]):

flags a list of Rclone flags (default [])

command subcommand you want Rclone to use (default sync)

cmdargs a list of command args to use (default ['--dry-run', '-vv'])

>>> from saisoku import Rclone

>>> Rclone('/src/dir', '/dest/dir')

Watchdog

Saisoku's Watchdog class requires two parameters:

src source directory of files you want to sync

dst destination directory of where you want the files to go

Optional parameters:

def init(self, src, dst, recursive, patterns, ignore_patterns, ignore_directories, case_sensitive)

recursive bool used for recurisvely checking all sub directories for changes (default True)

patterns file name patterns to use when checking for changes (default *)

ignore_patterns file name patterns to ingore when checking for changes (default *)

ignore_directories bool used for ignoring directories (default False)

case_sensitive bool used for being case sensitive (default True)

>>> from saisoku import Watchdog

>>> Watchdog('/src/dir', '/dest/dir')

Patreon

If you are a fan of the project or using Saisoku in production, please consider becoming a Patron to help advance the project.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].