All Projects → wimleers → Fileconveyor

wimleers / Fileconveyor

Licence: unlicense
File Conveyor is a daemon written in Python to detect, process and sync files. In particular, it's designed to sync files to CDNs. Amazon S3 and Rackspace Cloud Files, as well as any Origin Pull or (S)FTP Push CDN, are supported. Originally written for my bachelor thesis at Hasselt University in Belgium.

Programming Languages

python
139335 projects - #7 most used programming language

Description

File Conveyor is designed to discover new, changed and deleted files via the operating system's built-in file system monitor. After discovering the files, they can be optionally be processed by a chain of processors – you can easily write new ones yourself. After files have been processed, they can also optionally be transported to a server.

Discovery happens through inotify on Linux (with kernel >= 2.6.13), through FSEvents on Mac OS X (>= 10.5) and through polling on other operating systems.

Processors are simple Python scripts that can change the file's base name (it is impossible to change the path) and apply any sort of processing to the file's contents. Examples are image optimization and video transcoding.

Transporters are simple threaded abstractions around Django storage systems.

For a detailed description of the innards of file conveyor, see my bachelor thesis text (find it via http://wimleers.com/tags/bachelor-thesis).

This application was written as part of the bachelor thesis [1] of Wim Leers at Hasselt University [2].

[1] http://wimleers.com/tags/bachelor-thesis [2] http://uhasselt.be/

IMPORTANT WARNING

I've attempted to provide a solid enough README to get you started, but I'm well aware that it isn't superb. But as this is just a bachelor thesis, time was fairly limited. I've opted to create a solid basis instead of an extremely rigourously documented piece of software. If you cannot find the answer in the README.txt, nor the INSTALL.txt, nor the API.txt files, then please look at my bachelor thesis text instead. If neither of that is sufficient, then please contact me.

Upgrading

If you're upgrading from a previous version of File Conveyor, please run upgrade.py.

============================================================================== | The basics |

Configuring File Conveyor

The sample configuration file (config.sample.xml) should be self explanatory. Copy this file to config.xml, which is the file File Conveyor will look for, and edit it to suit your needs. For a detailed description, see my bachelor thesis text (look for the "Configuration file design" section).

Each rule consists of 3 components:

  • filter
  • processorChain
  • destinations

A rule can also be configured to delete source files after they have been synced to the destination(s).

The filter and processorChain components are optional. You must have at least one destination. If you want to use File Conveyor to process files locally, i.e. without transporting them to a server, then use the Symlink or Copy transporter (see below).

Starting File Conveyor

File Conveyor must be started by starting its arbitrator (which links everything together; it controls the file system monitor, the processor chains, the transporters and so on). You can start the arbitrator like this: python /path/to/fileconveyor/arbitrator.py

Stopping File Conveyor

File Conveyor listens to standard signals to know when it should end, like the Apache HTTP server does too. Send the TERMinate signal to terminate it: kill -TERM cat ~/.fileconveyor.pid

You can configure File Conveyor to store the PID file in the more typical /var/run location on *nix:

  • You can change the PID_FILE setting in settings.py to /var/run/fileconveyor.pid. However, this requires File Conveyor to be run with root permissions (/var/run requires root permissions).
  • Alternatively, you can create a new directory in /var/run which then no longer requires root permissions. This can be achieved through these commands:
  1. sudo mkdir /var/run/fileconveyor`
  2. sudo chown fileconveyor-user /var/run/fileconveyor
  3. sudo chown 700 /var/run/fileconveyor Then, you can change the PID_FILE setting in settings.py to /var/run/fileconveyor/fileconveyor.pid, and you won't need to run File Conveyor with root permissions anymore.

File Conveyor's behavior

Upon startup, File Conveyor starts the file system monitor and then performs a "manual" scan to detect changes since the last time it ran. If you've got a lot of files, this may take a while.

Just for fun, type the following while File Conveyor is syncing: killall -9 python Now File Conveyor is dead. Upon starting it again, you should see something like: 2009-05-17 03:52:13,454 - Arbitrator - WARNING - Setup: initialized 'pipeline' persistent queue, contains 2259 items. 2009-05-17 03:52:13,455 - Arbitrator - WARNING - Setup: initialized 'files_in_pipeline' persistent list, contains 47 items. 2009-05-17 03:52:13,455 - Arbitrator - WARNING - Setup: initialized 'failed_files' persistent list, contains 0 items. 2009-05-17 03:52:13,671 - Arbitrator - WARNING - Setup: moved 47 items from the 'files_in_pipeline' persistent list into the 'pipeline' persistent queue. 2009-05-17 03:52:13,672 - Arbitrator - WARNING - Setup: moved 0 items from the 'failed_files' persistent list into the 'pipeline' persistent queue. As you can see, 47 items were still in the pipeline when File Conveyor was killed. They're now simply added to the pipeline queue again and they will be processed once again.

The initial sync

To get a feeling of File Conveyor's speed, you may want to run it in the console and look at its output.

Verifying the synced files

Running the verify.py script will open the synced files database and verify that each synced file actually exists.

============================================================================== | Processors |

Addressing processors

You can address a specific processor by first specifying its processor module and then the exact processor name (which is its class name):

  • unique_filename.MD5
  • image_optimizer.KeepMetadata
  • yui_compressor.YUICompressor
  • link_updater.CSSURLUpdater

But, it works with third-party processors too! Just make sure the third-party package is in the Python path and then you can just use this in config.xml:

  • MyProcessorPackage.SomeProcessorClass

Processor module: filename

Available processors:

  1. SpacesToUnderscores Changes a filename; replaces spaces by underscores. E.g.: this is a test.txt --> this_is_a_test.txt
  2. SpacesToDashes Changes a filename; replaces spaces by dashes. E.g.: this is a test.txt --> this-is-a-test.txt

Processor module: unique_filename

Available processors:

  1. Mtime Changes a filename based on the file's mtime. E.g.: logo.gif --> logo_1240668971.gif
  2. MD5 Changes a filename based on the file's MD5 hash. E.g.: logo.gif --> logo_2f0342a2b9aaf48f9e75aa7ed1d58c48.gif

Processor module: image_optimizer

It's important to note that all metadata is stripped from JPEG images, as that is the most effective way to reduce the image size. However, this might also strip copyright information, i.e. this can also have legal consequences. Choose one of the "keep metadata" classes if you want to avoid this. When optimizing GIF images, they are converted to the PNG format, which also changes their filename.

Available processors:

  1. Max optimizes image files losslessly (GIF, PNG, JPEG, animated GIF)
  2. KeepMetadata same as Max, but keeps JPEG metadata
  3. KeepFilename same as Max, but keeps the original filename (no GIF optimization)
  4. KeepMetadataAndFilename same as Max, but keeps JPEG metadata and the original filename (no GIF optimization)

Processor module: yui_compressor

Warning: this processor is CPU-intensive! Since you typically don't get new CSS and JS files all the time, it's still fine to use this. But the initial sync may cause a lot of CSS and JS files to be processed and thereby cause a lot of load!

Available processors:

  1. YUICompressor Compresses .css and .js files with the YUI Compressor

Processor module: google_closure_compiler

Warning: this processor is CPU-intensive! Since you typically don't get new JS files all the time, it's still fine to use this. But the initial sync may cause a lot of JS files to be processed and thereby cause a lot of load!

Available processors:

  1. GoogleClosureCompiler Compresses .js files with the Google Closure Compiler

Processor module: link_updater

Warning: this processor is CPU-intensive! Since you typically don't get new CSS files all the time, it's still fine to use this. But the initial sync may cause a lot of CSS files to be processed and thereby cause a lot of load! Note that this processor will skip processing a CSS file if not all files that are referenced from it, have been synced to the CDN yet. Which means the CSS files may need to parsed over and over again until the images have been synced.

It seems this processor is suited for optimization. It uses the cssutils Python module, which validates every CSS property. This is an enormous slow- down: on a 2.66 GHz Core 2 Duo, it causes 100% CPU usage every time it runs. This module also seems to suffer from rather massive memory leaks. Memory usage can easily top 30 MB on Mac OS X where it would never go over 17 MB without this processor!

This processor will replace all URLs in CSS files with references to their counterparts on the CDN. There are a couple of important gotchas to use this processor module:

  • absolute URLs (http://, https://) are ignored, only relative URLs are processed
  • if a referenced file doesn't exist, its URL will remain unchanged
  • if one of the referenced images or fonts is changed and therefor resynced, and if it is configured to have a unique filename, the CDN URL referenced from the updated CSS file will no longer be valid. Therefor, when you update an image file or font file that is referenced by CSS files, you should modify the CSS files as well. Just modifying the mtime (by using the touch command) is sufficient.
  • it requires the referenced files to be synced to the same server the CSS file is being synced to. This implies that all the references files must also be synced to the same server, or the file will never get synced!

Available processors:

  1. CSSURLUpdater Replaces URLs in .css files with their counterparts on the CDN

============================================================================== | Transporters |

Addressing transporters

You can address a specific transporter by only specifying its module:

  • cf
  • ftp
  • cloudfiles
  • s3
  • sftp
  • symlink_or_copy

But, it works with third-party transporters too! Just make sure the third-party package is in the Python path and then you can just use this in config.xml:

  • MyTransporterPackage

Transporter: FTP (ftp)

Value to enter: "ftp".

Available settings:

  • host
  • username
  • password
  • url
  • port
  • path
  • key

Transporter: SFTP (sftp)

Value to enter: "sftp".

Available settings:

  • host
  • username
  • password
  • url
  • port
  • path

Transporter: Amazon S3

Value to enter: "s3".

Available settings:

  • access_key_id
  • secret_access_key
  • bucket_name
  • bucket_prefix

More than 4 concurrent connections doesn't show a significant speedup.

Transporter: Amazon CloudFront

Value to enter: "cf".

Available settings:

  • access_key_id
  • secret_access_key
  • bucket_name
  • bucket_prefix
  • distro_domain_name

Transporter: Rackspace Cloud Files

Value to enter: "cloudfiles".

Available settings:

  • username
  • api_key
  • container

Transporter: Symlink or Copy

Value to enter: "symlink_or_copy".

Available settings:

  • location
  • url

Transporter: Amazon CloudFront - Creating a CloudFront distribution

You can either use the S3Fox Firefox add-on to create a distribution or use the included Python function to do so. In the latter case, do the following:

import sys sys.path.append('/path/to/fileconveyor/transporters') sys.path.append('/path/to/fileconveyor/dependencies') from transporter_cf import create_distribution create_distribution("access_key_id", "secret_access_key", "bucketname.s3.amazonaws.com") Created distribution - domain name: dqz4yxndo4z5z.cloudfront.net - origin: bucketname.s3.amazonaws.com - status: InProgress - comment: - id: E3FERS845MCNLE

Over the next few minutes, the distribution will become active. This
function will keep running until that happens.
............................
The distribution has been deployed!

============================================================================== | The advanced stuff |

Constants in Arbitrator.py

The following constants can be tweaked to change where File Conveyor stores its files, or to change its behavior.

RESTART_AFTER_UNHANDLED_EXCEPTION = True Whether File Conveyor should restart itself after it encountered an unhandled exception (i.e., a bug). RESTART_INTERVAL = 10 After how much time File Conveyor should restart itself, after it has encountered an unhandled exception. Thus, this setting only has an effect when RESTART_AFTER_UNHANDLED_EXCEPTION == True. LOG_FILE = './fileconveyor.log' The log file. PERSISTENT_DATA_DB = './persistent_data.db' Where to store persistent data (pipeline queue, 'files in pipeline' list and 'failed files' list). SYNCED_FILES_DB = './synced_files.db' Where to store the input_file, transported_file_basename, url and server for each synced file. WORKING_DIR = '/tmp/fileconveyor' The working directory. MAX_FILES_IN_PIPELINE = 50 The maximum number of files in the pipeline. Should be high enough in order to prevent transporters from idling too long. MAX_SIMULTANEOUS_PROCESSORCHAINS = 1 The maximum number of processor chains that may be executed simultaneously. If you've got CPU intensive processors and if you're running File Conveyor on the web server, you'll want to keep this very low, probably at 1. MAX_SIMULTANEOUS_TRANSPORTERS = 10 The maximum number of transporters that may be running simultaneously. This effectively caps the number of simultaneous connections. It can also be used to have some -- although limited -- control on the throughput consumed by the transporters. MAX_TRANSPORTER_QUEUE_SIZE = 1 The maximum of files queued for each transporters. It's recommended to keep this low enough to ensure files are not unnecessarily waiting. If you set this too high, no new transporters will be spawned, because all files will be queued on the existing transporters. Setting this to 0 can only be recommended in environments with a continuous stream of files that need syncing. The default of 1 is to ensure each transporter is idling as little as possible. QUEUE_PROCESS_BATCH_SIZE = 20 The number of files that will be processed when processing one of the many queues. Setting this too low will cause overhead. Setting this too high will cause delays for files that are ready to be processed or transported. See the "Pipeline design pattern" section in my bachelor thesis text. CALLBACKS_CONSOLE_OUTPUT = False Controls whether output will be generated for each callback. (There are callbacks for the file system monitor, processor chains and transporters.) CONSOLE_LOGGER_LEVEL = logging.WARNING Controls the output level of the logging to the console. For a full list of possibilities, see http://docs.python.org/release/2.6/library/logging.html#logging-levels. FILE_LOGGER_LEVEL = logging.DEBUG Controls the output level of the logging to the console. For a full list of possibilities, see http://docs.python.org/release/2.6/library/logging.html#logging-levels. RETRY_INTERVAL = 30 Sets the interval in which the 'failed files' list is appended to the pipeline queue, to retry to sync these failed files.

Understanding persistent_data.db

We'll go through this by using a sample database I created. You should be able to reproduce similar output on your persistent_data.db file using the exact same commands. Access the database, by using the SQLite console application. $ sqlite3 persistent_data.db SQLite version 3.6.11 Enter ".help" for instructions Enter SQL statements terminated with a ";" sqlite>

As you can see, there are three tables in the database, one for every persistent data structure: sqlite> .table failed_files_list pipeline_list pipeline_queue

Simple count queries show how many items there are in each persistent data structure. In this case for example, there are 2560 files waiting to enter the pipeline, 50 were in the pipeline at the time of stopping File Conveyor (these will be added to the queue again once we restart File Conveyor) and 0 files are in the list of failed files. Files end up in there when their processor chain or (one of) their transporters fails. sqlite> SELECT COUNT() FROM pipeline_queue; 2560 sqlite> SELECT COUNT() FROM pipeline_list; 50 sqlite> SELECT COUNT(*) FROM failed_files_list; 0

You can also look at the database schemas of these tables: sqlite> .schema pipeline_queue CREATE TABLE pipeline_queue(id INTEGER PRIMARY KEY AUTOINCREMENT, item pickle); sqlite> .schema pipeline_list CREATE TABLE pipeline_list(id INTEGER PRIMARY KEY AUTOINCREMENT, item pickle); sqlite> .schema failed_files_list CREATE TABLE failed_files_list(id INTEGER PRIMARY KEY AUTOINCREMENT, item pickle);

As you can see, the three tables have identical schemas. the type for the stored item is 'pickle', which means that you can store any Python object in there as long as it can be "pickled", which means as much as "convertable to a string representation". "Serialization" is the term PHP developers have given to this, although pickling is much more advanced. The Python object stored in there is the same for all three tables: a tuple of the filename (as a string) and the event (as an integer). The event is one of FSMonitor.CREATED, FSMonitor.MODIFIED, FSMonitor.DELETED.

This file is what tracks the curent state of File Conveyor. Thanks to this file, it is possible for File Conveyor to crash and not lose any data. Deleting this file would cause File Conveyor to lose all of its current work. Only new (as in: after the file was deleted) changes in the file system would be picked up. Changes that still had to be synced, would be forgotten.

Understanding fsmonitor.db

This database has a single table: pathscanner (which is inherited from the pathscanner module around which the fsmonitor module is built). Its schema is:

sqlite> .schema pathscanner CREATE TABLE pathscanner(path text, filename text, mtime integer);

This file is what tracks the current state of the directory tree associated with each source. When an operating system's file system monitor is used, this database will be updated through its callbacks. When no such file system monitor is available, it will be updated through polling. Deleting this file would cause File Conveyor to have to sync all files again.

Understanding synced_files.db

We'll go through this by using a sample database I created. You should be able to reproduce similar output on your synced_files.db file using the exact same commands. Access the database, by using the SQLite console application. $ sqlite3 synced_files.db SQLite version 3.6.11 Enter ".help" for instructions Enter SQL statements terminated with a ";" sqlite>

As you can see, there's only one table: synced_files. sqlite> .table synced_files

Let's look at the schema. There are 4 fields: input_file, transported_file_basename, url and server. input_file is the full path. transported_file_basename is the base name of the file that was transported to the server. This is stored because the filename might have been altered by the processors that have been applied to it, but the path cannot change. I use this to delete the previous version of a file if a file has been modified. The url field is of course the URL to retrieve the file from the server. Finally, the server field contains the name you've assigned to the server in the configuration file. Each file may be synced to multiple servers and this allows you to check if a file has been synchronized to a specific server. sqlite> .schema synced_files CREATE TABLE synced_files(input_file text, transported_file_basename text, url text, server text);

We can again use simple count queries to learn more about the synced files. As you can see, 845 files have been synced, of which 602 have been synced to a the server that was named "origin pull cdn" and 243 to the server that was named "ftp push cdn". sqlite> SELECT COUNT() FROM synced_files; 845 sqlite> SELECT COUNT() FROM synced_files WHERE server="origin pull cdn"; 602 sqlite> SELECT COUNT(*) FROM synced_files WHERE server="ftp push cdn"; 243

License

This application is dual-licensed under the GPL and the UNLICENSE.

Due to the dependencies that were initially included within File Conveyor, which were all subject to GPL-compatible licenses, it made sense to initially release the source code under the GPL. Then, it was decided the UNLICENSE was a better fit.

Author

Wim Leers ~ http://wimleers.com/

This application was written as part of the bachelor thesis of Wim Leers at Hasselt University.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].