All Projects → novoid → guess-filename.py

novoid / guess-filename.py

Licence: GPL-3.0 License
Derive a file name according to old file name cues and/or PDF file content

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to guess-filename.py

Hypertag
Knowledge Management for Humans using Machine Learning & Tags
Stars: ✭ 116 (+329.63%)
Mutual labels:  tags, tagging, file
file access
File.io for Web, Desktop and Mobile for Flutter
Stars: ✭ 59 (+118.52%)
Mutual labels:  files, file
org-contacts2vcard
Converting Emacs Org-mode org-contacts contact information to VCard format suitable for importing to Android 4.4
Stars: ✭ 21 (-22.22%)
Mutual labels:  pim, personal-information-management
additional tags
Redmine Plugin for adding tags functionality to issues and wiki pages.
Stars: ✭ 25 (-7.41%)
Mutual labels:  tags, tagging
PHP-FileUpload
Simple and convenient file uploads — secure by default
Stars: ✭ 53 (+96.3%)
Mutual labels:  files, file
tagify
Tagify produces a set of tags from a given source. Source can be either an HTML page, a Markdown document or a plain text. Supports English, Russian, Chinese, Hindi, Spanish, Arabic, Japanese, German, Hebrew, French and Korean languages.
Stars: ✭ 24 (-11.11%)
Mutual labels:  tags, tagging
preact-token-input
🔖 A text field that tokenizes input, for things like tags.
Stars: ✭ 57 (+111.11%)
Mutual labels:  tags, tagging
Uploadcare Widget
Uploadcare Widget, an ultimate tool for HTML5 file upload supporting multiple file upload, drag&drop, validation by file size/file extension/MIME file type, progress bar for file uploads, image preview.
Stars: ✭ 183 (+577.78%)
Mutual labels:  files, file
notes
我的笔记
Stars: ✭ 21 (-22.22%)
Mutual labels:  pim, personal-information-management
Voila
Voila is a domain-specific language launched through CLI tool for operating with files and directories in massive amounts in a fast & reliable way.
Stars: ✭ 78 (+188.89%)
Mutual labels:  files, file
files
Useful methods to manage files and directories
Stars: ✭ 27 (+0%)
Mutual labels:  files, file
Com2Kube
Web application that convert docker-compose files to kubernetes files
Stars: ✭ 26 (-3.7%)
Mutual labels:  files, file
offPIM
Decentralized, Offline-first, Personal Information Manager (PIM) using PouchDB/CouchDB. Includes task-, note-, and contact-management, as well as journaling.
Stars: ✭ 63 (+133.33%)
Mutual labels:  pim, personal-information-management
SSCTaglistView
Customizable iOS tag list view, in Swift.
Stars: ✭ 54 (+100%)
Mutual labels:  tags, tagging
Copy Webpack Plugin
Copy files and directories with webpack
Stars: ✭ 2,679 (+9822.22%)
Mutual labels:  files, file
tag-picker
Better tags input interaction with JavaScript.
Stars: ✭ 27 (+0%)
Mutual labels:  tags, tagging
tagflow
TagFlow is a file manager working with tags.
Stars: ✭ 22 (-18.52%)
Mutual labels:  tags, tagging
File Storage
File storage abstraction for Yii2
Stars: ✭ 116 (+329.63%)
Mutual labels:  files, file
React Files
A file input (dropzone) management component for React
Stars: ✭ 126 (+366.67%)
Mutual labels:  files, file
lumen-file-manager
File manager module for the Lumen PHP framework.
Stars: ✭ 40 (+48.15%)
Mutual labels:  files, file

guessfilename.py

This Python script tries to come up with a new file name for each file from command line argument.

It does this with several methods: first, the current file name is analyzed and any ISO date/timestamp and filetags are re-used. Secondly, if the parsing of the file name did not lead to any new file name, the content of the file is analyzed. Following file types are supported by now:

  • PDF files

The script accepts an arbitrary number of files (see your shell for possible length limitations).

Why

I do scan almost all paper mail. Many of those documents are sent to me regularily. Such documents are bills or insurance informations, for example.

Being too lazy to name those files manually with high chances of getting many variants for the same document type, I came up with a method to derive file names from either the old file name (cues I enter without knowing the exact target file name) or the file content.

Analyzing the content enables this script to recognize bills via customer numbers or phone numbers, amounts to pay, and so on.

Examples

Here are some examples that demonstrate the purpose of this script. The generated file names are following my file name convention.

For better user experience, I like to define an abbreviation in my shell which also makes the examples easier to read:

alias gf=guessfilename.py

A very simple example is a simple bill:

gf "2016-03-05 phone 12,34 €.pdf"
 → "2016-03-05 COMPANY landline 12,34€ -- scan bill.pdf"

Some mobile apps generate weird formatted file names. Here is some recording:

gf "rec_20171129-0902 A nice recording .wav"
 → "2017-11-29T09.02 A nice recording.wav"

Android screenshot files tend to look like that:

gf "Screenshot_2017-11-29_10-32-12.png"
 → "2017-11-29T10.32.12 -- screenshots.png"

Android photographs are handled similarily:

gf "IMG_20190118_133928.jpg"
 → "2019-01-18T13.39.28.jpg"

Files saved from Signal do have strange default names as well:

gf "signal-2018-03-08-102332.jpg"
 → "2018-03-08T10.23.32.jpg"

Many companies like to generate really silly file names. This is from my bank:

gf "C110014365208EUR20150930001.pdf"
 → "2015-09-30 Bank statement 2015-001 10014365208.pdf"

This script is able to parse content of PDF file in order to get meta-data to generate the new file name. This can be applied to you salary, for example:

gf "2020-03-04 salary.pdf"
 → "2020-02-29 MYCOMPANY salary for February 1234,56€ -- finance.pdf"

As you can see, “guessfilename” makes your digital life easier when you do have recurring file rename tasks.

Usage

guessfilename --help
Usage:
    guessfilename [<options>] <list of files>

This little Python script tries to rename files according to pre-defined rules.

It does this with several methods: first, the current file name is analyzed and
any ISO date/timestamp and filetags are re-used. Secondly, if the parsing of the
file name did not lead to any new file name, the content of the file is analyzed.

You have to adapt the rules in the Python script to meet your requirements.
The default rule-set follows the filename convention described on
http://karl-voit.at/managing-digital-photographs/


:copyright: (c) by Karl Voit
:license: GPL v3 or any later version
:URL: https://github.com/novoid/guess-filename.py
:bugreports: via github or <[email protected]>


Options:
  -h, --help     show this help message and exit
  -d, --dryrun   enable dryrun mode: just simulate what would happen, do not
                 modify files
  -v, --verbose  enable verbose mode
  -q, --quiet    enable quiet mode
  --version      display version and exit

Pixel Images and Videos

I added handling for my Pixel 4a camera results: JPEG images and MP4 videos.

Due to a somewhat messy meta data situation I had to use the File:FileModifyDate Exif meta-data in order to get time-stamps from the local time zone. If you happen to apply guessfilename after modifying the file due to copying or editing, you will get wrong time-stamps. Therefore, use Syncthing or similar synchronzation tools that preserve file modification time to get the files from the mobile to your computer. Apply guessfilename before modifying the files any further.

Furthermore, you will need to install ExifTool as an external dependency. I was not able to find a Python-only Exif library that provided me read access to advanced Exif values the Pixel is using.

MediathekView

When downloading TV shows using MediathekView, you should use the following download pattern:

  • MediathekView v11:
    %DT%d %s - %t - %T -ORIGINAL- %N.mp4
        
  • MediathekView v13:
    • Einstellungen > Aufzeichnen und Abspielen > Set bearbeiten
      • [Set-Name] > Hilfsprogramme:
        • ffmpeg > Zieldateiname > %DT%d %s - %t - %T -ORIGINALhd- %N.mp4
        • ffmpeg > Schalter > -user_agent "Mozilla" -i %f -c copy -bsf:a aac_adtstoasc **

When applying guessfilename on the resulting files, you will get something like this:

20180509T235000 ORF - ZIB 24 - Auswirkungen nach US-Aus für Atomdeal -ORIGINAL- 2018-05-09_2350_tl_01_ZIB-24_Auswirkungen-na__13976363__o__1735069995__s14297628_8__BCK1HD_23514710P_23540405P_Q4A.mp4  ...
    →  2018-05-09T23.51.47 ORF - ZIB 24 - Auswirkungen nach US-Aus für Atomdeal -- lowquality.mp4

20180509T235000 ORF - ZIB 24 - Hirntoter Bub plötzlich aufgewacht -ORIGINAL- 2018-05-09_2350_tl_01_ZIB-24_Hirntoter-Bub-p__13976363__o__5119815115__s14297631_1__BCK1HD_00045915P_00072303P_Q4A.mp4  ...
    →  2018-05-09T00.04.59 ORF - ZIB 24 - Hirntoter Bub plötzlich aufgewacht -- lowquality.mp4

20180509T235000 ORF - ZIB 24 - Meldungen -ORIGINAL- 2018-05-09_2350_tl_01_ZIB-24_Meldungen__13976363__o__1117657593__s14297632_2__BCK1HD_00072303P_00085816P_Q4A.mp4  ...
    →  2018-05-09T00.07.23 ORF - ZIB 24 - Meldungen -- lowquality.mp4

20180509T235000 ORF - ZIB 24 - Neuerung bei Filmfestspielen in Cannes -ORIGINAL- 2018-05-09_2350_tl_01_ZIB-24_Neuerung-bei-Fi__13976363__o__1941003027__s14297634_4__BCK1HD_00085816P_00111715P_Q4A.mp4  ...
    →  2018-05-09T00.08.58 ORF - ZIB 24 - Neuerung bei Filmfestspielen in Cannes -- lowquality.mp4

20180509T235000 ORF - ZIB 24 - Trumps CIA-Kandidatin umstritten -ORIGINAL- 2018-05-09_2350_tl_01_ZIB-24_Trumps-Kandidat__13976363__o__1488806017__s14297630_0__BCK1HD_00020922P_00045915P_Q4A.mp4  ...
    →  2018-05-09T00.02.09 ORF - ZIB 24 - Trumps CIA-Kandidatin umstritten -- lowquality.mp4

20180509T235000 ORF - ZIB 24 - Wetter -ORIGINAL- 2018-05-09_2350_tl_01_ZIB-24_Wetter__13976363__o__2966973785__s14297635_5__BCK1HD_00111715P_00120000P_Q4A.mp4  ...
    →  2018-05-09T00.11.17 ORF - ZIB 24 - Wetter -- lowquality.mp4

As you can see, the temporal order of the chunks is extracted so that the files are in their correct order.

Please note that this does not work with a show whose chunks do cross midnight since the date is always taken from the start of the show and the time from the actual time being shown.

.info.json Meta-Data Files

If you do download a media file and its associated separate .info.json file (both base-names without file extension need to match), this tool is able to parse the meta-data to derive a new file name.

Currently, there are two meta-data formats supported: ORG TVthek and YouTube, both via http://rg3.github.io/youtube-dl/

youtube-dl --write-info-json <URL>

This results, for example, with files like these:

Durchbruch bei Brexit-Verhandlungen-14577219.info.json
Durchbruch bei Brexit-Verhandlungen-14577219.mp4
Isolierte Familie - 58-jähriger Österreicher in U-Haft-14577221.info.json
Isolierte Familie - 58-jähriger Österreicher in U-Haft-14577221.mp4
The Star7 PDA Prototype-Ahg8OBYixL0.info.json
The Star7 PDA Prototype-Ahg8OBYixL0.mp4

Please notice the associated mp4 files as well as the info.json files.

Applying guess-filename on these files look like this:

vk@sherri ~tmp % guessfilename *mp4

   Durchbruch bei Brexit-Verhandlungen-14577219.mp4  ...
       →  2019-10-17T16.59.07 ORF - ZIB 17 00 - Durchbruch bei Brexit-Verhandlungen -- highquality.mp4

   Isolierte Familie - 58-jähriger Österreicher in U-Haft-14577221.mp4  ...
       →  2019-10-17T17.01.44 ORF - ZIB 17 00 - Isolierte Familie: 58-jähriger Österreicher in U-Haft -- highquality.mp4

   The Star7 PDA Prototype-Ahg8OBYixL0.mp4  ...
       →  2007-09-13 youtube - The Star7 PDA Prototype - Ahg8OBYixL0.mp4

The info.json files are not removed or renamed.

Extending with your own regular expressions

The structure of the script is like the following:

  • general header, command-line argument parser, …
  • handle_logging()
  • error_exit()
  • FileSizePlausibilityException()
  • class GuessFilename()
    • a long list of regular expression definitions
    • derive_new_filename_from_old_filename()
      • here, you can add code to interpret the regular expressions
    • derive_new_filename_from_content()
      • if you want to parse PDF content, add your code here
    • derive_new_filename_from_json_metadata()
      • this handles the JSON meta-data files generated by youtube-dl (see above)
    • handle_file()
      • the function that loops over all files is probing for new file names until a function is returning with a new name:
        1. derive_new_filename_from_old_filename()
        2. derive_new_filename_from_content()
        3. derive_new_filename_from_json_metadata()
        4. if no name returned until here: prints out a warning that no new name could be derived
    • The rest of the class consist of a bunch of tool functions, e.g., for parsing and querying:
    • adding_tags()
    • split_filename_entities()
    • contains_one_of()
    • contains_all_of()
    • fuzzy_contains_one_of()
    • fuzzy_contains_all_of()
    • has_euro_charge()
    • get_euro_charge()
    • get_euro_charge_from_context_or_basename()
    • get_euro_charge_from_context()
    • rename_file()
    • get_datetime_string_from_named_groups()
    • get_date_string_from_named_groups()
    • get_datetime_description_extension_filename()
    • get_date_description_extension_filename()
    • NumToMonth()
    • translate_ORF_quality_string_to_tag()
    • get_file_size()
    • warn_if_ORF_file_seems_to_small_according_to_duration_and_quality_indicator()
  • move_to_success_dir()
  • move_to_error_dir()
  • main()

For the most basic pattern matching, you just have to add regular expressions to the GuessFilename() class and add the regex matching code to derive_new_filename_from_old_filename().

Do not forget to add simple tests to guessfilename_test.py as well!

Related tools and workflows

This tool is part of a tool-set which I use to manage my digital files such as photographs. My work-flows are described in this blog posting you might like to read.

In short:

For tagging, please refer to filetags and its documentation.

See date2name for easily adding ISO time-stamps or date-stamps to files.

For easily naming and tagging files within file browsers that allow integration of external tools, see appendfilename (once more) and filetags.

Moving to the archive folders is done using move2archive.

Having tagged photographs gives you many advantages. For example, I automatically choose my desktop background image according to the current season.

Files containing an ISO time/date-stamp gets indexed by the filename-module of Memacs.


Jonas Sjöberg took my idea and developed the much more advanced (and thus a bit more complicated) autonameow. It uses rule-based renaming, analyzes content of plain text, epub, pdf and rtf files, extracts meta-data from many different file formats via exiftool and so forth.


This reddit thread brought me to fs-curator whose documentation looks promising. I did not test it and it’s still in an early stage. However, it could be a future user-friendly part of a workflow that watches folders for file changes and applies processes like guessfilename.

Alternatives

I you don’t need the full power of a programming language, organize might do the trick for you. Instead of coding Python, you define your rules within a text file.

Contribute!

I am looking for your ideas!

If you want to contribute to this cool project, please fork and contribute!

Local Variables

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].