All Projects → bloomberg → repofactor

bloomberg / repofactor

Licence: Apache-2.0 license
Tools for refactoring history of git repositories

Programming Languages

perl
6916 projects
shell
77523 projects

Finding the causes of repository bloat

This project contains a bunch of tools to help analyse the largest blobs (by "on disk" storage) in a repository.

Here is a sample sequence of commands showing typical usage:

  • Typically start with a clean clone of the repository that you want to analyse. It can be bare. For reasonable performance it should be cloned onto "local" disk on a reasonably fast Linux machine.

  • Add these tools to your PATH or use a full path to each script or executable.

  • Run these tools from the repository undergoing analysis and cleaning.

  • Work out a suitable threshold size by running generate-larger-than with experimental parameters. 50000 might be a good starting point. The size is "average bytes after compression by Git".

  • Generate a sorted list of objects with file information

    generate-larger-than 50000 | sort -k3n | add-file-info >../largeobjs.txt

  • Make a report showing the summary of each commit together with the paths which introduce the large objects, their uncompressed size and file information

    report-on-large-objects ../largeobjs.txt

Filtering out large blobs

  • Create a temporary work directory and export RFWORK_DIR to point to this directory (defaults to the current directory).

  • Again, run all commands from the repository being analysed.

  • From the above report, edit down a list of blob ids that can be eliminated. Call this large-objects.txt.

  • Generate a remove script

    make-remove-blobs large-objects.txt >"$RFWORK_DIR"/remove-blobs.pl
    chmod +x "$RFWORK_DIR"/remove-blobs.pl
    
  • Optionally edit the remove script to filter out any paths that are not required at the same time

  • Run the filter branch

    run-filter-branch

  • Create a new "easy rebase" script for moving work-in-progess branches from the old history to the new history

    make-mtnh >"$RFWORK_DIR"/move-to-new-history

  • Push the rewritten refs and the rewrite-commit-map branch to all central repositories

  • Deploy move-to-new-history for users to use

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].