jgoenetxea
11/13/2017 - 9:58 AM

Remove big size files from repo history

Before start with this stuff

This document lists some commands to remove specific elements from the repository history. If you are interested in remove elements with certain properties (min size, formated name, etc.) you can use BFG Repo-Cleaner app (not such powerfull but lot faster). Related link: https://rtyley.github.io/bfg-repo-cleaner/

Identify the big sized elements in the repo

List all the SHA identification number for all the files in the repo:

$ git rev-list --objects --all | sort -k 2 > allfileshas.txt

Get a list of the files ordered by the size (from biggest to smallest):

$ git gc && git verify-pack -v .git/objects/pack/pack-*.idx | egrep "^\w+ blob\W+[0-9]+ [0-9]+ [0-9]+$" | sort -k 3 -n -r > bigobjects.txt

The file generated in the previous step includes only the SHA values to identify each file. Now, we need to include the file name/path for each entry:

$ for SHA in `cut -f 1 -d\  < bigobjects.txt`; do
echo $(grep $SHA bigobjects.txt) $(grep $SHA allfileshas.txt) | awk '{print $1,$3,$7}' >> bigtosmall.txt
done;

Filter the repository history

  • Download the repo to have a clean copy.
  • Check the big files in the history.
  • Remove the big files/folders with git filter-branch command:
$ git filter-branch --force --index-filter 'git rm --cached --ignore-unmatch PATH-TO-FILE-TO-BE-REMOVED' --prune-empty --tag-name-filter cat -- --all
Note: If you want to remove a folder add '-r' after 'git rm' so
=> '... git rm -r --cached ...

Where 'PATH-TO-FILE-TO-BE-REMOVED' is the path to the file or folder you want to remove.

  • [OPTIONAL] Add your file with sensitive or big data to .gitignore to ensure that you don't accidentally commit it again.
  • Double-check that you've removed everything you wanted to from your repository's history, and that all of your branches are checked out.
  • After all the changes are validated (ideally after some time) the garbage must be collected and erased with:
$ git for-each-ref --format='delete %(refname)' refs/original | git update-ref --stdin
$ git reflog expire --expire=now --all
$ git gc --prune=now

Upload changes to the remote

  • Once the final state of the repository is the desired, push all the changes to the repo forcing the rebase:
$ git push origin --force --all
  • To update the tagged releases force-push the tags as well.
$ git push origin --force --tags
  • Tell to all the collaborators to rebase, NOT MERGE, any branches they created off of your old (tainted) repository history. One merge commit could reintroduce some or all of the tainted history that you just went to the trouble of purging.
  • [OPTIONAL] If you want to prune the data in the server, go to the location of the repo in the server and call:;
$ git reflog expire --expire=now --all
$ git gc --aggressive --prune=now

IMPORTANT: If you need to upload the branches to other servers, and they are not present in the current cloned repo (that you have pruned), DO NOT PULL changes from the remote. Instead, only checkout the branches you need and thats it.

Tell your partners to sync their local repos

They can not pull the changes (this could be catastrophic), but there is a way to synchronize the repos in a save way. For those with extra commits:

$ cd MY_LOCAL_GIT_REPO
$ git fetch origin
$ git rebase
$ git reflog expire --expire=now --all
$ git gc --aggressive --prune=now

For those with no extra data (Warning: This options reases all not pushed data):

$ cd MY_LOCAL_GIT_REPO
$ git fetch origin
# WARNING: can destroy unpublished data!

$ git reset --hard origin/master
$ git reflog expire --expire=now --all
$ git gc --aggressive --prune=now

References:

https://help.github.com/articles/removing-sensitive-data-from-a-repository/ https://help.github.com/articles/removing-sensitive-data-from-a-repository/ http://naleid.com/blog/2012/01/17/finding-and-purging-big-files-from-git-history http://naleid.com/blog/2012/01/17/finding-and-purging-big-files-from-git-history http://blog.ostermiller.org/git-remove-from-history http://blog.ostermiller.org/git-remove-from-history https://git-scm.com/docs/git-filter-branch