Stripping out certain files from Mercurial history

Sometimes it happens that a repository in a version control system gets too big to fit in the given file system quota limit, such as with the free account from Bitbucket, which has a limit of 2 GB. Reaching that limit is in most cases due to the binary files under version control, such as images and videos.

I recently came across this with paazmaya.fi since I used to host the video files myself. Later I moved to vimeo.com, but the history still contains plenty of mp4 and avi, even psd and fla files… Time to strip those out. Also some images could be removed, which might never been used, for example those coming with JavaScript libraries.

Please note that the method described here requires most likely to rewrite the whole history of the given repository, thus everyone who uses it needs to get a fresh clone of it after the stripping has been done and pushed.

The following examples are done by using Mac OS X 10.9, with Mercurial version 3.2.4 and the convert extension. The output from the hg version command:

Mercurial Distributed SCM (version 3.2.4)
(see https://mercurial.selenic.com for more information)

Copyright (C) 2005-2014 Matt Mackall and others
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Enabled extensions:

  color
  convert

The commands shown in this post should work with the same results in Linux distributions, and even in Windows when MSys Bash is used and the applications used are available.

The --filemap argument of the hg convert command, is the key feature for excluding files from the history. It is given the path of a text file containing a list of files to the excluded/included/renamed, depending on the use case.

First the files needs to be identified. It needs to be done by searching the history, since simply setting all *.mp4 and *.avi files to be excluded is not supported. There has been a patch to add globbing support back in March 2010 but it was not merged. Luckily Mercurial provides a manifest command that can be used to list all files under versioning throughout the history.

hg manifest --all -R paazmaya-original | grep -i \
  -E '\.(mp4|mkv|avi|psd|fla|exe|jbf|zip|bmp|ai|jar)$'

In case there is an interest to see how many files are matched by the grep, append the previous command with | wc -l, which pipes the lines from grep command to line counter.

Writing the map file can also be done with further piping the command, by adding | awk {'print "exclude \""$0"\""'} > exclude-map.txt. The quotes around the file name are needed when it might contain white space or other non ASCII characters.

The conversion can now be done with the below example command. Since the repository size is expected to be near 2 GB, it will take a while.

hg convert --filemap exclude-map.txt paazmaya-original paazmaya-exclude-map

Once the paazmaya-original has been converted to paazmaya-exclude-map, size comparison could be done:

du -sh paazmaya-original
du -sh paazmaya-exclude-map

However using the above du command is not the most accurate due to the nature of filesystem distribution, in most cases it will report a bit higher result. More accurate numbers can be retrieved by bundling the repository to a single file with the Mercurial bundle command:

hg bundle --all -R paazmaya-original paazmaya-original.hg
hg bundle --all -R paazmaya-exclude-map exclude-map.hg
ls -lh

There might be a need, or at least curiosity, of what are those files that are removed. In order to save them, the opposite conversion should be made, in which only the previous files are included.

Replace all the occurrences of the word “exclude” with the “include”, that are starting the line and have a white space after them, by using sed:

sed 's/^exclude /include /' exclude-map.txt > include-map.txt

Use the resulting include map file while converting:

hg convert --filemap include-map.txt paazmaya-original paazmaya-include-map

Once done check the size:

hg bundle --all -R paazmaya-include-map include-map.hg
ls -lh

Now that the two repositories have been converted from the original, they should be inspected possibly with a graphical interface and made sure that the history looks good and possible tags are preserved.

Strip the original repository from Bitbucket, from its “Settings” -> “Strip changesets”, under the “General” category. The “Revision to strip” should be 0, which is the first commit of the given repository.

If the repository which has binary files stripped from it looks good, it can be pushed to Bitbucket. Before pushing it might be useful to enable the progress extension so that the push can be monitored a bit more easier. Also --debug could be added, if the push takes too long or fails in the middle.

hg push -R paazmaya-exclude-map \
  ssh://hg@bitbucket.org/username/repository

If the push fails since it is just too much to handle in one session, a single revision or the revisions up to certain number could be pushed on one go, for example pushing only the first commit:

hg push -R paazmaya-exclude-map -r 0 \
  ssh://hg@bitbucket.org/username/repository

The next push could be using -r 4 which would push the commits from index 1 to 4.

Once the whole repository is up, it’s time to inform the other possible developers/users of the given repository.

The sizes I received when doing this for paazmaya.fi:

  • Bitbucket reports the size in settings page: 1.5 GB
  • Locally original size before conversion, 1424 changesets: 1.4 GB
  • Locally after excluding most binaries, 1401 changeset: 852 MB
  • Excluded files separately, 167 changesets: 614 MB
  • Bitbucket reports the stripped version in settings page: 2.4 GB, since they seem to have caching problems…

Anyhow the total size reduction is huge, about 40% of the original size taken away by removing the binary files.

As for future reference, it is worthwhile to start using services like Dropbox for binary file sharing.