tarsum-0.2 – A read only version of tarsum

When I first scratched the itch of calculating checksums for every file in a tar archive, this was my original intention. When I decided I want the script in bash for simplicity, I forfeited the idea and settled for extracting the files and then going over all the files to calculate their checksum value.

So when Jon Flowers asked in the comments of the original tarsum post about the possibility of getting the checksums of files in the tar file without extracting all the archive, I’ve decided to re-tackle the problem.

This time I’ve chose python and by using the tarfile and hashlib modules I came up with a solution that allowed me to go over tar files to calculate the checksum values without extracting all of them to the disk. However some sacrifices where made in the form of back-compatibility of the output. I’ve tried to make the interface similar to the old one, and have kept all the command line options. Instead of specifying a program name to calculate the checksum values (such as sha1sum) as argument to --checksum you specify the name of the checksum algorithm such as md5, sha1, sha256, sha512 (or any other supported by hashlib).

Other changes where made so tar files can be piped directly into tarsum (which also works transparently with bzip2 and gzip compression).

tarsum < sometarfile.tar.gz > sometarfile.tar.gz.md5

Performance-wise, according to some tests I’ve carried out, the new version is faster with big tar files than the old one, but it’s the other way around with small archives (which I find less important).

Update 2009-08-12: Removed excess argument to tarsum() and switched the filemode to r|* (from r:*). Bumped version string.

6 thoughts on “tarsum-0.2 – A read only version of tarsum

  1. Mike T.


    Tried out your program using Ubuntu 9.04, but I encountered two problems.

    First, in the last line of main() you called tarsum() with 4 parameters while tarsum() is defined to accept only 3 parameters. Python is aborting the program because of this. I’m not a python programmer, but when I took out the 4th parameters, the program now runs.

    Then I tried to use the program on a 21GB bzipped tarball that contains a 100GB file. While the program runs, it runs for less than 1 second. It also prints out a checksum that is different. The file is ok when tested using “bzip2 -t”.

    After some research, I changed the filemode from “r:*” to “r|*” to use stream IO. After this change, “tarsum-0.2 file.tar.bz2” now aborts, but “bunzip2 -c file.tar.bz2 | tarsum-0.2” now seems to work.

  2. Guy Post author

    @Mike: Thanks for pointing out the tarsum() signature error. I guess when I’ve cleaned the script before the release I’ve missed that I’ve also change the signature.

    I admit that I’ve never tested the script with such a big file like you did. What suprised me that setting the filemode to r|* slowed the script a bit (at least for my ~300MMB tar), I assumed that giving up random access, should make things faster, but it didn’t

    Anyway I’ve fixed both issues. Thanks again.

  3. Goran Tornqvist

    I had to do a workaround due to disk space limitations since I cannot extract my archives which contain extremely large log files so this is what I came up with …


    for line in $(cat ${md5file})
    md5=$(echo ${line}|awk ‘{print $1}’)
    filename=$(echo ${line}|awk ‘{print $2}’)
    md5archivefile=$(tar -zxOvf myfile.tgz ${filename} 2>/dev/null | md5sum – | awk ‘{print $1}’)

    if [ ! “${md5archivefile}” == “${md5}” ]; then
    echo “NOT OK: $filename,$md5,$md5archivefile”
    echo “OK: $filename,$md5,$md5archivefile”

  4. Georges Dupéron


    Thanks for the script !

    I have built a script that hashes the files in a directory and its subdirectories when run for the first time, then upon subsequent runs it will compute the new hash only for files whose size and/or last change time have changed, thus allowing me to update the hashes for really huge directories (several hundred gigabytes) in very little time.

    I’ll use your script to also hash the contents of archives (I use my script to detect duplicate files, so it’ll be nice for it to also find files which have a duplicate within an archive).

    By looking at your code, I think the “store_digests = {}” line in the tarsum function is useless, since that variable is never read from.

    Georges Dupéron

Leave a Reply

Your email address will not be published. Required fields are marked *