When I first scratched the itch of calculating checksums for every file in a tar archive, this was my original intention. When I decided I wanted the script in bash for simplicity, I forfeited the idea and settled for extracting the files and then going over all the files to calculate their checksum values.
So when Jon Flowers asked in the comments on the original tarsum post about the possibility of getting the checksums of files in the tar file without extracting the whole archive, I decided to re-tackle the problem.
This time I chose Python, and by using the tarfile and hashlib modules, I came up with a solution that allowed me to go over tar files to calculate the checksum values without extracting all of them to disk. However, some sacrifices were made in the form of backward compatibility of the output. I’ve tried to make the interface similar to the old one and have kept all the command line options. Instead of specifying a program name to calculate the checksum values (such as sha1sum) as an argument to --checksum, you specify the name of the checksum algorithm, such as md5, sha1, sha256, sha512 (or any other supported by hashlib).
Other changes were made so tar files can be piped directly into tarsum (which also works transparently with bzip2 and gzip compression).
Performance-wise, according to some tests I’ve carried out, the new version is faster with big tar files than the old one, but it’s the other way around with small archives (which I find less important).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
7 thoughts on “tarsum-0.2 – A read-only version of tarsum”
Hi.
Tried out your program using Ubuntu 9.04, but I encountered two problems.
First, in the last line of main() you called tarsum() with 4 parameters while tarsum() is defined to accept only 3 parameters. Python is aborting the program because of this. I’m not a python programmer, but when I took out the 4th parameters, the program now runs.
Then I tried to use the program on a 21GB bzipped tarball that contains a 100GB file. While the program runs, it runs for less than 1 second. It also prints out a checksum that is different. The file is ok when tested using “bzip2 -t”.
After some research, I changed the filemode from “r:*” to “r|*” to use stream IO. After this change, “tarsum-0.2 file.tar.bz2” now aborts, but “bunzip2 -c file.tar.bz2 | tarsum-0.2” now seems to work.
@Mike: Thanks for pointing out the tarsum() signature error. I guess when I’ve cleaned the script before the release I’ve missed that I’ve also change the signature.
I admit that I’ve never tested the script with such a big file like you did. What suprised me that setting the filemode to r|* slowed the script a bit (at least for my ~300MMB tar), I assumed that giving up random access, should make things faster, but it didn’t
Anyway I’ve fixed both issues. Thanks again.
I had to do a workaround due to disk space limitations since I cannot extract my archives which contain extremely large log files so this is what I came up with …
IFS=”
”
for line in $(cat ${md5file})
do
md5=$(echo ${line}|awk ‘{print $1}’)
filename=$(echo ${line}|awk ‘{print $2}’)
md5archivefile=$(tar -zxOvf myfile.tgz ${filename} 2>/dev/null | md5sum – | awk ‘{print $1}’)
if [ ! “${md5archivefile}” == “${md5}” ]; then
echo “NOT OK: $filename,$md5,$md5archivefile”
else
echo “OK: $filename,$md5,$md5archivefile”
fi
done
Thanks for the great code! I’ve made a few changes, at
– add –pipe option, to copy input to output for use in shell pipelines
– add a ‘total’ sum of input tarfile.
Comments welcome! Still in progress 🙂
Hi,
Thanks for the script !
I have built a script that hashes the files in a directory and its subdirectories when run for the first time, then upon subsequent runs it will compute the new hash only for files whose size and/or last change time have changed, thus allowing me to update the hashes for really huge directories (several hundred gigabytes) in very little time.
I’ll use your script to also hash the contents of archives (I use my script to detect duplicate files, so it’ll be nice for it to also find files which have a duplicate within an archive).
By looking at your code, I think the “store_digests = {}” line in the tarsum function is useless, since that variable is never read from.
> Thanks for the great code! I’ve made a few changes, at
>
> https://github.com/mikemccabe/code/blob/master/tarsum
>
> – add –pipe option, to copy input to output for use in shell pipelines
>
> – add a ‘total’ sum of input tarfile.
This was very useful. Thanks!
I had to slightly change it to work with Python 3 (this computer doesn’t have Python 2 installed).
Hi.
Tried out your program using Ubuntu 9.04, but I encountered two problems.
First, in the last line of main() you called tarsum() with 4 parameters while tarsum() is defined to accept only 3 parameters. Python is aborting the program because of this. I’m not a python programmer, but when I took out the 4th parameters, the program now runs.
Then I tried to use the program on a 21GB bzipped tarball that contains a 100GB file. While the program runs, it runs for less than 1 second. It also prints out a checksum that is different. The file is ok when tested using “bzip2 -t”.
After some research, I changed the filemode from “r:*” to “r|*” to use stream IO. After this change, “tarsum-0.2 file.tar.bz2” now aborts, but “bunzip2 -c file.tar.bz2 | tarsum-0.2” now seems to work.
@Mike: Thanks for pointing out the tarsum() signature error. I guess when I’ve cleaned the script before the release I’ve missed that I’ve also change the signature.
I admit that I’ve never tested the script with such a big file like you did. What suprised me that setting the filemode to r|* slowed the script a bit (at least for my ~300MMB tar), I assumed that giving up random access, should make things faster, but it didn’t
Anyway I’ve fixed both issues. Thanks again.
I had to do a workaround due to disk space limitations since I cannot extract my archives which contain extremely large log files so this is what I came up with …
IFS=”
”
for line in $(cat ${md5file})
do
md5=$(echo ${line}|awk ‘{print $1}’)
filename=$(echo ${line}|awk ‘{print $2}’)
md5archivefile=$(tar -zxOvf myfile.tgz ${filename} 2>/dev/null | md5sum – | awk ‘{print $1}’)
if [ ! “${md5archivefile}” == “${md5}” ]; then
echo “NOT OK: $filename,$md5,$md5archivefile”
else
echo “OK: $filename,$md5,$md5archivefile”
fi
done
Thanks for the great code! I’ve made a few changes, at
https://github.com/mikemccabe/code/blob/master/tarsum
– add –pipe option, to copy input to output for use in shell pipelines
– add a ‘total’ sum of input tarfile.
Comments welcome! Still in progress 🙂
Hi,
Thanks for the script !
I have built a script that hashes the files in a directory and its subdirectories when run for the first time, then upon subsequent runs it will compute the new hash only for files whose size and/or last change time have changed, thus allowing me to update the hashes for really huge directories (several hundred gigabytes) in very little time.
I’ll use your script to also hash the contents of archives (I use my script to detect duplicate files, so it’ll be nice for it to also find files which have a duplicate within an archive).
By looking at your code, I think the “store_digests = {}” line in the tarsum function is useless, since that variable is never read from.
Cheers,
Georges Dupéron
I moved my version to it’s own repo –
https://github.com/mikemccabe/tarsump
> Thanks for the great code! I’ve made a few changes, at
>
> https://github.com/mikemccabe/code/blob/master/tarsum
>
> – add –pipe option, to copy input to output for use in shell pipelines
>
> – add a ‘total’ sum of input tarfile.
This was very useful. Thanks!
I had to slightly change it to work with Python 3 (this computer doesn’t have Python 2 installed).
The updated file is at https://gist.github.com/sjmurdoch/5e089249bc465706f1ca32f195787ad8