`tarsum`-0.2 – A read-only version of tarsum

2009-08-12T06:47:59+03:00

Hi.

Tried out your program using Ubuntu 9.04, but I encountered two problems.

First, in the last line of main() you called tarsum() with 4 parameters while tarsum() is defined to accept only 3 parameters. Python is aborting the program because of this. I’m not a python programmer, but when I took out the 4th parameters, the program now runs.

Then I tried to use the program on a 21GB bzipped tarball that contains a 100GB file. While the program runs, it runs for less than 1 second. It also prints out a checksum that is different. The file is ok when tested using “bzip2 -t”.

After some research, I changed the filemode from “r:*” to “r|*” to use stream IO. After this change, “tarsum-0.2 file.tar.bz2” now aborts, but “bunzip2 -c file.tar.bz2 | tarsum-0.2” now seems to work.

2009-08-12T11:31:53+03:00

@Mike: Thanks for pointing out the tarsum() signature error. I guess when I’ve cleaned the script before the release I’ve missed that I’ve also change the signature.

I admit that I’ve never tested the script with such a big file like you did. What suprised me that setting the filemode to r|* slowed the script a bit (at least for my ~300MMB tar), I assumed that giving up random access, should make things faster, but it didn’t

Anyway I’ve fixed both issues. Thanks again.

2010-02-16T15:36:33+02:00

I had to do a workaround due to disk space limitations since I cannot extract my archives which contain extremely large log files so this is what I came up with …

IFS=”
”
for line in $(cat ${md5file})
do
md5=$(echo ${line}|awk ‘{print $1}’)
filename=$(echo ${line}|awk ‘{print $2}’)
md5archivefile=$(tar -zxOvf myfile.tgz ${filename} 2>/dev/null | md5sum – | awk ‘{print $1}’)

if [ ! “${md5archivefile}” == “${md5}” ]; then
echo “NOT OK: $filename,$md5,$md5archivefile”
else
echo “OK: $filename,$md5,$md5archivefile”
fi
done

2011-03-24T21:56:21+02:00

Thanks for the great code! I’ve made a few changes, at

https://github.com/mikemccabe/code/blob/master/tarsum

– add –pipe option, to copy input to output for use in shell pipelines

– add a ‘total’ sum of input tarfile.

Comments welcome! Still in progress 🙂

2012-05-01T00:40:06+03:00

Hi,

Thanks for the script !

I have built a script that hashes the files in a directory and its subdirectories when run for the first time, then upon subsequent runs it will compute the new hash only for files whose size and/or last change time have changed, thus allowing me to update the hashes for really huge directories (several hundred gigabytes) in very little time.

I’ll use your script to also hash the contents of archives (I use my script to detect duplicate files, so it’ll be nice for it to also find files which have a duplicate within an archive).

By looking at your code, I think the “store_digests = {}” line in the tarsum function is useless, since that variable is never read from.

Cheers,
Georges Dupéron

2012-05-02T01:44:08+03:00

I moved my version to it’s own repo –

https://github.com/mikemccabe/tarsump

> Thanks for the great code! I’ve made a few changes, at
>
> https://github.com/mikemccabe/code/blob/master/tarsum
>
> – add –pipe option, to copy input to output for use in shell pipelines
>
> – add a ‘total’ sum of input tarfile.

2016-08-10T17:42:24+03:00

This was very useful. Thanks!

I had to slightly change it to work with Python 3 (this computer doesn’t have Python 2 installed).

The updated file is at https://gist.github.com/sjmurdoch/5e089249bc465706f1ca32f195787ad8

	#! /usr/bin/env python

	# Copyright (C) 2008-2009 by Guy Rutenberg

	# This program is free software; you can redistribute it and/or modify
	# it under the terms of the GNU General Public License as published by
	# the Free Software Foundation; either version 2 of the License, or
	# (at your option) any later version.
	#
	# This program is distributed in the hope that it will be useful,
	# but WITHOUT ANY WARRANTY; without even the implied warranty of
	# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
	# GNU General Public License for more details.
	#
	# You should have received a copy of the GNU General Public License
	# along with this program; if not, write to the
	# Free Software Foundation, Inc.,
	# 59 Temple Place – Suite 330, Boston, MA 02111-1307, USA.

	import hashlib
	import tarfile

	def tarsum(input_file, hash, output_file):
	"""
	input_file – A FILE object to read the tar file from.
	hash – The name of the hash to use. Must be supported by hashlib.
	output_file – A FILE to write the computed signatures to.
	"""
	tar = tarfile.open(mode="r\|*", fileobj=input_file)

	chunk_size = 100*1024
	store_digests = {}

	for member in tar:
	if not member.isfile():
	continue
	f = tar.extractfile(member)
	h = hashlib.new(hash)
	data = f.read(chunk_size)
	while data:
	h.update(data)
	data = f.read(chunk_size)
	output_file.write("%s %s\n" % (h.hexdigest(), member.name))

	def main():
	parser = OptionParser()

	version=("%prog 0.2.1\n"
	"Copyright (C) 2008-2009 Guy Rutenberg <http://www.guyrutenberg.com/contact-me>")
	usage=("%prog [options] TARFILE\n"
	"Print a checksum signature for every file in TARFILE.\n"
	"With no FILE, or when FILE is -, read standard input.")
	parser = OptionParser(usage=usage, version=version)
	parser.add_option("-c", "–checksum", dest="checksum", type="string",
	help="use HASH as for caclculating the checksums. [default: %default]", metavar="HASH",
	default="md5")
	parser.add_option("-o", "–output", dest="output", type="string",
	help="save signatures to FILE.", metavar="FILE")

	(option, args) = parser.parse_args()

	output_file = sys.stdout
	if option.output:
	output_file = open(option.output, "w")

	input_file = sys.stdin
	if len(args)==1 and args[0]!="-":
	input_file = open(args[0], "r")

	tarsum(input_file, option.checksum, output_file)

	if __name__ == "__main__":
	from optparse import OptionParser
	import sys
	main()

`tarsum`-0.2 – A read-only version of tarsum

7 thoughts on “`tarsum`-0.2 – A read-only version of tarsum”

Leave a Reply

Share this:

7 thoughts on “tarsum-0.2 – A read-only version of tarsum”

Leave a Reply

7 thoughts on “`tarsum`-0.2 – A read-only version of tarsum”