Tarsum in Rust

Almost 14 years ago, I wrote a [small utility, named tarsum, to calculate checksums on files inside a tar archive. It was useful for verifying data inside backups. Recently, I decided to rewrite it in Rust. It’s available from https://github.com/guyru/tarsum.

Installation using cargo is straight forward:

$ cargo install --git https://github.com/guyru/tarsum

Surprisingly, testing on a large tar archive (recent Linux tarball, 1.3 GB), the performance of both Python and Rust implementation is very similar.

tarsum-0.2 – A read only version of tarsum

When I first scratched the itch of calculating checksums for every file in a tar archive, this was my original intention. When I decided I want the script in bash for simplicity, I forfeited the idea and settled for extracting the files and then going over all the files to calculate their checksum value.

So when Jon Flowers asked in the comments of the original tarsum post about the possibility of getting the checksums of files in the tar file without extracting all the archive, I’ve decided to re-tackle the problem.

Continue reading tarsum-0.2 – A read only version of tarsum

tarsum – Calculate Checksum for Files inside Tar Archive

Update: I’ve released tarsum-0.2, a new version of tarsum.

Some time ago, I got back a hard disk back from data recovery. One of the annoying issues I encountered with the recovered data was corrupted files. Some files looked like they were recovered successfully but their content was corrupted. The ones that were configuration files, where usually easy to detect, as it raised errors in programs that tried to use them. But when such error occurs in some general text file, (or inside the data of an SQL dump), the file may seem correctly fine unless closely inspected.

I have an habit of storing old backups on CDs (they are initially made to online storage), I do it in order to reduce backup costs. But the recovered/corrupted data issue raised some concerns about my ability to recover using this disks. Assuming that I have a disk failure, and I couldn’t recover from my online backups for reason, how can I check the integrity of my CD backups?

Only storing and comparing hash signature for the whole archive, is almost useless. It allows you to validate whether all the files are probably fine, but it can’t tell apart one corrupted file in the archive from a completed corrupted archive. My idea was to calculate checksum (hash) for each file in the data and store the signature in a way that would allow me to see which individual files are corrupted.

This is where tarsum comes to the rescue. As it’s name applies it calculate checksum for each file in the archive. You can download tarsum from here.

Using tarsum is pretty straight forward.

tarsum backup.tar > backup.tar.md5

Calculates the MD5 checksums of the files. You can specify other hashes as well, by passing a tool that calculates it (it must work like md5sum).

tarsum --checksum=sha256sum backup.tar > backup.tar.sha256

To verify the integrity of the files inside the archive we use the diff command:

tarsum backup.tar | diff backup.tar.md5 -

where backup.tar.md5 is the original signature file we created. This is possible because the signatures are sorted alphabetically by the file name inside the archive, so it the order of the files is always the same.

Note that if you use an updated version of GNU tar, tarsum can also operate directly on compressed archives (e.g. tar.bz2, tar.gz).