Tarsum in Rust

Almost 14 years ago, I wrote a small utility, named tarsum, to calculate checksums on files inside a tar archive. It was useful for verifying data inside backups. Recently, I decided to rewrite it in Rust. It’s available from https://github.com/guyru/tarsum.

Installation using cargo is straightforward:

$ cargo install --git https://github.com/guyru/tarsum

Surprisingly, when testing on a large tar archive (a recent Linux tarball, 1.3 GB), the performance of both the Python and Rust implementations is very similar.

tarsum-0.2 – A read-only version of tarsum

When I first scratched the itch of calculating checksums for every file in a tar archive, this was my original intention. When I decided I wanted the script in bash for simplicity, I forfeited the idea and settled for extracting the files and then going over all the files to calculate their checksum values.

So when Jon Flowers asked in the comments on the original tarsum post about the possibility of getting the checksums of files in the tar file without extracting the whole archive, I decided to re-tackle the problem.

Continue reading tarsum-0.2 – A read-only version of tarsum

tarsum – Calculate Checksums for Files inside a Tar Archive

Update: I’ve released tarsum-0.2, a new version of tarsum.

Some time ago, I got a hard disk back from data recovery. One of the annoying issues I encountered with the recovered data was corrupted files. Some files looked like they were recovered successfully, but their content was corrupted. The ones that were configuration files were usually easy to detect, as they raised errors in programs that tried to use them. But when such an error occurs in some general text file (or inside the data of an SQL dump), the file may seem perfectly fine unless closely inspected.

I have a habit of storing old backups on CDs (they are initially made to online storage). I do it in order to reduce backup costs. But the recovered/corrupted data issue raised some concerns about my ability to recover using these disks. Assuming that I have a disk failure, and I couldn’t recover from my online backups for some reason, how can I check the integrity of my CD backups?

Only storing and comparing a hash signature for the whole archive is almost useless. It allows you to validate whether all the files are probably fine, but it can’t tell apart one corrupted file in the archive from a completely corrupted archive. My idea was to calculate a checksum (hash) for each file in the data and store the signature in a way that would allow me to see which individual files are corrupted.

This is where tarsum comes to the rescue. As its name implies, it calculates a checksum for each file in the archive. You can download tarsum from here.

Using tarsum is pretty straightforward.

tarsum backup.tar > backup.tar.md5

Calculates the MD5 checksums of the files. You can specify other hashes as well, by passing a tool that calculates them (it must work like md5sum).

tarsum --checksum=sha256sum backup.tar > backup.tar.sha256

To verify the integrity of the files inside the archive, we use the diff command:

tarsum backup.tar | diff backup.tar.md5 -

where backup.tar.md5 is the original signature file we created. This is possible because the signatures are sorted alphabetically by the file name inside the archive, so the order of the files is always the same.

Note that if you use an updated version of GNU tar, tarsum can also operate directly on compressed archives (e.g. tar.bz2, tar.gz).