Scanning Lecture Notes – Compression

A new semester is about to begin, so I again set out to organize lecture notes and scan them. This time I intend to invest more time in investigating and perfecting this process. Hopefully, I’ll present my conclusions in a few posts, each focusing on a different aspect.

In the first post, I’ll discuss the various ways to compress scanned lecture notes. Because lecture notes (at least mine) aren’t especially colorful, as I only use one pen at a time, I want the result to be black and white (line art). This allows for readable lecture notes while preserving a small size per page (as you can see in Some Tips on Scanning Lecture Notes).

Generating the Benchmarks

When scanning documents, PDF acts as a container, and the scanned images are stored as a binary string. Given a PDF, you can view the compression used by using strings and grep and looking for strings like /F /SomethingDecode or /Filter /SomethingDecode. For example:

$ strings lectures.pdf | grep Decode
  /F /FlateDecode

(the last line gets repeated for every page). This PDF was created directly by XSane. If we search a bit, we find that this means the document uses Deflate (or Zip) as its compression algorithm. However, there are several other compression algorithms that can be used, and each will have a different filter. Below (taken from imagemagick documentation) is a concise list of possible compression algorithms:

"-compress none"   '/ASCII85Decode'

"-compress zip"     '/FlateDecode'

"-compress jpeg"    '/DCTDecode'

"-compress lzw"     '/LZWDecode'

"-compress fax"     '/CCITTFaxDecode'

"+compress"
"-compress rle"
any thing else      '/RunLengthDecode'

So, I’ve generated benchmark PDFs with LZW, Zip (what XSane uses), and Group 4 (which turned out to be identical to fax).

$ ls *.pbm | xargs -I XXX convert XXX -density 600 -compress lzw XXX_lzw.pdf
$ pdftk *_lzw.pdf output lzw.pdf

$ ls *.pbm | xargs -I XXX convert XXX -density 600 -compress zip XXX_zip.pdf
$ pdftk *_zip.pdf output deflate.pdf

$ ls *.pbm | xargs -I XXX convert XXX -density 600 -compress group4 XXX_gp4.pdf
$ pdftk *_pg4.pdf output group4.pdf

The -density 600 parameter indicates the DPI for the scans and allows us to retain the correct physical dimensions in the PDF, which is important for documents.

ImageMagick has several other types of compression algorithms that it supports, as can be seen via convert -list compress. The most interesting one that appears there is JBIG2. JBIG2 is a modern bi-tonal compression introduced in 2000, and has been part of the PDF standard since version 1.4. As I understand it, it surpasses Group 4 compression (which is also bi-tonal and was designed for faxes), and it should work very similarly to the bi-tonal JB2 compression in DjVu.

Unfortunately, it seems that ImageMagick doesn’t support encoding JBIG2 images into PDFs (only decoding), as trying to encode with it resulted in /RunLengthDecode streams. As I heard, JBIG2 should give DjVu fair competition in terms of compression, so I’ve looked for other means and found jbig2enc. jbig2enc provides two useful programs: jbig2, which is able to take several images and build the compression index for them, and pdf.py, which takes those indexes and embeds them inside a PDF. jbig2enc isn’t (yet) in Ubuntu’s repository, but it’s very easy to compile (few dependencies, automake). So I’ve used it to create a PDF with JBIG2 compression:

$ ./agl-jbig2enc-d5cb3d5/src/jbig2 -s --pdf *.pbm
# This create a bunch of output.* files which can be discarded afterwards
$ python agl-jbig2enc-d5cb3d5/pdf.py output > jbig2_pbm.pdf
$ rm output.*

(I’ve also had to patch pdf.py to use the right DPI, as jbig2 can’t extract it from the pbm files.)

For the sake of comparison, I’ve also used minidjvu to create both lossless and lossy DjVu files.

$ minidjvu --dpi 600 -a 0 *.pbm lossless.djvu
$ minidjvu --dpi 600 --lossy *.pbm lossy.djvu

Results

All the benchmarks have been made on 38 pages of handwritten notes. The raw results are:

$ du deflate.pdf lzw.pdf group4.pdf jbig2_pbm.pdf lossless.djvu lossy.djvu 
10036   deflate.pdf
9680    lzw.pdf
2716    group4.pdf
2136    jbig2_pbm.pdf
2164    lossless.djvu
1828    lossy.djvu

Both Zip (Deflate) and LZW are general-purpose compressions, so it’s not surprising they perform worst. The old Group 4 performs considerably well, but it is left behind by the modern options of JBIG2 and DjVu. The JBIG2 encoding I did was lossy (I had some problems with the lossless one), and it compared to the lossless DjVu file. The lossy DjVu encoding surpassed JBIG2, but not by far.

Conclusion

Using XSane’s (or probably any other scanning software’s) PDF creation is a waste of bits. They just use the wrong tools for the job, and the PDF produced is way too large. DjVu is the best choice for lossless compression, but the difference between lossy and lossless is perceptually negligible. If you must use PDF, I suggest going for JBIG2. If you’re fine with DjVu, go for lossless (as the difference in bytes isn’t big).

Further Work

I’ve assumed at this point that I have black-and-white images. I’ve relied on simple conversion from grayscale, which doesn’t always perform as well as it should. It looks like, using ImageMagick and some smart filter, better color separation can be achieved, especially when notes are written with a pen with non-black ink.

Update: I’ve posted my results in Scanning Lecture Notes – Separating Colors.

The other, less significant, issue is deskewing, e.g., automatically aligning the scanned images. This should remove small tilts introduced while scanning (and while writing).

I plan to cover both topics in the following posts.

Update 2012-10-12: The lossless DjVu creation command was missing -a 0. Fixing that resulted in minor file size changes for lossless.djvu, and as such, I took the liberty of not updating the figures.

Leave a Reply

Your email address will not be published. Required fields are marked *