A new semester is about to begin, hence I again set out to organize lecture notes and scan them. This time I intend to invest more time investigating and perfecting this process. Hopefully, I’ll present my conclusions in few posts, each focusing on a different aspect.
In the first post, I’ll discuss the various ways to compress the scanned lecture notes. Because lecture notes (at least mine) aren’t especially colorful has I only use one pen at the time, I want the result to be black and white (line art). This allows readable lecture notes in while preserving small size per page (as you can see in Some Tips on Scanning Lecture Notes).
Generating the Benchmarks
When scanning documents, PDF acts as a container and the scanned images are stored as a binary string. Given a PDF, you can view the compression used by using
grep and look for strings like
/F /SomethingDecode or
/Filter /SomethingDecode. For example:
$ strings lectures.pdf | grep Decode /F /FlateDecode
(the last line gets repeated for every page). This PDF was created directly by XSane. If we search a bit we find that this means the document uses Deflate (or Zip) as compression algorithm. However there are several other compression algorithms that can be used, and each will have a different filter. Below (taken from imagemagick documentation) is a concise list of possible compression algorithms:
"-compress none" '/ASCII85Decode' "-compress zip" '/FlateDecode' "-compress jpeg" '/DCTDecode' "-compress lzw" '/LZWDecode' "-compress fax" '/CCITTFaxDecode' "+compress" "-compress rle" any thing else '/RunLengthDecode'
$ ls *.pbm | xargs -I XXX convert XXX -density 600 -compress lzw XXX_lzw.pdf $ pdftk *_lzw.pdf output lzw.pdf $ ls *.pbm | xargs -I XXX convert XXX -density 600 -compress zip XXX_zip.pdf $ pdftk *_zip.pdf output deflate.pdf $ ls *.pbm | xargs -I XXX convert XXX -density 600 -compress group4 XXX_gp4.pdf $ pdftk *_pg4.pdf output group4.pdf
-density 600 parameter indicates the DPI for the scans, and allows us to retain the correct physical dimensions in the PDF (which is important for documents).
ImageMagick as several other types of compression algorithms which it supports as can be seen via
convert -list compress. The most interesting one, that appears there is JBIG2.
JBIG2 is a modern bi-tonal compression introduced in 2000, and has been part of the PDF standard since version 1.4. As I understand it, it surpasses the
Group 4 compression (which is also bi-tonal, and was designed for faxes), and it should work very similar to the bi-tonal
JB2 compression in DjVu.
Unfortunately, it seems that ImageMagick doesn’t support encoding JBIG2 images into PDFs (only decoding) as trying to encode with resulted in
/RunLengthDecode streams. As I heard, JBIG2 should give DjVu a fair competition in terms of compression, I’ve looked for other means, and found
jbig2enc provides two useful programs:
jbig2 which is able to take several images, and build the compression index for them, and
pdf.py which takes those indexes and embeds them inside a PDF.
jbig2enc isn’t (yet) in Ubuntu’s repository, but it’s very easy to compile (few dependencies, automake). So I’ve used it to create a PDF with JBIG2 compression:
$ ./agl-jbig2enc-d5cb3d5/src/jbig2 -s --pdf *.pbm # This create a bunch of output.* files which can be discarded afterwards $ python agl-jbig2enc-d5cb3d5/pdf.py output > jbig2_pbm.pdf $ rm output.*
(I’ve also had to patch
pdf.py to use the right DPI, as
jbig2 can’t extract it from the
For the sake of comparison I’ve also used minidjvu to create both lossless and lossy djvu files
$ minidjvu --dpi 600 -a 0 *.pbm lossless.djvu $ minidjvu --dpi 600 --lossy *.pbm lossy.djvu
All the benchmarks have been made on 38 pages of handwritten notes. The raw results are:
$ du deflate.pdf lzw.pdf group4.pdf jbig2_pbm.pdf lossless.djvu lossy.djvu 10036 deflate.pdf 9680 lzw.pdf 2716 group4.pdf 2136 jbig2_pbm.pdf 2164 lossless.djvu 1828 lossy.djvu
Both Zip (Deflate) and LZW are general purpose compressions, hence it’s not surprising they perform worst. The old Group 4 performs considerable well, but if left behind the modern options of JBIG2 an DjVu. The JBIG2 encoding I did was lossy (I had some problems with the lossless one), and it compared to the lossless DjVu file. The lossy DjVu encoding surpassed the JBIG2, but not by far.
Using the XSane (or probably any other scanning software) PDF creation, is a waste of bits. The just use the wrong tools for the job, and the PDF produced is way too large. DjVu is the best choice for lossless compression, but the difference between lossy and lossless are perceptually negligible. If you must use PDF, I suggest going for JBIG2. If you’re fine with DjVu go for the lossless (as the difference in bytes isn’t big).
I’ve assumed at this part that I have black and white images. I’ve relied to simple conversion from gray-scale, which doesn’t always perform as well as it should. It looks like using ImageMagick and some smart filter, better color separation can be achieved, especially when notes are written with a pen with a non-black ink.
Update: I’ve posted my results in Scanning Lecture Notes – Separating Colors.
The other, less significant, issue is deskewing, e.g. aligning, the scanned images automatically. This should remove small tilts introduced while scanning (and while writing).
I plan to cover both topics in the following posts.
Update 2012-10-12: The lossless DjVU creation command was missing
-a 0. Fixing that resulted in minor file size changes for the lossless.djvu, and as such, I took the liberty of not updating the figures.