Fixing Archive.org's PDFs

Here’s the webpage for a very early edition of Huckleberry Finn. If you open the PDF using a modern PC or tablet, it will look fine though a little slow to load. If you open it on your Kindle, Nook Color, or some other older Ebook reader that displays PDFs, you’re in for a shock.

[gallery ids=“927,928,926” type=“rectangular”]

Each page in these PDFs are actually 3 images. When put together by a modern PDF reader, they make one nice scanned PDF page. If you’re not suing a modern reader, you see all 3 layers separately. This makes the book unreadable. Even if you are using a modern reader, these PDFs have a noticeable lag time compared to other documents because it is loading 3 images per page.

This guide which show you how to eliminate the first two images and reverse the third image to be white on black. Will this 100% fix the book?  No. However if you value text over presentation, it does make the book readable on any device including the good old E-ink Kindle.

Step 1. Install the applications (OpenSUSE)

sudo zypper in pdfmod imagemagick pandoc grename

Step 2. Convert the PDF to images. Create a directory for the files to go to first:

mkdir huck pdfimages huckleberry.pdf huck/

Step 3. The files that are created are all -xxx.ppm and .pbm: Bash doesn’t like this. I use grename to rename every file so that they don’t begin with a hyphen

Step 4. cd to the directory and delete the extra image files:

cd huck rm *.ppm

Step 5. Reverse the images of the .pbm files. This will create a new copy of the files with inverted colors.

for i in *; do convert -monochrome -colors 2 -depth 1 -negate $i in-$i; done

Step 6. Move the completed files to a next directory and delete the originals

mkdir finished mv in* finished/ rm *.pbm

Step 7. cd to the finished directory and create a new pdf. This will take time and may freeze your computer. Be patient.

cd finished convert `ls -v` huck_bw.pdf

Step 8. Shrink your newly created PDF because it is far too large right now.

gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/screen \ -dNOPAUSE -dQUIET -dBATCH -sOutputFile=huck_bw_final.pdf huck_bw.pdf

Your new PDF is complete. It is not a pretty as the original but it is more handy.

I then use pdfmod to edit the metadata so the ebook is easier to work with in calibre.

I’m very interested if anyone has found a better way to do this with open source software that retains the color of the original but without the multiple layers.