Fixing Archive.org's PDFs
Here’s the webpage for a very early edition of Huckleberry Finn. If you open the PDF using a modern PC or tablet, it will look fine though a little slow to load. If you open it on your Kindle, Nook Color, or some other older Ebook reader that displays PDFs, you’re in for a shock.
[gallery ids=“927,928,926” type=“rectangular”]
Each page in these PDFs are actually 3 images. When put together by a modern PDF reader, they make one nice scanned PDF page. If you’re not suing a modern reader, you see all 3 layers separately. This makes the book unreadable. Even if you are using a modern reader, these PDFs have a noticeable lag time compared to other documents because it is loading 3 images per page.
This guide which show you how to eliminate the first two images and reverse the third image to be white on black. Will this 100% fix the book? No. However if you value text over presentation, it does make the book readable on any device including the good old E-ink Kindle.
Step 1. Install the applications (OpenSUSE)
sudo zypper in pdfmod imagemagick pandoc grename
Step 2. Convert the PDF to images. Create a directory for the files to go to first:
mkdir huck pdfimages huckleberry.pdf huck/
Step 3. The files that are created are all -xxx.ppm and .pbm: Bash doesn’t like this. I use grename to rename every file so that they don’t begin with a hyphen
Step 4. cd to the directory and delete the extra image files:
cd huck rm *.ppm
Step 5. Reverse the images of the .pbm files. This will create a new copy of the files with inverted colors.
for i in *; do convert -monochrome -colors 2 -depth 1 -negate $i in-$i; done
Step 6. Move the completed files to a next directory and delete the originals
mkdir finished mv in* finished/ rm *.pbm
Step 7. cd to the finished directory and create a new pdf. This will take time and may freeze your computer. Be patient.
cd finished convert `ls -v` huck_bw.pdf
Step 8. Shrink your newly created PDF because it is far too large right now.
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/screen \ -dNOPAUSE -dQUIET -dBATCH -sOutputFile=huck_bw_final.pdf huck_bw.pdf
Your new PDF is complete. It is not a pretty as the original but it is more handy.
I then use pdfmod to edit the metadata so the ebook is easier to work with in calibre.
I’m very interested if anyone has found a better way to do this with open source software that retains the color of the original but without the multiple layers.