Sunday, February 20, 2011

DoggyBook


DoggyBook is an open project to digitize books, pamphlets and papers for use in E-readers like Kindle and Nook.

I have recently gotten very interested in the availability of classic texts in digital format. The reason? Dad just bought himself an Amazon Kindle 3 e-reader, and he lets me use it as well. Most of my reading list is older technical manuals, but many of them are PDF format. The text is "frozen", and will not wrap to fit a small page.  Many of these PDFs are actually scanned as images, making them very inflexible. You can view a page "fit to screen" (very small) or full size (hard to navigate). So I have been converting  the Handbook of Downdraft Gasifiers Engine Systems to a more "e-friendly" format. This is a government publication in the public domain.

After looking around, I discovered that the process of digitizing an old book consists of three major steps.

1) Take a snapshot of the physical page, using a scanner or similar device. This can be exported directly for "snapshot" PDFs.

2) "Read" the text on the page using Optical Character Recognition (OCR), which leaves you with an unformatted mass of misspelled text - but text nonetheless. All images, tables and equations must be skipped by the OCR, as they only produce garbled nonsense. 

3) The text must be spell-checked (mostly by hand), edited back to original formatting as much as possible, and the images and tables added back in as graphics. This is the human element, and this is why so many older texts are only available as PDF snapshots. Once formatted properly, it can be saved out to an e-book format like epub or mobi.

It is a lot of work, especially if you are not in the business of doing it. I have found some free tools to help with the job. First is Adobe Reader. Most versions will export to a text file. The only catch is, there has to be some text to export. "Snapshot" PDFs are only images, no text. So instead, we need FreeOCR, reportedly the heart of the Google Books OCR software. It is plain and unadorned; you see a panel with the PDF page, draw a box around the text area, and hit "Convert". Depending on the amount of text, each chunk takes about a minute to process. Then you see in the other panel, the plain text output. Copy this to a Word document, then clear the cache and do some more. I did a 140 page book in about 2 hours. Then you have a massive .doc file. Books this size are difficult to work on with a slow PC like mine, so breaking it into chapters makes sense. 

Then I decided to move the whole project out into the open, so that more eyes can help spot the typos. I created a wiki called DoggyBook, and the whole text is posted on various pages there. I will continue to edit it online as I can, and maybe interested folks will join me and speed this thing along. Periodically I will compile the whole thing as a .doc file and post it for download on the front page. Eventually, all the typos will get smoothed out, and images will be inserted in the right places.

This is only worth the effort for a special book. In the area of biomass gasification, this is certainly one of the classic texts. In fact, it is still in print - the Biomass Energy Foundation will sell you a spiral bound photocopy for $35. But nobody has it available for e-readers yet. I aim to fix that.