hits counter

Sunday, October 5, 2008

DIY Ebooks, Part 3: Clean Up

The cool thing about scanning text pages is that it frees text from the confines of the page. You can take the raw text and use it the way you want -- convert it to ebooks or other platforms for reading, index it for future reference, or even grab long passages for citations.


But, there is one complication that's easy to overlook in the initial rush of utility: Individual pages provide a fundamental organizational unit for books. References are organized by page -- lose the page and the text of the original index is useless. Footnotes sit relative to text on paper pages. Even endnotes sometimes reference the page of their citation.


Not only that, pages contain lots of information the eye can easily filter out, but scanners and OCR software doesn't.


I don't have a perfect method for handling these issues -- yet -- but I can offer some advice about how to work around them.


Before OCRing:


1. Make a copy of all image files before you start editing, in case you must up or decide to try something else in the future.


2. Remove page numbers, headers, and footers -- Although OmniPage OCR software can you templates that ignore information on certain parts of each page, I've never had much luck using this, since my scans aren't always uniform in size and position. Instead, I remove this stuff using PaperPort, either erasing or deleting it page-by-page. Sounds time-consuming, but actually goes very quickly.


3. Make a decision about footnotes -- Most times, I simply remove footnotes so they don't get OCR'd, but when I see citations I want to keep, I handle it post-OCR. (See below)


4. Check for cut-off text and embedded images -- I often remove stylized initial caps to make OCRing more accurate.


5. Straighten text -- It OCRs better. PaperPort can do a batch straighten on all pages.


6. Remove pictures -- I scan these separately as TIFFs and either OCR or retype captions later. (Most times, it's just as quick to retype captions.)


7. Dump the index (or at least remove page references) -- You won't need it anyway, since you can search by keyword with software.


After OCRing, with raw text in Word:


1. As with original images, I save the raw text before I start making changes to it, so I don't have to OCR the whole thing again, if I screw something up.


2. Remove stray line breaks -- I run a macro on the text that looks for every line break that is not immediately preceded by a period or other punctuation. This removes most stray line breaks.


3. Remove stray hyphens -- PaperPoint (and OmniPage) do a stellar job of deleting printed hyphens that are no longer needed in digitized text, but there are usually a few hanging around. I use a macro to find and delete these.


4. Break out chapters -- I put some extra space in front of each chapter heading, for formatting.


5. Deal with remaining footnotes -- If I've decided to keep a few footnotes, I search the text for them and re-insert them immediately following the paragraph that contains the citation. I typically insert extra line breaks between text and citations to create a break when I'm reading.


6. Save the file.


In eReader's eBook Studio:


1. Paste text into eBook Studio from Word.


2. Find chapter headers, bold them, and create links to the table of contents (automatically created by the software)


3. Decide about end notes. You can create hotlinks to end notes (or anything else) in eBook Studio, although I usually don't bother. It seems needlessly time-consuming to me, but it is possible for those who want it.


4. Place photos. Since I've already converted the photo TIFFs to PNGs (247px high by 147px wide max), I drag and drop the pictures wherever I want them and insert the appropriate captions. Embedding them in the appropriate positions in the text is nice, although it also can be time consuming to find the right reference/location. Instead, most of the time I just create a photo section chapter at the end of the ebook.


5. Press "make book" and you're done. Read it on your PC, laptop, or PDA phone.


And that's the whole process. Using macros -- and, if possible, OmniPage's masking capabilities -- can make the effort pretty simple. I would estimate that it takes me maybe an hour to do all these steps on a book of about 500 pages.


Key thing to keep in mind: As long as you have the original images, don't worry too much about the details on things like endnotes and footnotes. It's quicker to look those up in the original images (or the book itself) than it is to handle all that stuff in an ebook.



4 comments:

Anonymous said...

kal,

have you thought of making the page size and margins identical to the original book, and setting the typeface to something similar? i would think that would preserve the pagination issue....

keith

Kalnel said...

Hi Keith,

Good thought, but it wouldn't work -- ereaders generally flow raw text, they don't use any kind of page container.

(Besides that would all but eliminate reading on the handheld, which is my favorite reader. It would be awful -- like trying to read documents in Adobe Reader.)

Getting away from physical containers -- even the geography of a fixed page -- is a high priority for going digital. One of my goals is to turn my data into a format that will flow easily into any container I want, not creating new containers.

Keep thinking, though... Thanks!
kal

Unknown said...

Hi Kal,

I finally made it over to check out your blog. Great! I'm looking forward to lots of cool new ideas.

Today's subject is of particular interest to me. I have a huge dead tree library that I've dreamed of scanning for years but it was such a daunting task. But now I feel up to giving it a try.

Kalnel said...

Hi Catherine,

Glad you stopped by and found the info helpful!

I know what you mean about a whole library being a daunting tasks. I'd love to have my entire library digitized, but right now I'm focused just on new books (that I can't otherwise get as ebooks)and a few favorites.

I wish there was a legal way to scan and trade the texts -- like Napster for books -- with others who already own the same book. Maybe users to download as many texts as they've uploaded, just to encourage contributions and fairness.

kal