The Kalnel Files: eReader

Showing posts with label eReader. Show all posts

Sunday, October 5, 2008

DIY Ebooks, Part 3: Clean Up

The cool thing about scanning text pages is that it frees text from the confines of the page. You can take the raw text and use it the way you want -- convert it to ebooks or other platforms for reading, index it for future reference, or even grab long passages for citations.

But, there is one complication that's easy to overlook in the initial rush of utility: Individual pages provide a fundamental organizational unit for books. References are organized by page -- lose the page and the text of the original index is useless. Footnotes sit relative to text on paper pages. Even endnotes sometimes reference the page of their citation.

Not only that, pages contain lots of information the eye can easily filter out, but scanners and OCR software doesn't.

I don't have a perfect method for handling these issues -- yet -- but I can offer some advice about how to work around them.

Before OCRing:

1. Make a copy of all image files before you start editing, in case you must up or decide to try something else in the future.

2. Remove page numbers, headers, and footers -- Although OmniPage OCR software can you templates that ignore information on certain parts of each page, I've never had much luck using this, since my scans aren't always uniform in size and position. Instead, I remove this stuff using PaperPort, either erasing or deleting it page-by-page. Sounds time-consuming, but actually goes very quickly.

3. Make a decision about footnotes -- Most times, I simply remove footnotes so they don't get OCR'd, but when I see citations I want to keep, I handle it post-OCR. (See below)

4. Check for cut-off text and embedded images -- I often remove stylized initial caps to make OCRing more accurate.

5. Straighten text -- It OCRs better. PaperPort can do a batch straighten on all pages.

6. Remove pictures -- I scan these separately as TIFFs and either OCR or retype captions later. (Most times, it's just as quick to retype captions.)

7. Dump the index (or at least remove page references) -- You won't need it anyway, since you can search by keyword with software.

After OCRing, with raw text in Word:

1. As with original images, I save the raw text before I start making changes to it, so I don't have to OCR the whole thing again, if I screw something up.

2. Remove stray line breaks -- I run a macro on the text that looks for every line break that is not immediately preceded by a period or other punctuation. This removes most stray line breaks.

3. Remove stray hyphens -- PaperPoint (and OmniPage) do a stellar job of deleting printed hyphens that are no longer needed in digitized text, but there are usually a few hanging around. I use a macro to find and delete these.

4. Break out chapters -- I put some extra space in front of each chapter heading, for formatting.

5. Deal with remaining footnotes -- If I've decided to keep a few footnotes, I search the text for them and re-insert them immediately following the paragraph that contains the citation. I typically insert extra line breaks between text and citations to create a break when I'm reading.

6. Save the file.

In eReader's eBook Studio:

1. Paste text into eBook Studio from Word.

2. Find chapter headers, bold them, and create links to the table of contents (automatically created by the software)

3. Decide about end notes. You can create hotlinks to end notes (or anything else) in eBook Studio, although I usually don't bother. It seems needlessly time-consuming to me, but it is possible for those who want it.

4. Place photos. Since I've already converted the photo TIFFs to PNGs (247px high by 147px wide max), I drag and drop the pictures wherever I want them and insert the appropriate captions. Embedding them in the appropriate positions in the text is nice, although it also can be time consuming to find the right reference/location. Instead, most of the time I just create a photo section chapter at the end of the ebook.

5. Press "make book" and you're done. Read it on your PC, laptop, or PDA phone.

And that's the whole process. Using macros -- and, if possible, OmniPage's masking capabilities -- can make the effort pretty simple. I would estimate that it takes me maybe an hour to do all these steps on a book of about 500 pages.

Key thing to keep in mind: As long as you have the original images, don't worry too much about the details on things like endnotes and footnotes. It's quicker to look those up in the original images (or the book itself) than it is to handle all that stuff in an ebook.

Thursday, October 2, 2008

DIY Ebooks, part 2

(Responding to the post below, Caroline asked a great question about ebook and scan formats. As I started to answer her, I realized that my answer was getting so long it probably makes a better blog post than comment response. So, Caroline, here goes:)

Trying to use ebooks (like reference books) as PDFs can be cumbersome, especially if the file is really large and not optimized for use as a book.

As an alternative, I turn the PDFs I create into .pdb files, which I can annotate and organize with chapter headings and photos. I can read these in either eReader or Mobipocket, both of which are free and available on multiple platforms, including Windows Mobile.

Here's my process:

1. Scan as PDFs -- Creates an image of the page for future use and reference.

2. OCR them and save as plain .txt files -- Gives me an open source copy of the book's text, which, like the PDFs, I should be able to use well into the future.

3. Edit/clean up as Word documents -- I work with this program all the time, so it's the easiest one for me to work with. I've also created several macros to help me clean up text after scanning.

4. Convert to .pdb -- These are easy to create (through the eReader book creator), inserts pictures well, and works with both eReader and Mobipocket.

5. For books with lots of pictures, such as biographies, I also save the photos as Tiffs. To insert these into a .pdb file, I have to convert them to PNGs first.

Mobipocket also offers an much more streamlined (and free) alternative to creating ebooks -- just drop-and-drag a PDF, Word doc, or other file onto the Mobipocket window, and it will convert itself to an ebook automatically. It's not as "clean" as an edited ebook, but it's a quick way to make a smaller ebook.

Wednesday, October 1, 2008

DIY Ebooks

As a voraciou

s ebook reader, I have one big frustration: Although the availability of titles available through my favorite book stores, eReader and Mobipocket, is impressive -- and growing -- the stores still lack most out-of-print and "less popular" titles.

The other day, though, I found a great -- and free -- source for new ebooks: The public library. While my local does not have an ebook "lending" program as some systems do, they have shelves of what the do-it-yourself ebookmaker wants: Thousands of old books whose bindings are already broken, flexible, and easy to flatten for scanning. And, it's all free.

So, I've been taking matters into my own hands lately, by scanning some of the traditional books I've bought and never gotten around to reading -- and library books. Although it's a fairly time-consuming process -- an hour or two to scan and another hour to clean up the text and convert it to a .pdb file -- like many scanning tasks, it's easy to fit in for a few minutes here and there during the day or as I'm watching TV.

The process isn't perfect. Getting a book flat enough to get a decent scan is tough on the binding, and books with a lot of footnotes and references -- like many of the biographies I enjoy -- require a lot of clean up to create a good text.

Still, it gets the job done.

EDITED: To correct text pasted out of order

Sunday, October 5, 2008

DIY Ebooks, Part 3: Clean Up

Thursday, October 2, 2008

DIY Ebooks, part 2

Wednesday, October 1, 2008

DIY Ebooks

Labels

Blog Archive

About the Kalnel Files

Sunday, October 5, 2008

DIY Ebooks, Part 3: Clean Up

Thursday, October 2, 2008

DIY Ebooks, part 2

Wednesday, October 1, 2008

DIY Ebooks

Kalnel RSS Links

Labels

Blog Archive

About the Kalnel Files