How To: OCR an article from print

Discussion in 'How To' started by slobeck, Nov 9, 2011.

  1. slobeck Member

    Hopefully this will elicit some suggestions from people for various operating systems. I am figuring this out on both PowerPC and Intel based Mac's (OS's 10.5.8 and 10.7.1 respectively)

    Report to follow unless someone else beats me to it.
  2. telomere Member

    bump for beating it
    • Like Like x 2
  3. Anonymous Member

    Decent OCR software isn't free, and the stuff people get packaged along with their scanners tend to be shit. If you have good software and a decent scanner then you are sorted - but it will cost you a bit.
    • Like Like x 1
  4. moarxenu Member

    Recommendations on OCR software?
  5. slobeck Member

    strange...this got split off from another thread. Its been a while since I OCR'd the Newsweek cover article about Anonymous. (which, if i remember, was the thread this got split from) So, I'll get on it.
    • Like Like x 2
  6. Anonymous Member

    OmniPage Professional. Not cheap mind.
  7. Anonymous Member

    Linux is free. Well if you get pass this point, it's not that difficult even if you use Windows. It's pretty good and usable other than arbitrary hand writing.

    You can now install Ubuntu Linux as an application, without addition a partition as a dual-boot system. Of course you can download and boot Ubuntu on CD or USB flash drive.

    From the Ubuntu software centre, you can install the package named:

    Open up a terminal, the command is
    #tesseract image.tif outputfilename

    for cuneiform
    #cuneiform image.jpg

    The output is a text file.

    For a typical printed letter, you don't expect errors. Tesseract is supposed to be more accurate, but it only accepts (uncompressed) tif, while the other accept most formats. Tif is basically a 1 bit bitmap.

    OCRfeeder is the GUI that detect and use both engines, and others. It can deal with multi-columns and then some. But it doesn't work. As usual the bug is reported but, well, nobody can tell you anything.
    • Like Like x 2
  8. Zhent Member

    Adobe Acrobat Professional has pretty decent OCR software. Just scan an article as normal to a pdf, then run Adobe's OCR from in the pdf.

    Many scanners will also have bundled OEM software with them. For example all Samsung products will have their software you can freely use to OCR. The results are quite good as well, though I found Adobe's slightly better.
    • Like Like x 1
  9. Anonymous Member

    Found it. The GUI on linux that works is gImageReader. You have to google and download the .deb file for Ubuntu. Double click to install as in Windows. The GUI is simple to use. Everything from installation can be done without a command.

    It only supports tesseract, but will convert images transparently. It has the usual goodies such as selected area for recognition, multi pages, spell check and post editing. Tesseract supports many languages, while most others only support western alphabets.

    Most scanners should work on linux, eventually, except for the very new and very old.
    • Like Like x 1

Share This Page

Customize Theme Colors


Choose a color via Color picker or click the predefined style names!

Primary Color :

Secondary Color :
Predefined Skins