Optical Character Recognition w/ Tesseract

Using Tesseract OCR w/ from a .PDF file;

  1. Tesseract doesn’t read PDFs.

  2. However Imagemagik can convert PDF to .tiff.

  3. Warning: Tiff files can be very large and require lots of RAM.

First try an 72dpi TIFF, If this does not work then go to 300 dpi bit size.

convert file.pdf -depth 8 file.tiff

tesseract file.tiff output
  1. 300 dpi size - when using Tesseract the output is default for .txt suffix.
convert -density 300 file.pdf -depth 8 file.tiff

tesseract file.tiff output