AMHARIC PDF OCR

Fitseyotor
3 min readSep 25, 2020

In Africa around 2,500 languages are spoken. Some of these languages have their own indigenous scripts. Accordingly, there is a bulk of printed documents available in libraries, information centers, museums and offices. Digitization of these documents enables to harness already available information technologies to local information needs and developments. today we are showing Optical Character Recognition (OCR) system for converting digitized documents in local languages(Amharic).

Optical character recognition technology was invented in the early 1800s, when it was patented as reading aids for the blind. In 1870, C. R. Carey patented an image transmission system using photocells, and in 1890 P.G. Nipkow invented sequential scanning OCR. However, the practical OCR technology used for reading characters was introduced in the early 1950s as a replacement for keypunching system. A year later, D.H. Shephard developed the first commercial OCR for typewritten data. The 1980’s saw the emergence of OCR systems intended for use with personal computer.

Optical character recognition converts scanned images of printed, typewritten or handwritten documents into computer readable format (such as ASCII, Unicode, etc.) so as to ease on-line data processing. The potential of OCR for data entry application is obvious: it offers a faster, more automated, and presumably less expensive alternative to the manual data entry devices, thereby improving the accuracy and speed in transcribing data into the computer system. Consequently, it increases efficiency and effectiveness (by reducing cost, time and labor) in information storage and retrieval.

The package contains an OCR engine — libtesseract and a command line program - tesseract. Tesseract 4 adds a new neural net (LSTM) based OCR engine which is focused on line recognition, but also still supports the legacy Tesseract OCR engine of Tesseract 3 which works by recognizing character patterns. Compatibility with Tesseract 3 is enabled by using the Legacy OCR Engine mode (--oem 0). It also needs traineddata files which support the legacy engine, for example those from the tessdata repository.

Tesseract was originally developed at Hewlett-Packard Laboratories Bristol and at Hewlett-Packard Co, Greeley Colorado between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some C++izing in 1998. In 2005 Tesseract was open sourced by HP. Since 2006 it is developed by Google.

Imagemagick is a tool developed to convert images from one format to another. Due to a versatile range of image formats and its precise and simple working, it has a huge community support. We can get images from pdf files.
Wand is a binding developed for python by Imagemagick. Wand opens and manipulate images. Wand provides a large number of functions for image manipulation.

Start by installing libraries. for installing pytesseract we use this command.

pip install pytesseract

The Wand is an Imagick library for python. It supports the functionalities of Imagick API in Python 2.6, 2.7, 3.3+, and PyPy.This library not only helps in processing the images but also provides valuable functionalities for Machine Learning codes using NumPy. Wand can be easily installed by running.

pip install Wand

Let’s begin by creating a new file named pdfocr.py

import io
from PIL import Image
import pytesseract
from wand.image import Image as w

we are starting our code by calling the libraries

pdf = wi(filename = “path of file/.pdf”, resolution = 300)
//this line of code use to browse the pdf
pdfImg = pdf.convert('jpeg')
//it use to convert pdf file to image
imgBlobs = []
//for image sequence
for img in pdfImg.sequence:
page = wi(image = img)
imgBlobs.append(page.make_blob('jpeg'))
extracted_text = []for imgBlob in imgBlobs:
im = Image.open(io.BytesIO(imgBlob))
text = pytesseract.image_to_string(im, lang = 'amh')
//cher uis the outonverting to string
extracted_text.append(text)print(extracted_text[0])

--

--