|
What is OCR?
Suppose that you've got a paper text such as a deed, book, RFP, ... and you have to enter it as an editable text file to computer in order to use it in your research, report or ... |
|
It's obvious that this is a time consuming process. This shows itself more obviously when there are several pages to retype. Another approach which has come to life by extension of information technology is to scan and get digital images from documents.
However this method by producing and electronice archive improves the process of archiving and eliminates need for large office spaces for archiving paper documents, there is no way of searching the texts of these documents and exploiting computer technologies such as data mining.
OCR software does the conversion of scanned images to searchable files. These software creates digital files by identifying different parts of document images and converting text parts to editable file.
OCR Technology
If we look at OCR software as a black-box, its an entity which gets images of documents and generates editable and searchable digital files.
After getting the image, the first step is analyzing the layout of the image. The image layout is divided to table, text and image blocks.
Afterwards according to zone type ARAX does required steps and recovers information
1-Text zones are processed and their content and font information is read.
2-Images are kept as is.
3-Tables are read cell by cell and put in output as a table preserving layout.
In next stage, ARAX shows read document in a WYSIWYG editor. You can correct any mistakes by use of a spellchecker.
At the end of process, ARAX generates files with your desired format with all the information from document which can be put in the file.
Comparison between Farsi and Latin OCRs
For Latin languages such as english and French there has been OCR softwares for years and has passed a history of change and improvement, but unfortunately there has not been a suitable OCR forFarsi despite of 2000 history of life of this language.
One of the reasons for this is the high complexity and complex structre of Farsi writing in comparison to Latin. For example where in latin texts characters are written seperately, making identification very easy, in Farsi first the words must be identified. Each word must be broken to segments creating it. This part, according to different fonts in Farsi is the most difficult part.
ARAXPage which is a result of continuous effort in research and development department of HODA System, has solved many of problems facing Farsi OCR systems and after years has made Farsi language equipped with a powerful OCR system. Currently for providing users with as much capability as possible, ARAXPage can read English texts as well as english OCR softwares. Added to this, ARAXPage can identify English words and phrases in midst of Farsi texts.
OCR applications
As the buisness and office processes are still based on paper documents, OCR can be utilized in every part of governmental andprivate organizations. In this section some of the OCR applications are described.
OCR as the optimal way of entering information
Typing information manually from printed documents, is a common task that is done every day in office activities of many organizations. This job is time consuming, and costly. Added to this, the typed information always has a percentage of operator mistakes. These errors are reduced by means of several stages of reviews, in some cases errors will remain after all the stages. ARAX as the most powefull OCR software can eliminate this boring procedure and automate it.
Some of OCR applications in this regard
-Fast recovery of letters, contracts,... that are available printed.
-Completion of tender documents and answers to RFPs more fast.
-Completion and update of technical and financial reports, marketing plans and more, using available printed papers.
OCR as the only way of creating digital libraries
Farsi language as the oldest language in life not only is a pride for Iran but also has gained a most valuable place in the world literature. Despite this and despite the fact that there are several books written in this language, absence of a good digital library has put a serious problem in front of expansion of this language.
ARAXPage as the most powerful Farsi OCR can eliminate the gap between current situation and creating rich digital libraries with a high speed and accuracy.
For more information about OCR usages, please visit ARAXPage applications page.