Tuesday, November 30, 2010

OCR (Optical Character Recognition)



OCR (Optical Character Recognition) technology, refers to the electronic devices (such as a digital camera or scanner) examination paper print characters, through testing, dark, light mode determine its shape, then use Character Recognition methods will shape translated into computer language process, Namely, to text data scanned, then the image file analyzed with words and layout, obtain information process.

OCR: optical character recognition technology

So-called OCR (Optical Character Recognition) technology, refers to the electronic devices (such as a digital camera or scanner) examination paper print characters, through testing, dark, light mode determine its shape, then use Character Recognition methods will shape translated into computer language process, Namely, to text data scanned, then the image file analyzed with words and layout, obtain information process.

Because the OCR is a door and Recognition rate tug-of-war technology, therefore how debugging or use auxiliary information to enhance Recognition accuracy, is the most important task, OCR ICR (Intelligent Recognition) noun is full and thus produce. And, according to the text information exists different media, and the media get these material in different ways, is derived from various different kinds of applications.

The history of OCR

You want to talk about the OCR development, in early 60's and 70's, countries around the world have begun to have OCR study, research and the early stages, in text recognition method research primarily, and identification of the text only 0 to 9 number. By the same word of Japan with squares as an example, about 1960 began studying OCR basic recognition theory, early in digital as the object, till 1965 to 1970 started between some simple products, such as print the postal code identification system, identify email on the postal code, help the post office for regional points believed homework; Therefore also has the postal code has been advocated by the address of all countries to writing.

OCR can be said to be a kind of uncertain technology research, and the correct like an infinite approaching function, know its approaching the value, but only near and cannot reach, forever in combining with 100%. Because its mohajirs-the factors too much, writing the habit or document printing quality, scanners, scanning the quality, identification method, the study and test sample... And so on, and how much will influence the accuracy, and therefore, OCR products except needs to have a strong recognition core outside, product operation convenience, provided the debug function and method, also is the important factors determine product quality.

A OCR recognition system, its purpose is very simple, just want to make a conversion, the images that make images within the graphics continue to save, got a form, this form domestic anticipates the images within the text, uniform into computer language, make can achieve image data storage also reduce, identify the text can be used again and analysis, of course, can also be saving for keyboard input manpower and time.

From the video into output, beard through video imput, image pretreatment, text feature extraction, than to identify, finally after artificial correction will admit mistake of text correction, will output.

Video input:

To pass the subject matter shall handle the OCR through optical instruments, such as image scanner, fax or any photographic equipment, will turn into a computer. Image The progress of science and technology, scanners, etc input device has been produced more delicate, frivolous short, quality is high also, for OCR has quite big help, scanner resolution that make images more clearly, eliminate speed more enhance OCR processing efficiency.

Image pretreatment: image pretreatment is OCR system, must solve the problem of a maximum number of module, not from get a black is white binary image, or gray, color images, to independence from each text image process, all belong to the image pretreatment. Contains the image normalization, remove the noise, image correction of image processing, and graphic analysis, text line with words in separate files before treatment. In image processing, in theory and technology has reached maturity stage, so on market or websites with many usable link libraries, In the files before treatment, then by clans can do, Image must first upload pictures, form and language area isolated, even can submit articles of layout direction, articles and the content of the outline of subject area separated, and text of the size and the text font can such as original document the same figure it out.

Text: single feature extraction by recognition rate is concerned, feature extraction is the core of OCR, with what characteristics, how extraction, the direct impact recognition is good or bad, also so in OCR at the beginning of the study, feature extraction research report special. And characteristics of the chip can identify, simple differentiate can be divided into two categories: for statistical features, such as text area of black/white dot than, when several words classified into several regional, this one area black/white dot than the number of joint, become the space of a numerical vector, in comparison, the basic mathematical theory is enough to cope with. And another feature for structural features, such as text image of thin lines, obtains the word after stroke endpoint, intersection of quantity and position, or for recognition.a section is the characteristic, with a special comparison method, compares, on the market online handwritten input software method of identification with a structure of the method how to give priority to.

Contrast database: when the input text is finished, whether with statistical characteristics or the structure feature of, must have a database or feature database than to differences, the contents of database should contain all the ZiJi text, to identify with input text according to the same feature extraction method income features group.

Contrast recognition:

This is can give full play to the mathematical operation theory of a module, according to the different characteristics, choose different mathematical distance functions, a famous comparison method, European space comparison method and Relaxation comparison method (Relaxation), Dynamic program Dynamic than method (DP), and -, neural network database established and a comparison, HMM Markov Model (what)... The famous method, in order to make the result of identification is more stable, also have so-called expert System (or System) has been proposed by using various characteristics of comparison method by different complementarities, make identified, and the results of especially high confidence level.

Words post-processing: because the recognition rate of OCR and cannot reach 100%, or want to strengthen the correctness and confidence than values, some debugging or even help correction function, also become the OCR system necessary a module. Words post-processing is an example, using the comparison with the text after identification of similar candidate word groups may, according to the text before and after identifying to identify the most logical words, do the correction function.

Word for word post-processing database: established thesaurus.

Artificially correction:

OCR final checkpoint, before this, users may only get a mouse, follow the rhythm of the software design is only watch or operation, but there may be in this special flower the user's mind and time to correct even find may be OCR error place. A good OCR software, except that there is a stable image processing and identify the core, to reduce the error probability outside, artificial calibration process and function, also influence the OCR processing efficiency, therefore, text image and recognition of contrast, and screen text information, and the location of each text candidate word recognition function, rejected the function, and read after specially highlighted words post-processing may have problems of words, all is for the use of the keyboard less as far as possible users design of a kind of function, of course, not to say the system is not show text must be correct, like entirely by keyboard input staff there will also be wrong, then to re-correct once or allows some fault, it is completely see the use of units of the demand.

Output:

Actually, the output is a simple thing to do, but must see users to use OCR for what purpose? Someone just text files as part of the reuse of the text with, so just general text files, someone will beautify bright and input file exactly like, so have the original backdating function, someone notice form of words, so to and Excel software such as combination. No matter how to change, only output file format change just. If need reductive into the original format, as is in after identification, need artificial typesetting, time-consuming and force.

OCR criteria

The measure of a OCR system performance is the main indexes: refus general rate, deterrent rate, the speed of recognition and friendly user interface, the stability of the products, sexual usability and feasibility, etc.

OCR working principle

Identification process:

Books at the corresponding level: Chinese, English, Simplified and traditional,

Format: ShuPai level, the horizontally, Without points column,

Do segmentation

Character segmentation

Distinguish: real OCR identification process, image information reductive into text information

Post-processing: manual intervention, mainly concentrated in the first four stages. The identification accuracy can be reached 99%

OCR recognition rate the deciding factor

1. The picture quality, generally recommend that 150dpi above

2. Color, the general color recognition is very poor, black and white pictures higher, because this proposal for black and white tif the OCR format

3. The most important is the font, if is handwritten recognition rate is very low.

OCR is a kind of computer input technique, it through the writing pattern recognition of image files into editable text files, thoroughly changed the computer paper dielectric material input concept. If use scanner to text input image can be transformed into computers, there can be modified to text files, which is more than the manual input speed dozens of times. With the application of OCR, it is increasingly being known. International software giant Microsoft in r&d XP system, realized the OCR market demand, in the announcement Office 2003 in full equipped with a TH - OCR (Beijing text links, information technology Co., LTD develop); Hardware leader enterprise Intel also determine TH - OCR MMX technical support for the project.

Recently, some big companies realize the benefits of OCR began in their products bundling OCR techniques. Google has begun OCR software development work, in its recruitment revelation it said: "Google currently" brief "almost faster web criticallife will Come in the world. For all of us the maritime material as well!" . As Google startup OCR exploration work, OCR application into a full-blown era.

Whether to make the computer to text typesetting output, still want to let computers know it sees the text, all this is for our life service. The process of informationization and digitization, let us never settle with striking the keyboard fingers to input data. People would like time and energy into more creative work, and hope, such as computers, auxiliary equipment can more wisdom. OCR (Optical full Recognition, Optical Character Recognition) technology is a among them, with printing technology, it is relatively allow computers to read a technique, this is far more complex than print.

Economic competition to bring more business activity, each activity on business CARDS are indispensable protagonist, name card of management products also arises at the historic moment, name card identification management tools are also in OCR technology as the core products. Through the card identification card, scanning tools will be identified, classification, not only able to import mobile phones, PDA etc., but also for the card information for backup, need not worry lost. The text link e - hello is one excellent card identification management products, OCR techniques can do the business life in good order, save more time. Now, almost all the scanner and one machine is equipped with OCR software, such as HP, UNISCAN, EPSON, CANON, LENOVO etc scanner manufacturers bound is text links, TH - OCR.

No comments:

Post a Comment