OCR and Text extraction can be resource intensive and needs the appropriate system architecture in order to achieve the throughput the business is looking for. This article provides some information to assist with the planning:
The first step is to analyze the number and type of documents by gathering the following information:
OCR and text extraction can be highly CPU-intensive so er recommend using a high-performance server with Intel i5 or better. You can use the OCR connector on any DocuNECT station server, however, you will need to take into account the existing utilization of the station resources. To maximize throughput, we recommend using a dedicated server for the OCR connector.
The number of hours required is
P / (2.5 Seconds * PR)
Where:
Note, 2.5 seconds a page include OCR, text extraction and associate I/O operations.
For example:
A medium complexity conversion job to process 500,000 pages with the job making use of two OCR connector instances:
((5,000 pages * 2.5 seconds per page) / 2 processes) = 6,250 seconds = 104 hours.
In v4.8 the Document Processor Connector now has multi-processor capabilities (with an additional license) that can help with the through-put of OCR'd documents. Each instance of the connector will increase the through-put. It is recommended to add one CPU core to each instance to cover the additional resources required.