Planning an Architecture for OCR and Text Extraction

Planning an Architecture for OCR and Text Extraction

Applies to Version: v5.0

OCR and Text extraction can be resource intensive and needs the appropriate system architecture in order to achieve the throughput the business is looking for. This article provides some information to assist with the planning:

Document Analysis

The first step is to analyze the number and type of documents by gathering the following information:

  • Number of documents
  • Average number of pages per document
  • Image Color (Black & White, Gray scale, Full Color)
  • Type of input file (TIFF, PDF …)
  • Typical resolution (200dpi, 300dpi…)
  • Typical page size (US Letter, A4, A3…)
  • Typical text density

Calculating the Required Resources

OCR and text extraction can be highly CPU-intensive so er recommend using a high-performance server with Intel i5 or better. You can use the OCR connector on any DocuNECT station server, however, you will need to take into account the existing utilization of the station resources. To maximize throughput, we recommend using a dedicated server for the OCR connector.

The number of hours required is
P / (2.5 Seconds * PR)

Where:

  • P = Number of Pages
  • PR = Number of Processes (Number of OCR Connector instances)

Note, 2.5 seconds a page include OCR, text extraction and associate I/O operations.

For example:
A medium complexity conversion job to process 500,000 pages with the job making use of two OCR connector instances:

((5,000 pages * 2.5 seconds per page) / 2 processes) = 6,250 seconds = 104 hours.

Multi-Processor Capabilities

In v4.8 the Document Processor Connector now has multi-processor capabilities (with an additional license) that can help with the through-put of OCR'd documents. Each instance of the connector will increase the through-put. It is recommended to add one CPU core to each instance to cover the additional resources required.


    • Related Articles

    • xPlore full-text search working in xPlore not in AX

      After setting up full-text indexing for an AX application it was discovered that you can obtain results from xPlore, but not from the full-text search in AX.  This was consistent between all AX applications and xPlore collections. The issue is caused ...
    • Unable to acquire xPlore Full Text license when auto Logging in

      Check line 42 in the AppXtender web.config is set to “False”. By default it is set to “False”   Change value from false to true.  Run a component registration wizard for Web Access and a reset IIS.      <!--        If request Full-Text License when ...
    • How to Order Apps By App Name in Groups in AppXtender Admin

      1)      Make the following changes to the profile files below by adding the highlighted text. Note, make a backup of these files first. Inetpub/wwwroot/AppXtenderAdmin/App/Components/Groups/View/profietab.html on or around line 13 <option ...
    • Viewing Documents in DocuNECT

      Applies to Version: v4.5, v4.6, v4.7 and v4.8 This document provides an overview of the current viewing technology that DocuNECT uses in v4.6. Viewing vs Editing Document management falls into two categories: 1. Collaborative – Documents that are ...
    • iSubmitQuery in ApplicationXtender v16.3

      The iSubmitQuery interface is still in ApplicationXtender v16.3, however, there is no real documentation. The only documentation is from 2008 and is attached but its basically the same. Enabling the Test Utility In lieu of there being no ...