Extracting data from documents

Sphereon’s Smart Data Capture functions enable you to extract data from structured documents, such as Forms and Questionnaires. But also from complex, unstructured documents, such as e-mail, e-mail attachments, correspondence, requests, AP invoices, orders, etc., where the required data can be located anywhere on any page on the document.

Sphereon uses several technologies and methods to extract the data: Optical Charater Recognition (OCT), Barcode recognition, handwritten text recognition (ICR), Optical Mark Recognition (OMR), Business Rules Management (BRMS) and Knowledge Base systems. Sphereon combines multiple technologies and multiple engines to achieve the superior results.

Optical Character Recognition (OCR)

OCR is able to recognize characters in digital image files. These can be photos, faxes or scanned documents. OCR will not achieve 100% recognition results: it depends on the quality of the image and is sensitive to contrast, sharpness, brightness, smudges, etc. on the image. By using multiple OCR-engines in parallel, we can achieve optimum results.

Intelligent Character Recognition (ICR)

ICR is the name for technologies to recognize handwritten characters. Recognizing handwritten text remains a big challenge with varying results, but big improvements have been made over the last years. We have done several successful projects, for which we even receive several industry awards.

Barcode recognition

Barcodes are the most reliable technology to recognize data. They are often used to capture specific data, such as dossier- or case-numbers, client- or supplier-codes, part-numbers, etc.

To achieve the best possible results we here also use multiple engines that support the traditional 1D barcodes and the newer 2D barcodes, such as QR codes.

Optical Mark Recognition (OMR)

OMR (Optical Mark Recognition) is able to recognize the marking of the squares and rounds on forms, such as Forms and Questionnaires, and pass the corresponding values to the next steps in the process.


The recognized text can be used to find and extract data.

Format extraction

By using Regular Expressions values can be found in a document.

For example a valid Visa card: ^4[0-9]{12}(?:[0-9]{3})?$
All Visa card numbers start with a 4. New cards have 16 digits. Old cards have 13.

Key-Value extraction

Regular Expressions can also be used to specify a key value, f.i. ‘Invoice Number’, and the value that needs to be extracted. Sphereon is also able to evaluate the relative position between the Key and the Value, like ‘Right of’, ‘Below’, etc., and score the results based on that.

Knowledge Base assisted extraction

Using our Knowledge Base system for extraction increases the results of the extraction over time. It also can evaluate other dimensions in deciding the best possible result, such as formats, historical results and relationships with other data.


To confirm the data or to increase the confidence of the recognized data, several checks can be performed of the captured data.

Format checks

Regular Expressions enable simple to complex checks on the format of the data. From just simply checking for numbers to complex checks for example the validity of an IBAN code. Or splitting a value into multiple values or substitution of data.

Database lookups

One of the most powerful checks is a validation of a value against the known data in a trusted database. This also enables the retrieval of data from a database and adding those as additional data to a document.

Validation rules, Business Rules

Logical validation checks can be performed by using validation rules, also known as Business Rules Management (BRMS). Checks like Net Amount + VAT amount = Gross Amount. Or is a Last Name found the same as the Last Name retrieved from a database using the Policy Number.

User validation

When no automatic checks are possible, or give not enough confidence or conflicting results or even fail, documents and data can be manually checked by users. Or even be checked “blind” by multiple users.

Knowledge bases

Data can be processed into Knowledge Bases during validation. Combining the results of the different checks with user input “teach” the system and make the system “smarter”.