OCR: The first step in using the content in your documents.

For many people, an office filled with giant filing cabinets and overflowing in trays is a picture from the past. At least, that is what is suggested by the increasing digitalisation in businesses and public authorities. This includes invoices and orders in electronic format. Read on to find out what unstructured, semi-structured and structured data means for electronic documents, what OCR is, and how data needs to be prepared to fully automate processing. We will also be looking at what steps follow OCR in automated processing.

Part of our IT series SEEOcta, this article looks at optical character recognition. Further articles written from a data perspective look at data governance, seamless data exchange and big data.

The SEEOcta blog series highlights the eight most important perspectives for successful project management. Discover all the areas you need to consider when planning digitalisation and integration projects in your company. Armed with the ideas and knowledge in the articles, you will have a solid foundation for planning your IT project and a guide to help you ensure that no one gets left behind.

What is unstructured, semi-structured and structured data?

Unstructured data may refer to paper-based documents. However, scanned images, text files, audio files and video files are also unstructured data, as are Word, Excel or pdf files which have been sent by e-mail. These cannot be automatically processed without some extra work.

Figure 1: Unstructured data might not be a paper-based document.

(Semi)-structured data can mean a format such as XML, HTML or JSON. It could also refer to a chart from a database. The formats used in e-invoicing such as XRechnung, which is an XML file, are (semi)-structured. There are also hybrid e-invoicing formats, such as ZUGFeRD, which provides an XML file for the machines to read and a pdf for people to read.

People are still sending more paper-based documents, scanned images or e-mail attachments than files with structured data. Before this unstructured data can join the other documents to be automatically processed, it needs to be standardised, made readable with OCR technology, and pertinent content extracted and validated.

Modern OCR and processing solutions should be able to deal with as many data types and formats as possible. This means structured, semi-structured and unstructured data. This essentially allows any document to be mapped for a digital business process.

What is OCR?

OCR stands for optical character recognition, and means that the letters and other characters in an image can be automatically ‛read’.

Originally, the technology was developed to read items such as cheques. These days, many types of OCR software can now read printed characters, while some has been specially developed for reading handwritten documents and forms. More advanced versions, known as intelligent character recognition (ICR), can actually correct what they read and even read hand-written forms.

10 rules for getting good results from OCR

The quality of the OCR results from scanned or faxed paper-based documents – or even pdfs – is highly dependent on the quality of the original document. You get the best results from a black font on a white background. Use these tips to get the best results from your OCR software:

The background should not contain colour gradients.
Do not use a white font on a black/grey or other coloured background.
The font size should be at least 8pt, with 9pt delivering better results.
Letters should not be widely-spaced out.
Do not use handwritten, angled or vertically orientated text.
Use the original document where possible and avoid copies or faxed copies.
The ‛Zebra look’ (one line on a white background, the next on a grey background) really doesn’t work.
Do not stamp a document.
Do not put general terms and conditions on the reverse of the document; these may bleed through, and have no relevance to what you need to process.
The original document should be clean, with no smudging, doodles, scribbled notes, coffee stains, etc.

You can find further tips and tricks for gaining good OCR results, extracting data and machine learning in this cheat sheet from DPS invoice (DocProStar).

Figure 2: Optical character recognition – OCR software would struggle with these types of documents.

What are the steps after OCR?

We often talk about using OCR to process unstructured data. Most people automatically think of paper-based documents here, although pdfs sent by e-mail are also unstructured. However, it is rare that using OCR alone will give you the results you need. offer a range of functions which go far beyond pure OCR. They can extract, classify and validate unstructured, semi-structured and structured data. And, they can add further information from other systems.

Once the file formats have been standardised and run through OCR, there are a number of further steps. These are illustrated in the simplified diagram in figure 3. An extraction engine and automated checking rules are used to ensure that the data recognised by OCR is correct and matches the master data.

Figure 3: There are a number of steps which follow OCR.

The humble date is a good example of why OCR often needs help from other software. Although OCR software can recognise a date, a document usually contains several different dates. There may be an invoice date, a delivery date, the date the order was placed, and date when the document was generated. However, extraction and checking rules can help the OCR find the correct date by clues such as its position on the document, or its proximity to such phrases as ‛invoice date’.

Extracting and automatically validating data means document processing can be fully automated. Automatically checking the data against that held in the master data also highlights potential errors – or items in the database which need updating.

A separate technology to traditional OCR is layout classification software. This technology employs artificial intelligence to digitalise incoming messages and automatically put documents into categories (e.g. ‛invoice‘) based on typical characteristics in the structure of a document, without needing OCR to read the words. Content classification, on the other hand, involves taking OCR results and using AI to decode the content to put the document in the correct category. These two types of classification can even be combined. There is even special software which has been developed to categorise sensitive documents such as ID cards.

Structured data

Once data has been through an entire input management process consisting of OCR, extraction, validation and – if required – classification, it is now in a structured format. It can now be digitally archived and passed on to the right system. As a rule, the structured data is processed and exported as an XML file. The disadvantage of this is that although machines can read the content, humans can’t. Therefore, software solutions such as DocProStar (DPS) Invoice also export the original pdf document, which legally often also needs to be archived, alongside the XML file.

Conclusion

OCR is an indispensable tool for bridging the gap between paper-based or other unstructured invoices and an automated bookkeeping or ERP system. Despite the increasing number of documents in digital formats, there will always be a need for optical character recognition to some extent. Thanks to the better quality documents that modern scanners capture, OCR can deliver very good results. However, OCR cannot work in isolation, and needs to be used hand in hand with a comprehensive, intelligent input management strategy and process to be used to its full advantage.

This post is part of the SEEOcta project management series. In the blog category SEEOcta you will find all of the collected posts of this series related to the introduction of a new IT project.

Thank you for your message

We appreciate your interest in SEEBURGER

Get in contact with us:

Please enter details about your project in the message section so we can direct your inquiry to the right consultant.