Today, data is a highly valued commodity which can be derived from a variety of different sources, making discoverability a key attribute for making the most of the information held in any organisation. This insight led us to start an internal project to improve the discoverability of information in Endava’s internal document stores. The first deliverables included an API and two communication interfaces, a web application and a chat bot, which provide powerful search capabilities across our internal documents.
An important part of improving the discoverability of these documents is augmenting the information which is extracted from them, with tags that indicate the named entities (e.g. persons, organisations, locations), the client context and business domain that the documents relate to. These extracted metadata fields help to create more accurate search results. This information is relatively straightforward to extract from textual documents, but we found that it is much more difficult to extract from graphical information found in many presentations.
THE PROBLEM WITH PRESENTATIONS
A modern presentation usually contains a significant number of graphics and minimises the amount of text on each slide, relying on the presenter’s verbal skills to deliver the details of the message (see Figure 1).
Figure 1: Sample presentation slides showing the graphical nature of modern presentations
The question then arises: how can we make this information searchable?
The challenge with presentation files was how to augment the small amount of text that could be extracted from them with further information derived from the graphical slides. We realized that emerging technologies, such as deep learning, offer a potential solution to this challenge, given their ability to segment images and recognize text and objects within them.
This article explains how we implemented text extraction from images using standard computer vision techniques and technologies that can be applied to other problems, such as object detection.
The process we designed for text extraction is shown in Figure 2, and we implemented it as a service which loads a given image, detects the location of text regions in images (as rotated bounding boxes) and extracts the information from these detected ROIs (regions of interest).
Figure 2: Text extraction process
The recent growth in reliable open-source technology in the artificial intelligence area meant that we were able to lean heavily on proven 3rd party libraries in our implementation. The technology we used was largely Python-based and we used OpenCV for image processing, a pre-trained Tensorflow model for ROI detection and the well-known Tesseract library for OCR (optical character recognition) processing.
REGIONS OF INTEREST (ROI) DETECTION
ROI detection is one of the most difficult parts of the solution to implement. Even with the assumption that we do not need to process generic images such as photographs of the natural world (for example, landscapes or flowers), our solution needs to allow for different graphics resolutions, unknown slide layout, non-planar objects and different viewing angles. Obtaining an accurate result that allows for all of these complications using traditional image processing algorithms would be very difficult and involve the use of complex and fragile heuristics.
To solve this problem, we decided to use a “learning” approach. However we did not have annotated data to use to train a model and we didn’t have time or enough people to perform a comprehensive data annotation exercise, so we decided to investigate the use of a pre-trained deep learning model.
We discovered a promising deep-learning-based ROI detection pipeline for text called EAST (Efficient and Accurate Scene Text), which was described in Xinyu Zhou’s 2017 paper, EAST: An Efficient and Accurate Scene Text Detector . According to the authors, the EAST pipeline, composed of a Fully-Convolutional Network (FCN) and a locality-aware Non-Maximum Suppression (NMS) algorithm, is capable of detecting lines of text without using expensive traditional algorithms and is able to process images with the resolution of 1280x720 at around 16 frames per second (fps).
Figure 3: The structure of the EAST text detection Fully-Convolutional Network (from Zhou’s paper ).
We don’t have space in this article to explain EAST in detail, and so we refer you to Zhou’s paper for the details. However, we reproduce Figure 3 from the paper to give an overview of the architecture of the EAST neural network. For our situation, using a pre-trained model based on a known neural network, the key parts of the architecture are the input and the output of the model.
In EAST there are 4 layers of feature maps, denoted as fi, which are extracted such that their sizes are 1/32, 1/16, 1/8 and 1/4 of the input image. So, for the neural network to process an image its dimensions must be a multiple of 32.
The first output layer, named “score map” in Figure 3, is the output of the sigmoid activation function, which indicates the probability of a region containing text or not. The second output layer required for text detection is shown in the diagram as “RBOX geometry” and it is used to derive the rotated bounding boxes for the text regions detected. The last output layer is labeled “QUAD geometry” and it detects text regions, annotated by 4 vertices of the rectangle, however we don’t use this output as the “RBOX geometry” is sufficient for our purposes.
Figure 4 shows an example image from a sample presentation which will be used as our example to describe the steps in our implementation.
Figure 4: Example image from presentation
Step 1 – Image Preparation
The first step is to load the image and process it using OpenCV in order to get it to a form suitable for use by the EAST model. This involves resizing it, if its dimensions are not multiplies of 32, while retaining the width:height ratio to allow transformation back after the extraction. For large images we also shrink them to a maximum size of 2400 pixels wide and high, to avoid running out of memory during the process.
Step 2 – Initial Image Processing
The next step is to process the image with a forward pass through a proven Tensorflow pre-trained EAST model , which achieved an F1-score of 80.83 on the ICDAR 2015 data set; a very strong result. This step creates the two output layers described above, namely the score map and the RBOX geometry data. Example output shapes for our sample image are shown in Figure 5. The shape of the input matrix is given by the dimensions in pixels of the original image together with the image RGB channels. The shapes of the output matrices represent the regions which result from the forward passing of the input matrix through 4 feature map layers, hence the dimensions are divided by 4. The actual content of the output matrices are typical sigmoid confidence scores in the case of the “score_map” and RBOX coordinates in the case of the “geometry” matrix.
Figure 5: Input and Output shapes for the example image in Figure 4
Step 3 – Region of Interest Identification
After the EAST model solves the challenging problem of computing the probability of a region containing text or not, the next step is to translate the output of the model into the rotated bounding boxes of the ROIs in the original image.
We first eliminate regions with a low probability of containing text by filtering out those with a score map value of less than a threshold value (we used 0.8 in this example, although this is configurable). The remaining regions are considered good candidates, and a series of matrix operations and math formulas, designed with performance in mind, are used to adapt the EAST model’s geometry-form representation of each region to a simpler 4 point Cartesian coordinate format, describing a rotated bounding box (see our example in Figure 6).
Figure 6: EAST geometry format vs final extracted format of the solution
Step 4 – Rationalisation of Bounding Boxes
In Figure 7 we show our example image, annotated with the bounding boxes that we detected. As can be observed, we have a problem with the precision of the bounding boxes and the degree to which they overlap.
Figure 7: Detected rotated bounding boxes before NMS
To solve this problem, the EAST pipeline paper suggests merging the intersected geometries using the NMS algorithm. Our implementation uses a modified NMS algorithm, as recommended in the original paper, which assumes that nearby pixels tend to be highly correlated so it can merge geometries in a row-by-row manner. This hugely improves the speed of the process, compared to the original algorithm, as it reduces the complexity from O(n2) to a best-case scenario of O(n).
Figure 8: Regions of interest containing text result
The result of merging the candidate bounding boxes in our example image using NMS is shown in Figure 8.
Having located the regions of interest in a given image, we then attempt to extract the text we believe is in those regions. Previous experience has shown that we need to crop images carefully to get good results from OCR processing, so we use standard image processing techniques to crop each ROI to a configurable padding level (2% in the case of our example image, as shown in Figure 9).
Figure 9: Cropped ROI
From previous projects, we also know that an image pre-processing step to simplify the image will improve the accuracy of the extracted results. In our implementation, we transform the image to a grayscale version and then to a binary image, by applying Otsu’s binarization, , method and the result for one of our images is shown in Figure 10.
Figure 10: Pre-processed image
TEXT EXTRACTION VIA OCR
As mentioned earlier, we use Tesseract as our OCR engine, Tesseract has long been considered one of the most accurate open-source OCR engines available and the fact that it had a new beta version that used LSTM (Long short-term memory) internally, meant that it was the obvious choice for OCR processing in our implementation.
Our implementation passes the images to Tesseract to allow it to extract text from then stores them as metadata to improve the discoverability of the presentation documents. Figure 11 shows a partial example result returned by the service for the example image.
Figure 11: Text extraction results for the example image
Starting with images extracted from presentations, we have used a sophisticated, pre-trained EAST neutral network to identify possible text within each image, we have then processed each image by cropping and pre-processing them to simplify the job of an OCR processor, and used the Tesseract OCR engine to extract the text that they contain.
Figure 12 shows the transformation of the example image in each step of the text extraction process.
Figure 12: Example image during text extraction process
While effective, our current implementation can be improved further.
As can be seen from the examples in Figure 13, the extraction results are not always perfect, and improving the current implementation’s text extraction capabilities will be an iterative, empirical process.
Figure 13: Example results for images with different resolutions and formats
However, extracting the text from presentation images is only the first step in improving the ability to search in presentation documents. For example, with minor adaptations to how we provide the input and translate the output, a pre-trained model for object detection could also be used to detect if certain known image elements (e.g. chart, map, building) are detected in the visual elements of a document, which could provide another improvement in search capabilities.
As we hope we have demonstrated, pre-annotated data and in-depth AI knowledge is not necessarily required in order to build useful solutions that involve image processing. The more knowledge and better data you have, the more you will be able to do, but by just understanding how to use the available open-source libraries effectively, we hope we have shown that you can still work wonders!
. Xinyu Zhou, EAST: An Efficient and Accurate Scene Text Detector, 2017 - https://arxiv.org/abs/1704.03155
. EAST Model - https://drive.google.com/file/d/0B3APw5BZJ67ETHNPaU9xUkVoV0U/view
. Nobuyuki Otsu, "A threshold selection method from gray-level histograms", 1979, IEEE Trans. Sys., Man., Cyber. 9 (1): 62–66. doi:10.1109/TSMC.1979.4310076.