<img height="1" width="1" style="display:none;" alt="" src="https://px.ads.linkedin.com/collect/?pid=4958233&amp;fmt=gif">
RSS Feed

AI | Alexandru Mortan |
20 August 2019


Today, data is a highly valued commodity which can be derived from a variety of different sources, making discoverability a key attribute for making the most of the information held in any organisation. This insight led us to start an internal project to improve the discoverability of information in Endava’s internal document stores. The first deliverables included an API and two communication interfaces, a web application and a chat bot, which provide powerful search capabilities across our internal documents.

An important part of improving the discoverability of these documents is augmenting the information which is extracted from them, with tags that indicate the named entities (e.g. persons, organisations, locations), the client context and business domain that the documents relate to. These extracted metadata fields help to create more accurate search results. This information is relatively straightforward to extract from textual documents, but we found that it is much more difficult to extract from graphical information found in many presentations.


A modern presentation usually contains a significant number of graphics and minimizes the amount of text on each slide, relying on the presenter’s verbal skills to deliver the details of the message (see Figure 1).
Sample presentation slides showing the graphical nature of modern presentations
Figure 1: Sample presentation slides showing the graphical nature of modern presentations

The question then arises: how can we make this information searchable?

The challenge with presentation files was how to augment the small amount of text that could be extracted from them with further information derived from the graphical slides. We realized that emerging technologies, such as deep learning, offer a potential solution to this challenge, given their ability to segment images and recognize text and objects within them.

This article explains how we implemented text extraction from images using standard computer vision techniques and technologies that can be applied to other problems, such as object detection.


The process we designed for text extraction is shown in Figure 2, and we implemented it as a service which loads a given image, detects the location of text regions in images (as rotated bounding boxes) and extracts the information from these detected ROIs (regions of interest).
Text extraction process
Figure 2: Text extraction process

The recent growth in reliable open-source technology in the artificial intelligence area meant that we were able to lean heavily on proven 3rd party libraries in our implementation. The technology we used was largely Python-based and we used OpenCV for image processing, a pre-trained Tensorflow model for ROI detection and the well-known Tesseract library for OCR (optical character recognition) processing.


ROI detection is one of the most difficult parts of the solution to implement. Even with the assumption that we do not need to process generic images such as photographs of the natural world (for example, landscapes or flowers), our solution needs to allow for different graphics resolutions, unknown slide layout, non-planar objects and different viewing angles. Obtaining an accurate result that allows for all of these complications using traditional image processing algorithms would be very difficult and involve the use of complex and fragile heuristics.

To solve this problem, we decided to use a “learning” approach. However we did not have annotated data to use to train a model and we didn’t have time or enough people to perform a comprehensive data annotation exercise, so we decided to investigate the use of a pre-trained deep learning model.

We discovered a promising deep-learning-based ROI detection pipeline for text called EAST (Efficient and Accurate Scene Text), which was described in Xinyu Zhou’s 2017 paper, EAST: An Efficient and Accurate Scene Text Detector [1]. According to the authors, the EAST pipeline, composed of a Fully-Convolutional Network (FCN) and a locality-aware Non-Maximum Suppression (NMS) algorithm, is capable of detecting lines of text without using expensive traditional algorithms and is able to process images with the resolution of 1280x720 at around 16 frames per second (fps).
The structure of the EAST text detection Fully-Convolutional Network
Figure 3: The structure of the EAST text detection Fully-Convolutional Network (from Zhou’s paper [1]).

We don’t have space in this article to explain EAST in detail, and so we refer you to Zhou’s paper for the details. However, we reproduce Figure 3 from the paper to give an overview of the architecture of the EAST neural network. For our situation, using a pre-trained model based on a known neural network, the key parts of the architecture are the input and the output of the model.

In EAST there are 4 layers of feature maps, denoted as fi, which are extracted such that their sizes are 1/32, 1/16, 1/8 and 1/4 of the input image. So, for the neural network to process an image its dimensions must be a multiple of 32.

The first output layer, named “score map” in Figure 3, is the output of the sigmoid activation function, which indicates the probability of a region containing text or not. The second output layer required for text detection is shown in the diagram as “RBOX geometry” and it is used to derive the rotated bounding boxes for the text regions detected. The last output layer is labeled “QUAD geometry” and it detects text regions, annotated by 4 vertices of the rectangle, however we don’t use this output as the “RBOX geometry” is sufficient for our purposes.

Figure 4 shows an example image from a sample presentation which will be used as our example to describe the steps in our implementation.
Example image from presentation
Figure 4: Example image from presentation

Step 1 – Image Preparation

The first step is to load the image and process it using OpenCV in order to get it to a form suitable for use by the EAST model. This involves resizing it, if its dimensions are not multiplies of 32, while retaining the width:height ratio to allow transformation back after the extraction. For large images we also shrink them to a maximum size of 2400 pixels wide and high, to avoid running out of memory during the process.

Step 2 – Initial Image Processing

The next step is to process the image with a forward pass through a proven Tensorflow pre-trained EAST model [2], which achieved an F1-score of 80.83 on the ICDAR 2015 data set; a very strong result. This step creates the two output layers described above, namely the score map and the RBOX geometry data. Example output shapes for our sample image are shown in Figure 5. The shape of the input matrix is given by the dimensions in pixels of the original image together with the image RGB channels. The shapes of the output matrices represent the regions which result from the forward passing of the input matrix through 4 feature map layers, hence the dimensions are divided by 4. The actual content of the output matrices are typical sigmoid confidence scores in the case of the “score_map” and RBOX coordinates in the case of the “geometry” matrix.

Input and Output shapes for the example image in Figure 4
Figure 5: Input and Output shapes for the example image in Figure 4

Step 3 – Region of Interest Identification

After the EAST model solves the challenging problem of computing the probability of a region containing text or not, the next step is to translate the output of the model into the rotated bounding boxes of the ROIs in the original image.
We first eliminate regions with a low probability of containing text by filtering out those with a score map value of less than a threshold value (we used 0.8 in this example, although this is configurable). The remaining regions are considered good candidates, and a series of matrix operations and math formulas, designed with performance in mind, are used to adapt the EAST model’s geometry-form representation of each region to a simpler 4 point Cartesian coordinate format, describing a rotated bounding box (see our example in Figure 6).

EAST geometry format vs final extracted format of the solution
Figure 6: EAST geometry format vs final extracted format of the solution

Step 4 – Rationalisation of Bounding Boxes

In Figure 7 we show our example image, annotated with the bounding boxes that we detected. As can be observed, we have a problem with the precision of the bounding boxes and the degree to which they overlap.
Detected rotated bounding boxes before NMS
Figure 7: Detected rotated bounding boxes before NMS

To solve this problem, the EAST pipeline paper suggests merging the intersected geometries using the NMS algorithm. Our implementation uses a modified NMS algorithm, as recommended in the original paper, which assumes that nearby pixels tend to be highly correlated so it can merge geometries in a row-by-row manner. This hugely improves the speed of the process, compared to the original algorithm, as it reduces the complexity from O(n2) to a best-case scenario of O(n).
Regions of interest containing text result
Figure 8: Regions of interest containing text result

The result of merging the candidate bounding boxes in our example image using NMS is shown in Figure 8.


Having located the regions of interest in a given image, we then attempt to extract the text we believe is in those regions. Previous experience has shown that we need to crop images carefully to get good results from OCR processing, so we use standard image processing techniques to crop each ROI to a configurable padding level (2% in the case of our example image, as shown in Figure 9).

Cropped ROI
Figure 9: Cropped ROI


From previous projects, we also know that an image pre-processing step to simplify the image will improve the accuracy of the extracted results. In our implementation, we transform the image to a grayscale version and then to a binary image, by applying Otsu’s binarization, [3], method and the result for one of our images is shown in Figure 10.

Pre-processed image
Figure 10: Pre-processed image


As mentioned earlier, we use Tesseract as our OCR engine, Tesseract has long been considered one of the most accurate open-source OCR engines available and the fact that it had a new beta version that used LSTM (Long short-term memory) internally, meant that it was the obvious choice for OCR processing in our implementation.
Our implementation passes the images to Tesseract to allow it to extract text from then stores them as metadata to improve the discoverability of the presentation documents. Figure 11 shows a partial example result returned by the service for the example image.
Text extraction results for the example image
Figure 11: Text extraction results for the example image


Starting with images extracted from presentations, we have used a sophisticated, pre-trained EAST neutral network to identify possible text within each image, we have then processed each image by cropping and pre-processing them to simplify the job of an OCR processor, and used the Tesseract OCR engine to extract the text that they contain.

Figure 12 shows the transformation of the example image in each step of the text extraction process.
Example image during text extraction process
Figure 12: Example image during text extraction process


While effective, our current implementation can be improved further.

As can be seen from the examples in Figure 13, the extraction results are not always perfect, and improving the current implementation’s text extraction capabilities will be an iterative, empirical process.
Figure 13: Example results for images with different resolutions and formats

However, extracting the text from presentation images is only the first step in improving the ability to search in presentation documents. For example, with minor adaptations to how we provide the input and translate the output, a pre-trained model for object detection could also be used to detect if certain known image elements (e.g. chart, map, building) are detected in the visual elements of a document, which could provide another improvement in search capabilities.


As we hope we have demonstrated, pre-annotated data and in-depth AI knowledge is not necessarily required in order to build useful solutions that involve image processing. The more knowledge and better data you have, the more you will be able to do, but by just understanding how to use the available open-source libraries effectively, we hope we have shown that you can still work wonders!


[1]. Xinyu Zhou, EAST: An Efficient and Accurate Scene Text Detector, 2017 - https://arxiv.org/abs/1704.03155
[2]. rel="noopener noreferrer" EAST Model - https://drive.google.com/file/d/0B3APw5BZJ67ETHNPaU9xUkVoV0U/view
[3]. Nobuyuki Otsu, "A threshold selection method from gray-level histograms", 1979, IEEE Trans. Sys., Man., Cyber. 9 (1): 62–66. doi:10.1109/TSMC.1979.4310076.

Alexandru Mortan

Software Development Consultant

Alex is a passionate senior software engineer with a background in Java, who enjoys experimenting with cutting-edge practices and technologies in the field of Artificial Intelligence. Alex is involved in a number of initiatives which aim to automate repetitive business processes, mostly in the financial sector. When he isn’t exploring new ways to make businesses more efficient and allow our clients to focus on more important and valuable work, you can find Alex travelling to remote countries and exploring new and different cultures.


From This Author



  • 13 November 2023

    Delving Deeper Into Generative AI: Unlocking Benefits and Opportunities

  • 07 November 2023

    Retrieval Augmented Generation: Combining LLMs, Task-chaining and Vector Databases

  • 19 September 2023

    The Rise of Vector Databases

  • 27 July 2023

    Large Language Models Automating the Enterprise – Part 2

  • 20 July 2023

    Large Language Models Automating the Enterprise – Part 1

  • 11 July 2023

    Boost Your Game’s Success with Tools – Part 2

  • 04 July 2023

    Boost Your Game’s Success with Tools – Part 1

  • 01 June 2023

    Challenges for Adopting AI Systems in Software Development

  • 07 March 2023

    Will AI Transform Even The Most Creative Professions?

  • 14 February 2023

    Generative AI: Technology of Tomorrow, Today

  • 25 January 2023

    The Joy and Challenge of being a Video Game Tester

  • 14 November 2022

    Can Software Really Be Green

  • 26 July 2022

    Is Data Mesh Going to Replace Centralised Repositories?

  • 09 June 2022

    A Spatial Analysis of the Covid-19 Infection and Its Determinants

  • 17 May 2022

    An R&D Project on AI in 3D Asset Creation for Games

  • 07 February 2022

    Using Two Cloud Vendors Side by Side – a Survey of Cost and Effort

  • 25 January 2022

    Scalable Microservices Architecture with .NET Made Easy – a Tutorial

  • 04 January 2022

    Create Production-Ready, Automated Deliverables Using a Build Pipeline for Games – Part 2

  • 23 November 2021

    How User Experience Design is Increasing ROI

  • 16 November 2021

    Create Production-Ready, Automated Deliverables Using a Build Pipeline for Games – Part 1

  • 19 October 2021

    A Basic Setup for Mass-Testing a Multiplayer Online Board Game

  • 24 August 2021

    EHR to HL7 FHIR Integration: The Software Developer’s Guide – Part 3

  • 20 July 2021

    EHR to HL7 FHIR Integration: The Software Developer’s Guide – Part 2

  • 29 June 2021

    EHR to HL7 FHIR Integration: The Software Developer’s Guide – Part 1

  • 08 June 2021

    Elasticsearch and Apache Lucene: Fundamentals Behind the Relevance Score

  • 27 May 2021

    Endava at NASA’s 2020 Space Apps Challenge

  • 27 January 2021

    Following the Patterns – The Rise of Neo4j and Graph Databases

  • 12 January 2021

    Data is Everything

  • 05 January 2021

    Distributed Agile – Closing the Gap Between the Product Owner and the Team – Part 3

  • 02 December 2020

    8 Tips for Sharing Technical Knowledge – Part 2

  • 12 November 2020

    8 Tips for Sharing Technical Knowledge – Part 1

  • 30 October 2020

    API Management

  • 22 September 2020

    Distributed Agile – Closing the Gap Between the Product Owner and the Team – Part 2

  • 25 August 2020

    Cloud Maturity Level: IaaS vs PaaS and SaaS – Part 2

  • 18 August 2020

    Cloud Maturity Level: IaaS vs PaaS and SaaS – Part 1

  • 08 July 2020

    A Virtual Hackathon Together with Microsoft

  • 30 June 2020

    Distributed safe PI planning

  • 09 June 2020

    The Twisted Concept of Securing Kubernetes Clusters – Part 2

  • 15 May 2020

    Performance and security testing shifting left

  • 30 April 2020

    AR & ML deployment in the wild – a story about friendly animals

  • 16 April 2020

    Cucumber: Automation Framework or Collaboration Tool?

  • 25 February 2020

    Challenges in creating relevant test data without using personally identifiable information

  • 04 January 2020

    Service Meshes – from Kubernetes service management to universal compute fabric

  • 10 December 2019

    AWS Serverless with Terraform – Best Practices

  • 05 November 2019

    The Twisted Concept of Securing Kubernetes Clusters

  • 01 October 2019

    Cognitive Computing Using Cloud-Based Resources II

  • 17 September 2019

    Cognitive Computing Using Cloud-Based Resources

  • 03 September 2019

    Creating A Visual Culture

  • 20 August 2019

    Extracting Data from Images in Presentations

  • 06 August 2019

    Evaluating the current testing trends

  • 23 July 2019

    11 Things I wish I knew before working with Terraform – part 2

  • 12 July 2019

    The Rising Cost of Poor Software Security

  • 09 July 2019

    Developing your Product Owner mindset

  • 25 June 2019

    11 Things I wish I knew before working with Terraform – part 1

  • 30 May 2019

    Microservices and Serverless Computing

  • 14 May 2019

    Edge Services

  • 30 April 2019

    Kubernetes Design Principles Part 1

  • 09 April 2019

    Keeping Up With The Norm In An Era Of Software Defined Everything

  • 25 February 2019

    Infrastructure as Code with Terraform

  • 11 February 2019

    Distributed Agile – Closing the Gap Between the Product Owner and the Team

  • 28 January 2019

    Internet Scale Architecture