<img height="1" width="1" style="display:none;" alt="" src="https://px.ads.linkedin.com/collect/?pid=4958233&amp;fmt=gif">
 
RSS Feed

Technology | Alveiro Garcia |
08 June 2021

I still remember those days of using search engines in different portals and not getting even one relevant result. That could be the reason Google became so popular, and Google certainly resolved that problem.

As the amount of content is growing daily and with an increased pace, giving such powerful search capabilities to users is getting more important as well. Which is why I want to share some fundamental information on the topic.

Currently, Elasticsearch is ranked as the most popular search engine according to DB-Engines. This article explores fundamental Elasticsearch concepts such as indexes, documents, and inverted indexes, and how these concepts work together to provide storage and relevance scoring. Additionally, I will present a case where these Elasticsearch features were evaluated and Elasticsearch was proposed as the main data repository for an internal project.

FUNDAMENTAL CONCEPTS OF THE APACHE LUCENE LIBRARY

The core of Elasticsearch is the Apache Lucene library, which includes features for indexing, searching, retrieving and updating documents, and text analysis. The figure below depicts the integration between Elasticsearch and Lucene, and how they interact with external systems:

Light installation showing 3D cubes

The fundamental concepts required to understand the theory behind Apache Lucene are indexes, documents, inverted indexes, scoring, and tokenisation. I will briefly describe these concepts below.

An index is a collection of documents sharing conceptual and logical similarities. You can think of an index as a folder with multiple related documents. Given that Elasticsearch is a distributed system and clusters can be added on demand, there is virtually no limit to the number of documents an Elasticsearch server can store.

Light installation showing 3D cubes

A document is a record containing information related to the index. Documents do not have a specific scheme and every document pushed into the index is tagged with a unique identifier. A document contains a set of fields and is generally stored in JSON format. Fields are the minimal unit of storage in the Lucene ecosystem. A simple document is illustrated in the following figure:

Light installation showing 3D cubes

An inverted index is a data structure storing information in a complex HashMap, aiming to facilitate the search of terms contained within the fields of the documents. The concept of the inverted index is close to the concept of a book index. When you need to search for a specific term in the book, you use the book index and check which pages contain information about the queried term. The inverted index is composed of two substructures:

  • the term dictionary which groups all the terms included in the documents in a sorted list.
  • the postings list which creates a list of each term, indicating the documents where the term appears.

The following table shows an example of an inverted index substructures build based on a set of documents:

Light installation showing 3D cubes

Scoring is the process that compares the user input against the stored documents and assigns relevance values for each result. Similar to how the Google search engine works, the outcome of the scoring process is a sorted array containing all the search matches ordered by score.

Three aspects should be considered for the scoring process:

  • relevance, 
  • user satisfaction, and
  • algorithms.

The first aspect is related to the relevance definition itself. Even though different users perform the same search, the relevance results could be different for each of them; therefore, the relevance should be aligned to the user expectations.

The second aspect concerns knowledge about a user’s satisfaction with the relevance score provided in the search. In fact, giving feedback to the algorithm on the user’s perception is a very fascinating field related to reinforcement learning techniques, which by itself would warrant a separate article.

The third aspect encompasses the algorithms which calculate the scores. There are two main algorithms used for scoring: Term Frequency-Inverse Document Frequency (TD-IDF) and Best Match 25 (BM25). Both algorithms are rooted in the concept of tokenisation.

Tokenisation is a fundamental concept of the Natural Language Processing (NLP) field, which is also being applied to search engines. Tokenisation consists of splitting the text string or documents into tokens, or smaller chunks. Although tokenisation libraries provide a set of methods for splitting text, users can implement their own rules using regular expressions. Common tokenisation methods include breaking up the text by words or by punctuation marks.

Although tokenisation is commonly applied to text, in certain situations, special techniques allow for the tokenisation of other kinds of data, such as numbers or geographical coordinates. Consequently, each specific use case should be taken into consideration during the configuration of Elasticsearch tokenisation capabilities.

HOW THE FUNDAMENTAL CONCEPTS WORK TOGETHER

The previously reviewed concepts work together to establish relevance scores from semi-structured data. When a document is saved in Elasticsearch, two processes happen in parallel: in the first, the raw document is stored in the index; in the second, each document passes through an analysis phase before being stored in the inverted index. This is described in the following figure:

Light installation showing 3D cubes

The analysis layer is a configurable step which, by default, includes three actions: character filter, tokeniser, and token filters. These actions facilitate the search algorithms. Additionally, each document pushed into the search engine goes through a metadata enrichment process, which can be as simple as the counting of terms or as complex as the calculation of a sentiment trend.

Naturally, when the query API is used for retrieving documents, the search terms pass through the same actions used when documents were stored. That way, search terms can be matched against the inverted index, and the search server can calculate the scores accordingly and return the results sorted by relevance. This process is depicted in the next figure:

Light installation showing 3D cubes

The following example will clarify the process described above. Let us say we have the next document from a review:

{
description”: “Barcelona has beautifullllll beaches.”
}

After applying the analysis step, the information in the inverted index looks something like this:

[“barcelona”, “beautiful”, “beach”]

When a user inputs the search string “Barcelona’s beaches” and the analysis layer is applied, the search string is transformed into something like this:

[“barcelona”, “beach”]

The resulting terms can now be matched against the inverted index, and the result can be obtained by direct comparison.

GETTING THE BEST OUT OF THIS STACK 

For Endava’s Innovation Lab, an internal competition where Endavans develop innovative tech solutions, we proposed a product using Elasticsearch as data storage. The goal was to provide the recruiting team with a list of the most relevant candidates for a specific opening.

To achieve this, the system extracts professional profiles from various sources and then creates a cherry-picked selection of candidates for the recruiting team, based on NLP techniques leveraged by Elasticsearch’s built-in algorithms.

All the features explained before fed the idea of using Elasticsearch as principal storage. However, there were two design decisions remaining for the final architectural approach: the principal storage mechanism and the deployment method (cloud or on-premises).

For the principal storage decision,

  • the first option was to use Elasticsearch just as a search engine returning indexes to retrieve content from a dedicated database. This dedicated database would be a relational database, like PostgreSQL or MySQL, or a more robust document database, like Mongo or DynamoDB.
  • the other option was to use Elasticsearch as main storage as well.

Ultimately, Elasticsearch was selected as the principal storage mechanism, due to the medium-scale storage requirement for the entire solution (estimated at just 36 Gigabytes) and the non-transactional nature of the business requirements.

We deployed this solution in the AWS cloud because it provides a fully maintained Elasticsearch service, and it can be easily integrated with other services within the AWS ecosystem.

The following diagram reflects the final design, including the forementioned decisions:

Light installation showing 3D cubes

CONCLUSIONS

The purpose of this article was to review how the main components of Elasticsearch and Apache Lucene interact to provide an effortless way to resolve relevance problems with search results.

Apache Lucene is the heart of Elasticsearch and provides an interface which helps with abstracting the complexity and algorithms behind the scenes.

For most business requirements, a default configuration of Elasticsearch will be sufficient. However, some cases may require improvements in how documents are scored. This kind of score tuning is possible, but such implementations require a deep understanding of the relevance algorithms used in Apache Lucene.

Alveiro Garcia

Data Engineer & Bogota Data Community Lead

Alveiro is a Data Engineer with more than 8 years of experience building products based on Java and Python technologies in different layers but mostly focused on data. He has participated in projects in different sectors like telecommunications, banking, health, arts and insurance, and he has been a trainer for software-related programmes. In his free time, Alveiro loves spending time with his family, reading, and continuously learning new things. He also likes travelling to learn about history.

 

From This Author

 

Archive

  • 13 November 2023

    Delving Deeper Into Generative AI: Unlocking Benefits and Opportunities

  • 07 November 2023

    Retrieval Augmented Generation: Combining LLMs, Task-chaining and Vector Databases

  • 19 September 2023

    The Rise of Vector Databases

  • 27 July 2023

    Large Language Models Automating the Enterprise – Part 2

  • 20 July 2023

    Large Language Models Automating the Enterprise – Part 1

  • 11 July 2023

    Boost Your Game’s Success with Tools – Part 2

  • 04 July 2023

    Boost Your Game’s Success with Tools – Part 1

  • 01 June 2023

    Challenges for Adopting AI Systems in Software Development

  • 07 March 2023

    Will AI Transform Even The Most Creative Professions?

  • 14 February 2023

    Generative AI: Technology of Tomorrow, Today

  • 25 January 2023

    The Joy and Challenge of being a Video Game Tester

  • 14 November 2022

    Can Software Really Be Green

  • 26 July 2022

    Is Data Mesh Going to Replace Centralised Repositories?

  • 09 June 2022

    A Spatial Analysis of the Covid-19 Infection and Its Determinants

  • 17 May 2022

    An R&D Project on AI in 3D Asset Creation for Games

  • 07 February 2022

    Using Two Cloud Vendors Side by Side – a Survey of Cost and Effort

  • 25 January 2022

    Scalable Microservices Architecture with .NET Made Easy – a Tutorial

  • 04 January 2022

    Create Production-Ready, Automated Deliverables Using a Build Pipeline for Games – Part 2

  • 23 November 2021

    How User Experience Design is Increasing ROI

  • 16 November 2021

    Create Production-Ready, Automated Deliverables Using a Build Pipeline for Games – Part 1

  • 19 October 2021

    A Basic Setup for Mass-Testing a Multiplayer Online Board Game

  • 24 August 2021

    EHR to HL7 FHIR Integration: The Software Developer’s Guide – Part 3

  • 20 July 2021

    EHR to HL7 FHIR Integration: The Software Developer’s Guide – Part 2

  • 29 June 2021

    EHR to HL7 FHIR Integration: The Software Developer’s Guide – Part 1

  • 08 June 2021

    Elasticsearch and Apache Lucene: Fundamentals Behind the Relevance Score

  • 27 May 2021

    Endava at NASA’s 2020 Space Apps Challenge

  • 27 January 2021

    Following the Patterns – The Rise of Neo4j and Graph Databases

  • 12 January 2021

    Data is Everything

  • 05 January 2021

    Distributed Agile – Closing the Gap Between the Product Owner and the Team – Part 3

  • 02 December 2020

    8 Tips for Sharing Technical Knowledge – Part 2

  • 12 November 2020

    8 Tips for Sharing Technical Knowledge – Part 1

  • 30 October 2020

    API Management

  • 22 September 2020

    Distributed Agile – Closing the Gap Between the Product Owner and the Team – Part 2

  • 25 August 2020

    Cloud Maturity Level: IaaS vs PaaS and SaaS – Part 2

  • 18 August 2020

    Cloud Maturity Level: IaaS vs PaaS and SaaS – Part 1

  • 08 July 2020

    A Virtual Hackathon Together with Microsoft

  • 30 June 2020

    Distributed safe PI planning

  • 09 June 2020

    The Twisted Concept of Securing Kubernetes Clusters – Part 2

  • 15 May 2020

    Performance and security testing shifting left

  • 30 April 2020

    AR & ML deployment in the wild – a story about friendly animals

  • 16 April 2020

    Cucumber: Automation Framework or Collaboration Tool?

  • 25 February 2020

    Challenges in creating relevant test data without using personally identifiable information

  • 04 January 2020

    Service Meshes – from Kubernetes service management to universal compute fabric

  • 10 December 2019

    AWS Serverless with Terraform – Best Practices

  • 05 November 2019

    The Twisted Concept of Securing Kubernetes Clusters

  • 01 October 2019

    Cognitive Computing Using Cloud-Based Resources II

  • 17 September 2019

    Cognitive Computing Using Cloud-Based Resources

  • 03 September 2019

    Creating A Visual Culture

  • 20 August 2019

    Extracting Data from Images in Presentations

  • 06 August 2019

    Evaluating the current testing trends

  • 23 July 2019

    11 Things I wish I knew before working with Terraform – part 2

  • 12 July 2019

    The Rising Cost of Poor Software Security

  • 09 July 2019

    Developing your Product Owner mindset

  • 25 June 2019

    11 Things I wish I knew before working with Terraform – part 1

  • 30 May 2019

    Microservices and Serverless Computing

  • 14 May 2019

    Edge Services

  • 30 April 2019

    Kubernetes Design Principles Part 1

  • 09 April 2019

    Keeping Up With The Norm In An Era Of Software Defined Everything

  • 25 February 2019

    Infrastructure as Code with Terraform

  • 11 February 2019

    Distributed Agile – Closing the Gap Between the Product Owner and the Team

  • 28 January 2019

    Internet Scale Architecture

OLDER POSTS