<img height="1" width="1" style="display:none;" alt="" src="https://px.ads.linkedin.com/collect/?pid=4958233&amp;fmt=gif">
 
RSS Feed

Artificial Intelligence | Gabriel Preda |
19 September 2023

Today’s data-driven applications deal with high complexity and multi-dimensionality, including images, sound, videos and text, for which traditional databases and search engines are not well adapted by design. Over the past few years, vector databases have emerged as a solution to address the dual challenge of enabling efficient searches while maintaining a relatively high accuracy for these data types. Let’s look at why these databases have become increasingly popular.

What are vector databases?

Vector databases, also known as ‘vector search databases’ or ‘vector similarity search engines’, are designed for storing, indexing, fast retrieval and similarity search of data represented as vectors. In mathematics, vectors are a simple way to represent both magnitude and direction in space. Vectors can also be decomposed into their components. In the illustration below, a vector v in a two-dimensional space x0y is split into its two components along the x and y axes: vx and vy. One possible notation for this vector is v = [vx, vy].

Example vector visualisation

In data science, vectors can be used to represent complex information in a way that machines can understand. For this purpose, unstructured data like images, sound or text is transformed into a numerical form, that is, a list of numbers. The dimension of the list is the vector dimension, and each number in the list is one of the vector’s components. The transformation of the unstructured data into a vector representation is typically done using a machine learning model, and the encoding of data as a vector is called ‘embedding’. Embeddings have been used for quite a while in machine learning, most notably for text data.

In the following illustration, we observe the transformation of the terms ‘king’, ‘queen’, ‘man’ and ‘woman’ into two-dimensional vectors. In this simplified depiction, ‘king’ will correspond to the numerical components [1, 1], ‘queen’ corresponds to [2, 2] and so on. Adopting the vector representation allows us to apply vector algebra operations within the respective two-dimensional space. By utilising metrics inherent in this vector space, we can perform calculations of distances and vector similarities in the same way we would do for a mathematical vector.

Graphic-2

More generally, text embedding creates vector representations of words, phrases or sentences based on their semantic and syntactic relationship to a larger language corpus. Some examples of methods to create text embeddings are Bag-of-Words (BOW) [1], TF-IDF (term frequency – inverse document frequency) [1], word embeddings (word2vec, GloVe) [1] or pre-trained language models, such as BERT [1] or GPT [2]. The vector dimension depends on the type of embeddings we adopt. As an example, with GloVe, a frequently used vector dimension is 300.

For image data, embeddings can be created using convolutional neural networks (CNN) [3]. The intermediate layers of the CNN, when trained with the image data for which we create the embeddings, can be used to extract feature vectors (embeddings). This corresponds to a latent representation of the images. Another method to create embeddings for images is by using autoencoders, where the compressed input image representation, also known as the projection in the latent space, serves as the embedding.

Similarities between vectors are calculated using metrics such as cosine similarity, Euclidian distance or Jaccard distance [4]. Cosine similarity, the preferred method for text data, evaluates the similitude of two vectors by calculating the cosine of their relative angle in a multi-dimensional space. Two identical vectors will have a cosine similarity of 1. Two orthogonal vectors that have no common components will have a cosine similarity of 0.

The value of vector databases

Each piece of data we want to store in a vector database is represented as a high-dimensional vector encoded with one of the methods described above. The actual data storage can be on a disk, in memory or hybrid. The search is based on the similarity of the vectors with a given query. For a vector dataset to be efficient, it is not enough to create a data representation using a vector approach. We also need to index the vectors to reduce the computational effort when searching, for which approximate nearest neighbour (ANN) [5] algorithms are used. Rather than performing a full vector similarity computation, we will use one of the ANN implementations to reduce the overall computational cost of the search.

What all ANN algorithms have in common is that they perform data preprocessing to create an index that accelerates the search for nearest neighbours, eliminating the need to compare a query with all vectors in the database. Some of the most frequently used ANN algorithms are k-d trees, locality-sensitive hashing (LSH), Hierarchically Navigable Small World (HNSW) and Approximate Nearest Neighbors Oh Yeah (Annoy) [5]. A k-d tree is a k-dimensional tree that partitions the space into hyperplanes. LSH hashes similar input items into the same ‘buckets’ with high probability; only for vectors in the same bucket we then calculate the similarity. One particularity of this algorithm is that it maximises hash collisions, instead of minimising them. Annoy was developed by Spotify to search for points in a multi-dimensional space that are close to a certain query point. It is a C++ library with bindings for Python. HNSW operates by constructing graph structures optimised to reduce the number of steps for traversing it between any pair of vertices.

Current implementations of search algorithms in vector databases significantly reduce the overall cost of search by combining:

  • Parallelism: taking advantage of the intrinsic parallelism of the similarity comparison,
  • Data reduction: leveraging data space partitioning using the indexing algorithm,
  • Pruning: discarding large portions of the dataset, focusing only on those regions that are most likely to contain the nearest neighbours,
  • Approximation: prioritising speed over accuracy, terminating the search once a ‘good enough’ result was found.

 

Initially, vector databases found traction in applications such as rankings, recommendation systems, semantic search, and similarity searches across very large databases containing images, sound, video or text. With the rising public interest in the use of large language models (LLMs), it became obvious that vector databases can also be used for long-term memory representation for LLMs. By leveraging the LangChain framework, vector databases and LLMs, we can create applications where agents are tasked to solve a specific task, for example to find a certain piece of information in our local collection of documents based on a specification formulated in natural language.

The rapidly rising interest in vector databases is underscored by substantial investments in various start-ups. Pinecone, which was founded in 2019, witnessed a $100 million investment round in April 2023, shortly after the announcement of GPT-4, elevating its valuation to $750 million. Qdrant, which offers an API service for nearest vector search, secured a $10 million investment also in 2023. Weaviate, one of the fastest-growing vector database start-ups, received a $50 million investment in April 2023 as well. Furthermore, established companies and long-standing organisations are also offering vector database services now. Examples are Elastic’s vector search or Google Cloud’s Vertex AI Matching Engine.

As vector databases continue to gain momentum, their still untapped potential remains a focal point of interest. The anticipated application possibilities encompass a wide range of sectors and problem types, including healthcare, genetics, drug discovery, e-commerce, recommendation systems, financial services, content creation, virtual and augmented reality… The list could go on, which shows that vector databases promise to add substantial value in the coming years.

References

[1] Neeraj Agarwal, The Ultimate Guide To Different Word Embedding Techniques in NLP, KDNuggets, https://www.kdnuggets.com/2021/11/guide-word-embedding-techniques-nlp.html (accessed at 16.08.2023)

[2] Ajay Halthor, Word Embeddings, Explained, Towards Data Science, https://towardsdatascience.com/word-embeddings-explained-c07c5ea44d64 (accessed at 16.08.2023)

[3] Mayank Mishra, Convolutional Neural Networks, Explained, https://towardsdatascience.com/convolutional-neural-networks-explained-9cc5188c4939 (accessed at 16.08.2023)

[4] Maarten Grootendorst, 9 Distance Measures in Data Science, https://towardsdatascience.com/9-distance-measures-in-data-science-918109d069fa, (accessed at 16.08.2023)

[5] Labelbox, How vector similarity search works, https://labelbox.com/blog/how-vector-similarity-search-works/ (accessed at 16.08.2023)

Gabriel Preda

Principal Data Scientist

Gabriel has a PhD in computational electromagnetics and started his career in academic and private research. He co-founded two technology start-ups and has worked in software development for 15+ years. Currently, Gabriel is a Principal Data Scientist at Endava, working for a range of industries and writing about advanced data analytics, geospatial analysis, natural language processing (NLP), anomaly detection, MLOps and generative AI. He is a high-profile contributor in the world of competitive machine learning and currently one of the few triple Kaggle Grandmasters. Outside of data science and machine learning, Gabriel enjoys hiking, climbing and reading.

 

FROM THIS AUTHOR

  • 07 November 2023

    Retrieval Augmented Generation: Combining LLMs, Task-chaining and Vector Databases

  • 27 May 2021

    Endava at NASA’s 2020 Space Apps Challenge

 

Archive

  • 13 November 2023

    Delving Deeper Into Generative AI: Unlocking Benefits and Opportunities

  • 07 November 2023

    Retrieval Augmented Generation: Combining LLMs, Task-chaining and Vector Databases

  • 19 September 2023

    The Rise of Vector Databases

  • 27 July 2023

    Large Language Models Automating the Enterprise – Part 2

  • 20 July 2023

    Large Language Models Automating the Enterprise – Part 1

  • 11 July 2023

    Boost Your Game’s Success with Tools – Part 2

  • 04 July 2023

    Boost Your Game’s Success with Tools – Part 1

  • 01 June 2023

    Challenges for Adopting AI Systems in Software Development

  • 07 March 2023

    Will AI Transform Even The Most Creative Professions?

  • 14 February 2023

    Generative AI: Technology of Tomorrow, Today

  • 25 January 2023

    The Joy and Challenge of being a Video Game Tester

  • 14 November 2022

    Can Software Really Be Green

  • 26 July 2022

    Is Data Mesh Going to Replace Centralised Repositories?

  • 09 June 2022

    A Spatial Analysis of the Covid-19 Infection and Its Determinants

  • 17 May 2022

    An R&D Project on AI in 3D Asset Creation for Games

  • 07 February 2022

    Using Two Cloud Vendors Side by Side – a Survey of Cost and Effort

  • 25 January 2022

    Scalable Microservices Architecture with .NET Made Easy – a Tutorial

  • 04 January 2022

    Create Production-Ready, Automated Deliverables Using a Build Pipeline for Games – Part 2

  • 23 November 2021

    How User Experience Design is Increasing ROI

  • 16 November 2021

    Create Production-Ready, Automated Deliverables Using a Build Pipeline for Games – Part 1

  • 19 October 2021

    A Basic Setup for Mass-Testing a Multiplayer Online Board Game

  • 24 August 2021

    EHR to HL7 FHIR Integration: The Software Developer’s Guide – Part 3

  • 20 July 2021

    EHR to HL7 FHIR Integration: The Software Developer’s Guide – Part 2

  • 29 June 2021

    EHR to HL7 FHIR Integration: The Software Developer’s Guide – Part 1

  • 08 June 2021

    Elasticsearch and Apache Lucene: Fundamentals Behind the Relevance Score

  • 27 May 2021

    Endava at NASA’s 2020 Space Apps Challenge

  • 27 January 2021

    Following the Patterns – The Rise of Neo4j and Graph Databases

  • 12 January 2021

    Data is Everything

  • 05 January 2021

    Distributed Agile – Closing the Gap Between the Product Owner and the Team – Part 3

  • 02 December 2020

    8 Tips for Sharing Technical Knowledge – Part 2

  • 12 November 2020

    8 Tips for Sharing Technical Knowledge – Part 1

  • 30 October 2020

    API Management

  • 22 September 2020

    Distributed Agile – Closing the Gap Between the Product Owner and the Team – Part 2

  • 25 August 2020

    Cloud Maturity Level: IaaS vs PaaS and SaaS – Part 2

  • 18 August 2020

    Cloud Maturity Level: IaaS vs PaaS and SaaS – Part 1

  • 08 July 2020

    A Virtual Hackathon Together with Microsoft

  • 30 June 2020

    Distributed safe PI planning

  • 09 June 2020

    The Twisted Concept of Securing Kubernetes Clusters – Part 2

  • 15 May 2020

    Performance and security testing shifting left

  • 30 April 2020

    AR & ML deployment in the wild – a story about friendly animals

  • 16 April 2020

    Cucumber: Automation Framework or Collaboration Tool?

  • 25 February 2020

    Challenges in creating relevant test data without using personally identifiable information

  • 04 January 2020

    Service Meshes – from Kubernetes service management to universal compute fabric

  • 10 December 2019

    AWS Serverless with Terraform – Best Practices

  • 05 November 2019

    The Twisted Concept of Securing Kubernetes Clusters

  • 01 October 2019

    Cognitive Computing Using Cloud-Based Resources II

  • 17 September 2019

    Cognitive Computing Using Cloud-Based Resources

  • 03 September 2019

    Creating A Visual Culture

  • 20 August 2019

    Extracting Data from Images in Presentations

  • 06 August 2019

    Evaluating the current testing trends

  • 23 July 2019

    11 Things I wish I knew before working with Terraform – part 2

  • 12 July 2019

    The Rising Cost of Poor Software Security

  • 09 July 2019

    Developing your Product Owner mindset

  • 25 June 2019

    11 Things I wish I knew before working with Terraform – part 1

  • 30 May 2019

    Microservices and Serverless Computing

  • 14 May 2019

    Edge Services

  • 30 April 2019

    Kubernetes Design Principles Part 1

  • 09 April 2019

    Keeping Up With The Norm In An Era Of Software Defined Everything

  • 25 February 2019

    Infrastructure as Code with Terraform

  • 11 February 2019

    Distributed Agile – Closing the Gap Between the Product Owner and the Team

  • 28 January 2019

    Internet Scale Architecture

OLDER POSTS