Today’s data-driven applications deal with high complexity and multi-dimensionality, including images, sound, videos and text, for which traditional databases and search engines are not well adapted by design. Over the past few years, vector databases have emerged as a solution to address the dual challenge of enabling efficient searches while maintaining a relatively high accuracy for these data types. Let’s look at why these databases have become increasingly popular.
What are vector databases?
Vector databases, also known as ‘vector search databases’ or ‘vector similarity search engines’, are designed for storing, indexing, fast retrieval and similarity search of data represented as vectors. In mathematics, vectors are a simple way to represent both magnitude and direction in space. Vectors can also be decomposed into their components. In the illustration below, a vector v in a two-dimensional space x0y is split into its two components along the x and y axes: vx and vy. One possible notation for this vector is v = [vx, vy].
In data science, vectors can be used to represent complex information in a way that machines can understand. For this purpose, unstructured data like images, sound or text is transformed into a numerical form, that is, a list of numbers. The dimension of the list is the vector dimension, and each number in the list is one of the vector’s components. The transformation of the unstructured data into a vector representation is typically done using a machine learning model, and the encoding of data as a vector is called ‘embedding’. Embeddings have been used for quite a while in machine learning, most notably for text data.
In the following illustration, we observe the transformation of the terms ‘king’, ‘queen’, ‘man’ and ‘woman’ into two-dimensional vectors. In this simplified depiction, ‘king’ will correspond to the numerical components [1, 1], ‘queen’ corresponds to [2, 2] and so on. Adopting the vector representation allows us to apply vector algebra operations within the respective two-dimensional space. By utilising metrics inherent in this vector space, we can perform calculations of distances and vector similarities in the same way we would do for a mathematical vector.
More generally, text embedding creates vector representations of words, phrases or sentences based on their semantic and syntactic relationship to a larger language corpus. Some examples of methods to create text embeddings are Bag-of-Words (BOW) , TF-IDF (term frequency – inverse document frequency) , word embeddings (word2vec, GloVe)  or pre-trained language models, such as BERT  or GPT . The vector dimension depends on the type of embeddings we adopt. As an example, with GloVe, a frequently used vector dimension is 300.
For image data, embeddings can be created using convolutional neural networks (CNN) . The intermediate layers of the CNN, when trained with the image data for which we create the embeddings, can be used to extract feature vectors (embeddings). This corresponds to a latent representation of the images. Another method to create embeddings for images is by using autoencoders, where the compressed input image representation, also known as the projection in the latent space, serves as the embedding.
Similarities between vectors are calculated using metrics such as cosine similarity, Euclidian distance or Jaccard distance . Cosine similarity, the preferred method for text data, evaluates the similitude of two vectors by calculating the cosine of their relative angle in a multi-dimensional space. Two identical vectors will have a cosine similarity of 1. Two orthogonal vectors that have no common components will have a cosine similarity of 0.
The value of vector databases
Each piece of data we want to store in a vector database is represented as a high-dimensional vector encoded with one of the methods described above. The actual data storage can be on a disk, in memory or hybrid. The search is based on the similarity of the vectors with a given query. For a vector dataset to be efficient, it is not enough to create a data representation using a vector approach. We also need to index the vectors to reduce the computational effort when searching, for which approximate nearest neighbour (ANN)  algorithms are used. Rather than performing a full vector similarity computation, we will use one of the ANN implementations to reduce the overall computational cost of the search.
What all ANN algorithms have in common is that they perform data preprocessing to create an index that accelerates the search for nearest neighbours, eliminating the need to compare a query with all vectors in the database. Some of the most frequently used ANN algorithms are k-d trees, locality-sensitive hashing (LSH), Hierarchically Navigable Small World (HNSW) and Approximate Nearest Neighbors Oh Yeah (Annoy) . A k-d tree is a k-dimensional tree that partitions the space into hyperplanes. LSH hashes similar input items into the same ‘buckets’ with high probability; only for vectors in the same bucket we then calculate the similarity. One particularity of this algorithm is that it maximises hash collisions, instead of minimising them. Annoy was developed by Spotify to search for points in a multi-dimensional space that are close to a certain query point. It is a C++ library with bindings for Python. HNSW operates by constructing graph structures optimised to reduce the number of steps for traversing it between any pair of vertices.
Current implementations of search algorithms in vector databases significantly reduce the overall cost of search by combining:
- Parallelism: taking advantage of the intrinsic parallelism of the similarity comparison,
- Data reduction: leveraging data space partitioning using the indexing algorithm,
- Pruning: discarding large portions of the dataset, focusing only on those regions that are most likely to contain the nearest neighbours,
- Approximation: prioritising speed over accuracy, terminating the search once a ‘good enough’ result was found.
Initially, vector databases found traction in applications such as rankings, recommendation systems, semantic search, and similarity searches across very large databases containing images, sound, video or text. With the rising public interest in the use of large language models (LLMs), it became obvious that vector databases can also be used for long-term memory representation for LLMs. By leveraging the LangChain framework, vector databases and LLMs, we can create applications where agents are tasked to solve a specific task, for example to find a certain piece of information in our local collection of documents based on a specification formulated in natural language.
The rapidly rising interest in vector databases is underscored by substantial investments in various start-ups. Pinecone, which was founded in 2019, witnessed a $100 million investment round in April 2023, shortly after the announcement of GPT-4, elevating its valuation to $750 million. Qdrant, which offers an API service for nearest vector search, secured a $10 million investment also in 2023. Weaviate, one of the fastest-growing vector database start-ups, received a $50 million investment in April 2023 as well. Furthermore, established companies and long-standing organisations are also offering vector database services now. Examples are Elastic’s vector search or Google Cloud’s Vertex AI Matching Engine.
As vector databases continue to gain momentum, their still untapped potential remains a focal point of interest. The anticipated application possibilities encompass a wide range of sectors and problem types, including healthcare, genetics, drug discovery, e-commerce, recommendation systems, financial services, content creation, virtual and augmented reality… The list could go on, which shows that vector databases promise to add substantial value in the coming years.
 Neeraj Agarwal, The Ultimate Guide To Different Word Embedding Techniques in NLP, KDNuggets, https://www.kdnuggets.com/2021/11/guide-word-embedding-techniques-nlp.html (accessed at 16.08.2023)
 Ajay Halthor, Word Embeddings, Explained, Towards Data Science, https://towardsdatascience.com/word-embeddings-explained-c07c5ea44d64 (accessed at 16.08.2023)
 Mayank Mishra, Convolutional Neural Networks, Explained, https://towardsdatascience.com/convolutional-neural-networks-explained-9cc5188c4939 (accessed at 16.08.2023)
 Maarten Grootendorst, 9 Distance Measures in Data Science, https://towardsdatascience.com/9-distance-measures-in-data-science-918109d069fa, (accessed at 16.08.2023)
 Labelbox, How vector similarity search works, https://labelbox.com/blog/how-vector-similarity-search-works/ (accessed at 16.08.2023)
Principal Data ScientistGabriel has a PhD in computational electromagnetics and started his career in academic and private research. He co-founded two technology start-ups and has worked in software development for 15+ years. Currently, Gabriel is a Principal Data Scientist at Endava, working for a range of industries and writing about advanced data analytics, geospatial analysis, natural language processing (NLP), anomaly detection, MLOps and generative AI. He is a high-profile contributor in the world of competitive machine learning and currently one of the few triple Kaggle Grandmasters. Outside of data science and machine learning, Gabriel enjoys hiking, climbing and reading.
13 November 2023
Delving Deeper Into Generative AI: Unlocking Benefits and Opportunities
07 November 2023
Retrieval Augmented Generation: Combining LLMs, Task-chaining and Vector Databases
27 July 2023
Large Language Models Automating the Enterprise – Part 2
20 July 2023
Large Language Models Automating the Enterprise – Part 1
01 June 2023
Challenges for Adopting AI Systems in Software Development