<img height="1" width="1" style="display:none;" alt="" src="https://px.ads.linkedin.com/collect/?pid=4958233&amp;fmt=gif">
RSS Feed

Artificial Intelligence | Gabriel Preda |
07 November 2023

The recent development and commoditization of large language models (LLMs) have opened numerous ways to apply them to a variety of tasks. From domain-specific chatbots to personal assistants to AI Agents for process automation, a plethora of business problems can now be approached to provide solutions with high accuracy and impressive productivity. There is indeed a true explosion of creativity in regard to leveraging the amazing syntactic prowess of LLMs. Looking in retrospect over recent times, there have been three main approaches, which we will review below.

Usage options for large language models

Prompt the model (prompt engineering)

The simplest approach to using a large language model is by directly prompting it [1]. This can be done directly through the chat interface where available (like with ChatGPT), through an API (like with GPT-3 or GPT-4) or by using a query pipeline with a locally deployed model. In all these cases, the quality of the response will highly depend on the users’ ability to formulate their prompts in a way that will guide the LLM to address exactly the envisioned task.

There is now an entire genre of literature aiming to educate potential users on how to craft perfectly adapted prompts for a specific task. We even talk about a new technical discipline: prompt engineering [1]. By chaining several prompts, we can model complex, multi-step tasks and create powerful applications. Prompt engineering offers easy implementation and domain adaptability but exposes users to LLM issues like lacking new or non-public information, low answer accuracy and generating veridic yet partly imaginary responses, known as hallucinations.

Fine-tune the model

To increase the accuracy of the model and eliminate hallucinations, we can fine-tune the model using its own data [2]. Fine-tuning is generally performed for the models trained for natural language processing (NLP) tasks, those based on Transformer architecture, which includes the foundation LLMs of companies like OpenAI, Google or Meta. The aim is to adapt the foundation models to a downstream NLP task (specialise them for a certain task, such as translation, summarising or answering questions) or to be more accurate when queried on domain-specific information.

Fine-tuning has the advantage that it reduces hallucinations (not entirely since the model still has information from a variety of sources besides the newly added information, but considerably) and improves the precision regarding the specific task or answering queries about the new corpus of information. The disadvantages are related to the considerable effort to label the new data, the huge computational resources needed for the training itself and the need to regularly repeat the process to keep the model up to date with the latest information.

Retrieval Augmented Generation

Retrieval Augmented Generation (RAG) combines two main components to retain the advantages of the two previously described methods while reducing the drawbacks. The components are a retriever and a generator [3]. The retriever can be described as a system that is able to encode our data so that we can easily retrieve its relevant parts with our queries. We then compose the initial query with the context information retrieved and feed the generator. The answer of the generator is then returned as the final response.

The retriever part is normally implemented using vector databases [4]. For that, we first have to select the information that we want to store in the vector database for future retrieval, based on our queries. If the information is not in text format already but rather video, sound, image, PDF, email etc, we will have to convert it. Then, we chunk the resulting text documents into an optimal dimension so that we can get meaningful context when querying the information in the database. The text is then encoded using one version of text embeddings, and special indexes are used for fast queries. Some options for vector databases are FAISS, ChromaDB, Weaviate or Pinecone [5].

In the following diagram, the initial data transformation into text format, chunking, encoding and indexing is represented as step 1. Some of the possible original data formats are also represented, including various text documents, video, audio and emails.

Illustration of RAG process and components

The result of querying the vector database will be a collection of documents that match the query (step 2 in the diagram). These form the context that is added to the query (step 3) to compose the prompt (step 4) that is used with the second component, the generator, implemented using an LLM (such as OpenAI ChatGPT 3.5 turbo or Llama 2). The end-to-end process can easily be orchestrated using one of the options for task-chaining frameworks, such as Langchain or LlamaIndex [4].

RAG retains the accuracy provided by fine-tuning while omitting the high costs of labelling the data and of the computing resources for training. It also prevents hallucination since the LLM will only answer queries from the context extracted from the vector database. However, it still requires periodically indexing the new data in the vector database. The main issue with RAG systems is that the context relevance relies on the accuracy of matching the query with the documents in the vector database. Various techniques can improve the accuracy of RAG systems, which we will go into in the next section.

We can also combine the methods described above. For example, we can improve the efficiency of RAG by fine-tuning the LLM used as generator. The relevance of the context extracted from the vector database can be improved by carefully crafting the query using prompt engineering techniques.

Having reviewed the technical options, we have chosen RAG as the most promising option for a search system we want to develop. In the next section, we will examine how to implement it with this approach.

Build a Retrieval Augmented Generation system

In this section, we will detail the implementation of an RAG system, for which there are multiple options available. We opt for the combination of a vector database to index the information, a task-chaining framework to orchestrate the whole process and an LLM for generating the answer after being prompted with a combination of the initial query and the context [4].

The information we store in the vector database is first ingested, staged and transformed into text format. For some source formats, we will also need to implement format conversion. For example, videos would be transformed with one of the available solutions for video-to-text extraction, such as Open AI Wisper. For web pages, we can use Beautiful Soup to extract text-only content from HTML format. PDF-to-text and image-to-text solutions, like Google Tesseract OCR, can cover additional formats.

Once all the data is transformed into text format, while keeping the source information, the next step is to apply a chunking procedure to split the data into partially superposed chunks of a predefined dimension. We do this so that, upon querying, we can extract only the relevant context from larger texts in the retrieval process. The partial superposition of the chunks helps to ensure that we are not missing important context in the process of chunking. The chunking can be done with a custom chunking implementation for full control or using one of the Langchain ready-to-use classes dedicated to this operation.

After all documents are converted into text, conveniently chunked and stored in the new form – we can use either CSV or Parquet format, depending on the data size – we are ready to apply the embedding. Various embedding approaches are available, from simple TF-IDF text embeddings to a spacy embedding model to HuggingFace transformers to GPT. One popular and widely used option for text embeddings is the sentence transformer from HuggingFace, due to its ready availability and effectiveness.

There are also multiple options for vector databases. Popular choices include the easy-to-use FAISS, ChromaDB and Maevius, all of which allow for in-memory and persistent storage and have a user-friendly interface. Additional options are Weaviate, Pinecone and Redis. Cloud providers offer integration with some of these products as well as their own solutions. In our example, we opted for ChromaDB and sentence transformer embeddings. The Python code to add documents to the database is shown below and is an excerpt from the full code available as a Notebook on Kaggle [6].

# define langchain text splitter with chunk size and chunk overlap
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
#apply the text splitter to all documents
all_splits = text_splitter.split_documents(documents)

# initialize the model used for embeddings
model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {"device": "cuda"}

embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)

# add the chunked documents to the ChromaDB database
vectordb = Chroma.from_documents(documents=all_splits, embedding=embeddings, persist_directory="chroma_db")

In our approach, we use one of the most common LLM options, the Llama 2 model. It is available for direct download through HuggingFace, after Meta’s approval, and we can also use it on Kaggle. To run it with a lower memory footprint or even on CPU, that is, with lower computational resources, one common technique is to quantize the model. Quantization is a model compression technique that transforms the model weights to lower precision (for example on 4 bits) while retaining the accuracy. After quantizing the model, we can create a querying pipeline including the model. One option could be the HuggingFace pipeline. The code snippet below shows a simplified implementation available on Kaggle [6].

# path to Kaggle model (Llama 2, 7b chat version from HuggingFace) on Kaggle environment
model_id = '/kaggle/input/llama-2/pytorch/7b-chat-hf/1'

# set the device
device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# set quantization configuration to load large model with less GPU memory
# this requires the 'bitsandbytes' library
bnb_config = transformers.BitsAndBytesConfig(

# prepare the model configuration
model_config = transformers.AutoConfig.from_pretrained(
# initialize the model
model = transformers.AutoModelForCausalLM.from_pretrained(
# initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

# create a query pipeline with the model and tokenizer initialized before
query_pipeline = transformers.pipeline(

# create a HuggingFace pipeline
llm = HuggingFacePipeline(pipeline=query_pipeline)

The end-to-end process to query the vector database, compose the prompt from the initial query, retrieve the context and prompt the LLM can be implemented using the RetrievalQA function from Langchain. The output of the function is the answer provided by the model. We can also add the actual documents to the response for more context as well as the links to the original document source.

# Set the retriever parameter for RetrievalQA langchain function – the vector DB initialized before is used as retriever
retriever = vectordb.as_retriever()

# initialize the langchain function
qa = RetrievalQA.from_chain_type(

# run a query
result = qa.run(query)

The implementation described above (full code available in [6]) includes all the elements of a RAG system. Each of the key building blocks can be modified by using different tools and products.

There are some limitations to the RAG system, which we highlighted in the previous section. One is that a query might not capture all the targeted context due to the limitations of the similarity search, which only relies on the comparison metric implemented for the embeddings in the vector database. Elaborate approaches have been developed to overcome this limitation, such as RAG combined with Reciprocal Rank Fusion and Generated Queries [7]. In this approach, the queries are generated using an LLM based on multiple partial results querying the vector database with complementary queries to capture the full context.


We reviewed the three main approaches for leveraging LLMs in our business applications: prompt engineering, fine-tuning and Retrieval Augmented Generation (RAG), comparing the pros and cons for each. We concluded that RAGs combine the advantages of both prompt engineering and fine-tuning, without retaining the main drawbacks of the two alternative methods. We then introduced a basic implementation for an RAG system, using Llama 2 as LLM, ChromaDB as vector database and Langchain as task-chaining framework. The Python code, model and data used are available on Kaggle [6].


[1] Fareed Khan, Prompt Engineering Complete Guide, Medium (accessed Oct 2023), https://medium.com/@fareedkhandev/prompt-engineering-complete-guide-2968776f0431

[2] Shawhin Talebi, Fine-Tuning Large Language Models (LLMs), A conceptual overview with example Python code, Towards Data Science (accessed Oct 2023), https://towardsdatascience.com/fine-tuning-large-language-models-llms-23473d763b91

[3] Patrick Lewis, Ethan Perez, et al, Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, ArXiv (accessed Oct 2023), https://browse.arxiv.org/pdf/2005.11401.pdf

[4] Murtuza Kazmi, Using LLaMA 2.0, FAISS and LangChain for Question-Answering on Your Own Data, Medium (accessed Oct 2023), https://medium.com/@murtuza753/using-llama-2-0-faiss-and-langchain-for-question-answering-on-your-own-data-682241488476

[5] Gabriel Preda, The Rise of Vector Databases, Endava Engineering Blog (accessed Oct 2023), endava.com/en/blog/engineering/2023/the-rise-of-vector-databases

[6] Gabriel Preda, RAG using Llama 2, Langchain and ChromaDB, Kaggle Notebooks (accessed Oct 2023), https://www.kaggle.com/code/gpreda/rag-using-llama-2-langchain-and-chromadb

[7] Adrian H. Raudaschl, Forget RAG, the Future is RAG-Fusion, The Next Frontier of Search: Retrieval Augmented Generation meets Reciprocal Rank Fusion and Generated Queries, Towards Data Science (accessed Oct 2023), https://towardsdatascience.com/forget-rag-the-future-is-rag-fusion-1147298d8ad1

Gabriel Preda

Principal Data Scientist

Gabriel has a PhD in computational electromagnetics and started his career in academic and private research. He co-founded two technology start-ups and has worked in software development for 15+ years. Currently, Gabriel is a Principal Data Scientist at Endava, working for a range of industries and writing about advanced data analytics, geospatial analysis, natural language processing (NLP), anomaly detection, MLOps and generative AI. He is a high-profile contributor in the world of competitive machine learning and currently one of the few triple Kaggle Grandmasters. Outside of data science and machine learning, Gabriel enjoys hiking, climbing and reading.



  • 19 September 2023

    The Rise of Vector Databases

  • 27 May 2021

    Endava at NASA’s 2020 Space Apps Challenge



  • 13 November 2023

    Delving Deeper Into Generative AI: Unlocking Benefits and Opportunities

  • 07 November 2023

    Retrieval Augmented Generation: Combining LLMs, Task-chaining and Vector Databases

  • 19 September 2023

    The Rise of Vector Databases

  • 27 July 2023

    Large Language Models Automating the Enterprise – Part 2

  • 20 July 2023

    Large Language Models Automating the Enterprise – Part 1

  • 11 July 2023

    Boost Your Game’s Success with Tools – Part 2

  • 04 July 2023

    Boost Your Game’s Success with Tools – Part 1

  • 01 June 2023

    Challenges for Adopting AI Systems in Software Development

  • 07 March 2023

    Will AI Transform Even The Most Creative Professions?

  • 14 February 2023

    Generative AI: Technology of Tomorrow, Today

  • 25 January 2023

    The Joy and Challenge of being a Video Game Tester

  • 14 November 2022

    Can Software Really Be Green

  • 26 July 2022

    Is Data Mesh Going to Replace Centralised Repositories?

  • 09 June 2022

    A Spatial Analysis of the Covid-19 Infection and Its Determinants

  • 17 May 2022

    An R&D Project on AI in 3D Asset Creation for Games

  • 07 February 2022

    Using Two Cloud Vendors Side by Side – a Survey of Cost and Effort

  • 25 January 2022

    Scalable Microservices Architecture with .NET Made Easy – a Tutorial

  • 04 January 2022

    Create Production-Ready, Automated Deliverables Using a Build Pipeline for Games – Part 2

  • 23 November 2021

    How User Experience Design is Increasing ROI

  • 16 November 2021

    Create Production-Ready, Automated Deliverables Using a Build Pipeline for Games – Part 1

  • 19 October 2021

    A Basic Setup for Mass-Testing a Multiplayer Online Board Game

  • 24 August 2021

    EHR to HL7 FHIR Integration: The Software Developer’s Guide – Part 3

  • 20 July 2021

    EHR to HL7 FHIR Integration: The Software Developer’s Guide – Part 2

  • 29 June 2021

    EHR to HL7 FHIR Integration: The Software Developer’s Guide – Part 1

  • 08 June 2021

    Elasticsearch and Apache Lucene: Fundamentals Behind the Relevance Score

  • 27 May 2021

    Endava at NASA’s 2020 Space Apps Challenge

  • 27 January 2021

    Following the Patterns – The Rise of Neo4j and Graph Databases

  • 12 January 2021

    Data is Everything

  • 05 January 2021

    Distributed Agile – Closing the Gap Between the Product Owner and the Team – Part 3

  • 02 December 2020

    8 Tips for Sharing Technical Knowledge – Part 2

  • 12 November 2020

    8 Tips for Sharing Technical Knowledge – Part 1

  • 30 October 2020

    API Management

  • 22 September 2020

    Distributed Agile – Closing the Gap Between the Product Owner and the Team – Part 2

  • 25 August 2020

    Cloud Maturity Level: IaaS vs PaaS and SaaS – Part 2

  • 18 August 2020

    Cloud Maturity Level: IaaS vs PaaS and SaaS – Part 1

  • 08 July 2020

    A Virtual Hackathon Together with Microsoft

  • 30 June 2020

    Distributed safe PI planning

  • 09 June 2020

    The Twisted Concept of Securing Kubernetes Clusters – Part 2

  • 15 May 2020

    Performance and security testing shifting left

  • 30 April 2020

    AR & ML deployment in the wild – a story about friendly animals

  • 16 April 2020

    Cucumber: Automation Framework or Collaboration Tool?

  • 25 February 2020

    Challenges in creating relevant test data without using personally identifiable information

  • 04 January 2020

    Service Meshes – from Kubernetes service management to universal compute fabric

  • 10 December 2019

    AWS Serverless with Terraform – Best Practices

  • 05 November 2019

    The Twisted Concept of Securing Kubernetes Clusters

  • 01 October 2019

    Cognitive Computing Using Cloud-Based Resources II

  • 17 September 2019

    Cognitive Computing Using Cloud-Based Resources

  • 03 September 2019

    Creating A Visual Culture

  • 20 August 2019

    Extracting Data from Images in Presentations

  • 06 August 2019

    Evaluating the current testing trends

  • 23 July 2019

    11 Things I wish I knew before working with Terraform – part 2

  • 12 July 2019

    The Rising Cost of Poor Software Security

  • 09 July 2019

    Developing your Product Owner mindset

  • 25 June 2019

    11 Things I wish I knew before working with Terraform – part 1

  • 30 May 2019

    Microservices and Serverless Computing

  • 14 May 2019

    Edge Services

  • 30 April 2019

    Kubernetes Design Principles Part 1

  • 09 April 2019

    Keeping Up With The Norm In An Era Of Software Defined Everything

  • 25 February 2019

    Infrastructure as Code with Terraform

  • 11 February 2019

    Distributed Agile – Closing the Gap Between the Product Owner and the Team

  • 28 January 2019

    Internet Scale Architecture