Artificial Intelligence
| Gabriel Preda |
07 November 2023
The recent development and commoditization of large language models (LLMs) have opened numerous ways to apply them to a variety of tasks. From domain-specific chatbots to personal assistants to AI Agents for process automation, a plethora of business problems can now be approached to provide solutions with high accuracy and impressive productivity. There is indeed a true explosion of creativity in regard to leveraging the amazing syntactic prowess of LLMs. Looking in retrospect over recent times, there have been three main approaches, which we will review below.
Usage options for large language models
Prompt the model (prompt engineering)
The simplest approach to using a large language model is by directly prompting it [1]. This can be done directly through the chat interface where available (like with ChatGPT), through an API (like with GPT-3 or GPT-4) or by using a query pipeline with a locally deployed model. In all these cases, the quality of the response will highly depend on the users’ ability to formulate their prompts in a way that will guide the LLM to address exactly the envisioned task.
There is now an entire genre of literature aiming to educate potential users on how to craft perfectly adapted prompts for a specific task. We even talk about a new technical discipline: prompt engineering [1]. By chaining several prompts, we can model complex, multi-step tasks and create powerful applications. Prompt engineering offers easy implementation and domain adaptability but exposes users to LLM issues like lacking new or non-public information, low answer accuracy and generating veridic yet partly imaginary responses, known as hallucinations.
Fine-tune the model
To increase the accuracy of the model and eliminate hallucinations, we can fine-tune the model using its own data [2]. Fine-tuning is generally performed for the models trained for natural language processing (NLP) tasks, those based on Transformer architecture, which includes the foundation LLMs of companies like OpenAI, Google or Meta. The aim is to adapt the foundation models to a downstream NLP task (specialise them for a certain task, such as translation, summarising or answering questions) or to be more accurate when queried on domain-specific information.
Fine-tuning has the advantage that it reduces hallucinations (not entirely since the model still has information from a variety of sources besides the newly added information, but considerably) and improves the precision regarding the specific task or answering queries about the new corpus of information. The disadvantages are related to the considerable effort to label the new data, the huge computational resources needed for the training itself and the need to regularly repeat the process to keep the model up to date with the latest information.
Retrieval Augmented Generation
Retrieval Augmented Generation (RAG) combines two main components to retain the advantages of the two previously described methods while reducing the drawbacks. The components are a retriever and a generator [3]. The retriever can be described as a system that is able to encode our data so that we can easily retrieve its relevant parts with our queries. We then compose the initial query with the context information retrieved and feed the generator. The answer of the generator is then returned as the final response.
The retriever part is normally implemented using vector databases [4]. For that, we first have to select the information that we want to store in the vector database for future retrieval, based on our queries. If the information is not in text format already but rather video, sound, image, PDF, email etc, we will have to convert it. Then, we chunk the resulting text documents into an optimal dimension so that we can get meaningful context when querying the information in the database. The text is then encoded using one version of text embeddings, and special indexes are used for fast queries. Some options for vector databases are FAISS, ChromaDB, Weaviate or Pinecone [5].
In the following diagram, the initial data transformation into text format, chunking, encoding and indexing is represented as step 1. Some of the possible original data formats are also represented, including various text documents, video, audio and emails.
The result of querying the vector database will be a collection of documents that match the query (step 2 in the diagram). These form the context that is added to the query (step 3) to compose the prompt (step 4) that is used with the second component, the generator, implemented using an LLM (such as OpenAI ChatGPT 3.5 turbo or Llama 2). The end-to-end process can easily be orchestrated using one of the options for task-chaining frameworks, such as Langchain or LlamaIndex [4].
RAG retains the accuracy provided by fine-tuning while omitting the high costs of labelling the data and of the computing resources for training. It also prevents hallucination since the LLM will only answer queries from the context extracted from the vector database. However, it still requires periodically indexing the new data in the vector database. The main issue with RAG systems is that the context relevance relies on the accuracy of matching the query with the documents in the vector database. Various techniques can improve the accuracy of RAG systems, which we will go into in the next section.
We can also combine the methods described above. For example, we can improve the efficiency of RAG by fine-tuning the LLM used as generator. The relevance of the context extracted from the vector database can be improved by carefully crafting the query using prompt engineering techniques.
Having reviewed the technical options, we have chosen RAG as the most promising option for a search system we want to develop. In the next section, we will examine how to implement it with this approach.
Build a Retrieval Augmented Generation system
In this section, we will detail the implementation of an RAG system, for which there are multiple options available. We opt for the combination of a vector database to index the information, a task-chaining framework to orchestrate the whole process and an LLM for generating the answer after being prompted with a combination of the initial query and the context [4].
The information we store in the vector database is first ingested, staged and transformed into text format. For some source formats, we will also need to implement format conversion. For example, videos would be transformed with one of the available solutions for video-to-text extraction, such as Open AI Wisper. For web pages, we can use Beautiful Soup to extract text-only content from HTML format. PDF-to-text and image-to-text solutions, like Google Tesseract OCR, can cover additional formats.
Once all the data is transformed into text format, while keeping the source information, the next step is to apply a chunking procedure to split the data into partially superposed chunks of a predefined dimension. We do this so that, upon querying, we can extract only the relevant context from larger texts in the retrieval process. The partial superposition of the chunks helps to ensure that we are not missing important context in the process of chunking. The chunking can be done with a custom chunking implementation for full control or using one of the Langchain ready-to-use classes dedicated to this operation.
After all documents are converted into text, conveniently chunked and stored in the new form – we can use either CSV or Parquet format, depending on the data size – we are ready to apply the embedding. Various embedding approaches are available, from simple TF-IDF text embeddings to a spacy embedding model to HuggingFace transformers to GPT. One popular and widely used option for text embeddings is the sentence transformer from HuggingFace, due to its ready availability and effectiveness.
There are also multiple options for vector databases. Popular choices include the easy-to-use FAISS, ChromaDB and Maevius, all of which allow for in-memory and persistent storage and have a user-friendly interface. Additional options are Weaviate, Pinecone and Redis. Cloud providers offer integration with some of these products as well as their own solutions. In our example, we opted for ChromaDB and sentence transformer embeddings. The Python code to add documents to the database is shown below and is an excerpt from the full code available as a Notebook on Kaggle [6].
# define langchain text splitter with chunk size and chunk overlap text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20) #apply the text splitter to all documents all_splits = text_splitter.split_documents(documents) # initialize the model used for embeddings model_name = "sentence-transformers/all-mpnet-base-v2" model_kwargs = {"device": "cuda"} embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs) # add the chunked documents to the ChromaDB database vectordb = Chroma.from_documents(documents=all_splits, embedding=embeddings, persist_directory="chroma_db")
In our approach, we use one of the most common LLM options, the Llama 2 model. It is available for direct download through HuggingFace, after Meta’s approval, and we can also use it on Kaggle. To run it with a lower memory footprint or even on CPU, that is, with lower computational resources, one common technique is to quantize the model. Quantization is a model compression technique that transforms the model weights to lower precision (for example on 4 bits) while retaining the accuracy. After quantizing the model, we can create a querying pipeline including the model. One option could be the HuggingFace pipeline. The code snippet below shows a simplified implementation available on Kaggle [6].
# path to Kaggle model (Llama 2, 7b chat version from HuggingFace) on Kaggle environment model_id = '/kaggle/input/llama-2/pytorch/7b-chat-hf/1' # set the device device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu' # set quantization configuration to load large model with less GPU memory # this requires the 'bitsandbytes' library bnb_config = transformers.BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type='nf4', bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=bfloat16 ) # prepare the model configuration model_config = transformers.AutoConfig.from_pretrained( model_id, ) # initialize the model model = transformers.AutoModelForCausalLM.from_pretrained( model_id, trust_remote_code=True, config=model_config, quantization_config=bnb_config, device_map='auto', ) # initialize the tokenizer tokenizer = AutoTokenizer.from_pretrained(model_id) # create a query pipeline with the model and tokenizer initialized before query_pipeline = transformers.pipeline( "text-generation", model=model, tokenizer=tokenizer, torch_dtype=torch.float16, device_map="auto", ) # create a HuggingFace pipeline llm = HuggingFacePipeline(pipeline=query_pipeline)
The end-to-end process to query the vector database, compose the prompt from the initial query, retrieve the context and prompt the LLM can be implemented using the RetrievalQA function from Langchain. The output of the function is the answer provided by the model. We can also add the actual documents to the response for more context as well as the links to the original document source.
# Set the retriever parameter for RetrievalQA langchain function – the vector DB initialized before is used as retriever retriever = vectordb.as_retriever() # initialize the langchain function qa = RetrievalQA.from_chain_type( llm=llm, chain_type="stuff", retriever=retriever, verbose=True ) # run a query result = qa.run(query)
The implementation described above (full code available in [6]) includes all the elements of a RAG system. Each of the key building blocks can be modified by using different tools and products.
There are some limitations to the RAG system, which we highlighted in the previous section. One is that a query might not capture all the targeted context due to the limitations of the similarity search, which only relies on the comparison metric implemented for the embeddings in the vector database. Elaborate approaches have been developed to overcome this limitation, such as RAG combined with Reciprocal Rank Fusion and Generated Queries [7]. In this approach, the queries are generated using an LLM based on multiple partial results querying the vector database with complementary queries to capture the full context.
Summary
We reviewed the three main approaches for leveraging LLMs in our business applications: prompt engineering, fine-tuning and Retrieval Augmented Generation (RAG), comparing the pros and cons for each. We concluded that RAGs combine the advantages of both prompt engineering and fine-tuning, without retaining the main drawbacks of the two alternative methods. We then introduced a basic implementation for an RAG system, using Llama 2 as LLM, ChromaDB as vector database and Langchain as task-chaining framework. The Python code, model and data used are available on Kaggle [6].
References
[1] Fareed Khan, Prompt Engineering Complete Guide, Medium (accessed Oct 2023), https://medium.com/@fareedkhandev/prompt-engineering-complete-guide-2968776f0431
[2] Shawhin Talebi, Fine-Tuning Large Language Models (LLMs), A conceptual overview with example Python code, Towards Data Science (accessed Oct 2023), https://towardsdatascience.com/fine-tuning-large-language-models-llms-23473d763b91
[3] Patrick Lewis, Ethan Perez, et al, Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, ArXiv (accessed Oct 2023), https://browse.arxiv.org/pdf/2005.11401.pdf
[4] Murtuza Kazmi, Using LLaMA 2.0, FAISS and LangChain for Question-Answering on Your Own Data, Medium (accessed Oct 2023), https://medium.com/@murtuza753/using-llama-2-0-faiss-and-langchain-for-question-answering-on-your-own-data-682241488476
[5] Gabriel Preda, The Rise of Vector Databases, Endava Engineering Blog (accessed Oct 2023), endava.com/en/blog/engineering/2023/the-rise-of-vector-databases
[6] Gabriel Preda, RAG using Llama 2, Langchain and ChromaDB, Kaggle Notebooks (accessed Oct 2023), https://www.kaggle.com/code/gpreda/rag-using-llama-2-langchain-and-chromadb
[7] Adrian H. Raudaschl, Forget RAG, the Future is RAG-Fusion, The Next Frontier of Search: Retrieval Augmented Generation meets Reciprocal Rank Fusion and Generated Queries, Towards Data Science (accessed Oct 2023), https://towardsdatascience.com/forget-rag-the-future-is-rag-fusion-1147298d8ad1
Gabriel Preda
Principal Data Scientist
Gabriel has a PhD in computational electromagnetics and started his career in academic and private research. He co-founded two technology start-ups and has worked in software development for 15+ years. Currently, Gabriel is a Principal Data Scientist at Endava, working for a range of industries and writing about advanced data analytics, geospatial analysis, natural language processing (NLP), anomaly detection, MLOps and generative AI. He is a high-profile contributor in the world of competitive machine learning and currently one of the few triple Kaggle Grandmasters. Outside of data science and machine learning, Gabriel enjoys hiking, climbing and reading.ALL CATEGORIES
Related Articles
-
13 November 2023
Delving Deeper Into Generative AI: Unlocking Benefits and Opportunities
-
19 September 2023
The Rise of Vector Databases
-
27 July 2023
Large Language Models Automating the Enterprise – Part 2
-
20 July 2023
Large Language Models Automating the Enterprise – Part 1
-
01 June 2023
Challenges for Adopting AI Systems in Software Development