The Modern AI Stack: Mastering Large Language Models with Prompt Engineering and Retrieval Techniques
Development in the world of Artificial Intelligence (AI) is moving at breakneck speed. The broader public’s interest in AI began to climb rapidly alongside the launch of ChatGPT, a Large Language Model launched by OpenAI in 30 November, 2022.
Large Language Models are a subset of machine learning models which can understand and outputs data in the form of human language. It was born due to the rapid development in NLP (Natural Language Processing) that helped AI models to interpret and understand natural language.
Nowadays, we hear a lot about the term “Generative AI.” Generative AI means AI that can generate different forms of data, for example text, audio, and images. It used to be the case that Generative AI models were unimodal, or only capable of processing one type of output (e.g. text) at a time, ChatGPT. But now we’re seeing Generative AI models such as Gemini 1.5 by Google which is a multimodal model, capable of receiving and producing multiple types of data (text, audio, video).
In this article, we’ll explore how Large Language Models can be used to power modern web applications, specifically on our server-side code.
The Modern Developer AI Stack
Here is the stack that we’ll be diving into with this article:
Foundation Models
LLM Frameworks: LangChain
Vector Databases
Prompt Engineering
Knowledge Retrieval
Let’s go!
Foundation Models
According to IBM Research, Foundation Models (FMs) are a new type of AI model that is trained on a “broad set of unlabeled data that can be used for different tasks, with minimal fine-tuning.” In the past, training AI models had been expensive, time consuming, and difficult because it needs to be developed for a very specific use case, hence the training and fine-tuning also required specificity.
Today, however, developers worldwide can use a “general purpose model” through these Foundation Models. This reduces the development costs of new projects and shortens the idea-to-market timeline. One example of FMs is the GPT series model developed by OpenAI — they can understand human language, find information from text (with some accuracy), debug code, and so on.
LLM Frameworks: LangChain
Foundation Models have lowered the barrier to entry and allowed many developers to enter the AI-based products market quickly. To speed up the development of such AI applications, developers can use LangChain, which is a framework for developing AI-powered applications (think of it as ReactJS for AI apps!). You can probably do just fine without LangChain, e.g. by making your own libraries, but LangChain provides us with a wide array of tools that’s been tested and evaluated by the open source community.
LangChain is an open-source framework to develop LLM-powered apps. It helps developers simplify each phase of the application lifecycle: development, productionization, to deployment.
Vector Databases
[Non-technical primer]
Databases, in their general terms, are organized collection of data. In a lot of cases, databases are stored electronically in computer systems. The software we use to store databases are called DBMS, or Database Management Systems. DBMS supports the storing and querying of data into an electronic system, with lots (and I mean lots!) of extra features added in between. To interact with the DBMS we usually use a standardized syntax (“language”) named SQL (Structured Query Language).
These “Relational” Databases store data in a way that represents their relations via tables. A School DB, for example, might have a “Students” table and a “Teachers” table with a “Relation” of “One-to-many” between Teacher and Student.
If standard (“relational”) databases store data in a structured way (e.g. rows-based in MySQL), Vector databases stores information in the form of Vector Embeddings (“embeddings”). Embeddings are representations of data in a vector data representation that conveys the original data’s semantic information. Simply put, embeddings are just arrays of numbers (can be binary, base-10 integers/floating points, etc.). An example of embedding is [-0.13, 0.43, -0.01, 0.07, ... ]
which might carry the semantic meaning of “cat”, for example.
Now you might ask, how exactly do we generate these embeddings then? Well the smart researchers have already figured it out: we use an Embedding Model! According to AWS, Embedding models are “algorithms trained to encapsulate information into dense representations in a multi-dimensional space.” Some examples of embedding models are: Principal Components Analysis, Singular Value Decomposition, Word2Vec, and BERT (you can google these terms for more detail — be wary though, lots of maths!).
Another key difference between vector databases and traditional, relational databases is the way they query data (so different in both storing & querying!). In traditional databases such as MySQL or PostgreSQL, we query data by querying rows that matches our query (e.g. SELECT name, grade FROM students WHERE grade == 12;), in vector databases we use the Approximate Nearest Neighbor (ANN) search algorithm which finds the “closest” vector embedding to the one we’re querying to, mathematically. Under the hood, there are more things such as indexing, hashing, etc. but we won’t go into that detail :)
Prompt Engineering
Prompt Engineering are techniques used to achieve optimized LLM outputs with minimal effort after result generation. In laymanspeak, it’s tactics to “get the best ChatGPT output by only giving it the best questions.” There are various techniques of prompt engineering, but one of the most famous and easiest is “Few Shot Prompting.” This is basically giving the LLM some examples in advance, to give them the necessary context for solving your upcoming query. You can do this by giving them question-answer pairs, a definition or some other information before entering your query. Touvron et al. 2023 observed that these “few shot” properties emerges as a result from larger and larger parameter models. Hence Prompt Engineering should work better on GPT-4 compared to GPT-2, in theory.
Knowledge Retrieval
Since LLMs are basically a “function of their training data”, they cannot answer any query that does not exist within their training data. We can overcome this limitation by pairing the raw brainpower of LLMs with data outside of its training set. This type of knowledge retrieval is usually done at inference-time, in other words: whenever you ask a question to a chatbot/LLM.
RAG overall architecture, stackoverflow.com
One widespread technique of knowledge retrieval in LLMs are Retrieval Augmented Generation or RAG. In RAG, we take an input (user’s query), retrieve a set of relevant documents (can also be embeddings!), append the added context into the original prompt, and voila! Your LLM now has access to information outside of its training data.
RAG at inference-time, stackoverflow.com
What We’re Going to Build
We’re going to build a Quiz API service that receives only 2 things:
A
note_id
which is an ID for a digital note, a structure containing hypertext (think of a Notion page)question_count
, which is an integer from the set {3, 5, 10} where it’s the amount of questions we want generated.
We need to combine the external information of Note object (mapped via note_id
, scope limitation of question_count
, and the capabilities of Large Language Models (LLMs) to build multiple-choice Quiz question sets.
A Quiz QuestionAnswer Pair
We want to generate QuestionSet, which is an array of these QuestionAnswer pair objects. It could contain 3, 5, or 10 of these items. Our solution will be divided into two parts: generating the right question (via LLMs), and generating the right answers to that question (via LLMs also!).
Building the Solution
1: Import dependencies
import uuid
from typing import List, Optional
from fastapi import (
Request,
)
from sqlalchemy.orm import Session
from bson.objectid import ObjectId
from langchain_openai import (
ChatOpenAI,
OpenAIEmbeddings
)
from langchain.prompts import PromptTemplate
from langchain_chroma import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains.retrieval_qa.base import (
BaseRetrievalQA,
RetrievalQA,
)
from langchain.chains.summarize import load_summarize_chain
from src.models.users import User
from src.schemas.note import (
NoteSchema
)
from src.schemas.qna import (
QNAAnswerSchema,
QNAQuestionSchema,
QNAQuestionSetSchema,
)
from src.utils.settings import (
OPENAI_MODEL_NAME,
OPENAI_API_KEY,
)
from src.utils.time import get_datetime_now_jkt
2: Instantiate a class-based Service object
You can add class fields such as the model temperature for each question / answer model. A higher temperature means a more random answer from the model.
class QNAService:
MODEL_TEMPERATURE_QUESTION = 0.6
MODEL_TEMPERATURE_ANSWER = 0.3
#...
3: Define the Create Quiz method
Here, we also prepared the LLM object for each step
def generate_qna_set(
self,
note: NoteSchema,
question_count: int,
) -> dict:
# PREPARE LANGUAGE MODELS
LLM_QUESTION_GEN = ChatOpenAI(
temperature=self.MODEL_TEMPERATURE_QUESTION,
model=OPENAI_MODEL_NAME,
api_key=OPENAI_API_KEY,
)
LLM_ANSWER_GEN = ChatOpenAI(
temperature=self.MODEL_TEMPERATURE_ANSWER,
model=OPENAI_MODEL_NAME,
api_key=OPENAI_API_KEY,
)
# ...
4: Data Preprocessing
In this step, we first flatten the digital Note content into a single formatted string, and recursively split the string to form documents of a certain token size. A token is the basic unit of data processed by LLMs, a word is on average 3–5 tokens long (no exact measurements on this, depends on model).
def generate_qna_set(...):
# Initialize ChatOpenAI above
# DATA PREPROCESSING AND PERSISTENCE
note_documents_chunk = self.split_note_into_chunks(
note=note,
)
def split_note_into_chunks(
self,
note: NoteSchema,
):
CHUNK_SIZE = 400
CHUNK_OVERLAP = 60
recursive_splitter = RecursiveCharacterTextSplitter(
chunk_size=CHUNK_SIZE,
chunk_overlap=CHUNK_OVERLAP,
add_start_index=True,
)
# NOTE - the current method flattens a Note's contents into one long string
note_flattened = self.flatten_note_contents(
note=note,
)
# Generate chunk documents
note_documents_chunk = recursive_splitter.create_documents(
texts=[note_flattened],
)
return note_documents_chunk
5: Implement Retrieval Pipeline
First we initialize a vector database instance using the open-source Chroma database. The amazing thing about Chroma is that it can work in-memory, so we don’t have to set up complex databases on remote servers just to store a little bit of embeddings for one method’s lifetime!
Then we convert the vectorstore
instance into a Vector Retriever interface using the as_retriever
method. We pass in the searching strategies to use (similarity search), and how many results should Chroma return from that search process (I picked 5 here).
def generate_qna_set(...):
# Above code
# Use the OpenAI Embedding model and create Vector Store
vectorstore = Chroma.from_documents(
documents=note_documents_chunk,
embedding=OpenAIEmbeddings(),
)
# Convert vectorstore into Retriever interface
retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={
"k": 5,
},
)
6: Question Generation
Now here’s the fun part. Here we generate the questions to ask that’s relevant to our Note contents. First we initialized our custom prompts, within get_q_base_prompt
and get_q_refined_prompt
. Second, we created a premade SummarizationChain, which is basically just combining multiple LLM calls together in one sequence, and is optimized for summarization-based tasks. Next, we run the inputs note_documents_chunk
on the new chain question_gen_chain
. Lastly, we split the generated questions by newline.
def generate_qna_set(...):
# Above code
# QUESTION GENERATION
Q_BASE_PROMPT = self.get_q_base_prompt(question_count=question_count)
Q_REFINED_PROMPT = self.get_q_refined_prompt(question_count=question_count)
question_gen_chain = load_summarize_chain(
llm=LLM_QUESTION_GEN,
chain_type="refine",
verbose=True,
question_prompt=Q_BASE_PROMPT,
refine_prompt=Q_REFINED_PROMPT,
)
generated_questions = question_gen_chain.run(note_documents_chunk)
question_list = generated_questions.split("\n")
7: Answer Generation
After determining the right questions to ask, we can now ask the LLM to figure out the right answers. First we initialized our custom answer prompt get_ans_prompt
, and created a RetrievalQA object which is a “Chain for question-answering against an index,” according to LangChain’s API documentation. This RetrievalQA will receive a set of arguments as seen below, and will serve as our answer generation chain.
The following lines of code after is just us running the chain against each question generated earlier, via generate_answers_for_question
.
def generate_qna_set(...):
# Above code
# ANSWER GENERATION
ANS_GEN_PROMPT = self.get_ans_prompt()
answer_gen_chain = RetrievalQA.from_chain_type(
llm=LLM_ANSWER_GEN,
chain_type="stuff",
retriever=retriever,
chain_type_kwargs={
"prompt": ANS_GEN_PROMPT,
}
)
# Run answer chain for each question
result = {}
for q_id in range(len(question_list)):
question = question_list[q_id]
answers = []
for cnt in range(5):
if len(answers) >= 4:
break
if cnt > 0:
# If not the first iteration, then previous iteration resulted in < 4 answers
print("Answer array has less than 4 items!")
answers = self.generate_answers_for_question(
answer_gen_chain=answer_gen_chain,
question=question,
)
# Append to Result Dict
result[str(q_id)] = {
"question": question,
# First line = correct answer, the rest = incorrect answers
"answer_correct": answers[0],
"answers_incorrect": answers[1:],
}
# Print values
print(result[str(q_id)])
print("--------------------------------------------------\n\n")
return result
def generate_answers_for_question(
self,
answer_gen_chain: BaseRetrievalQA,
question: str,
) -> List[str]:
answers = answer_gen_chain.run(question)
answers = [line.strip() for line in answers.split("\n")]
return answers
8: Done!
And we are done! The code in step 7 should output a dictionary containing question/answer pairs of the quiz. The sample output of the API service should be something like this (after some modifications e.g. MongoDB _id
, and some generic class fields like created_at
)
{
"id": "66274d6576b67a539exxx",
"owner_id": "af81b095-20ca-453a-8caf-x",
"created_at": "2024-04-23T05:55:50.018000",
"updated_at": "2024-04-23T05:55:50.018000",
"is_deleted": false,
"question_count": 3,
"questions": [
{
"id": "x-8844-46ec-84ce-f654b2292a19",
"created_at": "2024-04-23T05:55:50.018000",
"updated_at": "2024-04-23T05:55:50.018000",
"is_deleted": false,
"qna_set_id": "66274d6576b67a539e2xxx",
"question": "What promotional materials are planned for the Pocket dictionary main campaign?",
"answer_options": [
{
"id": "x-a5e2-4f68-af5b-e655e03b6f80",
"created_at": "2024-04-23T05:55:50.018000",
"updated_at": "2024-04-23T05:55:50.018000",
"is_deleted": false,
"question_id": "x-8844-46ec-84ce-f654b2292a19",
"content": "Free gifts like calendars, key rings, and possibly umbrellas are being considered for exhibitions",
"is_correct_answer": true
},
{
"id": "b6715add-ada4-47bc-978c-67c9bbeef641",
"created_at": "2024-04-23T05:55:50.018000",
"updated_at": "2024-04-23T05:55:50.018000",
"is_deleted": false,
"question_id": "x-8844-46ec-84ce-f654b2292a19",
"content": "Airtime on Radio East has been negotiated by Alison",
"is_correct_answer": false
},
{
"id": "x-af06-4643-822f-967291fcb65f",
"created_at": "2024-04-23T05:55:50.018000",
"updated_at": "2024-04-23T05:55:50.018000",
"is_deleted": false,
"question_id": "x-8844-46ec-84ce-f654b2292a19",
"content": "Visit to a TV network planned for Friday in relation to future titles",
"is_correct_answer": false
},
{
"id": "x-4d98-47d3-abf6-e61c6316fa60",
"created_at": "2024-04-23T05:55:50.018000",
"updated_at": "2024-04-23T05:55:50.018000",
"is_deleted": false,
"question_id": "x-8844-46ec-84ce-f654b2292a19",
"content": "Publicity material is listed in the annual catalog to be sent to booksellers in December",
"is_correct_answer": false
}
],
"answer_key": {
"id": "x-a5e2-4f68-af5b-e655e03b6f80",
"created_at": "2024-04-23T05:55:50.018000",
"updated_at": "2024-04-23T05:55:50.018000",
"is_deleted": false,
"question_id": "x-8844-46ec-84ce-f654b2292a19",
"content": "Free gifts like calendars, key rings, and possibly umbrellas are being considered for exhibitions",
"is_correct_answer": true
},
"question_score": 33.333,
"marked_irrelevant": false
},
{
"id": "x-aacd-491e-8768-0ff2a959965d",
"created_at": "2024-04-23T05:55:50.018000",
"updated_at": "2024-04-23T05:55:50.018000",
"is_deleted": false,
"qna_set_id": "66274d6576b67a539e2f6xx",
"question": "Where is Alison negotiating airtime for the dictionary launch party?",
"answer_options": [
{
"id": "x-64b2-4946-906d-a9608ab27b26",
"created_at": "2024-04-23T05:55:50.018000",
"updated_at": "2024-04-23T05:55:50.018000",
"is_deleted": false,
"question_id": "x-aacd-491e-8768-0ff2a959965d",
"content": "Airtime on Radio East has been negotiated by Alison.",
"is_correct_answer": true
},
{
"id": "x-9fe8-45b0-a37f-8227facfe89f",
"created_at": "2024-04-23T05:55:50.018000",
"updated_at": "2024-04-23T05:55:50.018000",
"is_deleted": false,
"question_id": "x-aacd-491e-8768-0ff2a959965d",
"content": "Visit to a TV network planned for Friday in relation to future titles.",
"is_correct_answer": false
},
{
"id": "x-8487-4275-b762-5aae04ce487c",
"created_at": "2024-04-23T05:55:50.018000",
"updated_at": "2024-04-23T05:55:50.018000",
"is_deleted": false,
"question_id": "x-aacd-491e-8768-0ff2a959965d",
"content": "Publicity material is listed in the annual catalog to be sent to booksellers in December.",
"is_correct_answer": false
},
{
"id": "x-56c0-4619-9569-6aef541b3474",
"created_at": "2024-04-23T05:55:50.018000",
"updated_at": "2024-04-23T05:55:50.018000",
"is_deleted": false,
"question_id": "x-aacd-491e-8768-0ff2a959965d",
"content": "Bookseller mail shot scheduled for September.",
"is_correct_answer": false
}
],
"answer_key": {
"id": "x-64b2-4946-906d-a9608ab27b26",
"created_at": "2024-04-23T05:55:50.018000",
"updated_at": "2024-04-23T05:55:50.018000",
"is_deleted": false,
"question_id": "x-aacd-491e-8768-0ff2a959965d",
"content": "Airtime on Radio East has been negotiated by Alison.",
"is_correct_answer": true
},
"question_score": 33.333,
"marked_irrelevant": false
},
{
"id": "x-b3ae-47ca-b40a-bf229ec8d5a5",
"created_at": "2024-04-23T05:55:50.018000",
"updated_at": "2024-04-23T05:55:50.018000",
"is_deleted": false,
"qna_set_id": "66274d6576b67a539e2xxxx",
"question": "What catering options are available at the management center for events?",
"answer_options": [
{
"id": "x-5c84-4740-a7ab-90982b1fcc57",
"created_at": "2024-04-23T05:55:50.018000",
"updated_at": "2024-04-23T05:55:50.018000",
"is_deleted": false,
"question_id": "x-b3ae-47ca-b40a-bf229ec8d5a5",
"content": "Good catering at the management center",
"is_correct_answer": true
},
{
"id": "x-c4d3-4c97-804f-1e6c562fb3b8",
"created_at": "2024-04-23T05:55:50.018000",
"updated_at": "2024-04-23T05:55:50.018000",
"is_deleted": false,
"question_id": "x-b3ae-47ca-b40a-bf229ec8d5a5",
"content": "Airtime on Radio East has been secured",
"is_correct_answer": false
},
{
"id": "x-5214-49b7-a712-ff5405cac211",
"created_at": "2024-04-23T05:55:50.018000",
"updated_at": "2024-04-23T05:55:50.018000",
"is_deleted": false,
"question_id": "x-b3ae-47ca-b40a-bf229ec8d5a5",
"content": "Visit to TV network planned for Friday",
"is_correct_answer": false
},
{
"id": "x-5efb-45cc-956e-ce93cec4039f",
"created_at": "2024-04-23T05:55:50.018000",
"updated_at": "2024-04-23T05:55:50.018000",
"is_deleted": false,
"question_id": "x-b3ae-47ca-b40a-bf229ec8d5a5",
"content": "Publicity material is listed in the annual catalog",
"is_correct_answer": false
}
],
"answer_key": {
"id": "x-5c84-4740-a7ab-90982b1fcc57",
"created_at": "2024-04-23T05:55:50.018000",
"updated_at": "2024-04-23T05:55:50.018000",
"is_deleted": false,
"question_id": "x-b3ae-47ca-b40a-bf229ec8d5a5",
"content": "Good catering at the management center",
"is_correct_answer": true
},
"question_score": 33.333,
"marked_irrelevant": false
}
]
}
Program Evaluation
So we managed to ship this feature on the 3rd sprint of our software engineering project. It is now the 4th sprint and I’d like to share some evaluations on results and what to improve for future iterations.
Evaluations
We used GPT-3.5-turbo on our production backend, and the results have been satisfying given that it’s a 2 year old model. We used prompt engineering to improve the quality and format of the model responses. The model consistently outputs valid JSON, through the few-shot prompting we provided in the backend code. Even with “only” 175 billion parameters (compared to 1 trillion for GPT-4), GPT-3.5 can benefit from proper prompt engineering.
In-memory retrieval performs surprisingly good. We used Chroma, a lightweight, in-memory vector database which eliminates the need for setting up remote database servers such as on Pinecone. Because it’s open source and lightweight, we thought it’d have noticable impact on quality. But we were wrong, as the retrieval results had been factually correct thus far. The zero network latency also makes retrieval super fast. The only challenge is ensuring not only retrieval is factually correct, but also logically coherent. Performing retrieval across sentences was quite difficult to nail down, and would likely demand more advanced algorithms and optimizations down the road.
Although the results were factually correct due to retrieval, GPT-3.5 still find it quite hard to output the EXACT amount of questions at generation time. We find odd cases where it generated more than 4 or less than 4 questions for a QuestionAnswer Set.
The QnA quality and question uniqueness drops as the contents of a vlecture Note decreases. We observed that notes with <200 words (in general) have a much higher chance of generating two or more questions with the same content but different wordings.
Future Improvements
The LLM needs a minimum threshold of content to get acceptable results. We set 200 words of main section content to be a good starting point for the minimum content length. Main sections (1 of the 3 parts in Cornell Notetaking that we used) with more than 200 words tend to generate better results in our observation.
LLMs sometimes hallucinate and generate an incorrect number of questions. We solved this by adding a checking logic within the service code. Whenever an answer generation process fails to generate exactly 4 options, we re-run the entire process up to 5 times (much lower probability getting it wrong 5x in a row). If somehow the answer options are still more than 4, we truncate the result array into the first 4 items to be inserted to the database.
Using a more powerful model, such as GPT-4, will most likely produce better QnA results. If you have the budget to do so, by all means do it. Since our team was on a budget, we opted for the 3.5 turbo instead.
Thanks and I hope you enjoyed this article! Give a clap if you liked it.