Back to main pageArticle by Teodor Petrov

FHNW RAG Chatbot

The FHNW chatbot is a prototype developed by Teodor Petrov as a part of a ToBIT project in the BSc Business Information Technology program at Fachhochschule Nordwestschweiz (FHNW). Prof. Dr. Thomas Hanne supervised the ToBIT paper. Users can ask questions about admission requirements, degree programs, student FAQs, or university policies and receive immediate answers and links to the relevant pages. The chatbot uses Retrieval-Augmented Generation (RAG) combined with a Large Language Model (LLM) to create accurate responses based on the information found on the FHNW website.

This part of the paper describes the end-to-end process of developing the FHNW RAG chatbot, from gathering and processing the data to implementing the chatbot and deploying its user-facing interface. It highlights the technical steps, tools, and methods used to create a robust system that demonstrates the potential of RAG in a real-world scenario.

Table of Contents

FHNW RAG Chatbot Prototype

Use case for the chatbot

Choice of architecture

Data gathering and indexing

Scraping the website

Cleaning the html files

Cleaning the links in the JSON files

Chunking the cleaned markdown files

Generating summaries and populating the graph database

Creating the vector database and generating embeddings

Implementation of the RAG chatbot

Conversation and query analysis

Retrieval

Generation and response stream

Frontend and deployment

Deployment

Frontend technologies and layout

User input and conversation

Evaluation of the FHNW RAG chatbot prototype

Performance

Cost

References

Use case for the chatbot

The fundamental idea behind the RAG approach, demonstrated in developing the FHNW chatbot prototype, is to leverage the well-structured websites that many businesses and organizations already maintain. These websites contain the valuable information people look for and thus can serve as an excellent data source. A RAG-based chatbot could utilize this existing high-quality and structured information resource and directly provide relevant responses rather than customers, students, or other users needing to click through multiple pages and sections to find the information they are searching for. Additionally, this RAG system can effectively and accurately provide its sources of information, allowing the user to shortcut the website navigation.

This RAG use case offers a relatively low-effort and low-investment way for businesses and organizations to level up their customer support or provide a new way for their users to find relevant information, as they can capitalize on the work and investment they have already made in organizing and creating their website's content. Websites' consistent styling and highly repetitive template structure open an excellent opportunity for building practical chunking algorithms that can create high-quality information stores for the RAG. Additionally, a website's inherent hierarchy and interconnections allow for many additional techniques to enhance the RAG's performance.

In the case of the FHNW RAG chatbot prototype, the goal is to help people quickly find information about admission requirements, organizational details, common student questions, regulations, and other topics that are available on FHNW's official website. Although the website is well-structured, navigating it can take time when unsure which page holds the needed information. The chatbot shortens the search process, and users can type in their questions directly and receive immediate answers, along with links to the relevant source pages on the website.

Potential new students can learn about admission criteria, deadlines, and program offerings without digging through pages on the website. Current bachelor's or master's students can get instant guidance on frequently asked questions. Even university staff can use the chatbot to access regulations or organizational policies without manually going through every link.

The prototype effectively utilizes the structure, hierarchy, and navigation of the FHNW site. A tailor-made chunking algorithm was implemented to split the university's pages into logical chunks, which enhanced retrieval performance. Additionally, graph-based algorithms exploit the underlying website hierarchy and connections between pages. This technique allows for improving RAG performance even further.

Choice of architecture

The FHNW RAG chatbot operates with the following general architecture:

A diagram of a diagram Description automatically generated

Figure 2 - General architecture of the FHNW RAG chatbot prototype

The overall system is split into frontend and backend. The frontend serves as the user interface and is additionally tasked with verifying the input and sending valid requests with the entire conversation to the backend of the FHNW chatbot.

To ensure accurate and contextually aware responses, an LLM analyzes the user's query to determine its conversational context. The query is then rewritten into multiple versions optimized for the subsequent retrieval process. This technique significantly improves retrieval performance, particularly in multi-message conversations. Additionally, the LLM decides if the query requires retrieval at all, sometimes bypassing the entire process, saving unnecessary API calls, and reducing overall response latency.

Each query version is embedded using a text embedding model to capture its semantic meaning. The resulting query vectors are then used to perform a similarity search against the embedding vectors of page summaries stored in the vector database. These page summaries are created using an LLM in the chatbot's data gathering and processing development stage.

The returned pages are ranked based on their semantic closeness to the queries. A reciprocal rank fusion (RRF) function is applied to the top pages from each query vector, yielding the best overall pages. This technique is also used in Contextual Retrieval by Anthropic.

As previously mentioned, this unique RAG approach utilizes the hierarchy and connections between pages to refine the retrieved results further. After the best pages are initially identified based solely on vector similarity, they are refined using the Neo4j graph database. This step in the retrieval process filters results by leveraging their positions within the graph structure, which is derived from the original website. By incorporating this graph-based perspective, the technique enhances retrieval outcomes to complement traditional vector similarity search.

Once the final best pages are identified, the query vectors are used again to identify the most relevant chunks within these pages. The RRF function is similarly applied to determine the top chunks.

Inspired by the summarization techniques used in GraphRAG (Edge et al., 2024), the architecture of this RAG relies heavily on the summaries of pages as the foundation for determining relevant content, offering several advantages over a simple, naive RAG pipeline. First, it minimizes the risk of retrieving misleading or irrelevant chunks that, despite their high semantic similarity to the query, might include information from unintended contexts. This method significantly reduces the likelihood of introducing errors by grounding responses in accurate and contextually appropriate content. Additionally, it enhances vector similarity search performance, as page summaries are rich in information density and include numerous keywords and phrases compared to individual chunks from pages. This method also allows the final retrieved to include chunks that may not be semantically close to the query but carry critical information necessary for generating more accurate responses. Furthermore, this page summary approach enables the graph-based retrieval enhancements, leveraging the page hierarchy and connections to refine results.

However, the approach does have a disadvantage. Queries containing highly specific information that might not be captured in the page summaries could result in retrieval failures, particularly if the broader context of the query cannot be identified even after query analysis. Despite this limitation, the architecture provides far better practical performance and fits well with the use case for the FHNW chatbot.

Finally, the gathered information is assembled and passed to an LLM for response generation.

Data gathering and indexing

Data gathering is the essential starting point for building the RAG system, as it provides data that the chatbot relies on to generate accurate and domain-specific responses. For this project, all the required information is sourced from the official FHNW website, https://www.fhnw.ch. The primary goal of this phase is to ensure that the collected data comprehensively covers the topics and queries relevant to the chatbot's users. Afterward, the data is integrated into a graph and vector database used in the RAG pipeline.

Scraping the website

The scraping process was done using the Scrapy1 python web-crawling framework. The scraper was designed to systematically crawl and save the page content from the official FHNW website, starting at the root URL https://www.fhnw.ch. This crawler recursively follows all links on a page and saves the content of each linked page. The scrapy crawler was limited to FHNW's primary domain, and a maximum depth limit of 15 was imposed. The depth represents how many links, starting from the main page, the crawler had to follow to reach a page.

The screenshot below shows the total amount of pages and corresponding JSON files saved after the scraping process:

A screen shot of a computer screen Description automatically generated

Figure 3 - Files extracted after the scraping process

The scraper was configured to save files with common extensions, such as .html, .pdf, and other relevant formats. Each file was saved with the name of its corresponding URL path on the website, ensuring logical organization. The filenames were limited to 200 characters for manageability. To avoid filename collisions, an MD5 hash of the whole URL path was appended to each file as a suffix.

The screenshot below shows a small part of the downloaded HTML pages and their names:

A screenshot of a computer program Description automatically generated

Figure 4 - Example screenshot showing filenames of the scraped HTML files

In addition to saving the files, the scraper created a companion JSON file for every HTML page it processed. Each JSON file contained metadata that included the original URL path, the saved file name, and a list of all links to other pages on the saved page. This metadata recorded the website's intra-page connections and hierarchy, which was later used to facilitate better retrieval in subsequent stages.

Cleaning the html files

The raw HTML pages saved from the scraper naturally include many unnecessary elements that do not contribute to meaningful responses in the chatbot. The BeautifulSoup2 was used to remove all scripts, styles, and code comments within the HTML. Structural components such as headers and footers containing repetitive or navigational information irrelevant to the chatbot's functionality were also excluded.

Certain tags and their associated content, deemed irrelevant to the chatbot's context, were explicitly removed. Tags lacking textual content were recursively stripped away to refine the dataset further. However, <a> and <button> tags containing valid https:// links were retained, as these were essential for the RAG system to include its sources when responding to the user.

After cleaning the HTML files, they were converted into markdown format using the html2text3 library. This step is essential for the chunking algorithm later, and markdown is also used to implement the user interface. SHA-256 hashes were computed for each file's content to identify and eliminate duplicates. At the end of the cleaning process, JSON files corresponding to duplicates were removed.

This thorough cleaning and deduplication produced a concise, high-quality dataset.

An important step before creating the graph database was cleaning the links between pages recorded in the JSON files corresponding to each page. As this is a prototype, only the English part of the website was kept. This subset was identified by selecting all pages with URLs beginning with /en/ and extending to include any additional pages linked by these English-language pages.

Next, each JSON file was processed to remove duplicate links. For every unique link in a file, its occurrence was counted to record the frequency of links across the dataset. This information was particularly useful in the subsequent filtering steps.

Once duplicates were removed, the remaining links in each JSON file were further refined. Only links pointing to pages within the previously defined English subset were preserved, guaranteeing that all retained links led to pages available in the subset.

Finally, frequency-based filtering was applied to reduce the inclusion of overly common links that added little value to the dataset. Links found in over 30% of the JSON files were discarded, with one important exception. If a page is itself in these highly linked-to pages, then it retains its links regardless of their frequency. This exemption ensured that essential pages, often serving as hubs, were not inadvertently excluded from the final dataset.

The screenshot below shows the number of files in the cleaned dataset:

A black screen with white text Description automatically generated

Figure 5 - Number of files left after dataset cleaning

This multi-step process resulted in a refined set of links for each page. Subsequently, all markdown pages without a corresponding JSON file were removed. Cleaning the links was a crucial step to ensure the effectiveness of the graph-based algorithms, which relied on the graph database constructed using the relationships and hierarchy between pages.

Chunking the cleaned markdown files

Chunking the cleaned Markdown files is an essential step in preparing a dataset for RAG. The goal is to ensure that the data is divided into manageable, relatively uniform-sized, and meaningful sections that align with the structural and contextual organization of the original content.

The already cleaned markdown files are first processed by a parse_markdown function. This function scans the content line by line, identifying and categorizing blocks such as headings, paragraphs, empty lines, and list items denoted by asterisks. This block-level segmentation forms the foundation for the subsequent chunking process.

Next, a process_blocks function is called, which iterates over the identified blocks from each page. The blocks are then passed to a process_chunk function, which either splits them or combines them to form coherent chunks. This step creates well-structured chunks of uniform size, measured in tokens, while maintaining the contextual integrity of the content. The number of tokens is determined by tokenization using the tiktoken4 library and the o200k_base5 tokenizer.

The process_chunk function incrementally adds blocks to a chunk, continuously measuring its token count after each addition. Hierarchical thresholds determine when a chunk is complete, ensuring a balance between chunk size and contextual fidelity. The first threshold is set at 200 tokens, and if an entire page contains fewer than 200 tokens, it is returned as a single chunk. The function imposes increasingly relaxed splitting criteria for chunks as token counts exceed successive thresholds.

As the token count grows beyond 200, the chunk may end before including certain blocks, such as heading levels 1 or 2. If, despite this, the token count surpasses 300, splitting can occur at heading levels 1, 2, or 3. At 400 tokens, additional splitting is allowed at heading level 4 or after three consecutive empty lines. As thresholds increase, the criteria for splitting expand to include fewer consecutive new lines or structural markers, such as asterisks, making it more likely for the chunk to end.

The table below shows the thresholds in detail:

Table 1 - Chunk processing thresholds in detail

Threshold (Tokens) Stopping Heading Levels Stopping Newlines Stopping Characters
200 1, 2 None None
300 1, 2, 3 None None
400 1, 2, 3, 4 3 consecutive None
500 1, 2, 3, 4 2 consecutive None
600 1, 2, 3, 4 2 consecutive Asterisk (*)
800 1, 2, 3, 4 1 consecutive Asterisk (*)

This hierarchical chunking mechanism is custom-designed to leverage the natural structure of the original content. Once the chunks from a markdown page are created, they are saved in a directory corresponding to the name of the original page.

The screenshot below shows the final statistics after the chunking process:

Figure 6 - Chunking process final statistics

After this process, there are three directories: markdown_pages containing the clean pages, chunked_pages holding directories of chunks corresponding to each page, and json_files storing metadata for each page.

Generating summaries and populating the graph database

The next step in the data preparation process involved generating concise summaries for each cleaned page and populating the Neo4j6 graph database with the processed data. Summaries were generated for each cleaned markdown page using the gpt-4o-mini7 model. These summaries captured the essence of the content in highly information-dense text with consistent length. The summaries were appended to the respective JSON files under a new field named summary, adding to the metadata associated with each page.

Below is an example summary of a page:

"Bachelor of Science FHNW in Business Information Technology focuses on digitalisation, offering 180 ECTS points, full-time or part-time study modes, and specialisations in Data Science, Digital Trust, and Digital Business Management; graduates bridge IT and business, optimising processes and developing innovative solutions. "

The distribution below highlights the difference in token lengths between whole pages and their summaries. Summaries have far more uniform lengths and are generally much smaller in size.

A graph of a log scale Description automatically generated

Figure 7 - Token count distribution comparison between summaries and whole pages

The graph database was implemented to store the information and represent the relationships within the dataset. Each page was represented as a node with attributes - unique page_id, the original file name (file_name), the number of chunks derived from the page (number_of_chunks), the generated summary (page_summary), and the page's URL (url). Similarly, the individual chunks of each page were stored as separate nodes, with attributes - unique chunk_id, chunk_number, and the chunk's content (chunk_content). Relationships between pages and their chunks were established using the HAS_CHUNK relationship type, while connections between pages, as refined during the link-cleaning phase, were modeled using the LINKS_TO relationship type.

The schema below shows the different nodes' attributes and the graph relationships:

A screenshot of a computer program Description automatically generated

Figure 8 - Attributes of nodes inside the graph database

A purple circle with white text Description automatically generated

Figure 9 - Relation schema for nodes inside the graph database

After the graph was fully populated, the Leiden community detection algorithm was applied using the Graph Data Science (GDS)8 library in Neo4j to reveal the underlying communities of highly interconnected pages. The first step in this process identified all weakly connected components within the original directed graph. These components were primarily single nodes that did not have connections to the rest of the graph containing personal pages of university staff. Such components containing fewer than ten pages were flagged and added to a list of "small components" to avoid introducing too many small communities. A temporary undirected subgraph was then created by excluding these small components.

The Leiden algorithm revealed 19 distinct communities in the refined graph based on the structural relationships found in the data. Each community was assigned a unique community_id, which was added as a property to the corresponding page nodes. Nodes belonging to the previously identified small components were assigned a default community_id of 9999, ensuring all nodes belong to a community.

The image below shows the final directed graph with each unique community highlighted in a different color:

Figure 10 - The final graph structure with unique communities highlighted in different color

This process of identifying communities establishes a meaningful separation within the graph database. These communities are later leveraged to significantly enhance the retrieval process by improving the retrieved context's accuracy and efficiency, which enables user queries to be met with precise responses.

Creating the vector database and generating embeddings

The final step in data preparation for the FHNW RAG chatbot prototype was to set up a vector database with embeddings of the summaries and chunks derived from the processed dataset. This phase's goal was to create the key component of the retrieval system for identifying the relevant pages and chunks.

The vector database was implemented using ChromaDB9, with two separate collections: one for the page summaries and another for the content chunks. The process began by querying the Neo4j graph database to retrieve all the summaries associated with the page nodes. Each summary was embedded using OpenAI's text-embedding-3-large10 model. These embeddings, alongside their corresponding page_id, were stored in the summaries collection within ChromaDB.

Next, Neo4j was queried again to extract all chunks linked to the pages. Each chunk was embedded using the same text-embedding-3-large model. These chunk embeddings were then added to the chunks collection in ChromaDB, including the chunk_id and the page_id of the parent page. This metadata ensured an explicit mapping between chunks and their respective pages, allowing for flexible querying of the vector database later on.

The schema below shows the two collections and their attributes in the vector database:

A screenshot of a computer Description automatically generated

Figure 11 - Attributes of collections in the vector database

This process resulted in an excellent foundation for semantic search and finding the most relevant information by the retrieval process.

Implementation of the RAG chatbot

The data gathering and processing phase resulted in two databases containing all the information and attributes required for the RAG. The graph database contains the content of pages in chunks and relationships between pages with defined communities. The vector database stores the embeddings for all pages' summaries and individual chunks. Together, these databases provide the foundation for the chatbot's retrieval capabilities.

This section delves into the core implementation of the FHNW RAG chatbot, detailing how the graph and vector databases are leveraged for efficient information retrieval. It explores the processes and techniques used to analyze the user's query, retrieve, rank, and build the relevant information for generating precise, adequate responses to user queries, forming the backbone of the chatbot's RAG functionality.

Conversation and query analysis

The conversation and query analysis process is the starting point for handling user requests within the RAG system, which is implemented in python and runs on a Flask11 server. Each request to the flask_api contains the entire conversation history between the user and the assistant, enabling the chatbot to maintain context when generating responses.

Once a request reaches the RAG backend, the entire conversation is passed into the generate_response function. This function plays a central role in the RAG pipeline. It first isolates the latest user query while retaining the conversation history of the previous interactions. This separation is essential to ensure the system can analyze the current request and consider the broader conversational context.

The latest query and the context are then forwarded to the rewrite_query function. Here, the gpt‑4o model determines whether retrieval is necessary for answering the query. The model evaluates the query and returns a structured output with a boolean flag indicating if retrieval is required. Additionally, it generates three rewritten versions of the query. These rewrites are optimized for retrieval, considering both the context of the conversation and the phrasing of the original query.

The screenshot below shows the exact system instruction prompt used for query analysis:

A black screen with white text Description automatically generated

Figure 12 - System prompt used for query analysis

If the query does not require retrieval, the system directly provides the conversation and query to the gpt-4o-mini model, which generates the response. However, if retrieval is deemed necessary, the retrieval process is initiated.

By integrating this process, the queries used for retrieval are contextually aware. Additionally, using three separate versions of the rewritten query allows for consistent and accurate semantic search.

Retrieval

The retrieve_context function is the centerpiece of the RAG process, facilitating the retrieval and construction of relevant information to generate accurate responses. Finding the most relevant content based on the user's rewritten queries is the first and most important step in the retrieval process. It begins by generating embeddings for all rewritten queries using the same text‑embedding-3-large model. These embeddings are then used for similarity search in the vector database, specifically the summaries collection, to identify the top 36 most relevant pages for each rewritten query.

The reciprocal_rank_fusion function is then called to combine these three ranked lists of pages and extract the top 12 pages overall.

Reciprocal rank fusion (RRF) is a method to combine multiple ranked lists of items into a unified ranking. It works by calculating a score for each item based on its position in the lists. The function takes several ranked lists of the same length, where items ranked higher (with lower rank numbers) are considered more relevant. For each item, the RRF score is computed using the formula:

RRF(d)=rR1K+r(d)\text{RRF}(d) = \sum_{r \in R}^{}\frac{1}{K + r(d)}

where RR represents the set of ranked lists, r(d)r(d) is the rank of the item dd in a list rr\ (0-based), and KK is a smoothing constant set to the number of input lists. Items not present in a list contribute zero to the score. Contributions from higher ranks are more significant, while lower-ranked items have less influence. The scores from all lists are summed for each item, and the items are then sorted by their total scores in descending order to produce a unified ranking. Finally, the top 12 pages are extracted.

These selected pages are retrieved from the Neo4j graph database and passed to the select_ids function. Here, a further refinement occurs by utilizing the previously established communities from the graph structure. The 12 ranked pages are sorted by community frequency. The highest-ranked page from the most common community in the retrieved pages takes the first position, while the worst-ranked page from the least common community takes the last position. After this ordering, the function cuts the bottom four pages refining the context by 33%.

Once the final 8 pages are identified, the rewritten queries are used again to query the chunks collection, extracting the top 128 chunks from the selected pages. Once again, the reciprocal_rank_fusion method narrows the results to the top 40 chunks. After finding the top 10 most relevant pages and their 40 most relevant chunks, the retrieve_context function begins assembling the information. Each context part includes a page URL, its summary, and any chunks from the top 40 that belong to that page. A page is discarded if it does not have any of those 40 chunks. The screenshot below shows the general structure of a context part:

A screenshot of a computer Description automatically generated

Figure 13 - Example structure of retrieved context part

Finally, the retrieve_context function combines and returns the final structured and relevant information that allows for the generation of accurate responses.

Generation and response stream

The final step in the RAG pipeline focuses on generating and delivering an adequate response to the user. Once the retrieve_context function has constructed the final information, along with the user's latest query and the entire conversation history, it is passed to the gpt-4o-mini model. The model is specifically instructed to generate responses in markdown format for later parsing and styling in the prototype's frontend.

The screenshot below shows the exact system instruction prompt used for response generation:

A screen shot of a computer Description automatically generated

Figure 14 - System instruction prompt for response generation

If retrieval was deemed unnecessary in earlier stages, the model works directly with the conversation and query without additional retrieved context. This flexibility allows the chatbot to efficiently handle straightforward or unrelated queries and complex, information-dependent requests.

The flask_api plays an important role in returning the generated response. Rather than delivering the entire response at once, it streams the output back to the client in real time. The streaming capability enhances the interaction by reducing perceived latency, creating a more natural and responsive conversation flow.

With these features, the chatbot allows user-friendly interactions by combining markdown formatting and real-time streaming.

Frontend and deployment

The FHNW RAG chatbot's deployment is hosted entirely on a physical server under my management. This process involved creating two Linux virtual machines and building a user-facing interface, leveraging the markdown format of chatbot responses to render styled and user-friendly chat messages. This design ensures that conversations are intuitive and visually engaging and that the overall system is secure and performant.

Deployment

The FHNW RAG chatbot is divided into two parts: a frontend and a backend, each running on separate virtual machines. These machines operate on the same server local network, with only the frontend assigned a public IP address. This configuration is intentional, providing an additional layer of security by limiting external access to the backend. By restricting access in this way, the backend ensures that sensitive operations, such as information retrieval and OpenAI API calls, remain protected from unauthorized access.

The frontend handles all user interactions, acting as the chatbot's public-facing interface. This deployment strategy secures the backend and allows for a clear separation of the system components, allowing independent development.

Frontend technologies and layout

The frontend of the FHNW RAG chatbot is built using cutting-edge web development technologies. It uses the Node.js12 JavaScript runtime environment to host Next.js 1413, a React–based framework that allows for server—and client-side rendering. Tailwind CSS14 and shadcn/uI15 components are utilized for the interface and styling, ensuring a sleek and responsive design. The user interface is minimalistic and intuitive, consisting of a single-page layout that prioritizes user interaction and accessibility.

The screenshot below shows the FHNW chatbot user interface:

A screenshot of a computer Description automatically generated

Figure 15 - User interface of the FHNW RAG chatbot prototype

At the center of the interface is the chat component, which has an input field where users can type their queries to chat with the chatbot. The layout includes a theme toggle button, which allows users to switch between light and dark modes based on their preference and a "New Chat" button that lets the user start a new conversation. The page also displays the FHNW logo while maintaining a clean and professional appearance.

User input and conversation

The user interaction with the FHNW RAG chatbot begins with a simulated greeting message, creating a friendly and approachable entry point for the conversation. When the user submits a query, the entire conversation history, including the latest input, is sent to the /api/chat endpoint. This endpoint, implemented as a Next.js server component, validates the input on the server side and ensures the conversation is in the correct format before forwarding it to the local network flask_api for processing.

The flask_api returns a response stream generated in markdown format. Then, the markdown stream is parsed on the frontend using the react-markdown16 and remark-gfm17 libraries. These libraries convert the markdown into styled HTML elements, creating the messages. The responses support headings, numbered lists, bullet points, tables, and other elements that enhance the overall user experience when chatting with the FHNW RAG chatbot.

The screenshot below is an example styled message from the chatbot:

A screenshot of a computer Description automatically generated

Figure 16 - Example styled message from the FHNW RAG chatbot prototype

The final styled response message is displayed to the user in real time, creating a dynamic interaction. As the response streams in, the interface updates progressively, offering a responsive and engaging user experience.

The FHNW RAG chatbot prototype is now publicly available at this URL:

https://chatfhnw.bulpost.com

Evaluation of the FHNW RAG chatbot prototype

This section evaluates the practical usefulness of the FHNW RAG chatbot prototype, focusing on both its performance and the cost of running the system. Simpler configurations of the retrieval system are also tested, offering a clear comparison and highlighting the improvements achieved through the various techniques applied during the chatbot's implementation.

Performance

A dataset of 40 queries and corresponding factually correct information was created to evaluate the chatbot's performance and usefulness. These queries were designed to represent potential user submissions, resembling chat-like messages with abbreviations and intentional spelling mistakes. A Python script was developed to simulate API requests to the backend flask_api. The responses of the FHNW RAG chatbot prototype, alongside the factually correct information, were given to an LLM to decide if the response was accurate or not. Due to the non-deterministic nature of LLMs and embedding models, each query was submitted 7 times to ensure reliable performance assessment, resulting in a total of 280 requests.

Multiple system versions were tested for comparison. System instruction prompts for generating the response remained the same for all versions. Each system used a total of 40 chunks for relevant context.

  • Basic Naive System: Used only the chunks.

  • Enhanced Naive System: Included a query analysis step and a single rewritten query.

  • Summaries-Based System: Used page summaries to determine relevant content, along with query analysis and a single rewritten query.

  • Complete System: Graph-based filtering, multiple rewritten queries, and Reciprocal Rank Fusion (RRF).

Table 2 - System performance evaluation results

System Percentage correct Total correct
Basic Naive 18.93% 53/280
Enhanced Naive 32.14% 90/280
Summaries-Based 86.07% 241/280
Complete 97.50% 273/280

The Basic Naive System managed to correctly respond to only ~19% of queries, highlighting the significant limitations of the naive approach. Query rewriting provided a substantial boost, with 90 out of 280 responses (~32%) flagged as accurate. The biggest improvement came from switching to summary-based system, which achieved over 85% response accuracy. The complete system showcases a robust real-world use case for RAG, with the FHNW chatbot responding accurately 97.5% of the time. Out of 280 requests, 7 did not have a satisfactory answer. The queries that were not answered all 7 times correctly are:

  • “do I need to know programming before I study bit?” - 6/7 correct

  • “give me the link to the embedded systems design specialization in the ics trinational program” - 5/7 correct

  • “what methods for supervised learning should students learn in the foundation in machine learning module? ” - 5/7 correct

  • “When is next information event for Visual Communication?” - 5/7 correct

The evaluation dataset, script, and results are available on GitHub, along with the full implementation of the FHNW RAG chatbot prototype. Additionally, the scripts used during the data gathering and indexing stage are also provided: GitHub Repository Link

Cost

The cost of running the FHNW RAG chatbot prototype consists of two main components: the cost of virtual machines hosting the system and the cost of OpenAI API calls for the LLM and embedding model.

Using an EX44 dedicated server from Hetzner Online provides more than sufficient resources to host both the backend and frontend of the chatbot. Currently, such a server is available for €39.00 per month. The cost of OpenAI API calls to generate all 280 responses during the complete system evaluation was $1.24 (~1.13 CHF), equating to less than 0.005 CHF per answered question. If the chatbot answers 1,000 questions per day, the total daily OpenAI API cost would be less than 5.00 CHF. The table below shows the estimated cost of running the FHNW RAG chatbot per month, answering 1000 questions a day:

Table 3 - Estimated costs of running the FHNW RAG chatbot

Cost Per day Per month
EX44 server CHF 1.20 CHF 36.52
OpenAI API calls CHF 5.00 CHF 152.50
TOTAL CHF 6.20 CHF 189.02

References

Edge, D., Trinh, H., Cheng, N., Bradley, J., Chao, A., Mody, A., Truitt, S., & Larson, J. (2024). From Local to Global: A Graph RAG Approach to Query-Focused Summarization. http://arxiv.org/abs/2404.16130

Introducing Contextual Retrieval \ Anthropic. (2024). https://www.anthropic.com/news/contextual-retrieval

 


  1. Scrapy open-source web-crawling framework written in Python - https://scrapy.org↩︎

  2. BeautifulSoup4 library - https://pypi.org/project/beautifulsoup4↩︎

  3. html2text Python script - https://pypi.org/project/html2text↩︎

  4. tiktoken BPE tokenizer for OpenAI models - https://pypi.org/project/tiktoken↩︎

  5. OpenAI cookbook for counting tokens - https://cookbook.openai.com/examples/how_to_count_tokens_with_tiktoken↩︎

  6. Neo4j Graph Database - https://neo4j.com↩︎

  7. OpenAI GPT-4o mini model - https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence↩︎

  8. Neo4j Graph Data Science - https://neo4j.com/product/graph-data-science↩︎

  9. ChromaDB vector database - https://www.trychroma.com↩︎

  10. OpenAI vector embeddings - https://platform.openai.com/docs/guides/embeddings↩︎

  11. Flask lightweight WSGI web application framework - https://flask.palletsprojects.com/en/stable↩︎

  12. Node.js JavaScript runtime environment - https://nodejs.org↩︎

  13. NEXT.js React web framework - https://nextjs.org↩︎

  14. Tailwind CSS open-source CSS framework - https://tailwindcss.com↩︎

  15. shadcn/ui Component library - https://ui.shadcn.com↩︎

  16. react-markdown React component for markdown rendering - https://github.com/remarkjs/react-markdown↩︎

  17. remark-gfm remark plugin to support GitHub Flavored Markdown - https://github.com/remarkjs/remark-gfm↩︎

Back to main
teo@bulpost.com
Light mode