Starting with a RAG

Starting with a RAG
Photo by Pierre Bamin / Unsplash

Thousands of teams around the world use Confluence to collaborate and share information. Just like in any other system of record for knowledge, a lot of effort goes into writing documents, and it also takes effort to keep those documents updated. And, as time passes and organizations and their content grow, answers become harder to find. Search is often not good enough and it takes a lot of sifting and reading through documents, especially for people new to the organization.

Thankfully, foundation models, with their ability to summarize, extract and even reason, are making high-quality question answering much more feasible. The first leg of our journey starts by creating an app able to answer questions using the information contained in a Confluence space.

These days, a common pattern for question answering using LLMs is based on the concept of Retrieval-Augmented Generation, first described in the 2020 paper by Patrick Lewis et al. In early 2023, the pattern involves, in a very rough first approximation:

  • Building an index from the source documents.
  • At query time, retrieving the most relevant documents from the index and passing them along with the question as part of the prompt to the LLM.

In this post, we go over our initial approach to question-answering in Connie AI.

Building and maintaining the index

Every time a document is created or changes in your Confluence space, we perform the following steps:

  • Conversion: documents in Confluence are stored in an XML-type format called Atlassian Document Format. We convert the text to markdown that is more readily-processable by LLMs, but we also preserve non-markdown features such as mentions.
  • Chunking: we divide the document into smaller pieces. We use the structure of the document (paragraphs, lists, tables, etc) to help us separate self-contained pieces. By doing this, we are able to directly search for the relevant parts of a document rather than retrieve and feed whole documents to the LLM.
  • Diff-ing: we detect which chunks of the document have changed. This allows us to update only the necessary chunks, as well as identifying the contributors to different parts of the document.
  • Embedding: we compute an embedding vector for each chunk. An embedding vector is akin to a set of coordinates that situates a piece of text in a multi-dimensi0nal space. Similar pieces of text are closer together in this embedding space, allowing us to retrieve pieces of text that are somewhat related to the query.
  • Indexing: we use Elastic as a hybrid index with both the embeddings and the text of the chunk.

Other considerations:

  • Bootstrapping: when Connie AI is first installed in a space, we use a queuing mechanism to index existing documents, which uses queues to help ensure we don't run into rate limits with the LLM.
  • Index only what's needed: we only index documents that a user has access to. If someone is not using Connie AI on a given space, their private documents in that space are not indexed.
  • Privacy, GDPR, right to be forgotten: we have built a mechanism to honor requests to remove documents from the index. We automatically expire any cached private information such as usernames, emails, etc.

Retrieval and question answering

To answer a question, we perform the following steps:

  • Access control: we determine what documents does the user have access to at query time. If a user has lost access to an indexed document, we make sure that document is not used for the answer.
  • Embedding: we compute an embedding vector for the query.
  • Classification: we run the query through a classifier (currently LLM-based) to determine the query intent and route the query appropriately to different agents.
  • Retrieval: we have built several (somewhat redundant) document retrieval pipelines, namely: K-nearest neighbours in embedding space, keyword search using the Confluence API, text search using Elastic. We are able to use and combine the results of several pipelines in order to rank the best candidate chunks and documents. We determine the weighting and thresholds by doing systematic, continuous evaluation on the system.
  • De-chunking: when multiple candidate chunks for the same document are retrieved, we partially recompose the original document by placing the chunks in their original order.
  • Fact-extraction: to help prevent hallucination and obtain higher-quality answers, we issue a separate query to the LLM for each candidate document in order to extract the most relevant excerpts and more clearly separate the sources. This introduces additional latency, so we are currently experimenting with skipping this process when the number of candidate candidate documents is low as we have fine-tuned our generation prompts and also noticed that newer models follow instructions better.
  • Answer generation: finally, we feed the relevant excerpts and the original query to the LLM in order to generate an answer. This answer contains references to the source documents.

Just the beginning

This is an overview of our first steps, from prototype in early 2023 to initial release in June. We plan to go into a lot more details in future posts about security, privacy, quality, and anything worth writing about.

If you'd like to be notified of new posts and announcements from us about Connie AI and future products you can subscribe below: