Zen and the art of Q.A.

Zen and the art of Q.A.
Photo by Towfiqu barbhuiya / Unsplash

Our initial RAG-based question answering system worked surprisingly well out of the box, a testament to how good foundation models are. They are also relatively flimsy: a change in a prompt may cause a whole category of questions to suddenly not be answered correctly.

In order to make sure we don't break something as we improve the system, we run a battery of questions with each commit affecting prompts or other parts of the question-answering pipeline. This is useful as a sort of smoke test before we are tempted to move any changes to production, but it's not a good benchmark to measure the quality of our answers in general.

We want to make sure Connie AI is able to answer a wide variety of question types, from those for which the question is readily-available in one of the source documents, to those that require an acceptable amount of reasoning. Perhaps most importantly, we want Connie to say "I don't know" when there is no information in the Confluence space that answers the user's question. Confident assertion of made-up facts ("hallucinations") has been widely reported and associated with foundation models.

A popular dataset to benchmark question-answering models is the Stanford Question Answering Dataset (SQuAD). It includes more than 100,000 questions posed by crowdworkers on a set of Wikipedia articles. In its second version, it also includes 50,000+ unanswerable questions, which is a good way to detect undue eagerness to respond.

One of the challenges of evaluating with a dataset like SQuAD is that non-fine-tuned foundation models tend to give verbose answers. SQuAd's official scoring script computes two scores: F1 and exact. The latter is either 1.0, when the prediction matches exactly one of the gold answers specified in the dataset, or 0.0 otherwise. The F1 score is computed by geometrically averaging a precision score (the percentage of words in the prediction that are in the gold answer, taking the best if there's multiple gold answers) and a recall score (the percentage of words in the gold answer that match words in the prediction; again take the highest score if there's multiple gold answers).

While the foundation models we use often paraphrase the question when giving the answer (e.g. Q: "What form do complex Gaussian integers have?" A: "Complex Gaussian integers have the form a + bi, where a and b are integers and i is the imaginary unit."), the gold answers in SQuAD tend to be terse (G: "a + bi"). Thus, while our system is giving a correct answer (arguably a better answer), the exact score is usually 0.0 and the F1 score drops due to low precision.

To make SQuAD more useful for us, we needed to make some tweaks. We didn't want to taint the evaluation dataset or the evaluation script with changes, so the initial solution we came up with was to create a pre-processing script that would eliminate paraphrasis or repetition of the question from our system's answers.

Our script uses spaCy to do part-of-speech tagging on the question and then matches the structure with rules. One such rule is, for example:

      # E.g. Who did King David I of Scotland marry?
      { # anchor: verb
       'RIGHT_ID': 'anchor',
       'RIGHT_ATTRS': {'DEP': 'ROOT', 'POS': 'VERB'}
      { # Auxiliary verb
       'LEFT_ID': 'anchor',
       'REL_OP': '>--',
       'RIGHT_ID': 'aux',
       'RIGHT_ATTRS': {'DEP': 'aux'}
      { # Wh-pronoun as a dependency of the auxiliary verb
       'LEFT_ID': 'aux',
       'REL_OP': '>--',
       'RIGHT_ID': 'wh-question',
       'RIGHT_ATTRS': {'DEP': 'nsubj', 'TAG': 'WP'}
      { # Subject
       'LEFT_ID': 'anchor',
       'REL_OP': '>--',
       'RIGHT_ID': 'subject',
       'RIGHT_ATTRS': {'DEP': 'nsubj'}
Using a spaCy matcher we obtain a basic pattern for the question's syntax, and we use the different parts to rewrite the question in the form of a sentence:
  # Input: Who did King David I of Scotland marry?
  q_doc = nlp(question)
  matches = q_matcher(q_doc)
  if len(matches):
    tokens = [q_doc[i] for i in matches[0][1]]
    subj = [*next(t for t in tokens if t.dep_ in ['nsubj', 'nsubjpass'] and t.tag_ != 'WP').subtree]
    if subj[0].tag_ in ['WP', 'WDT']:
      subj = chain([FakeToken('The')], subj[1:], [FakeToken('that')])
    verb = next(t for t in tokens if t.dep_ == 'ROOT')
    aux = next(([t] for t in verb.children if t.dep_ in ['aux', 'auxpass']), [])
    neg = next(([t] for t in verb.children if t.dep_ == 'neg'), [])
    complements = chain(*[t.subtree for t in verb.rights if t.lemma_ != '?'])
    rewrite = list(chain(subj, aux, neg, [verb], complements))
    a_doc = nlp(answer)
  # Output: [King David I of Scotland](SUBJ) [did](AUX) [marry](VERB)

Finally, we use the rewritten question to remove the matching text from our system's answer before evaluation. Matching is done by taking into consideration the root of the words, not just exact matches.

   r_i = 0
    while a_i < len(a_doc) and r_i < len(rewrite):
      a_token = a_doc[a_i]
      r_token = rewrite[r_i]
      if a_token.text == r_token.text or a_token.lemma_ == r_token.lemma_:
        a_i += 1
        r_i += 1
      elif r_token.dep_ == 'aux' or r_token.pos_ == 'DET':
        r_i += 1
    new_a_doc = a_doc[a_i:].as_doc()
    answer = new_a_doc.text

Since our system also includes citations for the sources of the answer (in the form [1]) this script also remove them from the output. While crude, this system increased our initial scores by more than 10 points overall. Going forward, we're planning to improve it by using an LLM to produce better concise answers for evaluation purposes.

Our current evaluation setup includes a Confluence site with all of the Wikipedia articles that are used for the publicly-available section of SQuAD, along with an evaluation pipeline that includes the pre-processing steps described above. We also have a manually-created set of questions based on our own company's Confluence space content, and test cases we've been adding as we refined Connie's abilities. For example, we have documents with tables, and corresponding evaluation questions that verify that Connie is able to extract information from those tables.

While SQuAD gives us an external baseline of general text retrieval and question-answering performance, our own test cases keep us from regressing on more advanced features such as responses that involve multiple source documents, formatted content, mentions, attachments, etc.

In future articles, we'll talk about our internal evaluation tool, Gaucho, which allows us to execute and store evaluation runs, and visualize and analyze the performance of our pipeline.

If you'd like to be notified of new posts and announcements from us about Connie AI and future products you can subscribe below: