Technical

LLM Latency

Matt Cowan

Jul 21, 2023 • 2 min read

The latency to complete inference with a large language model (LLM) such as OpenAI's GPT can be uncomfortably high. Multiply that by needing to make sequentially LLM requests and you have a recipe for a poor user experience.

You have probably noticed or read about how ChatGPT for example streams its inference response, specifically to make the user experience of a high latency operation more bearable. By showing constant progress quickly, instead of the traditional spinner, a request that may take 27 seconds feels more instantaneous:

0:00

Connie AI does not yet stream the tokens for its answer (although we may add that in the future). We do however use response streaming via the recently released AWS Lambda function streaming to make waiting for your answers less unpleasant:

0:00

A lot is happening behind the scenes while Connie AI figures out the final answer to your question. There are a few steps you can see from the video above:

Classify the request, for example we differentiate Q&A requests from document enumeration requests (Notice the ? icon appearing)
Perform search and retrieval of relevant information in that space from our index (Notice the text mentioning the # of relevant documents found)
Extract relevant facts from the documents. This helps us cut down on hallucinations drastically and merge information from a variety of documents (Notice the text mentions the # of facts found)
Compose a natural language answer with references

What is neat about the way that Connie AI does this is just how simple it is. It is just a single https request. As mentioned earlier we make use of AWS Lambda response streaming: https://aws.amazon.com/blogs/compute/introducing-aws-lambda-response-streaming/

No need to setup websockets or server-sent-events. The client makes the usual fetch request and then reads the response chunks as they come in and updates the UI. Our backend code stays super simple as well. We've got a simple lambda function wrapped with exports.handler = awslambda.streamifyResponse(async (event, responseStream) => {}) and any time we want to update the client with some data we call responseStream.write('data')

If you'd like to be notified of new posts and announcements from us about Connie AI and future products you can subscribe below: