Evaluating LLM Responses

You may have noticed in the previous lesson that a you ran a unit test used that validated the interaction with the LLM.

Unit tests are not only a convenient way to test individual elements of an application. They also provide an automated way to test the response.

The actual response, however, can be more challenging to validate.

LLM Evaluation

The test uses an evaluation chain to determine whether the response from the LLM has answered the original question. The test is a take on a technique called LLM as Judge, where the LLM is used to judge the quality of the response.

Before running any tests, the beforeAll() function creates an instance of the following chain.

typescript
Answer Evaluation Chain
evalChain = RunnableSequence.from([
  PromptTemplate.fromTemplate(`
    Does the following response answer the question provided?

    Question: {question}
    Response: {response}

    Respond simply with "yes" or "no".
  `),
  llm,
  new StringOutputParser(),
]);

Each test in the suite uses this chain to evaluate the answer provided in the test.

typescript
Evaluating an answer
const evaluation = await evalChain.invoke({ question, response });

expect(`${evaluation.toLowerCase()} - ${response}`).toContain("yes");

The test uses a concatenation of the output of the evaluation chain and the original response, so if the evaluation does not return a yes, the response is appended to the test output to help debug the issue.

Prompt Iteration

As you develop your application and test different models, you may need to modify the prompt iteratively until the model returns a consistent response. Running the tests in watch mode will allow you to receive instant feedback.

This repository includes a npm run test:watch^ command.

Watching for changes
npm run test:watch

Check your understanding

Purpose of LLM Evaluation

What is the purpose of using a unit test in the context of evaluating LLM responses?

  • ❏ To manually review the responses for accuracy.

  • ✓ To automate the testing of individual elements of an application and its interaction with the LLM.

  • ❏ To replace the need for an evaluation chain in determining the quality of LLM responses.

  • ❏ To solely test the computational efficiency of the LLM.

Hint

The correct answer focuses on automating tests for application components, especially their interaction with an LLM, to ensure they work as expected.

Solution

You can use unit tests to automate the testing of individual elements of an application and its interaction with the LLM.

Summary

In this lesson, you learned how to run automated tests that evaluate the response generated by your chains.

In the next module, you will learn how to use conversation history to rephrase questions and provide more accurate responses.