How to Evaluate Chat Quality Using Standard NLP Benchmarks

SDA Research
4 min readOct 18, 2021


Proxy Indicators for the Quality of Open-domain Dialogues

Written by Rostislav Nedelchev with feedback from Prof. Ricardo Usbeck and Prof. Jens Lehmann.

Our previous article briefly introduced chatbots and their automatic evaluation using language models like GPT2, BERT, and XLNet. There, we discussed the importance of open-domain dialogue systems. We showed how probabilities inferred by the language model (LM) correlate with human evaluation scores. And hence, they are suitable for estimating dialogue quality.

As discussed earlier, there are various criteria along which we judge a conversation. For example, a response can make good use of fluent language but be completely incoherent, i.e., unrelated to the preceding context. The LM-based evaluation is practical since it does not need supervision. However, it has one major downside — it presents only one score portraying the overall quality without providing insight into single criteria like fluence or coherency…

Fear not, we have a solution to this challenge! However, before we go into the answer to the problem, we need to cover some groundwork. What is one of the most important concepts in Artificial Intelligence and Machine Learning? Yes? That is right. We almost always come across benchmarking in the context of AI and ML, regardless of the subfields. In Natural Language Processing, a trendy one also drove a lot of the research and development of language models like BERT, XLNet, and many others. The General Language Understanding Evaluation (GLUE) benchmark provides resources for training, evaluating, and analyzing natural language understanding systems. It contains a total of eleven tasks that aim to assess a system’s language understanding abilities. The most famous example is sentiment analysis. Systems have to flag sentences as positive or negative. Another example is duplicate question detection. The systems have to decide whether two questions are semantically the same or similar.

How does GLUE help us to evaluate dialogue quality? By now, you probably have guessed that a machine learning model performing well on CoLA can infer the fluency of a sentence. Similarly, we use pair-wise sentence tasks like RTE or STSB to check if a response is coherent with its preceding dialogue context. All of the pair-sentence tasks look for various semantic relations that can model dialogues as well. We do not use WNLI and MNLI since they cannot be easily matched to the problem of dialogue evaluation.

First, we need models that have been trained on the GLUE tasks. We took a shortcut by re-using fine-tuned BERT instances, part of the TextAttack framework and available on the HuggingFace ModelHub. We use BERT because it is an established approach. However, we think one should be able to use any neural architecture. To validate the idea, we ran experiments on data that involve knowledge-based conversations evaluated by humans. The assessment involved six criteria — Understandable, Natural, Maintains Context, Interesting, Uses Knowledge, and Overall Quality. We then took the probabilities scores and ran a correlation analysis against the human annotations.

As expected, we have Pearson’s and Spearman’s correlations coefficients of up to 0.7. For example, one of the criteria, “Uses Knowledge”, and “STSB” have correlation rates of 0.7329 and 0.7173, respectively. Also, using “STSB”, we acquire correlations of 0.3620 and 0.3463 with “Maintains Context”. But this is not all!

Now we have quality indicators for each of the criteria. It would also be great to have a mixture of those that would indicate the overall quality since the single ones did not correlate that well. We ran a linear regression using the single GLUE tasks as dependent variables and the Overall Quality criteria as a target. The composite indicator has correlation coefficients of almost 0.5. In contrast, the best single GLUE task has only 0.4.

All these correlation coefficients imply a relationship between the variables. In the last example, a value of 0.5 suggests that for every unit of increase of the Overall criterion the linear regression will respond by half of it. In other words, if the evaluator increases the score by 0.1, the model should do so by 0.05.

In the figure below, we visualized the learned weights from the linear regression.

We see that many of the tasks have influence. Some of them are even quite strong. The weights with a “fact” prefix calculate the scores between the conversation knowledge base and the target response. We see that semantic overlap tasks like MRPC and STSB have the strongest role. The observation applies both to measuring the tasks against dialogue context and knowledge base.

While the correlation coefficients are decent, there is still some space for improvement. In the future, we will look into other such benchmarks that could possibly provide additional improvements. So until we close that gap, be sure to follow us here on the blog not to miss any news!

Link to paper:
Link to code:



SDA Research

The Smart Data Analytics (SDA) research group at the University of Bonn working on #semantics, #machinelearning and #bigdata.