Evaluating Chit-Chat Using Language Models

SDA Research
4 min readNov 4, 2020


Language Model Transformers as Evaluators for Open-domain Dialogues

This blog post was written by Rostislav Nedelchev.

Dialogue systems, nowadays more commonly referred to as chatbots, have been around since the 1960s. One of the first well-known examples of such a system is ELIZA by Joseph Weizenbaum. The system used keyword matching and rules to mimic Simple Rogerian psychological therapy. Since then, the research field has evolved massively, and dialogue systems are now present in everyday life. They have widespread usage in voice assistants like Siri or Alexa, or chatbots on social media platforms that help us book a restaurant table or give support in case of problems. They also play an increasingly important role in industrial dialogue settings. But how do you know that the chatbot you developed actually works? Task-oriented (e.g., booking a flight) chatbots are usually component-based. Their task is broken down into subtasks, which makes it possible to perform an automated evaluation. Until today, such systems’ development still involves rigorous testing conducted by real people as a final stage. Also, there are non-task oriented chatbots (e.g., small talk or chit-chat) that need evaluation. Until now, the best thing we have is to check their response against a reference. However, we all know that in an informal conversation, there might be more than one possible answer. So far, no one has managed to create a tool, a method, or an algorithm that can measure how well these programs converse.

For many, this may sound like an automated Turing test (also called an “imitation game” by Alan Turing himself). However, such an evaluation requires two significant capabilities. First, to be able to understand if a dialogue meets certain quality criteria, for example, whether the conversation is fluent (correct language usage) or coherent (a response that is relevant to the current context. Second, one should be able to hold a conversation, i.e. generate a response that meets the criteria as mentioned earlier. Since we do not have a system that can do the latter properly (at least not yet), we cannot automate the Turing test. Instead, we focus on measuring a dialogue’s fluency and coherency.

But to know what is a fluent and coherent dialogue, does one not need to know how to converse well?
Not quite! It is a common fact that reading books can help with the mastery of a language as either a native or a non-native speaker. In essence, this is what language models (LM) like BERT (Devlin et al., 2018), GPT2 (Radford et al., 2019), or XLNet (Yang et al., 2019) more or less do. They “read” many articles from news websites or Wikipedia, and by doing so, they “acquire knowledge” of the consumed language. However, none of them has learned to participate in a dialogue.

In our work “Language Model Transformers as Evaluators for Open-domain Dialogues”, we show that language models “have a feeling” what might be a coherent and fluent dialogue. They acquired that “sense” by just “reading books.” In simple terms, language models have learned to guess the most likely word(s) given a specific context. Each of the three approaches above does that in its own way. We wanted to find out whether their “skill” can be a good indicator of the quality of a conversation.

We asked language models how “probable” responses in dialogues are. We used the participating systems in the ConvAI1 and ConvAI2 challenges for our tests. We then checked whether there is a correlation between the LMs’ “likeliness score” and the human annotator evaluations. It turned out that there is (some)! Depending on the used LM and dialogue dataset, we discovered positive correlation coefficients (Pearson’s and Spearman’s), ranging between 0.13 and 0.49, with high statistical significance. BERT’s Next Sentence Prediction (NSP) performs the best since it works on utterance level, versus token level. It is followed by XLNet, which uses position information for each target word. Finally, GPT2 comes with it is standard left-to-right word-for-word prediction.

That’s amazing! So, if language models “have an opinion of their own,” besides asking them to score dialogue, did you ask them what a good response from their perspective is?

Yes, we did! Two of them, GPT2 and XLNet, are capable of generating whole sentences. So, we asked the two to live up the conversations from ConvAI1 and ConvAI2. While their responses were not entirely fluent, they were understandable and made sense in the context. Furthermore, the probability scores of these hypothetical responses had an even higher correlation with the human annotator scores, when compared to the likeliness score that was mentioned earlier. Depending on the LM and dataset, there was an average increase in correlation by about 0.05. So, does that mean that LMs are better than the ConvAI1 and ConvAI2 systems? Maybe! Both competitions happened before the dawn of transformer LMs and thus, such a comparison is unfair.

At first sight, the sampled generated responses appear far from good. However, if one removes the first and last token, we get a perfect response.

Currently, the approach works only on pairs of utterances. It needs to be improved to consider the whole complete context, rather than just the last one. We saw that a more encompassing approach like BERT’s NSP is beneficial. Thus, we would look into how one can obtain a dialogue score, without the need for an aggregation step. Until then, stay tuned to our blog!

Link to paper: http://jens-lehmann.org/file/2020/coling_lm_dialogue_eval.pdf
Link to code: https://github.com/SmartDataAnalytics/transformers_dialogue_evaluators



SDA Research

The Smart Data Analytics (SDA) research group at the University of Bonn working on #semantics, #machinelearning and #bigdata.