Inspired by the principles of behavioral testing in software engineering, Ribeiro et al (2020) introduced the CheckList tool – a task-agnostic methodology for testing NLP models. With the help of the CheckList tool, we can create templates and automatically generate a large number of diverse examples to test a particular capability or phenomenon.

The capabilities listed in Ribeiro et al (2020) are targeted to test a minimal set of properties of a system that are necessary yet feasible to check. For the Natural Language Inference (NLI) task, such capabilities can check the robustness of systems using perturbations along the capabilities. However, an NLI example represents a mix of many types of reasoning, spanning lexical, syntactic and semantic. To test an NLI system, it is important to test if such types of reasoning are captured individually and when combined in a deterministic, scalable manner. In some sense, some of the reasoning types, even if deemed not necessary, should inform the evaluator about the systems’ properties in a holistic manner. We describe creation of CheckList for the NLI task, which systematically covers a set of reasoning capabilities necessary for NLI that is inspired from the recently proposed TaxiNLI.

The table below lists a few templates along with the capability being tested by them

Template Label Example Capability
P: {NAME} is {ADJ}
H: {NAME} is {Antonym(ADJ)}.
contradiction P: Amjad is poor
H: Amjad is rich
P: {NAME} is {ADJ1} and {ADJ2}
H: {NAME} is {ADJ1/2}
entailment P: William is cynical and selfish.
H: William is cynical
use of and
P: Among {NAME1}, {NAME2} and {NAME3} the {SUP ADJ} is {NAME1}
H: {NAME1} is {COM ADJ} than {NAME2}
entailment P: Among James, Lily and Smith the tallest is James
H: James is taller than Lily
comparatives and superlatives
P: {NAME1} was born in {YEAR1} and {NAME2} was born in {YEAR2}.
H: {NAME1} was born earlier than {NAME2}
(condition YEAR1 < YEAR2)
entailment P: Martha was born in 1992 and Peter was born in 2001
H: Martha was born earlier than Peter
reasoning about time
P: {NAME} lives in {CITY}.
H: {NAME} lives in {COUNTRY}
(condition CITY does not belong to COUNTRY)
contradiction P: Rachel lives in Seoul.
H: Rachel lives in France
knowledge about countries and cities

In the above templates {NAME}, {ADJ}, {COM ADJ}, {SUP ADJ}, {CITY}, {COUNTRY}, {YEAR} are placeholders which can be populated with different values (satisfying the condition) to generated a large number of test examples. We distinguish multiple use of same placeholders using numbers to identify them ({NAME1}, {NAME2}, … etc will be filled with distinct values)

Click here to download the Jupyter Notebook illustrating the use of checklist to create templates. The full set of tutorials on using CheckList can be found here