Title: BERTCoherence: Evaluating Text Generation with BERT and Discourse Coherence

Speaker: Wei Zhao (HITS)


There has been a growing interest in developing text generation systems towards discourse coherence, e.g., modeling interdependence between sentences. Recently, BERT-based metrics have become popular in system evaluation. While strong in modeling semantics, they cannot recognize coherence and thus fail to punish incoherent elements in system outputs. In this work, we introduce two unsupervised reference-based evaluation metrics, FocusDiff and SentGraph, for summarization and document-level machine translation (MT), both of which use BERT to model discourse coherence according to Centering theory---that formulates coherence from the lens of focus-of-readers in text. To interpret them, we analyze two regularities that our metrics rely on in how much they distinguish hypothesis from reference. Our experiments encompass 14 non-discourse and discourse metrics (including ours), as well as coherence models (in the discourse community) portrayed as metrics. We show that (i) previous BERT-based metrics do not correlate with human rated coherence, even worse than early attempts towards discourse metrics~\cite{wong-kit-2012-extending} and (ii) there exists a strong relation between regularities and results of metrics, i.e., the more discriminative regularities are, the better our metrics perform---which encourages future research in discovering other novel regularities for better, self-interpretable metrics.
