Reconciling the Multilingual Debate: Knowing vs Doing in Cross-Lingual Generalization
Abstract
What drives cross-lingual generalization in multilingual language models? A widely discussed explanation is the anchor-point hypothesis: as Pires et al. (2019) put it, "having word pieces used in all languages (numbers, URLs, etc) which have to be mapped to a shared space forces the co-occurring pieces to also be mapped to a shared space, thus spreading the effect to other word pieces, until different languages are close to a shared space." This idea is echoed across the literature, and many studies report a link between token overlap and transfer. Yet other results show substantial generalization even with no overlap, and sometimes suggest no relationship between overlap and performance. In this talk, I reconcile these findings using controlled synthetic languages with fully disjoint input and output vocabularies. I show that models can generalize cross-lingually by reusing language-agnostic computation even when embeddings remain language-specific. This motivates a distinction between doing (shared mechanisms for computation) and knowing (shared, language-independent storage of features/world knowledge), and I outline toy experimental designs to probe when models can and cannot learn the latter without explicit anchors.

