Recent advancements in multi-modal understanding struggle to fuse the representations of the respective modalities and lack interpretability. In order to improve the current situation, we explicitly align text and image. More precisely, we start by solving the task of phrase grounding, which localizes objects in an image via phrases contained in its caption. Recent work focuses on (weakly) supervised training of complex neural architectures, that rely only on the visual and textual information given in respective datasets and obscure the explanation for the alignment. In contrast, we propose Text to Scene Graph Alignment (T2SGA), an unsupervised and dataset-agnostic approach for phrase grounding that rivals the performance of supervised systems. We end with a discussion of how our explicit alignment can be exploited in future work related to multi-modal understanding and reasoning.