Translating Middle Egyptian Hieroglyphs

I want to welcome you with a warmhearted which means

in Middle Egyptian around 5000 years ago and can simply be interpreted as

‘Hello’ 😙

I hope, you’re now also curious to read about the steps that we did to translate a lot of Middle Egyptian resources. So, enjoy and read on!

Once upon a time in Egypt…

For this post, I decided to put the disclaimer early on: We’re actually not translating hieroglyphs into concurrent language by their visual appearance. The signs, that I’ve welcomed you with, are Middle Egyptian graphemes that developed from pictograms. They were used in Ancient Egypt from around 3200 BCE to the 500th century of our common era. So, it’s definitely one of the longest writing traditions out there. But these symbols, which you can observe on the remains of tomb walls and temples, were not used for letters, administrative or legal documents. For these occasions, instead, the forms of cursive hieroglyphs, hieratic and demotic emerged, the latter to be considered as an independent language. Coming back to my disclaimer from above: We have our multiple writing forms of the hieroglyphs (not demotic actually as it is not Middle Egyptian exactly), but we have it in an encoded format: This encoding is based on the monumental research work of Sir Alan Gardiner:

Source Data
Hieroglyph
Encoding H6 G43 M17 M17 X1 N5 Z1 G17 D4 G17 H6 Z7 N5
Transcription šw,yt m jri̯ m šwi
Part-of-speech tags substantive verb verb preposition substantive
Word-level translation shadow , not be as sun
Interpreted translation Shade, don't be as the blazing sun (by Mark-Jan Nederhof)

So, we actually have some character/number combinations that map the hieroglyphs to a more manageable code. This originally comes from grouping the signs to semantic categories. Also, the table mirrors a further source: It’s called a transliteration from an Egyptologist’s point of view, but we actually treated it as a source of transcription. This circumscription of the hieroglyphs once was a method to publish Ancient Egyptian resources. But it also represents the translator’s interpretation of word/sentence boundaries and insertion of missing signs. Although, vowels are not reflected within the hieroglyphs, the consonantic transliteration alphabet is expressive enough to consider it as kind of transcription. That’s amazing, as we will then apply machine translation methods from both AST (Automated Speech Translation) and NMT (Neural Machine Translation). For our corpus, that we thankfully received in corporation with the Thesaurus Linguae Aegyptia project, we also have access to the part-of-speech tagging. Sounds good, right? But the whole parallel corpus consisted only of around 30.000 pairs. So we dealt with a pretty tough low resource scenario.

How to exploit all these resources?

We experimented with several techniques, called Backtranslation (NMT) and Pipeline models (AST) as viable opponents to our best player: Multi Task Learning. Here is short introduction of the three guys:

Pipeline Model

This setup is really close to the human approach to translate Middle Egyptian texts: First train an Encoder/Decoder Model that learns to translates hieroglyphs $\to$ transcription. In parallel we also train a model from transcription $\to$ translation. We can then translate any Egyptian Text by first generating the transcription and afterwards using this output to generate translations - just like Egyptologists do it.

Backtranslation

Backtranslation is a famous tool when dealing with low resources: One first trains a backward model on the available data that translates from the target language to the source language. After that, one can take any additional (best: in-domain) target language corpus and backtranslate into the source language - and voilá: There’s our additional data that we can use to train our main system! In our case, we were faced with a very special situation: The database that our corpus was extracted from is filled reversely: The TLA first input the translation and other sources and have not yet finished integrating the Gardiner encoding. This means that on the one hand, we were missing an amount of 60.000 Egyptian encodings, but at least could use the available target sources to actually backtranslate them. We did that and added parts of these “synthetic” sentences bit by bit to the parallel corpus to evaluate if that helped our system to learn.

Last, but not in the very least, we implemented a multi tasking schedule. This technique originates from human learning: When tackling a difficult problem, it might help to

• first learn a simpler problem (e.g. creating word boundaries before translating)
• deal with a related problem, that helps to generalize the solution of the main problem (e.g. POS tagging before translating) And this is exactly how we could exploit our data sparsity! So, we implemented a schedule within our training that switched from translating hieroglyphs $\to$ transcription to, let’s say, transcription $\to$ hieroglyphs or hieroglyphs $\to$ pos-tags. The key is that the encoder/decoder was shared during that proccess and learned to encode to/decode from multiple resources!

And the winner is….

How did we do? Which system dealt best with the extreme data sparsity? For details, I’ll just point to the paper (Wiesenbach & Riezler, 2019) 😋 But at as a spoiler the one-2-many MTL system cleared the first place by learning from additional 30% transcription and POS tags. The Pipeline and Backtranslaton models both fell short as they just couldn’t leverage the little amount of data. I hope you had a good read and learned some facts about dealing with ancient languages and some techniques for low resource usecases.

Acknowledgment: Thanks to Stefan Riezler for his valuable and much needed feedback for improving this post.

Disclaimer: This blogpost reflects solely the opinion of the author, not any of his affiliated organizations and makes no claim or warranties as to completeness, accuracy and up-to-dateness.

Comments, ideas and critical views are very welcome. We appreciate your feedback! If you want to cite this blogpost, cite the paper instead:

1. Philipp Wiesenbach and Stefan Riezler
Multi-Task Modeling of Phonographic Languages: Translating Middle Egyptian Hieroglyphs
Proceedings of the International Workshop on Spoken Language Translation (IWSLT), 2019
@article{wiesenbach19,
title = {Multi-Task Modeling of Phonographic Languages: Translating Middle Egyptian Hieroglyphs},
author = {Wiesenbach, Philipp and Riezler, Stefan},
journal = {Proceedings of the International Workshop on Spoken Language Translation},
journal-abbrev = {IWSLT},
year = {2019},
url = {https://www.cl.uni-heidelberg.de/statnlpgroup/publications/IWSLT2019_v2.pdf}
}


Tags:

Categories:

Updated: