Ruprecht-Karls-Universität Heidelberg
Bilder vom Neuenheimer Feld, Heidelberg und der Universität Heidelberg
Siegel der Uni Heidelberg

Language change - Project information

This is a project recommendation. Should you wish to do another project, come talk to me.

What to submit

  • A project report (details here)
  • Your data and annotations

Steps

  1. choose 3 words (I suggest English, but it can be any language as long as you have some corpora from at least 2 different time frames, i.e. a distance of more than 50 years)

  2. research their etymology and establish the relation between them and their source (or whether they themselves are a “basic” word -- e.g. like “mother” and “father” in English)

  3. using the different time frame corpora, obtain their collocations or larger contexts, and analyze their senses

    You can use whatever corpora you find and like. Google n-grams are a very good (and free!) resource. We have the 5grams downloaded and available at /bfcorpora/googlebooks/5grams, and a processed version -- the contexts for each word from WordNet (I think with the exception of adverbs, I don't remember 100% what I did) at /bfcorpora/googlebooks/5grams_contexts_sorted_by_word/. This processed version also has the context frequencies aggregated in 25 years time intervals. You don't have to use these though! Use whatever you like.
    NOTE: the /bfcorpora partition is automatically mounted when you log into one of the "work" servers -- last, ella, petty. On last (maybe also ella) you should preface the path with /mnt, i.e.: /mnt/bfcorpora/googlebooks/ . On petty the path should be as written, no /mnt necessary.

  4. establish what senses they had in each time frame, and analyze the relation between their “basic” sense and the others (could be metaphoric expansions, or generalizations, etc.)

  5. determine (using clustering for example, or some other automatic method) the closest words to each sense in each time frame, and their associated contexts. Compare them with contemporary senses.

    If you are very new to this, Python's scikit-learn library has lots and lots of clustering and learning functions, and their documentation is very clear. Try some of those! You can also use other libraries or tools you find online. You are also welcome to develop your own code should you wish, but the point here is not necessarily to develop new clustering algorithms, but to use a few and analyse what happens. That said, if you wish to build a variation (possibly inspired by the semantic chaining paper, for example) you are welcome to do it!

Grading criteria

  • interesting insights -- I will appreciate you linking your project and approach you chose to papers we discussed in class, other related work, while specifying why you chose this approach, and what does that reveal about the words you chose. Explain what you have learned from each step of the process relative to language evolution and your chosen words in particular.
  • clarity of the report -- The easier it is for me to understand what you did and how, the more likely it is that your efforts will be justly appreciated and graded.

Project report

I expect a project report, organized similarly to a paper:
  • have an introduction that explains your idea
  • related work to show on which previous work you base your idea and your implementation
  • describe the data you are using (including all the additional resources used!), the experimental set-up, the experiments and results

  • and very importantly

  • DISCUSSION (!!!) This could include results obtained on several variations of your system (including different subsets of features for example) to allow you to observe and discuss the impact of different types of information on your system
  • And some conclusions about what you learned during the project and what the experiments tell about your initial idea.
There is no minimum or maximum limit on the length of this document, but give enough details to convince me you did a good job while sparing me irrelevant details.

Important dates

Project start: NOW!
Projects due: August 26th !!!No extensions possible, because everything needs to be wrapped up by the end of August!!!
zum Seitenanfang