Aufbauphase: Methoden der Computer- linguistik

University of Potsdam

Weekly Schedule

Contact me if you are interested on the slides.

CORRETION: If you downloaded the slides a while ago, in slides 13 and 14 about statistical significance testing it should be >= (instead of >) in the if statement of the algorithms.

Week Date Focus Slides
1 04.11.2020 Introduction / Course guidelines  
2 11.11.2020 Paradigms  
3 18.11.2020 Common procedures  
4 25.11.2020 Annotation  
5 02.12.2020 Metrics & Measurements  
6 09.12.2020 Statistical significance testing  
7 16.12.2020 Best practices / Presentation guidelines  
8 06.01.2021 Presentation preparation  
9 13.01.2021 Group 1: Machine Translation & History of Evaluation in NLP  
10 20.01.2021 Group 2: Natural Language Generation & Shared Tasks  
11 27.01.2021 Group 3: Dialogue & Replication Crisis  
12 03.02.2021 Group 4: Speech Synthesis & Ethics  
13 10.02.2021 Tutorial / Project guidelines  

Weekly Reading List


  1. COHEN, Paul R.; HOWE, Adele E. How evaluation guides AI research: The message still counts more than the medium. AI magazine, v. 9, n. 4, p. 35-35, 1988. link
  2. KING, Margaret. Evaluating natural language processing systems. Communications of the ACM, v. 39, n. 1, p. 73-79, 1996. link


  1. SPÄRCK JONES, Karen. Towards better NLP system evaluation. In: Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994. link
  2. HIRSCHMAN, Lynette; THOMPSON, Henry S. Overview of evaluation in speech and natural language processing. In: Survey of the state of the art in human language technology. Cambridge University Press, 1997. link (pages 409-414)
  3. BELZ, Anja. That’s nice… what can you do with it?. Computational Linguistics, v. 35, n. 1, p. 111-118, 2009. link

Common Procedures

  1. POTTS, Christopher. Evaluation methods in NLP. 2020. link
  2. Communicating results with scientific graphs. The University of Queensland. link (click on each type of plot to see presentation best practices)
  3. REITER, Ehud. Use proper baselines. Blogpost, 2018. link
  4. (Optional) Wikipedia’s overview on cross validation. link


  1. SCHNEIDER, Nathan. What I’ve learned about annotating informal text (and why you shouldn’t take my word for it). In: Proceedings of The 9th Linguistic Annotation Workshop. 2015. p. 152-157. link
  2. BENDER, Emily M.; FRIEDMAN, Batya. Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics, v. 6, p. 587-604, 2018. link (sections 1-5)
  3. FORT, Karën; ADDA, Gilles; COHEN, K. Bretonnel. Amazon mechanical turk: Gold mine or coal mine?. Computational Linguistics, v. 37, n. 2, p. 413-420, 2011. link
  4. (Optional) FORT, Karën. Corpus Linguistics: Inter-Annotator Agreements (slides), 2011. link

Metrics and Measurements

  1. POTTS, Christopher. Evaluation metrics in NLP. 2020. Read or watch.
  2. REITER, Ehud. Why do we still use 18-year old BLEU? / Small differences in BLEU are meaningless / Learning does not require evaluation metrics. Blogposts, 2018 and 2020. link1 link2 link3
  3. HUYEN, Chip. Evaluation Metrics for Language Modeling. Blogpost, 2019. link
  4. (Optional) Slides on Information Theory metrics. link
  5. (Optional) Slides on Minimum Edit Distance algorithm. link
  6. (Optional) Learning curves for diagnosing machine learning performance. link

Statistical Significance Testing

  1. Chapters 2, 3 and 4 of DROR, Rotem; PELED-COHEN, Lotem; SHLOMOV, Segev & REICHART, Roi. Statistical Significance Testing for Natural Language Processing. Synthesis Lectures on Human Language Technologies, v. 13, n. 2, p. 1-116, 2020. link (pages 3-33)
  2. (Optional) DROR, R., BAUMER, G., SHLOMOV, S., & REICHART, R. The hitchhiker’s guide to testing statistical significance in natural language processing. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), p. 1383-1392, 2018. link
  3. (Optional) SØGAARD, Anders; JOHANNSEN, Anders; PLANK, Barbara; HOVY, Dirk & MARTINEZ, Hector. What’s in a p-value in NLP?. In: Proceedings of the eighteenth conference on computational natural language learning, p. 1-10, 2014. link
  4. (Optional) BERG-KIRKPATRICK, Taylor; BURKETT, David; KLEIN, Dan. An empirical investigation of statistical significance in nlp. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, p. 995-1005, 2012. link
  5. (Optional) William Morgan’s slides about Approximate Randomization. link
  6. (Optional) KÖHN, Arne. We need to talk about significance tests. Blogpost, 2019. link

Best Practices

  1. HOVY, Dirk; SPRUIT, Shannon L. The social impact of natural language processing. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2016. p. 591-598. link
  2. REITER; Ehud. My guidelines to evaluating AI systems, 2017, and Do people “cheat” by overfitting test data, 2020. Blogposts. link1 and link2
  3. HEAVEN, Will Douglas. AI is wrestling with a replication crisis. MIT Technology Review, 2020. link
  4. GRUS, Joel. Reproducibility as a Vehicle for Engineering Best Practices. ICRL, 2019. link (the first 19 minutes)
  5. (Optional) LIPTON, Zachary C.; STEINHARDT, Jacob. Troubling trends in machine learning scholarship. Queue, v. 17, n. 1, p. 45-77, 2019. link
  6. (Optional) WIRED. Artificial intelligence confronts a reproducibility crisis, 2019. link
  7. (Optional) BENDER, Emily. The #BenderRule: On Naming the Languages We Study and Why It Matters. Blogpost, 2019. link
  8. More material about Ethics in NLP. link