Schedule

Weekly Schedule

Contact me if you are interested on the slides.

CORRETION: If you downloaded the slides a while ago, in slides 13 and 14 about statistical significance testing it should be >= (instead of >) in the if statement of the algorithms.

Week	Date	Focus
1	04.11.2020	Introduction / Course guidelines
2	11.11.2020	Paradigms
3	18.11.2020	Common procedures
4	25.11.2020	Annotation
5	02.12.2020	Metrics & Measurements
6	09.12.2020	Statistical significance testing
7	16.12.2020	Best practices / Presentation guidelines
Break
8	06.01.2021	Presentation preparation
9	13.01.2021	Group 1: Machine Translation & History of Evaluation in NLP
10	20.01.2021	Group 2: Natural Language Generation & Shared Tasks
11	27.01.2021	Group 3: Dialogue & Replication Crisis
12	03.02.2021	Group 4: Speech Synthesis & Ethics
13	10.02.2021	Tutorial / Project guidelines

Weekly Reading List

Introduction

COHEN, Paul R.; HOWE, Adele E. How evaluation guides AI research: The message still counts more than the medium. AI magazine, v. 9, n. 4, p. 35-35, 1988. link
KING, Margaret. Evaluating natural language processing systems. Communications of the ACM, v. 39, n. 1, p. 73-79, 1996. link

Paradigms

SPÄRCK JONES, Karen. Towards better NLP system evaluation. In: Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994. link
HIRSCHMAN, Lynette; THOMPSON, Henry S. Overview of evaluation in speech and natural language processing. In: Survey of the state of the art in human language technology. Cambridge University Press, 1997. link (pages 409-414)
BELZ, Anja. That’s nice… what can you do with it?. Computational Linguistics, v. 35, n. 1, p. 111-118, 2009. link

Common Procedures

POTTS, Christopher. Evaluation methods in NLP. 2020. link
Communicating results with scientific graphs. The University of Queensland. link (click on each type of plot to see presentation best practices)
REITER, Ehud. Use proper baselines. Blogpost, 2018. link
(Optional) Wikipedia’s overview on cross validation. link

Annotation

SCHNEIDER, Nathan. What I’ve learned about annotating informal text (and why you shouldn’t take my word for it). In: Proceedings of The 9th Linguistic Annotation Workshop. 2015. p. 152-157. link
BENDER, Emily M.; FRIEDMAN, Batya. Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics, v. 6, p. 587-604, 2018. link (sections 1-5)
FORT, Karën; ADDA, Gilles; COHEN, K. Bretonnel. Amazon mechanical turk: Gold mine or coal mine?. Computational Linguistics, v. 37, n. 2, p. 413-420, 2011. link
(Optional) FORT, Karën. Corpus Linguistics: Inter-Annotator Agreements (slides), 2011. link

Metrics and Measurements

POTTS, Christopher. Evaluation metrics in NLP. 2020. Read or watch.
REITER, Ehud. Why do we still use 18-year old BLEU? / Small differences in BLEU are meaningless / Learning does not require evaluation metrics. Blogposts, 2018 and 2020. link1 link2 link3
HUYEN, Chip. Evaluation Metrics for Language Modeling. Blogpost, 2019. link
(Optional) Slides on Information Theory metrics. link
(Optional) Slides on Minimum Edit Distance algorithm. link
(Optional) Learning curves for diagnosing machine learning performance. link

Statistical Significance Testing

Chapters 2, 3 and 4 of DROR, Rotem; PELED-COHEN, Lotem; SHLOMOV, Segev & REICHART, Roi. Statistical Significance Testing for Natural Language Processing. Synthesis Lectures on Human Language Technologies, v. 13, n. 2, p. 1-116, 2020. link (pages 3-33)
(Optional) DROR, R., BAUMER, G., SHLOMOV, S., & REICHART, R. The hitchhiker’s guide to testing statistical significance in natural language processing. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), p. 1383-1392, 2018. link
(Optional) SØGAARD, Anders; JOHANNSEN, Anders; PLANK, Barbara; HOVY, Dirk & MARTINEZ, Hector. What’s in a p-value in NLP?. In: Proceedings of the eighteenth conference on computational natural language learning, p. 1-10, 2014. link
(Optional) BERG-KIRKPATRICK, Taylor; BURKETT, David; KLEIN, Dan. An empirical investigation of statistical significance in nlp. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, p. 995-1005, 2012. link
(Optional) William Morgan’s slides about Approximate Randomization. link
(Optional) KÖHN, Arne. We need to talk about significance tests. Blogpost, 2019. link

Best Practices

HOVY, Dirk; SPRUIT, Shannon L. The social impact of natural language processing. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2016. p. 591-598. link
REITER; Ehud. My guidelines to evaluating AI systems, 2017, and Do people “cheat” by overfitting test data, 2020. Blogposts. link1 and link2
HEAVEN, Will Douglas. AI is wrestling with a replication crisis. MIT Technology Review, 2020. link
GRUS, Joel. Reproducibility as a Vehicle for Engineering Best Practices. ICRL, 2019. link (the first 19 minutes)
(Optional) LIPTON, Zachary C.; STEINHARDT, Jacob. Troubling trends in machine learning scholarship. Queue, v. 17, n. 1, p. 45-77, 2019. link
(Optional) WIRED. Artificial intelligence confronts a reproducibility crisis, 2019. link
(Optional) BENDER, Emily. The #BenderRule: On Naming the Languages We Study and Why It Matters. Blogpost, 2019. link
More material about Ethics in NLP. link