Weekly Schedule
Contact me if you are interested on the slides.
CORRETION: If you downloaded the slides a while ago, in slides 13 and 14 about statistical significance testing it should be >= (instead of >) in the if statement of the algorithms.
Week |
Date |
Focus |
Slides |
1 |
04.11.2020 |
Introduction / Course guidelines |
|
2 |
11.11.2020 |
Paradigms |
|
3 |
18.11.2020 |
Common procedures |
|
4 |
25.11.2020 |
Annotation |
|
5 |
02.12.2020 |
Metrics & Measurements |
|
6 |
09.12.2020 |
Statistical significance testing |
|
7 |
16.12.2020 |
Best practices / Presentation guidelines |
|
Break |
|
|
|
8 |
06.01.2021 |
Presentation preparation |
|
9 |
13.01.2021 |
Group 1: Machine Translation & History of Evaluation in NLP |
|
10 |
20.01.2021 |
Group 2: Natural Language Generation & Shared Tasks |
|
11 |
27.01.2021 |
Group 3: Dialogue & Replication Crisis |
|
12 |
03.02.2021 |
Group 4: Speech Synthesis & Ethics |
|
13 |
10.02.2021 |
Tutorial / Project guidelines |
|
Weekly Reading List
Introduction
- COHEN, Paul R.; HOWE, Adele E. How evaluation guides AI research: The message still counts more than the medium. AI magazine, v. 9, n. 4, p. 35-35, 1988. link
- KING, Margaret. Evaluating natural language processing systems. Communications of the ACM, v. 39, n. 1, p. 73-79, 1996. link
Paradigms
- SPÄRCK JONES, Karen. Towards better NLP system evaluation. In: Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994. link
- HIRSCHMAN, Lynette; THOMPSON, Henry S. Overview of evaluation in speech and natural language processing. In: Survey of the state of the art in human language technology. Cambridge University Press, 1997. link (pages 409-414)
- BELZ, Anja. That’s nice… what can you do with it?. Computational Linguistics, v. 35, n. 1, p. 111-118, 2009. link
Common Procedures
- POTTS, Christopher. Evaluation methods in NLP. 2020. link
- Communicating results with scientific graphs. The University of Queensland. link (click on each type of plot to see presentation best practices)
- REITER, Ehud. Use proper baselines. Blogpost, 2018. link
- (Optional) Wikipedia’s overview on cross validation. link
Annotation
- SCHNEIDER, Nathan. What I’ve learned about annotating informal text (and why you shouldn’t take my word for it). In: Proceedings of The 9th Linguistic Annotation Workshop. 2015. p. 152-157. link
- BENDER, Emily M.; FRIEDMAN, Batya. Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics, v. 6, p. 587-604, 2018. link (sections 1-5)
- FORT, Karën; ADDA, Gilles; COHEN, K. Bretonnel. Amazon mechanical turk: Gold mine or coal mine?. Computational Linguistics, v. 37, n. 2, p. 413-420, 2011. link
- (Optional) FORT, Karën. Corpus Linguistics: Inter-Annotator Agreements (slides), 2011. link
Metrics and Measurements
- POTTS, Christopher. Evaluation metrics in NLP. 2020. Read or watch.
- REITER, Ehud. Why do we still use 18-year old BLEU? / Small differences in BLEU are meaningless / Learning does not require evaluation metrics. Blogposts, 2018 and 2020. link1 link2 link3
- HUYEN, Chip. Evaluation Metrics for Language Modeling. Blogpost, 2019. link
- (Optional) Slides on Information Theory metrics. link
- (Optional) Slides on Minimum Edit Distance algorithm. link
- (Optional) Learning curves for diagnosing machine learning performance. link
Statistical Significance Testing
- Chapters 2, 3 and 4 of DROR, Rotem; PELED-COHEN, Lotem; SHLOMOV, Segev & REICHART, Roi. Statistical Significance Testing for Natural Language Processing. Synthesis Lectures on Human Language Technologies, v. 13, n. 2, p. 1-116, 2020. link (pages 3-33)
- (Optional) DROR, R., BAUMER, G., SHLOMOV, S., & REICHART, R. The hitchhiker’s guide to testing statistical significance in natural language processing. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), p. 1383-1392, 2018. link
- (Optional) SØGAARD, Anders; JOHANNSEN, Anders; PLANK, Barbara; HOVY, Dirk & MARTINEZ, Hector. What’s in a p-value in NLP?. In: Proceedings of the eighteenth conference on computational natural language learning, p. 1-10, 2014. link
- (Optional) BERG-KIRKPATRICK, Taylor; BURKETT, David; KLEIN, Dan. An empirical investigation of statistical significance in nlp. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, p. 995-1005, 2012. link
- (Optional) William Morgan’s slides about Approximate Randomization. link
- (Optional) KÖHN, Arne. We need to talk about significance tests. Blogpost, 2019. link
Best Practices
- HOVY, Dirk; SPRUIT, Shannon L. The social impact of natural language processing. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2016. p. 591-598. link
- REITER; Ehud. My guidelines to evaluating AI systems, 2017, and Do people “cheat” by overfitting test data, 2020. Blogposts. link1 and link2
- HEAVEN, Will Douglas. AI is wrestling with a replication crisis. MIT Technology Review, 2020. link
- GRUS, Joel. Reproducibility as a Vehicle for Engineering Best Practices. ICRL, 2019. link (the first 19 minutes)
- (Optional) LIPTON, Zachary C.; STEINHARDT, Jacob. Troubling trends in machine learning scholarship. Queue, v. 17, n. 1, p. 45-77, 2019. link
- (Optional) WIRED. Artificial intelligence confronts a reproducibility crisis, 2019. link
- (Optional) BENDER, Emily. The #BenderRule: On Naming the Languages We Study and Why It Matters. Blogpost, 2019. link
- More material about Ethics in NLP. link