Bibliography
This course will not follow a single specific source. Each lecture will have links to corresponding material and I will also list some relevant works here.
A comprehensive overview is available at:
- RESNIK, Philip & LIN, Jimmy. Evaluation of NLP Systems. In: Clark, Alexander; Fox, Chris & Lappin, Shalom (Eds.). The handbook of computational linguistics and natural language processing. John Wiley & Sons, pp. 271-295, 2010.
Books & book chapters
- DROR, Rotem; PELED-COHEN, Lotem; SHLOMOV, Segev & REICHART, Roi. Statistical Significance Testing for Natural Language Processing. Synthesis Lectures on Human Language Technologies, v. 13, n. 2, p. 1-116, 2020. link
- GALLIERS, Julia R.; SPÄRCK JONES, Karen. Evaluating natural language processing systems. University of Cambridge, Computer Laboratory, 1993.
- HIRSCHMAN, Lynette; THOMPSON, Henry S. Overview of evaluation in speech and natural language processing. In: Survey of the state of the art in human language technology. Cambridge University Press, 1997. p. 409-414.
- (Experimentation, Appendix B of) SMITH, Noah A. Linguistic Structure Prediction. Synthesis Lectures on Human Language Technologies, v. 4, n. 2, 2011.
link
- SPÄRCK JONES, Karen & GALLIERS, Julia R. Evaluating Natural Language Processing Systems: An Analysis and Review. Berlin: Springer, 1996. link
Papers
- BARR, Valerie; KLAVANS, Judith L. Verification and validation of language processing systems: is it evaluation?. In: Proceedings of the ACL 2001 Workshop on Evaluation Methodologies for Language and Dialogue Systems. 2001. link
- BELINKOV, Yonatan & GLASS, James. Analysis methods in neural language processing: A survey. Transactions of the Association for Computational Linguistics, v. 7, p. 49-72, 2019. link
- BELZ, Anja. That’s nice… what can you do with it?. Computational Linguistics, v. 35, n. 1, p. 111-118, 2009.
- BENDER, Emily M.; FRIEDMAN, Batya. Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics, v. 6, p. 587-604, 2018. link
- BERG-KIRKPATRICK, Taylor; BURKETT, David; KLEIN, Dan. An empirical investigation of statistical significance in nlp. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, p. 995-1005, 2012. link
- DROR, Rotem; BAUMER, Gili; BOGOMOLOV, Marina & REICHART, Roi. Replicability Analysis for Natural Language Processing: Testing Significance with Multiple Datasets. Transactions of the Association for Computational Linguistics, v. 5, p. 471-486, 2017. link
- DROR, R., BAUMER, G., SHLOMOV, S., & REICHART, R. The hitchhiker’s guide to testing statistical significance in natural language processing. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), p. 1383-1392, 2018. link
- ESCARTÍN, Carla Parra et al. Ethical Considerations in NLP Shared Tasks. In: Proceedings of the First ACL Workshop on Ethics in Natural Language Processing. 2017. p. 66-73. link
- FORT, Karën; ADDA, Gilles; COHEN, K. Bretonnel. Amazon mechanical turk: Gold mine or coal mine?. Computational Linguistics, v. 37, n. 2, p. 413-420, 2011. link
- GORMAN, Kyle; BEDRICK, Steven. We need to talk about standard splits. In: Proceedings of the 57th annual meeting of the association for computational linguistics. 2019. p. 2786-2791. link
- HOVY, Dirk; SPRUIT, Shannon L. The social impact of natural language processing. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2016. p. 591-598. link
- KING, Margaret. Evaluating natural language processing systems. Communications of the ACM, v. 39, n. 1, p. 73-79, 1996. link
- NOVIKOVA, Jekaterina et al. Why We Need New Evaluation Metrics for NLG. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, p. 2241-2252, 2017. link
- PAROUBEK, Patrick; CHAUDIRON, Stéphane & HIRSCHMAN, Lynette. Principles of Evaluation in Natural Language Processing. Traitement Automatique des Langues, ATALA, 48 (1), pp.7-31. hal- 00502700, 2007.
- SØGAARD, Anders; JOHANNSEN, Anders; PLANK, Barbara; HOVY, Dirk & MARTINEZ, Hector. What’s in a p-value in NLP?. In: Proceedings of the eighteenth conference on computational natural language learning, p. 1-10, 2014. link
- SPÄRCK JONES, Karen. Towards better NLP system evaluation. In: Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994. link
- VAN DER LEE, C., GATT, A., VAN MILTENBURG, E., WUBBEN, S., & KRAHMER, E. Best practices for the human evaluation of automatically generated text. In: Proceedings of the 12th International Conference on Natural Language Generation, p. 355-368, 2019. link
Miscellaneous
- COHEN, Paul R.; HOWE, Adele E. How evaluation guides AI research: The message still counts more than the medium. AI magazine, v. 9, n. 4, p. 35-35, 1988. link
- HUYEN, Chip. Evaluation Metrics for Language Modeling. Blogpost, 2019. link
- KING, Margaret. Evaluating natural language processing systems. Communications of the ACM, v. 39, n. 1, p. 73-79, 1996. link
- KING M., Maegaard B., Schütz J., des Tombes L., Bech A., Neville A., Arppe A., Balkan L., Brace C., Bunt H., Carlson L., Douglas S., Höge M., Krauwer S., Manzi S., Mazzi, C., Sieleman A. J., Steenbakkers R. EAGLES Evaluation of Natural Language Processing Systems: Final Report. EAGLES Document EAGEWG-PR. 2. Center for Sprogteknologi, Copenhagen, 1996. link
- LIPTON, Zachary C.; STEINHARDT, Jacob. Troubling trends in machine learning scholarship. Queue, v. 17, n. 1, p. 45-77, 2019. link
- POTTS, Christopher. Evaluation methods and metrics in NLP. 2020. link and link
- REITER, Ehud. His blog has several posts related to evaluation. link
Group presentations
Dialogue
- DERIU, Jan et al. Survey on evaluation methods for dialogue systems. Artificial Intelligence Review, p. 1-56, 2020. link
- LIU, Chia-Wei et al. How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016. p. 2122-2132. link
- DUSEK, O \& HUDECEK, V. Course about Dialogue Systems, 2019. link
Machine Translation
- CALLISON-BURCH, Chris et al. (Meta-) evaluation of machine translation. In: Proceedings of the Second Workshop on Statistical Machine Translation. 2007. p. 136-158. link
- CHATZIKOUMI, Eirini. How to evaluate machine translation: A review of automated and human metrics. Natural Language Engineering, v. 26, n. 2, p. 137-161, 2020. link
- MÜLLER, Mathias. Seven recommendations for machine translation evaluation. Blogpost on Dec 15, 2020. link
- RIEZLER, Stefan; MAXWELL III, John T. On some pitfalls in automatic evaluation and significance testing for MT. In: Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 2005. p. 57-64. link
Natural Language Generation
- CELIKYILMAZ, Asli; CLARK, Elizabeth; GAO, Jianfeng. Evaluation of Text Generation: A Survey. Preprint. link and slides
- GATT, Albert; KRAHMER, Emiel. Survey of the state of the art in natural language generation: Core tasks, applications and evaluation. Journal of Artificial Intelligence Research, v. 61, p. 65-170, 2018. link
- HOWCROFT, David M. et al. Twenty years of confusion in human evaluation: NLG needs evaluation sheets and standardised definitions. In: Proceedings of the 13th International Conference on Natural Language Generation. 2020. p. 169-182. link
- NOVIKOVA, Jekaterina et al. Why We Need New Evaluation Metrics for NLG. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017. p. 2241-2252. link
- REITER, Ehud; BELZ, Anja. An investigation into the validity of some metrics for automatically evaluating natural language generation systems. Computational Linguistics, v. 35, n. 4, p. 529-558, 2009. link
- VAN DER LEE, Chris et al. Best practices for the human evaluation of automatically generated text. In: Proceedings of the 12th International Conference on Natural Language Generation. 2019. p. 355-368. link
Speech Synthesis
- LE MAGUER, Sébastian. Speech Synthesis Evaluation. Lecture in Saarland University, 2020. link
- WAGNER, Petra et al. Speech Synthesis Evaluation—State-of-the-Art Assessment and Suggestion for a Novel Research Program. In: Proceedings of the 10th Speech Synthesis Workshop (SSW10). 2019. link