Bibliography

This course will not follow a single specific source. Each lecture will have links to corresponding material and I will also list some relevant works here.

A comprehensive overview is available at:

RESNIK, Philip & LIN, Jimmy. Evaluation of NLP Systems. In: Clark, Alexander; Fox, Chris & Lappin, Shalom (Eds.). The handbook of computational linguistics and natural language processing. John Wiley & Sons, pp. 271-295, 2010.

Books & book chapters

DROR, Rotem; PELED-COHEN, Lotem; SHLOMOV, Segev & REICHART, Roi. Statistical Significance Testing for Natural Language Processing. Synthesis Lectures on Human Language Technologies, v. 13, n. 2, p. 1-116, 2020. link
GALLIERS, Julia R.; SPÄRCK JONES, Karen. Evaluating natural language processing systems. University of Cambridge, Computer Laboratory, 1993.
HIRSCHMAN, Lynette; THOMPSON, Henry S. Overview of evaluation in speech and natural language processing. In: Survey of the state of the art in human language technology. Cambridge University Press, 1997. p. 409-414.
(Experimentation, Appendix B of) SMITH, Noah A. Linguistic Structure Prediction. Synthesis Lectures on Human Language Technologies, v. 4, n. 2, 2011. link
SPÄRCK JONES, Karen & GALLIERS, Julia R. Evaluating Natural Language Processing Systems: An Analysis and Review. Berlin: Springer, 1996. link

Papers

BARR, Valerie; KLAVANS, Judith L. Verification and validation of language processing systems: is it evaluation?. In: Proceedings of the ACL 2001 Workshop on Evaluation Methodologies for Language and Dialogue Systems. 2001. link
BELINKOV, Yonatan & GLASS, James. Analysis methods in neural language processing: A survey. Transactions of the Association for Computational Linguistics, v. 7, p. 49-72, 2019. link
BELZ, Anja. That’s nice… what can you do with it?. Computational Linguistics, v. 35, n. 1, p. 111-118, 2009.
BENDER, Emily M.; FRIEDMAN, Batya. Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics, v. 6, p. 587-604, 2018. link
BERG-KIRKPATRICK, Taylor; BURKETT, David; KLEIN, Dan. An empirical investigation of statistical significance in nlp. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, p. 995-1005, 2012. link
DROR, Rotem; BAUMER, Gili; BOGOMOLOV, Marina & REICHART, Roi. Replicability Analysis for Natural Language Processing: Testing Significance with Multiple Datasets. Transactions of the Association for Computational Linguistics, v. 5, p. 471-486, 2017. link
DROR, R., BAUMER, G., SHLOMOV, S., & REICHART, R. The hitchhiker’s guide to testing statistical significance in natural language processing. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), p. 1383-1392, 2018. link
ESCARTÍN, Carla Parra et al. Ethical Considerations in NLP Shared Tasks. In: Proceedings of the First ACL Workshop on Ethics in Natural Language Processing. 2017. p. 66-73. link
FORT, Karën; ADDA, Gilles; COHEN, K. Bretonnel. Amazon mechanical turk: Gold mine or coal mine?. Computational Linguistics, v. 37, n. 2, p. 413-420, 2011. link
GORMAN, Kyle; BEDRICK, Steven. We need to talk about standard splits. In: Proceedings of the 57th annual meeting of the association for computational linguistics. 2019. p. 2786-2791. link
HOVY, Dirk; SPRUIT, Shannon L. The social impact of natural language processing. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2016. p. 591-598. link
KING, Margaret. Evaluating natural language processing systems. Communications of the ACM, v. 39, n. 1, p. 73-79, 1996. link
NOVIKOVA, Jekaterina et al. Why We Need New Evaluation Metrics for NLG. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, p. 2241-2252, 2017. link
PAROUBEK, Patrick; CHAUDIRON, Stéphane & HIRSCHMAN, Lynette. Principles of Evaluation in Natural Language Processing. Traitement Automatique des Langues, ATALA, 48 (1), pp.7-31. hal- 00502700, 2007.
SØGAARD, Anders; JOHANNSEN, Anders; PLANK, Barbara; HOVY, Dirk & MARTINEZ, Hector. What’s in a p-value in NLP?. In: Proceedings of the eighteenth conference on computational natural language learning, p. 1-10, 2014. link
SPÄRCK JONES, Karen. Towards better NLP system evaluation. In: Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994. link
VAN DER LEE, C., GATT, A., VAN MILTENBURG, E., WUBBEN, S., & KRAHMER, E. Best practices for the human evaluation of automatically generated text. In: Proceedings of the 12th International Conference on Natural Language Generation, p. 355-368, 2019. link

Miscellaneous

COHEN, Paul R.; HOWE, Adele E. How evaluation guides AI research: The message still counts more than the medium. AI magazine, v. 9, n. 4, p. 35-35, 1988. link
HUYEN, Chip. Evaluation Metrics for Language Modeling. Blogpost, 2019. link
KING, Margaret. Evaluating natural language processing systems. Communications of the ACM, v. 39, n. 1, p. 73-79, 1996. link
KING M., Maegaard B., Schütz J., des Tombes L., Bech A., Neville A., Arppe A., Balkan L., Brace C., Bunt H., Carlson L., Douglas S., Höge M., Krauwer S., Manzi S., Mazzi, C., Sieleman A. J., Steenbakkers R. EAGLES Evaluation of Natural Language Processing Systems: Final Report. EAGLES Document EAGEWG-PR. 2. Center for Sprogteknologi, Copenhagen, 1996. link
LIPTON, Zachary C.; STEINHARDT, Jacob. Troubling trends in machine learning scholarship. Queue, v. 17, n. 1, p. 45-77, 2019. link
POTTS, Christopher. Evaluation methods and metrics in NLP. 2020. link and link
REITER, Ehud. His blog has several posts related to evaluation. link

Group presentations

Dialogue

DERIU, Jan et al. Survey on evaluation methods for dialogue systems. Artificial Intelligence Review, p. 1-56, 2020. link
LIU, Chia-Wei et al. How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016. p. 2122-2132. link
DUSEK, O \& HUDECEK, V. Course about Dialogue Systems, 2019. link

Machine Translation

CALLISON-BURCH, Chris et al. (Meta-) evaluation of machine translation. In: Proceedings of the Second Workshop on Statistical Machine Translation. 2007. p. 136-158. link
CHATZIKOUMI, Eirini. How to evaluate machine translation: A review of automated and human metrics. Natural Language Engineering, v. 26, n. 2, p. 137-161, 2020. link
MÜLLER, Mathias. Seven recommendations for machine translation evaluation. Blogpost on Dec 15, 2020. link
RIEZLER, Stefan; MAXWELL III, John T. On some pitfalls in automatic evaluation and significance testing for MT. In: Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 2005. p. 57-64. link

Natural Language Generation

CELIKYILMAZ, Asli; CLARK, Elizabeth; GAO, Jianfeng. Evaluation of Text Generation: A Survey. Preprint. link and slides
GATT, Albert; KRAHMER, Emiel. Survey of the state of the art in natural language generation: Core tasks, applications and evaluation. Journal of Artificial Intelligence Research, v. 61, p. 65-170, 2018. link
HOWCROFT, David M. et al. Twenty years of confusion in human evaluation: NLG needs evaluation sheets and standardised definitions. In: Proceedings of the 13th International Conference on Natural Language Generation. 2020. p. 169-182. link
NOVIKOVA, Jekaterina et al. Why We Need New Evaluation Metrics for NLG. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017. p. 2241-2252. link
REITER, Ehud; BELZ, Anja. An investigation into the validity of some metrics for automatically evaluating natural language generation systems. Computational Linguistics, v. 35, n. 4, p. 529-558, 2009. link
VAN DER LEE, Chris et al. Best practices for the human evaluation of automatically generated text. In: Proceedings of the 12th International Conference on Natural Language Generation. 2019. p. 355-368. link

Speech Synthesis

LE MAGUER, Sébastian. Speech Synthesis Evaluation. Lecture in Saarland University, 2020. link
WAGNER, Petra et al. Speech Synthesis Evaluation—State-of-the-Art Assessment and Suggestion for a Novel Research Program. In: Proceedings of the 10th Speech Synthesis Workshop (SSW10). 2019. link