Bibliography
Each week we will have its corresponding reading list available on Moodle and in the schedule. Here is an overview of the relevant literature.
Books & book chapters
- DROR, Rotem; PELED-COHEN, Lotem; SHLOMOV, Segev & REICHART, Roi. Statistical Significance Testing for Natural Language Processing. Synthesis Lectures on Human Language Technologies, v. 13, n. 2, p. 1-116, 2020. link
- GALLIERS, Julia R.; SPÄRCK JONES, Karen. Evaluating natural language processing systems. University of Cambridge, Computer Laboratory, 1993.
- HIRSCHMAN, Lynette; THOMPSON, Henry S. Overview of evaluation in speech and natural language processing. In: Survey of the state of the art in human language technology. Cambridge University Press, 1997. p. 409-414.
- RESNIK, Philip & LIN, Jimmy. Evaluation of NLP Systems. In: Clark, Alexander; Fox, Chris & Lappin, Shalom (Eds.). The handbook of computational linguistics and natural language processing. John Wiley & Sons, pp. 271-295, 2010.
- (Experimentation, Appendix B of) SMITH, Noah A. Linguistic Structure Prediction. Synthesis Lectures on Human Language Technologies, v. 4, n. 2, 2011.
link
- SPÄRCK JONES, Karen & GALLIERS, Julia R. Evaluating Natural Language Processing Systems: An Analysis and Review. Berlin: Springer, 1996. link
Papers
- BARR, Valerie; KLAVANS, Judith L. Verification and validation of language processing systems: is it evaluation?. In: Proceedings of the ACL 2001 Workshop on Evaluation Methodologies for Language and Dialogue Systems. 2001. link
- BELINKOV, Yonatan & GLASS, James. Analysis methods in neural language processing: A survey. Transactions of the Association for Computational Linguistics, v. 7, p. 49-72, 2019. link
- BELZ, Anja. That’s nice… what can you do with it?. Computational Linguistics, v. 35, n. 1, p. 111-118, 2009.
- BENDER, Emily M.; FRIEDMAN, Batya. Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics, v. 6, p. 587-604, 2018. link
- BERG-KIRKPATRICK, Taylor; BURKETT, David; KLEIN, Dan. An empirical investigation of statistical significance in nlp. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, p. 995-1005, 2012. link
- DROR, Rotem; BAUMER, Gili; BOGOMOLOV, Marina & REICHART, Roi. Replicability Analysis for Natural Language Processing: Testing Significance with Multiple Datasets. Transactions of the Association for Computational Linguistics, v. 5, p. 471-486, 2017. link
- DROR, R., BAUMER, G., SHLOMOV, S., & REICHART, R. The hitchhiker’s guide to testing statistical significance in natural language processing. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), p. 1383-1392, 2018. link
- ESCARTÍN, Carla Parra et al. Ethical Considerations in NLP Shared Tasks. In: Proceedings of the First ACL Workshop on Ethics in Natural Language Processing. 2017. p. 66-73. link
- FORT, Karën; ADDA, Gilles; COHEN, K. Bretonnel. Amazon mechanical turk: Gold mine or coal mine?. Computational Linguistics, v. 37, n. 2, p. 413-420, 2011. link
- GORMAN, Kyle; BEDRICK, Steven. We need to talk about standard splits. In: Proceedings of the 57th annual meeting of the association for computational linguistics. 2019. p. 2786-2791. link
- HOVY, Dirk; SPRUIT, Shannon L. The social impact of natural language processing. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2016. p. 591-598. link
- KING, Margaret. Evaluating natural language processing systems. Communications of the ACM, v. 39, n. 1, p. 73-79, 1996. link
- NOVIKOVA, Jekaterina et al. Why We Need New Evaluation Metrics for NLG. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, p. 2241-2252, 2017. link
- PAROUBEK, Patrick; CHAUDIRON, Stéphane & HIRSCHMAN, Lynette. Principles of Evaluation in Natural Language Processing. Traitement Automatique des Langues, ATALA, 48 (1), pp.7-31. hal- 00502700, 2007.
- SØGAARD, Anders; JOHANNSEN, Anders; PLANK, Barbara; HOVY, Dirk & MARTINEZ, Hector. What’s in a p-value in NLP?. In: Proceedings of the eighteenth conference on computational natural language learning, p. 1-10, 2014. link
- SPÄRCK JONES, Karen. Towards better NLP system evaluation. In: Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994. link
- VAN DER LEE, C., GATT, A., VAN MILTENBURG, E., WUBBEN, S., & KRAHMER, E. Best practices for the human evaluation of automatically generated text. In: Proceedings of the 12th International Conference on Natural Language Generation, p. 355-368, 2019. link
Miscellaneous
- COHEN, Paul R.; HOWE, Adele E. How evaluation guides AI research: The message still counts more than the medium. AI magazine, v. 9, n. 4, p. 35-35, 1988. link
- HUYEN, Chip. Evaluation Metrics for Language Modeling. Blogpost, 2019. link
- KING, Margaret. Evaluating natural language processing systems. Communications of the ACM, v. 39, n. 1, p. 73-79, 1996. link
- KING M., Maegaard B., Schütz J., des Tombes L., Bech A., Neville A., Arppe A., Balkan L., Brace C., Bunt H., Carlson L., Douglas S., Höge M., Krauwer S., Manzi S., Mazzi, C., Sieleman A. J., Steenbakkers R. EAGLES Evaluation of Natural Language Processing Systems: Final Report. EAGLES Document EAGEWG-PR. 2. Center for Sprogteknologi, Copenhagen, 1996. link
- LIPTON, Zachary C.; STEINHARDT, Jacob. Troubling trends in machine learning scholarship. Queue, v. 17, n. 1, p. 45-77, 2019. link
- POTTS, Christopher. Evaluation methods and metrics in NLP. 2020. link and link
- REITER, Ehud. His blog has several posts related to evaluation. link