How We Evaluate Large Language Models
Patrycja Cieplicka
Tooploox
Abstract
Good evaluation helps understand what large language models really do. This talk gives a simple overview of how large language models are evaluated in practice. It looks at common open-source benchmarks and tools used to test model behaviour and capabilities. On top of that, recent research trends, common issues, and practical tips for real-world evaluation are covered.
Bio
Patrycja Cieplicka is a Machine Learning Engineer with around six years of experience. At Tooploox, she focuses on Large Language Models, especially post-training, evaluation, and optimization. She holds a degree in Computer Science from the Warsaw University of Technology. In 2024, she was named one of the TOP100 Women in Data Science in Poland.