1h Workshop: Hands-on AI agent evaluation: building benchmarks with Harbor
Piotr Migdal
Quesma
Abstract
This is a hands-on introduction to Harbor, an open source framework designed for AI agent evaluation, creating benchmarks, and reinforcement learning. You will learn why Harbor is a game-changer for AI development and see real-world examples, including our migration of CompileBench.
During the workshop, we will build a small benchmark from scratch. You will leave with a working setup and the skills to benchmark AI models and agents effectively.
To get the most out of this session, please bring your laptop. We recommend installing Harbor prior to the event. The only technical prerequisites are UV and Docker. Please also make sure you have an API key for your preferred models (e.g. Anthropic, OpenAI, Gemini, or OpenRouter).
Bio
Piotr Migdal is a founding engineer and AI specialist with a strong background in data visualization, machine learning, and applied research. He is a Founding Engineer at Quesma, where he leads development of Quesma Charts, using AI to transform data from sources such as CSV, SQL, and APIs into accurate, high quality visualizations for tools including ggplot2 and Grafana. Previously, he worked as an independent AI consultant, delivering user facing AI and data science solutions across medtech, biotech, gaming, and emerging technologies. Piotr is also the co founder and former CTO of Quantum Flytrap, focused on intuitive interfaces for quantum computing, and has experience as an AI researcher in game development and physics based simulation. He combines deep technical expertise with a strong emphasis on usability and clear communication of complex data.