LLM-as-a-Judge: Leveraging Multimodal LLMs for Large-Scale Product Retrieval Evaluation

Note: Most of the content in this blogpost was previously published on the Zalando Engineering Blog. We introduce a novel approach to large-scale product retrieval evaluation using Multimodal Large Language Models (MLLMs). Evaluated on 20,000 examples, our method shows how MLLMs can help automate the relevance assessment of retrieved products, achieving levels of accuracy comparable to human annotators and enabling scalable evaluation for high-traffic e-commerce platforms. Our contributions include: A multimodal LLM-based evaluation framework for large-scale product retrieval systems that utilizes LLMs to (i) generate context-specific annotation guidelines and (ii) conduct relevance assessments. Performance evaluation against human annotations on real-world production search queries in a multilingual setting, with analysis of different types of errors that humans and LLMs tend to make. Demonstration of cost-effectiveness and efficiency for conducting large-scale evaluations, comparing different types of LLMs including GPT-4o, GPT-4 Turbo and GPT-3.5 Turbo. Why Evaluate Product Retrieval at Scale? Search functionality is a fundamental component of e-commerce platforms, with the objective of finding the most relevant products in a dynamic product database. Customers using search often exhibit a higher intent to find specific products, leading to greater engagement and conversion rates. However, they may struggle to articulate their needs in a search query. Even if they do express their intent clearly, information retrieval systems and search engines might fail to interpret it correctly, resulting in irrelevant search results. ...

November 15, 2024 · Kasra Hosseini