About the Leaderboard

The GENEA Gesture Generation Leaderboard is an upcoming living benchmark for gesture generation models. It allows researchers to test and showcase their models’ performance on a standardised dataset and evaluation methodology.

Initially, the leaderboard will feature results from existing models that have been adapted to the BEAT-2 dataset. Following this initial phase, we’ll open the leaderboard to all researchers interested in submitting and evaluating new models.

Our goals

  • Establish a continuously updated definitive ranking of state-of-the-art models on the most common speech-gesture motion capture datasets, based on human evaluation.
  • Raise the standards for reproducibility by releasing all collected model outputs, human ratings, and scripts for motion visualisation, conducting user-studies, and more.
  • Use the collected human ratings to develop better objective metrics that are aligned with human perception of motion;
  • Unify gesture-generation researchers from computer vision, computer graphics, machine learning, NLP, HCI, and robotics.
  • Evolve with the community as new datasets, evaluations, and methodologies become available.

Outcomes

Once the Leaderboard is operational, you will be able to:

  • Submit your gesture-generation model's outputs, and receive human evaluation results in 2-4 weeks for free, managed by our expert team;
  • Compare to any state-of-the-art method on the Leaderboard using our comprehensive collection of rendered video outputs, without having to reproduce baselines;
  • Visualise your generated motion and conduct our user studies on your own using our easy-to-use open-source tools;
  • ...and much more!

BEAT-2 in the SMPL-X Format

The leaderboard will initially evaluate models using the English recordings in the test split of the BEAT-2 dataset. Submissions will be required to be in the same SMPL-X format as the dataset, but we will hide facial expressions in order to focus on the hand- and body movements.

An example video clip rendered from the BEAT-2 dataset. The avatar is a textured SMPL-X mesh.

We think this data is the best candidate for an initial benchmark dataset for several reasons:

  1. It’s the largest public mocap dataset of gesturing (with 60 hours of data in total).
  2. BEAT, its predecessor, is one of the most commonly used gesture-generation datasets in recent years.
  3. It has a high variety of speakers and emotions, and it includes semantic gesture annotations.
  4. The SMPL-X format is compatible with many other mocap datasets, the majority of pose estimation models, and includes potential extensions (e.g., facial expressions for future iterations).

Being a living leaderboard, the dataset used for benchmarking is expected to change in the future as better mocap datasets become available.

Evaluation methodology

We will recruit large numbers of evaluators to conduct best-practices human evaluation. Our perceptual user studies will be designed to carefully disentangle key aspects of gesture evaluation, following what we learned from organising the 2020–2023 GENEA challenges.

Evaluation tasks

Motion quality

The first evaluation task will measure motion quality, in other words, to what degree do the evaluators perceive the overall movements to be natural-looking gesturing, without considering the speech. For this evaluation, the stimuli will be silent videos, and we will perform pairwise comparisons of motions from different sources (e.g., gesture-generation systems, baselines, or mocap data).

The statistical analysis will use an Elo-style ranking system, in particular the Bradley-Terry model, similar to the methodology of Chatbot Arena. You can read more about Elo scores and the Bradley-Terry model in this blog post; the key point is that 1) Elo-like systems naturally work well in a leaderboard setting where scores are continuously updated and the comparisons are not necessarily exhaustive; 2) the difference between the Elo scores of two systems directly expresses the probability that users prefer the output of one system over the output of the other.

We believe that this approach will prove to be a highly scalable and efficient method, with interpretable results, that allows us to conduct sustainable recurring evaluations for each future submission separately.

A preview of the evaluation interface used in our studies.

Motion specificity to speech

The second evaluation task will measure whether the outputs of the gesture-generation system are somehow related to the speech input. As discussed in the GENEA challenge 2022 paper, a naive evaluation of this question – e.g., directly asking evaluators to choose which of two systems generated movements that are more appropriate for the speech – has significant risk of confounding with other factors such as motion quality.

Therefore, we will use a mismatching procedure based on the GENEA Challenges. In a nutshell, our approach will be to show two clips from the same system, one with correctly paired speech and motion, and the other with independent, intentionally misaligned motion and speech signals. Evaluators then will be tasked with indicating which of the two videos has better connection between speech and motion.

This is also a pairwise comparison, but unlike the motion quality assessment, it can be performed for each system independently, therefore it avoids the confounding factor of motion quality.

Future evaluations

After the leaderboard becomes established, we will include new evaluation tasks based on what datasets become available, and what challenges become more important in the field. Some possibilities, already compatible with the BEAT2 dataset, are to evaluate facial expressions or emotion expressivity. For future datasets, it might become possible to test motion specificity to the meaning of the speech, and other types of grounding information. (See our position paper for more ideas.)

Tooling

Standardised visualisation

Visualisation is one of the most important design choices for perceptual user studies that evaluate motion synthesis. Currently, almost every gesture-generation paper uses a different character model and 3D scene configuration due to difficulties of using animation software, as well as the lack of shared 3D assets. Because character appearence and other environmental factors can have a subtle but important effect on the evaluation, this means that human evaluation results are largely incomparable to each other.

We will use a standardised visualisation setup, containining a textured SMPl-X mesh as a human character model, and a minimal 3D scene with lighting. There will be an option to hide the face in the videos, since our first evaluations will only be based on hand- and body motion. The videos shown above on this page are previews of our visualisation setup.

We are currently working on an open-source, automated pipeline for rendering videos for our user studies, based on our previous GENEA Blender visualiser. The updated pipeline will be shared with the community after we release the leaderboard.

User-study automation

To standardise the human evaluation process, we are rewriting the HEMVIP codebase, which was originally developed for the GENEA challenges, with an emphasis on stability and ease of use. This software will also be open-sourced – our vision is to enable independent replication of our evaluations, and to lower the barriers for crowd-sourced evaluations.

Objective evaluation

The leaderboard will also feature many commonly used objective metrics (e.g., FGD and beat consistency), and we are planning to develop new automated evaluation methods based on the collected human preference data. Each of these will be open-sourced with the release of the leaderboard.

Frequently Asked Questions

Why do we need a leaderboard?
  • Gesture generation research is currently fragmented across different datasets and evaluation protocols.
  • Objective metrics are inconsistently applied, and their validity is not sufficiently established in the literature.
  • At the same time, subjective evaluation methods often have low reproducibility, and their results are impossible to directly compare to each other.
  • This leads to a situation where it is impossible to know what is the state of the art, or to know which method works better for which purpose when comparing two publications.
  • The leaderboard is designed to directly counter these issues.
How are the evaluations funded?

We currently have academic funding for running the leaderboard for a period of time. Having your system evaluated by the leaderboard will be free of charge. However, if there are a lot of systems submitted, we might not be able to evaluate all of them.


Contact address

genea-leaderboard@googlegroups.com