Announcing the GENEA Leaderboard: an upcoming online benchmark for gesture generation
TL;DR: To improve benchmarking of speech-driven gesture generation, we are developing the online GENEA Leaderboard based on large-scale human evaluation. This is a cross between the previous GENEA Challenges in gesture generation and recent leaderboards in NLP and computer vision, such as Chatbot Arena and HEIM.
New position paper
We have released a new position paper about the problems of gesture-generation evaluation, and the details of the GENEA leaderboard initiative! Click here or on the image below to view the arxiv preprint.
Our goals
- Establish a living benchmark of gesture-generation models using human evaluation
- Ensure high reproducibility by releasing all collected model outputs, human ratings, and scripts for motion visualisation, conducting user-studies, and more
- Develop better objective metrics based on the collected human ratings
- Unify gesture-generation researchers from computer vision, computer graphics, machine learning, NLP, and HCI
- Evolve with the community
Leaderboard setup and timeline
The leaderboard will be released in two stages.
To construct the leaderboard, we are inviting authors of a selection of already published gesture-generation models to participate in a large-scale evaluation. The organisers will conduct a comprehensive evaluation of the submitted systems, which will then be published on our website, alongside all collected outputs, ratings, and scripts necessary for reproducing the evaluation.
Afterwards, the leaderboard will become open to new submissions, and will be continuously updated by the GENEA team. Our current plan is to release the leaderboard by the end of the year.
Dataset
The leaderboard is going to use the English recordings of the BEAT-v2 dataset in the SMPL-X format, without facial expressions. We think this data is the best candidate for an initial benchmark dataset for several reasons:
- It’s the largest public mocap dataset of gesturing (with 60 hours of data)
- It has a high variety of speakers and emotions
- It includes semantic gesture annotations
- The SMPL-X format is compatible with many other datasets
- It also includes facial expressions (a possible future addition for the leaderboard)
Being a living leaderboard, the dataset used for benchmarking is expected to evolve in the future as newer datasets become available.
Evaluation methodology
We will recruit a large number of evaluators on a crowd-sourcing platform to conduct best-practises human evaluation on three aspects:
- Motion naturalness
- Motion appropriateness for the speech
- Emotional expression
For motion naturalness and emotional expression, we will use an ELO-style ranking system based on pairwise comparisons (Bradley-Terry), similar to Chatbot Arena.
To accurately quantify motion appropriateness, we will use a mismatching procedure based on the GENEA Challenges.
To standardise human evaluation, our tooling for running experiments will be released alongside the necessary visualisation scripts and 3D models.
The leaderboard will also feature many commonly used objective metrics (e.g., FGD and beat consistency) as well as model properties such as size, memory usage, etc.
How to participate
To participate in the evaluation, you will need to:
- Train your model on the BEAT-v2 dataset, with the official test set held out.
- Generate motion for the leaderboard’s test set (a superset of the BEAT-v2 test set; will be provided at a later time).
- Submit the generated motion to the leaderboard organisers, alongside your paper or brief technical report describing the details of your model. If submitting an already published model, you only need to document the adaptations you made for the new dataset.
We are currently reaching out to potential participants for the first evaluation. We will share more details in the upcoming months.
Frequently Asked Questions
Why do we need a leaderboard?
- Gesture generation research is currently fragmented across different datasets and evaluation protocols.
- Objective metrics are inconsistently applied, and their validity is not sufficiently established in the literature.
- At the same time, subjective evaluation methods often have low reproducibility, and their results are impossible to directly compare to each other.
- This leads to a situation where it is impossible to know what is the state of the art, or to know which method works better for which purpose when comparing two publications.
- The leaderboard is designed to directly counter these issues.
How are the evaluations funded?
We currently have academic funding for running the leaderboard for a period of time. Having your system evaluated by the leaderboard will be free of charge. However, if there are a lot of systems submitted, we might not be able to evaluate all of them.
Organisers
The evaluations will be managed by the GENEA team:
Scientific advisors
Contact
If you have any questions, please feel free to contact us at genea-leaderboard@googlegroups.com.