Conversational Gesture Model (CGM): Extending Speaker-Centric Audio-Driven Motion Generation to Full Conversation Gestures

T. Koren¹, A. Rosenthal¹, D. Friedman², A. Shamir¹

¹School of Computer Science, Reichman University, ²School of Communication, Reichman University

Eurographics 2026

Conversational Gesture Model teaser showing speaking and listening gestures over time

Abstract

In this work we extend speaker-centric audio-driven gesture synthesis toward a unified conversational model that jointly captures both speaking and listening behaviors.

Existing speaker-centric models effectively generate gestures aligned with speech but overlook the bidirectional dynamics that characterize natural dialogue. To address this limitation, we propose the Conversational Gesture Model (CGM), a cross-attention-based model capable of synthesizing gestures conditioned on interlocutor conversational cues such as gestures, tone, and textual semantics. By leveraging cross-attention mechanisms, CGM fuses interlocutor audio and text features with character gesture encodings, enabling a single system to seamlessly alternate between speaking and listening roles of the same character.

Experiments demonstrate that this approach preserves the quality of speaker-driven gestures while significantly improving the realism, coherence, and responsiveness of full conversational interactions.

Our Approach

CGM augments a pretrained speaker-centric diffusion transformer with an interlocutor-aware cross-attention pathway. Character audio, text, seed pose, and noisy motion latents form the Character stream, while Interlocutor audio and text provide conversational context. The model fuses these signals to predict Character motion, enabling a single system to generate both speaking gestures and responsive listening behavior across role shifts.

CGM architecture with Character inputs, Interlocutor inputs, cross-attention, and predicted Character motion

Full Conversation Character Behavior

To assess how CGM handles natural conversational dynamics, we evaluate it on full dyadic interactions where the Character alternates between speaking and listening in response to the Interlocutor. This setting tests whether the model synthesizes co-articulated behaviors and smooth transitions between conversational roles. The evaluation uses 7,320 Talking With Hands frames, approximately 4.1 minutes, spanning four dyadic conversations between paired participants.

Method	ID 1 (Wayne)			ID 2 (Scott)			ID 6 (Carla)
Method	FGD ↓	BC × 10⁻¹ ↑	Diversity ↑	FGD ↓	BC × 10⁻¹ ↑	Diversity ↑	FGD ↓	BC × 10⁻¹ ↑	Diversity ↑
GT	0.000	7.855	9.578	0.000	6.696	9.578	0.000	6.966	9.578
3 Random Listener	3.247	7.188	7.274	3.247	7.188	7.274	3.247	4.988	7.274
SynTalker	2.327	8.149	5.777	1.453	8.314	9.162	1.099	7.546	5.937
DiffuseStyleGesture+	0.919	7.441	7.738	0.919	7.441	7.738	0.919	5.923	7.738
Audio2Photoreal	0.902	7.638	7.613	0.902	7.638	7.613	0.902	7.638	7.613
Ours w/o Motion Encoding/Decoding	0.731	7.584	6.412	1.102	7.903	7.221	0.682	6.944	6.938
Ours w/o Listening-Aware Loss	0.497	7.581	8.412	0.836	7.901	10.602	0.381	7.134	8.236
Ours w/o Interlocutor-aware Module	0.594	7.642	8.436	0.873	7.981	9.421	0.428	6.917	8.201
Ours (CGM, Interlocutor-Aware)	0.462	8.210	8.085	0.782	8.537	10.155	0.339	7.782	7.889

Personalized Listening Behavior

We investigate whether CGM can generate Character-specific listening gestures by leveraging pretrained speaker representations. For three Character IDs, we compare generated listening segments against ground-truth listening motion and speaking segments from each identity. Lower FID against ground-truth listening and the same Character's speaking style indicates stronger personalization, while higher distance from other Characters indicates clearer identity differentiation. The listening-focused Talking With Hands evaluation contains 2,702 frames, approximately 1.5 minutes.

Generated Listening	GT Listening (FID)	ID 1 Speaking (FID)	ID 2 Speaking (FID)	ID 6 Speaking (FID)
ID 1	51.99	20.61	308.36	180.87
ID 2	181.87	299.54	48.60	242.44
ID 6	29.71	191.76	264.75	16.13

Character Head Alignment to Interlocutor Audio

To evaluate rhythmic alignment during listening, we compare Character head motion against the Interlocutor audio. Generated head motion synchronizes more strongly with the true Interlocutor audio than with random audio offsets, approaching the ground-truth alignment reference.

Head Alignment Setting ↑	ID 1	ID 2	ID 6	Audio2Photoreal
GT vs. Interlocutor Audio	0.850	0.850	0.850	0.850
Generated vs. Interlocutor Audio	0.744	0.750	0.813	0.749
Generated vs. Random Audio	0.674	0.647	0.691	0.711

Character Performance Preservation

We verify that the interlocutor-aware modules do not degrade the original Character motion generation quality. This evaluation uses the BEATX test set, consisting of 29,310 frames, approximately 16.3 minutes of speech, collected from 15 different sessions.

Method	ID 1 (Wayne)			ID 2 (Scott)			ID 6 (Carla)
Method	FGD ↓	BC × 10⁻¹ ↑	Diversity ↑	FGD ↓	BC × 10⁻¹ ↑	Diversity ↑	FGD ↓	BC × 10⁻¹ ↑	Diversity ↑
GT	0.000	6.580	14.141	0.000	8.440	12.638	0.000	1.907	8.637
SynTalker	0.258	6.781	5.466	0.307	8.364	10.697	0.481	3.565	7.794
Ours w/o Motion Encoding/Decoding	0.356	7.520	3.221	0.527	7.137	4.266	0.768	5.445	5.832
Ours w/o Listening-Aware Loss	0.213	6.402	7.914	0.417	8.063	12.601	0.589	5.911	8.674
Ours w/o Interlocutor-aware Module	0.241	6.612	6.882	0.352	8.201	11.146	0.512	4.982	8.031
Ours (CGM, Interlocutor-Aware)	0.190	6.796	6.101	0.484	8.794	14.074	0.621	6.774	9.286

Perceptual Study

In side-by-side comparisons, participants preferred CGM over DiffuseStyleGesture+ in the pooled setting, while CGM was statistically indistinguishable from ground truth. This suggests the model improves over the conversational baseline while approaching real motion quality in the tested clips.

Condition	Ours (n)	Not Ours (n)	Prop. Ours	Binomial p
DiffuseStyleGesture+ pooled	87	53	0.62	.005
Ground truth pooled	36	34	0.51	.905

Citation

@article{https://doi.org/10.1111/cgf.70412,
author = {Koren, T. and Rosenthal, A. and Friedman, D. and Shamir, A.},
title = {Conversational Gesture Model (CGM): Extending Speaker-Centric Audio-Driven Motion Generation to Full Conversation Gestures},
journal = {Computer Graphics Forum},
volume = {n/a},
number = {n/a},
pages = {e70412},
doi = {https://doi.org/10.1111/cgf.70412},
url = {https://onlinelibrary.wiley.com/doi/abs/10.1111/cgf.70412},
eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1111/cgf.70412}
}