Social Structure Matters in 3D Human–Human Interaction Generation

Zhongju Wang¹ Beier Wang¹ Yatao Bian² Pichao Wang³ Zhi Wang⁴
Daoyi Dong⁵ Hongdong Li⁶ Huadong Mo^1,✉ Zhenhong Sun⁶

¹University of New South Wales ²National University of Singapore ³NVIDIA ⁴Nanjing University
⁵University of Technology Sydney ⁶Australian National University

^✉Corresponding author

Paper Code 🤗Model 🤗Dataset

Social structure of text-driven HHI generation. (a) Solo motion execution provides strong intra-personal motion priors but lacks interaction-level coordination. (b) HHI requires social structure along two dimensions: phase-level temporal organization and partner-aware coordination. (c) LLMs offer social planning abilities to make such structure explicit for interaction motion execution.

Abstract

Although text-to-motion generation has achieved strong progress in synthesizing realistic single-person motions from language, extending it to text-driven 3D human–human interaction (HHI) remains non-trivial, as HHI requires modeling the underlying social structure that governs phase progression, actor roles, and inter-actor coordination. In this paper, we formulate HHI generation as a social structure modeling and grounding problem: the model must first infer how an interaction unfolds and how the two actors coordinate their roles, and then realize this structure as continuous, physically plausible, and partner-aware 3D motion. To study how such structure should be modeled, we first examine the capability boundary of large language models (LLMs) for HHI generation. Our analysis shows that LLMs can think by recovering phase decompositions and partner-aware roles, but cannot directly move, as they fail to generate dynamic, physically plausible, and interaction-aware motion. This motivates our planner–executor paradigm, Think with LLM, Move with Motion Skill. The LLM planner converts implicit interaction semantics into motion-aligned social supervision by decomposing interactions into phases, assigning partner-aware actor roles, and aligning them with motion sequence. The motion executor then grounds the planned social structure into coordinated two-person motion by adapting a pretrained solo motion model with LoRA, previous-phase self-conditioning, and ego-relative partner conditioning. Together, our Solo-to-Social framework bridges social organization and motion realization, producing 3D HHI with improved phase consistency, role alignment, and partner-aware coordination.

Method

Overview of our proposed planner–executor paradigm for social-structure-centered HHI generation. (a) The LLM serves as a social structure planner which recovers plausible phase decompositions and partner-aware role assignments from global prompt. (b) The motion skill is built on a solo motion backbone equipped with self and partner conditioning for motion execution.

🧊 Interactive Demos

Live 3D viewers — drag to orbit. P1 in orange, P2 in blue.

BibTeX

@misc{wang2026socialstructure,
  title  = {Social Structure Matters in 3D Human--Human Interaction Generation},
  author = {Zhongju Wang and Beier Wang and Yatao Bian and Pichao Wang and Zhi Wang
            and Daoyi Dong and Hongdong Li and Huadong Mo and Zhenhong Sun},
  year   = {2026},
  eprint = {2606.24255},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV}
}