Social Structure Matters in 3D Human–Human Interaction Generation

Zhongju Wang1   Beier Wang1   Yatao Bian2   Pichao Wang3   Zhi Wang4
Daoyi Dong5   Hongdong Li6   Huadong Mo1,✉   Zhenhong Sun6

1University of New South Wales   2National University of Singapore   3NVIDIA   4Nanjing University
5University of Technology Sydney   6Australian National University

Corresponding author

SocialStructureHHI teaser

Social structure of text-driven HHI generation. (a) Solo motion execution provides strong intra-personal motion priors but lacks interaction-level coordination. (b) HHI requires social structure along two dimensions: phase-level temporal organization and partner-aware coordination. (c) LLMs offer social planning abilities to make such structure explicit for interaction motion execution.

Abstract

Although text-to-motion generation has achieved strong progress in synthesizing realistic single-person motions from language, extending it to text-driven 3D human–human interaction (HHI) remains non-trivial, as HHI requires modeling the underlying social structure that governs phase progression, actor roles, and inter-actor coordination. In this paper, we formulate HHI generation as a social structure modeling and grounding problem: the model must first infer how an interaction unfolds and how the two actors coordinate their roles, and then realize this structure as continuous, physically plausible, and partner-aware 3D motion. To study how such structure should be modeled, we first examine the capability boundary of large language models (LLMs) for HHI generation. Our analysis shows that LLMs can think by recovering phase decompositions and partner-aware roles, but cannot directly move, as they fail to generate dynamic, physically plausible, and interaction-aware motion. This motivates our planner–executor paradigm, Think with LLM, Move with Motion Skill. The LLM planner converts implicit interaction semantics into motion-aligned social supervision by decomposing interactions into phases, assigning partner-aware actor roles, and aligning them with motion sequence. The motion executor then grounds the planned social structure into coordinated two-person motion by adapting a pretrained solo motion model with LoRA, previous-phase self-conditioning, and ego-relative partner conditioning. Together, our Solo-to-Social framework bridges social organization and motion realization, producing 3D HHI with improved phase consistency, role alignment, and partner-aware coordination.

Method

SocialStructureHHI method overview

Overview of our proposed planner–executor paradigm for social-structure-centered HHI generation. (a) The LLM serves as a social structure planner which recovers plausible phase decompositions and partner-aware role assignments from global prompt. (b) The motion skill is built on a solo motion backbone equipped with self and partner conditioning for motion execution.

🧊 Interactive Demos

Live 3D viewers — drag to orbit. P1 in orange, P2 in blue.

BibTeX

@misc{wang2026socialstructure,
  title  = {Social Structure Matters in 3D Human--Human Interaction Generation},
  author = {Zhongju Wang and Beier Wang and Yatao Bian and Pichao Wang and Zhi Wang
            and Daoyi Dong and Hongdong Li and Huadong Mo and Zhenhong Sun},
  year   = {2026},
  eprint = {2606.24255},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV}
}