Posted 2026-03-25Updated 2026-03-30Note3 minutes read (About 428 words) visits

MotionGPT3

## 研究背景 Text & Motion are two modalities. They differs in that text is Quantized while motion is continuous. 与之相对的，典型的LLM擅长处理 **符号化(离散)** 的文字信息，而DiT擅长生成高质量的**连续信息**。由此派生出两类做法： - 将动作encode为量子化的token（VQVAE）然后和text放在single-stream backbone里推理，但是会带来跨模态interference，而量子化输出的动作质量也不高。 - 每个模态使用自己的backbone, multi-branch的方法。本文使用multi-branch的方法，具体来说: **dual-stream transformer with shared attention**

方法

Motion Representation

通过重建损失和KL正则来训练的motion VAE.
o

MotionGPT3

http://chen-yulin.github.io/2026/03/25/[OBS]Deep Learning-MotionGen-MotionGPT3/

Author

Chen Yulin

Posted on

2026-03-25

Updated on

2026-03-30

Licensed under

MotionGPT3

相关工作

人体动作建模

多模态的理解与生成框架

MoE & 多流架构

方法

Motion Representation

Author

Posted on

Updated on

Licensed under

Comments

Catalogue

Archives

Recents

Tags