Chen Yulin's BlogChen Yulin's Blog
HomeArchivesCategoriesTagsAbout
  目录
MotionGPT3
Posted 2026-03-25Updated 2026-03-30Note3 minutes read (About 428 words)   visits

MotionGPT3

## 研究背景 Text & Motion are two modalities. They differs in that text is Quantized while motion is continuous. 与之相对的,典型的LLM擅长处理 **符号化(离散)** 的文字信息,而DiT擅长生成高质量的**连续信息**。 由此派生出两类做法: - 将动作encode为量子化的token(VQVAE)然后和text放在single-stream backbone里推理,但是会带来跨模态interference,而量子化输出的动作质量也不高。 - 每个模态使用自己的backbone, multi-branch的方法。 本文使用multi-branch的方法,具体来说: **dual-stream transformer with shared attention**

相关工作

人体动作建模

  • Early: shared embedding for motion&texts
  • Recent: raw motion / reconstruct VAE latent
    • 为了适配next-token-prediction(LLM) 也有工作使用离散化的动作表达(VQ-VAE),但是根本上还是会有symbolic-continuous mismatch
      本文还是使用正常的VAE

多模态的理解与生成框架

目前有许多single-streamd 框架,但是单流架构经常受到跨模式干扰,限制了可扩展性和鲁棒性。即使目标经过精心调整,新引入的模式也可能会破坏现有的表示形式,这凸显了在扩展到新领域的同时保留特定模式能力的挑战。

MoE & 多流架构

通过将输入路由到该模态特定的Expert中进行处理,同时保留一个共享的融合接口,避免了dradient interference between different modality. 每个模态,每个branch可以有自己的训练目标。
本文使用MoT

方法

Motion Representation

通过重建损失和KL正则来训练的motion VAE.
o

MotionGPT3

http://chen-yulin.github.io/2026/03/25/[OBS]Deep Learning-MotionGen-MotionGPT3/

Author

Chen Yulin

Posted on

2026-03-25

Updated on

2026-03-30

Licensed under

#Research-paperDiffusionTransformerMultiModalMotionGeneration
exist_label

Comments

Chen Yulin

Chen Yulin

SJTU student

Manchester by the Sea

Posts

132

Categories

6

Tags

106

Follow

Catalogue

  • 相关工作
    • 人体动作建模
    • 多模态的理解与生成框架
    • MoE & 多流架构
  • 方法
    • Motion Representation

Archives

  • March 20261
  • February 20268
  • November 20253
  • July 20252
  • May 20252
  • April 20259
  • March 202540
  • February 20259
  • January 202512
  • December 20246
  • November 20242
  • October 20244
  • September 20246
  • August 20241
  • July 20241
  • June 20241
  • May 20241
  • April 20244
  • March 20241
  • January 20241
  • December 20231
  • May 20231
  • August 20221
  • May 20226
  • April 20229

Recents

MotionGPT3

2026-03-25

MotionGPT3

Note

exist_label

2026-02-14

exist_label

Note

BAGEL-Unified-Multimodal-Pretraining

2026-02-06

BAGEL-Unified-Multimodal-Pretraining

Review

LingBot-VLA

2026-02-05

LingBot-VLA

Review

Mixture-of-Experts-Survey

2026-02-05

Mixture-of-Experts-Survey

Review

Tags

3D-Scene17
Atlas1
CADC1
CLIP11
CNN1
CV56
Chemistry1
Contrastive-Learning5
Csharp1
DINO3
DT1
Debate2
Diffusion3
DiffusionModel4
Discrete-Mathematics1
Embodied-AI18
Emoation1
Emotion9
FL1
FPN2
Foundation1
FoundationModel4
Functional programming1
Game1
Gated-NN3
Github1
HRI2
Haskell1
Hexo4
Hierarchical4
Html1
HumanoidRobot1
Image-Grounding2
Image-Text4
Image-generation2
Image2Text7
ImgGen3
ImitationLearning5
LLM15
LatentAction1
Latex1
Love2
ML8
MR/AR3
Message-Passing2
MoE2
Mod1
MotionGeneration1
Multi-modal14
Multi-view1
MultiModal6
NLP6
NN7
Nodejs1
Object-Detection9
Open-Vocabulary11
OpenCV1
Panoptic1
Physical-Scene4
Plugin1
PoseEstimation3
Probability1
Promise1
Python1
Pytorch1
QML1
Quantum1
RL3
RNN3
ROS3
Reading3
Real2Sim2
Reconstruct13
Representation-Learning5
Research-paper98
RobotLearning13
Robotics29
SJTU-Lecture1
Scalability2
Scene-graph31
Scene-synthesis2
Segmentation7
Semantic14
Signals and Systems1
Sim2Real1
Snippets1
Subgraph1
Survey4
Task-Planning9
Tech Communication1
Transformer21
Translation-Embedding2
Travel1
Unified-Multimodal1
Unity1
VAE1
VLA2
VLM8
VLP5
VQ-VAE1
ViT5
Vim1
Visual-Relation23
WSL1
Web1
WorldModel2
Chen Yulin's BlogChen Yulin's Blog

© 2026 Chen Yulin  Powered by Hexo & Icarus

×