Chen Yulin's Blog

Posted 2026-02-06Updated 2026-03-23Review10 minutes read (About 1443 words)

Posted 2026-02-03Updated 2026-03-23Review15 minutes read (About 2285 words)

UniDiffuser

论文链接 | GitHub

#Research-paper Multi-modal Transformer Image2Text CV DiffusionModel ImgGen

Posted 2025-03-18Updated 2026-03-23Notea few seconds read (About 83 words)

(UVtransE) Contextual Translation Embedding for Visual Relationship Detection and Scene Graph Generation

#Research-paper Scene-graph Visual-Relation Image2Text CV Translation-Embedding

Posted 2025-03-18Updated 2026-03-23Reviewa few seconds read (About 7 words)

Visual Translation Embedding Network for Visual Relation Detection

VTransE

#Research-paper Scene-graph Visual-Relation Image2Text CV Translation-Embedding

Posted 2025-03-04Updated 2026-03-23Reviewa minute read (About 154 words)

ALBEF

使用的backbone是BERT(通过MLM训练)
该研究认为，image encoder的模型大小应该大于text encoder,所以在text encoder这里，只使用六层self attention来提取特征，剩余六层cross attention用于multi-modal encoder。

#Research-paper Image2Text CV Contrastive-Learning MultiModal VLP Image-Text

Posted 2025-03-04Updated 2026-03-23Reviewa few seconds read (About 3 words)

ViLT

#Research-paper Transformer Image2Text CV MultiModal VLP Image-Text

Posted 2025-01-06Updated 2026-03-23Notea minute read (About 197 words)

CLIP

https://blog.csdn.net/h661975/article/details/135116957

#Research-paper Image2Text CV CLIP Contrastive-Learning MultiModal VLP Image-Text

Archives

Recents

Tags