Chen Yulin's Blog

Posted 2025-03-04Updated 2026-03-30Reviewa minute read (About 154 words)

使用的backbone是BERT(通过MLM训练)
该研究认为，image encoder的模型大小应该大于text encoder,所以在text encoder这里，只使用六层self attention来提取特征，剩余六层cross attention用于multi-modal encoder。

#Research-paper Image2Text CV MultiModal Contrastive-Learning VLP Image-Text

Posted 2025-03-04Updated 2026-03-30Reviewa few seconds read (About 3 words)

ViLT

#Research-paper Transformer Image2Text CV MultiModal VLP Image-Text

Posted 2025-03-03Updated 2026-03-30Reviewa few seconds read (About 108 words)

BLIP

A vision-language model that unifies vision-language understanding and generation tasks.

#Research-paper Multi-modal CV Semantic CLIP VLP Image-Text

Posted 2025-02-19Updated 2026-03-30Review2 minutes read (About 273 words)

GLIP

GLIP是一个学习了object-level, language-aware, and semantic-rich visual representations 的模型。
统一对象检测和短语接地进行预训练。

#Research-paper Multi-modal CV Object-Detection CLIP Contrastive-Learning VLP Image-Grounding

Posted 2025-01-06Updated 2026-03-30Notea minute read (About 197 words)

CLIP

https://blog.csdn.net/h661975/article/details/135116957

#Research-paper Image2Text CV MultiModal CLIP Contrastive-Learning VLP Image-Text

Archives

Recents

Tags