Chen Yulin's Blog

Posted 2025-03-04Updated 2026-02-22Reviewa minute read (About 154 words)

使用的backbone是BERT(通过MLM训练)
该研究认为，image encoder的模型大小应该大于text encoder,所以在text encoder这里，只使用六层self attention来提取特征，剩余六层cross attention用于multi-modal encoder。

#Research-paper CV Image2Text MultiModal VLP Image-Text Contrastive-Learning

Posted 2025-03-04Updated 2026-02-22Reviewa few seconds read (About 3 words)

ViLT

#Research-paper CV Transformer Image2Text MultiModal VLP Image-Text

Posted 2025-03-03Updated 2026-02-22Reviewa few seconds read (About 108 words)

BLIP

A vision-language model that unifies vision-language understanding and generation tasks.

#Research-paper CV Multi-modal Semantic CLIP VLP Image-Text

Posted 2025-01-06Updated 2026-02-22Notea minute read (About 197 words)

CLIP

https://blog.csdn.net/h661975/article/details/135116957

#Research-paper CV Image2Text CLIP MultiModal VLP Image-Text Contrastive-Learning

Archives

Recents

Tags