Posted 2026-02-06Updated 2026-03-01Review10 minutes read (About 1443 words)BAGEL-Unified-Multimodal-Pretraining论文链接 | 项目主页#Research-paperMulti-modalVLMDiffusionTransformerMoEUnified-MultimodalFoundationModelImage-generationImage2Text
Posted 2026-02-03Updated 2026-03-01Review15 minutes read (About 2285 words)UniDiffuser论文链接 | GitHub#Research-paperCVMulti-modalTransformerImage2TextDiffusionModelImgGen
Posted 2025-03-18Updated 2026-03-01Notea few seconds read (About 83 words)(UVtransE) Contextual Translation Embedding for Visual Relationship Detection and Scene Graph Generation#Scene-graphVisual-RelationResearch-paperCVImage2TextTranslation-Embedding
Posted 2025-03-18Updated 2026-03-01Reviewa few seconds read (About 7 words)Visual Translation Embedding Network for Visual Relation DetectionVTransE#Scene-graphVisual-RelationResearch-paperCVImage2TextTranslation-Embedding
Posted 2025-03-04Updated 2026-03-01Reviewa minute read (About 154 words)ALBEF使用的backbone是BERT(通过MLM训练)该研究认为,image encoder的模型大小应该大于text encoder,所以在text encoder这里,只使用六层self attention来提取特征,剩余六层cross attention用于multi-modal encoder。#Research-paperCVImage2TextContrastive-LearningMultiModalVLPImage-Text
Posted 2025-03-04Updated 2026-03-01Reviewa few seconds read (About 3 words)ViLT#Research-paperCVTransformerImage2TextMultiModalVLPImage-Text
Posted 2025-01-06Updated 2026-03-01Notea minute read (About 197 words)CLIPhttps://blog.csdn.net/h661975/article/details/135116957#Research-paperCVImage2TextCLIPContrastive-LearningMultiModalVLPImage-Text