Posted 2025-03-06Updated 2026-03-01Reviewa few seconds read (About 26 words)MaskDINO注:此DINO并非自蒸馏自监督的那个[[DINO]],而是派生自[[DETR]]#Research-paperCVTransformerObject-DetectionSemanticSegmentationMultiModal
Posted 2025-03-04Updated 2026-03-01Reviewa minute read (About 154 words)ALBEF使用的backbone是BERT(通过MLM训练)该研究认为,image encoder的模型大小应该大于text encoder,所以在text encoder这里,只使用六层self attention来提取特征,剩余六层cross attention用于multi-modal encoder。#Research-paperCVImage2TextContrastive-LearningMultiModalVLPImage-Text
Posted 2025-03-04Updated 2026-03-01Reviewa few seconds read (About 3 words)ViLT#Research-paperCVTransformerImage2TextMultiModalVLPImage-Text
Posted 2025-02-16Updated 2026-03-01Reviewa minute read (About 216 words)Grounding-DINO,#Research-paperCVTransformerObject-DetectionOpen-VocabularyContrastive-LearningMultiModalDINOImage-Grounding
Posted 2025-01-06Updated 2026-03-01Notea minute read (About 197 words)CLIPhttps://blog.csdn.net/h661975/article/details/135116957#Research-paperCVImage2TextCLIPContrastive-LearningMultiModalVLPImage-Text