Chen Yulin's BlogChen Yulin's Blog
HomeArchivesCategoriesTagsAbout
ALBEF
Posted 2025-03-04Updated 2026-03-08Reviewa minute read (About 154 words)

ALBEF

使用的backbone是BERT(通过MLM训练)
该研究认为,image encoder的模型大小应该大于text encoder,所以在text encoder这里,只使用六层self attention来提取特征,剩余六层cross attention用于multi-modal encoder。

#Research-paperImage2TextCVContrastive-LearningMultiModalVLPImage-Text
ViLT
Posted 2025-03-04Updated 2026-03-08Reviewa few seconds read (About 3 words)

ViLT

#Research-paperTransformerImage2TextCVMultiModalVLPImage-Text
ZegCLIP
Posted 2025-03-03Updated 2026-03-08Reviewa few seconds read (About 0 words)

ZegCLIP

#Research-paperCVSemanticCLIPOpen-VocabularySegmentation
BLIP
Posted 2025-03-03Updated 2026-03-08Reviewa few seconds read (About 108 words)

BLIP

A vision-language model that unifies vision-language understanding and generation tasks.

#Research-paperMulti-modalCVSemanticCLIPVLPImage-Text
GLIP
Posted 2025-02-19Updated 2026-03-08Review2 minutes read (About 273 words)

GLIP

GLIP是一个学习了object-level, language-aware, and semantic-rich visual representations 的模型。
统一对象检测和短语接地进行预训练。

#Research-paperMulti-modalCVObject-DetectionCLIPContrastive-LearningVLPImage-Grounding
Extract Free Dense Labels from CLIP
Posted 2025-02-18Updated 2026-03-08Reviewa few seconds read (About 0 words)

Extract Free Dense Labels from CLIP

#Research-paperCVSemanticCLIPOpen-VocabularySegmentation
ConceptFusion
Posted 2025-02-17Updated 2026-03-08Review2 minutes read (About 297 words)

ConceptFusion

将不同帧$X_t$中的特征集合在M中特征点的公式:

#Research-paperMulti-modalCVReconstruct3D-SceneSemanticCLIP
Grounding-DINO
Posted 2025-02-16Updated 2026-03-08Reviewa minute read (About 216 words)

Grounding-DINO

,

#Research-paperTransformerCVObject-DetectionOpen-VocabularyContrastive-LearningMultiModalDINOImage-Grounding
Gounded-SAM
Posted 2025-02-16Updated 2026-03-08Reviewa few seconds read (About 17 words)

Gounded-SAM

https://github.com/IDEA-Research/Grounded-Segment-Anything

By [[Grounding-DINO]] + SAM
Achieving Open-Vocab. Det & Seg

#Research-paperCVObject-DetectionSemanticOpen-VocabularySegmentation
Scene-LLM
Posted 2025-02-15Updated 2026-03-08Review6 minutes read (About 919 words)

Scene-LLM

本文提出的模型主要想解决3D密集标注和交互式规划。
结合

#RoboticsResearch-paperLLMMulti-modalVLM3D-SceneEmbodied-AICLIP
Previous
Next
  • 1
  • …
  • 5
  • 6
  • 7
  • 8
  • 9
  • …
  • 14
Chen Yulin

Chen Yulin

SJTU student

Manchester by the Sea

Posts

131

Categories

6

Tags

105

Follow

Archives

  • February 20268
  • November 20253
  • July 20252
  • May 20252
  • April 20259
  • March 202540
  • February 20259
  • January 202512
  • December 20246
  • November 20242
  • October 20244
  • September 20246
  • August 20241
  • July 20241
  • June 20241
  • May 20241
  • April 20244
  • March 20241
  • January 20241
  • December 20231
  • May 20231
  • August 20221
  • May 20226
  • April 20229

Recents

exist_label

2026-02-14

exist_label

Note

BAGEL-Unified-Multimodal-Pretraining

2026-02-06

BAGEL-Unified-Multimodal-Pretraining

Review

LingBot-VLA

2026-02-05

LingBot-VLA

Review

Mixture-of-Experts-Survey

2026-02-05

Mixture-of-Experts-Survey

Review

UniDiffuser

2026-02-03

UniDiffuser

Review

Tags

3D-Scene17
Atlas1
CADC1
CLIP11
CNN1
CV56
Chemistry1
Contrastive-Learning5
Csharp1
DINO3
DT1
Debate2
Diffusion2
DiffusionModel4
Discrete-Mathematics1
Embodied-AI18
Emoation1
Emotion9
FL1
FPN2
Foundation1
FoundationModel4
Functional programming1
Game1
Gated-NN3
Github1
HRI2
Haskell1
Hexo4
Hierarchical4
Html1
HumanoidRobot1
Image-Grounding2
Image-Text4
Image-generation2
Image2Text7
ImgGen3
ImitationLearning5
LLM15
LatentAction1
Latex1
Love2
ML8
MR/AR3
Message-Passing2
MoE2
Mod1
Multi-modal14
Multi-view1
MultiModal5
NLP6
NN7
Nodejs1
Object-Detection9
Open-Vocabulary11
OpenCV1
Panoptic1
Physical-Scene4
Plugin1
PoseEstimation3
Probability1
Promise1
Python1
Pytorch1
QML1
Quantum1
RL3
RNN3
ROS3
Reading3
Real2Sim2
Reconstruct13
Representation-Learning5
Research-paper97
RobotLearning13
Robotics29
SJTU-Lecture1
Scalability2
Scene-graph31
Scene-synthesis2
Segmentation7
Semantic14
Signals and Systems1
Sim2Real1
Snippets1
Subgraph1
Survey4
Task-Planning9
Tech Communication1
Transformer20
Translation-Embedding2
Travel1
Unified-Multimodal1
Unity1
VAE1
VLA2
VLM8
VLP5
VQ-VAE1
ViT5
Vim1
Visual-Relation23
WSL1
Web1
WorldModel2
Chen Yulin's BlogChen Yulin's Blog

© 2026 Chen Yulin  Powered by Hexo & Icarus

×