
使用的backbone是BERT(通过MLM训练)
该研究认为,image encoder的模型大小应该大于text encoder,所以在text encoder这里,只使用六层self attention来提取特征,剩余六层cross attention用于multi-modal encoder。
参考Moco [[Moco- Momentum Contrast for Unsupervised Visual Representation Learning]]
见[[BLIP]],是沿用的工作
GLIP是一个学习了object-level, language-aware, and semantic-rich visual representations 的模型。
统一对象检测和短语接地进行预训练。
什么是 phrase grounding:
Phrase Grounding refers to the task of associating or “grounding” a natural language phrase (like a sentence or a word) to a specific region or object in an image. In other words, it’s about finding which part of the image corresponds to the object or concept described by a given text phrase.
## Grounded Language Image Pre-training 将经典对象检测任务投入到grounding问题中,并提出**Unified Formulation**For instance, if you have the phrase “the red ball on the table” and an image of a room with a red ball placed on a table, the goal of phrase grounding is to identify the exact region in the image that corresponds to the “red ball on the table”, distinguishing it from other objects in the image.
传统的物体检测方法会把每个region分类进c个classes,而本文使用的Object detection as phrase grounding.
我们通过将每个区域与文本提示中的c(class)短语进行接地/对齐,将检测重新制定为基础任务
the classification prompt “person. bicycle. car. … . toothbrush”
将不同帧$X_t$中的特征集合在M
中特征点的公式:
,
通过结合[[DINO]]和grounded-pretraining,可以使用人类输入(例如类别名称或转介表达式)检测任意对象
Open-Vocab. Det
an open-set object detector that can detect any objects with respect to an arbitrary free-form text prompt. The model was trained on over 10 million images, including detection data, visual grounding data, and image-text pairs. It has a strong zero-shot detection performance. However, the model needs text as inputs and can only detect boxes with corresponding phrases.
什么是feature fusion?
- 在多模态领域,feature fusion 特指将不同模态的特征(如视觉、文本、音频等)进行融合的技术。CLIP 应该被看作是 Middle Fusion 的一种形式, 在特征提取后就进行融合对齐 #### large-scale grounded pre-train for concept generalization Reformulating **object detection** as a **phrase grounding task** and introducing **contrastive training** between object regions and language phrases on large-scale data