Chen Yulin's Blog

Posted 2025-02-19Updated 2026-06-22Review2 minutes read (About 273 words)

GLIP是一个学习了object-level, language-aware, and semantic-rich visual representations 的模型。
统一对象检测和短语接地进行预训练。

重要的问题

什么是 phrase grounding:
Phrase Grounding refers to the task of associating or “grounding” a natural language phrase (like a sentence or a word) to a specific region or object in an image. In other words, it’s about finding which part of the image corresponds to the object or concept described by a given text phrase.

For instance, if you have the phrase “the red ball on the table” and an image of a room with a red ball placed on a table, the goal of phrase grounding is to identify the exact region in the image that corresponds to the “red ball on the table”, distinguishing it from other objects in the image.

## Grounded Language Image Pre-training 将经典对象检测任务投入到grounding问题中，并提出**Unified Formulation**

Unified Formulation

传统的物体检测方法会把每个region分类进c个classes，而本文使用的Object detection as phrase grounding.
我们通过将每个区域与文本提示中的c(class)短语进行接地/对齐，将检测重新制定为基础任务
the classification prompt “person. bicycle. car. … . toothbrush”

Posted 2025-02-16Updated 2026-06-22Reviewa minute read (About 216 words)

Grounding-DINO

,

通过结合[[DINO]]和grounded-pretraining，可以使用人类输入（例如类别名称或转介表达式）检测任意对象
Open-Vocab. Det

an open-set object detector that can detect any objects with respect to an arbitrary free-form text prompt. The model was trained on over 10 million images, including detection data, visual grounding data, and image-text pairs. It has a strong zero-shot detection performance. However, the model needs text as inputs and can only detect boxes with corresponding phrases.

Grounding-DINO

Principle

Tight modality fusion based on [[DINO]]

什么是feature fusion?

- 在多模态领域，feature fusion 特指将不同模态的特征（如视觉、文本、音频等）进行融合的技术。CLIP 应该被看作是 Middle Fusion 的一种形式, 在特征提取后就进行融合对齐 #### large-scale grounded pre-train for concept generalization Reformulating **object detection** as a **phrase grounding task** and introducing **contrastive training** between object regions and language phrases on large-scale data

重要的问题

Unified Formulation

Grounding-DINO

Principle

Tight modality fusion based on [[DINO]]

Archives

Recents

Tags