Chen Yulin's Blog

Posted 2025-03-11Updated 2025-07-24Reviewa minute read (About 161 words)

DETR

参考： https://www.youtube.com/watch?v=T35ba_VXkMY&t=1744s

DETR是一个使用transformer作为基本架构的 object detection 模型。

Insight

Object queries (something that can be learned):

>Visualization of all box predictions on all images from COCO 2017 val set for 20 out of total N = 100 prediction slots in DETR decoder. Each box prediction is represented as a point with the coordinates of its center in the 1-by-1 square normalized by each image size. The points are color-coded so that green color corresponds to small boxes, red to large horizontal boxes and blue to large vertical boxes. We observe that each slot learns to specialize on certain areas and box sizes with several operating modes. We note that almost all slots have a mode of predicting large image-wide boxes that are common in COCO dataset.

Posted 2025-03-06Updated 2025-07-24Reviewa minute read (About 206 words)

Semantic-SAM

这片文章可以成为场景物理重建的基石之一
类似的后续工作有OMG-Seg

是什么

通用图像分割模型，以实现细分并识别任何所需粒度的任何内容

Semantic-awareness
Granularity-abundance

数据集

### 难点 - 目前的一些通用的物体和分割数据集虽然确实提供了大体量的数据和丰富的语义信息，但只局限在object level。 - 目前一些分割了细分part的数据集却体量有限。 - SAM使用的数据集为多粒度的大体量数据集，但是并不包含语义标注。 ### 解决方式合并了多个不同分割粒度的数据集

Posted 2025-03-06Updated 2025-07-24Reviewa few seconds read (About 26 words)

MaskDINO

注：此DINO并非自蒸馏自监督的那个[[DINO]]，而是派生自[[DETR]]

Posted 2025-03-04Updated 2025-07-24Reviewa minute read (About 154 words)

ALBEF

## Align Before Fuse ### Image Encoder 标准的六层self attention ViT，初始化为DeiT论文中的在ImageNet-1K数据集上训练的参数

Text Encoder

使用的backbone是BERT(通过MLM训练)
该研究认为，image encoder的模型大小应该大于text encoder,所以在text encoder这里，只使用六层self attention来提取特征，剩余六层cross attention用于multi-modal encoder。

ITC Loss & Momentum

参考Moco [[Moco- Momentum Contrast for Unsupervised Visual Representation Learning]]

Improve Noisy Web Data

见[[BLIP]]，是沿用的工作

Loss

#### ITC 旨在在融合之前学习更好的单模态表示

ITM

MLM

Posted 2025-03-04Updated 2025-07-24Reviewa few seconds read (About 3 words)

ViLT

Posted 2025-03-03Updated 2025-07-24Reviewa few seconds read (About 0 words)

ZegCLIP

Posted 2025-03-03Updated 2025-07-24Reviewa few seconds read (About 108 words)

BLIP

A vision-language model that unifies vision-language understanding and generation tasks.

主要分为两块工作：

去除图文检索所使用的数据集中的噪声
vision language understanding and generation

Model

Noise Filtering

Caption 模型生成图像文本对，然后Filt将caption和真实互联网数据（可能存在噪声）进行对比，如果差异过大则使用Caption模型生成的结果

Understanding & Generation

Posted 2025-02-19Updated 2025-07-24Review2 minutes read (About 273 words)

GLIP

GLIP是一个学习了object-level, language-aware, and semantic-rich visual representations 的模型。
统一对象检测和短语接地进行预训练。

重要的问题

什么是 phrase grounding:
Phrase Grounding refers to the task of associating or “grounding” a natural language phrase (like a sentence or a word) to a specific region or object in an image. In other words, it’s about finding which part of the image corresponds to the object or concept described by a given text phrase.

For instance, if you have the phrase “the red ball on the table” and an image of a room with a red ball placed on a table, the goal of phrase grounding is to identify the exact region in the image that corresponds to the “red ball on the table”, distinguishing it from other objects in the image.

## Grounded Language Image Pre-training 将经典对象检测任务投入到grounding问题中，并提出**Unified Formulation**

Unified Formulation

传统的物体检测方法会把每个region分类进c个classes，而本文使用的Object detection as phrase grounding.
我们通过将每个区域与文本提示中的c(class)短语进行接地/对齐，将检测重新制定为基础任务
the classification prompt “person. bicycle. car. … . toothbrush”

Posted 2025-02-18Updated 2025-07-24Reviewa few seconds read (About 0 words)

Extract Free Dense Labels from CLIP

Posted 2025-02-17Updated 2025-07-24Review2 minutes read (About 297 words)

ConceptFusion

## Approach 目标是构建一个open-set multimodal 3D map `M`. 可以使用特定于模态的编码器（基础模型）$F_{Mode}$将图像，文本，音频和点击等多维信号编码为矢量空间其中，`M` 由一系列点构成，每个点都包含：顶点位置，法向向量，置信度数量，颜色和概念向量（concept vector）组成首先是帧（单张输入图片）预处理：通过一系列输入的深度图片获取顶点法相maps和相机方位，再通过计算获得每张图片中每个像素的语义上下文嵌入。其中，语义上下文的嵌入是通过结合局部和全局的CLIP features获得的。