Chen Yulin's Blog

Posted 2025-03-18Updated 2025-11-28Reviewa minute read (About 197 words)

Clio= Real-time Task-Driven Open-Set 3D Scene Graphs

贡献：

The first contribution of this paper is to propose a task-driven 3D scene understanding problem, where the robot is given a list of tasks in natural language, and has to select the granularity and the subset of objects and scene structure to retain in its map that is sufficient to complete the tasks.
The second contribution is an algorithm for task-driven 3D scene understanding based on an Agglomerative IB approach, that is able to cluster 3D primitives in the environment into taskrelevant objects and regions
基于以上，实现了一个实时的pipeline

提出了针对不同任务需要不同粒度的语义信息，本文是通过结合SAM和[[CLIP多模态预训练模型]]实现，但是忽略了物体之间的谓语关系或者父子关系。本质还是智能做导航，拾取，放下，导航的基本操作。

Posted 2025-03-03Updated 2025-11-28Reviewa few seconds read (About 0 words)

ZegCLIP

Posted 2025-03-03Updated 2025-11-28Reviewa few seconds read (About 108 words)

BLIP

A vision-language model that unifies vision-language understanding and generation tasks.

主要分为两块工作：

去除图文检索所使用的数据集中的噪声
vision language understanding and generation

Model

Noise Filtering

Caption 模型生成图像文本对，然后Filt将caption和真实互联网数据（可能存在噪声）进行对比，如果差异过大则使用Caption模型生成的结果

Understanding & Generation

Posted 2025-02-19Updated 2025-11-28Review2 minutes read (About 273 words)

GLIP

GLIP是一个学习了object-level, language-aware, and semantic-rich visual representations 的模型。
统一对象检测和短语接地进行预训练。

重要的问题

什么是 phrase grounding:
Phrase Grounding refers to the task of associating or “grounding” a natural language phrase (like a sentence or a word) to a specific region or object in an image. In other words, it’s about finding which part of the image corresponds to the object or concept described by a given text phrase.

For instance, if you have the phrase “the red ball on the table” and an image of a room with a red ball placed on a table, the goal of phrase grounding is to identify the exact region in the image that corresponds to the “red ball on the table”, distinguishing it from other objects in the image.

## Grounded Language Image Pre-training 将经典对象检测任务投入到grounding问题中，并提出**Unified Formulation**

Unified Formulation

传统的物体检测方法会把每个region分类进c个classes，而本文使用的Object detection as phrase grounding.
我们通过将每个区域与文本提示中的c(class)短语进行接地/对齐，将检测重新制定为基础任务
the classification prompt “person. bicycle. car. … . toothbrush”

Posted 2025-01-06Updated 2025-11-28Note4 minutes read (About 541 words)

CLIP-Fields- Weakly Supervised Semantic Fields for Robotic Memory

疑问：

和LERF [[LERF- Language Embedded Radiance Fields]] 的区别

是什么

A spatial-semantic memory
是一个隐式场景模型，可用于各种任务，例如分割、实例识别、空间语义搜索和视图定位
CLIP-Fields 学习从空间位置到语义嵌入向量的映射。
这种映射可以仅通过来自网络图像和网络文本训练模型（例如 CLIP[[CLIP多模态预训练模型]]、Detic 和 Sentence-BERT）的监督进行训练；因此不使用直接的人类监督。

基于的工作

CLIP [[CLIP多模态预训练模型]] : 基于训练一对图像和语言嵌入网络，使得图像和描述该图像的文本字符串具有相似的嵌入。在这项工作中大量使用 CLIP 模型和嵌入，因为它们可以作为对象的视觉特征及其可能的语言标签之间的共享表示。
Detic : 开放标签对象检测和图像分割，允许用户在运行时定义标签集，无需额外的训练或微调。用于生成数据集
Sentence-BERT : 用于文本相似性的句子嵌入网络
Instant-NGP : 构建了从空间（可能还有时间）坐标到某些物理属性的映射，例如神经辐射场情况下的 RGB 颜色和密度，或即时有符号距离场情况下的有符号距离

方法

Goal

We aim to build a system that can connect points of a 3D scene with their visual and semantic meaning.
Provide an interface with a pair of scene-dependent implicit functions $f, h : R^3 → R^n$ such that for the coordinates of any point P in our scene, f (P ) is a vector representing its semantic features, and h(P ) is another vector representing its visual features.

Dataset creation

> **MHE** is multi-resolution hash encoding (MHE) as introduced in [[Instant Neural Graphics Primitives with a Multiresolution Hash Encoding]]. > MHEs build an implicit representation over coordinates with a feature pyramid like structure, which can flexibly maintain both local and global information, unlike purely voxel-based encodings ([[Scene-LLM]]) which focuses on local structures only.

貌似每针对一个新场景都需要重新train一遍来获得坐标到语义的映射。