Chen Yulin's Blog

Posted 2025-01-09Updated 2025-08-15Notea few seconds read (About 0 words)

Vision Transformers Need Registers

Posted 2025-01-09Updated 2025-08-15Notea few seconds read (About 0 words)

DINOv2- Learning Robust Visual Features without Supervision

Posted 2025-01-09Updated 2025-08-15Notea few seconds read (About 71 words)

AN IMAGE IS WORTH 16X16 WORDS- TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE

https://www.youtube.com/watch?v=j3VNqtJUoz0&t=16s

核心思想：

将图像分为patches, 线性映射, 再加上图片的position embeding来输入transformer encoder
额外使用一个cls token用于占位（ViT的输出就是这个cls input token对应的output token）

Posted 2025-01-08Updated 2025-08-15Note4 minutes read (About 561 words)

DINO

https://github.com/facebookresearch/dino/tree/main

# Emerging Properties in Self-Supervised Vision Transformers

https://juejin.cn/post/7224738994825789496
https://www.youtube.com/watch?v=h3ij3F3cPIk&t=1005s
DI+NO（蒸馏+No Label）
具体来说，DINO 是使用一种称为“无监督自蒸馏”的方法，该方法通过自监督学习来学习模型的知识表示。在这个方法中，模型使用自身的输出来生成“伪标签”，然后使用这些伪标签来重新训练模型，从而进一步提高模型的性能和泛化能力。

知识蒸馏

https://blog.csdn.net/xbinworld/article/details/83063726

重点idea就是提出用soft target来辅助hard target一起训练，而soft target来自于大模型的预测输出。这里有人会问，明明true label（hard target）是完全正确的，为什么还要soft target呢？
hard target 包含的信息量（信息熵）很低，soft target包含的信息量大，拥有不同类之间关系的信息（比如同时分类驴和马的时候，尽管某张图片是马，但是soft target就不会像hard target 那样只有马的index处的值为1，其余为0，而是在驴的部分也会有概率。）[5]
这样的好处是，这个图像可能更像驴，而不会去像汽车或者狗之类的，而这样的soft信息存在于概率中，以及label之间的高低相似性都存在于soft target中。但是如果soft targe是像这样的信息[0.98 0.01 0.01]，就意义不大了，所以需要在softmax中增加温度参数T（这个设置在最终训练完之后的推理中是不需要的）

ViT

DINO

总的来说DINO最适合的任务就是将不同状态的同一物体进行归类。

关于DINO中发生的涌现
https://juejin.cn/post/7280436457142501388

DINO之前的工作

We have also seen emerged two properties that can be leveraged in future applications: the quality of the features in k-NN classification has a potential for image retrieval. The presence of information about the scene layout in the features can also benefit weakly supervised image segmentation.

Posted 2025-01-06Updated 2025-08-15Notea minute read (About 197 words)

CLIP

https://blog.csdn.net/h661975/article/details/135116957

loss: ITC (Image Text Contrastive)

# image_encoder - ResNet or Vision Transformer 
# text_encoder - CBOW or Text Transformer 
# I[n, h, w, c] - minibatch of aligned images 
# T[n, l] - minibatch of aligned texts 
# W_i[d_i, d_e] - learned proj of image to embed 
# W_t[d_t, d_e] - learned proj of text to embed 
# t - learned temperature parameter  

# extract feature representations of each modality 
I_f = image_encoder(I) #[n, d_i] 
T_f = text_encoder(T) #[n, d_t]  

# joint multimodal embedding [n, d_e] 
I_e = l2_normalize(np.dot(I_f, W_i), axis=1) T
_e = l2_normalize(np.dot(T_f, W_t), axis=1)  

# scaled pairwise cosine similarities [n, n] 
logits = np.dot(I_e, T_e.T) * np.exp(t)  

# symmetric loss function 
labels = np.arange(n) 
loss_i = cross_entropy_loss(logits, labels, axis=0) 
loss_t = cross_entropy_loss(logits, labels, axis=1) 
loss = (loss_i + loss_t)/2

Cross_entropy_loss:

CLIP 本质上是全局图像嵌入，不利于像素对齐特征提取。

Posted 2025-01-06Updated 2025-08-15Note5 minutes read (About 790 words)

LERF- Language Embedded Radiance Fields

NeRF+CLIP

Intro

背景

神经辐射场 (NeRF) 已成为一种强大的技术，用于捕获复杂的现实世界 3D 场景的逼真数字表示。然而，NeRF 的直接输出只不过是一个彩色的密度场，缺乏意义或上下文，这阻碍了构建与生成的 3D 场景交互的界面。
自然语言是与 3D 场景交互的直观界面。考虑厨房的捕获。想象一下，能够通过询问“用具”在哪里来导航这个厨房，或者更具体地说，询问可用于“搅拌”的工具，甚至可以询问您最喜欢的带有特定功能的杯子。其上的徽标——贯穿日常对话的舒适和熟悉。这不仅需要处理自然语言输入查询的能力，还需要能够在多个尺度上合并语义并与长尾和抽象概念相关。

解决方案

一个Language Field
通过优化从现成的视觉语言模型（如 CLIP）到 3D 场景的嵌入，为 NeRF 中的语言奠定基础。
LERF 提供了一个额外的好处：由于我们从多个尺度的多个视图中提取 CLIP 嵌入，因此通过 3D CLIP 嵌入获得的文本查询的相关性图与通过 2D CLIP 嵌入获得的文本查询的相关性图相比更加本地化。根据定义，它们也是 3D 一致的，可以直接在 3D 字段中进行查询，而无需渲染到多个视图。

相较于Clip-Field[[CLIP-Fields- Weakly Supervised Semantic Fields for Robotic Memory]], LERF 更密集。

CLIP-Fields [32] and NLMaps-SayCan [8] fuse CLIP embeddings of crops into pointclouds, using a contrastively supervised field and classical pointcloud fusion respectively. In CLIP-Fields, the crop locations are guided by Detic [40]. On the other hand, NLMaps-SayCan relies on region proposal networks. These maps are sparser than LERF as they primarily query CLIP on detected objects rather than densely throughout views of the scene. Concurrent work ConceptFusion [19] fuses CLIP features more densely in RGBD pointclouds, using Mask2Former [9] to predict regions of interest, meaning it can lose objects which are out of distribution to Mask2Former’s training set. In contrast, LERF does not use region or mask proposals.

LERF

给定一组校准的输入图像，我们将 CLIP 嵌入到 NeRF 内的 3D 场中。然而，查询单个 3D 点的 CLIP 嵌入是不明确的，因为 CLIP 本质上是全局图像嵌入，不利于像素对齐特征提取。为了解释这一特性，我们提出了一种新颖的方法，该方法涉及学习以样本点为中心的卷上的语言嵌入领域。具体来说，该字段的输出是包含指定体积的图像作物的所有训练视图中的平均 CLIP 嵌入。通过将查询从点重新构造为体积，我们可以有效地从输入图像的粗裁剪中监督密集的字段，这些图像可以通过在给定的体积尺度上进行调节来以像素对齐的方式渲染。

https://blog.csdn.net/amusi1994/article/details/129701012

Posted 2025-01-06Updated 2025-08-15Note3 minutes read (About 480 words)

Some Thoughts Regarding -Reconstruct Anything-

主要记录一些读场景语义化重建的论文的过程中的想法

重要的问题

多模态包含哪些任务

图文检索 Image-text Retrival
视觉问答 VQA
视觉推理 Visual Reasoning
视觉蕴含 Visual Entailment

多模态有哪些loss

Image Text Contrastive(ITC) [[CLIP多模态预训练模型]]
Word Patch Aligment (WPA) used in object detection ViT
Image Text Matching (ITM)
Mask Languae Modeling (MLM) BERT 完形填空

给定一个具体的任务，机器人需要哪些场景信息才能顺利执行这个任务（通用机器人）

限定：暂不考虑机器人的移动性，也就是不需要跨视野的导航(OK-Robot)，暂定为桌面机器人

具体来说，通用机器人的特点包括：

多任务能力：能够执行多种不同类型的任务，如装配、搬运、清洁、检测等。
适应性强：具备适应多种环境和工作条件的能力，例如在不同地形或生产线中工作的能力。
智能控制：通过先进的传感器、人工智能算法、机器学习技术等手段，能够实现自主决策和任务规划。

物体的具体形状（用于抓取, grab-anything）
物体语义信息（grounded caption, clip）

Recognize The Relationships Between Child & Parent

受DINO自蒸馏自监督的启发，可以通过物体活动的图像序列来推测物体各个部分的物理关系(attention map)[[DINO]]

训练集可以使用Unity生成不同的光影/物体，连接语义

Build the physics world in robot mind

voxel collider for detected objects, joints, physics agent interact with physics engine.
点云数据，grounded caption=>object property, hierarchy relation, joints(maybe new model should be proposed)

语义还原物体模型

受[[BLIP]]启发，understanding for language & existing point cloud, generation for the rest of the point cloud (Wonder3D已实现)

Posted 2025-01-06Updated 2025-08-15Note4 minutes read (About 541 words)

CLIP-Fields- Weakly Supervised Semantic Fields for Robotic Memory

疑问：

和LERF [[LERF- Language Embedded Radiance Fields]] 的区别

是什么

A spatial-semantic memory
是一个隐式场景模型，可用于各种任务，例如分割、实例识别、空间语义搜索和视图定位
CLIP-Fields 学习从空间位置到语义嵌入向量的映射。
这种映射可以仅通过来自网络图像和网络文本训练模型（例如 CLIP[[CLIP多模态预训练模型]]、Detic 和 Sentence-BERT）的监督进行训练；因此不使用直接的人类监督。

基于的工作

CLIP [[CLIP多模态预训练模型]] : 基于训练一对图像和语言嵌入网络，使得图像和描述该图像的文本字符串具有相似的嵌入。在这项工作中大量使用 CLIP 模型和嵌入，因为它们可以作为对象的视觉特征及其可能的语言标签之间的共享表示。
Detic : 开放标签对象检测和图像分割，允许用户在运行时定义标签集，无需额外的训练或微调。用于生成数据集
Sentence-BERT : 用于文本相似性的句子嵌入网络
Instant-NGP : 构建了从空间（可能还有时间）坐标到某些物理属性的映射，例如神经辐射场情况下的 RGB 颜色和密度，或即时有符号距离场情况下的有符号距离

方法

Goal

We aim to build a system that can connect points of a 3D scene with their visual and semantic meaning.
Provide an interface with a pair of scene-dependent implicit functions $f, h : R^3 → R^n$ such that for the coordinates of any point P in our scene, f (P ) is a vector representing its semantic features, and h(P ) is another vector representing its visual features.

Dataset creation

> **MHE** is multi-resolution hash encoding (MHE) as introduced in [[Instant Neural Graphics Primitives with a Multiresolution Hash Encoding]]. > MHEs build an implicit representation over coordinates with a feature pyramid like structure, which can flexibly maintain both local and global information, unlike purely voxel-based encodings ([[Scene-LLM]]) which focuses on local structures only.

貌似每针对一个新场景都需要重新train一遍来获得坐标到语义的映射。

Posted 2025-01-06Updated 2025-08-15Notea few seconds read (About 3 words)

Simple Open-Vocabulary Object Detection with Vision Transformers

Posted 2025-01-06Updated 2025-08-15Note6 minutes read (About 959 words)

OK-Robot- What Really Matters in Integrating Open-Knowledge Models for Robotics

Intro

Creating a general-purpose robot has been a longstanding dream of the robotics community.

背景

当前想要实现这一目标的系统脆弱、封闭，并且在遇到未见过的情况时会失败。即使是最大的机器人模型通常也只能部署在以前见过的环境中 [5, 6]。在机器人数据很少的环境中，例如在非结构化的家庭环境中，这些系统的脆弱性会进一步加剧。

虽然大型视觉模型显示出语义理解、检测以及将视觉表示与语言联系起来的能力并且与此同时，机器人的导航、抓取和重新排列等基本机器人技能已经相当成熟。
但是将现代视觉模型与机器人特定基元相结合的机器人系统表现非常差。

这可能是因为单纯将多个不确定性的系统组合在一起会导致准确率急剧恶化。
所以我们需要一个将VLM和机器人primitives(导航，抓取，放置)结合在一起的细致框架，即OK-Robot。

发现

预训练的 VLM 对于开放词汇导航非常有效: 当前的开放词汇视觉语言模型，例如 CLIP 或 OWL-ViT，在识别现实世界中的任意对象方面提供了强大的性能，并能够以零样本的方式导航到它们。
预训练的抓取模型可以直接应用于移动操作：与 VLM 类似，经过大量数据预训练的专用机器人模型可以立即应用于家庭中的开放词汇抓取。这些机器人模型不需要任何额外的训练或微调。
如何组合组件至关重要：给定预训练模型，我们发现可以使用简单的状态机模型将它们组合在一起，无需训练。我们还发现，使用启发式方法来抵消机器人的物理限制可以在现实世界中获得更高的成功率。
仍然存在一些挑战：虽然，考虑到在任意家庭中进行零样本的巨大挑战，OK-Robot 在之前的工作基础上进行了改进，通过分析故障模式，我们发现 VLM、机器人模型和机器人形态可以进行重大改进，这将直接提高开放知识操纵代理的性能。

Methodology

该框架主要完成的任务

Pick up A (from B) and drop it on/in C”, where A is an object and B and C are places in a real-world environment such as homes

负责空间重建，识别物体大致位置，机器人导航
用到的方法:

CLIP-Fields [[CLIP-Fields- Weakly Supervised Semantic Fields for Robotic Memory]] : a RGB-D video of the home -> a sequence of posed ( with camera pose and positions) RGB-D images，用于重建环境，该研究还基于此获取了环境中物体和容器旁边的地板表面。
OWL-ViT [[Simple Open-Vocabulary Object Detection with Vision Transformers]] : 我们在每一帧上应用检测器，并提取每个对象边界框、CLIP-embedding、检测器置信度，并将这些信息传递到object memory模块中
SAM: 用于将ViT的检测框转化为mask
VoxcelMap: similar to object-centric memory of CLIP-Fields [[CLIP-Fields- Weakly Supervised Semantic Fields for Robotic Memory]], 基于点云中每一个点的CLIP semantic vector,每一个5cm的体素都包含一个CLIP-embedding的detector-confidence weighted average.
Querying the memory module: 先将language query 转化成CLIP semantic vector,然后基于voxelmap的clip-embeding，寻找最语义接近的那个voxel，以此定位。

知识蒸馏

ViT

DINO

Intro

背景

解决方案

LERF

重要的问题

多模态包含哪些任务

多模态有哪些loss

给定一个具体的任务，机器人需要哪些场景信息才能顺利执行这个任务（通用机器人）

Recognize The Relationships Between Child & Parent

Build the physics world in robot mind

语义还原物体模型

是什么

基于的工作

方法

Goal

Dataset creation

Intro

背景

发现

Methodology

该框架主要完成的任务

Open-home, open-vocabulary object navigation

Experiment

Archives

Recents

Tags