Chen Yulin's Blog

Posted 2025-03-24Updated 2025-07-20Note4 minutes read (About 539 words)

My repository: https://github.com/Chen-Yulin/Semantic-SAM
My venv: ssam

Installation

测试过的python版本：3.8,3.10
官方步骤：

pip3 install torch==1.13.1 torchvision==0.14.1 --extra-index-url https://download.pytorch.org/whl/cu113
python -m pip install 'git+https://github.com/MaureenZOU/detectron2-xyz.git'
pip install git+https://github.com/cocodataset/panopticapi.git
git clone https://github.com/UX-Decoder/Semantic-SAM
cd Semantic-SAM
python -m pip install -r requirements.txt

export DATASET=/pth/to/dataset  # path to your coco data

一些绊脚石 ^ ^

1

根据[[Cuda+Torch]]，需要先安装cudatoolkit和cuda-toolkit

conda install nvidia/label/cuda-11.7.0::cuda-toolkit -c nvidia/label/cuda-11.7.0 
conda install cudatoolkit # no need to specify version
conda env config vars set LD_LIBRARY_PATH="/home/cyl/miniconda3/envs/<name>/lib/"
conda env config vars set CPATH="/home/cyl/miniconda3/envs/<name>/include/" # `/usr/include`for missing `crypt.h`
conda env config vars set CUDA_HOME="/home/cyl/miniconda3/envs/<name>/"

然后按照torch官网的安装指令：

1	conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.7 -c pytorch -c nvidia

2

第二行直接运行可能会报错，提示系统gcc版本过高，安装gcc=11.2.0

conda install -c conda-forge gcc=11.2.0
conda install -c conda-forge gxx=11.2.0

# 指定编译器路径
export CC=$CONDA_PREFIX/bin/gcc
export CXX=$CONDA_PREFIX/bin/g++

# 找不到crypt.h的情况
sudo pacman -S libxcrypt-compat

export CXXFLAGS="${CXXFLAGS} -fuse-ld=/usr/bin/ld"

如果编译时出现ld: cannot find -lcudart: No such file or directory collect2: error: ld returned 1 exit status 报错，只是因为没有安装cudatoolkit ^ ^

3

安装完成后直接import semantic_sam会报错ModuleNotFoundError: No module named 'MultiScaleDeformableAttention' ^ ^
提示：

1
2
3

Please compile MultiScaleDeformableAttention CUDA op with the following commands:
	`cd mask2former[/modeling/pixel_decoder/ops](http://127.0.0.1:8888/modeling/pixel_decoder/ops)`
	`sh make.sh`

需要手动make一下 Mask2Former:

1 2	cd Mask2Former/mask2former/modeling/pixel_decoder/ops/ sh make.sh

4

一些版本问题

1 2	pip install gradio==3.37.0 pip install matplotlib==3.7.0

Demo 🐱

Generate multi-granularity Mask on CLICK

1	python demo.py --ckpt ./weights/swinl_only_sam_many2many.pth

Comment: 效果相较于SAM更多体现了语义的一致性，而不是基于texture进行分割。

Automatically Generate Mask on Different Granularity

1	python demo_auto_generation.py --ckpt ./weights/swinl_only_sam_many2many.pth

需要解决的问题

同level下mask重合

Solved by `utils.psg_utils.mask.discard_submask` in `psg_data.segment_pipeline`

Part-seg-dataset Generation

效果

原图：

.png)

Instance identification:

Part segmentation

Posted 2025-03-12Updated 2025-07-20Reviewa few seconds read (About 94 words)

SceneGraphFusion- Incremental 3D Scene Graph Predictionfrom RGB-D Sequences

Overview of the proposed SceneGraphFusion framework. Our method takes a stream of RGB-D images a) as input to create an incremental geometric segmentation b). Then, the properties of each segment and a neighbor graph between segments are constructed. The properties d) and neighbor graph e) of the segments that have been updated in the current frame c) are used as the inputs to compute node and edge features f) and to predict a 3D scene graph g). Finally, the predictions are h) fused back into a globally consistent 3D graph.

Posted 2025-03-06Updated 2025-07-20Reviewa minute read (About 206 words)

Semantic-SAM

这片文章可以成为场景物理重建的基石之一
类似的后续工作有OMG-Seg

是什么

通用图像分割模型，以实现细分并识别任何所需粒度的任何内容

Semantic-awareness
Granularity-abundance

数据集

### 难点 - 目前的一些通用的物体和分割数据集虽然确实提供了大体量的数据和丰富的语义信息，但只局限在object level。 - 目前一些分割了细分part的数据集却体量有限。 - SAM使用的数据集为多粒度的大体量数据集，但是并不包含语义标注。 ### 解决方式合并了多个不同分割粒度的数据集

Posted 2025-03-06Updated 2025-07-20Reviewa few seconds read (About 26 words)

MaskDINO

注：此DINO并非自蒸馏自监督的那个[[DINO]]，而是派生自[[DETR]]

Posted 2025-03-03Updated 2025-07-20Reviewa few seconds read (About 0 words)

ZegCLIP

Posted 2025-03-03Updated 2025-07-20Reviewa few seconds read (About 108 words)

BLIP

A vision-language model that unifies vision-language understanding and generation tasks.

主要分为两块工作：

去除图文检索所使用的数据集中的噪声
vision language understanding and generation

Model

Noise Filtering

Caption 模型生成图像文本对，然后Filt将caption和真实互联网数据（可能存在噪声）进行对比，如果差异过大则使用Caption模型生成的结果

Understanding & Generation

Posted 2025-02-18Updated 2025-07-20Reviewa few seconds read (About 0 words)

Extract Free Dense Labels from CLIP

Posted 2025-02-17Updated 2025-07-20Review2 minutes read (About 297 words)

ConceptFusion

## Approach 目标是构建一个open-set multimodal 3D map `M`. 可以使用特定于模态的编码器（基础模型）$F_{Mode}$将图像，文本，音频和点击等多维信号编码为矢量空间其中，`M` 由一系列点构成，每个点都包含：顶点位置，法向向量，置信度数量，颜色和概念向量（concept vector）组成首先是帧（单张输入图片）预处理：通过一系列输入的深度图片获取顶点法相maps和相机方位，再通过计算获得每张图片中每个像素的语义上下文嵌入。其中，语义上下文的嵌入是通过结合局部和全局的CLIP features获得的。

然后再进行特征融合：通过相机的方位将每个帧的顶点和法相图映射到全局坐标系。对于帧$X_{t}$中的每个像素$(u，v)_t$，都在`M`中具有相应的点$P_k$

将不同帧$X_t$中的特征集合在M中特征点的公式：

Posted 2025-01-06Updated 2025-07-20Note5 minutes read (About 790 words)

LERF- Language Embedded Radiance Fields

NeRF+CLIP

Intro

背景

神经辐射场 (NeRF) 已成为一种强大的技术，用于捕获复杂的现实世界 3D 场景的逼真数字表示。然而，NeRF 的直接输出只不过是一个彩色的密度场，缺乏意义或上下文，这阻碍了构建与生成的 3D 场景交互的界面。
自然语言是与 3D 场景交互的直观界面。考虑厨房的捕获。想象一下，能够通过询问“用具”在哪里来导航这个厨房，或者更具体地说，询问可用于“搅拌”的工具，甚至可以询问您最喜欢的带有特定功能的杯子。其上的徽标——贯穿日常对话的舒适和熟悉。这不仅需要处理自然语言输入查询的能力，还需要能够在多个尺度上合并语义并与长尾和抽象概念相关。

解决方案

一个Language Field
通过优化从现成的视觉语言模型（如 CLIP）到 3D 场景的嵌入，为 NeRF 中的语言奠定基础。
LERF 提供了一个额外的好处：由于我们从多个尺度的多个视图中提取 CLIP 嵌入，因此通过 3D CLIP 嵌入获得的文本查询的相关性图与通过 2D CLIP 嵌入获得的文本查询的相关性图相比更加本地化。根据定义，它们也是 3D 一致的，可以直接在 3D 字段中进行查询，而无需渲染到多个视图。

相较于Clip-Field[[CLIP-Fields- Weakly Supervised Semantic Fields for Robotic Memory]], LERF 更密集。

CLIP-Fields [32] and NLMaps-SayCan [8] fuse CLIP embeddings of crops into pointclouds, using a contrastively supervised field and classical pointcloud fusion respectively. In CLIP-Fields, the crop locations are guided by Detic [40]. On the other hand, NLMaps-SayCan relies on region proposal networks. These maps are sparser than LERF as they primarily query CLIP on detected objects rather than densely throughout views of the scene. Concurrent work ConceptFusion [19] fuses CLIP features more densely in RGBD pointclouds, using Mask2Former [9] to predict regions of interest, meaning it can lose objects which are out of distribution to Mask2Former’s training set. In contrast, LERF does not use region or mask proposals.

LERF

给定一组校准的输入图像，我们将 CLIP 嵌入到 NeRF 内的 3D 场中。然而，查询单个 3D 点的 CLIP 嵌入是不明确的，因为 CLIP 本质上是全局图像嵌入，不利于像素对齐特征提取。为了解释这一特性，我们提出了一种新颖的方法，该方法涉及学习以样本点为中心的卷上的语言嵌入领域。具体来说，该字段的输出是包含指定体积的图像作物的所有训练视图中的平均 CLIP 嵌入。通过将查询从点重新构造为体积，我们可以有效地从输入图像的粗裁剪中监督密集的字段，这些图像可以通过在给定的体积尺度上进行调节来以像素对齐的方式渲染。

https://blog.csdn.net/amusi1994/article/details/129701012

Posted 2025-01-06Updated 2025-07-20Note4 minutes read (About 541 words)

CLIP-Fields- Weakly Supervised Semantic Fields for Robotic Memory

疑问：

和LERF [[LERF- Language Embedded Radiance Fields]] 的区别

是什么

A spatial-semantic memory
是一个隐式场景模型，可用于各种任务，例如分割、实例识别、空间语义搜索和视图定位
CLIP-Fields 学习从空间位置到语义嵌入向量的映射。
这种映射可以仅通过来自网络图像和网络文本训练模型（例如 CLIP[[CLIP多模态预训练模型]]、Detic 和 Sentence-BERT）的监督进行训练；因此不使用直接的人类监督。

基于的工作

CLIP [[CLIP多模态预训练模型]] : 基于训练一对图像和语言嵌入网络，使得图像和描述该图像的文本字符串具有相似的嵌入。在这项工作中大量使用 CLIP 模型和嵌入，因为它们可以作为对象的视觉特征及其可能的语言标签之间的共享表示。
Detic : 开放标签对象检测和图像分割，允许用户在运行时定义标签集，无需额外的训练或微调。用于生成数据集
Sentence-BERT : 用于文本相似性的句子嵌入网络
Instant-NGP : 构建了从空间（可能还有时间）坐标到某些物理属性的映射，例如神经辐射场情况下的 RGB 颜色和密度，或即时有符号距离场情况下的有符号距离

方法

Goal

We aim to build a system that can connect points of a 3D scene with their visual and semantic meaning.
Provide an interface with a pair of scene-dependent implicit functions $f, h : R^3 → R^n$ such that for the coordinates of any point P in our scene, f (P ) is a vector representing its semantic features, and h(P ) is another vector representing its visual features.

Dataset creation

> **MHE** is multi-resolution hash encoding (MHE) as introduced in [[Instant Neural Graphics Primitives with a Multiresolution Hash Encoding]]. > MHEs build an implicit representation over coordinates with a feature pyramid like structure, which can flexibly maintain both local and global information, unlike purely voxel-based encodings ([[Scene-LLM]]) which focuses on local structures only.

貌似每针对一个新场景都需要重新train一遍来获得坐标到语义的映射。