
SceneGraphFusion- Incremental 3D Scene Graph Predictionfrom RGB-D Sequences
Overview of the proposed SceneGraphFusion framework. Our method takes a stream of RGB-D images a) as input to create an incremental geometric segmentation b). Then, the properties of each segment and a neighbor graph between segments are constructed. The properties d) and neighbor graph e) of the segments that have been updated in the current frame c) are used as the inputs to compute node and edge features f) and to predict a 3D scene graph g). Finally, the predictions are h) fused back into a globally consistent 3D graph.
将不同帧$X_t$中的特征集合在M中特征点的公式:
https://github.com/IDEA-Research/Grounded-Segment-Anything
By [[Grounding-DINO]] + SAM
Achieving Open-Vocab. Det & Seg
LERF- Language Embedded Radiance Fields
NeRF+CLIP
一个Language Field
通过优化从现成的视觉语言模型(如 CLIP)到 3D 场景的嵌入,为 NeRF 中的语言奠定基础。
LERF 提供了一个额外的好处:由于我们从多个尺度的多个视图中提取 CLIP 嵌入,因此通过 3D CLIP 嵌入获得的文本查询的相关性图与通过 2D CLIP 嵌入获得的文本查询的相关性图相比更加本地化。根据定义,它们也是 3D 一致的,可以直接在 3D 字段中进行查询,而无需渲染到多个视图。
相较于Clip-Field[[CLIP-Fields- Weakly Supervised Semantic Fields for Robotic Memory]], LERF 更密集。
CLIP-Fields [32] and NLMaps-SayCan [8] fuse CLIP embeddings of crops into pointclouds, using a contrastively supervised field and classical pointcloud fusion respectively. In CLIP-Fields, the crop locations are guided by Detic [40]. On the other hand, NLMaps-SayCan relies on region proposal networks. These maps are sparser than LERF as they primarily query CLIP on detected objects rather than densely throughout views of the scene. Concurrent work ConceptFusion [19] fuses CLIP features more densely in RGBD pointclouds, using Mask2Former [9] to predict regions of interest, meaning it can lose objects which are out of distribution to Mask2Former’s training set. In contrast, LERF does not use region or mask proposals.