
GLIP是一个学习了object-level, languageaware, and semantic-rich visual representations 的模型。
统一对象检测和短语接地进行预训练。
什么是 phrase grounding:
Phrase Grounding refers to the task of associating or “grounding” a natural language phrase (like a sentence or a word) to a specific region or object in an image. In other words, it’s about finding which part of the image corresponds to the object or concept described by a given text phrase.
## Grounded Language Image Pre-training 将经典对象检测任务投入到grounding问题中,并提出**Unified Formulation**For instance, if you have the phrase “the red ball on the table” and an image of a room with a red ball placed on a table, the goal of phrase grounding is to identify the exact region in the image that corresponds to the “red ball on the table”, distinguishing it from other objects in the image.
传统的物体检测方法会把每个region分类进c个classes,而本文使用的Object detection as phrase grounding.
我们通过将每个区域与文本提示中的c(class)短语进行接地/对齐,将检测重新制定为基础任务
the classification prompt “person. bicycle. car. … . toothbrush”
CLIP-Fields- Weakly Supervised Semantic Fields for Robotic Memory
疑问:
A spatial-semantic memory
是一个隐式场景模型,可用于各种任务,例如分割、实例识别、空间语义搜索和视图定位
CLIP-Fields 学习从空间位置到语义嵌入向量的映射。
这种映射可以仅通过来自网络图像和网络文本训练模型(例如 CLIP[[CLIP多模态预训练模型]]、Detic 和 Sentence-BERT)的监督进行训练;因此不使用直接的人类监督。
We aim to build a system that can connect points of a 3D scene with their visual and semantic meaning.
Provide an interface with a pair of scene-dependent implicit functions $f, h : R^3 → R^n$ such that for the coordinates of any point P in our scene, f (P ) is a vector representing its semantic features, and h(P ) is another vector representing its visual features.
貌似每针对一个新场景都需要重新train一遍来获得坐标到语义的映射。