
SGTR+= End-to-end Scene Graph Generation with Transformer
SGTR 是一种自上而下的方法,该方法首先使用基于Transformer的生成器来生成一组可学习的triplet queries (subject–predicate–object),然后使用级联的triplet detector逐步完善这些查询并生成最终场景图。它还提出了一种基于结构化发生器的实体感知关系表示方法,该方法利用了关系的组成属性。
Top-down approach (SGTR):
- Starts with higher-level structures (triplet queries) and refines them
- Begins by generating complete subject-predicate-object triplet candidates
- Then progressively refines these triplets to match the image content
- Works with the complete structural units from the beginning
- Analogous to starting with a rough sketch of the entire tree and then refining each branch
DETR是一个使用transformer作为基本架构的 object detection 模型。
Object queries (something that can be learned):
>Visualization of all box predictions on all images from COCO 2017 val set for 20 out of total N = 100 prediction slots in DETR decoder. Each box prediction is represented as a point with the coordinates of its center in the 1-by-1 square normalized by each image size. The points are color-coded so that green color corresponds to small boxes, red to large horizontal boxes and blue to large vertical boxes. We observe that each slot learns to specialize on certain areas and box sizes with several operating modes. We note that almost all slots have a mode of predicting large image-wide boxes that are common in COCO dataset.AN IMAGE IS WORTH 16X16 WORDS- TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE
https://www.youtube.com/watch?v=j3VNqtJUoz0&t=16s
核心思想:
https://github.com/facebookresearch/dino/tree/main
# Emerging Properties in Self-Supervised Vision Transformershttps://juejin.cn/post/7224738994825789496
https://www.youtube.com/watch?v=h3ij3F3cPIk&t=1005s
DI+NO(蒸馏+No Label)
具体来说,DINO 是使用一种称为“无监督自蒸馏”的方法,该方法通过自监督学习来学习模型的知识表示。在这个方法中,模型使用自身的输出来生成“伪标签”,然后使用这些伪标签来重新训练模型,从而进一步提高模型的性能和泛化能力。
https://blog.csdn.net/xbinworld/article/details/83063726
重点idea就是提出用soft target来辅助hard target一起训练,而soft target来自于大模型的预测输出。这里有人会问,明明true label(hard target)是完全正确的,为什么还要soft target呢?
hard target 包含的信息量(信息熵)很低,soft target包含的信息量大,拥有不同类之间关系的信息(比如同时分类驴和马的时候,尽管某张图片是马,但是soft target就不会像hard target 那样只有马的index处的值为1,其余为0,而是在驴的部分也会有概率。)[5]
这样的好处是,这个图像可能更像驴,而不会去像汽车或者狗之类的,而这样的soft信息存在于概率中,以及label之间的高低相似性都存在于soft target中。但是如果soft targe是像这样的信息[0.98 0.01 0.01],就意义不大了,所以需要在softmax中增加温度参数T(这个设置在最终训练完之后的推理中是不需要的)
关于DINO中发生的涌现
https://juejin.cn/post/7280436457142501388
DINO之前的工作
We have also seen emerged two properties that can be leveraged in future applications: the quality of the features in k-NN classification has a potential for image retrieval. The presence of information about the scene layout in the features can also benefit weakly supervised image segmentation.
Transformer是一种基于注意力机制,完全不需要递归或卷积网络的序列预测模型,且更易于训练
介绍了Gated-RNN/LSTM的基本逻辑[[Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling]],指出:
这种固有的顺序性质阻碍了训练示例中的并行化,这在较长的序列长度上变得至关重要,因为内存限制限制了示例之间的批处理,虽然后续有相关工作优化了一些性能,但是基本的限制并没有解除。
https://github.com/hkproj/pytorch-transformer/
https://www.youtube.com/watch?v=ISNdQcPhsts