
Scene Reconstruction with Functional Objects for Robot Autonomy
DETR是一个使用transformer作为基本架构的 object detection 模型。
Object queries (something that can be learned):
>Visualization of all box predictions on all images from COCO 2017 val set for 20 out of total N = 100 prediction slots in DETR decoder. Each box prediction is represented as a point with the coordinates of its center in the 1-by-1 square normalized by each image size. The points are color-coded so that green color corresponds to small boxes, red to large horizontal boxes and blue to large vertical boxes. We observe that each slot learns to specialize on certain areas and box sizes with several operating modes. We note that almost all slots have a mode of predicting large image-wide boxes that are common in COCO dataset.使用的backbone是BERT(通过MLM训练)
该研究认为,image encoder的模型大小应该大于text encoder,所以在text encoder这里,只使用六层self attention来提取特征,剩余六层cross attention用于multi-modal encoder。
参考Moco [[Moco- Momentum Contrast for Unsupervised Visual Representation Learning]]
见[[BLIP]],是沿用的工作