Chen Yulin's Blog

Posted 2025-03-11Updated 2025-07-24Reviewa few seconds read (About 3 words)

PHYSCENE- Physically Interactable 3D Scene Synthesis for Embodied AI

Posted 2025-03-11Updated 2025-07-24Reviewa few seconds read (About 23 words)

Scene Reconstruction with Functional Objects for Robot Autonomy

和李飞飞[[ACDC- Automated Creation of Digital Cousins for Robust Policy Learning]]的思想类似。

Posted 2025-03-11Updated 2025-07-24Reviewa few seconds read (About 3 words)

Part-level Scene Reconstruction Affords Robot Interaction

Posted 2025-03-11Updated 2025-07-24Reviewa minute read (About 161 words)

DETR

参考： https://www.youtube.com/watch?v=T35ba_VXkMY&t=1744s

DETR是一个使用transformer作为基本架构的 object detection 模型。

Insight

Object queries (something that can be learned):

>Visualization of all box predictions on all images from COCO 2017 val set for 20 out of total N = 100 prediction slots in DETR decoder. Each box prediction is represented as a point with the coordinates of its center in the 1-by-1 square normalized by each image size. The points are color-coded so that green color corresponds to small boxes, red to large horizontal boxes and blue to large vertical boxes. We observe that each slot learns to specialize on certain areas and box sizes with several operating modes. We note that almost all slots have a mode of predicting large image-wide boxes that are common in COCO dataset.

Posted 2025-03-06Updated 2025-07-24Reviewa minute read (About 206 words)

Semantic-SAM

这片文章可以成为场景物理重建的基石之一
类似的后续工作有OMG-Seg

是什么

通用图像分割模型，以实现细分并识别任何所需粒度的任何内容

Semantic-awareness
Granularity-abundance

数据集

### 难点 - 目前的一些通用的物体和分割数据集虽然确实提供了大体量的数据和丰富的语义信息，但只局限在object level。 - 目前一些分割了细分part的数据集却体量有限。 - SAM使用的数据集为多粒度的大体量数据集，但是并不包含语义标注。 ### 解决方式合并了多个不同分割粒度的数据集

Posted 2025-03-06Updated 2025-07-24Reviewa few seconds read (About 26 words)

MaskDINO

注：此DINO并非自蒸馏自监督的那个[[DINO]]，而是派生自[[DETR]]

Posted 2025-03-04Updated 2025-07-24Reviewa minute read (About 154 words)

ALBEF

## Align Before Fuse ### Image Encoder 标准的六层self attention ViT，初始化为DeiT论文中的在ImageNet-1K数据集上训练的参数

Text Encoder

使用的backbone是BERT(通过MLM训练)
该研究认为，image encoder的模型大小应该大于text encoder,所以在text encoder这里，只使用六层self attention来提取特征，剩余六层cross attention用于multi-modal encoder。