Chen Yulin's Blog

Posted 2025-04-15Updated 2025-11-28Reviewa few seconds read (About 30 words)

Web: https://mistral.ai/news/pixtral-12b
Demo: https://chat.mistral.ai/chat
Finetune: https://github.com/2U1/Pixtral-Finetune
Model: https://huggingface.co/mistralai/Pixtral-12B-2409

Posted 2025-03-14Updated 2025-11-28Reviewa few seconds read (About 86 words)

RelTR= Relation Transformer for Scene Graph Generation

RelTR是自下而上的方法, 使用基于Transformer的 object detector（例如DETR）生成对象候选者，然后使用relation transformer来预测object pairs之间的关系。它还设计了一种基于积分的关系表示方法，该方法将关系编码为二维矢量场。

Posted 2025-03-13Updated 2025-11-28Reviewa few seconds read (About 0 words)

Iterative Scene Graph Generation

Posted 2025-03-13Updated 2025-11-28Reviewa minute read (About 180 words)

SGTR+= End-to-end Scene Graph Generation with Transformer

SGTR 是一种自上而下的方法，该方法首先使用基于Transformer的生成器来生成一组可学习的triplet queries (subject–predicate–object)，然后使用级联的triplet detector逐步完善这些查询并生成最终场景图。它还提出了一种基于结构化发生器的实体感知关系表示方法，该方法利用了关系的组成属性。

Top-down approach (SGTR):
- Starts with higher-level structures (triplet queries) and refines them
- Begins by generating complete subject-predicate-object triplet candidates
- Then progressively refines these triplets to match the image content
- Works with the complete structural units from the beginning
- Analogous to starting with a rough sketch of the entire tree and then refining each branch

Posted 2025-03-11Updated 2025-11-28Reviewa minute read (About 161 words)

DETR

参考： https://www.youtube.com/watch?v=T35ba_VXkMY&t=1744s

DETR是一个使用transformer作为基本架构的 object detection 模型。

Insight

Object queries (something that can be learned):

>Visualization of all box predictions on all images from COCO 2017 val set for 20 out of total N = 100 prediction slots in DETR decoder. Each box prediction is represented as a point with the coordinates of its center in the 1-by-1 square normalized by each image size. The points are color-coded so that green color corresponds to small boxes, red to large horizontal boxes and blue to large vertical boxes. We observe that each slot learns to specialize on certain areas and box sizes with several operating modes. We note that almost all slots have a mode of predicting large image-wide boxes that are common in COCO dataset.

Posted 2025-03-04Updated 2025-11-28Reviewa few seconds read (About 3 words)

ViLT

Posted 2025-01-09Updated 2025-11-28Notea few seconds read (About 0 words)

Vision Transformers Need Registers

Posted 2025-01-09Updated 2025-11-28Notea few seconds read (About 0 words)

DINOv2- Learning Robust Visual Features without Supervision

Posted 2025-01-09Updated 2025-11-28Notea few seconds read (About 71 words)

AN IMAGE IS WORTH 16X16 WORDS- TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE

https://www.youtube.com/watch?v=j3VNqtJUoz0&t=16s

核心思想：

将图像分为patches, 线性映射, 再加上图片的position embeding来输入transformer encoder
额外使用一个cls token用于占位（ViT的输出就是这个cls input token对应的output token）

Posted 2025-01-08Updated 2025-11-28Note4 minutes read (About 561 words)

DINO

https://github.com/facebookresearch/dino/tree/main

# Emerging Properties in Self-Supervised Vision Transformers

https://juejin.cn/post/7224738994825789496
https://www.youtube.com/watch?v=h3ij3F3cPIk&t=1005s
DI+NO（蒸馏+No Label）
具体来说，DINO 是使用一种称为“无监督自蒸馏”的方法，该方法通过自监督学习来学习模型的知识表示。在这个方法中，模型使用自身的输出来生成“伪标签”，然后使用这些伪标签来重新训练模型，从而进一步提高模型的性能和泛化能力。

知识蒸馏

https://blog.csdn.net/xbinworld/article/details/83063726

重点idea就是提出用soft target来辅助hard target一起训练，而soft target来自于大模型的预测输出。这里有人会问，明明true label（hard target）是完全正确的，为什么还要soft target呢？
hard target 包含的信息量（信息熵）很低，soft target包含的信息量大，拥有不同类之间关系的信息（比如同时分类驴和马的时候，尽管某张图片是马，但是soft target就不会像hard target 那样只有马的index处的值为1，其余为0，而是在驴的部分也会有概率。）[5]
这样的好处是，这个图像可能更像驴，而不会去像汽车或者狗之类的，而这样的soft信息存在于概率中，以及label之间的高低相似性都存在于soft target中。但是如果soft targe是像这样的信息[0.98 0.01 0.01]，就意义不大了，所以需要在softmax中增加温度参数T（这个设置在最终训练完之后的推理中是不需要的）

ViT

DINO

总的来说DINO最适合的任务就是将不同状态的同一物体进行归类。

关于DINO中发生的涌现
https://juejin.cn/post/7280436457142501388

DINO之前的工作

We have also seen emerged two properties that can be leveraged in future applications: the quality of the features in k-NN classification has a potential for image retrieval. The presence of information about the scene layout in the features can also benefit weakly supervised image segmentation.

Insight

知识蒸馏

ViT

DINO

Archives

Recents

Tags