

For the figure, the each step corresponds to a processed batch(16).
Part-level Dataset Available for Augmentation
Single Instance
Complicated Scene
from shelves_dataset 3000 images -> ? train & ? val samples
from desk_dataset 699 images -> ? train & ? val samples
from tool_dataset (simplified) 1335 images -> 2729 train & 911 val samples
from 130k Images/furniture 1983 images -> ? train & ? val samples
from coco2017 2212 images -> ? train & ? val samples
current total:
Feature Pyramid Networks for Object Detection

识别不同尺寸的物体是目标检测中的一个基本挑战,而特征金字塔一直是多尺度目标检测中的一个基本的组成部分,但是由于特征金字塔计算量大,会拖慢整个检测速度,所以大多数方法为了检测速度而尽可能的去避免使用特征金字塔,而是只使用高层的特征来进行预测。高层的特征虽然包含了丰富的语义信息,但是由于低分辨率,很难准确地保存物体的位置信息。与之相反,低层的特征虽然语义信息较少,但是由于分辨率高,就可以准确地包含物体位置信息。所以如果可以将低层的特征和高层的特征融合起来,就能得到一个识别和定位都准确的目标检测系统。所以本文就旨在设计出这样的一个结构来使得检测准确且快速。
Deformable Convolutional Networks

Used in [[CenterNet]]
Associative Embedding= End-to-End Learning for Joint Detection and Grouping
What is standard dense supervised learning? Mentioned in [[CenterNet]].
Standard dense supervised learning typically refers to a supervised learning setup where:
In contrast to sparse supervision, where only a subset of the input (e.g., bounding boxes, keypoints) is labeled, dense supervision provides full annotations for every relevant part of the input.
Example
In semantic segmentation: