Posted 2025-07-18Updated 2025-10-16Note2 minutes read (About 281 words)

For the figure, the each step corresponds to a processed batch(16).

15k dataset

Resdcn + FPNx2 (↓:32,16,8,4)

18 layer

with warm-up and cosine schedule

python train.py --log_name resdcn18_all \
                --data_dir ~/cyl/Data/PSR_final \
                --arch resdcn_18 \
                --lr 1e-5 \
                --batch_size 36 \
                --num_epochs 90 --num_workers 0 --log_interval 5

### 50 layer #### with warm-up and cosine schedule

python train.py --log_name resdcn50_all \
                --data_dir ~/cyl/Data/PSR_final \
                --arch resdcn_50 \
                --lr 1e-5 \
                --batch_size 24 \
                --num_epochs 90 --num_workers 0 --log_interval 5

### 101 layer #### Use PSR dataset before remove bg samples (12415 training samples)

python train.py --log_name resdcn_all \
                --data_dir ~/cyl/Data/PSR_final \
                --arch resdcn_101 \
                --lr 5e-5 \
                --lr_step 40,70 \
                --batch_size 16 \
                --num_epochs 90 --num_workers 0 --log_interval 5

#### Use PSR dataset after remove bg samples (11729 samples)

python train.py --log_name resdcn_all \
                --data_dir ~/cyl/Data/PSR_final \
                --arch resdcn_101 \
                --lr 5e-5 \
                --lr_step 40,70 \
                --batch_size 16 \
                --num_epochs 90 --num_workers 0 --log_interval 5

HRNet (↓:4,4,4,4)

Use PSR dataset after remove bg samples (11729 samples)

python train.py --log_name hrnet_all \
                --data_dir ~/cyl/Data/PSR_final \
                --arch hrnet \
                --lr 5e-5 \
                --lr_step 40,70 \
                --batch_size 16 \
                --num_epochs 90 --num_workers 0 --log_interval 5

## Hourglass (↓:4,4,4,4) ### Use PSR dataset after remove bg samples (11729 samples)

python train.py --log_name hg_all \
                --data_dir ~/cyl/Data/PSR_final \
                --arch hourglass_small \
                --lr 5e-5 \
                --lr_step 40,70 \
                --batch_size 16 \
                --num_epochs 90 --num_workers 0 --log_interval 5

25k dataset

Resdcn+FPNx2 (↓:32,16,8,4)

50 layer

Resnet (pretrained) (↓:32,16,8,4)

Posted 2025-05-13Updated 2025-10-16Note3 minutes read (About 398 words)

Part-level Dataset Available for Augmentation

Sources

Single Instance

Image Classification - 32 Classes - Fourniture: About 10K

Complicated Scene

Real
- Indoor Training Set (ITS) [RESIDE-Standard]: 1.4K
- MIT Indoor Scenes: 15K
Synthetic
- InteriorVerse

chair_dataset

image_1~image_3000: kaggle furniture image dataset
image_3001~image_4887: DeepFurniture

desk_dataset

image_1~image_700: pix3d

home_appliance_dataset

image_1~image_3000: kaggle furniture image dataset fridge only
image_3001~image_3429: DeepFurniture home-appliance category

shelves_dataset

image_1~image_3000: kaggle furniture image dataset
image_3001~image_3243: pix3d wardrobe category
image_3244~image_3604: pix3d bookcase category

sofa_dataset

image_1~image_1947: pix3d
image_1498~image_3888: DeepFurniture

table_dataset

image_1~image_3000: kaggle furniture image dataset
image_3001~image_4870: pix3d
image_4871~image_7293: DeepFurniture

tool_dataset

image_1~image_115: pix3d
image_116~image_1441:kaggle mechanical tool dataset hammer
image_1442~image_1812:kaggle mechanical tool dataset plier
image_1813~image_3138:kaggle mechanical tool dataset screw driver
image_3139~image_4469:kaggle mechanical tool dataset wrench

tv_dataset

image_1~image_3000: kaggle furniture image dataset

PSR Dataset

Cabinet

from shelves_dataset 3000 images -> ? train & ? val samples

Desk (Processing)

from desk_dataset 699 images -> ? train & ? val samples

Tool (Processing)

from tool_dataset (simplified) 1335 images -> 2729 train & 911 val samples

Furniture (Processing)

from 130k Images/furniture 1983 images -> ? train & ? val samples

Coco (Processing)

from coco2017 2212 images -> ? train & ? val samples

current total:

train: 12,415-686(bg)=11729
val: 3,028-182(bg)=2846

Posted 2025-05-08Updated 2025-10-16Review10 minutes read (About 1552 words)

Feature Pyramid Networks for Object Detection

目的

识别不同尺寸的物体是目标检测中的一个基本挑战，而特征金字塔一直是多尺度目标检测中的一个基本的组成部分，但是由于特征金字塔计算量大，会拖慢整个检测速度，所以大多数方法为了检测速度而尽可能的去避免使用特征金字塔，而是只使用高层的特征来进行预测。高层的特征虽然包含了丰富的语义信息，但是由于低分辨率，很难准确地保存物体的位置信息。与之相反，低层的特征虽然语义信息较少，但是由于分辨率高，就可以准确地包含物体位置信息。所以如果可以将低层的特征和高层的特征融合起来，就能得到一个识别和定位都准确的目标检测系统。所以本文就旨在设计出这样的一个结构来使得检测准确且快速。

FPN结构

为了使得不同尺度的特征都包含丰富的语义信息，同时又不使得计算成本过高，作者就采用top down和lateral connection的方式，让低层高分辨率低语义的特征和高层低分辨率高语义的特征融合在一起，使得最终得到的不同尺度的特征图都有丰富的语义信息。

bottom-up

Bottom-up的过程就是将图片输入到backbone ConvNet中提取特征的过程中。Backbone输出的feature map的尺寸有的是不变的，有的是成2倍的减小的。对于那些输出的尺寸不变的层，把他们归为一个stage，那么每个stage的最后一层输出的特征就被抽取出来。以ResNet为例，将卷积块conv2， conv3， conv4， conv5的输出定义为{$C_2, C_{3}. C_{4}, C_{5}$} ，这些都是每个stage中最后一个残差块的输出，这些输出分别是原图的{$\frac{1}{4}, \frac{1}{8}, \frac{1}{16}, \frac{1}{32}$}倍，所以这些特征图的尺寸之间就是2倍的关系。

top-down

Top-down的过程就是将高层得到的feature map进行上采样然后往下传递，这样做是因为，高层的特征包含丰富的语义信息，经过top-down的传播就能使得这些语义信息传播到低层特征上，使得低层特征也包含丰富的语义信息。本文中，采样方法是最近邻上采样，使得特征图扩大2倍。上采样的目的就是放大图片，在原有图像像素的基础上在像素点之间采用合适的插值算法插入新的像素，在本文中使用的是最近邻上采样(插值)。这是最简单的一种插值方法，不需要计算，在待求像素的四个邻近像素中，将距离待求像素最近的邻近像素值赋给待求像素。
最邻近法计算量较小，但可能会造成插值生成的图像灰度上的不连续，在灰度变化的地方可能出现明显的锯齿状。

Lateral connection

对于每个stage输出的feature map $C_{n}$，都先进行一个1*1的卷积降低维度。
然后再将得到的特征和上一层采样得到特征图$P_{n+1}$进行融合，就是直接相加，element-wise addition。因为每个stage输出的特征图之间是2倍的关系，所以上一层上采样得到的特征图的大小和本层的大小一样，就可以直接将对应元素相加。
相加完之后需要进行一个3x3的卷积才能得到本层的特征输出$P_{n}$。使用这个3x3卷积的目的是为了消除上采样产生的混叠效应(aliasing effect)，混叠效应应该就是指上边提到的‘插值生成的图像灰度不连续，在灰度变化的地方可能出现明显的锯齿状’。在本文中，因为金字塔所有层的输出特征都共享classifiers/ regressors，所以输出的维度都被统一为256，即这些3x3的卷积的channel都为256。

FPN&RPN

下图所示为Faster R-CNN中的RPN的网络结构，接收单尺度的特征输入，然后经过3x3的卷积，并在feature map上的每个点处生成9个anchor(3个尺寸，每种尺寸对应3个宽高比)，之后再在两个分支并行的进行1x1卷积，分别用于对anchors进行分类和回归。这是单尺度的特征输入的RPN。

所以将FPN和RPN结合起来，那RPN的输入就会变成多尺度的feature map，那我们就需要在金字塔的每一层后边都接一个RPN head(一个3x3卷积，两个1x1卷积)，如下图所示，其中$P_6$是通过$P_5$下采样得到的。

Formally, we define the anchors to have areas of {$32^2, 64^2, 128^2, 256^2, 512^2$} pixels on {$P_2, P_3, P_4, P_5, P_6$}

在生成anchor的时候，因为输入是多尺度特征，就不需要再对每层都使用3种不同尺度的anchor了，所以只为每层设定一种尺寸的anchor，图中绿色的数字就代表每层anchor的size，但是每种尺寸还是会对应3种宽高比。所以总共会有15种anchors。此外，anchor的ground truth label和Faster R-CNN中的定义相同，即如果某个anchor和ground-truth box有最大的IoU，或者IoU大于0.7，那这个anchor就是正样本，如果IoU小于0.3，那就是负样本。此外，需要注意的是每层的RPN head都参数共享的。

Posted 2025-05-06Updated 2025-10-16Reviewa minute read (About 135 words)

Deformable Convolutional Networks

Used in [[CenterNet]]

pre: https://www.youtube.com/watch?v=HRLMSrxw2To&t=308s

解决的问题

Modeling spatial transformations is a long standing problem in computer vision

Deformation (human pose)
Scale
Viewpoint variation
Intra-class variation (不同设计的同一种物体)

Traditional approaches:

build datasets with sufficient desired variations
use transformation-invariant features and algorithms

架构

优势

与传统CNN拥有相同的输入输出

regular convolution -> deformable convolution
regular RoI pooling -> deformable RoI pooling

可以端到端训练且无需额外监督信号

直接认为是一种在物体检测方面即插即用的模块即可

Posted 2025-04-24Updated 2025-10-16Review2 minutes read (About 228 words)

Associative Embedding= End-to-End Learning for Joint Detection and Grouping

Q&A

1

What is standard dense supervised learning? Mentioned in [[CenterNet]].

Standard dense supervised learning typically refers to a supervised learning setup where:

Standard supervised learning means:
- You have input data X and corresponding ground truth labels Y.
- The goal is to train a model $f_\theta(X)$ that maps inputs to outputs by minimizing a loss function (e.g., cross-entropy, MSE) between the predicted labels and ground truth.
- The training dataset is fully labeled (i.e., each input has a corresponding label).
Dense refers to:
- A per-pixel or per-element prediction task, where every element in the input gets a corresponding label.
- Common in vision tasks like:
  - Semantic segmentation (each pixel is labeled with a class).
  - Depth estimation (each pixel has a depth value).
  - Optical flow (each pixel has a motion vector).
  - Surface normal estimation (each pixel has a 3D orientation vector).

In contrast to sparse supervision, where only a subset of the input (e.g., bounding boxes, keypoints) is labeled, dense supervision provides full annotations for every relevant part of the input.

Example
In semantic segmentation:

Input: an RGB image (e.g., 512×512 pixels).
Output: a label map of the same size (512×512), where each pixel has a class label like “road”, “car”, “sky”, etc.
Model: often a Fully Convolutional Network (FCN) or encoder-decoder like U-Net or DeepLab.
Loss: usually pixel-wise cross-entropy.

Posted 2025-04-24Updated 2025-10-16Reviewa few seconds read (About 12 words)

CenterNet

Posted 2025-04-23Updated 2025-10-16Note6 minutes read (About 870 words)

FCSGG Repo Explanation

FCSGG Repository Summary

FCSGG (Fully Convolutional Scene Graph Generation) is a PyTorch implementation of the paper “Fully Convolutional Scene Graph Generation” published in CVPR 2021. The project focuses on scene graph generation, which is the task of detecting objects in an image and identifying the relationships between them.

Core Components:

Architecture:
- Built on Detectron2, a popular object detection framework by Facebook
- Uses a one-stage detector approach (CenterNet) as the meta-architecture
- Supports various backbones including ResNet, HRNet (High-Resolution Network), Hourglass networks, and DLA
Key Features:
- Fully convolutional approach to scene graph generation
- Multiple backbone options with different feature pyramid networks (FPN, BiFPN, HRFPN)
- Various head designs including multiscale heads and attention mechanisms
- Support for different input resolutions and training strategies
Dataset:
- Primarily designed for the Visual Genome dataset, a large-scale dataset for scene understanding
- Includes custom data loaders and preprocessing for scene graph generation
Model Components:
- Backbones: Various CNN architectures (ResNet, HRNet, Hourglass, DLA)
- Necks: Feature pyramid networks and variants (FPN, BiFPN, HRFPN, Trident)
- Heads: Detection and relationship prediction heads
- Loss Functions: Custom losses for object detection and relationship prediction
Utilities:
- Visualization tools for scene graphs
- Evaluation metrics for scene graph generation
- Training and inference scripts

Project Structure:

fcsgg/: Main module containing model implementation
- modeling/: Neural network architecture components
  - backbone/: Feature extraction networks
  - necks/: Feature pyramid networks
  - heads/: Detection and relationship prediction heads
  - meta_arch/: High-level model architecture (CenterNet)
- data/: Dataset handling and preprocessing
- evaluation/: Metrics and evaluation code
- utils/: Helper functions and utilities
- layers/: Custom neural network layers
- structures/: Data structures for scene graphs
configs/: Configuration files for different model variants
tools/: Training, evaluation, and visualization scripts
GraphViz/: Visualization tools for scene graphs

Key Innovations:

The project implements a fully convolutional approach to scene graph generation, which differs from traditional two-stage methods. Instead of first detecting objects and then predicting relationships, it uses a one-stage detector to simultaneously predict objects and their relationships in a fully convolutional manner.

Benchmarks:

The repository provides several pre-trained models with different backbones:

HRNetW32-1S
ResNet50-4S-FPN×2
HRNetW48-5S-FPN×2

These models achieve competitive performance on the Visual Genome dataset for scene graph generation tasks.

Usage:

The project provides tools for training, evaluation, and visualization of scene graphs. It requires the Visual Genome dataset and can be run using Docker or directly with PyTorch.

In summary, FCSGG is a comprehensive implementation of a state-of-the-art approach to scene graph generation using fully convolutional networks, offering various model architectures and training configurations.

How Detectron2 is Used in FCSGG

FCSGG is built on top of Detectron2, Facebook’s object detection framework, and leverages many of its components while extending it for scene graph generation. Here’s a detailed breakdown:

1. Core Architecture Integration

Meta Architecture: FCSGG registers a custom meta architecture called “CenterNet” with Detectron2’s META_ARCH_REGISTRY. This extends Detectron2’s modular architecture system while maintaining compatibility.
Backbone Networks: FCSGG uses Detectron2’s backbone networks (ResNet, etc.) directly and also implements custom backbones like HRNet while following Detectron2’s backbone interface.
Feature Pyramid Networks (FPN): The repository uses Detectron2’s FPN implementation and extends it with custom variants like BiFPN and HRFPN.

2. Configuration System

YAML Configuration: FCSGG adopts Detectron2’s YAML-based configuration system, extending it with custom configurations for scene graph generation through add_fcsgg_config().
Command Line Arguments: The training script uses Detectron2’s default_argument_parser() to maintain the same command-line interface.

3. Data Handling

Dataset Registration: Visual Genome dataset is registered with Detectron2’s DatasetCatalog and MetadataCatalog, making it available through Detectron2’s data loading pipeline.
Custom Dataset Mapper: FCSGG implements a custom DatasetMapper class that extends Detectron2’s mapper to handle scene graph annotations.
Data Loaders: The repository uses Detectron2’s build_detection_train_loader and build_detection_test_loader with custom mappers.

4. Training and Evaluation

Trainer Class: FCSGG extends Detectron2’s DefaultTrainer class to customize the training loop, evaluation metrics, and data loading.
Checkpointing: The repository uses Detectron2’s DetectionCheckpointer for model saving and loading.
Distributed Training: FCSGG leverages Detectron2’s distributed training utilities through detectron2.utils.comm and the launch function.
Custom Evaluators: The repository implements a custom VGEvaluator for scene graph evaluation while following Detectron2’s evaluator interface.

5. Visualization and Logging

Event Storage: FCSGG uses Detectron2’s event storage system for logging metrics during training.
Visualization Tools: The repository leverages Detectron2’s visualization utilities for debugging and result analysis.

6. Extensions for Scene Graph Generation

Custom Heads: While using Detectron2’s architecture, FCSGG implements custom prediction heads for relationship detection.
Scene Graph Structures: The repository defines custom data structures for scene graphs that integrate with Detectron2’s Instances class.
Loss Functions: FCSGG implements specialized loss functions for scene graph generation while maintaining compatibility with Detectron2’s loss computation framework.

7. Installation and Dependencies

Submodule Integration: Detectron2 is included as a Git submodule, ensuring version compatibility.
Build Process: The installation process includes building Detectron2 from source to ensure proper integration.

In summary, FCSGG uses Detectron2 as its foundation, leveraging its modular architecture, data handling, training infrastructure, and configuration system while extending it with custom components for scene graph generation. This approach allows FCSGG to benefit from Detectron2’s robust implementation and optimizations while adding specialized functionality for relationship detection between objects.

Posted 2025-04-22Updated 2025-10-16Notea few seconds read (About 0 words)

Detectron

Posted 2025-04-22Updated 2025-10-16Note4 minutes read (About 570 words)

FCSGG Repository Application

Official repo:
https://github.com/liuhengyue/fcsgg
Our repo:
https://github.com/PSGBOT/KAF-Generation

My venv: fcsgg

Installation

Environment Preparation

git clone git@github.com:liuhengyue/fcsgg.git
cd fcsgg
git submodule init
git submodule update
conda create --name fcsgg
conda create -n fcsgg python=3.10
conda install nvidia/label/cuda-11.8.0::cuda-toolkit -c nvidia/label/cuda-11.8.0
conda install cudatoolkit
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# for building detectron
conda install -c conda-forge gcc=11.2.0
conda install -c conda-forge gxx=11.2.0

conda env config vars set LD_LIBRARY_PATH="/home/cyl/miniconda3/envs/fcsgg/lib/"
conda env config vars set CPATH="/home/cyl/miniconda3/envs/fcsgg/include/"
conda env config vars set CUDA_HOME="/home/cyl/miniconda3/envs/fcsgg/"

conda deactivate
conda activate fcsgg

export CC=$CONDA_PREFIX/bin/gcc
export CXX=$CONDA_PREFIX/bin/g++

pip install -r requirements.txt
python -m pip install -e detectron2

Downloads

Datasets:

cd ~/Reconst
wget https://cs.stanford.edu/people/rak248/VG_100K_2/images.zip -P ./Data/vg/
wget https://cs.stanford.edu/people/rak248/VG_100K_2/images2.zip -P ./Data/vg/
unzip -j ./Data/vg/images.zip -d ./Data/vg/VG_100K
unzip -j ./Data/vg/images2.zip -d ./Data/vg/VG_100K

Download the scene graphs and extract them to datasets/vg/VG-SGG-with-attri.h5.

Issues

1

1	AttributeError: module 'PIL.Image' has no attribute 'LINEAR'. Did you mean: 'BILINEAR'?

LINEAR-> BILINEAR: commit

2

在尝试训练的过程中报错：

1
2
3

  File "/home/cyl/Reconst/fcsgg/fcsgg/data/detection_utils.py", line 432, in generate_score_map
    masked_fmap = torch.max(masked_fmap, gaussian_mask * k)
RuntimeError: The size of tensor a (55) must match the size of tensor b (56) at non-singleton dimension 1

modify detection_utils.py: commit

Training

首先更改训练的配置文件./config/quick_schedules/Quick-FCSGG-HRNet-W32.yaml, (原文件使用预训练的参数)

MODEL:
  META_ARCHITECTURE: "CenterNet"
  HRNET:
    WEIGHTS: "output/FasterR-CNN-HR32-3x.pth"

更改为train from scratch

MODEL:
  META_ARCHITECTURE: "CenterNet"
  HRNET:
    WEIGHTS: ""  # Empty string to train from scratch

再运行：

1	python tools/train_net.py --num-gpus 1 --config-file configs/quick_schedules/Quick-FCSGG-HRNet-W32.yaml

成功训练✌

...
[04/23 10:21:01] d2.utils.events INFO:  eta: 0:05:37  iter: 1159  total_loss: 1.042  loss_cls: 0.7593  loss_box_wh: 0.08827  loss_center_reg: 0.02485  loss_raf: 0.1754    time: 0.4042  last_time: 0.4519  data_time: 0.0043  last_data_time: 0.0044   lr: 0.001  max_mem: 4141M
[04/23 10:21:09] d2.utils.events INFO:  eta: 0:05:29  iter: 1179  total_loss: 1.028  loss_cls: 0.7208  loss_box_wh: 0.09246  loss_center_reg: 0.02669  loss_raf: 0.1625    time: 0.4042  last_time: 0.4035  data_time: 0.0041  last_data_time: 0.0044   lr: 0.001  max_mem: 4141M
[04/23 10:21:17] d2.utils.events INFO:  eta: 0:05:21  iter: 1199  total_loss: 1.01  loss_cls: 0.671  loss_box_wh: 0.1038  loss_center_reg: 0.02432  loss_raf: 0.1635    time: 0.4042  last_time: 0.3943  data_time: 0.0042  last_data_time: 0.0043   lr: 0.001  max_mem: 4141M
[04/23 10:21:25] d2.utils.events INFO:  eta: 0:05:13  iter: 1219  total_loss: 0.9737  loss_cls: 0.6887  loss_box_wh: 0.0929  loss_center_reg: 0.02574  loss_raf: 0.1749    time: 0.4041  last_time: 0.4101  data_time: 0.0041  last_data_time: 0.0042   lr: 0.001  max_mem: 4141M
...

Explanation

See [[FCSGG Repo Explanation]]

Posted 2025-04-15Updated 2025-10-16Reviewa few seconds read (About 0 words)

OpenPose Using Part Affinity Fields