Chen Yulin's BlogChen Yulin's Blog
HomeArchivesCategoriesTagsAbout
  • Categories
  • Note
Pixtral 12B API Inference
Posted 2025-05-23Updated 2025-12-07Note3 minutes read (About 435 words)

Pixtral 12B API Inference

Repository:
https://github.com/PSGBOT/pixtral-12B-Inference

本地图片上传

1
2
3
4
5
6
7
8
9
10
11
def encode_image(image_path):
"""Encode the image to base64."""
try:
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode('utf-8')
except FileNotFoundError:
print(f"Error: The file {image_path} was not found.")
return None
except Exception as e: # Added general exception handling
print(f"Error: {e}")
return None

Prompt

VLM物体描述的prompt:

核心需要:准确定位物体所在方位,不把远景识别为物体,降低False Positive

1
2
3
4
5
6
7
8
9
10
11
12
13
Focus on the area highlighted in green in the image.

Step 1: Determine if the highlighted area represents a distinct, identifiable object or instance:
- If the highlighted area is clearly a distinct object, proceed to Step 2.
- If the highlighted area is abstract, ambiguous, or you cannot confidently identify it as a specific object (e.g., part of background, texture, partial view), respond with "Valid: No".

Step 2: If the highlighted area is a distinct object, provide:
1. The specific name of the object (be precise and use technical terms when appropriate)
2. The primary function or purpose of this object
3. Any notable features visible in the highlighted area (no color description)
4. If there is text visible on the object, include what it says

Remember, if you're uncertain about the highlighted area being a distinct object, respond only with "Valid: No".

输出结果:

  • Valid
    1
    2
    3
    4
    5
    6
    7
    8
    9
    Valid: Yes

    1. The specific name of the object: Soap dispenser
    2. The primary function or purpose of this object: To dispense liquid soap or hand sanitizer.
    3. Notable features visible in the highlighted area:
    - The dispenser has a pump mechanism at the top.
    - The body of the dispenser is cylindrical.
    - The material appears to be translucent plastic.
    4. There is no visible text on the object.
  • invalid
    1
    Valid: No

VLM输出->Structured Output

使用另一个LLM来对VLM输出的内容进行parse,转化成json文件, 通过mistral ai 提供的接口实现:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
class Instance(BaseModel):
valid: str
name: Optional[str] = None
feature: Optional[List[str]] = Field(default_factory=list)
usage: Optional[List[str]] = Field(default_factory=list)

def parse_description_msg(msg):
message = [
{"role": "system", "content": "Extract the description information."},
{
"role": "user",
"content": msg,
},
]
return message

chat_response = self.client.chat.parse(
model=self.llm,
messages=msg,
response_format=Instance,
max_tokens=self.llm_max_tokens,
temperature=self.llm_temperature,
)
return json.loads(chat_response.choices[0].message.content)
Posted 2025-05-19Updated 2025-12-07Notea few seconds read (About 24 words)

2025 Disneyland

https://pan.sjtu.edu.cn/web/desktop/personalSpace?path=Disney

https://pan.sjtu.edu.cn/web/share/268e334a73ed7a82daca26aecf2bfd67

Matlab on Archlinux
Posted 2025-05-13Updated 2025-12-07Notea minute read (About 160 words)

Matlab on Archlinux

使用mpm安装(https://wiki.archlinux.org/title/MATLAB)
Download mpm from https://www.mathworks.com/mpm/glnxa64/mpm and make it executable.

安装:

1
./mpm install --release=R2024b --destination=/home/cyl/matlab MATLAB

安装后启动完成lisense注册后使用patch(https://bbs.archlinux.org/viewtopic.php?id=303177)

1
2
patchelf --clear-execstack /home/user/.MathWorks/ServiceHost/-mw_shared_installs/v2024.13.0.2/bin/glnxa64/libmwfoundation_crash_handling.so
patchelf --clear-execstack /home/user/.MathWorks/ServiceHost/-mw_shared_installs/v2024.13.0.2/bin/glnxa64/mathworksservicehost/rcf/matlabconnector/serviceprocess/rcf/service/libmwmshrcfservice.so # may not needed

如果出现空白窗口左下显示ready,那么参考(www.reddit.com/r/matlab/comments/1dhejp5/matlab_gui_not_loading_properly_on_arch/),设置环境变量

1
export _JAVA_AWT_WM_NONREPARENTING=1
Posted 2025-05-13Updated 2025-12-07Note3 minutes read (About 398 words)

Part-level Dataset Available for Augmentation

Sources

Single Instance

  • Image Classification - 32 Classes - Fourniture: About 10K

Complicated Scene

  • Real
    • Indoor Training Set (ITS) [RESIDE-Standard]: 1.4K
    • MIT Indoor Scenes: 15K
  • Synthetic
    • InteriorVerse

chair_dataset

  • image_1~image_3000: kaggle furniture image dataset
  • image_3001~image_4887: DeepFurniture

desk_dataset

  • image_1~image_700: pix3d

home_appliance_dataset

  • image_1~image_3000: kaggle furniture image dataset fridge only
  • image_3001~image_3429: DeepFurniture home-appliance category

shelves_dataset

  • image_1~image_3000: kaggle furniture image dataset
  • image_3001~image_3243: pix3d wardrobe category
  • image_3244~image_3604: pix3d bookcase category

sofa_dataset

  • image_1~image_1947: pix3d
  • image_1498~image_3888: DeepFurniture

table_dataset

  • image_1~image_3000: kaggle furniture image dataset
  • image_3001~image_4870: pix3d
  • image_4871~image_7293: DeepFurniture

tool_dataset

  • image_1~image_115: pix3d
  • image_116~image_1441:kaggle mechanical tool dataset hammer
  • image_1442~image_1812:kaggle mechanical tool dataset plier
  • image_1813~image_3138:kaggle mechanical tool dataset screw driver
  • image_3139~image_4469:kaggle mechanical tool dataset wrench

tv_dataset

  • image_1~image_3000: kaggle furniture image dataset

PSR Dataset

Cabinet

from shelves_dataset 3000 images -> ? train & ? val samples

Desk (Processing)

from desk_dataset 699 images -> ? train & ? val samples

Tool (Processing)

from tool_dataset (simplified) 1335 images -> 2729 train & 911 val samples

Furniture (Processing)

from 130k Images/furniture 1983 images -> ? train & ? val samples

Coco (Processing)

from coco2017 2212 images -> ? train & ? val samples

current total:

  • train: 12,415-686(bg)=11729
  • val: 3,028-182(bg)=2846
Write Latex in Neovim on Archlinux
Posted 2025-05-07Updated 2025-12-07Notea few seconds read (About 9 words)

Write Latex in Neovim on Archlinux

https://www.youtube.com/watch?app=desktop&v=HVcTPeitxmw

Davinci-resolve on Archlinux
Posted 2025-05-07Updated 2025-12-07Notea few seconds read (About 31 words)

Davinci-resolve on Archlinux

Download pkg from https://apps.cloud.blackmagicdesign.com/davinci-resolve

1
2
3
4
git clone https://aur.archlinux.org/davinci-resolve.git
cd davinci-resolve
mv ~/Downloads/DaVinci_Resolve_19.1.4_Linux.zip ./
makepkg -si
FCSGG Repo Explanation
Posted 2025-04-23Updated 2025-12-07Note6 minutes read (About 870 words)

FCSGG Repo Explanation

FCSGG Repository Summary

FCSGG (Fully Convolutional Scene Graph Generation) is a PyTorch implementation of the paper “Fully Convolutional Scene Graph Generation” published in CVPR 2021. The project focuses on scene graph generation, which is the task of detecting objects in an image and identifying the relationships between them.

Core Components:

  1. Architecture:

    • Built on Detectron2, a popular object detection framework by Facebook
    • Uses a one-stage detector approach (CenterNet) as the meta-architecture
    • Supports various backbones including ResNet, HRNet (High-Resolution Network), Hourglass networks, and DLA
  2. Key Features:

    • Fully convolutional approach to scene graph generation
    • Multiple backbone options with different feature pyramid networks (FPN, BiFPN, HRFPN)
    • Various head designs including multiscale heads and attention mechanisms
    • Support for different input resolutions and training strategies
  3. Dataset:

    • Primarily designed for the Visual Genome dataset, a large-scale dataset for scene understanding
    • Includes custom data loaders and preprocessing for scene graph generation
  4. Model Components:

    • Backbones: Various CNN architectures (ResNet, HRNet, Hourglass, DLA)
    • Necks: Feature pyramid networks and variants (FPN, BiFPN, HRFPN, Trident)
    • Heads: Detection and relationship prediction heads
    • Loss Functions: Custom losses for object detection and relationship prediction
  5. Utilities:

    • Visualization tools for scene graphs
    • Evaluation metrics for scene graph generation
    • Training and inference scripts

Project Structure:

  • fcsgg/: Main module containing model implementation

    • modeling/: Neural network architecture components
      • backbone/: Feature extraction networks
      • necks/: Feature pyramid networks
      • heads/: Detection and relationship prediction heads
      • meta_arch/: High-level model architecture (CenterNet)
    • data/: Dataset handling and preprocessing
    • evaluation/: Metrics and evaluation code
    • utils/: Helper functions and utilities
    • layers/: Custom neural network layers
    • structures/: Data structures for scene graphs
  • configs/: Configuration files for different model variants

  • tools/: Training, evaluation, and visualization scripts

  • GraphViz/: Visualization tools for scene graphs

Key Innovations:

The project implements a fully convolutional approach to scene graph generation, which differs from traditional two-stage methods. Instead of first detecting objects and then predicting relationships, it uses a one-stage detector to simultaneously predict objects and their relationships in a fully convolutional manner.

Benchmarks:

The repository provides several pre-trained models with different backbones:

  1. HRNetW32-1S
  2. ResNet50-4S-FPN×2
  3. HRNetW48-5S-FPN×2

These models achieve competitive performance on the Visual Genome dataset for scene graph generation tasks.

Usage:

The project provides tools for training, evaluation, and visualization of scene graphs. It requires the Visual Genome dataset and can be run using Docker or directly with PyTorch.

In summary, FCSGG is a comprehensive implementation of a state-of-the-art approach to scene graph generation using fully convolutional networks, offering various model architectures and training configurations.

How Detectron2 is Used in FCSGG

FCSGG is built on top of Detectron2, Facebook’s object detection framework, and leverages many of its components while extending it for scene graph generation. Here’s a detailed breakdown:

1. Core Architecture Integration

  • Meta Architecture: FCSGG registers a custom meta architecture called “CenterNet” with Detectron2’s META_ARCH_REGISTRY. This extends Detectron2’s modular architecture system while maintaining compatibility.

  • Backbone Networks: FCSGG uses Detectron2’s backbone networks (ResNet, etc.) directly and also implements custom backbones like HRNet while following Detectron2’s backbone interface.

  • Feature Pyramid Networks (FPN): The repository uses Detectron2’s FPN implementation and extends it with custom variants like BiFPN and HRFPN.

2. Configuration System

  • YAML Configuration: FCSGG adopts Detectron2’s YAML-based configuration system, extending it with custom configurations for scene graph generation through add_fcsgg_config().

  • Command Line Arguments: The training script uses Detectron2’s default_argument_parser() to maintain the same command-line interface.

3. Data Handling

  • Dataset Registration: Visual Genome dataset is registered with Detectron2’s DatasetCatalog and MetadataCatalog, making it available through Detectron2’s data loading pipeline.

  • Custom Dataset Mapper: FCSGG implements a custom DatasetMapper class that extends Detectron2’s mapper to handle scene graph annotations.

  • Data Loaders: The repository uses Detectron2’s build_detection_train_loader and build_detection_test_loader with custom mappers.

4. Training and Evaluation

  • Trainer Class: FCSGG extends Detectron2’s DefaultTrainer class to customize the training loop, evaluation metrics, and data loading.

  • Checkpointing: The repository uses Detectron2’s DetectionCheckpointer for model saving and loading.

  • Distributed Training: FCSGG leverages Detectron2’s distributed training utilities through detectron2.utils.comm and the launch function.

  • Custom Evaluators: The repository implements a custom VGEvaluator for scene graph evaluation while following Detectron2’s evaluator interface.

5. Visualization and Logging

  • Event Storage: FCSGG uses Detectron2’s event storage system for logging metrics during training.

  • Visualization Tools: The repository leverages Detectron2’s visualization utilities for debugging and result analysis.

6. Extensions for Scene Graph Generation

  • Custom Heads: While using Detectron2’s architecture, FCSGG implements custom prediction heads for relationship detection.

  • Scene Graph Structures: The repository defines custom data structures for scene graphs that integrate with Detectron2’s Instances class.

  • Loss Functions: FCSGG implements specialized loss functions for scene graph generation while maintaining compatibility with Detectron2’s loss computation framework.

7. Installation and Dependencies

  • Submodule Integration: Detectron2 is included as a Git submodule, ensuring version compatibility.

  • Build Process: The installation process includes building Detectron2 from source to ensure proper integration.

In summary, FCSGG uses Detectron2 as its foundation, leveraging its modular architecture, data handling, training infrastructure, and configuration system while extending it with custom components for scene graph generation. This approach allows FCSGG to benefit from Detectron2’s robust implementation and optimizations while adding specialized functionality for relationship detection between objects.

Detectron
Posted 2025-04-22Updated 2025-12-07Notea few seconds read (About 0 words)

Detectron

FCSGG Repository Application
Posted 2025-04-22Updated 2025-12-07Note4 minutes read (About 570 words)

FCSGG Repository Application

Official repo:
https://github.com/liuhengyue/fcsgg
Our repo:
https://github.com/PSGBOT/KAF-Generation

My venv: fcsgg

Installation

Environment Preparation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
git clone git@github.com:liuhengyue/fcsgg.git
cd fcsgg
git submodule init
git submodule update
conda create --name fcsgg
conda create -n fcsgg python=3.10
conda install nvidia/label/cuda-11.8.0::cuda-toolkit -c nvidia/label/cuda-11.8.0
conda install cudatoolkit
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# for building detectron
conda install -c conda-forge gcc=11.2.0
conda install -c conda-forge gxx=11.2.0

conda env config vars set LD_LIBRARY_PATH="/home/cyl/miniconda3/envs/fcsgg/lib/"
conda env config vars set CPATH="/home/cyl/miniconda3/envs/fcsgg/include/"
conda env config vars set CUDA_HOME="/home/cyl/miniconda3/envs/fcsgg/"

conda deactivate
conda activate fcsgg

export CC=$CONDA_PREFIX/bin/gcc
export CXX=$CONDA_PREFIX/bin/g++

pip install -r requirements.txt
python -m pip install -e detectron2

Downloads

Datasets:

1
2
3
4
5
cd ~/Reconst
wget https://cs.stanford.edu/people/rak248/VG_100K_2/images.zip -P ./Data/vg/
wget https://cs.stanford.edu/people/rak248/VG_100K_2/images2.zip -P ./Data/vg/
unzip -j ./Data/vg/images.zip -d ./Data/vg/VG_100K
unzip -j ./Data/vg/images2.zip -d ./Data/vg/VG_100K

Download the scene graphs and extract them to datasets/vg/VG-SGG-with-attri.h5.

Issues

1

1
AttributeError: module 'PIL.Image' has no attribute 'LINEAR'. Did you mean: 'BILINEAR'?

LINEAR-> BILINEAR: commit

2

在尝试训练的过程中报错:

1
2
3
  File "/home/cyl/Reconst/fcsgg/fcsgg/data/detection_utils.py", line 432, in generate_score_map
masked_fmap = torch.max(masked_fmap, gaussian_mask * k)
RuntimeError: The size of tensor a (55) must match the size of tensor b (56) at non-singleton dimension 1

modify detection_utils.py: commit

Training

首先更改训练的配置文件./config/quick_schedules/Quick-FCSGG-HRNet-W32.yaml, (原文件使用预训练的参数)

1
2
3
4
MODEL:
META_ARCHITECTURE: "CenterNet"
HRNET:
WEIGHTS: "output/FasterR-CNN-HR32-3x.pth"

更改为train from scratch

1
2
3
4
MODEL:
META_ARCHITECTURE: "CenterNet"
HRNET:
WEIGHTS: "" # Empty string to train from scratch

再运行:

1
python tools/train_net.py --num-gpus 1 --config-file configs/quick_schedules/Quick-FCSGG-HRNet-W32.yaml

成功训练✌

1
2
3
4
5
6
...
[04/23 10:21:01] d2.utils.events INFO: eta: 0:05:37 iter: 1159 total_loss: 1.042 loss_cls: 0.7593 loss_box_wh: 0.08827 loss_center_reg: 0.02485 loss_raf: 0.1754 time: 0.4042 last_time: 0.4519 data_time: 0.0043 last_data_time: 0.0044 lr: 0.001 max_mem: 4141M
[04/23 10:21:09] d2.utils.events INFO: eta: 0:05:29 iter: 1179 total_loss: 1.028 loss_cls: 0.7208 loss_box_wh: 0.09246 loss_center_reg: 0.02669 loss_raf: 0.1625 time: 0.4042 last_time: 0.4035 data_time: 0.0041 last_data_time: 0.0044 lr: 0.001 max_mem: 4141M
[04/23 10:21:17] d2.utils.events INFO: eta: 0:05:21 iter: 1199 total_loss: 1.01 loss_cls: 0.671 loss_box_wh: 0.1038 loss_center_reg: 0.02432 loss_raf: 0.1635 time: 0.4042 last_time: 0.3943 data_time: 0.0042 last_data_time: 0.0043 lr: 0.001 max_mem: 4141M
[04/23 10:21:25] d2.utils.events INFO: eta: 0:05:13 iter: 1219 total_loss: 0.9737 loss_cls: 0.6887 loss_box_wh: 0.0929 loss_center_reg: 0.02574 loss_raf: 0.1749 time: 0.4041 last_time: 0.4101 data_time: 0.0041 last_data_time: 0.0042 lr: 0.001 max_mem: 4141M
...

Explanation

See [[FCSGG Repo Explanation]]

Vision-Language Interpreter for Robot Task Planning
Posted 2025-04-16Updated 2025-12-07Notea few seconds read (About 3 words)

Vision-Language Interpreter for Robot Task Planning

Previous
Next
  • 1
  • 2
  • 3
  • 4
  • 5
  • …
  • 13
Chen Yulin

Chen Yulin

SJTU student

Manchester by the Sea

Posts

290

Categories

10

Tags

203

Follow

Archives

  • November 20256
  • October 20251
  • September 20253
  • August 20256
  • July 20255
  • June 20256
  • May 202510
  • April 202517
  • March 202545
  • February 202512
  • January 202513
  • December 202412
  • November 20244
  • October 202418
  • September 202416
  • August 202413
  • July 20243
  • June 20245
  • May 202413
  • April 202417
  • March 20241
  • January 20241
  • December 20231
  • May 202346
  • August 20221
  • May 20226
  • April 20229

Recents

ChemGPT

2025-11-27

ChemGPT

Review

Lec8

2025-11-24

Lec8

Note

DREAM TO CONTROL= LEARNING BEHAVIORS  BY LATENT IMAGINATION

2025-11-22

DREAM TO CONTROL= LEARNING BEHAVIORS BY LATENT IMAGINATION

Review

奠定世界模型= Intelligence without representation

2025-11-21

奠定世界模型= Intelligence without representation

Review

2025-11-12

ROS2 Basic

Note

Tags

3D-Scene4
6-D3
AI12
AIGC1
API1
AR2
Academic1
Algorithm1
Aliyun1
App2
Atlas1
BS41
Beautify1
Behaviorism1
Business1
C1
CADC1
CD1
CLIP5
CNN1
CV29
Capstone10
Chemistry1
Communication2
Contrastive-Learning3
Control2
Csharp9
Css1
Cuda3
DD1
DINO4
DT1
Dataframe1
Debate5
Debugger1
Diffusion1
Discrete-Mathematics1
Disney1
Docker1
Docs2
Dynamic-programming1
ESP322
Education1
Embeded-System9
Embodied-AI10
Emoation1
Emotion13
Ethic1
Experiment2
FL1
FPN2
Family1
Federated-Learning1
Foundation1
Functional programming1
GPT3
Game5
Gated-NN2
Git7
Github1
Godot3
Graph1
HPC1
HRI2
Haskell1
Health2
Hexo10
Hierarchical1
Html5
Humanism1
Hyprland2
IK1
Image-Grounding1
Image-Text5
Image-generation1
ImitationLearning3
Jolt1
Json1
LLM14
LSP2
Latex2
Lego1
Life4
LinearAlgebra1
Linux22
Live2d1
Love4
Lua1
MBTI1
ML8
MR/AR3
Mason1
Math6
Meme1
Message-Passing1
MindPlus1
Mod3
Motivation1
Moveit1
Movie1
Multi-Agent1
Multi-modal6
Multi-view1
Music5
NLP4
NN7
Network2
Nodejs5
Numpy1
Nvim9
Object-Detection4
Open-Vocabulary9
OpenCV1
Oral1
PHD1
PSY5
Pandas2
Panoptic1
Path1
Philosophy3
PhysX1
Physical-Scene4
Physics-engine1
Pio2
Planning1
Plugin8
PoseEstimation3
Postgraduate1
Prefab1
Probability1
Python29
Pytorch1
QML1
Quantum1
RAG1
RL1
RNN4
ROS6
Reading19
Real2Sim1
Reconstruct9
Regex2
Reinforcement-learning1
Repository5
Representation-Learning1
Research-paper89
Robot2
Robotics18
SJTU-Lecture1
SQL2
SSH3
Scene-graph31
Scene-synthesis1
Science-fiction1
Scrap1
Script2
Segmentation7
Semantic12
Shader3
Shell4
Signals and Systems1
Sim2Real1
Sklearn1
Snippets1
Society4
Star-rail1
Subgraph1
Submodule1
Supervised-learning2
Survey3
TC1
TOEFL1
Task-Planning6
Tasks5
Tech Communication1
Torch5
Transformer11
Translation-Embedding2
Travel5
Unity20
Unsupervised-learning1
VLM6
VLP2
Version-management1
ViT4
VideoEditing2
Vim1
Visual-Relation20
WSL1
Waybar1
Wayland1
Web1
Website1
Well-being1
Window-manager2
WorldModel2
YKLL3
Zen2
♥️2
🍢1
🍰1
🐱2
🧀1
Chen Yulin's BlogChen Yulin's Blog

© 2025 Chen Yulin  Powered by Hexo & Icarus

×