Posted 2024-11-21Updated 2025-07-24Notea few seconds read (About 78 words)

Convert Raw RGB-D to tree-structure scene(maybe in unity), for more

Raw point cloud (voxel) to semantic segmented cloud
classify the segmented object into different structure and establish parent-child relationship
identify the state of object
1. pose like, numbers (\\)
2. state like, open/close (LLM)

发现和lff近期发表的一篇文章思想非常一致 https://arxiv.org/html/2410.07408v1

和场景理解的对比

Posted 2024-11-17Updated 2025-07-24Note18 minutes read (About 2722 words)

Cosypose modification

Setup

仓库: https://github.com/Simple-Robotics/cosypose

1
2
3

git clone --recurse-submodules https://github.com/Simple-Robotics/cosypose.git
cd cosypose
conda env create -n cosypose --file environment.yaml

注意执行这一步的时候pip 会提示setuptools 和matplotlib-inline不符合3.7.6的python，到环境中手动安装适配的版本

1
2
3

conda activate cosypose
pip install setuptools==63.4.1
pip install matplotlib-inline==0.1.6

1
2
3

git lfs pull
python setup.py install
python setup.py develop

根据README下载数据
注意第一块指令无法下载成功，由 https://bop.felk.cvut.cz/datasets/ 得知下载链接迁移到了huggingface, https://huggingface.co/datasets/bop-benchmark/datasets/tree/main/ycbv 可以从这里手动下载测试集并放置到local_data/bop_datasets/ycbv/test

设置测试使用的models

1	cp ./local_data/bop_datasets/ycbv/model_bop_compat_eval ./local_data/bop_datasets/ycbv/models

Debug

`np.where(mask)[0].item()`

运行

1 2	export CUDA_VISIBLE_DEVICES=0 python -m cosypose.scripts.run_cosypose_eval --config ycbv

时出现报错

Traceback (most recent call last):
  File "/home/cyl/.conda/envs/cosypose/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/cyl/.conda/envs/cosypose/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/cyl/cosypose/cosypose/scripts/run_cosypose_eval.py", line 491, in <module>
    main()
  File "/home/cyl/cosypose/cosypose/scripts/run_cosypose_eval.py", line 332, in main
    scene_ds = make_scene_dataset(ds_name)
  File "/home/cyl/cosypose/cosypose/datasets/datasets_cfg.py", line 68, in make_scene_dataset
    ids.append(np.where(mask)[0].item())
ValueError: can only convert an array of size 1 to a Python scalar

添加debug输出，得到

Debug - scene_id: 48, view_id: 1
Debug - mask matches: 1
Debug - where result shape: (1,), values: [225]
Debug - scene_id: 48, view_id: 36
Debug - mask matches: 1
Debug - where result shape: (1,), values: [226]
Debug - scene_id: 48, view_id: 47
Debug - mask matches: 1
Debug - where result shape: (1,), values: [227]
Debug - scene_id: 48, view_id: 83
Debug - mask matches: 1
Debug - where result shape: (1,), values: [228]
Debug - scene_id: 48, view_id: 112
Debug - mask matches: 1
Debug - where result shape: (1,), values: [229]
Debug - scene_id: 48, view_id: 135
Debug - mask matches: 0
Debug - where result shape: (0,), values: []
0:00:00.912023 - Expected exactly one match, got 0 matches for scene_id=48, view_id=135

发现是下载的测试数据集并不包含数据集keyframe.txt中所有的帧，导致一些关键帧识别不到

运行到一半被终止的情况

如果想重新开始新的训练：清空local_data/joblib_cache

Framework

Prediction Script `cosypose.scripts.run_cosypose_eval`

AI explanation

The script predicts object poses based on multi-view input by following these steps:

Dataset Loading: It first loads the dataset using the make_scene_dataset function, which prepares the scene data for evaluation. The dataset is wrapped in a MultiViewWrapper to handle multiple views.
Model Loading: The script loads pre-trained models for pose prediction using the load_models function. It loads both coarse and refiner models based on the configuration specified in the command-line arguments.
Prediction Setup: The script sets up the prediction parameters, including the number of iterations for coarse and refiner models, and whether to skip multi-view processing based on the number of views specified.
Multi-view Prediction: The MultiviewScenePredictor is initialized with the mesh database, which is used to predict poses across multiple views. The MultiviewPredictionRunner is then used to run predictions on the dataset, leveraging the multi-view setup to improve pose estimation accuracy.
Pose Estimation: The script uses the loaded models to predict object poses. It processes detections from either pix2pose or posecnn depending on the dataset, and refines these predictions using the refiner model.
Evaluation: After predictions, the script evaluates the predicted poses using the PoseEvaluation class. It calculates various metrics like ADD-S and AUC to assess the accuracy of the pose predictions.
Results Logging: Finally, the script logs the results, including evaluation metrics, and saves them to a specified directory.

The multi-view approach allows the script to leverage information from different viewpoints, which can help resolve ambiguities and improve the robustness of the pose estimation.

Prediction Script `run_custom_scenario`

Terms

TCO

Transformation from Camera to Object.
It represents the transformation matrix or parameters that describe the pose of an object relative to the camera’s coordinate system

TWO

Transformation from World to Object.
It represents the transformation matrix or parameters that describe the pose of an object relative to the world’s coordinate system

Model dataset

class MeshDataBase:
    def __init__(self, obj_list):
        self.infos = {obj['label']: obj for obj in obj_list}
        self.meshes = {l: trimesh.load(obj['mesh_path']) for l, obj in self.infos.items()}

    @staticmethod
    def from_object_ds(object_ds):
        obj_list = [object_ds[n] for n in range(len(object_ds))]
        return MeshDataBase(obj_list)
...

一般使用的初始化方式：

1 2	object_ds = BOPObjectDataset(scenario_dir / 'models') mesh_db = MeshDataBase.from_object_ds(object_ds)

也可以通过load models一起加载：

1	predictor, mesh_db = load_models(coarse_run_id, refiner_run_id, n_workers=n_plotters, object_set=object_set)

Important Classes

`Multiview_wrapper`

作用：
读取 scene_dataset 并且通过视角数量n_views来分割这些数据为不同场景，然后方便遍历其中的场景元素（这里都是ground truth）
遍历时返回的值为

n_views张不同视角下的RGB图像
n_views张对应的mask

n_views份对应的observation

识别到的物体位姿和类型
相机位姿和内参

frame_info，没太多用

1 2	scene_ds_pred = MultiViewWrapper(scene_ds, n_views=n_views) scene_ds_pred[0][2] # scene48 multiview_group1 's observations in five views

[
 {'objects': 
  [
   {'label': 'obj_000001',
    'name': 'obj_000001',
    'TWO': array([[-0.02062261, -0.99870347, -0.04654345, -0.05380909],
           [ 0.99854439, -0.022895  ,  0.04883047,  0.00189095],
           [-0.04983272, -0.04546878,  0.9977229 ,  0.07060698],
           [ 0.        ,  0.        ,  0.        ,  1.        ]]),
    'T0O': array([[-0.02062261, -0.99870347, -0.04654345, -0.05380909],
           [ 0.99854439, -0.022895  ,  0.04883047,  0.00189095],
           [-0.04983272, -0.04546878,  0.9977229 ,  0.07060698],
           [ 0.        ,  0.        ,  0.        ,  1.        ]]),
    'visib_fract': 0.7769277845777234,
    'id_in_segm': 1,
    'bbox': [347, 210, 467, 374]},
   {'label': 'obj_000006',
    'name': 'obj_000006',
    'TWO': array([[-0.40056693,  0.91475543, -0.05262471,  0.03103553],
           [-0.91622629, -0.39934108,  0.03248866, -0.02365388],
           [ 0.00870386,  0.06123014,  0.9980863 ,  0.01391488],
           [ 0.        ,  0.        ,  0.        ,  1.        ]]),
    'T0O': array([[-0.40056693,  0.91475543, -0.05262471,  0.03103553],
           [-0.91622629, -0.39934108,  0.03248866, -0.02365388],
           [ 0.00870386,  0.06123014,  0.9980863 ,  0.01391488],
           [ 0.        ,  0.        ,  0.        ,  1.        ]]),
    'visib_fract': 0.9990349353406678,
    'id_in_segm': 2,
    'bbox': [328, 343, 422, 405]},
   {'label': 'obj_000014',
    'name': 'obj_000014',
    'TWO': array([[ 0.24178672, -0.96941339, -0.04215706, -0.05206396],
           [ 0.96977496,  0.2399519 ,  0.0442575 ,  0.0179453 ],
           [-0.03278805, -0.05158388,  0.99813144,  0.16636215],
           [ 0.        ,  0.        ,  0.        ,  1.        ]]),
    'T0O': array([[ 0.24178672, -0.96941339, -0.04215706, -0.05206396],
           [ 0.96977496,  0.2399519 ,  0.0442575 ,  0.0179453 ],
           [-0.03278805, -0.05158388,  0.99813144,  0.16636215],
           [ 0.        ,  0.        ,  0.        ,  1.        ]]),
    'visib_fract': 0.9938250428816466,
    'id_in_segm': 3,
    'bbox': [372, 143, 490, 241]},
   {'label': 'obj_000019',
    'name': 'obj_000019',
    'TWO': array([[-0.69888905,  0.1926738 , -0.68878937,  0.01412755],
           [ 0.711967  ,  0.27928957, -0.64428215,  0.05127768],
           [ 0.06823575, -0.94067797, -0.33237011,  0.06472594],
           [ 0.        ,  0.        ,  0.        ,  1.        ]]),
    'T0O': array([[-0.69888905,  0.1926738 , -0.68878937,  0.01412755],
           [ 0.711967  ,  0.27928957, -0.64428215,  0.05127768],
           [ 0.06823575, -0.94067797, -0.33237011,  0.06472594],
           [ 0.        ,  0.        ,  0.        ,  1.        ]]),
    'visib_fract': 0.9890470974808324,
    'id_in_segm': 4,
    'bbox': [419, 222, 527, 410]},
   {'label': 'obj_000020',
    'name': 'obj_000020',
    'TWO': array([[-0.74512542, -0.66691536,  0.00352083,  0.07854437],
           [-0.6669148 ,  0.74507458, -0.00940455, -0.15283599],
           [ 0.00364864, -0.00935569, -0.99995023,  0.01854317],
           [ 0.        ,  0.        ,  0.        ,  1.        ]]),
    'T0O': array([[-0.74512542, -0.66691536,  0.00352083,  0.07854437],
           [-0.6669148 ,  0.74507458, -0.00940455, -0.15283599],
           [ 0.00364864, -0.00935569, -0.99995023,  0.01854317],
           [ 0.        ,  0.        ,  0.        ,  1.        ]]),
    'visib_fract': 0.9953060637992145,
    'id_in_segm': 5,
    'bbox': [92, 328, 288, 442]}],
  'camera': 
  {'T0C': array([[-0.0792652 ,  0.241296  , -0.967209  ,  0.946419  ],
          [ 0.996102  ,  0.0568396 , -0.0674529 , -0.02116569],
          [ 0.0386997 , -0.968786  , -0.244861  ,  0.36645836],
          [ 0.        ,  0.        ,  0.        ,  1.        ]]),
   'K': array([[1.066778e+03, 0.000000e+00, 3.129869e+02],
          [0.000000e+00, 1.067487e+03, 2.413109e+02],
          [0.000000e+00, 0.000000e+00, 1.000000e+00]]),
   'TWC': array([[-0.0792652 ,  0.241296  , -0.967209  ,  0.946419  ],
          [ 0.996102  ,  0.0568396 , -0.0674529 , -0.02116569],
          [ 0.0386997 , -0.968786  , -0.244861  ,  0.36645836],
          [ 0.        ,  0.        ,  0.        ,  1.        ]]),
   'resolution': torch.Size([480, 640])},
  'frame_info': 
  {
   'scene_id': 48,
   'cam_id': 'cam',
   'view_id': 1626,
   'cam_name': 'cam',
   'group_id': 0
   }
  },
  ... # other views
]

`MultiviewPredictorRunner`

作用：
接收Multiview_wrapper作为输入，并做出预测

首先是数据集接收：

dataloader = DataLoader(scene_ds, batch_size=batch_size,
						num_workers=n_workers,
						sampler=sampler,
						collate_fn=self.collate_fn)

use collate_fn to process the row data （最后的注释里面有真正用到的数据）

def collate_fn(self, batch):
	batch_im_id = -1

	cam_infos, K = [], []
	det_infos, bboxes = [], []
	for n, data in enumerate(batch): # normally only one batch
		assert n == 0
		images, masks, obss = data
		for c, obs in enumerate(obss): # iterate along different views
			batch_im_id += 1
			frame_info = obs['frame_info']
			im_info = {k: frame_info[k] for k in ('scene_id', 'view_id', 'group_id')} # info for the image
			im_info.update(batch_im_id=batch_im_id)
			cam_info = im_info.copy() # info for camera

			K.append(obs['camera']['K']) # info for 相机内参
			cam_infos.append(cam_info)

			for o, obj in enumerate(obs['objects']):
				obj_info = dict(
					label=obj['name'],
					score=1.0,
				)
				obj_info.update(im_info) # add key-value pair from im_info to obj_info
				bboxes.append(obj['bbox'])
				det_infos.append(obj_info)

	gt_detections = tc.PandasTensorCollection(
		infos=pd.DataFrame(det_infos),
		bboxes=torch.as_tensor(np.stack(bboxes)),
	) # 包括每一个ground truthdetection的的基本info,和检测框 
	cameras = tc.PandasTensorCollection(
		infos=pd.DataFrame(cam_infos),
		K=torch.as_tensor(np.stack(K)),
	)# 包括每一view 相机的基本info（和detection info相同）,和内参
	data = dict(
		images=images,
		cameras=cameras,
		gt_detections=gt_detections,
	)
	return data

最重要的function: get_predictions

def get_predictions(self, pose_predictor, mv_predictor,
					detections=None,
					n_coarse_iterations=1, n_refiner_iterations=1,
					sv_score_th=0.0, skip_mv=True,
					use_detections_TCO=False):

Responsible for generating predictions for object poses in a scene using both single-view and multi-view approaches.

Input Parameters:
- pose_predictor: single view predictor，比如ycbv数据集用的就是posecnn的检测模型
- mv_predictor: An object or function that predicts scene states using multi-view information.
- detections: A collection of detected objects with associated information, pre-generated and saved in a .pkl file
- n_coarse_iterations, n_refiner_iterations: Number of iterations for coarse and refinement pose estimation.
- sv_score_th: Score threshold for single-view detections.
- skip_mv: A flag to skip multi-view predictions.
- use_detections_TCO: A flag to use detections for initial pose estimation.
Filtering Detections:
需要注意的是这里使用的detection是直接来自预存好的检测数据（非ground truth）
1
posecnn_detections = load_posecnn_results()
- The function filters the input detections based on the sv_score_th threshold.
- It assigns a unique detection ID to each detection and creates an index based on scene_id and view_id.
Iterating Over Data:
- The function iterates over batches of data from the dataloader.
- For each batch, it extracts images, camera information, and ground truth detections.
Matching Detections:
- It matches the detections with the current batch of data using the index created earlier.
- It filters and prepares the detections for processing.
Pose Prediction:
- If there are detections, it uses the pose_predictor to get single-view predictions.
- It registers the initial bounding boxes with the candidates.
Multi-View Prediction:
- If skip_mv is False, it uses the mv_predictor to predict the scene state using multi-view information.
Collecting Predictions:
- It collects the single-view and multi-view predictions into a dictionary.
Concatenating Results:
- It concatenates the predictions across all batches and returns the final predictions.

`MultiviewScenePredictor`

作用：
used by Myltiview_PredictionRunner.get_predictions
In run_cosypose_eval we initialize MultiviewScenePredictor in this way:

1	mv_predictor = MultiviewScenePredictor(mesh_db)

In the MultiviewScenePredictor we use the mesh_db to initialize MultiviewRefinement and solve:

problem = MultiviewRefinement(candidates=candidates_n,
                    cameras=cameras,
	                pairs_TC1C2=pairs_TC1C2,
	                mesh_db=self.mesh_db_ba)
ba_outputs = problem.solve(
	n_iterations=ba_n_iter,
	optimize_cameras=not use_known_camera_poses,
)

The solve function of MultiviewRefinement:

def solve(self, sample_n_init=1, **lm_kwargs):
	timer_init = Timer()
	timer_opt = Timer()
	timer_misc = Timer()

	timer_init.start()
	TWO_9d_init, TCW_9d_init = self.robust_initialization_TWO_TCW(n_init=sample_n_init)
	timer_init.pause()

	timer_opt.start()
	TWO_9d_opt, TCW_9d_opt, history = self.optimize_lm(
		TWO_9d_init, TCW_9d_init, **lm_kwargs)
	timer_opt.pause()

	timer_misc.start()
	objects, cameras = self.make_scene_infos(TWO_9d_opt, TCW_9d_opt)
	objects_init, cameras_init = self.make_scene_infos(TWO_9d_init, TCW_9d_init)
	history = self.convert_history(history)
	timer_misc.pause()

	outputs = dict(
		objects_init=objects_init,
		cameras_init=cameras_init,
		objects=objects,
		cameras=cameras,
		history=history,
		time_init=timer_init.stop(),
		time_opt=timer_opt.stop(),
		time_misc=timer_misc.stop(),
	)
	return outputs

Adaption

准备基于run_custom_scenario进行修改
run_custom_scenario的使用方式：

1	python -m cosypose.scripts.run_custom_scenario --scenario=example

Setting OMP and MKL num threads to 1.
pybullet build time: Jan 28 2022 20:13:03
0:00:00.000859 - -----------------------------------------------
---------------------------------
0:00:00.000921 - scenario: example
0:00:00.000942 - sv_score_th: 0.3
0:00:00.000956 - n_symmetries_rot: 64
0:00:00.000956 - n_symmetries_rot: 64
0:00:00.000968 - ransac_n_iter: 2000
0:00:00.000980 - ransac_dist_threshold: 0.02
0:00:00.001002 - nms_th: 0.04
0:00:00.001015 - no_visualization: False
0:00:00.001026 - -----------------------------------------------
---------------------------------
0:00:00.569089 - Loaded 796 candidates in 8 views.
0:00:00.570278 - Loaded cameras intrinsics.
0:00:00.690990 - Loaded 30 3D object models.
0:00:00.691047 - Running stage 2 and 3 of CosyPose...
0:00:01.145408 - Num candidates: 107
0:00:01.145468 - Num views: 8
0:00:01.145728 - Estimating camera poses using RANSAC.
0:00:04.588304 - Matched candidates: 49
0:00:04.588375 - RANSAC time_models: 0:00:02.390068
0:00:04.588398 - RANSAC time_score: 0:00:00.990740
0:00:04.588415 - RANSAC time_misc: 0:00:00.061626
0:00:04.902268 - BA time_init: 0:00:00.005349
0:00:04.902333 - BA time_opt: 0:00:00.091822
0:00:04.902351 - BA time_misc: 0:00:00.004793
0:00:04.491746 - Subscene 0 has 8 objects and 7 cameras.
0:00:04.512850 - Wrote predicted scene (objects+cameras): /home/cyl/cosypose/local_data/custom_scenarios/example/
results/subscene=0/predicted_scene.json
0:00:04.512906 - Wrote predicted objects with pose expressed in camera frame: /home/cyl/cosypose/local_data/custo
m_scenarios/example/results/subscene=0/scene_reprojected.csv

该脚本只接收了candidates, mesh_db和camera_k信息，直接运行mv_predictor

写一个通过list输入构建candidates的function:

def read_list_candidates_cameras(self, data_list, cameras_K_list):
	"""
	Creates a PandasTensorCollection from a list of candidates information.

	Args:
		data_list (list): Each element is a dictionary with keys:
			- "candidates" (list of dict): Each candidate dictionary includes:
				- "label" (str): The label of the object.
				- "score" (float): The confidence score of the object.
				- "pose" (torch.Tensor): A [4, 4] torch.Tensor representing the pose matrix.

	Returns:
		PandasTensorCollection: Contains poses and infos.
	"""
	all_poses = []
	all_infos = []
	all_K = []

	# Initialize view_id to be assigned automatically
	view_id = 0
	scene_id = 0  # Fixed value for scene_id

	for view, K in zip(data_list, cameras_K_list):
		all_K.append(K)
		for candidate in view["candidates"]:
			label = candidate["label"]
			score = candidate["score"]
			pose = candidate["pose"]

			# Append the pose tensor
			all_poses.append(pose)

			# Append the metadata
			all_infos.append({
				"view_id": view_id,
				"scene_id": scene_id,
				"score": score,
				"label": label
			})

		# Increment view_id for the next set of candidates
		view_id += 1

	K_tensor = torch.stack(all_K).to(dtype=torch.float32, device="cuda:0")

	# Stack poses into a single tensor
	poses_tensor = torch.stack(all_poses).to(dtype=torch.float32, device="cuda:0")

	# Create a Pandas DataFrame for infos
	infos_df = pd.DataFrame(all_infos)
	# Return the PandasTensorCollection-like structure
	ptc_candidate = tc.PandasTensorCollection(poses=poses_tensor, infos=infos_df)
	cam_info = infos_df.loc[:,["view_id"]]
	cam_info = cam_info.drop_duplicates()
	ptc_cam = tc.PandasTensorCollection(K=K_tensor, infos=cam_info)
	return ptc_candidate, ptc_cam

# Example usage:
example_data = [
    {
        "candidates": [
            {"label": "obj_000017", "score": 0.829675, "pose": torch.eye(4)},
            {"label": "obj_000010", "score": 0.820436, "pose": torch.eye(4) * 2},
        ]
    },
    {
        "candidates": [
            {"label": "obj_000005", "score": 0.104478, "pose": torch.eye(4) * 3},
        ]
    }
]
example_cameras_K = [
    torch.eye(3),
    torch.eye(3) * 2,
]

cd, cam= read_list_candidates(example_data, example_cameras_K)
cd, cam

(PandasTensorCollection(
     poses: torch.Size([3, 4, 4]) torch.float32 cuda:0,
 ----------------------------------------
     infos:
    view_id  scene_id     score       label
 0        0         0  0.829675  obj_000017
 1        0         0  0.820436  obj_000010
 2        1         0  0.104478  obj_000005
 ),
 PandasTensorCollection(
     K: torch.Size([2, 3, 3]) torch.float32 cuda:0,
 ----------------------------------------
     infos:
    view_id
 0        0
 1        1
 ))

之后就正常调用MultiviewScenePredictor.predict_scene_state() to estimate the scene:

predictions = self.mv_predictor.predict_scene_state(candidates, cameras,
									   score_th=self.sv_score_th,
									   use_known_camera_poses=False,
									   ransac_n_iter= self.ransac_n_iter,
									   ransac_dist_threshold= self.ransac_dist_threshold,
									   ba_n_iter= self.ba_n_iter)

之后再使用Non-Maximum Suppression来聚合重复检出的物体

objects = predictions['scene/objects']
cameras = predictions['scene/cameras']
reproj = predictions['ba_output']
#print(predictions)
for view_group in np.unique(objects.infos['view_group']):
	objects_ = objects[np.where(objects.infos['view_group'] == view_group)[0]]
	cameras_ = cameras[np.where(cameras.infos['view_group'] == view_group)[0]]
	reproj_ = reproj[np.where(reproj.infos['view_group'] == view_group)[0]]
	objects_ = nms3d(objects_, th= self.nms_th, poses_attr='TWO')

最终输出objects_

PandasTensorCollection(
    TWO: torch.Size([10, 4, 4]) torch.float32 cuda:0,
----------------------------------------
    infos:
   obj_id     score       label  n_cand  view_group  group_id  scene_id
0       2  5.469747  obj_000016       7           0         0        16
1       0  5.450335  obj_000017       8           0         0        16
2       4  4.098602  obj_000012       8           0         0        16
3       1  3.380887  obj_000010       6           0         0        16
4       5  2.771779  obj_000015       6           0         0        16
5       3  1.453180  obj_000011       4           0         0        16
6       9  1.183983  obj_000014       3           0         0        16
7       8  1.106775  obj_000013       2           0         0        16
)

Usage

Please refer to the notebook custom_scene.ipynb.

Posted 2024-11-07Updated 2025-07-24Reviewa minute read (About 135 words)

CosyPose-- Consistent multi-view multi-object 6D pose estimation

Goal

Estimate accurate 6D poses of multiple known objects in a 3D scene captured by multiple cameras with unknown positions

Challenges

object pose hypotheses made in individual images cannot easily be expressed in a common reference frame when the relative transformations between the cameras are unknown(相机相对位置未知)
the single-view 6D object pose hypotheses have gross errors in the form of false positive and missed detections（由于视角遮蔽，会存在误报和错漏的情况）
the candidate 6D object poses estimated from input images are noisy as they suffer from depth ambiguities inherent to single view methods.（深度信息通常没那么精准）

Approach

Posted 2024-11-05Updated 2025-07-24Review4 minutes read (About 649 words)

AR2-D2 -- Training a Robot Without a Robot

https://ar2d2.site/

背景

机器人执行任务的视频数据集非常重要，特别是对于Visual Imitation Learning来说。
想要获得这些训练集视频，传统的方法是人工引导机器人做相关动作，然后再录制，耗费大量人力和时间成本，最关键的是机器人是固定在实验室内的，能接触到的物品和任务比较有限，因此这些训练数据中不包含更日常的场景。

Solution

提出了一个IOS APP，可以通过追踪用户手部的动作在视频中生成一个执行动作的AR机器人。

AR2-D2 系统细节

如上图，AR2-D2 的设计和实现由两个主要组件组成。第一个组件是一个手机应用程序，它将 AR 机器人投射到现实世界中，允许用户与物理对象和 AR 机器人进行交互。第二个组件将收集的视频转换为可用于训练不同行为克隆代理的格式，这些克隆代理随后可以部署在真实的机器人上。

IOS Application

Unity + AR Foundation kit（用于生成一个虚拟机械臂并布置在场景中）
传感器：苹果设备摄像头和自带的LiDAR
通过ios自己的人手姿态算法和深度信息获取手部动作，由此获取机械臂需要运动到的关键点，并且可以让AR界面中的机械臂移动到指定位置。

Training Data Generation

得到APP生成的视频后消除人手并填补消除的区域（E2FGVI），就可以得到机械臂操作物体的视频，它可以用作基于视觉的模仿学习的训练数据。

APP Evaluation

Real Deployment Evaluation

围绕三个常见的机器人任务收集演示：{press, push, pick up}

使用 Perciver-Actor (PERACT)训练基于 Transformer 的语言引导行为cloning policy

PERACT takes a 3D voxel observation and a language goal (v, l) as input and produces discretized outputs for translation, rotation, and gripper state of the end-effector. These outputs, coupled with a motion planner, enable the execution of the task specified by the language goal.

每一个agent执行一种任务（{press, push, pick up}），先训练3k次，然后再微调训练（3k iteration），用于缩小iphone摄像机和agent使用的kinect v2相机之间的偏差。

微调结果

测试结果

Posted 2024-10-29Updated 2025-07-24Reviewa few seconds read (About 18 words)

Human-robot interaction for robotic manipulator programming in Mixed Reality

和我毕设很像的工作，居然已经发ICRA了？

Posted 2024-10-28Updated 2025-07-24Review7 minutes read (About 1119 words)

Augmented Reality and Robotics - A Survey and Taxonomy for AR-enhanced Human-Robot Interaction and Robotic Interfaces

概要

虽然近些年有关AR在人机交互方面应用的研究有很多，但是这些研究大都缺少系统性的分析

Recently, an increasing number of studies in HCI, HRI, and robotics have demonstrated how AR enables better interactions between people and robots. However, often research remains focused on individual explorations and key design strategies, and research questions are rarely analyzed systematically.

本文主要给目前AR人机交互领域做一下分类（基于460篇文章）
AR人机交互主要分为这几种研究维度

approaches to augmenting reality
characteristics of robots
purposes and benefits
classification of presented information
design components and strategies for visual augmentation
interaction techniques and modalities
application domains
evaluation strategies

AR最大的优势就是能够提供超出物理限制的丰富视觉反馈，减少工人的认知负荷
这个研究最终的目标是提供一个对于该领域的共同基础和理解。

Definition, Scope, Contribution, Methodology

HRI & Robotic Interfaces

机器人系统不单指传统工业机器人，在本研究中，我们不局限于任一种机器人。
Robotic interfaces 主要指”Interfaces that use robots or other actuated systems as medium for HCI”.

Contribution

该研究通过design space dimensions来呈现该领域的分类
拓宽了HCI和HRI的文献研究
讨论了促进该领域进一步研究的开放性研究问题和机会
有一个交互式网站 https://ilab.ucalgary.ca/ar-and-robotics/

Future

使AR-HR更具实用性
1. 头戴式AR设备的追踪误差（陀螺仪），可靠性仍需加强
2. 在户外使用的局限性
对AR HRI的新的设计探索
1. 可以依靠AR设计不局限于物理限制的机器人
2. 更好的开发环境（因为目前的AR开发仍然主要使用平面显示器，可以思考有没有基于AR显示做程序设计的应用）
AR for better decision making(针对用户)
1. 可视化场景数据
2. 可解释性的机器人操作
新颖交互设计
1. 更自然的交互方式（例如更自然地指定任务对象）
2. 进一步融合虚拟和物理世界（让虚拟的交互能影响现实物理（经典最扯的放最后））

Posted 2024-10-28Updated 2025-07-24Note4 minutes read (About 566 words)

Blog Template For New Hexo User

前摇部分

基本原理

本地增添博客内容(markdown文件)->hexo根据文件内容生成网页源码->上通过指令上传(push)到github->github自行部署静态页面

基本准备

安装git

https://www.cnblogs.com/xueweisuoyong/p/11914045.html

Github shh key

因为把本地写的内容传到github，需要绑定一个ssh密钥
参见：https://docs.github.com/en/authentication/connecting-to-github-with-ssh/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent

1	ssh-keygen -t ed25519 -C "guanshengyuanlu@163.com"

把这串公钥添加到github ssh settings里面

安装npm

https://blog.csdn.net/lizhong2008/article/details/133844070
最新版本即可

本地部署

设定本地的git config

1 2	git config --global user.email "guanshengyuanlu@163.com" git config --global user.name "Draumurvakna"

克隆仓库

1 2	git clone git@github.com:Draumurvakna/MIGAO-Blog-Src.git cd MIGAO-Blog-Src

安装环境

git submodule update --recursive --init   
npm update
cd themes/icarus
npm update

更新网站

1	./show.sh #预览

1	./deploy.sh #可以直接通过网页访https://draumurvakna.github.io/

正文

网站上的每一篇文章在本地都是一份markdown文本文件，存在source/_posts中

例如这里就有两篇示例文章

通过指令hexo new "article title" 来创建一篇新博客

然后到对应文件里面编辑就行了
markdown的编辑器推荐用typora，当然如果足够硬核的话用txt文本编辑器也毫无问题！

如果想添加图片的话，就往同文件夹下的资源文件夹（和这篇博客名字相同的文件夹）中添加照片然后在文中输入

可以参考Sample Blog

添加完自己想要的内容之后用./deploy.sh部署一下网站，稍等片刻，进到网站里就可以看到最新的变化了。

后

我想给博客换个背景

进到themes/icarus/source/img文件夹📁

把`lightBG.png`, `darkBG.jpg`换成别的图片，名字一致

如果重新部署后发现没有更改，那就网页里按一下<Ctrl>+F5

我想换个头像👤

如上图，改avatar.png

关于评论系统的话需要自己搞定o

参考 https://chen-yulin.github.io/2024/09/03/%5BOBS%5Dhexo-Hexo%20Comment%20System%20--%20Twikoo/

Posted 2024-10-24Updated 2025-07-24Review10 minutes read (About 1451 words)

Federated Learning Atlas

联邦学习（Federated Learning, FL）作为一种新兴的分布式机器学习方法，已经引起了大量研究的关注。要系统地理解联邦学习的相关研究，建议遵循以下结构化的阅读图谱，以便逐步加深对其原理、应用和挑战的理解。

1. 基础与概念性论文

这些论文介绍了联邦学习的基本概念、目标、以及经典算法，是了解联邦学习的起点。

Konečnỳ, J., et al. (2016). “Federated Learning: Strategies for Improving Communication Efficiency” arXiv
- 介绍了联邦学习的概念，提出了最早期的算法（如FedAvg），并讨论了如何优化通信效率。
McMahan, H. B., et al. (2017). “Communication-Efficient Learning of Deep Networks from Decentralized Data” arXiv
- 这篇论文提出了经典的Federated Averaging (FedAvg) 算法，系统阐述了在分布式环境下训练深度学习模型时的通信效率问题。
Yang, Q., Liu, Y., Cheng, Y., Kang, Y., Chen, T., & Yu, H. (2019). “Federated Learning” ACM Transactions on Intelligent Systems and Technology (TIST)
- 详细综述了联邦学习的基本框架、挑战、技术和应用，适合作为综述性的阅读材料。

2. 隐私保护与安全性

联邦学习的一个重要目标是确保数据的隐私和安全，这一领域的研究为其提供了理论基础和技术手段。

Bonawitz, K., et al. (2017). “Practical Secure Aggregation for Federated Learning on User-Held Data” arXiv
- 讨论了如何在联邦学习中实现安全聚合（Secure Aggregation），即确保服务器无法知道单个客户端的模型更新内容，以保护用户隐私。
Geyer, R. C., Klein, T., & Nabi, M. (2017). “Differentially Private Federated Learning: A Client Level Perspective” arXiv
- 探讨了如何将差分隐私（Differential Privacy）应用于联邦学习中，以确保用户模型更新时的隐私。
Zhao, Y., et al. (2018). “Federated Learning with Non-IID Data” arXiv
- 讨论了在非独立同分布（non-IID）数据的情况下，如何在联邦学习中实现模型训练，这是实际应用中的重要挑战之一。

3. 优化与效率

联邦学习中的通信和计算效率问题是该领域的关键研究方向，许多研究尝试通过各种方法优化模型训练过程中的资源消耗。

Li, X., et al. (2020). “Federated Optimization in Heterogeneous Networks” arXiv
- 讨论了在客户端计算能力和网络资源异质性情况下如何进行联邦优化。
Kairouz, P., et al. (2021). “Advances and Open Problems in Federated Learning” arXiv
- 这篇论文对联邦学习的现状、挑战以及未来的研究方向进行了系统性综述，覆盖了通信效率、模型性能、隐私保护等多个方面。
Chen, M., et al. (2020). “Joint Learning and Communication Optimization for Federated Learning over Wireless Networks” arXiv
- 探讨了如何在无线网络环境下优化联邦学习中的学习效率和通信效率。

4. 系统实现与工具

要更好地理解联邦学习在实际中的应用和系统架构，可以参考一些开源框架和实际实现案例。

Google AI. “Federated Learning for Mobile Keyboard Prediction” Blog Post
- 这是联邦学习最早的实际应用之一，讲述了Google如何使用联邦学习提升手机键盘的预测能力。
TensorFlow Federated (TFF): GitHub
- TensorFlow Federated是Google推出的一个开源框架，用于实现联邦学习的系统实验。通过阅读其文档，可以深入理解联邦学习的具体实现细节。

5. 联邦学习在各领域的应用

联邦学习在诸多行业中都具有广泛的应用，了解这些应用有助于扩展对联邦学习实际意义的认识。

Rieke, N., et al. (2020). “The Future of Digital Health with Federated Learning” arXiv
- 探讨了联邦学习在医疗健康领域的应用，特别是在跨医院数据无法集中共享的情况下如何训练模型。
Hard, A., et al. (2019). “Federated Learning for Mobile Keyboard Prediction” arXiv
- 描述了联邦学习在智能手机上如何用于改善键盘输入的预测性能。

6. 联邦学习的挑战与未来研究方向

对于未来的研究，联邦学习还面临许多挑战，比如系统异质性、模型性能与隐私保护的平衡等。

Wang, J., et al. (2021). “Federated Learning: Challenges, Methods, and Future Directions” arXiv
- 这篇论文对联邦学习面临的主要挑战进行了分析，如数据不均衡、通信成本、模型性能等，并提出了一些未来的研究方向。

联邦学习论文阅读图谱总结：

基础理论：先了解联邦学习的基本框架和经典算法。
隐私与安全：深入研究数据隐私保护和安全机制。
优化与效率：关注如何优化联邦学习中的通信与计算。
系统实现：通过工具和实际案例理解系统实现细节。
应用领域：了解联邦学习在不同领域的实际应用。
挑战与未来方向：展望联邦学习的未来挑战和潜在研究方向。

通过这个图谱，你可以系统地了解联邦学习的关键领域，并逐步深入到各个具体问题的解决方法与研究前沿。

Posted 2024-10-22Updated 2025-07-24Note2 minutes read (About 232 words)

密院罗斯商学院双学位项目

研究生层面的罗斯商学院双学位项目
密大方面提供的硕士学位（授课型）：

管理 30 credits
供应链（管理）30 credits
商务分析 (BA) 36.5 credits 挺适合转商科，量化分析（programming required）

密大的一年放在三年学制的最后一学年的6月~来年5月，我们认对方6学分，对方认我们Gateway
准入条件：

密院研究生
对方的线上面试方式

学费，5w$~6w$ （如果承认学分学费可以打折），生活安娜堡预计1000$/m，饮食1000$/m，杂项500$/m

可以提供找工作的签证机会（利好留美发展者）
但是时间会和秋招冲突，会给国内找工作面试带来困难

发学位证书的时间在两边并不统一

26级包括专硕招生（和双学位挂钩）

Setup

Debug

np.where(mask)[0].item()

运行到一半被终止的情况

Framework

Prediction Script cosypose.scripts.run_cosypose_eval

AI explanation

Prediction Script run_custom_scenario