Posted 2024-12-31Updated 2025-07-02Note7 minutes read (About 1016 words)

A Survey of Imitation Learning- Algorithms, Recent Developments, and Challenges

Introduction

IL是区别于传统手动编程来赋予机器人自主能力的方法。
IL 允许机器通过演示（人类演示专家行为）来学习所需的行为，从而消除了对显式编程或特定于任务的奖励函数的需要。
IL主要有两个类别：

行为克隆(BC)
反向强化学习(IRL)

Behavior Cloning

BC 是一种 IL 技术，它将学习行为的问题视为监督学习任务。 BC 涉及通过建立环境状态与相应专家操作之间的映射来训练模型来复制专家的行为。专家的行为被记录为一组state-action pair，也称为演示。在训练过程中，模型学习一个函数，利用这些演示作为输入，将当前状态转换为相应的专家操作。经过训练，模型可以利用这个学习函数来生成遇到新状态的动作。

不需要了解环境的潜在动态，计算效率很高，相对简单的方法。

The covariate shift problem: 测试期间观察到的状态分布可能与训练期间观察到的状态分布有所不同，使得代理在遇到未见过的状态时容易出错，而对于如何进行操作缺乏明确的指导。BC监督方法的问题是，当智能体漂移并遇到分布外状态时，它不知道如何返回到演示的状态。

为了解决这个问题：

Inverse Reinforcement Learning

IRL 涉及一个学徒代理，其任务是推断观察到的演示背后的奖励函数，这些演示被认为源自表现最佳的专家。然后使用推断的奖励函数通过 RL 训练学习代理的策略。

为了解决“政策->奖励函数“的模糊性，有以下三种IRL

maximum-margin methods（奖励函数比任何其他策略在一定程度上更全面地解释最优策略。这本质上意味着找到一个最大化指定利润的解决方案，确保派生的奖励函数捕捉专家行为的本质。）
maximum entropy（处理专家次优性和随机性的有前景的能力）
guided cost learning（旨在优化策略优化内循环内的非线性奖励函数的方法。这种方法通过直接利用系统的原始状态来构建奖励函数，从而改变了传统的 IRL 范式，从而消除了广泛的特征工程的需要。）

Adversarial Imitation Learning

The agent strives to deceive the discriminator by generating trajectories closely resembling those of the expert.

Imitation From Observation

仅通过图像序列来学习，不需要具体的关节动作操作数据。

Unlike the traditional methods, IfO presents a more organic approach to learning from experts, mirroring how humans and animals approach imitation. Humans often learn new behaviors by observing others without detailed knowledge of their actions (e.g., the muscle commands). People learn a diverse range of tasks, from weaving to swimming to playing games, by watching online videos. Despite differences in body shapes, sensory inputs, and timing, humans exhibit an impressive ability to apply knowledge gained from the online demonstrations

将可学习的资源扩大到了线上的视频资源。

Latent Action Policies (LAPOs)

过分析观察到的动态，LAPO 推断出行动空间的底层结构，促进潜在行动策略的训练。然后，这些策略可以进行高效的微调，以达到专家级的性能，从而提供离线和在线场景的适应性。使用包含标记动作的小数据集进行离线微调是可行的，而在线微调可以使用奖励来完成。与依赖标记数据来训练逆动力学模型不同，LAPO直接从观察到的环境动态中导出潜在动作信息，而不需要任何标签。

Challenges And Limitations

。。。

Posted 2024-12-23Updated 2025-07-02Note13 minutes read (About 2017 words)

477 Challenge

Background

https://validator.w3.org/ : 是一个由万维网联盟（W3C）提供的在线工具，用于检查网页的 HTML、XHTML 或其他标记语言是否符合相关标准和规范。它可以帮助开发者提高网页的质量和兼容性，确保网页在不同浏览器和设备上正确显示。

作为实习生你需要完成网站开发，要求如下(不重要)

During the next few hours you are explained in details all the requirements you have to match. Among
the most emphasized points you learn that you must (i) use the last version of Microsoft Front Page
Express to write the websites; (ii) include as many buttons as possible (even if one is enough); (iii) when a
hidden box is expanded, do so as high as possible above the button which opened it and do not notify the
user; (iv) as much as possible do not disable or hide irrelevant information, simply include it the middle
of the useful content; (v) use and abuse pop ups; (vi) feel free to include Chinese text in the middle of
an English text; (vii) if a page includes videos, ensure they are all fully downloaded before the user can
do anything; You also learn that they pay much attention to the quality of their product, and as such
you should never forget to test your website, IE 6 being recommended.
在接下来的几个小时里，我们会详细解释您需要满足的所有要求。其中最受关注的要点是：您必须 (i) 使用最新版本的 Microsoft Front Page Express 来编写网站；(ii) 包含尽可能多的按钮（即使一个按钮就足够了）；(iii) 展开隐藏框时，尽可能将其展开到打开它的按钮上方，并且不要通知用户；(iv) 尽可能不要禁用或隐藏不相关的信息，只需将其包含在有用内容的中间；(v) 使用和滥用弹出窗口；(vi) 可以在英文文本中间随意添加中文文本；(vii) 如果页面包含视频，请确保在用户执行任何操作之前，它们都已完全下载；您还了解到他们非常重视产品质量，因此您永远不要忘记测试您的网站，建议使用 IE 6.

大概是马牛特有的讽刺。。。总之就是总结了一堆网站开发的禁止事项。
因为这些要求的存在（特别是网页开发环境只能在windows下使用），我们作为一位只有一台双系统电脑的实习生遇到了障碍：

每次开发网页我们都需要切换到windows系统并暂停linux系统下的任务。然而在linux下我们需要跑一些任务，需要尽可能早去完成。

针对这些linux任务，马牛貌似将其分成了T个独立的任务，根据output推测，每个任务也可以被暂停，然后切换到windows进行开发。

可以通过把任务转交给朋友的方式来从schedule中移除部分网页开发的任务，至多两个网页。每个网页有对应的用于交易的作业量，作业量加起来不能超过HC的触发阈值。

标识

需要注意的是，计算任务需要的时间 $t_i$ 和网页完成的deadline $t_i$ 不是一个东西，建议区分标记。

输入

计算任务，网页开发以及外援的信息

P H
s_0 t_0 h_0 # info for each page
... # omit P-1 lines
T # number of tasks
t_0
... # omit T-1 lines

### 输出我们并不需要在规划中完成所有的计算任务。针对每个计算任务，计算其**最早能完成的时间点**即可。针对不同时长的计算任务，**使用的分配策略可能不同**。

以下纯来自AI的solution

To solve this problem, you need to combine optimization, scheduling, and greedy strategies. Here’s a step-by-step guide:

1. Parse the Input

Read:
- P,H: Number of webpages and maximum allowable exercises for delegation.
- Details for each webpage (s_i, t_i, h_i): Start time, deadline, and exercise cost.
- T: Number of computational tasks and their durations (t_i).

2. Determine Delegation Strategy

Key Idea:

Delegate up to 2 webpages to the friend such that the total exercise cost (h_i) is minimized and ≤H.

Steps:

Sort webpages by their exercise cost (h_i).
Select up to 2 webpages to delegate:
- Choose pages with the smallest h_i that satisfy h_1+h_2≤H.
Remove these pages from your schedule, as they are handled by the friend.

3. Calculate Free Time Intervals

Key Idea:

After delegating pages, compute the blocked time intervals caused by the remaining webpages.

Steps:

For all non-delegated webpages, construct intervals [s_i, t_i).
Merge overlapping intervals to avoid double-counting blocked time.
Identify the free intervals:
- Initially, it is [0,∞).
- Subtract merged blocked intervals to get the free time slots.

4. Schedule Computational Tasks

Key Idea:

Fit each computational task into the free intervals to minimize its completion time.

Steps:

Sort computational tasks by their required time (t_i).
Iterate over each task:
- Start placing the task in the earliest available free interval.
- If the task fits, calculate its completion time and reduce the interval size.
- If it doesn’t fit, move to the next free interval.
Store the completion time for each task.

5. Output Results

For each computational task, output the minimum completion time determined in the previous step.

Algorithm Complexity

Input Parsing: O(P+T)
Delegation:
- Sorting webpages: O(P \log P)
- Selecting 2 webpages: O(1)
Free Interval Calculation:
- Merging intervals: O(P \log P)
Task Scheduling:
- Sorting tasks: O(T \log T)
- Scheduling: O(T×free intervals), typically O(T + P).

Overall complexity: O((P + T) \log (P + T)).
this complexity is acceptable regarding the given input range.

OCaml Implementation

Here’s a high-level pseudocode that can be adapted into OCaml:

(* Input Parsing *)
let read_input () =
  let p, h = read_line () in
  let webpages = List.init p (fun _ -> read_line ()) in
  let t = read_int () in
  let tasks = List.init t (fun _ -> read_int ()) in
  (p, h, webpages, tasks)

(* Delegate webpages *)
let delegate_webpages webpages h =
  let sorted_pages = List.sort (fun (_, _, h1) (_, _, h2) -> compare h1 h2) webpages in
  let rec select_delegation remaining_pages acc h_left =
    match remaining_pages with
    | [] -> acc, remaining_pages
    | (s, t, h)::tl when h <= h_left -> select_delegation tl ((s, t, h)::acc) (h_left - h)
    | _ -> acc, remaining_pages
  in
  select_delegation sorted_pages [] h

(* Compute free intervals *)
let compute_free_intervals remaining_pages =
  let intervals = List.map (fun (s, t, _) -> (s, t)) remaining_pages in
  let merged_intervals = merge_intervals intervals in
  calculate_free_time_slots merged_intervals

(* Schedule tasks *)
let schedule_tasks free_intervals tasks =
  List.map (fun task ->
    find_earliest_completion free_intervals task
  ) tasks

(* Main function *)
let () =
  let p, h, webpages, tasks = read_input () in
  let delegated_pages, remaining_pages = delegate_webpages webpages h in
  let free_intervals = compute_free_intervals remaining_pages in
  let task_completions = schedule_tasks free_intervals tasks in
  List.iter print_int task_completions

以下纯来自AI的proof

Proof of Correctness for the Algorithm

The algorithm is designed to solve the problem of scheduling computational tasks on a single computer while meeting constraints related to webpage deadlines and delegation. Below is a formal proof of its correctness.

1. Problem Understanding

The problem involves:

Allocating time for webpage writing (either by handling them yourself or delegating to a friend).
Maximizing free intervals for computational tasks.
Minimizing the completion times of computational tasks.

To ensure correctness, we need to prove:

Webpage constraints are respected.
Free time intervals are accurately calculated.
Computational tasks are scheduled optimally.

2. Proof Components

2.1 Webpage Delegation

Key Idea: Delegate up to 2 webpages with the smallest exercise costs such that the total cost does not exceed H.
Proof:
- By sorting webpages by h_i (exercise cost) and selecting the top 2 that satisfy h1+h2≤H, the algorithm guarantees that:
  - The delegation respects the maximum exercise limit H.
  - The webpages with the least impact on constraints are delegated, maximizing the remaining free time.

2.2 Free Interval Calculation

Key Idea: Compute free intervals by merging blocked intervals caused by non-delegated webpages.
Proof:
- Non-delegated webpages define intervals [s_i, t_i) that block the computer.
- Merging overlapping intervals ensures no double-counting of blocked time, leaving only valid free intervals.
- The algorithm iteratively merges intervals by sorting them by s_i and combining overlapping or adjacent intervals, which is a standard and proven approach.

2.3 Task Scheduling

Key Idea: Fit each computational task into the earliest available free interval to minimize its completion time.
Proof:
- Tasks are processed in increasing order of required time (t_i), ensuring that smaller tasks are scheduled earlier, leaving larger free intervals for longer tasks.
- For each task:
  - The algorithm scans free intervals and places the task in the earliest slot where it fits.
  - This greedy approach guarantees the earliest possible completion time for each task because:
    - Free intervals are disjoint and sorted.
    - A task placed in an earlier slot cannot complete earlier in any other configuration without violating disjointness.

2.4 Overall Algorithm

Correctness Statement: The algorithm produces the earliest possible completion times for all tasks while satisfying webpage constraints and delegation limits.
Proof:
- Webpages: Constraints are respected by delegating up to 2 pages within H exercises and blocking the necessary intervals for non-delegated pages.
- Free Intervals: Accurate calculation ensures no overlap or error in available time slots.
- Task Scheduling: Greedy placement in the earliest available interval ensures optimal completion times.

4. Edge Cases

P=0,T=0: No webpages or tasks, trivial solution.
H=0: No delegation possible, all webpages must be handled by the user.
Overlapping webpage intervals: Correctly merged to avoid double-counting.
Tasks larger than any free interval: Scheduled in the earliest free interval where they fit.

5. Conclusion

The algorithm is correct as it:

Adheres to all constraints (webpage deadlines, delegation limits, task requirements).
Ensures optimal scheduling of computational tasks.
Handles edge cases gracefully.

Would you like a detailed explanation for a specific part of this proof or assistance implementing it?

Posted 2024-12-17Updated 2025-07-02Note17 minutes read (About 2557 words)

The Bitter Lesson

Rich Sutton

March 13, 2019

The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin. The ultimate reason for this is Moore’s law, or rather its generalization of continued exponentially falling cost per unit of computation. Most AI research has been conducted as if the computation available to the agent were constant (in which case leveraging human knowledge would be one of the only ways to improve performance) but, over a slightly longer time than a typical research project, massively more computation inevitably becomes available. Seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of computation. These two need not run counter to each other, but in practice they tend to. Time spent on one is time not spent on the other. There are psychological commitments to investment in one approach or the other. And the human-knowledge approach tends to complicate methods in ways that make them less suited to taking advantage of general methods leveraging computation. There were many examples of AI researchers’ belated learning of this bitter lesson, and it is instructive to review some of the most prominent.

In computer chess, the methods that defeated the world champion, Kasparov, in 1997, were based on massive, deep search. At the time, this was looked upon with dismay by the majority of computer-chess researchers who had pursued methods that leveraged human understanding of the special structure of chess. When a simpler, search-based approach with special hardware and software proved vastly more effective, these human-knowledge-based chess researchers were not good losers. They said that “brute force” search may have won this time, but it was not a general strategy, and anyway it was not how people played chess. These researchers wanted methods based on human input to win and were disappointed when they did not.

A similar pattern of research progress was seen in computer Go, only delayed by a further 20 years. Enormous initial efforts went into avoiding search by taking advantage of human knowledge, or of the special features of the game, but all those efforts proved irrelevant, or worse, once search was applied effectively at scale. Also important was the use of learning by self play to learn a value function (as it was in many other games and even in chess, although learning did not play a big role in the 1997 program that first beat a world champion). Learning by self play, and learning in general, is like search in that it enables massive computation to be brought to bear. Search and learning are the two most important classes of techniques for utilizing massive amounts of computation in AI research. In computer Go, as in computer chess, researchers’ initial effort was directed towards utilizing human understanding (so that less search was needed) and only much later was much greater success had by embracing search and learning.

In speech recognition, there was an early competition, sponsored by DARPA, in the 1970s. Entrants included a host of special methods that took advantage of human knowledge—knowledge of words, of phonemes, of the human vocal tract, etc. On the other side were newer methods that were more statistical in nature and did much more computation, based on hidden Markov models (HMMs). Again, the statistical methods won out over the human-knowledge-based methods. This led to a major change in all of natural language processing, gradually over decades, where statistics and computation came to dominate the field. The recent rise of deep learning in speech recognition is the most recent step in this consistent direction. Deep learning methods rely even less on human knowledge, and use even more computation, together with learning on huge training sets, to produce dramatically better speech recognition systems. As in the games, researchers always tried to make systems that worked the way the researchers thought their own minds worked—they tried to put that knowledge in their systems—but it proved ultimately counterproductive, and a colossal waste of researcher’s time, when, through Moore’s law, massive computation became available and a means was found to put it to good use.

In computer vision, there has been a similar pattern. Early methods conceived of vision as searching for edges, or generalized cylinders, or in terms of SIFT features. But today all this is discarded. Modern deep-learning neural networks use only the notions of convolution and certain kinds of invariances, and perform much better.

This is a big lesson. As a field, we still have not thoroughly learned it, as we are continuing to make the same kind of mistakes. To see this, and to effectively resist it, we have to understand the appeal of these mistakes. We have to learn the bitter lesson that building in how we think we think does not work in the long run. The bitter lesson is based on the historical observations that 1) AI researchers have often tried to build knowledge into their agents, 2) this always helps in the short term, and is personally satisfying to the researcher, but 3) in the long run it plateaus and even inhibits further progress, and 4) breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning. The eventual success is tinged with bitterness, and often incompletely digested, because it is success over a favored, human-centric approach.

One thing that should be learned from the bitter lesson is the great power of general purpose methods, of methods that continue to scale with increased computation even as the available computation becomes very great. The two methods that seem to scale arbitrarily in this way are search and learning.

The second general point to be learned from the bitter lesson is that the actual contents of minds are tremendously, irredeemably complex; we should stop trying to find simple ways to think about the contents of minds, such as simple ways to think about space, objects, multiple agents, or symmetries. All these are part of the arbitrary, intrinsically-complex, outside world. They are not what should be built in, as their complexity is endless; instead we should build in only the meta-methods that can find and capture this arbitrary complexity. Essential to these methods is that they can find good approximations, but the search for them should be by our methods, not by us. We want AI agents that can discover like we can, not which contain what we have discovered. Building in our discoveries only makes it harder to see how the discovering process can be done.

从 70 年的人工智能研究中可以得到的最大教训是，利用计算的通用方法最终是最有效的，而且是最大的优势。其根本原因是摩尔定律，或者更确切地说是其对单位计算成本持续呈指数下降的概括。大多数人工智能研究都是在代理可用的计算是恒定的情况下进行的（在这种情况下，利用人类知识将是提高性能的唯一方法之一），但与典型的研究项目相比，在稍长的时间内，不可避免地会出现大量的计算可用。为了寻求在短期内产生影响的改进，研究人员试图利用他们对该领域的人类知识，但从长远来看，唯一重要的是利用计算。这两者不必相互矛盾，但在实践中它们往往是相互矛盾的。花在其中一个上的时间就是没有花在另一个上的时间。人们在心理上承诺投资于一种方法或另一种方法。而人类知识方法往往会使方法复杂化，使其不太适合利用利用计算的通用方法。人工智能研究人员迟迟没有吸取这一惨痛教训的例子有很多，回顾一下其中最突出的一些例子很有启发意义。
在计算机象棋中，1997 年击败世界冠军卡斯帕罗夫的方法是基于大规模深度搜索。当时，大多数计算机象棋研究人员对此感到沮丧，他们一直在寻求利用人类对象棋特殊结构的理解的方法。当一种更简单的、基于搜索的方法加上特殊的硬件和软件被证明更为有效时，这些基于人类知识的象棋研究人员就不是善于输的人了。他们说，“蛮力”搜索这次可能赢了，但这不是一种通用策略，而且无论如何它也不是人们下棋的方式。

这些研究人员希望基于人类输入的方法能够获胜，但结果却令他们失望。计算机围棋也出现了类似的研究进展模式，只是推迟了 20 年。最初，人们付出了巨大的努力，利用人类知识或游戏的特殊功能来避免搜索，但一旦搜索被大规模有效应用，所有这些努力都被证明是无关紧要的，甚至更糟。同样重要的是使用自学来学习价值函数（就像在许多其他游戏甚至国际象棋中一样，尽管学习在 1997 年首次击败世界冠军的程序中并没有发挥重要作用）。自学和一般的学习就像搜索一样，因为它能够发挥大规模计算的作用。搜索和学习是人工智能研究中利用大量计算的两类最重要的技术。在计算机围棋中，就像在计算机国际象棋中一样，研究人员最初的努力是利用人类的理解力（这样就不需要太多的搜索），直到后来，通过采用搜索和学习才取得了更大的成功。

在语音识别方面，20 世纪 70 年代，DARPA 赞助了一场早期的竞赛。参赛者包括大量利用人类知识（单词、音素、人类声道等知识）的特殊方法。另一方面，一些较新的方法更具统计性质，并且基于隐马尔可夫模型 (HMM) 进行更多的计算。统计方法再次战胜了基于人类知识的方法。这导致了整个自然语言处理领域发生了重大变化，几十年来，统计和计算逐渐占据了主导地位。语音识别中深度学习的兴起是朝着这一一致方向迈出的最新一步。深度学习方法更少地依赖人类知识，使用更多的计算，再加上对大量训练集的学习，从而产生了更好的语音识别系统。就像在游戏中一样，研究人员总是试图制造出按照他们认为自己的想法运作的系统——他们试图将这些知识放入他们的系统中——但最终却适得其反，浪费了研究人员大量的时间，而摩尔定律让大规模计算成为可能，并找到了一种充分利用它的方法。

在计算机视觉中，也有类似的模式。早期的方法将视觉设想为搜索边缘、广义圆柱体或 SIFT 特征。但今天所有这些都被抛弃了。现代深度学习神经网络只使用卷积和某些类型的不变性的概念，而且表现要好得多。

这是一个很大的教训。作为一个领域，我们还没有彻底学会它，因为我们还在继续犯同样的错误。要看到这一点，并有效地抵制它，我们必须了解这些错误的吸引力。我们必须学会不那么痛苦.

Posted 2024-12-14Updated 2025-07-02Note3 minutes read (About 485 words)

Cuda+Torch

nvidia-smi返回的是driver所能支持的最新的cuda版本
系统安装的cuda版本可以随意，torch会优先使用虚拟环境中安装的cuda版本

Conda管理Cuda

安装指定版本cuda-toolkit

1	conda install nvidia/label/cuda-12.4.0::cuda-toolkit -c nvidia/label/cuda-12.4.0

安装最新版本

1	conda install cuda-toolkit

某些仓库需要指定cuda路径才能编译包

1
2
3

conda env config vars set LD_LIBRARY_PATH="/home/cyl/miniconda3/envs/gsam/lib/python3.10/site-packages/nvidia/cuda_runtime/lib/:$LD_LIBRARY_PATH"
conda env config vars set CPATH="/home/cyl/miniconda3/envs/gsam/lib/python3.10/site-packages/nvidia/cuda_runtime/include/:$CPATH"
conda env config vars set CUDA_HOME="/home/cyl/miniconda3/envs/gsam/"

Note: 注意改变了库路径之后nvim中的lsp会报错，建议之后改回去

1
2
3

conda env config vars set LD_LIBRARY_PATH=""
conda env config vars set CPATH=""
conda env config vars set CUDA_HOME=""

Note: To find the correct path for CUDA_HOME use which nvcc. In my case, output of the command was:

1 2	>>> which nvcc /home/user/miniconda3/envs/py12/bin/nvcc

Therefore, I set the CUDA_HOME as /home/user/miniconda3/envs/py12/.

Note: To find the correct path for LD_LIBRARY_PATH use find ~ -name cuda_runtime_api.h. In my case, output of the command was:

>>> find ~ -name cuda_runtime_api.h
...
/home/user/miniconda3/envs/py12/targets/x86_64-linux/include/cuda_runtime_api.h
...

So I set the LD_LIBRARY_PATH as /home/user/miniconda3/envs/py12/targets/x86_64-linux/lib/ and CPATH as /home/user/miniconda3/envs/py12/targets/x86_64-linux/include/. If you have multiple CUDA installations, the output of find ~ -name cuda_runtime_api.h will display multiple paths. Make sure to choose the path that corresponds to the environment you have created.

ref:https://github.com/IDEA-Research/GroundingDINO/issues/355

Note: Always reboot the computer after the cuda is upgraded

Note: 在更改LD_LIBRARY_PATH后可能会导致neovim的pyright无法运行，所以建议在编译完成后设回该变量

1	conda env config vars set LD_LIBRARY_PATH=""

`cudatoolkit`和`cuda-toolkit`

这两个可以同时安装
如果不安装cudatoolkit可能会在编译时出现ld: cannot find -lcudart: No such file or directory collect2: error: ld returned 1 exit status 报错

使用以下指令获取版本信息

1	python -c 'import torch;print(torch.__version__);print(torch.version.cuda)'

1 2	2.0.0+cu117 11.7

Posted 2024-12-10Updated 2025-07-02Notea minute read (About 174 words)

Use SSH to Connect Jupyter-lab

使用ssh作为命令行远程工具，启动远程的jupyter lab并且在本地的浏览器中打开。

远程运行：

1	jupyter lab --no-browser --port=8080

--no-broswer is very important

output:

...
[I 2024-12-10 14:30:24.585 ServerApp]     http://127.0.0.1:8080/lab?token=0061d1eb31396b1bc3cd77a7161b2084da1dedcdeca0600c
[I 2024-12-10 14:30:24.586 ServerApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 2024-12-10 14:30:24.603 ServerApp]

    To access the server, open this file in a browser:
        file:///home/bohanfeng/.local/share/jupyter/runtime/jpserver-11659-open.html
    Or copy and paste one of these URLs:
        http://localhost:8080/lab?token=0061d1eb31396b1bc3cd77a7161b2084da1dedcdeca0600c
        http://127.0.0.1:8080/lab?token=0061d1eb31396b1bc3cd77a7161b2084da1dedcdeca0600c

本地运行：

1	ssh -L 8080:localhost:8080 bohanfeng@192.168.2.102

本地浏览器访问：
http://127.0.0.1:8080/lab?token=0061d1eb31396b1bc3cd77a7161b2084da1dedcdeca0600c

Posted 2024-12-03Updated 2025-07-02Note4 minutes read (About 609 words)

FLamby

Repository: https://github.com/owkin/FLamby

Installation

git clone https://github.com/owkin/FLamby.git
cd FLamby
conda env create -f environment.yml
conda activate flamby
pip install -e .[all_extra]
pip install wget
pip install lifelines
pip install jupyterlab

Dataset

Fed-TCGA-BCRA
https://owkin.github.io/FLamby/fed_tcga_brca.html

Baseline Learning

import torch
from flamby.utils import evaluate_model_on_tests

# 2 lines of code to change to switch to another dataset
from flamby.datasets.fed_tcga_brca import (
    BATCH_SIZE,
    LR,
    NUM_EPOCHS_POOLED,
    Baseline,
    BaselineLoss,
    metric,
    NUM_CLIENTS,
    Optimizer,
)
from flamby.datasets.fed_tcga_brca import FedTcgaBrca as FedDataset

Import several macros, datasets and metrics.

# Instantiation of local train set (and data loader)), baseline loss function, baseline model, default optimizer
train_dataset = FedDataset(center=0, train=True, pooled=False)
train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=0)
lossfunc = BaselineLoss()
model = Baseline()
optimizer = Optimizer(model.parameters(), lr=LR)

In this script, the pooled parameter is set to False when creating the FedDataset instances. This indicates that the dataset is not pooled, meaning that the data is kept separate for each client or center. Each client or center has its own local dataset, which is a common setup in federated learning to simulate real-world scenarios where data is distributed across different locations or devices.

# Traditional pytorch training loop
for epoch in range(0, NUM_EPOCHS_POOLED):
    for idx, (X, y) in enumerate(train_dataloader):
        optimizer.zero_grad()
        outputs = model(X)
        loss = lossfunc(outputs, y)
        loss.backward()
        optimizer.step()

正常的训练流程

# Evaluation
# Instantiation of a list of the local test sets
test_dataloaders = [
            torch.utils.data.DataLoader(
                FedDataset(center=i, train=False, pooled=False),
                batch_size=BATCH_SIZE,
                shuffle=False,
                num_workers=0,
            )
            for i in range(NUM_CLIENTS)
        ]
# Function performing the evaluation
dict_cindex = evaluate_model_on_tests(model, test_dataloaders, metric)
print(dict_cindex)

使用的evaluation metric是lifelines.utils.concordance_index，返回的是c_index

Federated Learning

import torch
from flamby.utils import evaluate_model_on_tests

# 2 lines of code to change to switch to another dataset
from flamby.datasets.fed_tcga_brca import (
    BATCH_SIZE,
    LR,
    NUM_EPOCHS_POOLED,
    Baseline,
    BaselineLoss,
    metric,
    NUM_CLIENTS,
    get_nb_max_rounds
)
from flamby.datasets.fed_tcga_brca import FedTcgaBrca as FedDataset

# 1st line of code to change to switch to another strategy
from flamby.strategies.fed_avg import FedAvg as strat

use `FedAvg` as strategy

# We loop on all the clients of the distributed dataset and instantiate associated data loaders
train_dataloaders = [
            torch.utils.data.DataLoader(
                FedDataset(center = i, train = True, pooled = False),
                batch_size = BATCH_SIZE,
                shuffle = True,
                num_workers = 0
            )
            for i in range(NUM_CLIENTS)
        ]

lossfunc = BaselineLoss()
m = Baseline()

# Federated Learning loop
# 2nd line of code to change to switch to another strategy (feed the FL strategy the right HPs)
args = {
            "training_dataloaders": train_dataloaders,
            "model": m,
            "loss": lossfunc,
            "optimizer_class": torch.optim.SGD,
            "learning_rate": LR / 10.0,
            "num_updates": 100,
# This helper function returns the number of rounds necessary to perform approximately as many
# epochs on each local dataset as with the pooled training
            "nrounds": get_nb_max_rounds(100),
        }
s = strat(**args)
m = s.run()[0]

# Evaluation
# We only instantiate one test set in this particular case: the pooled one
test_dataloaders = [
            torch.utils.data.DataLoader(
                FedDataset(train = False, pooled = True),
                batch_size = BATCH_SIZE,
                shuffle = False,
                num_workers = 0,
            )
        ]
dict_cindex = evaluate_model_on_tests(m, test_dataloaders, metric)
print(dict_cindex)

FedAvg vs FedAvgFineTuning

FedAvg

FedAvgFineTuning

Posted 2024-11-21Updated 2025-07-02Notea few seconds read (About 78 words)

Reconstruct Anything

Convert Raw RGB-D to tree-structure scene(maybe in unity), for more

Raw point cloud (voxel) to semantic segmented cloud
classify the segmented object into different structure and establish parent-child relationship
identify the state of object
1. pose like, numbers (\\)
2. state like, open/close (LLM)

发现和lff近期发表的一篇文章思想非常一致 https://arxiv.org/html/2410.07408v1

和场景理解的对比

Posted 2024-11-17Updated 2025-07-02Note18 minutes read (About 2722 words)

Cosypose modification

Setup

仓库: https://github.com/Simple-Robotics/cosypose

1
2
3

git clone --recurse-submodules https://github.com/Simple-Robotics/cosypose.git
cd cosypose
conda env create -n cosypose --file environment.yaml

注意执行这一步的时候pip 会提示setuptools 和matplotlib-inline不符合3.7.6的python，到环境中手动安装适配的版本

1
2
3

conda activate cosypose
pip install setuptools==63.4.1
pip install matplotlib-inline==0.1.6

1
2
3

git lfs pull
python setup.py install
python setup.py develop

根据README下载数据
注意第一块指令无法下载成功，由 https://bop.felk.cvut.cz/datasets/ 得知下载链接迁移到了huggingface, https://huggingface.co/datasets/bop-benchmark/datasets/tree/main/ycbv 可以从这里手动下载测试集并放置到local_data/bop_datasets/ycbv/test

设置测试使用的models

1	cp ./local_data/bop_datasets/ycbv/model_bop_compat_eval ./local_data/bop_datasets/ycbv/models

Debug

`np.where(mask)[0].item()`

运行

1 2	export CUDA_VISIBLE_DEVICES=0 python -m cosypose.scripts.run_cosypose_eval --config ycbv

时出现报错

Traceback (most recent call last):
  File "/home/cyl/.conda/envs/cosypose/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/cyl/.conda/envs/cosypose/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/cyl/cosypose/cosypose/scripts/run_cosypose_eval.py", line 491, in <module>
    main()
  File "/home/cyl/cosypose/cosypose/scripts/run_cosypose_eval.py", line 332, in main
    scene_ds = make_scene_dataset(ds_name)
  File "/home/cyl/cosypose/cosypose/datasets/datasets_cfg.py", line 68, in make_scene_dataset
    ids.append(np.where(mask)[0].item())
ValueError: can only convert an array of size 1 to a Python scalar

添加debug输出，得到

Debug - scene_id: 48, view_id: 1
Debug - mask matches: 1
Debug - where result shape: (1,), values: [225]
Debug - scene_id: 48, view_id: 36
Debug - mask matches: 1
Debug - where result shape: (1,), values: [226]
Debug - scene_id: 48, view_id: 47
Debug - mask matches: 1
Debug - where result shape: (1,), values: [227]
Debug - scene_id: 48, view_id: 83
Debug - mask matches: 1
Debug - where result shape: (1,), values: [228]
Debug - scene_id: 48, view_id: 112
Debug - mask matches: 1
Debug - where result shape: (1,), values: [229]
Debug - scene_id: 48, view_id: 135
Debug - mask matches: 0
Debug - where result shape: (0,), values: []
0:00:00.912023 - Expected exactly one match, got 0 matches for scene_id=48, view_id=135

发现是下载的测试数据集并不包含数据集keyframe.txt中所有的帧，导致一些关键帧识别不到

运行到一半被终止的情况

如果想重新开始新的训练：清空local_data/joblib_cache

Framework

Prediction Script `cosypose.scripts.run_cosypose_eval`

AI explanation

The script predicts object poses based on multi-view input by following these steps:

Dataset Loading: It first loads the dataset using the make_scene_dataset function, which prepares the scene data for evaluation. The dataset is wrapped in a MultiViewWrapper to handle multiple views.
Model Loading: The script loads pre-trained models for pose prediction using the load_models function. It loads both coarse and refiner models based on the configuration specified in the command-line arguments.
Prediction Setup: The script sets up the prediction parameters, including the number of iterations for coarse and refiner models, and whether to skip multi-view processing based on the number of views specified.
Multi-view Prediction: The MultiviewScenePredictor is initialized with the mesh database, which is used to predict poses across multiple views. The MultiviewPredictionRunner is then used to run predictions on the dataset, leveraging the multi-view setup to improve pose estimation accuracy.
Pose Estimation: The script uses the loaded models to predict object poses. It processes detections from either pix2pose or posecnn depending on the dataset, and refines these predictions using the refiner model.
Evaluation: After predictions, the script evaluates the predicted poses using the PoseEvaluation class. It calculates various metrics like ADD-S and AUC to assess the accuracy of the pose predictions.
Results Logging: Finally, the script logs the results, including evaluation metrics, and saves them to a specified directory.

The multi-view approach allows the script to leverage information from different viewpoints, which can help resolve ambiguities and improve the robustness of the pose estimation.

Prediction Script `run_custom_scenario`

Terms

TCO

Transformation from Camera to Object.
It represents the transformation matrix or parameters that describe the pose of an object relative to the camera’s coordinate system

TWO

Transformation from World to Object.
It represents the transformation matrix or parameters that describe the pose of an object relative to the world’s coordinate system

Model dataset

class MeshDataBase:
    def __init__(self, obj_list):
        self.infos = {obj['label']: obj for obj in obj_list}
        self.meshes = {l: trimesh.load(obj['mesh_path']) for l, obj in self.infos.items()}

    @staticmethod
    def from_object_ds(object_ds):
        obj_list = [object_ds[n] for n in range(len(object_ds))]
        return MeshDataBase(obj_list)
...

一般使用的初始化方式：

1 2	object_ds = BOPObjectDataset(scenario_dir / 'models') mesh_db = MeshDataBase.from_object_ds(object_ds)

也可以通过load models一起加载：

1	predictor, mesh_db = load_models(coarse_run_id, refiner_run_id, n_workers=n_plotters, object_set=object_set)

Important Classes

`Multiview_wrapper`

作用：
读取 scene_dataset 并且通过视角数量n_views来分割这些数据为不同场景，然后方便遍历其中的场景元素（这里都是ground truth）
遍历时返回的值为

n_views张不同视角下的RGB图像
n_views张对应的mask

n_views份对应的observation

识别到的物体位姿和类型
相机位姿和内参

frame_info，没太多用

1 2	scene_ds_pred = MultiViewWrapper(scene_ds, n_views=n_views) scene_ds_pred[0][2] # scene48 multiview_group1 's observations in five views

[
 {'objects': 
  [
   {'label': 'obj_000001',
    'name': 'obj_000001',
    'TWO': array([[-0.02062261, -0.99870347, -0.04654345, -0.05380909],
           [ 0.99854439, -0.022895  ,  0.04883047,  0.00189095],
           [-0.04983272, -0.04546878,  0.9977229 ,  0.07060698],
           [ 0.        ,  0.        ,  0.        ,  1.        ]]),
    'T0O': array([[-0.02062261, -0.99870347, -0.04654345, -0.05380909],
           [ 0.99854439, -0.022895  ,  0.04883047,  0.00189095],
           [-0.04983272, -0.04546878,  0.9977229 ,  0.07060698],
           [ 0.        ,  0.        ,  0.        ,  1.        ]]),
    'visib_fract': 0.7769277845777234,
    'id_in_segm': 1,
    'bbox': [347, 210, 467, 374]},
   {'label': 'obj_000006',
    'name': 'obj_000006',
    'TWO': array([[-0.40056693,  0.91475543, -0.05262471,  0.03103553],
           [-0.91622629, -0.39934108,  0.03248866, -0.02365388],
           [ 0.00870386,  0.06123014,  0.9980863 ,  0.01391488],
           [ 0.        ,  0.        ,  0.        ,  1.        ]]),
    'T0O': array([[-0.40056693,  0.91475543, -0.05262471,  0.03103553],
           [-0.91622629, -0.39934108,  0.03248866, -0.02365388],
           [ 0.00870386,  0.06123014,  0.9980863 ,  0.01391488],
           [ 0.        ,  0.        ,  0.        ,  1.        ]]),
    'visib_fract': 0.9990349353406678,
    'id_in_segm': 2,
    'bbox': [328, 343, 422, 405]},
   {'label': 'obj_000014',
    'name': 'obj_000014',
    'TWO': array([[ 0.24178672, -0.96941339, -0.04215706, -0.05206396],
           [ 0.96977496,  0.2399519 ,  0.0442575 ,  0.0179453 ],
           [-0.03278805, -0.05158388,  0.99813144,  0.16636215],
           [ 0.        ,  0.        ,  0.        ,  1.        ]]),
    'T0O': array([[ 0.24178672, -0.96941339, -0.04215706, -0.05206396],
           [ 0.96977496,  0.2399519 ,  0.0442575 ,  0.0179453 ],
           [-0.03278805, -0.05158388,  0.99813144,  0.16636215],
           [ 0.        ,  0.        ,  0.        ,  1.        ]]),
    'visib_fract': 0.9938250428816466,
    'id_in_segm': 3,
    'bbox': [372, 143, 490, 241]},
   {'label': 'obj_000019',
    'name': 'obj_000019',
    'TWO': array([[-0.69888905,  0.1926738 , -0.68878937,  0.01412755],
           [ 0.711967  ,  0.27928957, -0.64428215,  0.05127768],
           [ 0.06823575, -0.94067797, -0.33237011,  0.06472594],
           [ 0.        ,  0.        ,  0.        ,  1.        ]]),
    'T0O': array([[-0.69888905,  0.1926738 , -0.68878937,  0.01412755],
           [ 0.711967  ,  0.27928957, -0.64428215,  0.05127768],
           [ 0.06823575, -0.94067797, -0.33237011,  0.06472594],
           [ 0.        ,  0.        ,  0.        ,  1.        ]]),
    'visib_fract': 0.9890470974808324,
    'id_in_segm': 4,
    'bbox': [419, 222, 527, 410]},
   {'label': 'obj_000020',
    'name': 'obj_000020',
    'TWO': array([[-0.74512542, -0.66691536,  0.00352083,  0.07854437],
           [-0.6669148 ,  0.74507458, -0.00940455, -0.15283599],
           [ 0.00364864, -0.00935569, -0.99995023,  0.01854317],
           [ 0.        ,  0.        ,  0.        ,  1.        ]]),
    'T0O': array([[-0.74512542, -0.66691536,  0.00352083,  0.07854437],
           [-0.6669148 ,  0.74507458, -0.00940455, -0.15283599],
           [ 0.00364864, -0.00935569, -0.99995023,  0.01854317],
           [ 0.        ,  0.        ,  0.        ,  1.        ]]),
    'visib_fract': 0.9953060637992145,
    'id_in_segm': 5,
    'bbox': [92, 328, 288, 442]}],
  'camera': 
  {'T0C': array([[-0.0792652 ,  0.241296  , -0.967209  ,  0.946419  ],
          [ 0.996102  ,  0.0568396 , -0.0674529 , -0.02116569],
          [ 0.0386997 , -0.968786  , -0.244861  ,  0.36645836],
          [ 0.        ,  0.        ,  0.        ,  1.        ]]),
   'K': array([[1.066778e+03, 0.000000e+00, 3.129869e+02],
          [0.000000e+00, 1.067487e+03, 2.413109e+02],
          [0.000000e+00, 0.000000e+00, 1.000000e+00]]),
   'TWC': array([[-0.0792652 ,  0.241296  , -0.967209  ,  0.946419  ],
          [ 0.996102  ,  0.0568396 , -0.0674529 , -0.02116569],
          [ 0.0386997 , -0.968786  , -0.244861  ,  0.36645836],
          [ 0.        ,  0.        ,  0.        ,  1.        ]]),
   'resolution': torch.Size([480, 640])},
  'frame_info': 
  {
   'scene_id': 48,
   'cam_id': 'cam',
   'view_id': 1626,
   'cam_name': 'cam',
   'group_id': 0
   }
  },
  ... # other views
]

`MultiviewPredictorRunner`

作用：
接收Multiview_wrapper作为输入，并做出预测

首先是数据集接收：

dataloader = DataLoader(scene_ds, batch_size=batch_size,
						num_workers=n_workers,
						sampler=sampler,
						collate_fn=self.collate_fn)

use collate_fn to process the row data （最后的注释里面有真正用到的数据）

def collate_fn(self, batch):
	batch_im_id = -1

	cam_infos, K = [], []
	det_infos, bboxes = [], []
	for n, data in enumerate(batch): # normally only one batch
		assert n == 0
		images, masks, obss = data
		for c, obs in enumerate(obss): # iterate along different views
			batch_im_id += 1
			frame_info = obs['frame_info']
			im_info = {k: frame_info[k] for k in ('scene_id', 'view_id', 'group_id')} # info for the image
			im_info.update(batch_im_id=batch_im_id)
			cam_info = im_info.copy() # info for camera

			K.append(obs['camera']['K']) # info for 相机内参
			cam_infos.append(cam_info)

			for o, obj in enumerate(obs['objects']):
				obj_info = dict(
					label=obj['name'],
					score=1.0,
				)
				obj_info.update(im_info) # add key-value pair from im_info to obj_info
				bboxes.append(obj['bbox'])
				det_infos.append(obj_info)

	gt_detections = tc.PandasTensorCollection(
		infos=pd.DataFrame(det_infos),
		bboxes=torch.as_tensor(np.stack(bboxes)),
	) # 包括每一个ground truthdetection的的基本info,和检测框 
	cameras = tc.PandasTensorCollection(
		infos=pd.DataFrame(cam_infos),
		K=torch.as_tensor(np.stack(K)),
	)# 包括每一view 相机的基本info（和detection info相同）,和内参
	data = dict(
		images=images,
		cameras=cameras,
		gt_detections=gt_detections,
	)
	return data

最重要的function: get_predictions

def get_predictions(self, pose_predictor, mv_predictor,
					detections=None,
					n_coarse_iterations=1, n_refiner_iterations=1,
					sv_score_th=0.0, skip_mv=True,
					use_detections_TCO=False):

Responsible for generating predictions for object poses in a scene using both single-view and multi-view approaches.

Input Parameters:
- pose_predictor: single view predictor，比如ycbv数据集用的就是posecnn的检测模型
- mv_predictor: An object or function that predicts scene states using multi-view information.
- detections: A collection of detected objects with associated information, pre-generated and saved in a .pkl file
- n_coarse_iterations, n_refiner_iterations: Number of iterations for coarse and refinement pose estimation.
- sv_score_th: Score threshold for single-view detections.
- skip_mv: A flag to skip multi-view predictions.
- use_detections_TCO: A flag to use detections for initial pose estimation.
Filtering Detections:
需要注意的是这里使用的detection是直接来自预存好的检测数据（非ground truth）
1
posecnn_detections = load_posecnn_results()
- The function filters the input detections based on the sv_score_th threshold.
- It assigns a unique detection ID to each detection and creates an index based on scene_id and view_id.
Iterating Over Data:
- The function iterates over batches of data from the dataloader.
- For each batch, it extracts images, camera information, and ground truth detections.
Matching Detections:
- It matches the detections with the current batch of data using the index created earlier.
- It filters and prepares the detections for processing.
Pose Prediction:
- If there are detections, it uses the pose_predictor to get single-view predictions.
- It registers the initial bounding boxes with the candidates.
Multi-View Prediction:
- If skip_mv is False, it uses the mv_predictor to predict the scene state using multi-view information.
Collecting Predictions:
- It collects the single-view and multi-view predictions into a dictionary.
Concatenating Results:
- It concatenates the predictions across all batches and returns the final predictions.

`MultiviewScenePredictor`

作用：
used by Myltiview_PredictionRunner.get_predictions
In run_cosypose_eval we initialize MultiviewScenePredictor in this way:

1	mv_predictor = MultiviewScenePredictor(mesh_db)

In the MultiviewScenePredictor we use the mesh_db to initialize MultiviewRefinement and solve:

problem = MultiviewRefinement(candidates=candidates_n,
                    cameras=cameras,
	                pairs_TC1C2=pairs_TC1C2,
	                mesh_db=self.mesh_db_ba)
ba_outputs = problem.solve(
	n_iterations=ba_n_iter,
	optimize_cameras=not use_known_camera_poses,
)

The solve function of MultiviewRefinement:

def solve(self, sample_n_init=1, **lm_kwargs):
	timer_init = Timer()
	timer_opt = Timer()
	timer_misc = Timer()

	timer_init.start()
	TWO_9d_init, TCW_9d_init = self.robust_initialization_TWO_TCW(n_init=sample_n_init)
	timer_init.pause()

	timer_opt.start()
	TWO_9d_opt, TCW_9d_opt, history = self.optimize_lm(
		TWO_9d_init, TCW_9d_init, **lm_kwargs)
	timer_opt.pause()

	timer_misc.start()
	objects, cameras = self.make_scene_infos(TWO_9d_opt, TCW_9d_opt)
	objects_init, cameras_init = self.make_scene_infos(TWO_9d_init, TCW_9d_init)
	history = self.convert_history(history)
	timer_misc.pause()

	outputs = dict(
		objects_init=objects_init,
		cameras_init=cameras_init,
		objects=objects,
		cameras=cameras,
		history=history,
		time_init=timer_init.stop(),
		time_opt=timer_opt.stop(),
		time_misc=timer_misc.stop(),
	)
	return outputs

Adaption

准备基于run_custom_scenario进行修改
run_custom_scenario的使用方式：

1	python -m cosypose.scripts.run_custom_scenario --scenario=example

Setting OMP and MKL num threads to 1.
pybullet build time: Jan 28 2022 20:13:03
0:00:00.000859 - -----------------------------------------------
---------------------------------
0:00:00.000921 - scenario: example
0:00:00.000942 - sv_score_th: 0.3
0:00:00.000956 - n_symmetries_rot: 64
0:00:00.000956 - n_symmetries_rot: 64
0:00:00.000968 - ransac_n_iter: 2000
0:00:00.000980 - ransac_dist_threshold: 0.02
0:00:00.001002 - nms_th: 0.04
0:00:00.001015 - no_visualization: False
0:00:00.001026 - -----------------------------------------------
---------------------------------
0:00:00.569089 - Loaded 796 candidates in 8 views.
0:00:00.570278 - Loaded cameras intrinsics.
0:00:00.690990 - Loaded 30 3D object models.
0:00:00.691047 - Running stage 2 and 3 of CosyPose...
0:00:01.145408 - Num candidates: 107
0:00:01.145468 - Num views: 8
0:00:01.145728 - Estimating camera poses using RANSAC.
0:00:04.588304 - Matched candidates: 49
0:00:04.588375 - RANSAC time_models: 0:00:02.390068
0:00:04.588398 - RANSAC time_score: 0:00:00.990740
0:00:04.588415 - RANSAC time_misc: 0:00:00.061626
0:00:04.902268 - BA time_init: 0:00:00.005349
0:00:04.902333 - BA time_opt: 0:00:00.091822
0:00:04.902351 - BA time_misc: 0:00:00.004793
0:00:04.491746 - Subscene 0 has 8 objects and 7 cameras.
0:00:04.512850 - Wrote predicted scene (objects+cameras): /home/cyl/cosypose/local_data/custom_scenarios/example/
results/subscene=0/predicted_scene.json
0:00:04.512906 - Wrote predicted objects with pose expressed in camera frame: /home/cyl/cosypose/local_data/custo
m_scenarios/example/results/subscene=0/scene_reprojected.csv

该脚本只接收了candidates, mesh_db和camera_k信息，直接运行mv_predictor

写一个通过list输入构建candidates的function:

def read_list_candidates_cameras(self, data_list, cameras_K_list):
	"""
	Creates a PandasTensorCollection from a list of candidates information.

	Args:
		data_list (list): Each element is a dictionary with keys:
			- "candidates" (list of dict): Each candidate dictionary includes:
				- "label" (str): The label of the object.
				- "score" (float): The confidence score of the object.
				- "pose" (torch.Tensor): A [4, 4] torch.Tensor representing the pose matrix.

	Returns:
		PandasTensorCollection: Contains poses and infos.
	"""
	all_poses = []
	all_infos = []
	all_K = []

	# Initialize view_id to be assigned automatically
	view_id = 0
	scene_id = 0  # Fixed value for scene_id

	for view, K in zip(data_list, cameras_K_list):
		all_K.append(K)
		for candidate in view["candidates"]:
			label = candidate["label"]
			score = candidate["score"]
			pose = candidate["pose"]

			# Append the pose tensor
			all_poses.append(pose)

			# Append the metadata
			all_infos.append({
				"view_id": view_id,
				"scene_id": scene_id,
				"score": score,
				"label": label
			})

		# Increment view_id for the next set of candidates
		view_id += 1

	K_tensor = torch.stack(all_K).to(dtype=torch.float32, device="cuda:0")

	# Stack poses into a single tensor
	poses_tensor = torch.stack(all_poses).to(dtype=torch.float32, device="cuda:0")

	# Create a Pandas DataFrame for infos
	infos_df = pd.DataFrame(all_infos)
	# Return the PandasTensorCollection-like structure
	ptc_candidate = tc.PandasTensorCollection(poses=poses_tensor, infos=infos_df)
	cam_info = infos_df.loc[:,["view_id"]]
	cam_info = cam_info.drop_duplicates()
	ptc_cam = tc.PandasTensorCollection(K=K_tensor, infos=cam_info)
	return ptc_candidate, ptc_cam

# Example usage:
example_data = [
    {
        "candidates": [
            {"label": "obj_000017", "score": 0.829675, "pose": torch.eye(4)},
            {"label": "obj_000010", "score": 0.820436, "pose": torch.eye(4) * 2},
        ]
    },
    {
        "candidates": [
            {"label": "obj_000005", "score": 0.104478, "pose": torch.eye(4) * 3},
        ]
    }
]
example_cameras_K = [
    torch.eye(3),
    torch.eye(3) * 2,
]

cd, cam= read_list_candidates(example_data, example_cameras_K)
cd, cam

(PandasTensorCollection(
     poses: torch.Size([3, 4, 4]) torch.float32 cuda:0,
 ----------------------------------------
     infos:
    view_id  scene_id     score       label
 0        0         0  0.829675  obj_000017
 1        0         0  0.820436  obj_000010
 2        1         0  0.104478  obj_000005
 ),
 PandasTensorCollection(
     K: torch.Size([2, 3, 3]) torch.float32 cuda:0,
 ----------------------------------------
     infos:
    view_id
 0        0
 1        1
 ))

之后就正常调用MultiviewScenePredictor.predict_scene_state() to estimate the scene:

predictions = self.mv_predictor.predict_scene_state(candidates, cameras,
									   score_th=self.sv_score_th,
									   use_known_camera_poses=False,
									   ransac_n_iter= self.ransac_n_iter,
									   ransac_dist_threshold= self.ransac_dist_threshold,
									   ba_n_iter= self.ba_n_iter)

之后再使用Non-Maximum Suppression来聚合重复检出的物体

objects = predictions['scene/objects']
cameras = predictions['scene/cameras']
reproj = predictions['ba_output']
#print(predictions)
for view_group in np.unique(objects.infos['view_group']):
	objects_ = objects[np.where(objects.infos['view_group'] == view_group)[0]]
	cameras_ = cameras[np.where(cameras.infos['view_group'] == view_group)[0]]
	reproj_ = reproj[np.where(reproj.infos['view_group'] == view_group)[0]]
	objects_ = nms3d(objects_, th= self.nms_th, poses_attr='TWO')

最终输出objects_

PandasTensorCollection(
    TWO: torch.Size([10, 4, 4]) torch.float32 cuda:0,
----------------------------------------
    infos:
   obj_id     score       label  n_cand  view_group  group_id  scene_id
0       2  5.469747  obj_000016       7           0         0        16
1       0  5.450335  obj_000017       8           0         0        16
2       4  4.098602  obj_000012       8           0         0        16
3       1  3.380887  obj_000010       6           0         0        16
4       5  2.771779  obj_000015       6           0         0        16
5       3  1.453180  obj_000011       4           0         0        16
6       9  1.183983  obj_000014       3           0         0        16
7       8  1.106775  obj_000013       2           0         0        16
)

Usage

Please refer to the notebook custom_scene.ipynb.

Posted 2024-10-28Updated 2025-07-02Note4 minutes read (About 566 words)

Blog Template For New Hexo User

前摇部分

基本原理

本地增添博客内容(markdown文件)->hexo根据文件内容生成网页源码->上通过指令上传(push)到github->github自行部署静态页面

基本准备

安装git

https://www.cnblogs.com/xueweisuoyong/p/11914045.html

Github shh key

因为把本地写的内容传到github，需要绑定一个ssh密钥
参见：https://docs.github.com/en/authentication/connecting-to-github-with-ssh/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent

1	ssh-keygen -t ed25519 -C "guanshengyuanlu@163.com"

把这串公钥添加到github ssh settings里面

安装npm

https://blog.csdn.net/lizhong2008/article/details/133844070
最新版本即可

本地部署

设定本地的git config

1 2	git config --global user.email "guanshengyuanlu@163.com" git config --global user.name "Draumurvakna"

克隆仓库

1 2	git clone git@github.com:Draumurvakna/MIGAO-Blog-Src.git cd MIGAO-Blog-Src

安装环境

git submodule update --recursive --init   
npm update
cd themes/icarus
npm update

更新网站

1	./show.sh #预览

1	./deploy.sh #可以直接通过网页访https://draumurvakna.github.io/

正文

网站上的每一篇文章在本地都是一份markdown文本文件，存在source/_posts中

例如这里就有两篇示例文章

通过指令hexo new "article title" 来创建一篇新博客

然后到对应文件里面编辑就行了
markdown的编辑器推荐用typora，当然如果足够硬核的话用txt文本编辑器也毫无问题！

如果想添加图片的话，就往同文件夹下的资源文件夹（和这篇博客名字相同的文件夹）中添加照片然后在文中输入

可以参考Sample Blog

添加完自己想要的内容之后用./deploy.sh部署一下网站，稍等片刻，进到网站里就可以看到最新的变化了。

后

我想给博客换个背景

进到themes/icarus/source/img文件夹📁

把`lightBG.png`, `darkBG.jpg`换成别的图片，名字一致

如果重新部署后发现没有更改，那就网页里按一下<Ctrl>+F5

我想换个头像👤

如上图，改avatar.png

关于评论系统的话需要自己搞定o

参考 https://chen-yulin.github.io/2024/09/03/%5BOBS%5Dhexo-Hexo%20Comment%20System%20--%20Twikoo/

Posted 2024-10-22Updated 2025-07-02Note2 minutes read (About 232 words)

密院罗斯商学院双学位项目

研究生层面的罗斯商学院双学位项目
密大方面提供的硕士学位（授课型）：

管理 30 credits
供应链（管理）30 credits
商务分析 (BA) 36.5 credits 挺适合转商科，量化分析（programming required）

密大的一年放在三年学制的最后一学年的6月~来年5月，我们认对方6学分，对方认我们Gateway
准入条件：

密院研究生
对方的线上面试方式

学费，5w$~6w$ （如果承认学分学费可以打折），生活安娜堡预计1000$/m，饮食1000$/m，杂项500$/m

可以提供找工作的签证机会（利好留美发展者）
但是时间会和秋招冲突，会给国内找工作面试带来困难

发学位证书的时间在两边并不统一

26级包括专硕招生（和双学位挂钩）

Introduction

Behavior Cloning

Inverse Reinforcement Learning

Adversarial Imitation Learning

Imitation From Observation

Latent Action Policies (LAPOs)

Challenges And Limitations

Background

标识

输入

以下纯来自AI的solution

1. Parse the Input

2. Determine Delegation Strategy

Key Idea:

Steps:

3. Calculate Free Time Intervals

Key Idea:

Steps:

4. Schedule Computational Tasks

Key Idea:

Steps:

5. Output Results

Algorithm Complexity

OCaml Implementation

以下纯来自AI的proof

Proof of Correctness for the Algorithm

1. Problem Understanding

2. Proof Components

2.1 Webpage Delegation

2.2 Free Interval Calculation

2.3 Task Scheduling

2.4 Overall Algorithm

4. Edge Cases

5. Conclusion

Rich Sutton

March 13, 2019

Conda管理Cuda

cudatoolkit和cuda-toolkit

Installation

Dataset

Baseline Learning

Federated Learning

FedAvg vs FedAvgFineTuning

FedAvg

FedAvgFineTuning

Setup

Debug

np.where(mask)[0].item()

运行到一半被终止的情况

Framework

Prediction Script cosypose.scripts.run_cosypose_eval

AI explanation

Prediction Script run_custom_scenario

Terms

TCO

TWO

Model dataset

Important Classes

Multiview_wrapper

MultiviewPredictorRunner

MultiviewScenePredictor

Adaption

Usage

前摇部分

基本原理

基本准备

安装git

Github shh key

安装npm

本地部署

更新网站

正文

后

我想给博客换个背景

我想换个头像👤

关于评论系统的话需要自己搞定o

Archives

Recents

Tags

`cudatoolkit`和`cuda-toolkit`

`np.where(mask)[0].item()`

Prediction Script `cosypose.scripts.run_cosypose_eval`

Prediction Script `run_custom_scenario`

`Multiview_wrapper`

`MultiviewPredictorRunner`

`MultiviewScenePredictor`