A Survey of Imitation Learning- Algorithms, Recent Developments, and Challenges

IL是区别于传统手动编程来赋予机器人自主能力的方法。
IL 允许机器通过演示(人类演示专家行为)来学习所需的行为,从而消除了对显式编程或特定于任务的奖励函数的需要。
IL主要有两个类别:
A Survey of Imitation Learning- Algorithms, Recent Developments, and Challenges

IL是区别于传统手动编程来赋予机器人自主能力的方法。
IL 允许机器通过演示(人类演示专家行为)来学习所需的行为,从而消除了对显式编程或特定于任务的奖励函数的需要。
IL主要有两个类别:

https://validator.w3.org/ : 是一个由万维网联盟(W3C)提供的在线工具,用于检查网页的 HTML、XHTML 或其他标记语言是否符合相关标准和规范。它可以帮助开发者提高网页的质量和兼容性,确保网页在不同浏览器和设备上正确显示。
The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin. The ultimate reason for this is Moore’s law, or rather its generalization of continued exponentially falling cost per unit of computation. Most AI research has been conducted as if the computation available to the agent were constant (in which case leveraging human knowledge would be one of the only ways to improve performance) but, over a slightly longer time than a typical research project, massively more computation inevitably becomes available. Seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of computation. These two need not run counter to each other, but in practice they tend to. Time spent on one is time not spent on the other. There are psychological commitments to investment in one approach or the other. And the human-knowledge approach tends to complicate methods in ways that make them less suited to taking advantage of general methods leveraging computation. There were many examples of AI researchers’ belated learning of this bitter lesson, and it is instructive to review some of the most prominent.
In computer chess, the methods that defeated the world champion, Kasparov, in 1997, were based on massive, deep search. At the time, this was looked upon with dismay by the majority of computer-chess researchers who had pursued methods that leveraged human understanding of the special structure of chess. When a simpler, search-based approach with special hardware and software proved vastly more effective, these human-knowledge-based chess researchers were not good losers. They said that “brute force” search may have won this time, but it was not a general strategy, and anyway it was not how people played chess. These researchers wanted methods based on human input to win and were disappointed when they did not.
A similar pattern of research progress was seen in computer Go, only delayed by a further 20 years. Enormous initial efforts went into avoiding search by taking advantage of human knowledge, or of the special features of the game, but all those efforts proved irrelevant, or worse, once search was applied effectively at scale. Also important was the use of learning by self play to learn a value function (as it was in many other games and even in chess, although learning did not play a big role in the 1997 program that first beat a world champion). Learning by self play, and learning in general, is like search in that it enables massive computation to be brought to bear. Search and learning are the two most important classes of techniques for utilizing massive amounts of computation in AI research. In computer Go, as in computer chess, researchers’ initial effort was directed towards utilizing human understanding (so that less search was needed) and only much later was much greater success had by embracing search and learning.
In speech recognition, there was an early competition, sponsored by DARPA, in the 1970s. Entrants included a host of special methods that took advantage of human knowledge—knowledge of words, of phonemes, of the human vocal tract, etc. On the other side were newer methods that were more statistical in nature and did much more computation, based on hidden Markov models (HMMs). Again, the statistical methods won out over the human-knowledge-based methods. This led to a major change in all of natural language processing, gradually over decades, where statistics and computation came to dominate the field. The recent rise of deep learning in speech recognition is the most recent step in this consistent direction. Deep learning methods rely even less on human knowledge, and use even more computation, together with learning on huge training sets, to produce dramatically better speech recognition systems. As in the games, researchers always tried to make systems that worked the way the researchers thought their own minds worked—they tried to put that knowledge in their systems—but it proved ultimately counterproductive, and a colossal waste of researcher’s time, when, through Moore’s law, massive computation became available and a means was found to put it to good use.
In computer vision, there has been a similar pattern. Early methods conceived of vision as searching for edges, or generalized cylinders, or in terms of SIFT features. But today all this is discarded. Modern deep-learning neural networks use only the notions of convolution and certain kinds of invariances, and perform much better.
This is a big lesson. As a field, we still have not thoroughly learned it, as we are continuing to make the same kind of mistakes. To see this, and to effectively resist it, we have to understand the appeal of these mistakes. We have to learn the bitter lesson that building in how we think we think does not work in the long run. The bitter lesson is based on the historical observations that 1) AI researchers have often tried to build knowledge into their agents, 2) this always helps in the short term, and is personally satisfying to the researcher, but 3) in the long run it plateaus and even inhibits further progress, and 4) breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning. The eventual success is tinged with bitterness, and often incompletely digested, because it is success over a favored, human-centric approach.
One thing that should be learned from the bitter lesson is the great power of general purpose methods, of methods that continue to scale with increased computation even as the available computation becomes very great. The two methods that seem to scale arbitrarily in this way are search and learning.
The second general point to be learned from the bitter lesson is that the actual contents of minds are tremendously, irredeemably complex; we should stop trying to find simple ways to think about the contents of minds, such as simple ways to think about space, objects, multiple agents, or symmetries. All these are part of the arbitrary, intrinsically-complex, outside world. They are not what should be built in, as their complexity is endless; instead we should build in only the meta-methods that can find and capture this arbitrary complexity. Essential to these methods is that they can find good approximations, but the search for them should be by our methods, not by us. We want AI agents that can discover like we can, not which contain what we have discovered. Building in our discoveries only makes it harder to see how the discovering process can be done.
从 70 年的人工智能研究中可以得到的最大教训是,利用计算的通用方法最终是最有效的,而且是最大的优势。其根本原因是摩尔定律,或者更确切地说是其对单位计算成本持续呈指数下降的概括。大多数人工智能研究都是在代理可用的计算是恒定的情况下进行的(在这种情况下,利用人类知识将是提高性能的唯一方法之一),但与典型的研究项目相比,在稍长的时间内,不可避免地会出现大量的计算可用。为了寻求在短期内产生影响的改进,研究人员试图利用他们对该领域的人类知识,但从长远来看,唯一重要的是利用计算。这两者不必相互矛盾,但在实践中它们往往是相互矛盾的。花在其中一个上的时间就是没有花在另一个上的时间。人们在心理上承诺投资于一种方法或另一种方法。而人类知识方法往往会使方法复杂化,使其不太适合利用利用计算的通用方法。人工智能研究人员迟迟没有吸取这一惨痛教训的例子有很多,回顾一下其中最突出的一些例子很有启发意义。
在计算机象棋中,1997 年击败世界冠军卡斯帕罗夫的方法是基于大规模深度搜索。当时,大多数计算机象棋研究人员对此感到沮丧,他们一直在寻求利用人类对象棋特殊结构的理解的方法。当一种更简单的、基于搜索的方法加上特殊的硬件和软件被证明更为有效时,这些基于人类知识的象棋研究人员就不是善于输的人了。他们说,“蛮力”搜索这次可能赢了,但这不是一种通用策略,而且无论如何它也不是人们下棋的方式。
这些研究人员希望基于人类输入的方法能够获胜,但结果却令他们失望。 计算机围棋也出现了类似的研究进展模式,只是推迟了 20 年。最初,人们付出了巨大的努力,利用人类知识或游戏的特殊功能来避免搜索,但一旦搜索被大规模有效应用,所有这些努力都被证明是无关紧要的,甚至更糟。同样重要的是使用自学来学习价值函数(就像在许多其他游戏甚至国际象棋中一样,尽管学习在 1997 年首次击败世界冠军的程序中并没有发挥重要作用)。自学和一般的学习就像搜索一样,因为它能够发挥大规模计算的作用。搜索和学习是人工智能研究中利用大量计算的两类最重要的技术。在计算机围棋中,就像在计算机国际象棋中一样,研究人员最初的努力是利用人类的理解力(这样就不需要太多的搜索),直到后来,通过采用搜索和学习才取得了更大的成功。
在语音识别方面,20 世纪 70 年代,DARPA 赞助了一场早期的竞赛。参赛者包括大量利用人类知识(单词、音素、人类声道等知识)的特殊方法。另一方面,一些较新的方法更具统计性质,并且基于隐马尔可夫模型 (HMM) 进行更多的计算。统计方法再次战胜了基于人类知识的方法。这导致了整个自然语言处理领域发生了重大变化,几十年来,统计和计算逐渐占据了主导地位。语音识别中深度学习的兴起是朝着这一一致方向迈出的最新一步。深度学习方法更少地依赖人类知识,使用更多的计算,再加上对大量训练集的学习,从而产生了更好的语音识别系统。就像在游戏中一样,研究人员总是试图制造出按照他们认为自己的想法运作的系统——他们试图将这些知识放入他们的系统中——但最终却适得其反,浪费了研究人员大量的时间,而摩尔定律让大规模计算成为可能,并找到了一种充分利用它的方法。
在计算机视觉中,也有类似的模式。早期的方法将视觉设想为搜索边缘、广义圆柱体或 SIFT 特征。但今天所有这些都被抛弃了。现代深度学习神经网络只使用卷积和某些类型的不变性的概念,而且表现要好得多。
这是一个很大的教训。作为一个领域,我们还没有彻底学会它,因为我们还在继续犯同样的错误。要看到这一点,并有效地抵制它,我们必须了解这些错误的吸引力。我们必须学会不那么痛苦.
nvidia-smi返回的是driver所能支持的最新的cuda版本
系统安装的cuda版本可以随意,torch会优先使用虚拟环境中安装的cuda版本
安装指定版本cuda-toolkit
1 | conda install nvidia/label/cuda-12.4.0::cuda-toolkit -c nvidia/label/cuda-12.4.0 |
安装最新版本
1 | conda install cuda-toolkit |
某些仓库需要指定cuda路径才能编译包
1 | conda env config vars set LD_LIBRARY_PATH="/home/cyl/miniconda3/envs/gsam/lib/python3.10/site-packages/nvidia/cuda_runtime/lib/:$LD_LIBRARY_PATH" |
Note: 注意改变了库路径之后nvim中的lsp会报错,建议之后改回去
1 | conda env config vars set LD_LIBRARY_PATH="" |
Note: To find the correct path for CUDA_HOME use which nvcc. In my case, output of the command was:
1 | >>> which nvcc |
Therefore, I set the CUDA_HOME as /home/user/miniconda3/envs/py12/.
Note: To find the correct path for LD_LIBRARY_PATH use find ~ -name cuda_runtime_api.h. In my case, output of the command was:
1 | >>> find ~ -name cuda_runtime_api.h |
So I set the LD_LIBRARY_PATH as /home/user/miniconda3/envs/py12/targets/x86_64-linux/lib/ and CPATH as /home/user/miniconda3/envs/py12/targets/x86_64-linux/include/. If you have multiple CUDA installations, the output of find ~ -name cuda_runtime_api.h will display multiple paths. Make sure to choose the path that corresponds to the environment you have created.
ref:https://github.com/IDEA-Research/GroundingDINO/issues/355
Note: Always reboot the computer after the cuda is upgraded
Note: 在更改LD_LIBRARY_PATH后可能会导致neovim的pyright无法运行,所以建议在编译完成后设回该变量
1 | conda env config vars set LD_LIBRARY_PATH="" |
cudatoolkit和cuda-toolkit这两个可以同时安装
如果不安装cudatoolkit可能会在编译时出现ld: cannot find -lcudart: No such file or directory collect2: error: ld returned 1 exit status 报错
使用以下指令获取版本信息
1 | python -c 'import torch;print(torch.__version__);print(torch.version.cuda)' |
1 | 2.0.0+cu117 |
Use SSH to Connect Jupyter-lab

使用ssh作为命令行远程工具,启动远程的jupyter lab并且在本地的浏览器中打开。

Repository: https://github.com/owkin/FLamby
Convert Raw RGB-D to tree-structure scene(maybe in unity), for more
发现和lff近期发表的一篇文章思想非常一致 https://arxiv.org/html/2410.07408v1
和场景理解的对比
仓库: https://github.com/Simple-Robotics/cosypose
1 | git clone --recurse-submodules https://github.com/Simple-Robotics/cosypose.git |
注意执行这一步的时候pip 会提示setuptools 和matplotlib-inline不符合3.7.6的python,到环境中手动安装适配的版本
1 | conda activate cosypose |
1 | git lfs pull |
根据README下载数据
注意第一块指令无法下载成功,由 https://bop.felk.cvut.cz/datasets/ 得知下载链接迁移到了huggingface, https://huggingface.co/datasets/bop-benchmark/datasets/tree/main/ycbv 可以从这里手动下载测试集并放置到local_data/bop_datasets/ycbv/test
设置测试使用的models
1 | cp ./local_data/bop_datasets/ycbv/model_bop_compat_eval ./local_data/bop_datasets/ycbv/models |
np.where(mask)[0].item()运行
1 | export CUDA_VISIBLE_DEVICES=0 |
时出现报错
1 | Traceback (most recent call last): |
添加debug输出,得到
1 | Debug - scene_id: 48, view_id: 1 |
发现是下载的测试数据集并不包含数据集keyframe.txt中所有的帧,导致一些关键帧识别不到
如果想重新开始新的训练: 清空local_data/joblib_cache
cosypose.scripts.run_cosypose_evalThe script predicts object poses based on multi-view input by following these steps:
Dataset Loading: It first loads the dataset using the make_scene_dataset function, which prepares the scene data for evaluation. The dataset is wrapped in a MultiViewWrapper to handle multiple views.
Model Loading: The script loads pre-trained models for pose prediction using the load_models function. It loads both coarse and refiner models based on the configuration specified in the command-line arguments.
Prediction Setup: The script sets up the prediction parameters, including the number of iterations for coarse and refiner models, and whether to skip multi-view processing based on the number of views specified.
Multi-view Prediction: The MultiviewScenePredictor is initialized with the mesh database, which is used to predict poses across multiple views. The MultiviewPredictionRunner is then used to run predictions on the dataset, leveraging the multi-view setup to improve pose estimation accuracy.
Pose Estimation: The script uses the loaded models to predict object poses. It processes detections from either pix2pose or posecnn depending on the dataset, and refines these predictions using the refiner model.
Evaluation: After predictions, the script evaluates the predicted poses using the PoseEvaluation class. It calculates various metrics like ADD-S and AUC to assess the accuracy of the pose predictions.
Results Logging: Finally, the script logs the results, including evaluation metrics, and saves them to a specified directory.
The multi-view approach allows the script to leverage information from different viewpoints, which can help resolve ambiguities and improve the robustness of the pose estimation.
run_custom_scenarioTransformation from Camera to Object.
It represents the transformation matrix or parameters that describe the pose of an object relative to the camera’s coordinate system
Transformation from World to Object.
It represents the transformation matrix or parameters that describe the pose of an object relative to the world’s coordinate system
1 | class MeshDataBase: |
一般使用的初始化方式:
1 | object_ds = BOPObjectDataset(scenario_dir / 'models') |
也可以通过load models一起加载:
1 | predictor, mesh_db = load_models(coarse_run_id, refiner_run_id, n_workers=n_plotters, object_set=object_set) |
Multiview_wrapper作用:
读取 scene_dataset 并且通过视角数量n_views来分割这些数据为不同场景,然后方便遍历其中的场景元素(这里都是ground truth)
遍历时返回的值为
n_views张不同视角下的RGB图像n_views张对应的maskn_views份对应的observation1 | scene_ds_pred = MultiViewWrapper(scene_ds, n_views=n_views) |
1 | [ |
MultiviewPredictorRunner作用:
接收Multiview_wrapper作为输入,并做出预测
首先是数据集接收:
1 | dataloader = DataLoader(scene_ds, batch_size=batch_size, |
use collate_fn to process the row data (最后的注释里面有真正用到的数据)
1 | def collate_fn(self, batch): |
最重要的function: get_predictions
1 | def get_predictions(self, pose_predictor, mv_predictor, |
Responsible for generating predictions for object poses in a scene using both single-view and multi-view approaches.
Input Parameters:
pose_predictor: single view predictor,比如ycbv数据集用的就是posecnn的检测模型mv_predictor: An object or function that predicts scene states using multi-view information.detections: A collection of detected objects with associated information, pre-generated and saved in a .pkl filen_coarse_iterations, n_refiner_iterations: Number of iterations for coarse and refinement pose estimation.sv_score_th: Score threshold for single-view detections.skip_mv: A flag to skip multi-view predictions.use_detections_TCO: A flag to use detections for initial pose estimation.Filtering Detections:
需要注意的是这里使用的detection是直接来自预存好的检测数据(非ground truth)
1 | posecnn_detections = load_posecnn_results() |
detections based on the sv_score_th threshold.scene_id and view_id.Iterating Over Data:
dataloader.Matching Detections:
Pose Prediction:
pose_predictor to get single-view predictions.Multi-View Prediction:
skip_mv is False, it uses the mv_predictor to predict the scene state using multi-view information.Collecting Predictions:
Concatenating Results:
MultiviewScenePredictor作用:
used by Myltiview_PredictionRunner.get_predictions
In run_cosypose_eval we initialize MultiviewScenePredictor in this way:
1 | mv_predictor = MultiviewScenePredictor(mesh_db) |
In the MultiviewScenePredictor we use the mesh_db to initialize MultiviewRefinement and solve:
1 | problem = MultiviewRefinement(candidates=candidates_n, |
The solve function of MultiviewRefinement:
1 | def solve(self, sample_n_init=1, **lm_kwargs): |
准备基于run_custom_scenario进行修改run_custom_scenario的使用方式:
1 | python -m cosypose.scripts.run_custom_scenario --scenario=example |
1 | Setting OMP and MKL num threads to 1. |
该脚本只接收了candidates, mesh_db和camera_k信息,直接运行mv_predictor
写一个通过list输入构建candidates的function:
1 | def read_list_candidates_cameras(self, data_list, cameras_K_list): |
1 | # Example usage: |
1 | (PandasTensorCollection( |
之后就正常调用MultiviewScenePredictor.predict_scene_state() to estimate the scene:
1 | predictions = self.mv_predictor.predict_scene_state(candidates, cameras, |
之后再使用Non-Maximum Suppression来聚合重复检出的物体
1 | objects = predictions['scene/objects'] |
最终输出objects_
1 | PandasTensorCollection( |
Please refer to the notebook custom_scene.ipynb.
Blog Template For New Hexo User
/Pasted_image_20241027192407.png)
本地增添博客内容(markdown文件)->hexo根据文件内容生成网页源码->上通过指令上传(push)到github->github自行部署静态页面
研究生层面的罗斯商学院双学位项目
密大方面提供的硕士学位(授课型):
密大的一年放在三年学制的最后一学年的6月~来年5月,我们认对方6学分,对方认我们Gateway
准入条件:
学费,5w$~6w$ (如果承认学分学费可以打折),生活安娜堡预计1000$/m,饮食1000$/m, 杂项500$/m
可以提供找工作的签证机会(利好留美发展者)
但是时间会和秋招冲突,会给国内找工作面试带来困难
发学位证书的时间在两边并不统一
26级包括专硕招生(和双学位挂钩)