We have also seen emerged two properties that can be leveraged in future applications: the quality of the features in k-NN classification has a potential for image retrieval. The presence of information about the scene layout in the features can also benefit weakly supervised image segmentation.
# image_encoder - ResNet or Vision Transformer # text_encoder - CBOW or Text Transformer # I[n, h, w, c] - minibatch of aligned images # T[n, l] - minibatch of aligned texts # W_i[d_i, d_e] - learned proj of image to embed # W_t[d_t, d_e] - learned proj of text to embed # t - learned temperature parameter
# extract feature representations of each modality I_f = image_encoder(I) #[n, d_i] T_f = text_encoder(T) #[n, d_t]
神经辐射场 (NeRF) 已成为一种强大的技术,用于捕获复杂的现实世界 3D 场景的逼真数字表示。然而,NeRF 的直接输出只不过是一个彩色的密度场,缺乏意义或上下文,这阻碍了构建与生成的 3D 场景交互的界面。
自然语言是与 3D 场景交互的直观界面。考虑厨房的捕获。想象一下,能够通过询问“用具”在哪里来导航这个厨房,或者更具体地说,询问可用于“搅拌”的工具,甚至可以询问您最喜欢的带有特定功能的杯子。其上的徽标——贯穿日常对话的舒适和熟悉。这不仅需要处理自然语言输入查询的能力,还需要能够在多个尺度上合并语义并与长尾和抽象概念相关。
解决方案
一个Language Field 通过优化从现成的视觉语言模型(如 CLIP)到 3D 场景的嵌入,为 NeRF 中的语言奠定基础。 LERF 提供了一个额外的好处:由于我们从多个尺度的多个视图中提取 CLIP 嵌入,因此通过 3D CLIP 嵌入获得的文本查询的相关性图与通过 2D CLIP 嵌入获得的文本查询的相关性图相比更加本地化。根据定义,它们也是 3D 一致的,可以直接在 3D 字段中进行查询,而无需渲染到多个视图。
相较于Clip-Field[[CLIP-Fields- Weakly Supervised Semantic Fields for Robotic Memory]], LERF 更密集。
CLIP-Fields [32] and NLMaps-SayCan [8] fuse CLIP embeddings of crops into pointclouds, using a contrastively supervised field and classical pointcloud fusion respectively. In CLIP-Fields, the crop locations are guided by Detic [40]. On the other hand, NLMaps-SayCan relies on region proposal networks. These maps are sparser than LERF as they primarily query CLIP on detected objects rather than densely throughout views of the scene. Concurrent work ConceptFusion [19] fuses CLIP features more densely in RGBD pointclouds, using Mask2Former [9] to predict regions of interest, meaning it can lose objects which are out of distribution to Mask2Former’s training set. In contrast, LERF does not use region or mask proposals.
LERF
给定一组校准的输入图像,我们将 CLIP 嵌入到 NeRF 内的 3D 场中。然而,查询单个 3D 点的 CLIP 嵌入是不明确的,因为 CLIP 本质上是全局图像嵌入,不利于像素对齐特征提取。为了解释这一特性,我们提出了一种新颖的方法,该方法涉及学习以样本点为中心的卷上的语言嵌入领域。具体来说,该字段的输出是包含指定体积的图像作物的所有训练视图中的平均 CLIP 嵌入。通过将查询从点重新构造为体积,我们可以有效地从输入图像的粗裁剪中监督密集的字段,这些图像可以通过在给定的体积尺度上进行调节来以像素对齐的方式渲染。
voxel collider for detected objects, joints, physics agent interact with physics engine. 点云数据,grounded caption=>object property, hierarchy relation, joints(maybe new model should be proposed)
语义还原物体模型
受[[BLIP]]启发,understanding for language & existing point cloud, generation for the rest of the point cloud (Wonder3D已实现)
We aim to build a system that can connect points of a 3D scene with their visual and semantic meaning. Provide an interface with a pair of scene-dependent implicit functions $f, h : R^3 → R^n$ such that for the coordinates of any point P in our scene, f (P ) is a vector representing its semantic features, and h(P ) is another vector representing its visual features.
Dataset creation
> **MHE** is multi-resolution hash encoding (MHE) as introduced in [[Instant Neural Graphics Primitives with a Multiresolution Hash Encoding]].
> MHEs build an implicit representation over coordinates with a feature pyramid like structure, which can flexibly maintain both local and global information, unlike purely voxel-based encodings ([[Scene-LLM]]) which focuses on local structures only.
貌似每针对一个新场景都需要重新train一遍来获得坐标到语义的映射。
Posted Updated Notea few seconds read (About 3 words)
Pick up A (from B) and drop it on/in C”, where A is an object and B and C are places in a real-world environment such as homes
Open-home, open-vocabulary object navigation
负责空间重建,识别物体大致位置,机器人导航 用到的方法:
CLIP-Fields [[CLIP-Fields- Weakly Supervised Semantic Fields for Robotic Memory]] : a RGB-D video of the home -> a sequence of posed ( with camera pose and positions) RGB-D images,用于重建环境,该研究还基于此获取了环境中物体和容器旁边的地板表面。
VoxcelMap: similar to object-centric memory of CLIP-Fields [[CLIP-Fields- Weakly Supervised Semantic Fields for Robotic Memory]], 基于点云中每一个点的CLIP semantic vector,每一个5cm的体素都包含一个CLIP-embedding的detector-confidence weighted average.
Querying the memory module: 先将language query 转化成CLIP semantic vector,然后基于voxelmap的clip-embeding,寻找最语义接近的那个voxel,以此定位。
Experiment
Posted Updated Notea few seconds read (About 0 words)
IL是区别于传统手动编程来赋予机器人自主能力的方法。 IL 允许机器通过演示(人类演示专家行为)来学习所需的行为,从而消除了对显式编程或特定于任务的奖励函数的需要。 IL主要有两个类别:
行为克隆(BC)
反向强化学习(IRL)
Behavior Cloning
BC 是一种 IL 技术,它将学习行为的问题视为监督学习任务 。 BC 涉及通过建立环境状态与相应专家操作之间的映射来训练模型来复制专家的行为。专家的行为被记录为一组state-action pair,也称为演示。在训练过程中,模型学习一个函数,利用这些演示作为输入,将当前状态转换为相应的专家操作。经过训练,模型可以利用这个学习函数来生成遇到新状态的动作。
不需要了解环境的潜在动态,计算效率很高,相对简单的方法。
The covariate shift problem: 测试期间观察到的状态分布可能与训练期间观察到的状态分布有所不同,使得代理在遇到未见过的状态时容易出错,而对于如何进行操作缺乏明确的指导。BC监督方法的问题是,当智能体漂移并遇到分布外状态时,它不知道如何返回到演示的状态。
The agent strives to deceive the discriminator by generating trajectories closely resembling those of the expert.
Imitation From Observation
仅通过图像序列来学习,不需要具体的关节动作操作数据。
Unlike the traditional methods, IfO presents a more organic approach to learning from experts, mirroring how humans and animals approach imitation. Humans often learn new behaviors by observing others without detailed knowledge of their actions (e.g., the muscle commands). People learn a diverse range of tasks, from weaving to swimming to playing games, by watching online videos. Despite differences in body shapes, sensory inputs, and timing, humans exhibit an impressive ability to apply knowledge gained from the online demonstrations
During the next few hours you are explained in details all the requirements you have to match. Among the most emphasized points you learn that you must (i) use the last version of Microsoft Front Page Express to write the websites; (ii) include as many buttons as possible (even if one is enough); (iii) when a hidden box is expanded, do so as high as possible above the button which opened it and do not notify the user; (iv) as much as possible do not disable or hide irrelevant information, simply include it the middle of the useful content; (v) use and abuse pop ups; (vi) feel free to include Chinese text in the middle of an English text; (vii) if a page includes videos, ensure they are all fully downloaded before the user can do anything; You also learn that they pay much attention to the quality of their product, and as such you should never forget to test your website, IE 6 being recommended. 在接下来的几个小时里,我们会详细解释您需要满足的所有要求。其中最受关注的要点是:您必须 (i) 使用最新版本的 Microsoft Front Page Express 来编写网站;(ii) 包含尽可能多的按钮(即使一个按钮就足够了);(iii) 展开隐藏框时,尽可能将其展开到打开它的按钮上方,并且不要通知用户;(iv) 尽可能不要禁用或隐藏不相关的信息,只需将其包含在有用内容的中间;(v) 使用和滥用弹出窗口;(vi) 可以在英文文本中间随意添加中文文本;(vii) 如果页面包含视频,请确保在用户执行任何操作之前,它们都已完全下载;您还了解到他们非常重视产品质量,因此您永远不要忘记测试您的网站,建议使用 IE 6.
(* Input Parsing *) let read_input () = let p, h = read_line ()in let webpages = List.init p (fun _ -> read_line ()) in let t = read_int ()in let tasks = List.init t (fun _ -> read_int ()) in (p, h, webpages, tasks)
(* Delegate webpages *) let delegate_webpages webpages h = let sorted_pages = List.sort (fun (_, _, h1) (_, _, h2) -> compare h1 h2) webpages in letrec select_delegation remaining_pages acc h_left = match remaining_pages with | [] -> acc, remaining_pages | (s, t, h)::tl when h <= h_left -> select_delegation tl ((s, t, h)::acc) (h_left - h) | _ -> acc, remaining_pages in select_delegation sorted_pages [] h
(* Compute free intervals *) let compute_free_intervals remaining_pages = let intervals = List.map (fun (s, t, _) -> (s, t)) remaining_pages in let merged_intervals = merge_intervals intervals in calculate_free_time_slots merged_intervals
(* Main function *) let() = let p, h, webpages, tasks = read_input ()in let delegated_pages, remaining_pages = delegate_webpages webpages h in let free_intervals = compute_free_intervals remaining_pages in let task_completions = schedule_tasks free_intervals tasks in List.iter print_int task_completions
以下纯来自AI的proof
Proof of Correctness for the Algorithm
The algorithm is designed to solve the problem of scheduling computational tasks on a single computer while meeting constraints related to webpage deadlines and delegation. Below is a formal proof of its correctness.
1. Problem Understanding
The problem involves:
Allocating time for webpage writing (either by handling them yourself or delegating to a friend).
Maximizing free intervals for computational tasks.
Minimizing the completion times of computational tasks.
To ensure correctness, we need to prove:
Webpage constraints are respected.
Free time intervals are accurately calculated.
Computational tasks are scheduled optimally.
2. Proof Components
2.1 Webpage Delegation
Key Idea: Delegate up to 2 webpages with the smallest exercise costs such that the total cost does not exceed H.
Proof:
By sorting webpages by h_i (exercise cost) and selecting the top 2 that satisfy h1+h2≤H, the algorithm guarantees that:
The delegation respects the maximum exercise limit H.
The webpages with the least impact on constraints are delegated, maximizing the remaining free time.
2.2 Free Interval Calculation
Key Idea: Compute free intervals by merging blocked intervals caused by non-delegated webpages.
Proof:
Non-delegated webpages define intervals [s_i, t_i) that block the computer.
Merging overlapping intervals ensures no double-counting of blocked time, leaving only valid free intervals.
The algorithm iteratively merges intervals by sorting them by s_i and combining overlapping or adjacent intervals, which is a standard and proven approach.
2.3 Task Scheduling
Key Idea: Fit each computational task into the earliest available free interval to minimize its completion time.
Proof:
Tasks are processed in increasing order of required time (t_i), ensuring that smaller tasks are scheduled earlier, leaving larger free intervals for longer tasks.
For each task:
The algorithm scans free intervals and places the task in the earliest slot where it fits.
This greedy approach guarantees the earliest possible completion time for each task because:
Free intervals are disjoint and sorted.
A task placed in an earlier slot cannot complete earlier in any other configuration without violating disjointness.
2.4 Overall Algorithm
Correctness Statement: The algorithm produces the earliest possible completion times for all tasks while satisfying webpage constraints and delegation limits.
Proof:
Webpages: Constraints are respected by delegating up to 2 pages within H exercises and blocking the necessary intervals for non-delegated pages.
Free Intervals: Accurate calculation ensures no overlap or error in available time slots.
Task Scheduling: Greedy placement in the earliest available interval ensures optimal completion times.
4. Edge Cases
P=0,T=0: No webpages or tasks, trivial solution.
H=0: No delegation possible, all webpages must be handled by the user.
Overlapping webpage intervals: Correctly merged to avoid double-counting.
Tasks larger than any free interval: Scheduled in the earliest free interval where they fit.
5. Conclusion
The algorithm is correct as it:
Adheres to all constraints (webpage deadlines, delegation limits, task requirements).
Ensures optimal scheduling of computational tasks.
Handles edge cases gracefully.
Would you like a detailed explanation for a specific part of this proof or assistance implementing it?