16 posts tagged with "vlm"

Vision-Language Models

View All Tags

π0.5 Review

December 10, 2025 · 5 min read

Gracefullight

Owner

1. Abstract

Core Concept: $\pi_{0.5}$ is a model designed for broad generalization by utilizing co-training on heterogeneous tasks.
Method: It combines hybrid multi-modal examples including image observations, language commands, object detection, semantic subtask prediction, and low-level actions.
Impact: This knowledge transfer is essential for effective generalization, enabling the execution of long-horizon and dexterous manipulation skills in the wild.

2. Introduction

Goal: Design training recipes that provide the breadth of knowledge required for robots to generalize at multiple levels of abstraction, from physical behaviors to scene semantics.
Unified Framework: By casting different modalities into a single sequence modeling framework, VLAs can be trained on diverse sources: robot data, language data, computer vision tasks, and combinations thereof.
Capabilities: The model can control mobile manipulators to perform varied household tasks even in homes never seen during training.
Hierarchical Architecture:
- Training: Pre-trains on a heterogeneous mixture of tasks, then fine-tunes specifically for mobile manipulation using both low-level action examples and high-level semantic actions (e.g., predicting "pick up the cutting board").
- Inference: At runtime, the model first predicts a semantic subtask (inferring appropriate next behavior based on scene semantics) and then predicts the robot action chunk based on this subtask.

3. Model Structure

Pi 0.5 model architecture

Unified Transformer Architecture

The model corresponds to a transformer taking in $N$ multimodal input tokens $x_{1:N}$ (images, text, and actions) and producing multimodal outputs.
Input Processing: Different token types are processed by specific encoders (e.g., Vision Encoder for images, Embedding Matrix for text).
Output Split: The output is split into two streams:
- Text Logits ( $y^{l}_{1:M}$ ): Used for QA, reasoning, and dividing the task (predicting subtasks $\hat{l}$ ).
- Action Tokens ( $y^{a}_{1:H}$ ): Produced by a separate Action Expert to create continuous outputs for robot control.

Probabilistic Decomposition

The distribution captured by the model is decomposed using the chain rule and a conditional independence assumption:

\pi_{\theta}(a_{t:t+H}, \hat{l} | o_{t}, l) = \pi_{\theta}(a_{t:t+H} | o_{t}, \hat{l}) \cdot \pi_{\theta}(\hat{l} | o_{t}, l)

Assumption: The action distribution ( $a_{t:t+H}$ ) does not depend on the overall task prompt ( $l$ ), but only on the predicted subtask ( $\hat{l}$ ).
High-Level Inference: $\pi_{\theta}(\hat{l} | o_{t}, l)$ (Predicting "what to do next").
Low-Level Inference: $\pi_{\theta}(a_{t:t+H} | o_{t}, \hat{l})$ (Predicting "how to move").

4. Combining Discrete & Continuous Actions

The model employs a hybrid approach to balance training efficiency with inference speed and quality.

The Dilemma:
- Discrete Tokens (FAST): Fast training, but requires slow autoregressive decoding during inference.
- Continuous (Flow Matching): High quality and smooth control, but computationally expensive to train from scratch on massive datasets.
The Solution: Train on discretized actions (FAST) but use Flow Matching for inference.
- Attention Masking: Ensures discrete and continuous action representations do not attend to each other during joint training.

Hybrid Loss Function

The model minimizes a combined objective:

\mathbb{E} \left[ \underbrace{H(x, f^l_\theta)}_{\text{Cross Entropy}} + \alpha \underbrace{\| \omega - a - f^a_\theta \|^2}_{\text{MSE for Flow}} \right]

Cross Entropy: For text and discrete action tokens.
MSE: For the Flow Matching vector field (Action Expert).

5. Training Recipe

The training is split into two distinct stages based on the $\alpha$ parameter and the inclusion of the Action Expert.

Stage 1: Pre-training ( $\alpha = 0$ )

Goal: Efficient large-scale learning.
Method: Action Expert is OFF. Trains as a standard auto-regressive transformer using next-token prediction for text and discrete FAST action tokens.
Datasets:
- MM: Mobile Manipulator data (100+ homes).
- ME: Multi-Environment non-mobile robots.
- CE: Cross-Embodiment laboratory data (diverse tasks like folding).
- HL: High-Level subtask prediction data.
- WD: Multimodal Web Data (VQA, captioning).

Stage 2: Post-training ( $\alpha = 10.0$ )

Goal: Specialization for mobile manipulation and enabling continuous control.
Method: Action Expert is ON.
- Initialized with random weights.
- Jointly trains next-token prediction (to preserve text capabilities) and Flow Matching for continuous actions.
Key Addition (Verbal Instructions - VI):
- Data collected by "teleoperating" the robot using language commands (e.g., expert users selecting sub-tasks step-by-step).
- Crucial for training the model to predict high-quality subtasks ( $\hat{l}$ ).

6. Evaluation

Methodology

Settings: Tested in entirely new kitchens and bedrooms not seen during training.
Tasks: Long-horizon tasks like cleaning kitchens, putting laundry away, and making beds.
Metrics: Task progress (percentage of steps completed) and Language Following Rate.

Key Findings

Generalization: $\pi_{0.5}$ successfully performs multi-stage tasks in real, unseen homes.
Scaling: Performance improves consistently as the number of training environments increases.
Ablation Studies:
- Cross-Embodiment (CE/ME): Excluding data from other robots significantly degrades performance, indicating strong transfer learning.
- Web Data (WD): While less critical for general task progress, it is essential for Out-of-Distribution (OOD) object generalization and language following.
Comparison: Significantly outperforms $\pi_0$ and the $\pi_0$ -FAST+Flow baseline.

7. Conclusions & Future Work

Current Status: $\pi_{0.5}$ demonstrates that co-training with heterogeneous data enables end-to-end robotic systems to perform long-horizon, dexterous skills in open-world settings.
Limitations:
- Struggles with physical constraints (hard-to-open cabinets) or partial observability.
- Limited to relatively simple prompts based on training data.
Future Directions:
- Incorporating richer context and memory for better handling of partial observability.
- Expanding data sources, particularly exploring verbal instructions as a powerful new supervision modality.

Ref

Intelligence, P., Black, K., Brown, N., Darpinian, J., Dhabalia, K., Driess, D., Esmail, A., Equi, M., Finn, C., & Fusai, N. (2025). π0.5: a Vision-Language-Action Model with Open-World Generalization. arXiv preprint arXiv:2504.16054.

VLA Test Review

September 3, 2025 · 6 min read

Gracefullight

Owner

VLATest: Testing and Evaluating Vision-Language-Action Models for Robotic Manipulation

VLATest fuzzes 18,604 manipulation scenes (10 operators, 4 tasks) to systematically stress-test VLA robustness.
Seven VLA models show low success and brittleness to confounders, lighting/camera changes, unseen objects, and instruction mutations; larger pretraining helps.
Priorities: scale/augment demo data (incl. sim2real), use stepwise/CoT prompting & multi-agent setups, and expand benchmarks with online risk assessment.

Motivation & Gap

Problem: Current VLA models are typically evaluated on small, hand-crafted scenes, leaving general performance and robustness in diverse scenarios underexplored.
Goal: Introduce VLATest, a generation-based fuzzing framework that automatically creates robotic manipulation scenes to test performance and robustness of VLA models.

What Are VLA Models?

Vision-Language-Action (VLA) models take natural language instructions + camera images and output low-level robot actions (Δx, Δθ, Δgrip).
Inference loop: Tokenize text/image → transformer predicts action token A₁ → execute → append A₁ + new image tokens I₂ → predict A₂ → … until success or step limit.

VLA Architecture

Training & Evaluation

Training: (1) Train from scratch on robot demonstrations, or (2) fine-tune a large VLM (e.g., Llava) with >1B params pretraining.
Evaluation: Task-specific metrics (e.g., grasp, lift, hold for “pick up”), either in sim (auto-metrics) or real (manual labels).

VLATest Framework

Ten testing operators grouped across:
- Target objects: type, position, orientation
- Confounding objects: type, position, orientation, count
- Lighting: intensity
- Camera: position, orientation
Scene generation (Alg. 1): sample valid targets → (optional) confounders → mutate lighting (factor α) → mutate camera pose (d, θ). Semantic validity checks prevent infeasible scenes.

VLA Test

Research Questions (RQ)

RQ1: Basic performance on popular manipulation tasks
RQ2: Effect of confounding object count
RQ3: Effect of lighting changes
RQ4: Effect of camera pose changes
RQ5: Robustness to unseen objects (OOD)
RQ6: Robustness to instruction mutations

Tasks & Prompting

Tasks:
1. Pick up an object (grasp + lift ≥0.02 m for 5 frames)
2. Move A near B (≤0.05 m)
3. Put A on B (stable stacking)
4. Put A into B (fully inside)
Standard prompts (RQ1–RQ5):
- pick up [obj] · move [objA] near [objB] · put [objA] on [objB] · put [objA] into [objB]
Instruction mutations (RQ6): 10 paraphrases per task (GPT-4o), manually validated for semantic equivalence.

Experimental Setup

Scenes: 18,604 across 4 tasks (ManiSkill2).
Models: 7 public VLAs (RT-1-1k/58k/400k, RT-1-X, Octo-small/base, OpenVLA-7b).
Compute: >580 GPU hours.

Key Results & Findings

RQ1 — Overall Performance

VLA models underperform overall; no single model dominates across tasks.
Example best-case rates (default settings): 34.4% (Task1, RT-1-400k), 12.7% (Task2, OpenVLA-7b), 2.2% (Task3, RT-1-X), 2.1% (Task4, Octo-small).
Stepwise breakdown (Task 1): grasp 23.3% → lift 15.7% → hold 12.4% ⇒ difficulty composing sequential actions.
- Implication (Finding 2): Consider stepwise prompting / chain-of-thought to decompose complex tasks.

RQ1 — Coverage Metric

No established coverage for VLA; adopted trajectory coverage (pragmatic).
Increasing cases from n=10 to n=1000 achieved 100% coverage across tasks (object-position novelty relative to workspace).

RQ2 — Confounding Objects

More confounders ⇒ worse performance; models struggle to locate the correct object.
Similarity doesn’t matter much: Mann–Whitney U shows no significant difference between similar vs dissimilar distractors (p = 0.443, 0.614, 0.657, 0.443; effect sizes ≈ 0.23–0.29).

RQ3 — Lighting Robustness

Lighting perturbations significantly hurt performance.
OpenVLA-7b most robust (77.9% of previously passed cases still pass), plausibly due to SigLIP + DINOv2 pretraining and LLaVA 1.5 mixture.
Sensitivity: even α < 2.5 increase drops success to ~0.7×; α > 8 ⇒ ~40% of default-pass scenes succeed.
Decreasing light hurts less than increasing; α < 0.2 still ~60% pass.

RQ4 — Camera Pose Robustness

Small pose changes (≤5° rotation, ≤5 cm shift) reduce success to 34.0% of default.
RT-1-400k most robust (45.6% retain), OpenVLA-7b at 31.3%; Octo models <10%.
- Likely due to training data scale differences.

RQ5 — Unseen Objects

Using YCB (56 unseen objects) leads to large performance drops versus seen objects: avg –74.2%, –66.7%, –66.7%, –20.0% on Tasks 1–4.
Transfer rate across steps:
- $\displaystyle T_r^n = \frac{\text{Success rate}_n}{\text{Success rate}_{n-1}}$ , with $\text{Success rate}_0 = 100\%$
- Paired t-tests show significant differences on $T_r^1$ for Task 1 & 2 (p = 0.011, 0.007; Cohen’s d = 1.34, 0.891).
- Primary failure mode: recognizing/locating unseen objects.

RQ6 — Instruction Mutations

Mutated instructions generally reduce performance (avg drops: –32.8% T1, –1.7% T2, –8.3% T3; negligible on T4).
Larger language backbones help: OpenVLA-7b (Llama 2-7B) is more robust, sometimes improving under mutations (e.g., T1, T4).

Implications & Directions

Scale matters: larger pretraining and robot-demo datasets improve robustness (lighting/camera).
Data enrichment: use data augmentation and sim-to-real to diversify external factors; leverage traditional controllers to auto-generate demonstrations.
Prompting strategies: adopt stepwise/CoT prompting; consider multi-agent decompositions.
Benchmarking: the 18,604 VLATest scenes serve as an early benchmark; expand to more tasks/robots/conditions.
Online risk assessment: explore uncertainty estimation and safety monitoring for runtime quality control.

Robotics foundation models: (1) LLMs for planning/rewards; (2) Multi-modal FMs (VLMs/VLAs) for manipulation & perception.
CPS testing: gray-box/black-box fuzzing and search-based testing exist, but not directly applicable to VLAs (multimodality, autoregression, scale).
FM evaluation: beyond static benchmarks, VLATest dynamically generates 3D manipulation test cases—distinct from text-only testing.

Threats to Validity (mitigations in study)

Internal: randomness (mitigated by 18,604 scenes); potential prompt bias (mutations manually validated).
External: generalization to other tasks/models; chose popular tasks (Open X-Embodiment) and SOTA public models.
Construct: limited operators (lighting/camera/confounders chosen; future: #lights, camera intrinsics, resolution).
- Coverage: trajectory coverage used as a pragmatic proxy.

Conclusion

VLATest: early, generation-based fuzzing framework (10 operators) for VLA testing in ManiSkill2.
Empirical evidence across 7 models / 4 tasks / 18,604 scenes shows limited robustness (lighting, camera, unseen objects, instruction variation).
Points to data scaling, prompting, benchmarking, and risk assessment as practical paths to more reliable VLA systems.

Ref

Wang, Z., Zhou, Z., Song, J., Huang, Y., Shu, Z., & Ma, L. (2025). VLATest: Testing and Evaluating Vision-Language-Action Models for Robotic Manipulation. Proceedings of the ACM on Software Engineering, 2(FSE), 1615–1638.

Open X-Embodiment review

September 1, 2025 · 5 min read

Gracefullight

Owner

RT-X

RT-X trains generalist robot policies by co-training RT-1/RT-2 on an X-embodiment mix of multi-robot, multi-task data, enabling efficient adaptation to new robots, tasks, and environments.
It standardizes 1M+ trajectories from 22 embodiments into the Open X-Embodiment (RLDS/tfrecord) repository, unifying observations and 7-DoF actions via coarse alignment.
Experiments show strong positive transfer and emergent skills (≈3× with RT-2-X on cross-robot tasks); performance scales with model capacity, short image histories, and web pretraining, while sensing/actuation diversity and frame alignment remain open problems.

RT-X Architecture

Motivation

Seeks a generalist X-robot policy that can be efficiently adapted to new robots, tasks, and environments.
Mirrors a trend from CV/NLP where general-purpose, web-scale pretrained models outperform narrow, task-specific models.
Robotics lacks comparably large, diverse interaction datasets, making direct transfer of these lessons challenging.

Objectives

Positive transfer: Test whether co-training on data from many robots improves performance on each training domain.
Ecosystem building: Organize large robotic datasets to enable future X-embodiment research.

Core Approach

Train RT-1 and RT-2 on data from 9 different manipulators, producing RT-X variants that outperform policies trained only on the evaluation domain and show better generalization and new capabilities.

What’s Different From Prior Transfer Methods

Many prior works reduce the embodiment gap via specialized mechanisms (shared action spaces, representation learning objectives, policy adaptation using embodiment metadata, decoupled robot/environment representations, domain translation).
RT-X directly trains on X-embodiment data without explicit gap-reduction machinery and still observes positive transfer.

Dataset & Format (Open X-Embodiment)

1M+ real robot trajectories, 22 embodiments (single-arm, bimanual, quadrupeds), pooled from 60 datasets / 34 labs, standardized for easy use.
Uses RLDS (serialized tfrecord), supporting varied action spaces and input modalities (RGB, depth, point clouds), and efficient parallel loading across major DL frameworks.
Language annotations are leveraged; PaLM is used to extract objects/behaviors from instructions.

RLDS

Data Format Consolidation (Coarse Alignment)

Observations: History of recent images + language instruction. One canonical camera view per dataset is resized to a common resolution.
Actions: Convert original controls to a 7-DoF end-effector vector (x, y, z, roll, pitch, yaw, gripper or their rates). Actions are normalized before discretization; outputs are de-normalized per embodiment.
Deliberate non-alignment: Camera poses/properties are not standardized; action frame alignment across datasets is not enforced. The same action vector may cause different motions on different robots (absolute/relative, position/velocity allowed).

Policy Architectures

RT-1 (≈35M params): Transformer for control. Inputs: 15-frame image history + natural-language instruction.
- Vision via ImageNet-pretrained EfficientNet; language via USE embedding.
- Fuse via FiLM → 81 vision–language tokens → decoder-only Transformer outputs tokenized actions.
RT-2 (VLA family): Internet-scale VLM co-fine-tuned to output action as text tokens (e.g., 1 128 91 241 5 101 127).
- Any pretrained VLM can be adapted; this work uses RT-2–PaLI-X (ViT backbone + UL2 LM; primarily pretrained on WebLI).

Training Setup

Robotics data mixture: Data from 9 manipulators (a union of multiple well-known robotics datasets).
Loss: Standard categorical cross-entropy over tokenized actions.
Regimes:
- RT-1-X: Trained solely on the robotics mixture.
- RT-2-X: Co-fine-tuned on a ~1:1 mix of original VLM data and the robotics mixture.

Experimental Questions

Does X-embodiment co-training improve in-domain performance (positive transfer)?
Does it improve generalization to unseen tasks?
How do model size, architecture, and dataset composition influence performance/generalization?

Key Results

Small-scale domains: RT-1-X outperforms the Original Method (the authors’ per-dataset baselines) on 4/5 datasets with a large average gain → limited data domains benefit greatly from X-embodiment co-training.
Large-scale domains:
- RT-1-X does not beat an RT-1 trained only on the embodiment-specific large dataset (suggests underfitting for this class).
- RT-2-X (larger capacity) outperforms both Original Method and RT-1 → X-robot training helps even in data-rich regimes when using sufficient capacity.

Generalization & Emergent Skills

Unseen objects/backgrounds/environments: RT-2 and RT-2-X perform on par (VLM backbone already strong here).
Emergent skills (transfer across robots): On Google Robot tasks that do not appear in RT-2’s dataset but exist in Bridge (for WidowX), RT-2-X ≈ 3× RT-2.
- Removing Bridge from RT-2-X training significantly reduces hold-out performance → skills likely transferred from WidowX data.

Design Insights (Ablations)

Short image history notably improves generalization.
Web pretraining is critical for large models’ high performance.
Model capacity matters: 55B model succeeds more than 5B on emergent skills → greater capacity ⇒ greater cross-dataset transfer.
Co-fine-tuning vs. fine-tuning: Similar performance in this study (attributed to the greater diversity of robotics data in RT-2-X vs. prior works).

Limitations (Open Problems)

Does not cover robots with very different sensing/actuation modalities.
Does not study generalization to new robots nor define a decision criterion for when positive transfer will occur.
Camera pose/properties and control frame remain unaligned; a deliberate but still challenging domain gap to address in future work.

Ref

O’Neill, A., Rehman, A., Maddukuri, A., Gupta, A., Padalkar, A., Lee, A., Pooley, A., Gupta, A., Mandlekar, A., & Jain, A. (2024). Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. 2024 IEEE International Conference on Robotics and Automation (ICRA).

π0 Review

August 31, 2025 · 4 min read

Gracefullight

Owner

π0

Problem & Motivation

Achieving real-world generality in robot learning is blocked by data scarcity, generalization, and robustness limits.
Human intelligence most outpaces machines in versatility—solving diverse, physically situated tasks under constraints, language commands, and perturbations.
In NLP/CV, foundation models pre-trained on diverse multi-task data, then fine-tuned (aligned) on curated datasets, outperform narrow specialists; the same paradigm is hypothesized for robotics.

Core Proposal

A novel flow-matching architecture built on a pre-trained Vision-Language Model (VLM) to inherit Internet-scale semantics.
Further training adds robot actions, turning the model into a Vision-Language-Action (VLA) policy.
Use cross-embodiment training to combine data from many robot types (single/dual-arm, mobile), despite differing configuration/action spaces.
Employ action chunking + flow matching (diffusion variant) to model complex, continuous, high-frequency actions.
Introduce an Action Expert (separate weights for action/state tokens), akin to a Mixture-of-Experts, augmenting the standard VLM.

Training Recipe (Pre- vs Post-Training)

Pre-training on highly diverse data builds broad, general physical abilities.
Post-training on curated, task-specific data instills fluent, efficient strategies.
Rationale: high-quality-only training lacks recovery behaviors; low-quality-only training lacks efficiency/robustness; combining both yields desired behavior.

Data & Backbone

~10,000 hours of demonstrations + the OXE dataset; data spans 7 robot configurations and 68 tasks.
VLM backbone initialized from PaliGemma (3B); add ~300M parameters for the action expert (total ~3.3B).
Pre-training mixture: weighted combination of internal datasets + full OXE; n^0.43 weighting to down-weight overrepresented task-robot pairs.
Unify interfaces: zero-pad qt/at to the largest robot dimension (18); mask missing image slots; late-fusion encoders map images/states to the same token space as language.

Modeling Details

Conditional flow matching models the continuous distribution over action chunks.
Train with a diffusion-style loss on individual sequence elements (instead of cross-entropy), with separate weights for diffusion-related tokens.
Flow path uses a linear-Gaussian schedule; sample noisy actions with ε∼N(0, I); predict denoising vector field; Euler integration from τ=0→1 at inference.
Efficient inference by caching K/V for the observation prefix; action tokens recomputed per integration step.

High-Level Language Policy

Because the policy consumes language, a high-level VLM can decompose tasks (e.g., bussing) into intermediate language subgoals (SayCan-style planning), improving performance on complex, temporally extended tasks.

Evaluation Setup & Baselines

Out-of-box (direct prompting), fine-tuning on downstream tasks, and with high-level VLM providing intermediate commands.
Compare against OpenVLA (7B, autoregressive discretization; no action chunks/high-frequency control) and Octo (93M; diffusion), trained on the same mixture.
Include a compute-parity π0 (160k steps vs 700k) and a π0-small variant (no VLM init).

Key Results

Out-of-box: π0 outperforms all baselines; even compute-parity π0 beats OpenVLA/Octo; π0-small still surpasses them—highlighting the benefits of expressive architectures + diffusion/flow matching + VLM pre-training.
Language following: π0 clearly exceeds π0-small across conditions:
- π0-flat: only overall task command.
- π0-human: human-provided intermediate steps.
- π0-HL: high-level VLM-provided steps (fully autonomous).
- Better language-following accuracy directly translates into stronger autonomous performance with high-level guidance.
New dexterous tasks (e.g., bowls stacking, towel folding, microwave, drawer items, paper towel replacement):
- Fine-tuned π0 generally outperforms OpenVLA, Octo, and small-data methods ACT / Diffusion Policy.
- Pre-training helps most when tasks resemble pre-training data; pretrained π0 often beats from-scratch by up to 2×.
Complex multi-stage tasks (laundry folding, table bussing, box building, to-go box, eggs):
- π0 solves many tasks; full pre-training + fine-tuning performs best.
- Gains from pre-training are especially large on harder tasks; absolute performance varies with task difficulty and pre-training coverage.

Takeaways & Limitations

π0 mirrors LLM training: pre-train for knowledge, post-train for alignment (instruction-following and execution).
Limitations/open questions:
- Optimal composition/weighting of pre-training data remains unclear.
- Not all tasks work reliably; difficult to predict how much/what kind of data is needed for near-perfect performance.
- Uncertain positive transfer across very diverse tasks/robots and to distinct domains (e.g., driving, navigation, legged locomotion).

Ref

Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., Jakubczak, S., Jones, T., Ke, L., Levine, S., Li‑Bell, A., Mothukuri, M., Nair, S., Pertsch, K., Shi, L. X, … Zhilinsky, U. (2025, June 21). π₀: A vision‑language‑action flow model for general robot control Robotics: Science and Systems (RSS), Los Angeles, CA, United States. https://roboticsconference.org/program/papers/10/

Vima Review

August 31, 2025 · 2 min read

Gracefullight

Owner

VIMA

Unified Multimodal Prompts: Reformulates diverse robot tasks (language, images, video) into a single sequence modeling problem.
Object-Centric Tokenization: Uses object-level tokens (Mask R-CNN + ViT) instead of raw pixels, improving data efficiency and semantic generalization.
Cross-Attention Conditioning: Conditions the policy on prompts via cross-attention, maintaining strong zero-shot performance even with small models or novel tasks.

Motivation

Robot task specification comes in many forms: one-shot demonstrations, language instructions, and visual goals.
Traditionally, each task required distinct architectures and pipelines, leading to siloed systems with poor generalization.

VIMA Architecture

Key Contributions

Multimodal Prompting
- A novel formulation that unifies diverse robot manipulation tasks into a sequence modeling problem.
- Prompts are defined as interleaved sequences of text and images, enabling flexibility across task formats.
VIMA-BENCH
- A large-scale benchmark with 17 tasks across six categories (object manipulation, goal reaching, novel concept grounding, video imitation, constraint satisfaction, visual reasoning).
- Provides 650K expert trajectories and a four-level evaluation protocol for systematic generalization.
VIMA Agent
- A transformer-based visuomotor agent with encoder-decoder architecture and object-centric design.
- Encodes prompts with a pre-trained T5 model, parses images into object tokens via Mask R-CNN + ViT, and decodes actions autoregressively using cross-attention.

Design Insights

Object-Centric Representation: Passing variable-length object token sequences directly to the controller is more effective than pixel-based tokenization.
Cross-Attention Conditioning: Stronger prompt focus and efficiency compared to simple concatenation (e.g., GPT-style).
Robustness: Minimal degradation under distractors or corrupted prompts, aided by T5 backbone and object augmentation.

Results

Performance:
- Outperforms baselines (VIMA-Gato, VIMA-Flamingo, VIMA-GPT) by up to 2.9× success rate in hardest zero-shot generalization.
- With 10× less training data, still 2.7× better than best competitor.
Scaling:
- Sample-efficient: with just 1% of data, matches baselines trained with 10× more.
- Generalization holds across L1–L4 evaluation, with smaller regression than alternatives.

Conclusion

VIMA demonstrates that multimodal prompting is a powerful unifying framework for robot learning.
It achieves strong scalability, data efficiency, and generalization, establishing a solid starting point for future generalist robot agents.

Ref

Jiang, Y., Gupta, A., Zhang, Z., Wang, G., Dou, Y., Chen, Y., Fei-Fei, L., Anandkumar, A., Zhu, Y., & Fan, L. (2023). VIMA: Robot Manipulation with Multimodal Prompts Proceedings of the 40th International Conference on Machine Learning, Proceedings of Machine Learning Research. https://proceedings.mlr.press/v202/jiang23b.html

RoboFlamingo Review

August 31, 2025 · 2 min read

Gracefullight

Owner

RoboFlamingo

RoboFlamingo decouples vision-language understanding and control, using OpenFlamingo for perception and a lightweight policy head for sequential decision-making.
Unlike prior VLM-based approaches, it requires only small-scale imitation fine-tuning on language-conditioned manipulation data, without large-scale co-fine-tuning.
This design enables data-efficient, zero-shot generalizable, and deployable robot manipulation policies on modest compute resources.

Key Idea

Proposes RoboFlamingo, a simple framework to adapt existing VLMs for robotic manipulation with lightweight fine-tuning.
Built on OpenFlamingo, decoupling vision-language understanding from decision-making.
Pre-trained VLM handles language and visual comprehension, while a dedicated policy head models sequential history.
Fine-tuned only on language-conditioned manipulation datasets using imitation learning.

Advantages

Requires only a small amount of demonstrations to adapt to downstream manipulation tasks.
Provides open-loop control capability → deployable on low-performance platforms.
Can be trained/evaluated on a single GPU server, making it a cost-effective and accessible solution.

Benchmarks

Evaluated on CALVIN benchmark (34 tasks, 1000 instruction chains).
RoboFlamingo achieves 2× performance improvements over previous state-of-the-art methods.

Performance

Imitation Learning: Outperforms all baselines across all metrics.
Zero-shot Generalization:
- Vision: Stronger generalization in ABC→D setting.
- Language: Robust to GPT-4 generated synonymous instructions.
Ablation Studies:
- Ignoring history (MLP w/o hist) gives worst results.
- LSTM and GPT-based policy heads perform best (LSTM chosen as default).
- VL pre-training is crucial for downstream manipulation.
- Larger VLMs show better data efficiency.
- Instruction fine-tuning improves both seen and unseen tasks.

Flexibility of Deployment

Supports open-loop control by predicting entire action sequences with a single inference → reduces latency and test-time compute.
Direct open-loop use without retraining can degrade performance; mitigated with jump-step demonstrations.

Conclusion

Demonstrates that pre-trained VLMs enable data efficiency and strong zero-shot generalization in robotic manipulation.
RoboFlamingo is presented as an intuitive, efficient, and open solution, with high potential when combined with large-scale real robot data.

Ref

Li, X., Liu, M., Zhang, H., Yu, C., Xu, J., Wu, H., Cheang, C., Jing, Y., Zhang, W., & Liu, H. (2024). Vision-language foundation models as effective robot imitators. International Conference on Learning Representations (ICLR 2024), Vienna, Austria.

OpenVLA Review

August 29, 2025 · 3 min read

Gracefullight

Owner

OpenVLA

OpenVLA is a 7B open-source VLA model built on Llama2 + DINOv2 + SigLIP, trained on 970k demos, achieving stronger generalization and robustness than closed RT-2-X (55B) and outperforming Diffusion Policy.
It introduces efficient adaptation via LoRA (1.4% params, 8× compute reduction) and 4-bit quantization (half memory, same accuracy), enabling fine-tuning and inference on consumer GPUs.
Limitations remain (single-image input, <90% reliability, limited throughput), but OpenVLA provides the first open, scalable framework for generalist robot policies.

OpenVLA Architecture

Motivation

Training robot policies from scratch struggles with robustness and generalization.
Fine-tuning vision-language-action (VLA) models offers reusable, generalizable visuomotor policies.
Barriers: prior VLAs are closed-source, lack best practices for adaptation, and need server-class hardware.

Model & Training

OpenVLA: 7B parameters, open-source.
Built on Llama 2 with fused DINOv2 + SigLIP vision encoders.
Trained on 970k robot demonstrations from Open-X Embodiment dataset.
Represents robot actions as tokens (discretized into 256 bins, replacing unused Llama tokens).
Standard next-token prediction objective.

Architecture & Approach

End-to-end fine-tuning of VLM to generate robot actions as tokens.
Differs from modular methods (e.g., Octo) that stitch separate encoders/decoders.
Vision features are obtained by encoding the same input image with both SigLIP and DINOv2, then channel-wise concatenated and passed through an MLP projector. This preserves SigLIP’s semantic alignment with language and DINOv2's spatial reasoning, giving the VLM richer multimodal context for manipulation tasks.
Uses Prismatic VLM backbone with multi-resolution features (spatial reasoning + semantics).

Performance

Outperforms closed RT-2-X (55B) by +16.5% task success with 7× fewer parameters.
Beats Diffusion Policy (from-scratch imitation learning) by +20.4% on multi-task language-grounded settings.
Demonstrates robust behaviors (distractor resistance, error recovery).

Efficiency

Introduces parameter-efficient fine-tuning:
- LoRA updates only 1.4% of parameters yet matches full fine-tuning.
- Can fine-tune on a single A100 GPU in ~10–15 hours (8× compute reduction).
Quantization:
- 4-bit inference matches bfloat16 accuracy while halving memory footprint.
- Runs at 3Hz on consumer GPUs (e.g., A5000, 16GB).

Evaluations

Tested across 29 tasks and multiple robots (WidowX, Google robot, Franka).
Strong generalization on:
- Visual (unseen backgrounds/distractors).
- Motion (new object positions/orientations).
- Physical (new object shapes/sizes).
- Semantic (unseen tasks, instructions).
First generalist open-source VLA achieving ≥50% success rate across all tested tasks.

Design Insights

Fine-tuning the vision encoder (vs. freezing) crucial for robotic control.
Higher image resolution (384px vs. 224px) adds 3× compute without performance gains.
Training required 27 epochs, far more than typical VLM runs, to surpass 95% action token accuracy.

Limitations & Future Work

Supports only single-image observations (no proprioception, no history).
Inference throughput (~6Hz on RTX 4090) insufficient for high-frequency control (e.g., ALOHA at 50Hz).
Success rates remain below 90% in challenging tasks.
Open questions:
- Impact of base VLM size on performance.
- Benefits of co-training with Internet-scale data.
- Best visual features for VLAs.

Contributions

First open-source generalist VLA with strong performance.
Scalable end-to-end training pipeline (action-as-token).
Demonstrates LoRA + quantization for consumer-grade GPU adaptation.
Provides code, checkpoints, and data curation recipes to support future research.

Ref

Kim, M. J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E. P., Sanketi, P. R., Vuong, Q., Kollar, T., Burchfiel, B., Tedrake, R., Sadigh, D., Levine, S., Liang, P., & Finn, C. (2025). OpenVLA: An Open-Source Vision-Language-Action Model Proceedings of The 8th Conference on Robot Learning, Proceedings of Machine Learning Research. https://proceedings.mlr.press/v270/kim25c.html

Octo Review

August 27, 2025 · 4 min read

Gracefullight

Owner

Octo

Octo is a transformer-based policy with modular tokenizers (language via T5, images via CNN patches), blockwise masking, and readout tokens, trained on 800k multi-robot trajectories.
Actions are generated through a diffusion head that produces continuous, multimodal, chunked predictions, enabling precise control and broad generalization.
It achieves state-of-the-art zero-shot performance across 7 robots and allows efficient finetuning to new sensors and action spaces, while being fully open-source.

Category	Simple Analogy	Actual Tokenization
Language	`[Sentence]`	`[l₁, l₂, l₃, …]` → multiple tokens from a tokenized sentence
Goal Image	`[Goal]`	`[g₁, g₂, g₃, …]` → image split into patches
Observation (time t)	`[Observation]`	`[oₜ¹, oₜ², oₜ³, …]` → camera frames/sensors tokenized into patches
Readout Token	`[ ]` (empty slot)	`[TR,t]` → one per timestep, reserved for predicting actions

Time t-1: [l] [g] [o_{t-1}] [TR,t-1]
Time t:   [l] [g] [o_t]     [TR,t]
Time t+1: [l] [g] [o_{t+1}] [TR,t+1]

[TR,t-1], [TR,t], [TR,t+1]  ──►  Diffusion head  ──►  [a_t, a_{t+1}, …]

Motivation

Traditional robot learning trains policies from scratch on robot/task-specific datasets → costly data collection, narrow generalization.
Generalist Robot Policies (GRPs) pretrained on diverse robots/tasks can be finetuned with little in-domain data while generalizing broadly.
Real-world deployments face challenges across robot embodiments, sensor setups, action spaces, task specs, and environments.

Prior GRPs & Gaps

GRPs aim for low-level visuomotor control across tasks, environments, and robotic systems.
Existing models often have restricted inputs (e.g., a single camera), lack efficient finetuning to new domains, and importantly, largest models are not publicly available.

Contribution (What is Octo?)

Octo: a large transformer-based policy trained on 800k trajectories from the Open X-Embodiment dataset.
Accepts language instructions or goal images, and can be finetuned within hours on consumer GPUs to new sensors and action spaces.
First GRP to support effective finetuning to new observations and actions and to be fully open-source (training pipeline, checkpoints, data).
Novelty lies in combining: transformer backbone + language/goal image conditioning + diffusion head for expressive action distributions.

Architecture

Input tokenizers:
- Language via pretrained T5-base
- Images via shallow CNN → patch tokens
Transformer backbone: processes unified token sequence.
Blockwise masking + Readout tokens:
- Nonexistent modalities are masked
- Readout tokens only attend to past observations/tasks, not vice versa
Diffusion action head: predicts continuous, multimodal, chunked actions.
Modularity: new sensors/outputs can be added by only training lightweight encoders or heads; pretrained backbone remains unchanged.

Octo Architecture

Training Data & Objective

Mixture of 25 heterogeneous robot datasets: diverse robots, sensors (with/without wrist cams), labels (with/without language).
Conditional diffusion decoding predicts continuous, multimodal action distributions.
- Transformer runs one forward pass; denoising steps are contained in the small diffusion head.

Experiments

Evaluated on 7 robotic platforms across 4 institutions.
Key questions:
1. Zero-shot multi-robot control?
2. Do Octo weights improve finetuning vs. scratch or standard pretrained representations?
3. Which design choices matter for generalist robot policies?

Results

Achieves state-of-the-art zero-shot multi-robot control, competitive with RT-1-X and RT-2-X.
Provides a versatile policy initialization: significantly outperforms baselines for data-efficient finetuning to new obs/action spaces.

Limitations / Future Work

Needs better language conditioning, improved wrist camera support, and data beyond optimal demonstrations.

One-line Takeaway

Octo = modular, efficient, open-source GRP:
A transformer + diffusion policy trained on large-scale multi-robot data that adapts quickly with little in-domain data to new sensors and action spaces, enabling broad generalization.

Ref

Mees, O., Ghosh, D., Pertsch, K., Black, K., Walke, H. R., Dasari, S., Hejna, J., Kreiman, T., Xu, C., & Luo, J. (2024). Octo: An open-source generalist robot policy. First Workshop on Vision-Language Models for Navigation and Manipulation at ICRA 2024.

RT-2, Robotic Transformer 2 Review

August 24, 2025 · 4 min read

Gracefullight

Owner

Trains a Vision-Language-Action (VLA) model by co-fine-tuning web-scale VLMs with robot trajectories, and treats robot actions as text tokens.
Yields strong generalization and emergent capabilities (symbol understanding, reasoning, human recognition) beyond what appears in robot data.
Runs in direct closed-loop control; largest evaluated model (55B) executes at ~1–3 Hz via a cloud (multi-TPU) inference setup.

RT-2 Architecture

What RT-2 Is

A family of VLA models (RT-2-PaLI-X, RT-2-PaLM-E) that fine-tune large VLMs on robot trajectories to output low-level actions.
Target: generalizable, semantically aware manipulation policies that map images + instructions → actions end-to-end.
RT-2 does not rely on a restricted 2D action space or calibrated cameras.
The unified output space lets language and action tokens share the same model weights, without action-only layers.

Core Recipe

Directly train open-vocabulary VQA/dialogue VLMs to output robot actions while they still solve standard vision-language tasks.
Build on RT-1 protocol/data, but replace the policy backbone with a large VLM.

Action as Language (Tokenization)

Discretize continuous action dims (Δpos/Δrot, gripper, terminate) into 256 bins; represent each dimension with an integer token.
PaLI-X: reuse numeric tokens (≤1000). PaLM-E: overwrite 256 least-frequent tokens as action vocabulary (symbol tuning).
Form a single output string per step (e.g., terminate Δposx Δposy Δposz Δrotx Δroty Δrotz gripper).

Co-Fine-Tuning & Output Constraint

Mix robot data with original web VQA/caption data in training batches (up-weight robot samples) to prevent forgetting and improve generalization.
During decoding on robot tasks, restrict sampling to valid action tokens so outputs are always executable.

Closed-Loop Control & Real-Time Inference

RT-2 is trained and deployed for direct closed-loop control (camera → action → camera …), not just high-level planning.
For large models, inference runs via a multi-TPU cloud service; RT-2-PaLI-X-55B reaches ~1–3 Hz; smaller models ~5 Hz.

Generalization & Benchmarks

Matches RT-1 on seen tasks but far exceeds baselines on unseen objects/backgrounds/environments (~2× vs RT-1/MOO; up to ~6× vs others).
Open-source Language-Table sim: co-fine-tuned PaLI-3B outperforms baselines, showing the approach transfers to other robots/sims.

Emergent Capabilities

Symbol understanding (e.g., “move apple to 3 / heart / star”).
Reasoning (visual matching, simple math like “sum of two plus one”, multilingual commands).
Human recognition (e.g., “person with glasses”); none of these were present as low-level actions in robot data.
Chain-of-thought (CoT) variant adds a Plan step before actions → supports multi-stage semantic reasoning (e.g., pick a rock as an improvised hammer; pick an energy drink for a tired person).

rt-2-cot

Scaling & Ablations

From-scratch training (even 5B) performs poorly; fine-tuning helps; co-fine-tuning helps most.
Bigger models (55B > 5B) generalize better.
PaLM-E variant shows an edge on math reasoning; PaLI-X stronger on symbols/vision reasoning on average.

Limitations

Does not learn fundamentally new motor skills beyond the distribution in robot data; mainly transfers semantic/visual knowledge.
Compute/latency costly; real-time control can bottleneck. Limited availability of strong open VLMs and convenient FT APIs.

Future Directions (from the text)

Acquire new skills from human videos or richer datasets.
Quantization/distillation for faster/cheaper inference.
More open VLMs / FT APIs to make VLA models broadly buildable.

Ref

Zitkovich, B., Yu, T., Xu, S., Xu, P., Xiao, T., Xia, F., Wu, J., Wohlhart, P., Welker, S., Wahid, A., Vuong, Q., Vanhoucke, V., Tran, H., Soricut, R., Singh, A., Singh, J., Sermanet, P., Sanketi, P. R., Salazar, G., Ryoo, M. S., Reymann, K., Rao, K., Pertsch, K., Mordatch, I., Michalewski, H., Lu, Y., Levine, S., Lee, L., Lee, T.-W. E., Leal, I., Kuang, Y., Kalashnikov, D., Julian, R., Joshi, N. J., Irpan, A., Ichter, B., Hsu, J., Herzog, A., Hausman, K., Gopalakrishnan, K., Fu, C., Florence, P., Finn, C., Dubey, K. A., Driess, D., Ding, T., Choromanski, K. M., Chen, X., Chebotar, Y., Carbajal, J., Brown, N., Brohan, A., Arenas, M. G., & Han, K. (2023). RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control Proceedings of The 7th Conference on Robot Learning, Proceedings of Machine Learning Research. https://proceedings.mlr.press/v229/zitkovich23a.html

PaLM-E An Embodied Multimodal Language Model Review

August 24, 2025 · 4 min read

Gracefullight

Owner

PaLM-E

ViT (e.g., ViT-4B, ViT-22B) extracts image embeddings.
OSRT builds object-centric slot representations.
These are injected into the LLM embedding space (PaLM variants: 8B, 62B, 540B) for high-level abstraction and planning, with execution delegated to low-level policies (e.g., RT-1).

PaLM-E Architecture

Core idea

Build embodied language models by injecting continuous sensor inputs (images, states, other modalities) directly into a pretrained LLM’s embedding space, linking words ↔ percepts.
Inputs are multimodal sentences that interleave text tokens with encoded visual/state tokens; outputs are text (answers or high-level plans).

Architecture & representations

Start from a decoder-only, autoregressive LLM (PaLM) and condition on a prefix that mixes text and encoder-produced vectors.
Provide multiple encoder options:
- State vectors (simplest).
- ViT features with a learned projector ψ to match LLM embedding dimensionality.
- Object-centric, 3D-aware OSRT (neural scene representations). Supports entity-label tokens (<obj j>) so the model can refer to specific objects in generated plans.

Training setup

Train end-to-end (encoders + projector + optionally the LLM) to output sequential decisions as natural text or answers (VQA, captioning).
Dataset items contain (continuous observations, text sequence, prefix index); loss is cross-entropy on non-prefix tokens.
Explore freezing the LLM (train encoders/projection only), and co-training across diverse tasks ("full mixture"; only ~9% is embodied data).

Planning & control loop

For planning/control, PaLM-E emits textual subgoals/skills drawn from a small skill vocabulary; a separate low-level policy executes them.
The system runs closed-loop: execute → observe → (re)plan; PaLM-E acts as a high-level policy sequencing low-level skills.

Why not text-only LLMs or affordance-only grounding?

Prior work that feeds only text to the LLM (and uses external affordance models) is insufficient when spatial layout matters.
PaLM-E instead grounds inside the LLM by injecting continuous observations, enabling direct plan generation while leveraging the LLM’s world knowledge.

Environments & use cases

Three domains: TAMP (grasp/stack planning), Language-Table (multi-object tabletop pushing), Mobile manipulation (kitchen tasks).
Use cases to test embodied reasoning: affordance prediction, failure detection, long-horizon planning (low-level policies from RT-1).

Results (high level)

Transfer via co-training: One model trained on mixed tasks/embodiments achieves higher performance than task-specialists; "full mixture" yields >2× gains (Fig. 3).
Few-shot/data efficiency: Solves robotics tasks with very few examples (e.g., 10–80 for Language-Table, 320 for TAMP). OSRT further improves data efficiency.
Mobile manipulation: End-to-end embodied planning works in real kitchens, robust to disturbances; PaLM-E beats PaLI (zero-shot) and QT-OPT/CLIP baselines on affordance/failure detection.
General V+L: The 562B generalist achieves state-of-the-art on OK-VQA and strong VQAv2/COCO without task-specific finetuning.
Language retention & scaling: Freezing LLM preserves language ability but can struggle on some robotics tasks; unfrozen + scale up significantly reduces catastrophic forgetting.
Emergent behaviors: Multimodal chain-of-thought and multi-image reasoning emerge in PaLM-E-562B, despite training on single-image prompts.

Takeaways

Injecting neural scene representations (OSRT) and entity-labeled multimodal tokens is effective even without massive embodied data.
Diverse, joint training transfers vision-language knowledge into embodied decision-making, enabling data-efficient robot planning.
Two viable paths to retain language skills during multimodal finetuning:
1. Freeze the LLM, train encoders (max language retention, sometimes weaker robotics),
2. Unfreeze and scale the LLM (much less forgetting, strong embodied performance).

Ref

Driess, D., Xia, F., Sajjadi, M. S. M., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., Huang, W., Chebotar, Y., Sermanet, P., Duckworth, D., Levine, S., Vanhoucke, V., Hausman, K., Toussaint, M., Greff, K., Zeng, A., Mordatch, I., & Florence, P. (2023). PaLM-E: An Embodied Multimodal Language Model Proceedings of the 40th International Conference on Machine Learning, Proceedings of Machine Learning Research. https://proceedings.mlr.press/v202/driess23a.html

1. Abstract​

2. Introduction​

3. Model Structure​

Unified Transformer Architecture​

Probabilistic Decomposition​

4. Combining Discrete & Continuous Actions​

Hybrid Loss Function​

5. Training Recipe​

Stage 1: Pre-training (α=0\alpha = 0α=0)​

Stage 2: Post-training (α=10.0\alpha = 10.0α=10.0)​

6. Evaluation​

Methodology​

Key Findings​

7. Conclusions & Future Work​

Ref​

VLATest: Testing and Evaluating Vision-Language-Action Models for Robotic Manipulation​

Motivation & Gap​

What Are VLA Models?​

Training & Evaluation​

VLATest Framework​

Research Questions (RQ)​

Tasks & Prompting​

Experimental Setup​

Key Results & Findings​

RQ1 — Overall Performance​

RQ1 — Coverage Metric​

RQ2 — Confounding Objects​

RQ3 — Lighting Robustness​

RQ4 — Camera Pose Robustness​

RQ5 — Unseen Objects​

RQ6 — Instruction Mutations​

Implications & Directions​

Related Work​

Threats to Validity (mitigations in study)​

Conclusion​

Ref​

RT-X​

Motivation​

Objectives​

Core Approach​

What’s Different From Prior Transfer Methods​

Dataset & Format (Open X-Embodiment)​

Data Format Consolidation (Coarse Alignment)​

Policy Architectures​

Training Setup​

Experimental Questions​

Key Results​

Generalization & Emergent Skills​

Design Insights (Ablations)​

Limitations (Open Problems)​

Ref​

π0​

Problem & Motivation​

Core Proposal​

Training Recipe (Pre- vs Post-Training)​

Data & Backbone​

Modeling Details​

High-Level Language Policy​

Evaluation Setup & Baselines​

Key Results​

Takeaways & Limitations​

Ref​

VIMA​

Motivation​

Key Contributions​

Design Insights​

Results​

Conclusion​

Ref​

RoboFlamingo​

Key Idea​

Advantages​

Benchmarks​

Performance​

Flexibility of Deployment​

Conclusion​

Ref​

OpenVLA​

Motivation​

Model & Training​

1. Abstract

2. Introduction

3. Model Structure

Unified Transformer Architecture

Probabilistic Decomposition

4. Combining Discrete & Continuous Actions

Hybrid Loss Function

5. Training Recipe

Stage 1: Pre-training ( $\alpha = 0$ )

Stage 2: Post-training ( $\alpha = 10.0$ )

6. Evaluation

Methodology

Key Findings

7. Conclusions & Future Work

Ref

VLATest: Testing and Evaluating Vision-Language-Action Models for Robotic Manipulation

Motivation & Gap

What Are VLA Models?

Training & Evaluation

VLATest Framework

Research Questions (RQ)

Tasks & Prompting

Experimental Setup

Key Results & Findings

RQ1 — Overall Performance

RQ1 — Coverage Metric

RQ2 — Confounding Objects

RQ3 — Lighting Robustness

RQ4 — Camera Pose Robustness

RQ5 — Unseen Objects

RQ6 — Instruction Mutations

Implications & Directions

Related Work

Threats to Validity (mitigations in study)

Conclusion

Ref

RT-X

Motivation

Objectives

Core Approach

What’s Different From Prior Transfer Methods

Dataset & Format (Open X-Embodiment)

Data Format Consolidation (Coarse Alignment)

Policy Architectures

Training Setup

Experimental Questions

Key Results

Generalization & Emergent Skills

Design Insights (Ablations)

Limitations (Open Problems)

Ref

π0

Problem & Motivation

Core Proposal

Training Recipe (Pre- vs Post-Training)

Data & Backbone

Modeling Details

High-Level Language Policy

Evaluation Setup & Baselines

Key Results

Takeaways & Limitations

Ref

VIMA

Motivation

Key Contributions

Design Insights

Results

Conclusion

Ref

RoboFlamingo

Key Idea

Advantages

Benchmarks

Performance

Flexibility of Deployment

Conclusion

Ref

OpenVLA

Motivation

Model & Training