Skip to main content

16 posts tagged with "vlm"

Vision-Language Models

View All Tags

π0.5 Review

· 5 min read

1. Abstract

  • Core Concept: π0.5\pi_{0.5} is a model designed for broad generalization by utilizing co-training on heterogeneous tasks.
  • Method: It combines hybrid multi-modal examples including image observations, language commands, object detection, semantic subtask prediction, and low-level actions.
  • Impact: This knowledge transfer is essential for effective generalization, enabling the execution of long-horizon and dexterous manipulation skills in the wild.

2. Introduction

  • Goal: Design training recipes that provide the breadth of knowledge required for robots to generalize at multiple levels of abstraction, from physical behaviors to scene semantics.
  • Unified Framework: By casting different modalities into a single sequence modeling framework, VLAs can be trained on diverse sources: robot data, language data, computer vision tasks, and combinations thereof.
  • Capabilities: The model can control mobile manipulators to perform varied household tasks even in homes never seen during training.
  • Hierarchical Architecture:
    • Training: Pre-trains on a heterogeneous mixture of tasks, then fine-tunes specifically for mobile manipulation using both low-level action examples and high-level semantic actions (e.g., predicting "pick up the cutting board").
    • Inference: At runtime, the model first predicts a semantic subtask (inferring appropriate next behavior based on scene semantics) and then predicts the robot action chunk based on this subtask.

3. Model Structure

Pi 0.5 model architecture

Unified Transformer Architecture

  • The model corresponds to a transformer taking in NN multimodal input tokens x1:Nx_{1:N} (images, text, and actions) and producing multimodal outputs.
  • Input Processing: Different token types are processed by specific encoders (e.g., Vision Encoder for images, Embedding Matrix for text).
  • Output Split: The output is split into two streams:
    • Text Logits (y1:Mly^{l}_{1:M}): Used for QA, reasoning, and dividing the task (predicting subtasks l^\hat{l}).
    • Action Tokens (y1:Hay^{a}_{1:H}): Produced by a separate Action Expert to create continuous outputs for robot control.

Probabilistic Decomposition

The distribution captured by the model is decomposed using the chain rule and a conditional independence assumption:

πθ(at:t+H,l^ot,l)=πθ(at:t+Hot,l^)πθ(l^ot,l)\pi_{\theta}(a_{t:t+H}, \hat{l} | o_{t}, l) = \pi_{\theta}(a_{t:t+H} | o_{t}, \hat{l}) \cdot \pi_{\theta}(\hat{l} | o_{t}, l)
  • Assumption: The action distribution (at:t+Ha_{t:t+H}) does not depend on the overall task prompt (ll), but only on the predicted subtask (l^\hat{l}).
  • High-Level Inference: πθ(l^ot,l)\pi_{\theta}(\hat{l} | o_{t}, l) (Predicting "what to do next").
  • Low-Level Inference: πθ(at:t+Hot,l^)\pi_{\theta}(a_{t:t+H} | o_{t}, \hat{l}) (Predicting "how to move").

4. Combining Discrete & Continuous Actions

The model employs a hybrid approach to balance training efficiency with inference speed and quality.

  • The Dilemma:
    • Discrete Tokens (FAST): Fast training, but requires slow autoregressive decoding during inference.
    • Continuous (Flow Matching): High quality and smooth control, but computationally expensive to train from scratch on massive datasets.
  • The Solution: Train on discretized actions (FAST) but use Flow Matching for inference.
    • Attention Masking: Ensures discrete and continuous action representations do not attend to each other during joint training.

Hybrid Loss Function

The model minimizes a combined objective:

E[H(x,fθl)Cross Entropy+αωafθa2MSE for Flow]\mathbb{E} \left[ \underbrace{H(x, f^l_\theta)}_{\text{Cross Entropy}} + \alpha \underbrace{\| \omega - a - f^a_\theta \|^2}_{\text{MSE for Flow}} \right]
  • Cross Entropy: For text and discrete action tokens.
  • MSE: For the Flow Matching vector field (Action Expert).

5. Training Recipe

The training is split into two distinct stages based on the α\alpha parameter and the inclusion of the Action Expert.

Stage 1: Pre-training (α=0\alpha = 0)

  • Goal: Efficient large-scale learning.
  • Method: Action Expert is OFF. Trains as a standard auto-regressive transformer using next-token prediction for text and discrete FAST action tokens.
  • Datasets:
    • MM: Mobile Manipulator data (100+ homes).
    • ME: Multi-Environment non-mobile robots.
    • CE: Cross-Embodiment laboratory data (diverse tasks like folding).
    • HL: High-Level subtask prediction data.
    • WD: Multimodal Web Data (VQA, captioning).

Stage 2: Post-training (α=10.0\alpha = 10.0)

  • Goal: Specialization for mobile manipulation and enabling continuous control.
  • Method: Action Expert is ON.
    • Initialized with random weights.
    • Jointly trains next-token prediction (to preserve text capabilities) and Flow Matching for continuous actions.
  • Key Addition (Verbal Instructions - VI):
    • Data collected by "teleoperating" the robot using language commands (e.g., expert users selecting sub-tasks step-by-step).
    • Crucial for training the model to predict high-quality subtasks (l^\hat{l}).

6. Evaluation

Methodology

  • Settings: Tested in entirely new kitchens and bedrooms not seen during training.
  • Tasks: Long-horizon tasks like cleaning kitchens, putting laundry away, and making beds.
  • Metrics: Task progress (percentage of steps completed) and Language Following Rate.

Key Findings

  • Generalization: π0.5\pi_{0.5} successfully performs multi-stage tasks in real, unseen homes.
  • Scaling: Performance improves consistently as the number of training environments increases.
  • Ablation Studies:
    • Cross-Embodiment (CE/ME): Excluding data from other robots significantly degrades performance, indicating strong transfer learning.
    • Web Data (WD): While less critical for general task progress, it is essential for Out-of-Distribution (OOD) object generalization and language following.
  • Comparison: Significantly outperforms π0\pi_0 and the π0\pi_0-FAST+Flow baseline.

7. Conclusions & Future Work

  • Current Status: π0.5\pi_{0.5} demonstrates that co-training with heterogeneous data enables end-to-end robotic systems to perform long-horizon, dexterous skills in open-world settings.
  • Limitations:
    • Struggles with physical constraints (hard-to-open cabinets) or partial observability.
    • Limited to relatively simple prompts based on training data.
  • Future Directions:
    • Incorporating richer context and memory for better handling of partial observability.
    • Expanding data sources, particularly exploring verbal instructions as a powerful new supervision modality.

Ref

  • Intelligence, P., Black, K., Brown, N., Darpinian, J., Dhabalia, K., Driess, D., Esmail, A., Equi, M., Finn, C., & Fusai, N. (2025). π0.5: a Vision-Language-Action Model with Open-World Generalization. arXiv preprint arXiv:2504.16054.

VLA Test Review

· 6 min read

VLATest: Testing and Evaluating Vision-Language-Action Models for Robotic Manipulation

  • VLATest fuzzes 18,604 manipulation scenes (10 operators, 4 tasks) to systematically stress-test VLA robustness.
  • Seven VLA models show low success and brittleness to confounders, lighting/camera changes, unseen objects, and instruction mutations; larger pretraining helps.
  • Priorities: scale/augment demo data (incl. sim2real), use stepwise/CoT prompting & multi-agent setups, and expand benchmarks with online risk assessment.

Motivation & Gap

  • Problem: Current VLA models are typically evaluated on small, hand-crafted scenes, leaving general performance and robustness in diverse scenarios underexplored.
  • Goal: Introduce VLATest, a generation-based fuzzing framework that automatically creates robotic manipulation scenes to test performance and robustness of VLA models.

What Are VLA Models?

  • Vision-Language-Action (VLA) models take natural language instructions + camera images and output low-level robot actions (Δx, Δθ, Δgrip).
  • Inference loop: Tokenize text/image → transformer predicts action token A₁ → execute → append A₁ + new image tokens I₂ → predict A₂ → … until success or step limit.

VLA Architecture

Training & Evaluation

  • Training: (1) Train from scratch on robot demonstrations, or (2) fine-tune a large VLM (e.g., Llava) with >1B params pretraining.
  • Evaluation: Task-specific metrics (e.g., grasp, lift, hold for “pick up”), either in sim (auto-metrics) or real (manual labels).

VLATest Framework

  • Ten testing operators grouped across:
    • Target objects: type, position, orientation
    • Confounding objects: type, position, orientation, count
    • Lighting: intensity
    • Camera: position, orientation
  • Scene generation (Alg. 1): sample valid targets → (optional) confounders → mutate lighting (factor α) → mutate camera pose (d, θ). Semantic validity checks prevent infeasible scenes.

VLA Test

Research Questions (RQ)

  • RQ1: Basic performance on popular manipulation tasks
  • RQ2: Effect of confounding object count
  • RQ3: Effect of lighting changes
  • RQ4: Effect of camera pose changes
  • RQ5: Robustness to unseen objects (OOD)
  • RQ6: Robustness to instruction mutations

Tasks & Prompting

  • Tasks:
    1. Pick up an object (grasp + lift ≥0.02 m for 5 frames)
    2. Move A near B (≤0.05 m)
    3. Put A on B (stable stacking)
    4. Put A into B (fully inside)
  • Standard prompts (RQ1–RQ5):
    • pick up [obj] · move [objA] near [objB] · put [objA] on [objB] · put [objA] into [objB]
  • Instruction mutations (RQ6): 10 paraphrases per task (GPT-4o), manually validated for semantic equivalence.

Experimental Setup

  • Scenes: 18,604 across 4 tasks (ManiSkill2).
  • Models: 7 public VLAs (RT-1-1k/58k/400k, RT-1-X, Octo-small/base, OpenVLA-7b).
  • Compute: >580 GPU hours.

Key Results & Findings

RQ1 — Overall Performance

  • VLA models underperform overall; no single model dominates across tasks.
  • Example best-case rates (default settings): 34.4% (Task1, RT-1-400k), 12.7% (Task2, OpenVLA-7b), 2.2% (Task3, RT-1-X), 2.1% (Task4, Octo-small).
  • Stepwise breakdown (Task 1): grasp 23.3% → lift 15.7% → hold 12.4% ⇒ difficulty composing sequential actions.
    • Implication (Finding 2): Consider stepwise prompting / chain-of-thought to decompose complex tasks.

RQ1 — Coverage Metric

  • No established coverage for VLA; adopted trajectory coverage (pragmatic).
  • Increasing cases from n=10 to n=1000 achieved 100% coverage across tasks (object-position novelty relative to workspace).

RQ2 — Confounding Objects

  • More confounders ⇒ worse performance; models struggle to locate the correct object.
  • Similarity doesn’t matter much: Mann–Whitney U shows no significant difference between similar vs dissimilar distractors (p = 0.443, 0.614, 0.657, 0.443; effect sizes ≈ 0.23–0.29).

RQ3 — Lighting Robustness

  • Lighting perturbations significantly hurt performance.
  • OpenVLA-7b most robust (77.9% of previously passed cases still pass), plausibly due to SigLIP + DINOv2 pretraining and LLaVA 1.5 mixture.
  • Sensitivity: even α < 2.5 increase drops success to ~0.7×; α > 8 ⇒ ~40% of default-pass scenes succeed.
  • Decreasing light hurts less than increasing; α < 0.2 still ~60% pass.

RQ4 — Camera Pose Robustness

  • Small pose changes (≤ rotation, ≤5 cm shift) reduce success to 34.0% of default.
  • RT-1-400k most robust (45.6% retain), OpenVLA-7b at 31.3%; Octo models <10%.
    • Likely due to training data scale differences.

RQ5 — Unseen Objects

  • Using YCB (56 unseen objects) leads to large performance drops versus seen objects: avg –74.2%, –66.7%, –66.7%, –20.0% on Tasks 1–4.
  • Transfer rate across steps:
    • Trn=Success ratenSuccess raten1\displaystyle T_r^n = \frac{\text{Success rate}_n}{\text{Success rate}_{n-1}}, with Success rate0=100%\text{Success rate}_0 = 100\%
    • Paired t-tests show significant differences on Tr1T_r^1 for Task 1 & 2 (p = 0.011, 0.007; Cohen’s d = 1.34, 0.891).
    • Primary failure mode: recognizing/locating unseen objects.

RQ6 — Instruction Mutations

  • Mutated instructions generally reduce performance (avg drops: –32.8% T1, –1.7% T2, –8.3% T3; negligible on T4).
  • Larger language backbones help: OpenVLA-7b (Llama 2-7B) is more robust, sometimes improving under mutations (e.g., T1, T4).

Implications & Directions

  • Scale matters: larger pretraining and robot-demo datasets improve robustness (lighting/camera).
  • Data enrichment: use data augmentation and sim-to-real to diversify external factors; leverage traditional controllers to auto-generate demonstrations.
  • Prompting strategies: adopt stepwise/CoT prompting; consider multi-agent decompositions.
  • Benchmarking: the 18,604 VLATest scenes serve as an early benchmark; expand to more tasks/robots/conditions.
  • Online risk assessment: explore uncertainty estimation and safety monitoring for runtime quality control.
  • Robotics foundation models: (1) LLMs for planning/rewards; (2) Multi-modal FMs (VLMs/VLAs) for manipulation & perception.
  • CPS testing: gray-box/black-box fuzzing and search-based testing exist, but not directly applicable to VLAs (multimodality, autoregression, scale).
  • FM evaluation: beyond static benchmarks, VLATest dynamically generates 3D manipulation test cases—distinct from text-only testing.

Threats to Validity (mitigations in study)

  • Internal: randomness (mitigated by 18,604 scenes); potential prompt bias (mutations manually validated).
  • External: generalization to other tasks/models; chose popular tasks (Open X-Embodiment) and SOTA public models.
  • Construct: limited operators (lighting/camera/confounders chosen; future: #lights, camera intrinsics, resolution).
    • Coverage: trajectory coverage used as a pragmatic proxy.

Conclusion

  • VLATest: early, generation-based fuzzing framework (10 operators) for VLA testing in ManiSkill2.
  • Empirical evidence across 7 models / 4 tasks / 18,604 scenes shows limited robustness (lighting, camera, unseen objects, instruction variation).
  • Points to data scaling, prompting, benchmarking, and risk assessment as practical paths to more reliable VLA systems.

Ref

  • Wang, Z., Zhou, Z., Song, J., Huang, Y., Shu, Z., & Ma, L. (2025). VLATest: Testing and Evaluating Vision-Language-Action Models for Robotic Manipulation. Proceedings of the ACM on Software Engineering, 2(FSE), 1615–1638.

Open X-Embodiment review

· 5 min read

RT-X

  • RT-X trains generalist robot policies by co-training RT-1/RT-2 on an X-embodiment mix of multi-robot, multi-task data, enabling efficient adaptation to new robots, tasks, and environments.
  • It standardizes 1M+ trajectories from 22 embodiments into the Open X-Embodiment (RLDS/tfrecord) repository, unifying observations and 7-DoF actions via coarse alignment.
  • Experiments show strong positive transfer and emergent skills (≈3× with RT-2-X on cross-robot tasks); performance scales with model capacity, short image histories, and web pretraining, while sensing/actuation diversity and frame alignment remain open problems.

RT-X Architecture

Motivation

  • Seeks a generalist X-robot policy that can be efficiently adapted to new robots, tasks, and environments.
  • Mirrors a trend from CV/NLP where general-purpose, web-scale pretrained models outperform narrow, task-specific models.
  • Robotics lacks comparably large, diverse interaction datasets, making direct transfer of these lessons challenging.

Objectives

  1. Positive transfer: Test whether co-training on data from many robots improves performance on each training domain.
  2. Ecosystem building: Organize large robotic datasets to enable future X-embodiment research.

Core Approach

  • Train RT-1 and RT-2 on data from 9 different manipulators, producing RT-X variants that outperform policies trained only on the evaluation domain and show better generalization and new capabilities.

What’s Different From Prior Transfer Methods

  • Many prior works reduce the embodiment gap via specialized mechanisms (shared action spaces, representation learning objectives, policy adaptation using embodiment metadata, decoupled robot/environment representations, domain translation).
  • RT-X directly trains on X-embodiment data without explicit gap-reduction machinery and still observes positive transfer.

Dataset & Format (Open X-Embodiment)

  • 1M+ real robot trajectories, 22 embodiments (single-arm, bimanual, quadrupeds), pooled from 60 datasets / 34 labs, standardized for easy use.
  • Uses RLDS (serialized tfrecord), supporting varied action spaces and input modalities (RGB, depth, point clouds), and efficient parallel loading across major DL frameworks.
  • Language annotations are leveraged; PaLM is used to extract objects/behaviors from instructions.

RLDS

Data Format Consolidation (Coarse Alignment)

  • Observations: History of recent images + language instruction. One canonical camera view per dataset is resized to a common resolution.
  • Actions: Convert original controls to a 7-DoF end-effector vector (x, y, z, roll, pitch, yaw, gripper or their rates). Actions are normalized before discretization; outputs are de-normalized per embodiment.
  • Deliberate non-alignment: Camera poses/properties are not standardized; action frame alignment across datasets is not enforced. The same action vector may cause different motions on different robots (absolute/relative, position/velocity allowed).

Policy Architectures

  • RT-1 (≈35M params): Transformer for control. Inputs: 15-frame image history + natural-language instruction.
    • Vision via ImageNet-pretrained EfficientNet; language via USE embedding.
    • Fuse via FiLM → 81 vision–language tokens → decoder-only Transformer outputs tokenized actions.
  • RT-2 (VLA family): Internet-scale VLM co-fine-tuned to output action as text tokens (e.g., 1 128 91 241 5 101 127).
    • Any pretrained VLM can be adapted; this work uses RT-2–PaLI-X (ViT backbone + UL2 LM; primarily pretrained on WebLI).

Training Setup

  • Robotics data mixture: Data from 9 manipulators (a union of multiple well-known robotics datasets).
  • Loss: Standard categorical cross-entropy over tokenized actions.
  • Regimes:
    • RT-1-X: Trained solely on the robotics mixture.
    • RT-2-X: Co-fine-tuned on a ~1:1 mix of original VLM data and the robotics mixture.

Experimental Questions

  1. Does X-embodiment co-training improve in-domain performance (positive transfer)?
  2. Does it improve generalization to unseen tasks?
  3. How do model size, architecture, and dataset composition influence performance/generalization?

Key Results

  • Small-scale domains: RT-1-X outperforms the Original Method (the authors’ per-dataset baselines) on 4/5 datasets with a large average gain → limited data domains benefit greatly from X-embodiment co-training.
  • Large-scale domains:
    • RT-1-X does not beat an RT-1 trained only on the embodiment-specific large dataset (suggests underfitting for this class).
    • RT-2-X (larger capacity) outperforms both Original Method and RT-1 → X-robot training helps even in data-rich regimes when using sufficient capacity.

Generalization & Emergent Skills

  • Unseen objects/backgrounds/environments: RT-2 and RT-2-X perform on par (VLM backbone already strong here).
  • Emergent skills (transfer across robots): On Google Robot tasks that do not appear in RT-2’s dataset but exist in Bridge (for WidowX), RT-2-X ≈ 3× RT-2.
    • Removing Bridge from RT-2-X training significantly reduces hold-out performance → skills likely transferred from WidowX data.

Design Insights (Ablations)

  • Short image history notably improves generalization.
  • Web pretraining is critical for large models’ high performance.
  • Model capacity matters: 55B model succeeds more than 5B on emergent skills → greater capacity ⇒ greater cross-dataset transfer.
  • Co-fine-tuning vs. fine-tuning: Similar performance in this study (attributed to the greater diversity of robotics data in RT-2-X vs. prior works).

Limitations (Open Problems)

  • Does not cover robots with very different sensing/actuation modalities.
  • Does not study generalization to new robots nor define a decision criterion for when positive transfer will occur.
  • Camera pose/properties and control frame remain unaligned; a deliberate but still challenging domain gap to address in future work.

Ref

  • O’Neill, A., Rehman, A., Maddukuri, A., Gupta, A., Padalkar, A., Lee, A., Pooley, A., Gupta, A., Mandlekar, A., & Jain, A. (2024). Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. 2024 IEEE International Conference on Robotics and Automation (ICRA).

π0 Review

· 4 min read

π0

Problem & Motivation

  • Achieving real-world generality in robot learning is blocked by data scarcity, generalization, and robustness limits.
  • Human intelligence most outpaces machines in versatility—solving diverse, physically situated tasks under constraints, language commands, and perturbations.
  • In NLP/CV, foundation models pre-trained on diverse multi-task data, then fine-tuned (aligned) on curated datasets, outperform narrow specialists; the same paradigm is hypothesized for robotics.

Core Proposal

  • A novel flow-matching architecture built on a pre-trained Vision-Language Model (VLM) to inherit Internet-scale semantics.
  • Further training adds robot actions, turning the model into a Vision-Language-Action (VLA) policy.
  • Use cross-embodiment training to combine data from many robot types (single/dual-arm, mobile), despite differing configuration/action spaces.
  • Employ action chunking + flow matching (diffusion variant) to model complex, continuous, high-frequency actions.
  • Introduce an Action Expert (separate weights for action/state tokens), akin to a Mixture-of-Experts, augmenting the standard VLM.

Training Recipe (Pre- vs Post-Training)

  • Pre-training on highly diverse data builds broad, general physical abilities.
  • Post-training on curated, task-specific data instills fluent, efficient strategies.
  • Rationale: high-quality-only training lacks recovery behaviors; low-quality-only training lacks efficiency/robustness; combining both yields desired behavior.

Data & Backbone

  • ~10,000 hours of demonstrations + the OXE dataset; data spans 7 robot configurations and 68 tasks.
  • VLM backbone initialized from PaliGemma (3B); add ~300M parameters for the action expert (total ~3.3B).
  • Pre-training mixture: weighted combination of internal datasets + full OXE; n^0.43 weighting to down-weight overrepresented task-robot pairs.
  • Unify interfaces: zero-pad qt/at to the largest robot dimension (18); mask missing image slots; late-fusion encoders map images/states to the same token space as language.

Modeling Details

  • Conditional flow matching models the continuous distribution over action chunks.
  • Train with a diffusion-style loss on individual sequence elements (instead of cross-entropy), with separate weights for diffusion-related tokens.
  • Flow path uses a linear-Gaussian schedule; sample noisy actions with ε∼N(0, I); predict denoising vector field; Euler integration from τ=0→1 at inference.
  • Efficient inference by caching K/V for the observation prefix; action tokens recomputed per integration step.

High-Level Language Policy

  • Because the policy consumes language, a high-level VLM can decompose tasks (e.g., bussing) into intermediate language subgoals (SayCan-style planning), improving performance on complex, temporally extended tasks.

Evaluation Setup & Baselines

  • Out-of-box (direct prompting), fine-tuning on downstream tasks, and with high-level VLM providing intermediate commands.
  • Compare against OpenVLA (7B, autoregressive discretization; no action chunks/high-frequency control) and Octo (93M; diffusion), trained on the same mixture.
  • Include a compute-parity π0 (160k steps vs 700k) and a π0-small variant (no VLM init).

Key Results

  • Out-of-box: π0 outperforms all baselines; even compute-parity π0 beats OpenVLA/Octo; π0-small still surpasses them—highlighting the benefits of expressive architectures + diffusion/flow matching + VLM pre-training.
  • Language following: π0 clearly exceeds π0-small across conditions:
    • π0-flat: only overall task command.
    • π0-human: human-provided intermediate steps.
    • π0-HL: high-level VLM-provided steps (fully autonomous).
    • Better language-following accuracy directly translates into stronger autonomous performance with high-level guidance.
  • New dexterous tasks (e.g., bowls stacking, towel folding, microwave, drawer items, paper towel replacement):
    • Fine-tuned π0 generally outperforms OpenVLA, Octo, and small-data methods ACT / Diffusion Policy.
    • Pre-training helps most when tasks resemble pre-training data; pretrained π0 often beats from-scratch by up to .
  • Complex multi-stage tasks (laundry folding, table bussing, box building, to-go box, eggs):
    • π0 solves many tasks; full pre-training + fine-tuning performs best.
    • Gains from pre-training are especially large on harder tasks; absolute performance varies with task difficulty and pre-training coverage.

Takeaways & Limitations

  • π0 mirrors LLM training: pre-train for knowledge, post-train for alignment (instruction-following and execution).
  • Limitations/open questions:
    • Optimal composition/weighting of pre-training data remains unclear.
    • Not all tasks work reliably; difficult to predict how much/what kind of data is needed for near-perfect performance.
    • Uncertain positive transfer across very diverse tasks/robots and to distinct domains (e.g., driving, navigation, legged locomotion).

Ref

  • Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., Jakubczak, S., Jones, T., Ke, L., Levine, S., Li‑Bell, A., Mothukuri, M., Nair, S., Pertsch, K., Shi, L. X, … Zhilinsky, U. (2025, June 21). π₀: A vision‑language‑action flow model for general robot control Robotics: Science and Systems (RSS), Los Angeles, CA, United States. https://roboticsconference.org/program/papers/10/

Vima Review

· 2 min read

VIMA

  • Unified Multimodal Prompts: Reformulates diverse robot tasks (language, images, video) into a single sequence modeling problem.
  • Object-Centric Tokenization: Uses object-level tokens (Mask R-CNN + ViT) instead of raw pixels, improving data efficiency and semantic generalization.
  • Cross-Attention Conditioning: Conditions the policy on prompts via cross-attention, maintaining strong zero-shot performance even with small models or novel tasks.

Motivation

  • Robot task specification comes in many forms: one-shot demonstrations, language instructions, and visual goals.
  • Traditionally, each task required distinct architectures and pipelines, leading to siloed systems with poor generalization.

VIMA Architecture

Key Contributions

  1. Multimodal Prompting

    • A novel formulation that unifies diverse robot manipulation tasks into a sequence modeling problem.
    • Prompts are defined as interleaved sequences of text and images, enabling flexibility across task formats.
  2. VIMA-BENCH

    • A large-scale benchmark with 17 tasks across six categories (object manipulation, goal reaching, novel concept grounding, video imitation, constraint satisfaction, visual reasoning).
    • Provides 650K expert trajectories and a four-level evaluation protocol for systematic generalization.
  3. VIMA Agent

    • A transformer-based visuomotor agent with encoder-decoder architecture and object-centric design.
    • Encodes prompts with a pre-trained T5 model, parses images into object tokens via Mask R-CNN + ViT, and decodes actions autoregressively using cross-attention.

Design Insights

  • Object-Centric Representation: Passing variable-length object token sequences directly to the controller is more effective than pixel-based tokenization.
  • Cross-Attention Conditioning: Stronger prompt focus and efficiency compared to simple concatenation (e.g., GPT-style).
  • Robustness: Minimal degradation under distractors or corrupted prompts, aided by T5 backbone and object augmentation.

Results

  • Performance:

    • Outperforms baselines (VIMA-Gato, VIMA-Flamingo, VIMA-GPT) by up to 2.9× success rate in hardest zero-shot generalization.
    • With 10× less training data, still 2.7× better than best competitor.
  • Scaling:

    • Sample-efficient: with just 1% of data, matches baselines trained with 10× more.
    • Generalization holds across L1–L4 evaluation, with smaller regression than alternatives.

Conclusion

VIMA demonstrates that multimodal prompting is a powerful unifying framework for robot learning.
It achieves strong scalability, data efficiency, and generalization, establishing a solid starting point for future generalist robot agents.

Ref

  • Jiang, Y., Gupta, A., Zhang, Z., Wang, G., Dou, Y., Chen, Y., Fei-Fei, L., Anandkumar, A., Zhu, Y., & Fan, L. (2023). VIMA: Robot Manipulation with Multimodal Prompts Proceedings of the 40th International Conference on Machine Learning, Proceedings of Machine Learning Research. https://proceedings.mlr.press/v202/jiang23b.html

RoboFlamingo Review

· 2 min read

RoboFlamingo

  • RoboFlamingo decouples vision-language understanding and control, using OpenFlamingo for perception and a lightweight policy head for sequential decision-making.
  • Unlike prior VLM-based approaches, it requires only small-scale imitation fine-tuning on language-conditioned manipulation data, without large-scale co-fine-tuning.
  • This design enables data-efficient, zero-shot generalizable, and deployable robot manipulation policies on modest compute resources.

Key Idea

  • Proposes RoboFlamingo, a simple framework to adapt existing VLMs for robotic manipulation with lightweight fine-tuning.
  • Built on OpenFlamingo, decoupling vision-language understanding from decision-making.
  • Pre-trained VLM handles language and visual comprehension, while a dedicated policy head models sequential history.
  • Fine-tuned only on language-conditioned manipulation datasets using imitation learning.

Advantages

  • Requires only a small amount of demonstrations to adapt to downstream manipulation tasks.
  • Provides open-loop control capability → deployable on low-performance platforms.
  • Can be trained/evaluated on a single GPU server, making it a cost-effective and accessible solution.

Benchmarks

  • Evaluated on CALVIN benchmark (34 tasks, 1000 instruction chains).
  • RoboFlamingo achieves 2× performance improvements over previous state-of-the-art methods.

Performance

  • Imitation Learning: Outperforms all baselines across all metrics.
  • Zero-shot Generalization:
    • Vision: Stronger generalization in ABC→D setting.
    • Language: Robust to GPT-4 generated synonymous instructions.
  • Ablation Studies:
    • Ignoring history (MLP w/o hist) gives worst results.
    • LSTM and GPT-based policy heads perform best (LSTM chosen as default).
    • VL pre-training is crucial for downstream manipulation.
    • Larger VLMs show better data efficiency.
    • Instruction fine-tuning improves both seen and unseen tasks.

Flexibility of Deployment

  • Supports open-loop control by predicting entire action sequences with a single inference → reduces latency and test-time compute.
  • Direct open-loop use without retraining can degrade performance; mitigated with jump-step demonstrations.

Conclusion

  • Demonstrates that pre-trained VLMs enable data efficiency and strong zero-shot generalization in robotic manipulation.
  • RoboFlamingo is presented as an intuitive, efficient, and open solution, with high potential when combined with large-scale real robot data.

Ref

  • Li, X., Liu, M., Zhang, H., Yu, C., Xu, J., Wu, H., Cheang, C., Jing, Y., Zhang, W., & Liu, H. (2024). Vision-language foundation models as effective robot imitators. International Conference on Learning Representations (ICLR 2024), Vienna, Austria.

OpenVLA Review

· 3 min read

OpenVLA

  • OpenVLA is a 7B open-source VLA model built on Llama2 + DINOv2 + SigLIP, trained on 970k demos, achieving stronger generalization and robustness than closed RT-2-X (55B) and outperforming Diffusion Policy.
  • It introduces efficient adaptation via LoRA (1.4% params, 8× compute reduction) and 4-bit quantization (half memory, same accuracy), enabling fine-tuning and inference on consumer GPUs.
  • Limitations remain (single-image input, <90% reliability, limited throughput), but OpenVLA provides the first open, scalable framework for generalist robot policies.

OpenVLA Architecture

Motivation

  • Training robot policies from scratch struggles with robustness and generalization.
  • Fine-tuning vision-language-action (VLA) models offers reusable, generalizable visuomotor policies.
  • Barriers: prior VLAs are closed-source, lack best practices for adaptation, and need server-class hardware.

Model & Training

  • OpenVLA: 7B parameters, open-source.
  • Built on Llama 2 with fused DINOv2 + SigLIP vision encoders.
  • Trained on 970k robot demonstrations from Open-X Embodiment dataset.
  • Represents robot actions as tokens (discretized into 256 bins, replacing unused Llama tokens).
  • Standard next-token prediction objective.

Architecture & Approach

  • End-to-end fine-tuning of VLM to generate robot actions as tokens.
  • Differs from modular methods (e.g., Octo) that stitch separate encoders/decoders.
  • Vision features are obtained by encoding the same input image with both SigLIP and DINOv2, then channel-wise concatenated and passed through an MLP projector. This preserves SigLIP’s semantic alignment with language and DINOv2's spatial reasoning, giving the VLM richer multimodal context for manipulation tasks.
  • Uses Prismatic VLM backbone with multi-resolution features (spatial reasoning + semantics).

Performance

  • Outperforms closed RT-2-X (55B) by +16.5% task success with 7× fewer parameters.
  • Beats Diffusion Policy (from-scratch imitation learning) by +20.4% on multi-task language-grounded settings.
  • Demonstrates robust behaviors (distractor resistance, error recovery).

Efficiency

  • Introduces parameter-efficient fine-tuning:
    • LoRA updates only 1.4% of parameters yet matches full fine-tuning.
    • Can fine-tune on a single A100 GPU in ~10–15 hours (8× compute reduction).
  • Quantization:
    • 4-bit inference matches bfloat16 accuracy while halving memory footprint.
    • Runs at 3Hz on consumer GPUs (e.g., A5000, 16GB).

Evaluations

  • Tested across 29 tasks and multiple robots (WidowX, Google robot, Franka).
  • Strong generalization on:
    • Visual (unseen backgrounds/distractors).
    • Motion (new object positions/orientations).
    • Physical (new object shapes/sizes).
    • Semantic (unseen tasks, instructions).
  • First generalist open-source VLA achieving ≥50% success rate across all tested tasks.

Design Insights

  • Fine-tuning the vision encoder (vs. freezing) crucial for robotic control.
  • Higher image resolution (384px vs. 224px) adds 3× compute without performance gains.
  • Training required 27 epochs, far more than typical VLM runs, to surpass 95% action token accuracy.

Limitations & Future Work

  • Supports only single-image observations (no proprioception, no history).
  • Inference throughput (~6Hz on RTX 4090) insufficient for high-frequency control (e.g., ALOHA at 50Hz).
  • Success rates remain below 90% in challenging tasks.
  • Open questions:
    • Impact of base VLM size on performance.
    • Benefits of co-training with Internet-scale data.
    • Best visual features for VLAs.

Contributions

  1. First open-source generalist VLA with strong performance.
  2. Scalable end-to-end training pipeline (action-as-token).
  3. Demonstrates LoRA + quantization for consumer-grade GPU adaptation.
  4. Provides code, checkpoints, and data curation recipes to support future research.

Ref

  • Kim, M. J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E. P., Sanketi, P. R., Vuong, Q., Kollar, T., Burchfiel, B., Tedrake, R., Sadigh, D., Levine, S., Liang, P., & Finn, C. (2025). OpenVLA: An Open-Source Vision-Language-Action Model Proceedings of The 8th Conference on Robot Learning, Proceedings of Machine Learning Research. https://proceedings.mlr.press/v270/kim25c.html

Octo Review

· 4 min read

Octo

  • Octo is a transformer-based policy with modular tokenizers (language via T5, images via CNN patches), blockwise masking, and readout tokens, trained on 800k multi-robot trajectories.
  • Actions are generated through a diffusion head that produces continuous, multimodal, chunked predictions, enabling precise control and broad generalization.
  • It achieves state-of-the-art zero-shot performance across 7 robots and allows efficient finetuning to new sensors and action spaces, while being fully open-source.
CategorySimple AnalogyActual Tokenization
Language[Sentence][l₁, l₂, l₃, …]
→ multiple tokens from a tokenized sentence
Goal Image[Goal][g₁, g₂, g₃, …]
→ image split into patches
Observation (time t)[Observation][oₜ¹, oₜ², oₜ³, …]
→ camera frames/sensors tokenized into patches
Readout Token[ ] (empty slot)[TR,t]
→ one per timestep, reserved for predicting actions
Time t-1: [l] [g] [o_{t-1}] [TR,t-1]
Time t: [l] [g] [o_t] [TR,t]
Time t+1: [l] [g] [o_{t+1}] [TR,t+1]

[TR,t-1], [TR,t], [TR,t+1] ──► Diffusion head ──► [a_t, a_{t+1}, …]

Motivation

  • Traditional robot learning trains policies from scratch on robot/task-specific datasets → costly data collection, narrow generalization.
  • Generalist Robot Policies (GRPs) pretrained on diverse robots/tasks can be finetuned with little in-domain data while generalizing broadly.
  • Real-world deployments face challenges across robot embodiments, sensor setups, action spaces, task specs, and environments.

Prior GRPs & Gaps

  • GRPs aim for low-level visuomotor control across tasks, environments, and robotic systems.
  • Existing models often have restricted inputs (e.g., a single camera), lack efficient finetuning to new domains, and importantly, largest models are not publicly available.

Contribution (What is Octo?)

  • Octo: a large transformer-based policy trained on 800k trajectories from the Open X-Embodiment dataset.
  • Accepts language instructions or goal images, and can be finetuned within hours on consumer GPUs to new sensors and action spaces.
  • First GRP to support effective finetuning to new observations and actions and to be fully open-source (training pipeline, checkpoints, data).
  • Novelty lies in combining: transformer backbone + language/goal image conditioning + diffusion head for expressive action distributions.

Architecture

  • Input tokenizers:
    • Language via pretrained T5-base
    • Images via shallow CNN → patch tokens
  • Transformer backbone: processes unified token sequence.
  • Blockwise masking + Readout tokens:
    • Nonexistent modalities are masked
    • Readout tokens only attend to past observations/tasks, not vice versa
  • Diffusion action head: predicts continuous, multimodal, chunked actions.
  • Modularity: new sensors/outputs can be added by only training lightweight encoders or heads; pretrained backbone remains unchanged.

Octo Architecture

Training Data & Objective

  • Mixture of 25 heterogeneous robot datasets: diverse robots, sensors (with/without wrist cams), labels (with/without language).
  • Conditional diffusion decoding predicts continuous, multimodal action distributions.
    • Transformer runs one forward pass; denoising steps are contained in the small diffusion head.

Experiments

  • Evaluated on 7 robotic platforms across 4 institutions.
  • Key questions:
    1. Zero-shot multi-robot control?
    2. Do Octo weights improve finetuning vs. scratch or standard pretrained representations?
    3. Which design choices matter for generalist robot policies?

Results

  • Achieves state-of-the-art zero-shot multi-robot control, competitive with RT-1-X and RT-2-X.
  • Provides a versatile policy initialization: significantly outperforms baselines for data-efficient finetuning to new obs/action spaces.

Limitations / Future Work

  • Needs better language conditioning, improved wrist camera support, and data beyond optimal demonstrations.

One-line Takeaway

  • Octo = modular, efficient, open-source GRP:
    A transformer + diffusion policy trained on large-scale multi-robot data that adapts quickly with little in-domain data to new sensors and action spaces, enabling broad generalization.

Ref

  • Mees, O., Ghosh, D., Pertsch, K., Black, K., Walke, H. R., Dasari, S., Hejna, J., Kreiman, T., Xu, C., & Luo, J. (2024). Octo: An open-source generalist robot policy. First Workshop on Vision-Language Models for Navigation and Manipulation at ICRA 2024.

RT-2, Robotic Transformer 2 Review

· 4 min read
  • Trains a Vision-Language-Action (VLA) model by co-fine-tuning web-scale VLMs with robot trajectories, and treats robot actions as text tokens.
  • Yields strong generalization and emergent capabilities (symbol understanding, reasoning, human recognition) beyond what appears in robot data.
  • Runs in direct closed-loop control; largest evaluated model (55B) executes at ~1–3 Hz via a cloud (multi-TPU) inference setup.

RT-2 Architecture

What RT-2 Is

  • A family of VLA models (RT-2-PaLI-X, RT-2-PaLM-E) that fine-tune large VLMs on robot trajectories to output low-level actions.
  • Target: generalizable, semantically aware manipulation policies that map images + instructions → actions end-to-end.
  • RT-2 does not rely on a restricted 2D action space or calibrated cameras.
  • The unified output space lets language and action tokens share the same model weights, without action-only layers.

Core Recipe

  • Directly train open-vocabulary VQA/dialogue VLMs to output robot actions while they still solve standard vision-language tasks.
  • Build on RT-1 protocol/data, but replace the policy backbone with a large VLM.

Action as Language (Tokenization)

  • Discretize continuous action dims (Δpos/Δrot, gripper, terminate) into 256 bins; represent each dimension with an integer token.
  • PaLI-X: reuse numeric tokens (≤1000). PaLM-E: overwrite 256 least-frequent tokens as action vocabulary (symbol tuning).
  • Form a single output string per step (e.g., terminate Δposx Δposy Δposz Δrotx Δroty Δrotz gripper).

Co-Fine-Tuning & Output Constraint

  • Mix robot data with original web VQA/caption data in training batches (up-weight robot samples) to prevent forgetting and improve generalization.
  • During decoding on robot tasks, restrict sampling to valid action tokens so outputs are always executable.

Closed-Loop Control & Real-Time Inference

  • RT-2 is trained and deployed for direct closed-loop control (camera → action → camera …), not just high-level planning.
  • For large models, inference runs via a multi-TPU cloud service; RT-2-PaLI-X-55B reaches ~1–3 Hz; smaller models ~5 Hz.

Generalization & Benchmarks

  • Matches RT-1 on seen tasks but far exceeds baselines on unseen objects/backgrounds/environments (~ vs RT-1/MOO; up to ~6× vs others).
  • Open-source Language-Table sim: co-fine-tuned PaLI-3B outperforms baselines, showing the approach transfers to other robots/sims.

Emergent Capabilities

  • Symbol understanding (e.g., “move apple to 3 / heart / star”).
  • Reasoning (visual matching, simple math like “sum of two plus one”, multilingual commands).
  • Human recognition (e.g., “person with glasses”); none of these were present as low-level actions in robot data.
  • Chain-of-thought (CoT) variant adds a Plan step before actions → supports multi-stage semantic reasoning (e.g., pick a rock as an improvised hammer; pick an energy drink for a tired person).

rt-2-cot

Scaling & Ablations

  • From-scratch training (even 5B) performs poorly; fine-tuning helps; co-fine-tuning helps most.
  • Bigger models (55B > 5B) generalize better.
  • PaLM-E variant shows an edge on math reasoning; PaLI-X stronger on symbols/vision reasoning on average.

Limitations

  • Does not learn fundamentally new motor skills beyond the distribution in robot data; mainly transfers semantic/visual knowledge.
  • Compute/latency costly; real-time control can bottleneck. Limited availability of strong open VLMs and convenient FT APIs.

Future Directions (from the text)

  • Acquire new skills from human videos or richer datasets.
  • Quantization/distillation for faster/cheaper inference.
  • More open VLMs / FT APIs to make VLA models broadly buildable.

Ref

  • Zitkovich, B., Yu, T., Xu, S., Xu, P., Xiao, T., Xia, F., Wu, J., Wohlhart, P., Welker, S., Wahid, A., Vuong, Q., Vanhoucke, V., Tran, H., Soricut, R., Singh, A., Singh, J., Sermanet, P., Sanketi, P. R., Salazar, G., Ryoo, M. S., Reymann, K., Rao, K., Pertsch, K., Mordatch, I., Michalewski, H., Lu, Y., Levine, S., Lee, L., Lee, T.-W. E., Leal, I., Kuang, Y., Kalashnikov, D., Julian, R., Joshi, N. J., Irpan, A., Ichter, B., Hsu, J., Herzog, A., Hausman, K., Gopalakrishnan, K., Fu, C., Florence, P., Finn, C., Dubey, K. A., Driess, D., Ding, T., Choromanski, K. M., Chen, X., Chebotar, Y., Carbajal, J., Brown, N., Brohan, A., Arenas, M. G., & Han, K. (2023). RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control Proceedings of The 7th Conference on Robot Learning, Proceedings of Machine Learning Research. https://proceedings.mlr.press/v229/zitkovich23a.html

PaLM-E An Embodied Multimodal Language Model Review

· 4 min read

PaLM-E

  • ViT (e.g., ViT-4B, ViT-22B) extracts image embeddings.
  • OSRT builds object-centric slot representations.
  • These are injected into the LLM embedding space (PaLM variants: 8B, 62B, 540B) for high-level abstraction and planning, with execution delegated to low-level policies (e.g., RT-1).

PaLM-E Architecture

Core idea

  • Build embodied language models by injecting continuous sensor inputs (images, states, other modalities) directly into a pretrained LLM’s embedding space, linking words ↔ percepts.
  • Inputs are multimodal sentences that interleave text tokens with encoded visual/state tokens; outputs are text (answers or high-level plans).

Architecture & representations

  • Start from a decoder-only, autoregressive LLM (PaLM) and condition on a prefix that mixes text and encoder-produced vectors.
  • Provide multiple encoder options:
    • State vectors (simplest).
    • ViT features with a learned projector ψ to match LLM embedding dimensionality.
    • Object-centric, 3D-aware OSRT (neural scene representations). Supports entity-label tokens (<obj j>) so the model can refer to specific objects in generated plans.

Training setup

  • Train end-to-end (encoders + projector + optionally the LLM) to output sequential decisions as natural text or answers (VQA, captioning).
  • Dataset items contain (continuous observations, text sequence, prefix index); loss is cross-entropy on non-prefix tokens.
  • Explore freezing the LLM (train encoders/projection only), and co-training across diverse tasks ("full mixture"; only ~9% is embodied data).

Planning & control loop

  • For planning/control, PaLM-E emits textual subgoals/skills drawn from a small skill vocabulary; a separate low-level policy executes them.
  • The system runs closed-loop: execute → observe → (re)plan; PaLM-E acts as a high-level policy sequencing low-level skills.

Why not text-only LLMs or affordance-only grounding?

  • Prior work that feeds only text to the LLM (and uses external affordance models) is insufficient when spatial layout matters.
  • PaLM-E instead grounds inside the LLM by injecting continuous observations, enabling direct plan generation while leveraging the LLM’s world knowledge.

Environments & use cases

  • Three domains: TAMP (grasp/stack planning), Language-Table (multi-object tabletop pushing), Mobile manipulation (kitchen tasks).
  • Use cases to test embodied reasoning: affordance prediction, failure detection, long-horizon planning (low-level policies from RT-1).

Results (high level)

  • Transfer via co-training: One model trained on mixed tasks/embodiments achieves higher performance than task-specialists; "full mixture" yields >2× gains (Fig. 3).
  • Few-shot/data efficiency: Solves robotics tasks with very few examples (e.g., 10–80 for Language-Table, 320 for TAMP). OSRT further improves data efficiency.
  • Mobile manipulation: End-to-end embodied planning works in real kitchens, robust to disturbances; PaLM-E beats PaLI (zero-shot) and QT-OPT/CLIP baselines on affordance/failure detection.
  • General V+L: The 562B generalist achieves state-of-the-art on OK-VQA and strong VQAv2/COCO without task-specific finetuning.
  • Language retention & scaling: Freezing LLM preserves language ability but can struggle on some robotics tasks; unfrozen + scale up significantly reduces catastrophic forgetting.
  • Emergent behaviors: Multimodal chain-of-thought and multi-image reasoning emerge in PaLM-E-562B, despite training on single-image prompts.

Takeaways

  • Injecting neural scene representations (OSRT) and entity-labeled multimodal tokens is effective even without massive embodied data.
  • Diverse, joint training transfers vision-language knowledge into embodied decision-making, enabling data-efficient robot planning.
  • Two viable paths to retain language skills during multimodal finetuning:
    1. Freeze the LLM, train encoders (max language retention, sometimes weaker robotics),
    2. Unfreeze and scale the LLM (much less forgetting, strong embodied performance).

Ref

  • Driess, D., Xia, F., Sajjadi, M. S. M., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., Huang, W., Chebotar, Y., Sermanet, P., Duckworth, D., Levine, S., Vanhoucke, V., Hausman, K., Toussaint, M., Greff, K., Zeng, A., Mordatch, I., & Florence, P. (2023). PaLM-E: An Embodied Multimodal Language Model Proceedings of the 40th International Conference on Machine Learning, Proceedings of Machine Learning Research. https://proceedings.mlr.press/v202/driess23a.html