OpenVLA Review

August 29, 2025 · 3 min read

Owner

OpenVLA

OpenVLA is a 7B open-source VLA model built on Llama2 + DINOv2 + SigLIP, trained on 970k demos, achieving stronger generalization and robustness than closed RT-2-X (55B) and outperforming Diffusion Policy.
It introduces efficient adaptation via LoRA (1.4% params, 8× compute reduction) and 4-bit quantization (half memory, same accuracy), enabling fine-tuning and inference on consumer GPUs.
Limitations remain (single-image input, <90% reliability, limited throughput), but OpenVLA provides the first open, scalable framework for generalist robot policies.

OpenVLA Architecture

Training robot policies from scratch struggles with robustness and generalization.
Fine-tuning vision-language-action (VLA) models offers reusable, generalizable visuomotor policies.
Barriers: prior VLAs are closed-source, lack best practices for adaptation, and need server-class hardware.

OpenVLA: 7B parameters, open-source.
Built on Llama 2 with fused DINOv2 + SigLIP vision encoders.
Trained on 970k robot demonstrations from Open-X Embodiment dataset.
Represents robot actions as tokens (discretized into 256 bins, replacing unused Llama tokens).
Standard next-token prediction objective.

End-to-end fine-tuning of VLM to generate robot actions as tokens.
Differs from modular methods (e.g., Octo) that stitch separate encoders/decoders.
Vision features are obtained by encoding the same input image with both SigLIP and DINOv2, then channel-wise concatenated and passed through an MLP projector. This preserves SigLIP’s semantic alignment with language and DINOv2's spatial reasoning, giving the VLM richer multimodal context for manipulation tasks.
Uses Prismatic VLM backbone with multi-resolution features (spatial reasoning + semantics).

Outperforms closed RT-2-X (55B) by +16.5% task success with 7× fewer parameters.
Beats Diffusion Policy (from-scratch imitation learning) by +20.4% on multi-task language-grounded settings.
Demonstrates robust behaviors (distractor resistance, error recovery).

Introduces parameter-efficient fine-tuning:
- LoRA updates only 1.4% of parameters yet matches full fine-tuning.
- Can fine-tune on a single A100 GPU in ~10–15 hours (8× compute reduction).
Quantization:
- 4-bit inference matches bfloat16 accuracy while halving memory footprint.
- Runs at 3Hz on consumer GPUs (e.g., A5000, 16GB).

Tested across 29 tasks and multiple robots (WidowX, Google robot, Franka).
Strong generalization on:
- Visual (unseen backgrounds/distractors).
- Motion (new object positions/orientations).
- Physical (new object shapes/sizes).
- Semantic (unseen tasks, instructions).
First generalist open-source VLA achieving ≥50% success rate across all tested tasks.

Fine-tuning the vision encoder (vs. freezing) crucial for robotic control.
Higher image resolution (384px vs. 224px) adds 3× compute without performance gains.
Training required 27 epochs, far more than typical VLM runs, to surpass 95% action token accuracy.

Supports only single-image observations (no proprioception, no history).
Inference throughput (~6Hz on RTX 4090) insufficient for high-frequency control (e.g., ALOHA at 50Hz).
Success rates remain below 90% in challenging tasks.
Open questions:
- Impact of base VLM size on performance.
- Benefits of co-training with Internet-scale data.
- Best visual features for VLAs.

First open-source generalist VLA with strong performance.
Scalable end-to-end training pipeline (action-as-token).
Demonstrates LoRA + quantization for consumer-grade GPU adaptation.
Provides code, checkpoints, and data curation recipes to support future research.

Kim, M. J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E. P., Sanketi, P. R., Vuong, Q., Kollar, T., Burchfiel, B., Tedrake, R., Sadigh, D., Levine, S., Liang, P., & Finn, C. (2025). OpenVLA: An Open-Source Vision-Language-Action Model Proceedings of The 8th Conference on Robot Learning, Proceedings of Machine Learning Research. https://proceedings.mlr.press/v270/kim25c.html