RoboFlamingo Review

August 31, 2025 · 2 min read

Owner

RoboFlamingo

RoboFlamingo decouples vision-language understanding and control, using OpenFlamingo for perception and a lightweight policy head for sequential decision-making.
Unlike prior VLM-based approaches, it requires only small-scale imitation fine-tuning on language-conditioned manipulation data, without large-scale co-fine-tuning.
This design enables data-efficient, zero-shot generalizable, and deployable robot manipulation policies on modest compute resources.

Proposes RoboFlamingo, a simple framework to adapt existing VLMs for robotic manipulation with lightweight fine-tuning.
Built on OpenFlamingo, decoupling vision-language understanding from decision-making.
Pre-trained VLM handles language and visual comprehension, while a dedicated policy head models sequential history.
Fine-tuned only on language-conditioned manipulation datasets using imitation learning.

Requires only a small amount of demonstrations to adapt to downstream manipulation tasks.
Provides open-loop control capability → deployable on low-performance platforms.
Can be trained/evaluated on a single GPU server, making it a cost-effective and accessible solution.

Evaluated on CALVIN benchmark (34 tasks, 1000 instruction chains).
RoboFlamingo achieves 2× performance improvements over previous state-of-the-art methods.

Imitation Learning: Outperforms all baselines across all metrics.
Zero-shot Generalization:
- Vision: Stronger generalization in ABC→D setting.
- Language: Robust to GPT-4 generated synonymous instructions.
Ablation Studies:
- Ignoring history (MLP w/o hist) gives worst results.
- LSTM and GPT-based policy heads perform best (LSTM chosen as default).
- VL pre-training is crucial for downstream manipulation.
- Larger VLMs show better data efficiency.
- Instruction fine-tuning improves both seen and unseen tasks.

Supports open-loop control by predicting entire action sequences with a single inference → reduces latency and test-time compute.
Direct open-loop use without retraining can degrade performance; mitigated with jump-step demonstrations.

Demonstrates that pre-trained VLMs enable data efficiency and strong zero-shot generalization in robotic manipulation.
RoboFlamingo is presented as an intuitive, efficient, and open solution, with high potential when combined with large-scale real robot data.

Li, X., Liu, M., Zhang, H., Yu, C., Xu, J., Wu, H., Cheang, C., Jing, Y., Zhang, W., & Liu, H. (2024). Vision-language foundation models as effective robot imitators. International Conference on Learning Representations (ICLR 2024), Vienna, Austria.