RoboFlamingo Review
· 2 min read
RoboFlamingo
- RoboFlamingo decouples vision-language understanding and control, using OpenFlamingo for perception and a lightweight policy head for sequential decision-making.
- Unlike prior VLM-based approaches, it requires only small-scale imitation fine-tuning on language-conditioned manipulation data, without large-scale co-fine-tuning.
- This design enables data-efficient, zero-shot generalizable, and deployable robot manipulation policies on modest compute resources.
Key Idea
- Proposes RoboFlamingo, a simple framework to adapt existing VLMs for robotic manipulation with lightweight fine-tuning.
- Built on OpenFlamingo, decoupling vision-language understanding from decision-making.
- Pre-trained VLM handles language and visual comprehension, while a dedicated policy head models sequential history.
- Fine-tuned only on language-conditioned manipulation datasets using imitation learning.
Advantages
- Requires only a small amount of demonstrations to adapt to downstream manipulation tasks.
- Provides open-loop control capability → deployable on low-performance platforms.
- Can be trained/evaluated on a single GPU server, making it a cost-effective and accessible solution.
Benchmarks
- Evaluated on CALVIN benchmark (34 tasks, 1000 instruction chains).
- RoboFlamingo achieves 2× performance improvements over previous state-of-the-art methods.
Performance
- Imitation Learning: Outperforms all baselines across all metrics.
- Zero-shot Generalization:
- Vision: Stronger generalization in ABC→D setting.
- Language: Robust to GPT-4 generated synonymous instructions.
- Ablation Studies:
- Ignoring history (MLP w/o hist) gives worst results.
- LSTM and GPT-based policy heads perform best (LSTM chosen as default).
- VL pre-training is crucial for downstream manipulation.
- Larger VLMs show better data efficiency.
- Instruction fine-tuning improves both seen and unseen tasks.
Flexibility of Deployment
- Supports open-loop control by predicting entire action sequences with a single inference → reduces latency and test-time compute.
- Direct open-loop use without retraining can degrade performance; mitigated with jump-step demonstrations.
Conclusion
- Demonstrates that pre-trained VLMs enable data efficiency and strong zero-shot generalization in robotic manipulation.
- RoboFlamingo is presented as an intuitive, efficient, and open solution, with high potential when combined with large-scale real robot data.
Ref
- Li, X., Liu, M., Zhang, H., Yu, C., Xu, J., Wu, H., Cheang, C., Jing, Y., Zhang, W., & Liu, H. (2024). Vision-language foundation models as effective robot imitators. International Conference on Learning Representations (ICLR 2024), Vienna, Austria.