Skip to main content

RoboFlamingo Review

· 2 min read

RoboFlamingo

  • RoboFlamingo decouples vision-language understanding and control, using OpenFlamingo for perception and a lightweight policy head for sequential decision-making.
  • Unlike prior VLM-based approaches, it requires only small-scale imitation fine-tuning on language-conditioned manipulation data, without large-scale co-fine-tuning.
  • This design enables data-efficient, zero-shot generalizable, and deployable robot manipulation policies on modest compute resources.

Key Idea

  • Proposes RoboFlamingo, a simple framework to adapt existing VLMs for robotic manipulation with lightweight fine-tuning.
  • Built on OpenFlamingo, decoupling vision-language understanding from decision-making.
  • Pre-trained VLM handles language and visual comprehension, while a dedicated policy head models sequential history.
  • Fine-tuned only on language-conditioned manipulation datasets using imitation learning.

Advantages

  • Requires only a small amount of demonstrations to adapt to downstream manipulation tasks.
  • Provides open-loop control capability → deployable on low-performance platforms.
  • Can be trained/evaluated on a single GPU server, making it a cost-effective and accessible solution.

Benchmarks

  • Evaluated on CALVIN benchmark (34 tasks, 1000 instruction chains).
  • RoboFlamingo achieves 2× performance improvements over previous state-of-the-art methods.

Performance

  • Imitation Learning: Outperforms all baselines across all metrics.
  • Zero-shot Generalization:
    • Vision: Stronger generalization in ABC→D setting.
    • Language: Robust to GPT-4 generated synonymous instructions.
  • Ablation Studies:
    • Ignoring history (MLP w/o hist) gives worst results.
    • LSTM and GPT-based policy heads perform best (LSTM chosen as default).
    • VL pre-training is crucial for downstream manipulation.
    • Larger VLMs show better data efficiency.
    • Instruction fine-tuning improves both seen and unseen tasks.

Flexibility of Deployment

  • Supports open-loop control by predicting entire action sequences with a single inference → reduces latency and test-time compute.
  • Direct open-loop use without retraining can degrade performance; mitigated with jump-step demonstrations.

Conclusion

  • Demonstrates that pre-trained VLMs enable data efficiency and strong zero-shot generalization in robotic manipulation.
  • RoboFlamingo is presented as an intuitive, efficient, and open solution, with high potential when combined with large-scale real robot data.

Ref

  • Li, X., Liu, M., Zhang, H., Yu, C., Xu, J., Wu, H., Cheang, C., Jing, Y., Zhang, W., & Liu, H. (2024). Vision-language foundation models as effective robot imitators. International Conference on Learning Representations (ICLR 2024), Vienna, Austria.