본문으로 건너뛰기

Full Stack JavaScript Developer | Half-time Open Sourcerer.

모든 저자 보기

π0 Review

· 약 4분

π0

Problem & Motivation

  • Achieving real-world generality in robot learning is blocked by data scarcity, generalization, and robustness limits.
  • Human intelligence most outpaces machines in versatility—solving diverse, physically situated tasks under constraints, language commands, and perturbations.
  • In NLP/CV, foundation models pre-trained on diverse multi-task data, then fine-tuned (aligned) on curated datasets, outperform narrow specialists; the same paradigm is hypothesized for robotics.

Core Proposal

  • A novel flow-matching architecture built on a pre-trained Vision-Language Model (VLM) to inherit Internet-scale semantics.
  • Further training adds robot actions, turning the model into a Vision-Language-Action (VLA) policy.
  • Use cross-embodiment training to combine data from many robot types (single/dual-arm, mobile), despite differing configuration/action spaces.
  • Employ action chunking + flow matching (diffusion variant) to model complex, continuous, high-frequency actions.
  • Introduce an Action Expert (separate weights for action/state tokens), akin to a Mixture-of-Experts, augmenting the standard VLM.

Training Recipe (Pre- vs Post-Training)

  • Pre-training on highly diverse data builds broad, general physical abilities.
  • Post-training on curated, task-specific data instills fluent, efficient strategies.
  • Rationale: high-quality-only training lacks recovery behaviors; low-quality-only training lacks efficiency/robustness; combining both yields desired behavior.

Data & Backbone

  • ~10,000 hours of demonstrations + the OXE dataset; data spans 7 robot configurations and 68 tasks.
  • VLM backbone initialized from PaliGemma (3B); add ~300M parameters for the action expert (total ~3.3B).
  • Pre-training mixture: weighted combination of internal datasets + full OXE; n^0.43 weighting to down-weight overrepresented task-robot pairs.
  • Unify interfaces: zero-pad qt/at to the largest robot dimension (18); mask missing image slots; late-fusion encoders map images/states to the same token space as language.

Modeling Details

  • Conditional flow matching models the continuous distribution over action chunks.
  • Train with a diffusion-style loss on individual sequence elements (instead of cross-entropy), with separate weights for diffusion-related tokens.
  • Flow path uses a linear-Gaussian schedule; sample noisy actions with ε∼N(0, I); predict denoising vector field; Euler integration from τ=0→1 at inference.
  • Efficient inference by caching K/V for the observation prefix; action tokens recomputed per integration step.

High-Level Language Policy

  • Because the policy consumes language, a high-level VLM can decompose tasks (e.g., bussing) into intermediate language subgoals (SayCan-style planning), improving performance on complex, temporally extended tasks.

Evaluation Setup & Baselines

  • Out-of-box (direct prompting), fine-tuning on downstream tasks, and with high-level VLM providing intermediate commands.
  • Compare against OpenVLA (7B, autoregressive discretization; no action chunks/high-frequency control) and Octo (93M; diffusion), trained on the same mixture.
  • Include a compute-parity π0 (160k steps vs 700k) and a π0-small variant (no VLM init).

Key Results

  • Out-of-box: π0 outperforms all baselines; even compute-parity π0 beats OpenVLA/Octo; π0-small still surpasses them—highlighting the benefits of expressive architectures + diffusion/flow matching + VLM pre-training.
  • Language following: π0 clearly exceeds π0-small across conditions:
    • π0-flat: only overall task command.
    • π0-human: human-provided intermediate steps.
    • π0-HL: high-level VLM-provided steps (fully autonomous).
    • Better language-following accuracy directly translates into stronger autonomous performance with high-level guidance.
  • New dexterous tasks (e.g., bowls stacking, towel folding, microwave, drawer items, paper towel replacement):
    • Fine-tuned π0 generally outperforms OpenVLA, Octo, and small-data methods ACT / Diffusion Policy.
    • Pre-training helps most when tasks resemble pre-training data; pretrained π0 often beats from-scratch by up to .
  • Complex multi-stage tasks (laundry folding, table bussing, box building, to-go box, eggs):
    • π0 solves many tasks; full pre-training + fine-tuning performs best.
    • Gains from pre-training are especially large on harder tasks; absolute performance varies with task difficulty and pre-training coverage.

Takeaways & Limitations

  • π0 mirrors LLM training: pre-train for knowledge, post-train for alignment (instruction-following and execution).
  • Limitations/open questions:
    • Optimal composition/weighting of pre-training data remains unclear.
    • Not all tasks work reliably; difficult to predict how much/what kind of data is needed for near-perfect performance.
    • Uncertain positive transfer across very diverse tasks/robots and to distinct domains (e.g., driving, navigation, legged locomotion).

Ref

  • Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., Jakubczak, S., Jones, T., Ke, L., Levine, S., Li‑Bell, A., Mothukuri, M., Nair, S., Pertsch, K., Shi, L. X, … Zhilinsky, U. (2025, June 21). π₀: A vision‑language‑action flow model for general robot control Robotics: Science and Systems (RSS), Los Angeles, CA, United States. https://roboticsconference.org/program/papers/10/

Vima Review

· 약 2분

VIMA

  • Unified Multimodal Prompts: Reformulates diverse robot tasks (language, images, video) into a single sequence modeling problem.
  • Object-Centric Tokenization: Uses object-level tokens (Mask R-CNN + ViT) instead of raw pixels, improving data efficiency and semantic generalization.
  • Cross-Attention Conditioning: Conditions the policy on prompts via cross-attention, maintaining strong zero-shot performance even with small models or novel tasks.

Motivation

  • Robot task specification comes in many forms: one-shot demonstrations, language instructions, and visual goals.
  • Traditionally, each task required distinct architectures and pipelines, leading to siloed systems with poor generalization.

VIMA Architecture

Key Contributions

  1. Multimodal Prompting

    • A novel formulation that unifies diverse robot manipulation tasks into a sequence modeling problem.
    • Prompts are defined as interleaved sequences of text and images, enabling flexibility across task formats.
  2. VIMA-BENCH

    • A large-scale benchmark with 17 tasks across six categories (object manipulation, goal reaching, novel concept grounding, video imitation, constraint satisfaction, visual reasoning).
    • Provides 650K expert trajectories and a four-level evaluation protocol for systematic generalization.
  3. VIMA Agent

    • A transformer-based visuomotor agent with encoder-decoder architecture and object-centric design.
    • Encodes prompts with a pre-trained T5 model, parses images into object tokens via Mask R-CNN + ViT, and decodes actions autoregressively using cross-attention.

Design Insights

  • Object-Centric Representation: Passing variable-length object token sequences directly to the controller is more effective than pixel-based tokenization.
  • Cross-Attention Conditioning: Stronger prompt focus and efficiency compared to simple concatenation (e.g., GPT-style).
  • Robustness: Minimal degradation under distractors or corrupted prompts, aided by T5 backbone and object augmentation.

Results

  • Performance:

    • Outperforms baselines (VIMA-Gato, VIMA-Flamingo, VIMA-GPT) by up to 2.9× success rate in hardest zero-shot generalization.
    • With 10× less training data, still 2.7× better than best competitor.
  • Scaling:

    • Sample-efficient: with just 1% of data, matches baselines trained with 10× more.
    • Generalization holds across L1–L4 evaluation, with smaller regression than alternatives.

Conclusion

VIMA demonstrates that multimodal prompting is a powerful unifying framework for robot learning.
It achieves strong scalability, data efficiency, and generalization, establishing a solid starting point for future generalist robot agents.

Ref

  • Jiang, Y., Gupta, A., Zhang, Z., Wang, G., Dou, Y., Chen, Y., Fei-Fei, L., Anandkumar, A., Zhu, Y., & Fan, L. (2023). VIMA: Robot Manipulation with Multimodal Prompts Proceedings of the 40th International Conference on Machine Learning, Proceedings of Machine Learning Research. https://proceedings.mlr.press/v202/jiang23b.html

RoboFlamingo Review

· 약 2분

RoboFlamingo

  • RoboFlamingo decouples vision-language understanding and control, using OpenFlamingo for perception and a lightweight policy head for sequential decision-making.
  • Unlike prior VLM-based approaches, it requires only small-scale imitation fine-tuning on language-conditioned manipulation data, without large-scale co-fine-tuning.
  • This design enables data-efficient, zero-shot generalizable, and deployable robot manipulation policies on modest compute resources.

Key Idea

  • Proposes RoboFlamingo, a simple framework to adapt existing VLMs for robotic manipulation with lightweight fine-tuning.
  • Built on OpenFlamingo, decoupling vision-language understanding from decision-making.
  • Pre-trained VLM handles language and visual comprehension, while a dedicated policy head models sequential history.
  • Fine-tuned only on language-conditioned manipulation datasets using imitation learning.

Advantages

  • Requires only a small amount of demonstrations to adapt to downstream manipulation tasks.
  • Provides open-loop control capability → deployable on low-performance platforms.
  • Can be trained/evaluated on a single GPU server, making it a cost-effective and accessible solution.

Benchmarks

  • Evaluated on CALVIN benchmark (34 tasks, 1000 instruction chains).
  • RoboFlamingo achieves 2× performance improvements over previous state-of-the-art methods.

Performance

  • Imitation Learning: Outperforms all baselines across all metrics.
  • Zero-shot Generalization:
    • Vision: Stronger generalization in ABC→D setting.
    • Language: Robust to GPT-4 generated synonymous instructions.
  • Ablation Studies:
    • Ignoring history (MLP w/o hist) gives worst results.
    • LSTM and GPT-based policy heads perform best (LSTM chosen as default).
    • VL pre-training is crucial for downstream manipulation.
    • Larger VLMs show better data efficiency.
    • Instruction fine-tuning improves both seen and unseen tasks.

Flexibility of Deployment

  • Supports open-loop control by predicting entire action sequences with a single inference → reduces latency and test-time compute.
  • Direct open-loop use without retraining can degrade performance; mitigated with jump-step demonstrations.

Conclusion

  • Demonstrates that pre-trained VLMs enable data efficiency and strong zero-shot generalization in robotic manipulation.
  • RoboFlamingo is presented as an intuitive, efficient, and open solution, with high potential when combined with large-scale real robot data.

Ref

  • Li, X., Liu, M., Zhang, H., Yu, C., Xu, J., Wu, H., Cheang, C., Jing, Y., Zhang, W., & Liu, H. (2024). Vision-language foundation models as effective robot imitators. International Conference on Learning Representations (ICLR 2024), Vienna, Austria.

OpenVLA Review

· 약 3분

OpenVLA

  • OpenVLA is a 7B open-source VLA model built on Llama2 + DINOv2 + SigLIP, trained on 970k demos, achieving stronger generalization and robustness than closed RT-2-X (55B) and outperforming Diffusion Policy.
  • It introduces efficient adaptation via LoRA (1.4% params, 8× compute reduction) and 4-bit quantization (half memory, same accuracy), enabling fine-tuning and inference on consumer GPUs.
  • Limitations remain (single-image input, <90% reliability, limited throughput), but OpenVLA provides the first open, scalable framework for generalist robot policies.

OpenVLA Architecture

Motivation

  • Training robot policies from scratch struggles with robustness and generalization.
  • Fine-tuning vision-language-action (VLA) models offers reusable, generalizable visuomotor policies.
  • Barriers: prior VLAs are closed-source, lack best practices for adaptation, and need server-class hardware.

Model & Training

  • OpenVLA: 7B parameters, open-source.
  • Built on Llama 2 with fused DINOv2 + SigLIP vision encoders.
  • Trained on 970k robot demonstrations from Open-X Embodiment dataset.
  • Represents robot actions as tokens (discretized into 256 bins, replacing unused Llama tokens).
  • Standard next-token prediction objective.

Architecture & Approach

  • End-to-end fine-tuning of VLM to generate robot actions as tokens.
  • Differs from modular methods (e.g., Octo) that stitch separate encoders/decoders.
  • Vision features are obtained by encoding the same input image with both SigLIP and DINOv2, then channel-wise concatenated and passed through an MLP projector. This preserves SigLIP’s semantic alignment with language and DINOv2's spatial reasoning, giving the VLM richer multimodal context for manipulation tasks.
  • Uses Prismatic VLM backbone with multi-resolution features (spatial reasoning + semantics).

Performance

  • Outperforms closed RT-2-X (55B) by +16.5% task success with 7× fewer parameters.
  • Beats Diffusion Policy (from-scratch imitation learning) by +20.4% on multi-task language-grounded settings.
  • Demonstrates robust behaviors (distractor resistance, error recovery).

Efficiency

  • Introduces parameter-efficient fine-tuning:
    • LoRA updates only 1.4% of parameters yet matches full fine-tuning.
    • Can fine-tune on a single A100 GPU in ~10–15 hours (8× compute reduction).
  • Quantization:
    • 4-bit inference matches bfloat16 accuracy while halving memory footprint.
    • Runs at 3Hz on consumer GPUs (e.g., A5000, 16GB).

Evaluations

  • Tested across 29 tasks and multiple robots (WidowX, Google robot, Franka).
  • Strong generalization on:
    • Visual (unseen backgrounds/distractors).
    • Motion (new object positions/orientations).
    • Physical (new object shapes/sizes).
    • Semantic (unseen tasks, instructions).
  • First generalist open-source VLA achieving ≥50% success rate across all tested tasks.

Design Insights

  • Fine-tuning the vision encoder (vs. freezing) crucial for robotic control.
  • Higher image resolution (384px vs. 224px) adds 3× compute without performance gains.
  • Training required 27 epochs, far more than typical VLM runs, to surpass 95% action token accuracy.

Limitations & Future Work

  • Supports only single-image observations (no proprioception, no history).
  • Inference throughput (~6Hz on RTX 4090) insufficient for high-frequency control (e.g., ALOHA at 50Hz).
  • Success rates remain below 90% in challenging tasks.
  • Open questions:
    • Impact of base VLM size on performance.
    • Benefits of co-training with Internet-scale data.
    • Best visual features for VLAs.

Contributions

  1. First open-source generalist VLA with strong performance.
  2. Scalable end-to-end training pipeline (action-as-token).
  3. Demonstrates LoRA + quantization for consumer-grade GPU adaptation.
  4. Provides code, checkpoints, and data curation recipes to support future research.

Ref

  • Kim, M. J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E. P., Sanketi, P. R., Vuong, Q., Kollar, T., Burchfiel, B., Tedrake, R., Sadigh, D., Levine, S., Liang, P., & Finn, C. (2025). OpenVLA: An Open-Source Vision-Language-Action Model Proceedings of The 8th Conference on Robot Learning, Proceedings of Machine Learning Research. https://proceedings.mlr.press/v270/kim25c.html

데이터 시각화 의사 결정 트리

· 약 3분

Color Legend

IconCategoryDescriptionExample
🟡Distribution분포를 보여주고 싶을 때Histogram, Density plot
Correlation상관관계를 보여주고 싶을 때Scatterplot, Correlogram
🟢Ranking순위를 보여주고 싶을 때Bar chart, Lollipop chart
🔴Part of a whole전체 중 일부를 보여주고 싶을 때Pie chart, Treemap
🔵Evolution시간에 따른 변화를 보여주고 싶을 때Line chart, Area chart
🟣Maps지도를 활용해서 공간적 정보를 보여줄 때Choropleth map, Bubble Map
🟤Flow흐름(흐름도, 이동 경로 등)을 보여줄 때Flow map, Sankey-like

Categoric

  • One Variable
    • ⚫ Waffle
    • 🟢 Bar Plot
    • 🟢 Lollipop
    • 🟢 Word Cloud
    • 🔴 Circular Packing
    • 🔴 Doughnut
    • 🔴 Pie
    • 🔴 Treemap
  • Two or More Variables
    • Two Independent Lists
      • 🔴 Venn Diagram
    • Nested
      • 🟢 Bar Plot
      • 🔴 Circular Packing
      • 🔴 Dendrogram
      • 🔴 Sunburst
      • 🔴 Treemap
    • Subgroup
      • ⚫ Grouped Scatter
      • ⚫ Heatmap
      • 🟢 Lollipop
      • 🟢 Parallel Plot
      • Spider
      • 🔴 Grouped Bar Plot
      • 🔴 Grouped Bar Plot
      • 🟤 Sankey Diagram
    • Adjacency
      • 🟤 Arc
      • 🟤 Chord
      • 🟤 Network
      • 🟤 Sankey
      • ⚫ Heatmap

Relational

  • Network
    • ⚫ Heatmap
    • 🟢 Hive
    • 🟤 Arc
    • 🟤 Chord
    • 🟤 Network
    • 🟤 Sankey
  • Nested
    • No Value
      • 🔴 Circular Packing
      • 🔴 Dendrogram
      • 🔴 Sunburst
      • 🔴 Treemap
      • 🟤 Sankey
    • Value for Leaf
      • 🔴 Circular Packing
      • 🔴 Dendrogram
      • 🔴 Sunburst
      • 🔴 Treemap
      • 🟤 Sankey
    • Value for Edges
      • 🔴 Dendrogram
      • 🟤 Chord
      • 🟤 Sankey
    • Value for Connection
      • Edge Bundling

Map

  • 🟣 Bubble Map
  • 🟣 Choropleth
  • 🟣 Connected Map
  • 🟣 Map
  • 🟣 Map Hexbin

Time Series

  • One Series
    • 🟡 Box Plot
    • 🟡 Violin
    • 🟡 Ridge Line
    • 🔵 Area
    • 🔵 Line Plot
    • 🟢 Bar Plot
    • 🟢 Lollipop
  • Several Series
    • 🟡 Box Plot
    • 🟡 Violin
    • 🟡 Ridge Line
    • ⚫ Heatmap
    • 🔵 Line Plot
    • 🔵 Stacked Area
    • 🔵 Stream Graph

Categoric and Numeric

  • One Numeric + One Categoric
    • One Observation, per Group
      • ⚫ Waffle
      • 🟢 Bar Plot
      • 🟢 Lollipop
      • 🟢 Word Cloud
      • 🔴 Circular Packing
      • 🔴 Doughnut
      • 🔴 Pie
      • 🔴 Treemap
    • Several Observations, per Group
      • 🟡 Box Plot
      • 🟡 Violin
      • 🟡 Ridge Line
      • 🟡 Density
      • 🟡 Histogram
  • One Category, Several Numeric
    • No Order
      • 🟡 Box Plot
      • 🟡 Violin
      • ⚫ Grouped Scatter
      • ⚫ 2D Density
      • ⚫ PCA
      • ⚫ Correlogram
    • A Numeric is Ordered
      • ⚫ Connected Scatter
      • 🔵 Area
      • 🔵 Line Plot
      • 🔵 Stacked Area
      • 🔵 Stream Graph
    • One Value Per Group
      • ⚫ Grouped Scatter
      • ⚫ Heatmap
      • 🟢 Lollipop
      • 🟢 Parallel Plot
      • 🟢 Spider Plot
      • 🔴 Grouped Bar Plot
      • 🔴 Grouped Bar Plot
      • 🟤 Sankey Diagram
  • Several Categories, One Numeric
    • Subgroup
      • One Observation. per Group
        • ⚫ Grouped Scatter
        • ⚫ Heatmap
        • 🟢 Lollipop
        • 🟢 Parallel Plot
        • 🟢 Spider Plot
        • 🔴 Grouped Bar Plot
        • 🔴 Grouped Bar Plot
        • 🟤 Sankey Diagram
      • Several Observations, per Group
        • 🟡 Box Plot
        • 🟡 Violin
    • Nested
      • One Observation. per Group
        • 🟢 Bar Plot
        • 🔴 Circular Packing
        • 🔴 Dendrogram
        • 🔴 Sunburst
        • 🔴 Treemap
      • Several Observations. per Group
        • 🟡 Box Plot
        • 🟡 Violin
    • Adjacency
      • ⚫ Heatmap
      • 🟤 Arc
      • 🟤 Chord
      • 🟤 Network
      • 🟤 Sankey

Numeric

  • One Numeric Variable
    • 🟡 Density
    • 🟡 Histogram
  • Two Numeric Variables
    • Not Ordered
      • Few Points
        • 🟡 Box Plot
        • 🟡 Histogram
        • ⚫ Scatter Plot
      • Many Points
        • 🟡 Density
        • 🟡 Violin
        • ⚫ 2D Density
        • 🔵 Marginal Distribution
    • Ordered
      • ⚫ Connected Scatter
      • 🔵 Area Plot
      • 🔵 Line Plot
  • Three Numeric Variables
    • Not Ordered
      • 🟡 Box Plot
      • 🟡 Violin
      • ⚫ Bubble Plot
      • ⚫ 3d Scatter or Surface
    • Ordered
      • 🔵 Area
      • 🔵 Line Plot
      • 🔵 Stacked Area
      • 🔵 Stream Graph
  • Several Numeric Variables
    • Ordered
      • 🔵 Area
      • 🔵 Line Plot
      • 🔵 Stacked Area
      • 🔵 Stream Graph
    • Not Ordered
      • 🟡 Box Plot
      • 🟡 Ridge Line
      • 🟡 Violin
      • ⚫ Correlogram
      • ⚫ Heatmap
      • ⚫ PCA
      • 🔴 Dendrogram

Ref

Vocabulary for AI +006

· 약 2분

Vocabulary & Expressions

Term/ExpressionDefinitionSimpler ParaphraseMeaning
prevalenceThe state of being widespread or commonCommonness유행, 널리 퍼짐
instantiationThe act of creating a specific instance of somethingCreation of a specific example구체적인 값의 생성
trivialityThe quality of being trivial or unimportantUnimportance사소함, 하찮음
intermediaryA person or thing that acts as a link between two othersMiddleman중개자, 매개체
dreadedRegarded with great fear or apprehensionFeared두려운, 걱정되는
i.i.d.Independent and identically distributedSame distribution, no dependence독립적이고 동일한 분포
posterioriRelating to knowledge gained through experience or empirical evidenceBased on observation경험적, 관찰에 기초한
posteriorRelating to the back or rearBack뒤쪽의, 후방의
resemblanceThe state of resembling or being alikeSimilarity유사성, 닮음
stipulateTo demand or specify a requirementSpecify규정하다, 명시하다
rectifyTo correct or make rightCorrect수정하다, 바로잡다
schematicRelating to a diagram or representationDiagrammatic도식적인, 다이어그램의
propositionA statement or assertion that expresses a judgment or opinionProposal제안, 명제
cavityA cavity is a hollow place in a tooth caused by decayTooth decay충치
tautologicalRelating to or involving tautology (the saying of the same thing twice in different words)Redundant동의어 반복의, 중복적인
retrospectivelyLooking back on or dealing with past events or situationsLooking back회고적으로, 과거를 돌아보며
perturbationsDisturbances or deviations from a normal stateDisturbances교란, 변동
deformableCapable of being changed in shape or formChangeable변형 가능한
ConsolidationThe process of combining multiple elements into a single, more effective wholeIntegration통합
oscillationFluctuation or variation in a state or conditionFluctuation진동, 변동
homogeneousOf the same kind; alikeUniform동질의, 균일한
nonstationaryNot stationary; changing over timeChanging비정상적인, 시간에 따라 변하는
whereuponImmediately after whichAfter which그 후에, 그 다음에
magnitudeThe great size or extent of somethingSize크기, 규모
maneuverA movement or series of moves requiring skill and careMove조작, 움직임

Developing ML Systems

· 약 5분

Problem formulation (문제 정의)

  • The first step is to figure out what problem you want to solve.
    1. “사용자에게 어떤 문제를 해결해주고 싶은가?” → 모호하지 않고 구체적으로 정의해야 함.
    2. “그 문제 중 어떤 부분을 머신러닝으로 풀 수 있는가?” → 예: 사진을 라벨로 매핑하는 함수 학습.
  • 이를 구체화하려면 ML 컴포넌트에 대해 loss function 을 지정해야 한다.
  • 문제를 쪼개보면 일부는 전통적 SW 엔지니어링으로 해결 가능하고, 일부만 ML로 다뤄야 할 수 있다.
  • 학습 유형은 지도·비지도·강화·준지도(semisupervised)까지 연속선상에 있음.
    • Semisupervised learning: 일부 라벨만 활용해 비라벨 데이터에서 더 많은 정보 추출.
    • Weakly supervised learning: 부정확·노이즈 라벨을 사용.
  • 결론: Noise와 label 부족은 “지도 ↔ 비지도” 사이의 연속체를 형성한다.

Data collection & management (데이터 수집/관리)

  • 데이터는 직접 제작, 크라우드소싱, 사용자 행동에서 수집 가능.
  • 부족할 때는 transfer learning 활용.
  • Privacy 검토와 동의, 공정성, federated learning 등 고려 필요.
  • Data provenance(출처 관리): 데이터 정의, 값의 범위, 생성 주체, 중단 여부, 정의 변경 이력 등 추적 → 파이프라인 안정성이 알고리즘보다 중요.
  • 항상 자문: “이 데이터는 내 문제를 풀기에 적절한가? 입력과 출력 모두 충분히 담고 있는가?”
  • Learning curve 로 데이터 확장 효과/학습 plateau 확인.
  • 방어적 태도 필요: 입력 오류, 누락, 적대적 사용자, 철자 불일치 등 처리.
  • Data augmentation (회전, 이동, 노이즈 추가 등)으로 모델 강건성 향상.
  • 불균형 데이터는 undersampling, oversampling, SMOTE/ADASYN, boosting 등으로 완화.
  • 아웃라이어는 로그 변환 등으로 영향 축소, 트리 모델은 상대적으로 강건.

Feature engineering (특징 엔지니어링)

  • Quantization: 연속값을 구간(bin)으로 강제.
  • One-hot encoding: 범주형 속성을 다중 Boolean으로 변환.
  • 도메인 지식 기반 새 특성 추가 (예: 날짜 → 주말/공휴일 여부).
  • “At the end of the day, some ML projects succeed and some fail… the most important factor is the features used.” (Pedro Domingos)

Exploratory data analysis (EDA) & visualization

  • 목표: 예측/검증이 아닌 데이터 이해.
  • Histograms, scatter plots 로 분포/결측/오류/이상치 확인.
  • 클러스터링 → 프로토타입 시각화, 이상치 탐지 (“고양이 vs 사자 옷 입은 고양이”).
  • 차원 축소 (예: t-SNE)로 고차 데이터를 2D/3D로 시각화.

Model selection & training

  • 데이터가 정리되면 모델 구축 단계.
  • Random forests → 범주형 특징 많고 일부 무관할 때.
  • Nonparametric methods → 데이터 많고 지식 부족, 특징 선택 고민 줄이고 싶을 때.
  • Logistic regression → 선형 분리 가능(또는 feature engineering 후).
  • SVM → 데이터 크기 작고 차원 높을 때.
  • Deep neural nets → 패턴 인식(이미지·음성).
  • 하이퍼파라미터는 경험 + 탐색으로 조율.
  • 검증 데이터 남용 시 validation overfitting 위험 → 여러 검증셋 필요.
  • 성능 평가: ROC curve, AUC, confusion matrix.
  • 중요한 건 아이디어–실험–검증 반복 사이클을 빠르게 하는 것.

Trust, interpretability, explainability

  • 단순히 지표 성능만으로는 신뢰 부족 → 규제·언론·사용자도 신뢰성 원함.
  • Accountability: 오류 발생 시 책임 주체와 항소 절차 필요.
  • Interpretability: 모델 내부를 직접 이해 (트리, 선형회귀).
    • 핵심 질문: “If I change x, how will the output change?”
  • Explainability: 블랙박스 모델 + 별도 모듈로 설명 (예: LIME).
  • 단순 설명이 잘못된 확신을 줄 수 있음. → 테스트와 실제 성능이 더 큰 신뢰를 준다.
  • “안전하다고 설명만 있는 실험기 vs 100회 무사비행한 비행기” 비유.

Operation, monitoring, maintenance

  • 운영 단계에서는 롱테일 입력(long tail) 문제 등장 → 예상 못한 입력 지속 발생. → 실시간 모니터링과 사람 평가자 필요.
  • Nonstationarity: 세상과 사용자 행동 변화 → 최신 데이터 vs 안정적 모델 트레이드오프.
  • 신선도 요구 다름: 어떤 문제는 매일/매시간 새 모델, 어떤 문제는 수개월 동일 모델.
  • 배포 자동화 → 작은 변경은 자동 승인, 큰 변경은 리뷰.
  • Online vs Offline model: 기존 모델 점진적 수정 vs 매번 처음부터 재학습.
  • 데이터 자체가 바뀔 수도 있음 (스팸 이메일 → 스팸 문자, 음성, 영상 등).

Checklist

Tests for Features and Data

  • Feature expectations are captured in a schema.
  • All features are beneficial.
  • No feature’s cost is too much.
  • Features adhere to meta-level requirements.
  • The data pipeline has appropriate privacy controls.
  • New features can be added quickly.
  • All input feature code is tested.

Tests for Model Development

  • Every model specification undergoes a code review.
  • Every model is checked in to a repository.
  • Offline proxy metrics correlate with actual metrics.
  • All hyperparameters have been tuned.
  • The impact of model staleness is known.
  • A simpler model is not better.
  • Model quality is sufficient on all important data slices.
  • The model has been tested for considerations of inclusion.

Tests for Machine Learning Infrastructure

  • Training is reproducible.
  • Model specification code is unit tested.
  • The full ML pipeline is integration tested.
  • Model quality is validated before attempting to serve it.
  • The model allows debugging by observing the step-by-step computation of training or inference on a single example.
  • Models are tested via a canary process before they enter production serving environments.
  • Models can be quickly and safely rolled back to a previous serving version.

Monitoring Tests for Machine Learning

  • Dependency changes result in notification.
  • Data invariants hold in training and serving inputs.
  • Training and serving features compute the same values.
  • Models are not too stale.
  • The model is numerically stable.
  • The model has not experienced regressions in training speed, serving latency, throughput, or RAM usage.
  • The model has not experienced a regression in prediction quality on served data.

Ref

  • Breck, E., Cai, S., Nielsen, E., Salib, M., & Sculley, D. (2016). What’s your ML test score? A rubric for ML production systems. NIPS Workshop on Reliable Machine Learning in the Wild.

Nonparametric Models

· 약 2분

Nearest-neighbor Models

  • 쿼리점 xqx_q에 대해 가장 가까운 kk개의 이웃을 찾아 분류 또는 회귀에 사용한다.
    • 분류: 다수결
    • 회귀: 평균, 중앙값, 혹은 국소적 선형회귀
  • 거리 척도: Minkowski 거리
    • Lp(xj,xq)=(ixj,ixq,ip)1/pL_p(x_j, x_q) = \left( \sum_i |x_{j,i} - x_{q,i}|^p \right)^{1/p}
    • p=2p=2 → 유클리드 거리
    • p=1p=1 → 맨해튼 거리
    • 불리언 속성 → 해밍 거리
    • 공분산 고려 → 마할라노비스 거리
  • 차원의 저주 (curse of dimensionality):
    • 평균 이웃 부피: n=k/N        =(k/N)1/n\ell^n = k/N \;\;\Rightarrow\;\; \ell = (k/N)^{1/n}
    • nn이 커질수록 \ell 값이 커져 이웃이 “멀어진다”.
    • 대부분의 점은 고차원 공간에서 경계(껍질)에 몰린다.
    • 저차원: 보간(interpolation) 가능
    • 고차원: 외삽(extrapolation)이 많아져 일반화 어려움

k-d trees

  • 데이터를 차원별로 분할해 만든 이진 트리.
  • 각 노드에서 특정 차원의 중앙값 mm을 기준으로 ximx_i \le m 여부에 따라 좌/우로 분할한다.
  • 탐색: 쿼리점 기준으로 한쪽 브랜치로 내려가며 후보를 찾되, 경계와 가까우면 반대편 서브트리도 확인해야 한다.
  • 효율 조건: 데이터 수가 차원 수보다 훨씬 많아야 하며, 최소 2n2^n개 이상 필요하다.
  • 실용 범위:
    • 약 10차원 이하에서는 수천 개 데이터
    • 약 20차원 이하에서는 수백만 개 데이터

Support Vector Machines (SVM)

  • 최대 마진 분리자(maximum margin separator)를 찾는다.
  • 목표: 경험적 손실 최소화 대신 일반화 손실 최소화
  • 결정 경계: {x:wx+b=0}\{x : w \cdot x + b = 0\}
  • 학습은 이차계획법(QP) 최적화 문제로 정식화된다.
    • 이중 표현(dual form):
      argmaxαjαj12j,kαjαkyjyk(xjxk)\arg\max_\alpha \sum_j \alpha_j - \tfrac{1}{2} \sum_{j,k} \alpha_j \alpha_k y_j y_k (x_j \cdot x_k)
    • 제약조건: αj0,  jαjyj=0\alpha_j \ge 0,\; \sum_j \alpha_j y_j = 0
  • 최적 해에서 대부분 αj=0\alpha_j = 0이고, 경계 근처의 점들(서포트 벡터)만 αj>0\alpha_j > 0이다.
  • 예측 함수:
    h(x)=sign(jαjyj(xxj)b)h(x) = \text{sign}\Big(\sum_j \alpha_j y_j (x \cdot x_j) - b \Big)
  • 장점:
    • 서포트 벡터만 유지하면 되므로 효율적
    • 비모수적 유연성 + 모수적 안정성(과적합 억제)

The Kernel Trick

  • 커널 트릭: 실제 고차원 특징 공간 F(x)F(x)를 계산하지 않고, 내적만을 커널 함수로 대체한다.
    • K(x,z)=F(x)F(z)K(x,z) = F(x)\cdot F(z)
  • 대표 커널 함수:
    • 다항 커널: K(x,z)=(1+xz)dK(x,z) = (1 + x \cdot z)^d
    • 가우시안 커널 (RBF): K(x,z)=eγxz2K(x,z) = e^{-\gamma \|x-z\|^2}
  • 소프트 마진 분류기: 일부 오분류 허용, 오분류된 점을 올바른 쪽으로 이동시키는 거리만큼 패널티를 부여한다.
  • 커널 기법은 내적에만 의존하는 다른 알고리즘에도 적용 가능하다.
  • Mercer's theorem: “합리적인” 커널 함수는 항상 어떤 특징 공간에서의 내적에 해당한다.

Logistic regression

· 약 3분

단변량 선형 회귀 (Univariate Linear Regression)

  • 입력이 하나 xx인 경우, 가설: h(x)=w1x+w0h(x) = w_1x + w_0
  • 손실 함수: 제곱 오차 (Squared Error)
  • 경사 하강법으로 최적의 (w0,w1)(w_0, w_1) 찾기
    • w0w0+α(yh(x))w_0 \leftarrow w_0 + \alpha (y - h(x))
    • w1w1+α(yh(x))xw_1 \leftarrow w_1 + \alpha (y - h(x)) \cdot x
  • 손실 함수가 볼록(Convex) → 전역 최소값(Global Minimum) 보장

배치 / 확률적 경사 하강법 (Batch vs SGD)

  • 배치 경사 하강법(Batch GD): 모든 데이터 사용 → 정확하지만 느림, 대규모 데이터 비효율적
  • SGD(Stochastic GD): 무작위 예시 하나(또는 작은 minibatch)만으로 업데이트 → 빠르고 효율적
  • 미니배치(Minibatch): 속도 + 안정성 균형 가능
  • 학습률 α\alpha 감소 스케줄 → 수렴 보장

다변량 선형 회귀 (Multivariable Linear Regression)

  • 입력이 nn차원인 경우, 가설: h(x)=wx=iwixih(x) = w \cdot x = \sum_i w_i x_i
  • 정규 방정식 (Normal Equation): w=(XTX)1XTyw^* = (X^TX)^{-1}X^Ty
  • (XTX)1XT(X^TX)^{-1}X^T = 유사역행렬(Pseudoinverse)
  • 고차원에서는 과적합 위험이 크므로 정규화 필요

정규화 (Regularization)

  • 비용 함수: Cost(h)=Loss(h)+λComplexity(h)Cost(h) = Loss(h) + \lambda \cdot Complexity(h)
  • 복잡도 함수: Complexity(hw)=iwiqComplexity(h_w) = \sum_i |w_i|^q
  • q=1q = 1 → L1 정규화 (희소 모델, 많은 wi=0w_i = 0)
  • q=2q = 2 → L2 정규화 (가중치 제곱합 최소화)
  • L1 → 회전 불변성 없음 (축이 중요한 경우 적합)
  • L2 → 회전 불변성 있음 (축이 임의적일 때 적합)

퍼셉트론 학습 규칙 (Perceptron Learning Rule)

  • 선형 함수 + Hard Threshold → 선형 분류기
  • 가중치 업데이트: wiwi+α(yh(x))xiw_i \leftarrow w_i + \alpha (y - h(x)) \cdot x_i
  • 선형 분리 가능(linearly separable) → 완벽한 분리자로 수렴
  • 분리 불가능한 경우 → 수렴 보장 없음, α\alpha 스케줄 필요

로지스틱 회귀 (Logistic Regression)

  • Hard Threshold 문제
    • 불연속, 미분 불가능 → 학습 불안정
    • 항상 0 또는 1 확정 예측 → 경계 근처 비효율적
  • 해결책: 로지스틱 함수 g(z)=11+ezg(z) = \frac{1}{1 + e^{-z}}
  • 가설: hw(x)=g(wx)=11+ewxh_w(x) = g(w \cdot x) = \frac{1}{1 + e^{-w \cdot x}}
  • 출력 (0,1)\in (0,1) → 확률로 해석 가능, soft boundary 형성
  • 경계 중앙에서 0.5, 멀어질수록 0 또는 1에 가까움

로지스틱 함수의 도함수 성질

  • 로지스틱 함수: g(z)=11+ezg(z) = \frac{1}{1+e^{-z}}
  • 미분: g(z)=ez(1+ez)2g'(z) = \frac{e^{-z}}{(1+e^{-z})^2}
  • 1g(z)=ez1+ez1 - g(z) = \frac{e^{-z}}{1+e^{-z}}
  • 따라서 g(z)(1g(z))=ez(1+ez)2g(z)(1-g(z)) = \frac{e^{-z}}{(1+e^{-z})^2}
  • 결론: g(z)=g(z)(1g(z))g'(z) = g(z)(1-g(z))

로지스틱 회귀 가중치 업데이트 유도 과정

  • 손실 함수: Loss(w)=(yhw(x))2Loss(w) = (y - h_w(x))^2
  • wiLoss(w)=wi(yhw(x))2\frac{\partial}{\partial w_i} Loss(w) = \frac{\partial}{\partial w_i}(y - h_w(x))^2
  • =2(yhw(x))wi(yhw(x))= 2(y - h_w(x)) \cdot \frac{\partial}{\partial w_i}(y - h_w(x))
  • =2(yhw(x))wihw(x)= -2(y - h_w(x)) \cdot \frac{\partial}{\partial w_i} h_w(x)
  • hw(x)=g(wx)h_w(x) = g(w \cdot x) 이므로 wihw(x)=g(wx)xi\frac{\partial}{\partial w_i} h_w(x) = g'(w \cdot x) \cdot x_i
  • g(wx)=hw(x)(1hw(x))g'(w \cdot x) = h_w(x)(1-h_w(x))
  • 최종: wiLoss(w)=2(yhw(x))hw(x)(1hw(x))xi\frac{\partial}{\partial w_i} Loss(w) = -2(y - h_w(x)) \cdot h_w(x)(1-h_w(x)) \cdot x_i
  • 경사 하강법 업데이트:
    wiwiαwiLoss(w)w_i \leftarrow w_i - \alpha \cdot \frac{\partial}{\partial w_i} Loss(w)
  • 따라서:
    wiwi+α(yhw(x))hw(x)(1hw(x))xiw_i \leftarrow w_i + \alpha (y - h_w(x)) \cdot h_w(x)(1-h_w(x)) \cdot x_i

결론

  • 발전 흐름: 선형 회귀 → 경사 하강법 → 다변량 확장 → 정규화 → 퍼셉트론 → 로지스틱 회귀
  • L1 vs L2 정규화
    • L1: 희소 모델 (축 중요)
    • L2: 회전 불변 (축 임의적)
  • 퍼셉트론: 선형 분리 가능할 때만 완벽 동작
  • 로지스틱 회귀: soft boundary 제공 → 확률적 예측 + 현실 데이터에 강함

Octo Review

· 약 4분

Octo

  • Octo is a transformer-based policy with modular tokenizers (language via T5, images via CNN patches), blockwise masking, and readout tokens, trained on 800k multi-robot trajectories.
  • Actions are generated through a diffusion head that produces continuous, multimodal, chunked predictions, enabling precise control and broad generalization.
  • It achieves state-of-the-art zero-shot performance across 7 robots and allows efficient finetuning to new sensors and action spaces, while being fully open-source.
CategorySimple AnalogyActual Tokenization
Language[Sentence][l₁, l₂, l₃, …]
→ multiple tokens from a tokenized sentence
Goal Image[Goal][g₁, g₂, g₃, …]
→ image split into patches
Observation (time t)[Observation][oₜ¹, oₜ², oₜ³, …]
→ camera frames/sensors tokenized into patches
Readout Token[ ] (empty slot)[TR,t]
→ one per timestep, reserved for predicting actions
Time t-1: [l] [g] [o_{t-1}] [TR,t-1]
Time t: [l] [g] [o_t] [TR,t]
Time t+1: [l] [g] [o_{t+1}] [TR,t+1]

[TR,t-1], [TR,t], [TR,t+1] ──► Diffusion head ──► [a_t, a_{t+1}, …]

Motivation

  • Traditional robot learning trains policies from scratch on robot/task-specific datasets → costly data collection, narrow generalization.
  • Generalist Robot Policies (GRPs) pretrained on diverse robots/tasks can be finetuned with little in-domain data while generalizing broadly.
  • Real-world deployments face challenges across robot embodiments, sensor setups, action spaces, task specs, and environments.

Prior GRPs & Gaps

  • GRPs aim for low-level visuomotor control across tasks, environments, and robotic systems.
  • Existing models often have restricted inputs (e.g., a single camera), lack efficient finetuning to new domains, and importantly, largest models are not publicly available.

Contribution (What is Octo?)

  • Octo: a large transformer-based policy trained on 800k trajectories from the Open X-Embodiment dataset.
  • Accepts language instructions or goal images, and can be finetuned within hours on consumer GPUs to new sensors and action spaces.
  • First GRP to support effective finetuning to new observations and actions and to be fully open-source (training pipeline, checkpoints, data).
  • Novelty lies in combining: transformer backbone + language/goal image conditioning + diffusion head for expressive action distributions.

Architecture

  • Input tokenizers:
    • Language via pretrained T5-base
    • Images via shallow CNN → patch tokens
  • Transformer backbone: processes unified token sequence.
  • Blockwise masking + Readout tokens:
    • Nonexistent modalities are masked
    • Readout tokens only attend to past observations/tasks, not vice versa
  • Diffusion action head: predicts continuous, multimodal, chunked actions.
  • Modularity: new sensors/outputs can be added by only training lightweight encoders or heads; pretrained backbone remains unchanged.

Octo Architecture

Training Data & Objective

  • Mixture of 25 heterogeneous robot datasets: diverse robots, sensors (with/without wrist cams), labels (with/without language).
  • Conditional diffusion decoding predicts continuous, multimodal action distributions.
    • Transformer runs one forward pass; denoising steps are contained in the small diffusion head.

Experiments

  • Evaluated on 7 robotic platforms across 4 institutions.
  • Key questions:
    1. Zero-shot multi-robot control?
    2. Do Octo weights improve finetuning vs. scratch or standard pretrained representations?
    3. Which design choices matter for generalist robot policies?

Results

  • Achieves state-of-the-art zero-shot multi-robot control, competitive with RT-1-X and RT-2-X.
  • Provides a versatile policy initialization: significantly outperforms baselines for data-efficient finetuning to new obs/action spaces.

Limitations / Future Work

  • Needs better language conditioning, improved wrist camera support, and data beyond optimal demonstrations.

One-line Takeaway

  • Octo = modular, efficient, open-source GRP:
    A transformer + diffusion policy trained on large-scale multi-robot data that adapts quickly with little in-domain data to new sensors and action spaces, enabling broad generalization.

Ref

  • Mees, O., Ghosh, D., Pertsch, K., Black, K., Walke, H. R., Dasari, S., Hejna, J., Kreiman, T., Xu, C., & Luo, J. (2024). Octo: An open-source generalist robot policy. First Workshop on Vision-Language Models for Navigation and Manipulation at ICRA 2024.