Blog | gracefullight.dev

Octo Review

2025년 8월 27일 · 약 4분

Owner

Octo

Octo is a transformer-based policy with modular tokenizers (language via T5, images via CNN patches), blockwise masking, and readout tokens, trained on 800k multi-robot trajectories.
Actions are generated through a diffusion head that produces continuous, multimodal, chunked predictions, enabling precise control and broad generalization.
It achieves state-of-the-art zero-shot performance across 7 robots and allows efficient finetuning to new sensors and action spaces, while being fully open-source.

Category	Simple Analogy	Actual Tokenization
Language	`[Sentence]`	`[l₁, l₂, l₃, …]` → multiple tokens from a tokenized sentence
Goal Image	`[Goal]`	`[g₁, g₂, g₃, …]` → image split into patches
Observation (time t)	`[Observation]`	`[oₜ¹, oₜ², oₜ³, …]` → camera frames/sensors tokenized into patches
Readout Token	`[ ]` (empty slot)	`[TR,t]` → one per timestep, reserved for predicting actions

Time t-1: [l] [g] [o_{t-1}] [TR,t-1]
Time t:   [l] [g] [o_t]     [TR,t]
Time t+1: [l] [g] [o_{t+1}] [TR,t+1]

[TR,t-1], [TR,t], [TR,t+1]  ──►  Diffusion head  ──►  [a_t, a_{t+1}, …]

Motivation

Traditional robot learning trains policies from scratch on robot/task-specific datasets → costly data collection, narrow generalization.
Generalist Robot Policies (GRPs) pretrained on diverse robots/tasks can be finetuned with little in-domain data while generalizing broadly.
Real-world deployments face challenges across robot embodiments, sensor setups, action spaces, task specs, and environments.

Prior GRPs & Gaps

GRPs aim for low-level visuomotor control across tasks, environments, and robotic systems.
Existing models often have restricted inputs (e.g., a single camera), lack efficient finetuning to new domains, and importantly, largest models are not publicly available.

Contribution (What is Octo?)

Octo: a large transformer-based policy trained on 800k trajectories from the Open X-Embodiment dataset.
Accepts language instructions or goal images, and can be finetuned within hours on consumer GPUs to new sensors and action spaces.
First GRP to support effective finetuning to new observations and actions and to be fully open-source (training pipeline, checkpoints, data).
Novelty lies in combining: transformer backbone + language/goal image conditioning + diffusion head for expressive action distributions.

Architecture

Input tokenizers:
- Language via pretrained T5-base
- Images via shallow CNN → patch tokens
Transformer backbone: processes unified token sequence.
Blockwise masking + Readout tokens:
- Nonexistent modalities are masked
- Readout tokens only attend to past observations/tasks, not vice versa
Diffusion action head: predicts continuous, multimodal, chunked actions.
Modularity: new sensors/outputs can be added by only training lightweight encoders or heads; pretrained backbone remains unchanged.

Octo Architecture

Training Data & Objective

Mixture of 25 heterogeneous robot datasets: diverse robots, sensors (with/without wrist cams), labels (with/without language).
Conditional diffusion decoding predicts continuous, multimodal action distributions.
- Transformer runs one forward pass; denoising steps are contained in the small diffusion head.

Experiments

Evaluated on 7 robotic platforms across 4 institutions.
Key questions:
1. Zero-shot multi-robot control?
2. Do Octo weights improve finetuning vs. scratch or standard pretrained representations?
3. Which design choices matter for generalist robot policies?

Results

Achieves state-of-the-art zero-shot multi-robot control, competitive with RT-1-X and RT-2-X.
Provides a versatile policy initialization: significantly outperforms baselines for data-efficient finetuning to new obs/action spaces.

Limitations / Future Work

Needs better language conditioning, improved wrist camera support, and data beyond optimal demonstrations.

One-line Takeaway

Octo = modular, efficient, open-source GRP:
A transformer + diffusion policy trained on large-scale multi-robot data that adapts quickly with little in-domain data to new sensors and action spaces, enabling broad generalization.

Ref

Mees, O., Ghosh, D., Pertsch, K., Black, K., Walke, H. R., Dasari, S., Hejna, J., Kreiman, T., Xu, C., & Luo, J. (2024). Octo: An open-source generalist robot policy. First Workshop on Vision-Language Models for Navigation and Manipulation at ICRA 2024.

IAI +004

2025년 8월 26일 · 약 21분

Eunkwang Shin

Owner

Influencing improvement

Agent's component
Agent's prior knowledge, which influence that mode lit builds
the feedback available to learn from.

Components

Given an intelligent agent that performs some intelligent tasks, any components of agent program can be improved by learning.

Prior knowledge

Inductive learning (귀납적 학습): learning a general function or rule (possibly incorrect) from specific input-output pairs
- Bottom to top
- Specific to general
Deductive learning (연역적 학습): going from a general rule to a new specific rule that is logically entailed, but is useful because it allows more efficient processing
- Top to bottom
- general to specific

Feedback

feedback on its percept sequence
no feedback on its percept sequence
rewards for taking a sequence of actions based on its percept sequence

-	Supervised Learning	Unsupervised Learning
Training Data	labeled	unlabeled
Computational complexity	simpler	Computationally complex
Accuracy	high	less accurate

Supervised learning

the agent observes some examples of input-output pairs first and then learns a function or a relationship that maps from inputs to output.
Attributes/Features: the inputs are independent variables in the problem domain
Target attribute: the output is the dependent variable which is dependent on the inputs.
Model: the learned function or relationship
The agent learns a model using examples and uses this model to predict the outcomes for new inputs.

Unsupervised learning

The agent collects adequate examples in the problem domain but it does not get any explicit feedback to the examples.
The agent can make sense of the examples through identifying clusters or frequent patterns in the data.
When shown a large number of examples, the agent can learn to identify clusters of similar examples.

Reinforcement learning

the agent learns from a series of actions which can be rewards or punishments to improve its performance in completing the task under consideration.
the feedback helps the agent to enforce positive actions and reduce the negative actions through adjusting the policy.

Supervised Learning Technique

Decisinon tree
Random forests
Linear regrasssion
Logistic regression
K nearest neighbours
Support vector machines
Neural networks

Regression problem

to predict a continuous value as the output for a given input
weather temperature: solar radiation, wind direction and speed, geographic location..
how to predict the output value of a new data instance on the basis of observed features from the existing data (historical examples) in the problem domain.
Elements
- Collection of existing or historical data samples which are represented by a set of attributes or independent variables
- The output values of the existing data samples
  - the output variable or attribute must be continuous
Regressor: a function describes the relationship between the attributes of a data sample and the output.
- takes the values of attributes of a data sample and predicts the output value of this given data sample.

Evaluate a regressor

R Square/Adjust R Square
MSE Mean Square Error/RMSE Root Mean Square Error
MAD Mean Absolute Error

Examples of regression problems

Predict the fuel price using the Brent crude oil price, financial performance of the oil related companies (cash flow, projects lined up, etc.) and/or geopolitical risks (OPEC announcements, government sanctions, etc.)
Predict the house price of a suburb from the suburb's profile
Predict the blood pressure of a patient based on the patient's health profile
Predict the electricity price using temperature, demand and time.

Classification problem

to predict discrete or categorical value as the output for a given input
Pass or Failed given learning outcomes, student ID, prior learning, attitude, commitment and attendance.
how to put a new data instance into one of predefined categories or classes on the basis of observed features from the existing data in the problem domain.
Elements
- Collection of existing or historical data samples with class labels
- Predefined categories or classes
- Adequate samples in each category or class in the existing or historical data.
Logic-based techniques
- Decision tree
- Learning set of rules
Perceptron-based techniques
- Single-layer perceptron
- Multi-layer perceptron
- RBF network
SVM
Statistical learning techniques
- Naive Bayes classifier
- Bayesian networks
Instance-based learning
- K-nearest neighbor (KNN)

Evaluate a classifier

Confusion matrix
Precision
Recall/Sensitivity
Specificity
F1-Score
Area Under Curve & Receiver Operating Characteristics Curve (AUC-ROC)

Examples of classification problems

banking, healthcare, medical diagnosis, marketing (sentiment aalysis), telecommunication, agriculture, security (fraud detection).
e-mails into spam or non-spam class
loan applications into an approved or a rejected class.
patients into having a certain disease or not having that disease groups.
text into positive or negative sentiment.
customers into churn or non-churn classes.

Overfitting

general phenomenon with all types of learning models.
a modeling error that occurs when a function is too closely or exactly fit to a limited set of data points.
more likely as the complexity of models and the number of input attributes increase
less likely as the number of training examples is large.

Decision Tree

if-then statements to define patterns in data
A if-then statement splits the training data into two or more branches based on some values
Best Split: The results of each branch should be as homogeneous as possible, or has the lowest impurity possible.
- Information gain
- Gini index

Implement Decision Tree

the split (a feature and a condition) that leads to the lowest impurity in the resulting child nodes, in a greedy manner
For categorical features: each unique value can be a split condition.
For continuous features: midpoints between consecutive sorted unique values are used as split conditions.
For each potential split condition, the algorithm calculates the impurity of the resulting child nodes.
The lowest impurity node becomes the split point for that branch.
The process is then repeated recursively for each child node until all leaf nodes are pure, or the stopping criteria are met.

The selectino of best split attributes

ID3: employs a top-down, greedy search through the space of possible branches with no backtracking using information gain
C4.5: using information gain ratio
CART: using Gini Index
Gini Index
Chi-Square
Reduction in Variance

Entropy

the fundamental quantity in information theory. It is a measure of the uncertainty of a random variable.

the fundamental quantity in information theory. It is a measure of the uncertainty of a random variable
A more homogeneous node with a clear majority class has low impurity and low entropy, while a more mixed distribution of classes has high impurity and high entropy.

Information gain

the decrease in entropy.

The information gain from the attribute test on (split on A) is the expected reduction in entropy.
Information gain computes the difference between entropy before the split and average entropy after the split of the dataset based on given attribute values
$Entropy(S) = - \sum_{} p_i \log_2 p_i$
$Gain(S,A)=Entropy(S) − Entropy_{remain}(S,A)$

Gini index

For classification, another impurity measure commonly used for classification tasks in decision trees.
a lower Gini index indicates lower impurity, meaning that the samples in the node predominantly belong to a single class
a bit more computational efficient than entropy as it does not involve logarithm calculations. but results are quite similar.
$Gini(S) = 1 - \sum_{i=1}^{K} p_i^2$

Variance

For regression tree, the target variable is continuous rather than categorical.
use variance as a measure of impurity in regression trees.
lower variance indicates that the data points are closely clustered around the mean
$Var(S) = \frac{1}{N} \sum_{} (y_i - \mu)^2$

Prediction

For classification tasks: the predicted class label is the majority class among the training samples in the leaf node.
For regression tasks: the predicted value is the mean of the target values of the training samples in the leaf node.

Dealing with Overfitting

Overfitting: a common issue in decision trees, where the model captures noise or outliers in the training data rather than the underlying pattern.
the model performs poorly when applied to new, unseen data.

Setting Stopping Criteria

prevent the tree from becoming overly complex, which may lead to overfitting
applied during the tree construction process
limiting the maximum depth of the tree
setting a minimum number of samples per leaf node
requiring a minimum impurity decrease for a split

Pruning Strategies

applied after the tree has been fully grown
removing branches from the fully grown tree to simplify its structure
ensure that it captures the underlying patterns in the data rather than noise or outliers
Pruned trees perform significantly better than unpruned trees when the data contain a large amount of noise.

Ensemble Methods

combine multiple decision trees to form a more robust and accurate model.
address overfitting by averaging the predictions of the individual trees, reducing variance and improving generalization.
Random Forests
Gradient Boosted Trees

Random Forest

combines multiple weak decision tree models to create a stronger learning model.

two types of randomness are introduced to ensure that the individual decision trees are diverse and less prone to overfitting.
Random sampling of the input data
Bootstraping:
- involves sampling with replacement 복원추출 (meaning that some instances appearing multiple times and others not appearing) from the original dataset, creating a new dataset.
- each decision tree is trained on a slightly different set of data points, reducing the likelihood of overfitting.
Random selection of features at each split
- At each split in each decision tree, a random subset of features is considered when determining the best split.
- each tree in the ensemble does not rely on the same set of features for making decisions, resulting in a more diverse set of trees.
- By considering only a subset of features at each split, the model is less likely to be influenced by a small number of dominant features, leading to a more balanced and accurate prediction.

구분	데이터 무작위성 (Bootstrapping)	속성 무작위성 (Feature Subset Selection)
적용 위치	트리 훈련 데이터 선택 단계	트리의 각 분할(split) 단계
방법	원본 데이터셋에서 복원 추출(with replacement)로 샘플링하여 새로운 학습용 부분집합 생성	전체 속성 중 무작위로 일부 속성만 선택 후, 그 속성들로만 분할 기준 탐색
특징	- 각 나무가 다른 데이터 포인트로 학습됨 - 일부 샘플은 여러 번 등장, 일부는 제외될 수 있음	- 각 분할이 다른 속성을 사용 가능 - 동일한 속성에 과도하게 의존하지 않음
효과	- 트리 간의 데이터 다양성 확보 - 과적합 감소	- 트리 간의 속성 다양성 확보 - 소수 지배적 속성의 영향 축소
결과	더 다양한 데이터 시나리오를 반영한 트리들 생성	더 다양한 의사결정 규칙을 반영한 트리들 생성

Predict with Random Forest

aggregating the predictions of all individual decision trees in the forest.
Majority voting: For classification, Count the number of times each class is predicted by the individual decision trees. The class with the highest count is considered as the final prediction.
Averaging: For regression, Calculate the mean of the predictions made by the individual decision trees. The mean value is considered as the final prediction.

Linear regression

a learning technique that finds a linear relationship between input variables and the target variable based on a fundamental assumption that there is a linear relationship between input variables and the target variable

e.g. the input variables (engine size, weight and car age) ➡️ target variable (car fuel efficiency)
- assumption that there is a linear relationship
A linear regression technique learns a set of coefficients to estimate the linear relationship between $x$ $x$ and $y$ $y$ , denoted as $h_w$ $h_{w}$ , which can be represented by the following equation.
- $h\_w(x) = w_0 + w_1x_1 + ... + w_nx_n = \sum_{i=0}^{n} w_ix_i$
- $w$ is a weight vector
- $\hat{y} = \sum_{i=0}^{n} w_ix_i$
linear regression model is an approximate function between the input variables and the target variable, there will be an error between the output of the model and the actual output value for a data sample
- This error can be represented by a loss function, which calculates the mean square error
- $Loss(h_w) = \frac{1}{2m}\sum_{j=1}^{m}(h_w(x_j) - y_j)^2 = \frac{1}{2m}\sum_{j=1}^{m}(y_j - \sum_{i=0}^{n} w_ix_{j,i})^2$
for solving regression problems

Solving a linear regression problem

to find the best linear relationship $h_w$ $h_{w}$ that best fits the training data of $m$ $m$ data samples.
- makes the loss to be minimised.
to find the best weight vector $w^*$ $w^{*}$ , such that for a given training dataset of $m$ $m$ data samples.
- $w^* = \arg\min_{w} Loss(h_w)$
gradient descent: continuously resamples the gradient of the weight coefficients in the opposite direction depending on the weight $w$ $w$ .
- Until the loss function $Loss(h_w)$ reaches the global minimum
- to change the individual components of $w$ a little bit in the direction that minimises $Loss(h_w)$ , and to do this many times.
$w_i \;\leftarrow\; w_i + \alpha \sum_{j=1}^{m} x_{j,i} \Big( y_j - h_w(x_j) \Big)$ $w_{i} \leftarrow w_{i} + α \sum_{j = 1}^{m} x_{j, i} (y_{j} - h_{w} (x_{j}))$
- $\alpha$ : the step size, the learning rate
Training model: the process of iteratively updating weights with a learning rate to minimise loss, where the final weight vector defines the model used for predicting new data.
use regularisation on a multivariate linear function to avoid overfitting.
Batch gradient descent: consider the entire training dataset $(X, y)$ $(X, y)$ at once.
- $w_0 \;\leftarrow\; w_0 + \alpha \sum_{j=1}^{m} \Big(y_j - (w_0 + w_1 x_j)\Big)$
- $w_1 \;\leftarrow\; w_1 + \alpha \Big(\sum_{j=1}^{m} (y_j - (w_0 + w_1 x_j)) \cdot x_j\Big)$
Stochastic gradient descent (SGD): consider only a single training data sample $(x_j, y_j)$ $(x_{j}, y_{j})$ at a time.
- $w_0 \leftarrow w_0 + \alpha \big( y_j - (w_0 + w_1 x_j) \big)$
- $w_1 \leftarrow w_1 + \alpha \big( (y_j - (w_0 + w_1 x_j)) \cdot x_j \big)$
- can be used in an online setting, where new data is coming one at a time, or offline, where we cycle through the same data as many times as is necessary, taking a step after considering each single example.
- With a fixed learning rate $\alpha$ , the stochastic version does not guarantee convergence.
- often faster than batch gradient descent.
- With a schedule of decreasing learning rates (SA), the stochastic version does guarantee convergence.
These update rules are derived as the next weight update equations by taking the partial derivatives of the loss function with respect to $w_0$ and $w_1$ .

Logistic Regression

an extension of linear regression in such a way that the output of a linear regression model goes through a logistic function

$y(x) = \frac{1}{1 + e^{-x}}$
The output value of this logistic function is between 0 and 1.
0 is for certainly being labeled "0" and 1 is for certainly being labeled "1", and a value between 0 and 1 represents the probability of being labeled "1"
a logistic regression model: a linear regression model + a logistic function
mainly for solving classification problems

Nearest Neighbor

a technique to predict the output of a given new sample based on a collection of existing samples.

is to find the k-nearest neighbours of given sample in the collection and determine the output based on these k neighbours.
k is always chosen to be an odd number.
can be used for both classification and regression problems.
- For classification: majority vote of the neighbours.
- For regression: mean/median (or regression) of the neighbours.
Instance-based learning
- KNN does not learn a separate model.
- Instead, it stores all training data and uses them directly at prediction time.
Non-parametric model
- KNN has no parameters (like weights in linear regression) to train.
- The model is essentially the full dataset plus a distance measure.

Distance measures

Minkowski distance or $L^p$ norm
- $L^p(x_j, x_q) = \left( \sum_i |x_{j,i} - x_{q,i}|^p \right)^{1/p}$
- Euclidean distance: $p = 2$ , for the dimensions are measuring similar properties, such as the width, height and depth of 3D objects.
- Manhattan distance: $p = 1$ , for the dimensions are measuring dissimilar properties, such as age, weight, and gender of a patient.
- Hamming distance: the number of attributes on which the two points differ, for Boolean attribute values

Nomarlization

use the raw data from each dimension then the total distance will be affected by a change in scale in any dimension
To avoid this, apply normalization to the measurements in each dimension.
to compute the mean $\mu_i$ and standard deviation $\sigma_i$ of the values in each dimension, and rescale them
The rescaling is done using the formula:
- $x'_{j,i} = \frac{x_{j,i} - \mu_i}{\sigma_i}$ where $x'_{j,i}$ is the normalized value, $x_{j,i}$ is the original value, $\mu_i$ is the mean, and $\sigma_i$ is the standard deviation.

Time complexity

Conceptually trivial: Given a set of N examples and a query $x_q$ , iterate through the examples, measure the distance to $x_q$ from each one, and keep the best k.
$NN(k, x_q)$ 's time complexity is $O(N)$ , N is the number of examples in the training dataset.
Use a k-dimensional tree: a balanced binary tree with an arbitrary number of dimensions.
- Time complexity can be improved to $O(\log N)$
- appropriate only when there are many more examples than dimensions
- It works well with up to 10 dimensions with thousands of examples.
Use a Hash table with a locality-sensitive hash (LSH)
- Time complexity can be improved to $O(1)$

SVM

a framework for finding a boundary that distinctly classifies the data points in an optimal way.

supervised learning, binary classification
SVM chooses the boundary with the maximum possible geometric margin, which has the largest distance to the nearest training data points of any class
initially designed for binary classification problems but can also be applied for solving multi-class classification problems

Linear discriminant

$X_i$ is multiplied by its matching weight $w_i$
all these products are added together and passed to a threshold function
Decision surface: if $g(x) = w \cdot x \gt 0$ then $f(x) = +1 (class1)$ else $f(x) = -1 (class2)$
Decision function: $f(x) = \text{sign}(g(x)) = \text{sign}(w_0 + w_1x)$ $f (x) = sign (g (x)) = sign (w_{0} + w_{1} x)$
- To make a decision, the continuous value $g(x)$ is passed through the sign function so that it outputs either +1 or -1.
If the data from the two classes can be separated with a hyperplane, linearly separable.

Hyperplane

separates the data in 2D by a line or in 3D by a plane
The orientation of the hyperplane is given by the vector $w$
the location of the hyperplane is given by $w_0$
The distance from the origin to the hyperplane is $\frac{|w_0|}{\|w\|}$
If a given data sample $x^*$ and $g(x^*) = 0$ , then this data sample is on the separation boundary. It can normally be assigned to any class.
geometric margin: the minimum distance between the samples and the hyperplane by constructing and solving a constrained optimization problem
- $\gamma_i = y_i \left( \frac{w}{\|w\|} \cdot x_i + \frac{w_0}{\|w\|} \right)$
primary optimization problem: to maximize the minimal geometric distance across the training dataset of m samples.
- $\max_{w, w_0} \Big( \min_{i=1,\ldots,N} \gamma_i \Big) = \max_{w, w_0} \Big( \min_{i=1,\ldots,N} \Big( y_i \Big( \frac{w}{\|w\|} \cdot x_i + \frac{w_0}{\|w\|} \Big) \Big) \Big)$
- $\min_{w, w_0} \; \frac{1}{2}\|w\|^2$
- $\text{s.t. } \; y_i (w \cdot x_i + w_0) \geq \min_{i=1,\ldots,N} \big( y_i (w \cdot x_i + w_0) \big)$
dual optimization problem: easier to solve. More importantly the dual optimisation problem enables the so-called kernel trick in SVM
- $\max_{\alpha} \; \sum_{i=1}^N \alpha_i - \frac{1}{2} \sum_{i=1}^N \sum_{j=1}^N \alpha_i \alpha_j y_i y_j (x_i \cdot x_j)$
- $\min_{\alpha} \; \frac{1}{2} \sum_{i=1}^N \sum_{j=1}^N \alpha_i \alpha_j y_i y_j \,(x_i \cdot x_j) \;-\; \sum_{i=1}^N \alpha_i$
- $\text{s.t.} \quad \sum_{i=1}^m \alpha_i y_i = 0, \quad \alpha_i \geq 0, \quad i=1,2,\ldots,N$

Attractive Properties

SVMs construct a maximum margin separator
- the largest possible distance to example points, helping to improve generalization
SVMs create a linear separating hyperplane
- kernel trick: to embed the data into a higher-dimensional space
- Often data that are not linearly separable in the original input space are easily separable in a higher-dimensional space
- In general (excepted some special cases) if we have $N$ data points then they will always be separable in spaces of $N$ dimensions or more
SVMs are a nonparametric method
- retain training examples and potentially need to store them all
- In practice, they often end up retaining only a small fraction of examples
- have the flexibility to represent complex functions, but they are resistant to overfitting
not usually expect to find a linear separator in the input space $x$ $x$ , but we can find linear separators in the high-dimensional feature space $F(x)$ $F (x)$ simply by replacing $(x_j x_k)$ $(x_{j} x_{k})$ in
- $argmax_{\alpha} \sum_{j}\alpha_j - \frac{1}{2} \sum_{j,k}\alpha_j \alpha_k y_j y_k (x_j \cdot x_k)$ $a r g ma x_{α} \sum_{j} α_{j} - \frac{1}{2} \sum_{j, k} α_{j} α_{k} y_{j} y_{k} (x_{j} \cdot x_{k})$
  - with $K(x_j, x_k) = F(x_j) \cdot F(x_k)$
  - $F(x_j) \cdot F(x_k)$ can often be computed without first computing $F$ for each point.
In a higher dimensional feature space, which is created by transformation $F(x)$ $F (x)$ , if we can express $K(x_j \cdot x_k) = F(x_j) \cdot F(x_k)$ $K (x_{j} \cdot x_{k}) = F (x_{j}) \cdot F (x_{k})$ , the kernel function $K(x_j \cdot x_k)$ $K (x_{j} \cdot x_{k})$ can be applied to pairs of input data to evaluate dot product in some corresponding feature space.
- kernel trick is to plug a kernel function $K(x_j \cdot x_k)$ into the dual optimisation problem to replace $(x_j \cdot x_k)$
- Optimal linear separators can be found efficiently in feature spaces with billions of (or, in some cases, infinitely many) dimensions.
- we can learn in the higher-dimensional space, but we compute only kernel functions rather than the full list of features for each data point.

Classification evaluation metrics

용어	설명
True Positive (TP)	실제 1, 예측 1
True Negative (TN)	실제 0, 예측 0
False Positive (FP)	실제 0, 예측 1 (0을 잘못 1로 예측)
False Negative (FN)	실제 1, 예측 0 (1을 놓쳐서 0으로 예측)

Accuracy

the proportion of correctly classified instances (data points or samples) among the total instances.

$Accuracy = \frac{TP + TN}{\text{Total number of predictions}}$

a ratio of the number of correct predictions
it may not be suitable for imbalanced datasets where the class distribution is skewed
- a model that always predicts the majority class will have high accuracy but may not be useful in practice.

Precision

the proportion of true positives among all positive predictions.

$Precision = \frac{TP}{\text{number of positive predictions}} = \frac{TP}{TP + FP}$

the model's ability to not mistakenly view negatives as positives
A high precision value indicates that the model has made fewer false positive predictions.
it's useful to minimize the number of false positives.

Recall

Sensitivity, True Positive Rate, the proportion of true positive instances among the actual positive instances

$Recall = \frac{TP}{\text{number of positive instances}} = \frac{TP}{TP + FN}$

the model's ability to not mistakenly view actual positives as negatives
A high recall value indicates that the model has successfully identified a large portion of the actual positive instances
it's useful when the cost of false negatives is high
- e.g. in medical diagnosis, where failing to identify a disease can have severe consequences

F1 Score

the harmonic mean of precision and recall

$F1 = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall}$

a balanced evaluation of the model's performance
It is particularly useful when dealing with imbalanced datasets, where one class is significantly more prevalent than the other.
It's useful for imbalanced datasets to balance precision and recall.

Regression evaluation metrics

Mean Absolute Error (MAE)

the average of the absolute differences between the predicted values and the actual values

$MAE = \frac{1}{n} \sum_{i=1}^n \lvert y_i - \hat{y}_i \rvert$

It measures the average magnitude of errors made by the model, without considering their direction
A lower MAE value indicates that the model has made smaller prediction errors

Mean Squared Error (MSE)

the average squared difference between the predicted and actual values

$MSE = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2$

It's useful when you want to penalise larger errors more heavily, making it more sensitive to outliers at the same time
often used as a loss function when training regression models
A lower MSE value indicates that the model has smaller prediction errors, with a strong preference for avoiding large errors.

Root Mean Squared Error (RMSE)

the square root of MSE

$RMSE = \sqrt{MSE}$

it more interpretable as it is in the same units as the dependent variable.

R-Squared (Coefficient of Determination)

how well the regression model approximates the actual data

$R^2 = 1 - \frac{SSR}{SST} = 1 - \frac{\sum_{i=1}^n (y_i - \hat{y}_i)^2}{\sum_{i=1}^n (y_i - \bar{y})^2}$

the proportion of "sum squared regression (SSR)" and "total sum of squares (SST)"
- SSR obviously captures the model's prediction errors
- SST is the variance of the target variable.
  - can be viewed as a naive model ( $\hat{y} = \bar{y}$ ) using the average value of the target variable as the prediction
$R^2 = 1$ : The model predicts perfectly (the error is 0).
$R^2 = 0$ : The model does no better than a naive model that always predicts the mean of the target variable.
$R^2 < 0$ : The model performs worse than simply predicting the mean, meaning its predictions increase the error compared to the naive baseline.

FSD +005

2025년 8월 25일 · 약 2분

Eunkwang Shin

Owner

Method

a block of code grouped together and has a name
can be invoked by its name to perform certain action
can have parameters that represent the values needed for the method to run
can have local variables usable only within its own code block.

Function vs Procedure

Procedure: no return value, perform an action
- Example: move(), run(), deposit(), eat()
Function: have a return value, do not perform any action
- Example: total(), sum(), area()
a function and behaves as a combined function procedure, but not recommended.

Method Overloading

Java allows methods in the same class to have the same name but different parameters.
method signature: The method name together with the number and types of a method's parameter.

Parameter vs Arguments

Parameter: placeholder variables used at method definition, indicate the type and order of argument
Arguments: data values passed to the method when the method is invoked or called.

Patterns

The read pattern

def <name>():
  <prompt>;
  return <type>

The update read-loop pattern

<read function>
while (<value> != <end value>):
   <use the value>
   <read function>

The array-loop pattern

for <value> in <range>:
  >use the item from array>

The any-pattern

for <item> in <collection>:
  if (<test>):
    return True
return False

The every-pattern

for <item> in <collection>:
  if (not(<test>)):
    return False
return True

The none-pattern

for <item> in <collection>:
  if (<test>):
    return False
return True

Boolean Functions

def isEven(number):
  if number % 2 == 0:
    return True
  else:
    return False

def isEven(number):
  return (number % 2 == 0)

Recursion

a technique where a method calls itself repeatedly.
to provide a termination logic for a recursive method to avoid infinite execution.

def factorial(n):
  return 1 if (n == 1 or n == 0) else n * factorial(n - 1)

def factorial(n):
  F = lambda n: n * F(n-1) if n > 1 else 1
  return F(n)

Process in Programming

process is the method used to solve a problem
Break it down-Build it up is a technique structured approach to handle complex problems.

RT-2, Robotic Transformer 2 Review

2025년 8월 24일 · 약 4분

Eunkwang Shin

Owner

Trains a Vision-Language-Action (VLA) model by co-fine-tuning web-scale VLMs with robot trajectories, and treats robot actions as text tokens.
Yields strong generalization and emergent capabilities (symbol understanding, reasoning, human recognition) beyond what appears in robot data.
Runs in direct closed-loop control; largest evaluated model (55B) executes at ~1–3 Hz via a cloud (multi-TPU) inference setup.

RT-2 Architecture

What RT-2 Is

A family of VLA models (RT-2-PaLI-X, RT-2-PaLM-E) that fine-tune large VLMs on robot trajectories to output low-level actions.
Target: generalizable, semantically aware manipulation policies that map images + instructions → actions end-to-end.
RT-2 does not rely on a restricted 2D action space or calibrated cameras.
The unified output space lets language and action tokens share the same model weights, without action-only layers.

Core Recipe

Directly train open-vocabulary VQA/dialogue VLMs to output robot actions while they still solve standard vision-language tasks.
Build on RT-1 protocol/data, but replace the policy backbone with a large VLM.

Action as Language (Tokenization)

Discretize continuous action dims (Δpos/Δrot, gripper, terminate) into 256 bins; represent each dimension with an integer token.
PaLI-X: reuse numeric tokens (≤1000). PaLM-E: overwrite 256 least-frequent tokens as action vocabulary (symbol tuning).
Form a single output string per step (e.g., terminate Δposx Δposy Δposz Δrotx Δroty Δrotz gripper).

Co-Fine-Tuning & Output Constraint

Mix robot data with original web VQA/caption data in training batches (up-weight robot samples) to prevent forgetting and improve generalization.
During decoding on robot tasks, restrict sampling to valid action tokens so outputs are always executable.

Closed-Loop Control & Real-Time Inference

RT-2 is trained and deployed for direct closed-loop control (camera → action → camera …), not just high-level planning.
For large models, inference runs via a multi-TPU cloud service; RT-2-PaLI-X-55B reaches ~1–3 Hz; smaller models ~5 Hz.

Generalization & Benchmarks

Matches RT-1 on seen tasks but far exceeds baselines on unseen objects/backgrounds/environments (~2× vs RT-1/MOO; up to ~6× vs others).
Open-source Language-Table sim: co-fine-tuned PaLI-3B outperforms baselines, showing the approach transfers to other robots/sims.

Emergent Capabilities

Symbol understanding (e.g., “move apple to 3 / heart / star”).
Reasoning (visual matching, simple math like “sum of two plus one”, multilingual commands).
Human recognition (e.g., “person with glasses”); none of these were present as low-level actions in robot data.
Chain-of-thought (CoT) variant adds a Plan step before actions → supports multi-stage semantic reasoning (e.g., pick a rock as an improvised hammer; pick an energy drink for a tired person).

rt-2-cot

Scaling & Ablations

From-scratch training (even 5B) performs poorly; fine-tuning helps; co-fine-tuning helps most.
Bigger models (55B > 5B) generalize better.
PaLM-E variant shows an edge on math reasoning; PaLI-X stronger on symbols/vision reasoning on average.

Limitations

Does not learn fundamentally new motor skills beyond the distribution in robot data; mainly transfers semantic/visual knowledge.
Compute/latency costly; real-time control can bottleneck. Limited availability of strong open VLMs and convenient FT APIs.

Future Directions (from the text)

Acquire new skills from human videos or richer datasets.
Quantization/distillation for faster/cheaper inference.
More open VLMs / FT APIs to make VLA models broadly buildable.

Ref

Zitkovich, B., Yu, T., Xu, S., Xu, P., Xiao, T., Xia, F., Wu, J., Wohlhart, P., Welker, S., Wahid, A., Vuong, Q., Vanhoucke, V., Tran, H., Soricut, R., Singh, A., Singh, J., Sermanet, P., Sanketi, P. R., Salazar, G., Ryoo, M. S., Reymann, K., Rao, K., Pertsch, K., Mordatch, I., Michalewski, H., Lu, Y., Levine, S., Lee, L., Lee, T.-W. E., Leal, I., Kuang, Y., Kalashnikov, D., Julian, R., Joshi, N. J., Irpan, A., Ichter, B., Hsu, J., Herzog, A., Hausman, K., Gopalakrishnan, K., Fu, C., Florence, P., Finn, C., Dubey, K. A., Driess, D., Ding, T., Choromanski, K. M., Chen, X., Chebotar, Y., Carbajal, J., Brown, N., Brohan, A., Arenas, M. G., & Han, K. (2023). RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control Proceedings of The 7th Conference on Robot Learning, Proceedings of Machine Learning Research. https://proceedings.mlr.press/v229/zitkovich23a.html

PaLM-E An Embodied Multimodal Language Model Review

2025년 8월 24일 · 약 4분

Eunkwang Shin

Owner

PaLM-E

ViT (e.g., ViT-4B, ViT-22B) extracts image embeddings.
OSRT builds object-centric slot representations.
These are injected into the LLM embedding space (PaLM variants: 8B, 62B, 540B) for high-level abstraction and planning, with execution delegated to low-level policies (e.g., RT-1).

PaLM-E Architecture

Core idea

Build embodied language models by injecting continuous sensor inputs (images, states, other modalities) directly into a pretrained LLM’s embedding space, linking words ↔ percepts.
Inputs are multimodal sentences that interleave text tokens with encoded visual/state tokens; outputs are text (answers or high-level plans).

Architecture & representations

Start from a decoder-only, autoregressive LLM (PaLM) and condition on a prefix that mixes text and encoder-produced vectors.
Provide multiple encoder options:
- State vectors (simplest).
- ViT features with a learned projector ψ to match LLM embedding dimensionality.
- Object-centric, 3D-aware OSRT (neural scene representations). Supports entity-label tokens (<obj j>) so the model can refer to specific objects in generated plans.

Training setup

Train end-to-end (encoders + projector + optionally the LLM) to output sequential decisions as natural text or answers (VQA, captioning).
Dataset items contain (continuous observations, text sequence, prefix index); loss is cross-entropy on non-prefix tokens.
Explore freezing the LLM (train encoders/projection only), and co-training across diverse tasks ("full mixture"; only ~9% is embodied data).

Planning & control loop

For planning/control, PaLM-E emits textual subgoals/skills drawn from a small skill vocabulary; a separate low-level policy executes them.
The system runs closed-loop: execute → observe → (re)plan; PaLM-E acts as a high-level policy sequencing low-level skills.

Why not text-only LLMs or affordance-only grounding?

Prior work that feeds only text to the LLM (and uses external affordance models) is insufficient when spatial layout matters.
PaLM-E instead grounds inside the LLM by injecting continuous observations, enabling direct plan generation while leveraging the LLM’s world knowledge.

Environments & use cases

Three domains: TAMP (grasp/stack planning), Language-Table (multi-object tabletop pushing), Mobile manipulation (kitchen tasks).
Use cases to test embodied reasoning: affordance prediction, failure detection, long-horizon planning (low-level policies from RT-1).

Results (high level)

Transfer via co-training: One model trained on mixed tasks/embodiments achieves higher performance than task-specialists; "full mixture" yields >2× gains (Fig. 3).
Few-shot/data efficiency: Solves robotics tasks with very few examples (e.g., 10–80 for Language-Table, 320 for TAMP). OSRT further improves data efficiency.
Mobile manipulation: End-to-end embodied planning works in real kitchens, robust to disturbances; PaLM-E beats PaLI (zero-shot) and QT-OPT/CLIP baselines on affordance/failure detection.
General V+L: The 562B generalist achieves state-of-the-art on OK-VQA and strong VQAv2/COCO without task-specific finetuning.
Language retention & scaling: Freezing LLM preserves language ability but can struggle on some robotics tasks; unfrozen + scale up significantly reduces catastrophic forgetting.
Emergent behaviors: Multimodal chain-of-thought and multi-image reasoning emerge in PaLM-E-562B, despite training on single-image prompts.

Takeaways

Injecting neural scene representations (OSRT) and entity-labeled multimodal tokens is effective even without massive embodied data.
Diverse, joint training transfers vision-language knowledge into embodied decision-making, enabling data-efficient robot planning.
Two viable paths to retain language skills during multimodal finetuning:
1. Freeze the LLM, train encoders (max language retention, sometimes weaker robotics),
2. Unfreeze and scale the LLM (much less forgetting, strong embodied performance).

Ref

Driess, D., Xia, F., Sajjadi, M. S. M., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., Huang, W., Chebotar, Y., Sermanet, P., Duckworth, D., Levine, S., Vanhoucke, V., Hausman, K., Toussaint, M., Greff, K., Zeng, A., Mordatch, I., & Florence, P. (2023). PaLM-E: An Embodied Multimodal Language Model Proceedings of the 40th International Conference on Machine Learning, Proceedings of Machine Learning Research. https://proceedings.mlr.press/v202/driess23a.html

RT-1, Robot Transformer 1 Review

2025년 8월 24일 · 약 3분

Eunkwang Shin

Owner

RT-1

RT-1 discretizes robot actions into 256-bin tokens, creating a shared "action language" across robots.
It absorbs heterogeneous data from simulation and other robot morphologies without losing performance.
It generalizes robustly to new tasks, environments, and long-horizon scenarios (up to 50 steps).

RT-1 Architecture

Introduction & Motivation

Leveraging large, diverse, task-agnostic datasets enables high performance in zero-shot or small task-specific settings.
Data collection and curation is a critical bottleneck in robotics ("the unsung hero" of large-scale ML).
Transformer-based controllers are powerful but inefficient for real-time robotics, requiring architectural adaptations.

Model & Architecture

RT-1 architecture: EfficientNet + FiLM layers + TokenLearner for compact vision-language tokenization.
Action tokenization: 11 action dimensions (7 arm, 3 base, 1 mode) discretized into 256 bins each.
This abstraction converts continuous robot actions into a discrete "token language", enabling cross-domain and cross-robot transfer.
Real-time feasibility: optimized design achieves ~3Hz inference speed suitable for real-world control.

Experiments & Results

General Performance

RT-1 executes over 700 unique instructions at 97% success rate.
On unseen instructions: 76% success, outperforming next-best baseline by +24%.
Robustness: 83% success with distractors, 59% with background changes (significantly higher than baselines).

Absorbing Simulation Data

Adding sim data does not degrade real-task performance.
Objects/tasks only seen in simulation: performance boosted 23% ⇒ 87%.
Unseen instructions with sim objects: 7% ⇒ 33%, showing strong sim-to-real domain transfer.

Absorbing Multi-Robot Data

Mixed RT-1 + Kuka datasets: only 2% drop in original tasks.
Bin-picking eval: RT-1 only 22% ⇒ mixed training 39% (almost 2×).
Kuka-only training: 0% on EDR robots ⇒ morphology transfer alone fails.
Mixed data enables RT-1 to leverage cross-robot experiences without explicit demonstrations.

Long-Horizon Scenarios (SayCan Integration)

Evaluated in two kitchens:
- Kitchen1: 67% execution success.
- Kitchen2 (novel environment): also 67% execution success.
Outperforms Gato (0% in Kitchen2) and BC-Z (13% in Kitchen2).
Demonstrated execution of ultra-long tasks up to 50 steps.

Data Quantity vs Diversity

Data Diversity

Reducing dataset size ⇒ gradual performance/generalization decline.
Reducing task diversity ⇒ much sharper decline, especially in generalization.
Key takeaway: Data diversity is more critical than data quantity.

Conclusions & Limitations

RT-1 proves large-scale data absorption and strong generalization in robotics.
Limitations:
- Based on imitation learning ⇒ cannot surpass demonstrator performance.
- Generalization limited to recombinations of known concepts ⇒ fails on truly novel motions.
- Dataset is large but not dexterous (fine manipulation limited).

Future Directions

Enable non-experts to collect training data and prompt models for faster skill scaling.
Increase environmental diversity to strengthen robustness to backgrounds/environments.
Improve reaction speed and context retention via scalable attention and memory.

Ref

Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Dabis, J., Finn, C., Gopalakrishnan, K., Hausman, K., Herzog, A., & Hsu, J. (2022). Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817.

Do As I Can, Not As I Say Review

2025년 8월 24일 · 약 4분

Eunkwang Shin

Owner

Say Can

The core of SayCan is using an LLM to decompose high-level instructions into low-level skills, and reinforcement-learned affordance value functions to evaluate whether each skill is feasible in the current environment.
The Say × Can structure is modular: different LLMs or affordance models can be swapped in, but each module’s inherent biases are carried into the system.
To mitigate limitations, loop-based strategies are essential — CoT and RLHF provide feedback loops for LLMs, while closed-loop feedback enables affordance functions to adapt during execution.

Motivation (Why LLMs alone fall short)

LLMs lack embodiment. They haven’t acted in the physical world, so using them for decision-making on a specific robot is unreliable.
LLMs don’t know robot’s abilities or state. They may split instructions into subtasks, but without context of capabilities and environment, plans can be irrelevant.
Prompting alone isn’t enough. Structured prompts help, but they don’t guarantee admissible or executable steps.

Core Proposal (What SayCan adds)

Ground with pretrained skills. Constrain LLM to propose actions that the robot can actually perform in context.
Say × Can factorization.
- Say (task-grounding): LLM estimates relevance of each skill to the instruction.
- Can (world-grounding): Affordance functions estimate probability of success from current state.

Probabilistic Formulation

Two probabilities multiplied:
- $p(\ell_\pi|i)$ : LLM score of relevance.
- $p(c_\pi|s,\ell_\pi)$ : affordance score of success.
- Select: $\pi = \arg\max p(c_\pi|s,\ell_\pi)\,p(\ell_\pi|i)$ .

Planning Procedure

Planning is structured as a dialog: user gives high-level instruction, LLM produces a step sequence, loop until "done."
Benefit: Interpretability—scores provide transparency.
Caveat: Without affordances, chosen steps may be irrelevant to the current scene.

Affordances via RL

Affordance = value function. In sparse reward settings, value ≈ success probability.
TD RL and MDP formalism used to learn $Q_\pi(s,a)$ .

Implementation

Skill training:
- BC-Z (behavioral cloning) and MT-Opt (reinforcement learning).
- Multi-task BC/RL amortizes training cost.
Language conditioning: Pretrained sentence encoder frozen, text embeddings as input.
Action space: 6-DoF end-effector, gripper open/close, base x-y & yaw deltas, terminate.

Metrics

Plan success rate: 2/3 human raters agree that the plan is valid.
Execution success rate: 2/3 raters agree robot achieved the task.

Key Results

Grounding nearly doubles performance vs non-grounded baselines.
Understands sequence order (approach → pick → bring).
Failures: Long-horizon tasks (early termination), negation, ambiguous references.
Error split: ~65% LLM, 35% affordance.

Ablations

Remove LLM (task-grounding):
- BC-NL: 0% all tasks.
- BC-USE: 60% on single primitives, 0% otherwise.
Remove affordances (world-grounding):
- No-VF: 67%, Generative: 74% vs 84% (SayCan).

Scaling & Models

PaLM > FLAN. PaLM-SayCan achieves 84% plan / 74% execute.
Stronger LMs improve robotics performance.

Extensibility

Add new skills easily: register skill, affordance, prompt example.
Chain-of-Thought: Add "Explanation" → helps with negation and reasoning-heavy queries.
Multilingual: Almost no performance drop (English, Chinese, French, Spanish).

Open-Source Variant

CLIPort for pick-and-place.
Affordances approximated by ViLD open-vocabulary object detector.
GPT-3 as language model.

Limitations & Future Work

Limits: Inherits LLM biases; skill library is bottleneck; hard to react to skill failures.
Closed-loop extensions: Huang et al. use environment feedback + inner monologue for replanning.
Future directions: Expand/robustify skills, explore new grounding sources (non-robotic), test if natural language is the right ontology, combine planning + language, use LMs for policy pretraining.

Ref

ichter, b., Brohan, A., Chebotar, Y., Finn, C., Hausman, K., Herzog, A., Ho, D., Ibarz, J., Irpan, A., Jang, E., Julian, R., Kalashnikov, D., Levine, S., Lu, Y., Parada, C., Rao, K., Sermanet, P., Toshev, A. T., Vanhoucke, V., Xia, F., Xiao, T., Xu, P., Yan, M., Brown, N., Ahn, M., Cortes, O., Sievers, N., Tan, C., Xu, S., Reyes, D., Rettinghouse, J., Quiambao, J., Pastor, P., Luu, L., Lee, K.-H., Kuang, Y., Jesmonth, S., Joshi, N. J., Jeffrey, K., Ruano, R. J., Hsu, J., Gopalakrishnan, K., David, B., Zeng, A., & Fu, C. K. (2023). Do As I Can, Not As I Say: Grounding Language in Robotic Affordances Proceedings of The 6th Conference on Robot Learning, Proceedings of Machine Learning Research. https://proceedings.mlr.press/v205/ichter23a.html

FDA +004

2025년 8월 22일 · 약 15분

Eunkwang Shin

Owner

Data Preparation

In real world applications, data can be inconsistent, incomplete, and noisy.
Data Collection problems: when data is collected incorrectly
Incomplete Data: when information is missing
Data entry problems: when data is entered incorrectly
Contradictions in data: when the data says something in one place, and then says a different thing elsewhere in the dataset. We can think of this data as noisy.
Discrepancy in naming conventions: when data descriptions are unclear, people may misinterpret their meaning.
Duplicated records: when integrating data from different sources, the same data may get entered multiple times.
Data transmission problems: when data is sent between different people or databases or companies, things can get lost in the process.

Data mining tasks

Classification
Estimation
Prediction
Characterisation
Discrimination
Affinity grouping
Clustering
Time series analysis

Data Cleaning

Missing data
- Ignore the record
- Fill the missing value manually
- Fill missing values with calculated values
  - The missing values can be filled using the average value for a particular attribute
  - or by using attribute mean for all samples belonging to the same class as the given record.
  - also be filled using methods such as Bayesian classification or decision trees to automatically infer the values.
Noisy data: a meaningless variation that cannot be interpreted properly by machines
- Binning
  - binning methods use the neighbour's data, this is referred to as local smoothing
  - can replace all data in a segment by its mean or boundary values
- Clustering
  - grouping of data points according to a distance measure
  - use a clustering algorithm to classify each data point into a specific group
  - can detect outliers
- Regression
  - a data mining function that deals with the prediction of a continuous value rather than a class
  - maps data values to a function
  - Using regression to fit data by finding a mathematical equation may be used to smooth noisy data.

Binning

Price	Equi-width	Equi-depth
7	`[0, 10]`	`[7, 20]`
20	`[11, 20]`	`[7, 20]`
22	`[21, 30]`	`[22, 50]`
50	`[41, 50]`	`[22, 50]`
51	`[51, 60]`	`[51, 53]`
53	`[51, 60]`	`[51, 53]`

Equi-width: Bins have equal width.
Equi-depth: Bins have the same number of values in them or almost the same number if they don't divide equally.

Equi-width binning

Equal-interval binning, split the whole range of numbers into intervals with equal size.

Price: 4, 8, 9, 15, 21, 21, 22, 26, 27, 28, 29, 36
Equal-width binning
- Bin1 [4, 12]: 4, 8, 9
- Bin2 (12, 20]: 15
- Bin3 (20, 28]: 21, 21, 22, 26, 27, 28
- Bin4 (28, 36]: 29, 36
Smoothing by bin means
- Bin1: 7, 7, 7
- Bin2: 15
- Bin3: 24, 24, 24, 24, 24, 24
- Bin4: 33, 33
Smoothing by bin boundaries
- Bin1: 4, 9, 9
- Bin2: 15
- Bin3: 21, 21, 21, 28, 28, 28
- Bin4: 29, 36

Equi-depth binning

Equal-frequency binning, use intervals containing an equal number of values.

Price: 4, 8, 9, 15, 21, 21, 22, 26, 27, 28, 29, 36
Equal-depth binnning
- Bin1: 4, 8, 9
- Bin2: 15, 21, 21
- Bin3: 22, 26, 27
- Bin4: 28, 29, 36
Smoothing by bin means: each value in a bin is replaced by the mean value of the bin.
- Bin1: 7, 7, 7
- Bin2: 19, 19, 19
- Bin3: 25, 25, 25
- Bin4: 31, 31, 31
Smoothing by bin boundaries: each bin value is replace by the closest boundary value.
- Bin1: 4, 9, 9
- Bin2: 15, 21, 21
- Bin3: 22, 27, 27
- Bin4: 28, 28, 36

Data Integration

provides unified data by combining data from various heterogeneous data sources into a coherent data store

The sources can include flat files, databases or multiple data cubes.
Careful integration may help to avoid and reduce inconsistencies and redundancies in the final dataset.
Building an enterprise's data warehouse is considered one of the most popular data integration implementations.
Redundant attributes: An attribute (feature or column of a dataset) is called redundant if it can be derived from any other attribute or set of attributes.
- In the process of data integration in data mining, the use of multiple data stores may lead to the problem of redundancy in data.
- Dimension naming or inconsistencies in an attribute can also lead to redundancies in the dataset.

Pearson correlation coefficient

Correlation analysis can be used to detect redundancies in Numerical data
It can measure how strongly one attribute implies the other on the basis of the available data.
> 0.5: a strong positive correlation, A⬆️ B⬆️
< -0.5: a strong negative correlation, A⬆️ B⬇️
0: no correlation. A and B are independent.
correlation != causation

$r_{A,B} = \frac{n\sum{}xy - (\sum{}x)(\sum{}y)} {\sqrt{(n\sum_{} x^2 - (\sum_{} x)^2) \, (n\sum_{} y^2 - (\sum_{} y)^2)}}$

$r_{A,B} = \frac{\sum_{} (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{} (x_i - \bar{x})^2} \, \sqrt{\sum_{} (y_i - \bar{y})^2}}$

step-by-step derivation
- 분산: $Var(X) = \frac{1}{n} \sum_{} (x_i - \bar{x})^2$
- 공분산: $Cov(X,Y) = \frac{1}{n} \sum_{} (x_i - \bar{x})(y_i - \bar{y})$
- 상관계수 (정규화): $\rho = \frac{\mathrm{Cov}(X,Y)}{\sigma_X \sigma_Y}$
- 평균: $\bar{x} = \frac{1}{n} \sum_{}x_i \quad \bar{y} = \frac{1}{n} \sum_{} y_i$
- 분자 전개
  - $\sum_{} (x_i - \bar{x})(y_i - \bar{y})$
  - $\sum_{} (x_i y_i - x_i \bar{y} - y_i \bar{x} + \bar{x}\bar{y})$
  - $\sum_{} (x_i y_i ) - \bar{y}\sum_{} x_i - \bar{x}\sum_{} y_i + n\bar{x}\bar{y}$
  - 평균 대입
    - $\sum_{}x_iy_i - \frac{1}{n}(\sum_{}y_i)(\sum{}x_i) - \frac{1}{n}(\sum{}x_1)(\sum{}y_1) + \frac{1}{n}(\sum{}x_1)(\sum{}y_1)$
    - $\sum{}x_iy_i - \frac{1}{n}(\sum{}x_i)(\sum{}y_i)$
- 분모 전개
  - $\sqrt{\sum_{} (x_i - \bar{x})^2} \, \sqrt{\sum_{} (y_i - \bar{y})^2}$
  - $\sqrt{\sum_{} x_i^2 - 2\bar{x}\sum_{} x_i + n\bar{x}^2} \, \sqrt{\sum_{} y_i^2 - 2\bar{y}\sum_{} y_i + n\bar{y}^2}$
  - 평균 대입
    - $\sqrt{\sum_{} x_i^2 - \frac{2}{n}(\sum_{} x_i)(\sum_{} x_i) + \frac{1}{n}(\sum{}x_i)^2} \, \sqrt{\sum_{} y_i^2 - \frac{2}{n}(\sum_{} y_i) + \frac{1}{n}(\sum{}y_i)^2}$
    - $\sqrt{\sum_{} x_i^2 - \frac{1}{n}(\sum_{} x_i)^2} \, \sqrt{\sum_{} y_i^2 - \frac{1}{n}(\sum_{} y_i)^2}$
- 재정의 $r_{A,B} = \frac{\sum{}x_iy_i - \frac{1}{n}(\sum{}x_i)(\sum{}y_i)} {\sqrt{(\sum_{} x_i^2 - \frac{1}{n}(\sum_{} x_i)^2) \, (\sum_{} y_i^2 - \frac{1}{n}(\sum_{} y_i)^2)}}$
- 분자/분모에 n 곱하고 인덱스 생략 $r_{A,B} = \frac{n\sum{}xy - (\sum{}x)(\sum{}y)} {\sqrt{(n\sum_{} x^2 - (\sum_{} x)^2) \, (n\sum_{} y^2 - (\sum_{} y)^2)}}$

Data Transformation

The data is consolidated or transformed so that the patterns found are easier to understand, and the consequent mining process is more efficient.

Smoothing: smoothing is used to remove noise from the data to improve clarity around the important features in the dataset
Normalization: the method of scaling your data, into a regularized range, so that you can compare and represent it more accurately
Discretization & Concept hierarchy generation
- Discretisation is the process of putting values into buckets so that there are a limited number of possible states.
- Discretisation transforms a continuous attribute into a categorical attribute, usually happens after the data is cleaned.
- This process includes replacing lower-level data (primitive) with higher-level concepts through the use of concept hierarchies.
- Street may be replaced with city, country or region.
- Age may be replaced with senior, adult, younger and youth.
Binarization: transforming data into binary numbers (e.g. 0, 1).
- This helps make classifier algorithms more efficient.

Data Nomalization

the data should be standardised or normalised in order to avoid dependency on the selection of measurement units.
This constitutes transforming data to lie within a common or smaller range, like [0.0, 1.0] or [−1, 1].
Min-max normalization
- Min-max normalisation maps a value of $K$ $K$ , indicated by $v_n$ $v_{n}$ , to a new value $v'_n$ $v_{n}^{'}$ within the range $[new\_min_K, new\_max_K]$ $[n e w_mi n_{K}, n e w_ma x_{K}]$
  - $v'_n = \frac{(v_n - min_K)}{(max_K - min_K)} \cdot (new\_max_K - new\_min_K) + new\_min_K$
  - calculates the relative position within the original range and reflects it in the new range accordingly.
- preserves the relationships between the original data values.
- It will encounter an out-of-bounds error if a future input case for normalisation falls outside of the original data range for K.
Z-Score normalization
- normalises attribute values using the average (i.e., mean) and standard deviation of $K$ $K$ .
  - $v'_n = \frac{(v_n - \mu_K)}{\sigma_K}$
  - It converts the distance of a data point from the mean into a unitless measure.
- is useful when there are outliers that dominate the min-max normalisation
- is useful when the actual minimum and maximum of attribute $K$ are unknown.
Decimal scaling normalization
- The number of decimal points moved is based on the maximum absolute value of $K$ $K$ .
  - $v'_n = \frac{v_n}{10^j}$
  - where $j$ is the smallest integer such that $max(|v'_n|) < 1$ .
  - divides all values by the power of 10 just larger than the maximum absolute value, bringing them into the range $(-1, 1)$ .
Softmax normalization
- a nonlinear transformation that yields an 's'-shaped curve that approaches 0 and 1 asymptotically.
- New values will be mapped between 0 and 1 even if they are beyond the range of your existing data.
- $\alpha = \frac{\nu - \mu}{\lambda \, (\sigma / 2\pi)}, \qquad \nu' = \frac{1}{1 + e^{-\alpha}}$ $α = \frac{ν - μ}{λ ( σ /2 π )}, ν^{'} = \frac{1}{1 + e ^{- α}}$
  - Center the data around the mean :: use $(\nu - \mu)$
  - Remove units by scaling with the standard deviation :: divide by $\sigma$ .
  - Control how steep or flat the curve is :: adjust with $\lambda$ .
  - Add $(2\pi)$ as a conventional constant to better match the logistic curve with statistical distributions.
  - the formula naturally arises by centering at the mean, standardizing by the spread, letting the user control the slope, and refining with a scaling constant.
Sigmoid normalization
- a nonlinear transformation similar to softmax. It ranges between −1 and 1 (asymptotically), and has a fixed linear portion within $±mu$ $\pm m u$ .
  - $\alpha = \frac{\nu - \mu}{\lambda \, (\sigma / 2\pi)}, \qquad \nu' = \frac{1 - e^{-\alpha}}{1 + e^{-\alpha}}$
  - Center the data around the mean :: use $(\nu - \mu)$
  - Remove units by scaling with the standard deviation :: divide by $\sigma$ .
  - Control how steep or flat the curve is :: adjust with $\lambda$ .
  - Add $(2\pi)$ as a conventional constant to refine scaling with respect to statistical distributions. Apply the hyperbolic tangent :: map the result smoothly into the range $[-1, 1]$ .
  - the formula naturally arises by centering at the mean, standardizing by spread, letting the user control the slope, and using tanh to compress all values into $[-1, 1]$ .

Discretization & Concept hierarchy generation

Data discretisation is a form of numerosity reduction that transforms a continuous attribute into a categorical attribute.
Higher concept labels or a smaller number of intervals (i.e. binning) are used to replace the raw data in order to simplify the original data and increase the efficiency of mining.
Discretisation is very beneficial for generating concept hierarchies automatically, which allow data mining at multiple levels of data abstraction.
One or more concept hierarchy can be defined for the single attribute for accommodating the requirements of various users.

Salary	Age	➡️	Salary	Age
2000	20	-	`[2000, 2900)`	`[20, 25)`
2800	25	-	`[2000, 2900)`	`[25, 30)`
3500	23	-	`[2900, 3800)`	`[20, 25)`
2400	26	-	`[2000, 2900)`	`[25, 30)`
5600	32	-	`[5600, 6500)`	`[30, 35)`
4200	36	-	`[3800, 4700)`	`[35, 40]`
5000	39	-	`[4700, 5600)`	`[35, 40]`
5000	40	-	`[4700, 5600)`	`[35, 40]`
3400	35	-	`[2900, 3800)`	`[35, 40]`
3600	34	-	`[2900, 3800)`	`[30, 35)`

If dependent and independent variables have only a few values, a wide range of classification algorithms can be used.

Data Binarizaion

maps a categorical or continuous attribute into one or more binary variables.
Binarisation can convert a continuous attribute to a categorical attribute which can then be converted into set of binary attributes.
only possible to keep the meaning of one categorical value at one time, losing the meaning of the others.

ID	Gender
1	Male
2	Female
3	Not specified
4	Female

ID	Male	Female	Not specified
1	1	0	0
2	0	1	0
3	0	0	1
4	0	1	0

Outlook	Temperature	Humidity	Windy	Play
Sunny	85	85	False	No
Sunny	80	90	True	No
Overcast	83	78	False	Yes
Rain	70	95	False	Yes
Rain	68	80	False	Yes

Outlook	Outlook	Outlook	Temperature	Humidity	Windy	Play
Overcast	Rain	Sunny
0	0	1	85	85	0	0
0	0	1	80	90	1	0
1	0	0	83	78	0	1
0	1	0	70	95	0	1
0	1	0	68	80	0	1

Data Reduction

to acquire a reduced data set representation which is much smaller in quantity and maintains the quality of the data close to the original data.
to reduce data storage and analysis costs while increasing storage efficiency

Aggregation

storing and presenting data as a summary, using statistical metrics like means, median and variance.
Data aggregation is often used to construct a data cube for data analysis at multiple levels of abstraction.
Multidimensional aggregated information is stored in data cubes

Data cube aggregation

Dimensionality reduction

to minimize the number of features
feature subset selection or feature selection detects and removes weakly relevant, redundant, or irrelevant dimensions or attributes
to determine a minimum set of attributes so that the resulting probability distribution of the data classes is as near as possible to the original distribution obtained using all attributes.
Feature subset selection: uses only available subsets of the features to reduce the dimensionality of the data
- Redundant features: Duplicates of all or much of the information present in one or more attributes.
  - the amount of sales tax paid / purchase price of a product
- Irrelevant features: Contain no information that is important for the data mining process at hand.
  - the color of a product when predicting its price
- While some redundant and irrelevant attributes can be eliminated immediately by considering the domain knowledge or common sense.
- The ideal approach to feature selection is to try all possible subsets of features in the input for the data mining algorithm of interest, and then consider the subset that gives the best outcome.
Feature subset selection techniques
- Brute-force approach
- Embedded approaches:
  - Feature selection occurs naturally as part of the data mining algorithm.
  - The algorithm decides by itself which attributes are to be ignored.
- Filter approaches:
  - Features are chosen before running the data mining algorithm by taking some of the approaches which are independent of the data mining process.
  - can be selected with pairwise correlation as low as possible.
- Wrapper approaches:
  - consider the target data mining algorithm as a black box to determine the best subset of attributes.
  - Instead of evaluating all possible combinations, it intelligently searches only a subset to find a near-optimal feature set.
  - Heuristic methods: Forward selection, backward elimination, genetic algorithm, greedy search.
  - Decision tree induction

Numerosity reduction

Regression, clustering, histograms, sampling
reducing the volume of the data, without any loss of data
- parametric models: store only the model parameters rather than the actual data, regression, log-linear models
- non-parametric approaches: clustering, sampling, histograms
Histograms
- unsupervised techniques that does not use a class label
- Singleton bucket: each of the buckets shows only a single frequency pair/attribute value
- Equal-width histogram: divided into equal ranges
- Equal-frequency(depth) histogram: each bucket has the similar number of data
Sampling
- a large dataset to be denoted by a smaller random subset (or sample) of the data
- often used in preliminary exploration as well as final analysis.
- useful when processing the entire dataset is too large or expensive.
- If the sample preserves the important properties of the original dataset (e.g., the mean), the sample is said to be representative
- Simple random sampling: every data point has an equal probability of being chosen.
- Sampling without replacement: once a data point is chosen, it cannot be selected again.
- Sampling with replacement (bootstrap): the same data point can be picked multiple times, since it is placed back into the dataset after selection.
- Cluster sampling: the dataset is divided into clusters (groups), and sampling is performed at the cluster level.
- Stratified sampling: the dataset is split into strata (partitions), and random samples are drawn from each stratum. This is especially useful when the data is imbalanced, e.g., sampling customers across different age groups.

Stratified sampling

Strata: Youth, Middle-aged, Senior

Vocabulary for AI +005

2025년 8월 21일 · 약 3분

Eunkwang Shin

Owner

Vocabulary & Expressions

Term/Expression	Definition	Simpler Paraphrase	Meaning
subconsciously	In a way that is not fully aware or conscious	Without thinking about it	무의식적으로
interleave	to arrange or mix things by placing them alternately	to alternate or weave together	교차 배치하다, 섞다
induce	to cause something to happen or exist	to bring about or give rise to	유도하다, 초래하다
polynomial	a mathematical expression consisting of variables and coefficients, involving only the operations of addition, subtraction, multiplication, and non-negative integer exponentiation of variables	a type of equation with multiple terms	다항식
sinusoidal	having the shape or characteristics of a sine wave	wave-like	사인 곡선의
piecewise	defined or done in separate parts or segments	in segments	구간별로, 조각조각
impose	to force something to be accepted or put in place	to establish or apply	부과하다, 강요하다
dubious	hesitating or doubting	uncertain or questionable	의심스러운
presumably	used to convey that what is assumed is likely to be true	probably	아마, 추정컨대
arbitrary	based on random choice or personal whim, rather than any reason or system	random or capricious	임의의, 자의적인
skimpy	insufficient in quantity or quality	scanty or meager	부족한, 빈약한
disjunctive	relating to or denoting a logical operation that combines two or more propositions	separating or contrasting	분리적인, 대립적인
parity	the state or condition of being equal or equivalent	equality	동등성
intractable	difficult to manage or control	stubborn or unmanageable	다루기 힘든
at someone's disposal	available to be used by someone	at their command	~가 다룰 수 있는
deviate	to depart from an established course or norm	to diverge or stray	벗어나다
asymptotic	approaching a limit as closely as possible	nearing a boundary	점근적인
univariate	involving only one variable	single-variable	단일 변수의
heterogeneous	composed of different or diverse elements	mixed or varied	이질적인
derivation	the process of obtaining something from a source or origin	extraction	유도, 파생
consolidate	to combine or unite into a single entity	to merge or strengthen	통합하다, 강화하다
asymptotically	in a manner that approaches a limit	nearing a boundary	점근적으로
preliminary	serving as a preparation or introduction	initial or preparatory	예비의, 준비의
harness	to make use of something effectively	to utilize	활용하다
repertoire	a collection or set of skills, abilities, or resources	a range or inventory	레퍼토리
visuomotor	relating to the coordination of visual and motor functions	visual-motor	시각 운동의
Owing to	because of	due to	~때문에, ~덕분에
trajectory	the path followed by a moving object	path or course	궤적
quadratic	relating to a polynomial of the second degree	second-degree	이차의
stationarity	the property of a process whose statistical properties do not change over time	stability	정상성
pseudoinverse	a generalization of the inverse matrix for non-square matrices	generalized inverse	유사 역행렬
logarithmic	relating to the logarithm of a quantity	log-based	로그의
spherical	relating to a sphere	sphere-based	구형의
interchangeable	able to be exchanged or replaced with something else	replaceable	교체 가능한
admit	to acknowledge or accept the existence or truth of something	to confess or recognize	인정하다
differentiable	capable of being differentiated	able to be derived	미분 가능한
derivation	the process of obtaining something from a source or origin	extraction	유도, 파생
parametric	relating to or expressed in terms of parameters	variable	모수의
extrapolation	the process of estimating values beyond the known data points	estimation beyond known data	외삽
interpolation	the process of estimating values within the range of known data points	estimation within known data	보간, 내삽
plateaued	having reached a state of little or no change after a period of activity or progress	stabilized	정체된
compelling	evoking interest, attention, or admiration in a powerfully irresistible way	captivating	매력적인

CLIPort Review

2025년 8월 21일 · 약 2분

Eunkwang Shin

Owner

Key Idea

CLIPort proposes a two-stream architecture for vision-based manipulation:
- Semantic pathway (what): leverages CLIP for broad semantic understanding.
- Spatial pathway (where): leverages Transporter for fine-grained spatial reasoning.
This design is inspired by the two-stream hypothesis in cognitive psychology (ventral/dorsal pathways).

Framework Contributions

Benchmark Extension: Expanded the Ravens benchmark with language-grounding tasks for manipulation.
Two-Stream Architecture: Uses pre-trained vision-language models (CLIP) to condition precise manipulation policies with language goals.
Empirical Results: Demonstrates robustness on diverse manipulation tasks, including multi-task settings and real-robot experiments.

Architectural Design

CLIPort integrates semantic (CLIP) with spatial (Transporter) features by lateral fusion.
The semantic stream is conditioned with language features from CLIP’s text encoder and fused with intermediate spatial features.
Enables end-to-end learning of affordance predictions (pick-and-place) without explicit object models, segmentations, or symbolic states.

Key Insights

Formulates manipulation as action detection (where to act), instead of object detection.
Tabula rasa systems (like plain Transporter) require new demonstrations for every goal/task. CLIPort addresses this with a strong semantic prior (from CLIP) to generalize across tasks and concepts.
Language-conditioned policies provide an intuitive interface for specifying goals and transferring concepts.

Experimental Results

Simulation (PyBullet, UR5 robot with suction gripper):
- 10 language-conditioned tasks with thousands of unique instances.
- Multi-task CLIPort outperformed or matched single-task models, even with fewer demonstrations.
- CLIP-only or Transporter-only baselines saturate, while CLIPort exceeds 90% success with just 100 demos.
Generalization:
- CLIPort generalizes to unseen attributes (e.g., new colors, shapes, object categories).
- Struggles with completely novel attributes (e.g., “pink” or “orange” never seen in training).
Real-World Robot Experiments (Franka Panda):
- Achieved ~70% success on real tasks with just 179 demonstrations.
- Performance trends were consistent with simulation, validating sim-to-real transfer.

Conclusion

CLIPort shows that multi-task, language-conditioned policies generalize across tasks better than object-centric or tabula rasa methods.
With action abstraction and spatio-semantic priors, end-to-end models can learn new skills without requiring hand-engineered pipelines.
Limitations remain for dexterous 6-DoF manipulation and complex continuous control.

Ref

Shridhar, M., Manuelli, L., & Fox, D. (2022). Cliport: What and where pathways for robotic manipulation. Conference on robot learning.

Octo​

Motivation​

Prior GRPs & Gaps​

Contribution (What is Octo?)​

Architecture​

Training Data & Objective​

Experiments​

Results​

Limitations / Future Work​

One-line Takeaway​

Ref​

Influencing improvement​

Components​

Prior knowledge​

Feedback​

Supervised learning​

Unsupervised learning​

Reinforcement learning​

Supervised Learning Technique​

Regression problem​

Evaluate a regressor​

Examples of regression problems​

Classification problem​

Evaluate a classifier​

Examples of classification problems​

Overfitting​

Decision Tree​

Implement Decision Tree​

The selectino of best split attributes​

Entropy​

Information gain​

Gini index​

Variance​

Prediction​

Dealing with Overfitting​

Setting Stopping Criteria​

Pruning Strategies​

Ensemble Methods​

Random Forest​

Predict with Random Forest​

Linear regression​

Solving a linear regression problem​

Logistic Regression​

Nearest Neighbor​

Distance measures​

Nomarlization​

Time complexity​

SVM​

Linear discriminant​

Hyperplane​

Attractive Properties​

Classification evaluation metrics​

Accuracy​

Precision​

Recall​

F1 Score​

Regression evaluation metrics​

Mean Absolute Error (MAE)​

Mean Squared Error (MSE)​

Root Mean Squared Error (RMSE)​

R-Squared (Coefficient of Determination)​

Method​

Function vs Procedure​

Method Overloading​

Parameter vs Arguments​

Patterns​

The read pattern​

The update read-loop pattern​

The array-loop pattern​

The any-pattern​

The every-pattern​

The none-pattern​

Boolean Functions​

Recursion​

Process in Programming​

What RT-2 Is​

Core Recipe​

Action as Language (Tokenization)​

Co-Fine-Tuning & Output Constraint​

Closed-Loop Control & Real-Time Inference​

Octo

Motivation

Prior GRPs & Gaps

Contribution (What is Octo?)

Architecture

Training Data & Objective

Experiments

Results

Limitations / Future Work

One-line Takeaway

Ref

Influencing improvement

Components

Prior knowledge

Feedback

Supervised learning

Unsupervised learning

Reinforcement learning

Supervised Learning Technique

Regression problem

Evaluate a regressor

Examples of regression problems

Classification problem

Evaluate a classifier

Examples of classification problems

Overfitting

Decision Tree

Implement Decision Tree

The selectino of best split attributes

Entropy

Information gain

Gini index

Variance

Prediction

Dealing with Overfitting

Setting Stopping Criteria

Pruning Strategies

Ensemble Methods

Random Forest

Predict with Random Forest

Linear regression

Solving a linear regression problem

Logistic Regression

Nearest Neighbor

Distance measures

Nomarlization

Time complexity

SVM

Linear discriminant

Hyperplane

Attractive Properties

Classification evaluation metrics

Accuracy

Precision

Recall

F1 Score

Regression evaluation metrics

Mean Absolute Error (MAE)

Mean Squared Error (MSE)

Root Mean Squared Error (RMSE)

R-Squared (Coefficient of Determination)

Method

Function vs Procedure

Method Overloading

Parameter vs Arguments

Patterns

The read pattern

The update read-loop pattern

The array-loop pattern

The any-pattern

The every-pattern

The none-pattern

Boolean Functions

Recursion

Process in Programming

What RT-2 Is

Core Recipe

Action as Language (Tokenization)

Co-Fine-Tuning & Output Constraint

Closed-Loop Control & Real-Time Inference