본문으로 건너뛰기

IAI +004

· 약 21분

Influencing improvement

  • Agent's component
  • Agent's prior knowledge, which influence that mode lit builds
  • the feedback available to learn from.

Components

  • Given an intelligent agent that performs some intelligent tasks, any components of agent program can be improved by learning.

Prior knowledge

  • Inductive learning (귀납적 학습): learning a general function or rule (possibly incorrect) from specific input-output pairs
    • Bottom to top
    • Specific to general
  • Deductive learning (연역적 학습): going from a general rule to a new specific rule that is logically entailed, but is useful because it allows more efficient processing
    • Top to bottom
    • general to specific

Feedback

  • feedback on its percept sequence
  • no feedback on its percept sequence
  • rewards for taking a sequence of actions based on its percept sequence
-Supervised LearningUnsupervised Learning
Training Datalabeledunlabeled
Computational complexitysimplerComputationally complex
Accuracyhighless accurate

Supervised learning

  • the agent observes some examples of input-output pairs first and then learns a function or a relationship that maps from inputs to output.
  • Attributes/Features: the inputs are independent variables in the problem domain
  • Target attribute: the output is the dependent variable which is dependent on the inputs.
  • Model: the learned function or relationship
  • The agent learns a model using examples and uses this model to predict the outcomes for new inputs.

Unsupervised learning

  • The agent collects adequate examples in the problem domain but it does not get any explicit feedback to the examples.
  • The agent can make sense of the examples through identifying clusters or frequent patterns in the data.
  • When shown a large number of examples, the agent can learn to identify clusters of similar examples.

Reinforcement learning

  • the agent learns from a series of actions which can be rewards or punishments to improve its performance in completing the task under consideration.
  • the feedback helps the agent to enforce positive actions and reduce the negative actions through adjusting the policy.

Supervised Learning Technique

  • Decisinon tree
  • Random forests
  • Linear regrasssion
  • Logistic regression
  • K nearest neighbours
  • Support vector machines
  • Neural networks

Regression problem

  • to predict a continuous value as the output for a given input
  • weather temperature: solar radiation, wind direction and speed, geographic location..
  • how to predict the output value of a new data instance on the basis of observed features from the existing data (historical examples) in the problem domain.
  • Elements
    • Collection of existing or historical data samples which are represented by a set of attributes or independent variables
    • The output values of the existing data samples
      • the output variable or attribute must be continuous
  • Regressor: a function describes the relationship between the attributes of a data sample and the output.
    • takes the values of attributes of a data sample and predicts the output value of this given data sample.

Evaluate a regressor

  • R Square/Adjust R Square
  • MSE Mean Square Error/RMSE Root Mean Square Error
  • MAD Mean Absolute Error

Examples of regression problems

  • Predict the fuel price using the Brent crude oil price, financial performance of the oil related companies (cash flow, projects lined up, etc.) and/or geopolitical risks (OPEC announcements, government sanctions, etc.)
  • Predict the house price of a suburb from the suburb's profile
  • Predict the blood pressure of a patient based on the patient's health profile
  • Predict the electricity price using temperature, demand and time.

Classification problem

  • to predict discrete or categorical value as the output for a given input
  • Pass or Failed given learning outcomes, student ID, prior learning, attitude, commitment and attendance.
  • how to put a new data instance into one of predefined categories or classes on the basis of observed features from the existing data in the problem domain.
  • Elements
    • Collection of existing or historical data samples with class labels
    • Predefined categories or classes
    • Adequate samples in each category or class in the existing or historical data.
  • Logic-based techniques
    • Decision tree
    • Learning set of rules
  • Perceptron-based techniques
    • Single-layer perceptron
    • Multi-layer perceptron
    • RBF network
  • SVM
  • Statistical learning techniques
    • Naive Bayes classifier
    • Bayesian networks
  • Instance-based learning
    • K-nearest neighbor (KNN)

Evaluate a classifier

  • Confusion matrix
  • Precision
  • Recall/Sensitivity
  • Specificity
  • F1-Score
  • Area Under Curve & Receiver Operating Characteristics Curve (AUC-ROC)

Examples of classification problems

  • banking, healthcare, medical diagnosis, marketing (sentiment aalysis), telecommunication, agriculture, security (fraud detection).
  • e-mails into spam or non-spam class
  • loan applications into an approved or a rejected class.
  • patients into having a certain disease or not having that disease groups.
  • text into positive or negative sentiment.
  • customers into churn or non-churn classes.

Overfitting

  • general phenomenon with all types of learning models.
  • a modeling error that occurs when a function is too closely or exactly fit to a limited set of data points.
  • more likely as the complexity of models and the number of input attributes increase
  • less likely as the number of training examples is large.

Decision Tree

  • if-then statements to define patterns in data
  • A if-then statement splits the training data into two or more branches based on some values
  • Best Split: The results of each branch should be as homogeneous as possible, or has the lowest impurity possible.
    • Information gain
    • Gini index

Implement Decision Tree

  • the split (a feature and a condition) that leads to the lowest impurity in the resulting child nodes, in a greedy manner
  • For categorical features: each unique value can be a split condition.
  • For continuous features: midpoints between consecutive sorted unique values are used as split conditions.
  • For each potential split condition, the algorithm calculates the impurity of the resulting child nodes.
  • The lowest impurity node becomes the split point for that branch.
  • The process is then repeated recursively for each child node until all leaf nodes are pure, or the stopping criteria are met.

The selectino of best split attributes

  • ID3: employs a top-down, greedy search through the space of possible branches with no backtracking using information gain
  • C4.5: using information gain ratio
  • CART: using Gini Index
  • Gini Index
  • Chi-Square
  • Reduction in Variance

Entropy

the fundamental quantity in information theory. It is a measure of the uncertainty of a random variable.

  • the fundamental quantity in information theory. It is a measure of the uncertainty of a random variable
  • A more homogeneous node with a clear majority class has low impurity and low entropy, while a more mixed distribution of classes has high impurity and high entropy.

Information gain

the decrease in entropy.

  • The information gain from the attribute test on (split on A) is the expected reduction in entropy.
  • Information gain computes the difference between entropy before the split and average entropy after the split of the dataset based on given attribute values
  • Entropy(S)=pilog2piEntropy(S) = - \sum_{} p_i \log_2 p_i
  • Gain(S,A)=Entropy(S)Entropyremain(S,A)Gain(S,A)=Entropy(S) − Entropy_{remain}(S,A)

Gini index

  • For classification, another impurity measure commonly used for classification tasks in decision trees.
  • a lower Gini index indicates lower impurity, meaning that the samples in the node predominantly belong to a single class
  • a bit more computational efficient than entropy as it does not involve logarithm calculations. but results are quite similar.
  • Gini(S)=1i=1Kpi2Gini(S) = 1 - \sum_{i=1}^{K} p_i^2

Variance

  • For regression tree, the target variable is continuous rather than categorical.
  • use variance as a measure of impurity in regression trees.
  • lower variance indicates that the data points are closely clustered around the mean
  • Var(S)=1N(yiμ)2Var(S) = \frac{1}{N} \sum_{} (y_i - \mu)^2

Prediction

  • For classification tasks: the predicted class label is the majority class among the training samples in the leaf node.
  • For regression tasks: the predicted value is the mean of the target values of the training samples in the leaf node.

Dealing with Overfitting

  • Overfitting: a common issue in decision trees, where the model captures noise or outliers in the training data rather than the underlying pattern.
  • the model performs poorly when applied to new, unseen data.

Setting Stopping Criteria

  • prevent the tree from becoming overly complex, which may lead to overfitting
  • applied during the tree construction process
  • limiting the maximum depth of the tree
  • setting a minimum number of samples per leaf node
  • requiring a minimum impurity decrease for a split

Pruning Strategies

  • applied after the tree has been fully grown
  • removing branches from the fully grown tree to simplify its structure
  • ensure that it captures the underlying patterns in the data rather than noise or outliers
  • Pruned trees perform significantly better than unpruned trees when the data contain a large amount of noise.

Ensemble Methods

  • combine multiple decision trees to form a more robust and accurate model.
  • address overfitting by averaging the predictions of the individual trees, reducing variance and improving generalization.
  • Random Forests
  • Gradient Boosted Trees

Random Forest

combines multiple weak decision tree models to create a stronger learning model.

  • two types of randomness are introduced to ensure that the individual decision trees are diverse and less prone to overfitting.
  • Random sampling of the input data
  • Bootstraping:
    • involves sampling with replacement 복원추출 (meaning that some instances appearing multiple times and others not appearing) from the original dataset, creating a new dataset.
    • each decision tree is trained on a slightly different set of data points, reducing the likelihood of overfitting.
  • Random selection of features at each split
    • At each split in each decision tree, a random subset of features is considered when determining the best split.
    • each tree in the ensemble does not rely on the same set of features for making decisions, resulting in a more diverse set of trees.
    • By considering only a subset of features at each split, the model is less likely to be influenced by a small number of dominant features, leading to a more balanced and accurate prediction.
구분데이터 무작위성 (Bootstrapping)속성 무작위성 (Feature Subset Selection)
적용 위치트리 훈련 데이터 선택 단계트리의 각 분할(split) 단계
방법원본 데이터셋에서 복원 추출(with replacement)로 샘플링하여 새로운 학습용 부분집합 생성전체 속성 중 무작위로 일부 속성만 선택 후, 그 속성들로만 분할 기준 탐색
특징- 각 나무가 다른 데이터 포인트로 학습됨
- 일부 샘플은 여러 번 등장, 일부는 제외될 수 있음
- 각 분할이 다른 속성을 사용 가능
- 동일한 속성에 과도하게 의존하지 않음
효과- 트리 간의 데이터 다양성 확보
- 과적합 감소
- 트리 간의 속성 다양성 확보
- 소수 지배적 속성의 영향 축소
결과더 다양한 데이터 시나리오를 반영한 트리들 생성더 다양한 의사결정 규칙을 반영한 트리들 생성

Predict with Random Forest

  • aggregating the predictions of all individual decision trees in the forest.
  • Majority voting: For classification, Count the number of times each class is predicted by the individual decision trees. The class with the highest count is considered as the final prediction.
  • Averaging: For regression, Calculate the mean of the predictions made by the individual decision trees. The mean value is considered as the final prediction.

Linear regression

a learning technique that finds a linear relationship between input variables and the target variable based on a fundamental assumption that there is a linear relationship between input variables and the target variable

  • e.g. the input variables (engine size, weight and car age) ➡️ target variable (car fuel efficiency)
    • assumption that there is a linear relationship
  • A linear regression technique learns a set of coefficients to estimate the linear relationship between xx and yy, denoted as hwh_w, which can be represented by the following equation.
    • h_w(x)=w0+w1x1+...+wnxn=i=0nwixih\_w(x) = w_0 + w_1x_1 + ... + w_nx_n = \sum_{i=0}^{n} w_ix_i
    • ww is a weight vector
    • y^=i=0nwixi\hat{y} = \sum_{i=0}^{n} w_ix_i
  • linear regression model is an approximate function between the input variables and the target variable, there will be an error between the output of the model and the actual output value for a data sample
    • This error can be represented by a loss function, which calculates the mean square error
    • Loss(hw)=12mj=1m(hw(xj)yj)2=12mj=1m(yji=0nwixj,i)2Loss(h_w) = \frac{1}{2m}\sum_{j=1}^{m}(h_w(x_j) - y_j)^2 = \frac{1}{2m}\sum_{j=1}^{m}(y_j - \sum_{i=0}^{n} w_ix_{j,i})^2
  • for solving regression problems

Solving a linear regression problem

  • to find the best linear relationship hwh_wthat best fits the training data of mm data samples.
    • makes the loss to be minimised.
  • to find the best weight vector ww^*, such that for a given training dataset of mm data samples.
    • w=argminwLoss(hw)w^* = \arg\min_{w} Loss(h_w)
  • gradient descent: continuously resamples the gradient of the weight coefficients in the opposite direction depending on the weight ww.
    • Until the loss function Loss(hw)Loss(h_w) reaches the global minimum
    • to change the individual components of ww a little bit in the direction that minimises Loss(hw)Loss(h_w), and to do this many times.
  • wi    wi+αj=1mxj,i(yjhw(xj))w_i \;\leftarrow\; w_i + \alpha \sum_{j=1}^{m} x_{j,i} \Big( y_j - h_w(x_j) \Big)
    • α\alpha: the step size, the learning rate
  • Training model: the process of iteratively updating weights with a learning rate to minimise loss, where the final weight vector defines the model used for predicting new data.
  • use regularisation on a multivariate linear function to avoid overfitting.
  • Batch gradient descent: consider the entire training dataset (X,y)(X, y) at once.
    • w0    w0+αj=1m(yj(w0+w1xj))w_0 \;\leftarrow\; w_0 + \alpha \sum_{j=1}^{m} \Big(y_j - (w_0 + w_1 x_j)\Big)
    • w1    w1+α(j=1m(yj(w0+w1xj))xj)w_1 \;\leftarrow\; w_1 + \alpha \Big(\sum_{j=1}^{m} (y_j - (w_0 + w_1 x_j)) \cdot x_j\Big)
  • Stochastic gradient descent (SGD): consider only a single training data sample (xj,yj)(x_j, y_j) at a time.
    • w0w0+α(yj(w0+w1xj))w_0 \leftarrow w_0 + \alpha \big( y_j - (w_0 + w_1 x_j) \big)
    • w1w1+α((yj(w0+w1xj))xj)w_1 \leftarrow w_1 + \alpha \big( (y_j - (w_0 + w_1 x_j)) \cdot x_j \big)
    • can be used in an online setting, where new data is coming one at a time, or offline, where we cycle through the same data as many times as is necessary, taking a step after considering each single example.
    • With a fixed learning rate α\alpha, the stochastic version does not guarantee convergence.
    • often faster than batch gradient descent.
    • With a schedule of decreasing learning rates (SA), the stochastic version does guarantee convergence.
  • These update rules are derived as the next weight update equations by taking the partial derivatives of the loss function with respect to w0w_0 and w1w_1.

Logistic Regression

an extension of linear regression in such a way that the output of a linear regression model goes through a logistic function

  • y(x)=11+exy(x) = \frac{1}{1 + e^{-x}}
  • The output value of this logistic function is between 0 and 1.
  • 0 is for certainly being labeled "0" and 1 is for certainly being labeled "1", and a value between 0 and 1 represents the probability of being labeled "1"
  • a logistic regression model: a linear regression model + a logistic function
  • mainly for solving classification problems

Nearest Neighbor

a technique to predict the output of a given new sample based on a collection of existing samples.

  • is to find the k-nearest neighbours of given sample in the collection and determine the output based on these k neighbours.
  • k is always chosen to be an odd number.
  • can be used for both classification and regression problems.
    • For classification: majority vote of the neighbours.
    • For regression: mean/median (or regression) of the neighbours.
  • Instance-based learning
    • KNN does not learn a separate model.
    • Instead, it stores all training data and uses them directly at prediction time.
  • Non-parametric model
    • KNN has no parameters (like weights in linear regression) to train.
    • The model is essentially the full dataset plus a distance measure.

Distance measures

  • Minkowski distance or LpL^p norm
    • Lp(xj,xq)=(ixj,ixq,ip)1/pL^p(x_j, x_q) = \left( \sum_i |x_{j,i} - x_{q,i}|^p \right)^{1/p}
    • Euclidean distance: p=2p = 2, for the dimensions are measuring similar properties, such as the width, height and depth of 3D objects.
    • Manhattan distance: p=1p = 1, for the dimensions are measuring dissimilar properties, such as age, weight, and gender of a patient.
    • Hamming distance: the number of attributes on which the two points differ, for Boolean attribute values

Nomarlization

  • use the raw data from each dimension then the total distance will be affected by a change in scale in any dimension
  • To avoid this, apply normalization to the measurements in each dimension.
  • to compute the mean μi\mu_i and standard deviation σi\sigma_i of the values in each dimension, and rescale them
  • The rescaling is done using the formula:
    • xj,i=xj,iμiσix'_{j,i} = \frac{x_{j,i} - \mu_i}{\sigma_i} where xj,ix'_{j,i} is the normalized value, xj,ix_{j,i} is the original value, μi\mu_i is the mean, and σi\sigma_i is the standard deviation.

Time complexity

  • Conceptually trivial: Given a set of N examples and a query xqx_q, iterate through the examples, measure the distance to xqx_q from each one, and keep the best k.
  • NN(k,xq)NN(k, x_q)'s time complexity is O(N)O(N), N is the number of examples in the training dataset.
  • Use a k-dimensional tree: a balanced binary tree with an arbitrary number of dimensions.
    • Time complexity can be improved to O(logN)O(\log N)
    • appropriate only when there are many more examples than dimensions
    • It works well with up to 10 dimensions with thousands of examples.
  • Use a Hash table with a locality-sensitive hash (LSH)
    • Time complexity can be improved to O(1)O(1)

SVM

a framework for finding a boundary that distinctly classifies the data points in an optimal way.

  • supervised learning, binary classification
  • SVM chooses the boundary with the maximum possible geometric margin, which has the largest distance to the nearest training data points of any class
  • initially designed for binary classification problems but can also be applied for solving multi-class classification problems

Linear discriminant

  • XiX_i is multiplied by its matching weight wiw_i
  • all these products are added together and passed to a threshold function
  • Decision surface: if g(x)=wx>0g(x) = w \cdot x \gt 0 then f(x)=+1(class1)f(x) = +1 (class1) else f(x)=1(class2)f(x) = -1 (class2)
  • Decision function: f(x)=sign(g(x))=sign(w0+w1x)f(x) = \text{sign}(g(x)) = \text{sign}(w_0 + w_1x)
    • To make a decision, the continuous value g(x)g(x) is passed through the sign function so that it outputs either +1 or -1.
  • If the data from the two classes can be separated with a hyperplane, linearly separable.

Hyperplane

  • separates the data in 2D by a line or in 3D by a plane
  • The orientation of the hyperplane is given by the vector ww
  • the location of the hyperplane is given by w0w_0
  • The distance from the origin to the hyperplane is w0w\frac{|w_0|}{\|w\|}
  • If a given data sample xx^* and g(x)=0g(x^*) = 0, then this data sample is on the separation boundary. It can normally be assigned to any class.
  • geometric margin: the minimum distance between the samples and the hyperplane by constructing and solving a constrained optimization problem
    • γi=yi(wwxi+w0w)\gamma_i = y_i \left( \frac{w}{\|w\|} \cdot x_i + \frac{w_0}{\|w\|} \right)
  • primary optimization problem: to maximize the minimal geometric distance across the training dataset of m samples.
    • maxw,w0(mini=1,,Nγi)=maxw,w0(mini=1,,N(yi(wwxi+w0w)))\max_{w, w_0} \Big( \min_{i=1,\ldots,N} \gamma_i \Big) = \max_{w, w_0} \Big( \min_{i=1,\ldots,N} \Big( y_i \Big( \frac{w}{\|w\|} \cdot x_i + \frac{w_0}{\|w\|} \Big) \Big) \Big)
    • minw,w0  12w2\min_{w, w_0} \; \frac{1}{2}\|w\|^2
    • s.t.   yi(wxi+w0)mini=1,,N(yi(wxi+w0))\text{s.t. } \; y_i (w \cdot x_i + w_0) \geq \min_{i=1,\ldots,N} \big( y_i (w \cdot x_i + w_0) \big)
  • dual optimization problem: easier to solve. More importantly the dual optimisation problem enables the so-called kernel trick in SVM
    • maxα  i=1Nαi12i=1Nj=1Nαiαjyiyj(xixj)\max_{\alpha} \; \sum_{i=1}^N \alpha_i - \frac{1}{2} \sum_{i=1}^N \sum_{j=1}^N \alpha_i \alpha_j y_i y_j (x_i \cdot x_j)
    • minα  12i=1Nj=1Nαiαjyiyj(xixj)    i=1Nαi\min_{\alpha} \; \frac{1}{2} \sum_{i=1}^N \sum_{j=1}^N \alpha_i \alpha_j y_i y_j \,(x_i \cdot x_j) \;-\; \sum_{i=1}^N \alpha_i
    • s.t.i=1mαiyi=0,αi0,i=1,2,,N\text{s.t.} \quad \sum_{i=1}^m \alpha_i y_i = 0, \quad \alpha_i \geq 0, \quad i=1,2,\ldots,N

Attractive Properties

  • SVMs construct a maximum margin separator
    • the largest possible distance to example points, helping to improve generalization
  • SVMs create a linear separating hyperplane
    • kernel trick: to embed the data into a higher-dimensional space
    • Often data that are not linearly separable in the original input space are easily separable in a higher-dimensional space
    • In general (excepted some special cases) if we have NN data points then they will always be separable in spaces of NN dimensions or more
  • SVMs are a nonparametric method
    • retain training examples and potentially need to store them all
    • In practice, they often end up retaining only a small fraction of examples
    • have the flexibility to represent complex functions, but they are resistant to overfitting
  • not usually expect to find a linear separator in the input space xx, but we can find linear separators in the high-dimensional feature space F(x)F(x) simply by replacing (xjxk)(x_j x_k) in
    • argmaxαjαj12j,kαjαkyjyk(xjxk)argmax_{\alpha} \sum_{j}\alpha_j - \frac{1}{2} \sum_{j,k}\alpha_j \alpha_k y_j y_k (x_j \cdot x_k)
      • with K(xj,xk)=F(xj)F(xk)K(x_j, x_k) = F(x_j) \cdot F(x_k)
      • F(xj)F(xk)F(x_j) \cdot F(x_k) can often be computed without first computing FF for each point.
  • In a higher dimensional feature space, which is created by transformation F(x)F(x), if we can express K(xjxk)=F(xj)F(xk)K(x_j \cdot x_k) = F(x_j) \cdot F(x_k), the kernel function K(xjxk)K(x_j \cdot x_k) can be applied to pairs of input data to evaluate dot product in some corresponding feature space.
    • kernel trick is to plug a kernel function K(xjxk)K(x_j \cdot x_k) into the dual optimisation problem to replace (xjxk)(x_j \cdot x_k)
    • Optimal linear separators can be found efficiently in feature spaces with billions of (or, in some cases, infinitely many) dimensions.
    • we can learn in the higher-dimensional space, but we compute only kernel functions rather than the full list of features for each data point.

Classification evaluation metrics

용어설명
True Positive (TP)실제 1, 예측 1
True Negative (TN)실제 0, 예측 0
False Positive (FP)실제 0, 예측 1 (0을 잘못 1로 예측)
False Negative (FN)실제 1, 예측 0 (1을 놓쳐서 0으로 예측)

Accuracy

the proportion of correctly classified instances (data points or samples) among the total instances.

Accuracy=TP+TNTotal number of predictionsAccuracy = \frac{TP + TN}{\text{Total number of predictions}}

  • a ratio of the number of correct predictions
  • it may not be suitable for imbalanced datasets where the class distribution is skewed
    • a model that always predicts the majority class will have high accuracy but may not be useful in practice.

Precision

the proportion of true positives among all positive predictions.

Precision=TPnumber of positive predictions=TPTP+FPPrecision = \frac{TP}{\text{number of positive predictions}} = \frac{TP}{TP + FP}

  • the model's ability to not mistakenly view negatives as positives
  • A high precision value indicates that the model has made fewer false positive predictions.
  • it's useful to minimize the number of false positives.

Recall

Sensitivity, True Positive Rate, the proportion of true positive instances among the actual positive instances

Recall=TPnumber of positive instances=TPTP+FNRecall = \frac{TP}{\text{number of positive instances}} = \frac{TP}{TP + FN}

  • the model's ability to not mistakenly view actual positives as negatives
  • A high recall value indicates that the model has successfully identified a large portion of the actual positive instances
  • it's useful when the cost of false negatives is high
    • e.g. in medical diagnosis, where failing to identify a disease can have severe consequences

F1 Score

the harmonic mean of precision and recall

F1=2PrecisionRecallPrecision+RecallF1 = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall}

  • a balanced evaluation of the model's performance
  • It is particularly useful when dealing with imbalanced datasets, where one class is significantly more prevalent than the other.
  • It's useful for imbalanced datasets to balance precision and recall.

Regression evaluation metrics

Mean Absolute Error (MAE)

the average of the absolute differences between the predicted values and the actual values

MAE=1ni=1nyiy^iMAE = \frac{1}{n} \sum_{i=1}^n \lvert y_i - \hat{y}_i \rvert

  • It measures the average magnitude of errors made by the model, without considering their direction
  • A lower MAE value indicates that the model has made smaller prediction errors

Mean Squared Error (MSE)

the average squared difference between the predicted and actual values

MSE=1ni=1n(yiy^i)2MSE = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2

  • It's useful when you want to penalise larger errors more heavily, making it more sensitive to outliers at the same time
  • often used as a loss function when training regression models
  • A lower MSE value indicates that the model has smaller prediction errors, with a strong preference for avoiding large errors.

Root Mean Squared Error (RMSE)

the square root of MSE

RMSE=MSERMSE = \sqrt{MSE}

  • it more interpretable as it is in the same units as the dependent variable.

R-Squared (Coefficient of Determination)

how well the regression model approximates the actual data

R2=1SSRSST=1i=1n(yiy^i)2i=1n(yiyˉ)2R^2 = 1 - \frac{SSR}{SST} = 1 - \frac{\sum_{i=1}^n (y_i - \hat{y}_i)^2}{\sum_{i=1}^n (y_i - \bar{y})^2}

  • the proportion of "sum squared regression (SSR)" and "total sum of squares (SST)"
    • SSR obviously captures the model's prediction errors
    • SST is the variance of the target variable.
      • can be viewed as a naive model (y^=yˉ\hat{y} = \bar{y}) using the average value of the target variable as the prediction
  • R2=1R^2 = 1: The model predicts perfectly (the error is 0).
  • R2=0R^2 = 0: The model does no better than a naive model that always predicts the mean of the target variable.
  • R2<0R^2 < 0: The model performs worse than simply predicting the mean, meaning its predictions increase the error compared to the naive baseline.

FSD +005

· 약 2분

Method

  • a block of code grouped together and has a name
  • can be invoked by its name to perform certain action
  • can have parameters that represent the values needed for the method to run
  • can have local variables usable only within its own code block.

Function vs Procedure

  • Procedure: no return value, perform an action
    • Example: move(), run(), deposit(), eat()
  • Function: have a return value, do not perform any action
    • Example: total(), sum(), area()
  • a function and behaves as a combined function procedure, but not recommended.

Method Overloading

  • Java allows methods in the same class to have the same name but different parameters.
  • method signature: The method name together with the number and types of a method's parameter.

Parameter vs Arguments

  • Parameter: placeholder variables used at method definition, indicate the type and order of argument
  • Arguments: data values passed to the method when the method is invoked or called.

Patterns

The read pattern

def <name>():
<prompt>;
return <type>

The update read-loop pattern

<read function>
while (<value> != <end value>):
<use the value>
<read function>

The array-loop pattern

for <value> in <range>:
>use the item from array>

The any-pattern

for <item> in <collection>:
if (<test>):
return True
return False

The every-pattern

for <item> in <collection>:
if (not(<test>)):
return False
return True

The none-pattern

for <item> in <collection>:
if (<test>):
return False
return True

Boolean Functions

def isEven(number):
if number % 2 == 0:
return True
else:
return False

def isEven(number):
return (number % 2 == 0)

Recursion

  • a technique where a method calls itself repeatedly.
  • to provide a termination logic for a recursive method to avoid infinite execution.
def factorial(n):
return 1 if (n == 1 or n == 0) else n * factorial(n - 1)

def factorial(n):
F = lambda n: n * F(n-1) if n > 1 else 1
return F(n)

Process in Programming

  • process is the method used to solve a problem
  • Break it down-Build it up is a technique structured approach to handle complex problems.

RT-2, Robotic Transformer 2 Review

· 약 4분
  • Trains a Vision-Language-Action (VLA) model by co-fine-tuning web-scale VLMs with robot trajectories, and treats robot actions as text tokens.
  • Yields strong generalization and emergent capabilities (symbol understanding, reasoning, human recognition) beyond what appears in robot data.
  • Runs in direct closed-loop control; largest evaluated model (55B) executes at ~1–3 Hz via a cloud (multi-TPU) inference setup.

RT-2 Architecture

What RT-2 Is

  • A family of VLA models (RT-2-PaLI-X, RT-2-PaLM-E) that fine-tune large VLMs on robot trajectories to output low-level actions.
  • Target: generalizable, semantically aware manipulation policies that map images + instructions → actions end-to-end.
  • RT-2 does not rely on a restricted 2D action space or calibrated cameras.
  • The unified output space lets language and action tokens share the same model weights, without action-only layers.

Core Recipe

  • Directly train open-vocabulary VQA/dialogue VLMs to output robot actions while they still solve standard vision-language tasks.
  • Build on RT-1 protocol/data, but replace the policy backbone with a large VLM.

Action as Language (Tokenization)

  • Discretize continuous action dims (Δpos/Δrot, gripper, terminate) into 256 bins; represent each dimension with an integer token.
  • PaLI-X: reuse numeric tokens (≤1000). PaLM-E: overwrite 256 least-frequent tokens as action vocabulary (symbol tuning).
  • Form a single output string per step (e.g., terminate Δposx Δposy Δposz Δrotx Δroty Δrotz gripper).

Co-Fine-Tuning & Output Constraint

  • Mix robot data with original web VQA/caption data in training batches (up-weight robot samples) to prevent forgetting and improve generalization.
  • During decoding on robot tasks, restrict sampling to valid action tokens so outputs are always executable.

Closed-Loop Control & Real-Time Inference

  • RT-2 is trained and deployed for direct closed-loop control (camera → action → camera …), not just high-level planning.
  • For large models, inference runs via a multi-TPU cloud service; RT-2-PaLI-X-55B reaches ~1–3 Hz; smaller models ~5 Hz.

Generalization & Benchmarks

  • Matches RT-1 on seen tasks but far exceeds baselines on unseen objects/backgrounds/environments (~ vs RT-1/MOO; up to ~6× vs others).
  • Open-source Language-Table sim: co-fine-tuned PaLI-3B outperforms baselines, showing the approach transfers to other robots/sims.

Emergent Capabilities

  • Symbol understanding (e.g., “move apple to 3 / heart / star”).
  • Reasoning (visual matching, simple math like “sum of two plus one”, multilingual commands).
  • Human recognition (e.g., “person with glasses”); none of these were present as low-level actions in robot data.
  • Chain-of-thought (CoT) variant adds a Plan step before actions → supports multi-stage semantic reasoning (e.g., pick a rock as an improvised hammer; pick an energy drink for a tired person).

rt-2-cot

Scaling & Ablations

  • From-scratch training (even 5B) performs poorly; fine-tuning helps; co-fine-tuning helps most.
  • Bigger models (55B > 5B) generalize better.
  • PaLM-E variant shows an edge on math reasoning; PaLI-X stronger on symbols/vision reasoning on average.

Limitations

  • Does not learn fundamentally new motor skills beyond the distribution in robot data; mainly transfers semantic/visual knowledge.
  • Compute/latency costly; real-time control can bottleneck. Limited availability of strong open VLMs and convenient FT APIs.

Future Directions (from the text)

  • Acquire new skills from human videos or richer datasets.
  • Quantization/distillation for faster/cheaper inference.
  • More open VLMs / FT APIs to make VLA models broadly buildable.

Ref

  • Zitkovich, B., Yu, T., Xu, S., Xu, P., Xiao, T., Xia, F., Wu, J., Wohlhart, P., Welker, S., Wahid, A., Vuong, Q., Vanhoucke, V., Tran, H., Soricut, R., Singh, A., Singh, J., Sermanet, P., Sanketi, P. R., Salazar, G., Ryoo, M. S., Reymann, K., Rao, K., Pertsch, K., Mordatch, I., Michalewski, H., Lu, Y., Levine, S., Lee, L., Lee, T.-W. E., Leal, I., Kuang, Y., Kalashnikov, D., Julian, R., Joshi, N. J., Irpan, A., Ichter, B., Hsu, J., Herzog, A., Hausman, K., Gopalakrishnan, K., Fu, C., Florence, P., Finn, C., Dubey, K. A., Driess, D., Ding, T., Choromanski, K. M., Chen, X., Chebotar, Y., Carbajal, J., Brown, N., Brohan, A., Arenas, M. G., & Han, K. (2023). RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control Proceedings of The 7th Conference on Robot Learning, Proceedings of Machine Learning Research. https://proceedings.mlr.press/v229/zitkovich23a.html

PaLM-E An Embodied Multimodal Language Model Review

· 약 4분

PaLM-E

  • ViT (e.g., ViT-4B, ViT-22B) extracts image embeddings.
  • OSRT builds object-centric slot representations.
  • These are injected into the LLM embedding space (PaLM variants: 8B, 62B, 540B) for high-level abstraction and planning, with execution delegated to low-level policies (e.g., RT-1).

PaLM-E Architecture

Core idea

  • Build embodied language models by injecting continuous sensor inputs (images, states, other modalities) directly into a pretrained LLM’s embedding space, linking words ↔ percepts.
  • Inputs are multimodal sentences that interleave text tokens with encoded visual/state tokens; outputs are text (answers or high-level plans).

Architecture & representations

  • Start from a decoder-only, autoregressive LLM (PaLM) and condition on a prefix that mixes text and encoder-produced vectors.
  • Provide multiple encoder options:
    • State vectors (simplest).
    • ViT features with a learned projector ψ to match LLM embedding dimensionality.
    • Object-centric, 3D-aware OSRT (neural scene representations). Supports entity-label tokens (<obj j>) so the model can refer to specific objects in generated plans.

Training setup

  • Train end-to-end (encoders + projector + optionally the LLM) to output sequential decisions as natural text or answers (VQA, captioning).
  • Dataset items contain (continuous observations, text sequence, prefix index); loss is cross-entropy on non-prefix tokens.
  • Explore freezing the LLM (train encoders/projection only), and co-training across diverse tasks ("full mixture"; only ~9% is embodied data).

Planning & control loop

  • For planning/control, PaLM-E emits textual subgoals/skills drawn from a small skill vocabulary; a separate low-level policy executes them.
  • The system runs closed-loop: execute → observe → (re)plan; PaLM-E acts as a high-level policy sequencing low-level skills.

Why not text-only LLMs or affordance-only grounding?

  • Prior work that feeds only text to the LLM (and uses external affordance models) is insufficient when spatial layout matters.
  • PaLM-E instead grounds inside the LLM by injecting continuous observations, enabling direct plan generation while leveraging the LLM’s world knowledge.

Environments & use cases

  • Three domains: TAMP (grasp/stack planning), Language-Table (multi-object tabletop pushing), Mobile manipulation (kitchen tasks).
  • Use cases to test embodied reasoning: affordance prediction, failure detection, long-horizon planning (low-level policies from RT-1).

Results (high level)

  • Transfer via co-training: One model trained on mixed tasks/embodiments achieves higher performance than task-specialists; "full mixture" yields >2× gains (Fig. 3).
  • Few-shot/data efficiency: Solves robotics tasks with very few examples (e.g., 10–80 for Language-Table, 320 for TAMP). OSRT further improves data efficiency.
  • Mobile manipulation: End-to-end embodied planning works in real kitchens, robust to disturbances; PaLM-E beats PaLI (zero-shot) and QT-OPT/CLIP baselines on affordance/failure detection.
  • General V+L: The 562B generalist achieves state-of-the-art on OK-VQA and strong VQAv2/COCO without task-specific finetuning.
  • Language retention & scaling: Freezing LLM preserves language ability but can struggle on some robotics tasks; unfrozen + scale up significantly reduces catastrophic forgetting.
  • Emergent behaviors: Multimodal chain-of-thought and multi-image reasoning emerge in PaLM-E-562B, despite training on single-image prompts.

Takeaways

  • Injecting neural scene representations (OSRT) and entity-labeled multimodal tokens is effective even without massive embodied data.
  • Diverse, joint training transfers vision-language knowledge into embodied decision-making, enabling data-efficient robot planning.
  • Two viable paths to retain language skills during multimodal finetuning:
    1. Freeze the LLM, train encoders (max language retention, sometimes weaker robotics),
    2. Unfreeze and scale the LLM (much less forgetting, strong embodied performance).

Ref

  • Driess, D., Xia, F., Sajjadi, M. S. M., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., Huang, W., Chebotar, Y., Sermanet, P., Duckworth, D., Levine, S., Vanhoucke, V., Hausman, K., Toussaint, M., Greff, K., Zeng, A., Mordatch, I., & Florence, P. (2023). PaLM-E: An Embodied Multimodal Language Model Proceedings of the 40th International Conference on Machine Learning, Proceedings of Machine Learning Research. https://proceedings.mlr.press/v202/driess23a.html

RT-1, Robot Transformer 1 Review

· 약 3분

RT-1

  • RT-1 discretizes robot actions into 256-bin tokens, creating a shared "action language" across robots.
  • It absorbs heterogeneous data from simulation and other robot morphologies without losing performance.
  • It generalizes robustly to new tasks, environments, and long-horizon scenarios (up to 50 steps).

RT-1 Architecture

Introduction & Motivation

  • Leveraging large, diverse, task-agnostic datasets enables high performance in zero-shot or small task-specific settings.
  • Data collection and curation is a critical bottleneck in robotics ("the unsung hero" of large-scale ML).
  • Transformer-based controllers are powerful but inefficient for real-time robotics, requiring architectural adaptations.

Model & Architecture

  • RT-1 architecture: EfficientNet + FiLM layers + TokenLearner for compact vision-language tokenization.
  • Action tokenization: 11 action dimensions (7 arm, 3 base, 1 mode) discretized into 256 bins each.
  • This abstraction converts continuous robot actions into a discrete "token language", enabling cross-domain and cross-robot transfer.
  • Real-time feasibility: optimized design achieves ~3Hz inference speed suitable for real-world control.

Experiments & Results

General Performance

  • RT-1 executes over 700 unique instructions at 97% success rate.
  • On unseen instructions: 76% success, outperforming next-best baseline by +24%.
  • Robustness: 83% success with distractors, 59% with background changes (significantly higher than baselines).

Absorbing Simulation Data

  • Adding sim data does not degrade real-task performance.
  • Objects/tasks only seen in simulation: performance boosted 23% ⇒ 87%.
  • Unseen instructions with sim objects: 7% ⇒ 33%, showing strong sim-to-real domain transfer.

Absorbing Multi-Robot Data

  • Mixed RT-1 + Kuka datasets: only 2% drop in original tasks.
  • Bin-picking eval: RT-1 only 22% ⇒ mixed training 39% (almost 2×).
  • Kuka-only training: 0% on EDR robots ⇒ morphology transfer alone fails.
  • Mixed data enables RT-1 to leverage cross-robot experiences without explicit demonstrations.

Long-Horizon Scenarios (SayCan Integration)

  • Evaluated in two kitchens:
    • Kitchen1: 67% execution success.
    • Kitchen2 (novel environment): also 67% execution success.
  • Outperforms Gato (0% in Kitchen2) and BC-Z (13% in Kitchen2).
  • Demonstrated execution of ultra-long tasks up to 50 steps.

Data Quantity vs Diversity

Data Diversity

  • Reducing dataset size ⇒ gradual performance/generalization decline.
  • Reducing task diversity ⇒ much sharper decline, especially in generalization.
  • Key takeaway: Data diversity is more critical than data quantity.

Conclusions & Limitations

  • RT-1 proves large-scale data absorption and strong generalization in robotics.
  • Limitations:
    • Based on imitation learning ⇒ cannot surpass demonstrator performance.
    • Generalization limited to recombinations of known concepts ⇒ fails on truly novel motions.
    • Dataset is large but not dexterous (fine manipulation limited).

Future Directions

  • Enable non-experts to collect training data and prompt models for faster skill scaling.
  • Increase environmental diversity to strengthen robustness to backgrounds/environments.
  • Improve reaction speed and context retention via scalable attention and memory.

Ref

  • Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Dabis, J., Finn, C., Gopalakrishnan, K., Hausman, K., Herzog, A., & Hsu, J. (2022). Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817.

Do As I Can, Not As I Say Review

· 약 4분

Say Can

  • The core of SayCan is using an LLM to decompose high-level instructions into low-level skills, and reinforcement-learned affordance value functions to evaluate whether each skill is feasible in the current environment.
  • The Say × Can structure is modular: different LLMs or affordance models can be swapped in, but each module’s inherent biases are carried into the system.
  • To mitigate limitations, loop-based strategies are essential — CoT and RLHF provide feedback loops for LLMs, while closed-loop feedback enables affordance functions to adapt during execution.

Motivation (Why LLMs alone fall short)

  • LLMs lack embodiment. They haven’t acted in the physical world, so using them for decision-making on a specific robot is unreliable.
  • LLMs don’t know robot’s abilities or state. They may split instructions into subtasks, but without context of capabilities and environment, plans can be irrelevant.
  • Prompting alone isn’t enough. Structured prompts help, but they don’t guarantee admissible or executable steps.

Core Proposal (What SayCan adds)

  • Ground with pretrained skills. Constrain LLM to propose actions that the robot can actually perform in context.
  • Say × Can factorization.
    • Say (task-grounding): LLM estimates relevance of each skill to the instruction.
    • Can (world-grounding): Affordance functions estimate probability of success from current state.

Probabilistic Formulation

  • Two probabilities multiplied:
    • p(πi)p(\ell_\pi|i): LLM score of relevance.
    • p(cπs,π)p(c_\pi|s,\ell_\pi): affordance score of success.
    • Select: π=argmaxp(cπs,π)p(πi)\pi = \arg\max p(c_\pi|s,\ell_\pi)\,p(\ell_\pi|i).

Planning Procedure

  • Planning is structured as a dialog: user gives high-level instruction, LLM produces a step sequence, loop until "done."
  • Benefit: Interpretability—scores provide transparency.
  • Caveat: Without affordances, chosen steps may be irrelevant to the current scene.

Affordances via RL

  • Affordance = value function. In sparse reward settings, value ≈ success probability.
  • TD RL and MDP formalism used to learn Qπ(s,a)Q_\pi(s,a).

Implementation

  • Skill training:
    • BC-Z (behavioral cloning) and MT-Opt (reinforcement learning).
    • Multi-task BC/RL amortizes training cost.
  • Language conditioning: Pretrained sentence encoder frozen, text embeddings as input.
  • Action space: 6-DoF end-effector, gripper open/close, base x-y & yaw deltas, terminate.

Metrics

  • Plan success rate: 2/3 human raters agree that the plan is valid.
  • Execution success rate: 2/3 raters agree robot achieved the task.

Key Results

  • Grounding nearly doubles performance vs non-grounded baselines.
  • Understands sequence order (approach → pick → bring).
  • Failures: Long-horizon tasks (early termination), negation, ambiguous references.
  • Error split: ~65% LLM, 35% affordance.

Ablations

  • Remove LLM (task-grounding):
    • BC-NL: 0% all tasks.
    • BC-USE: 60% on single primitives, 0% otherwise.
  • Remove affordances (world-grounding):
    • No-VF: 67%, Generative: 74% vs 84% (SayCan).

Scaling & Models

  • PaLM > FLAN. PaLM-SayCan achieves 84% plan / 74% execute.
  • Stronger LMs improve robotics performance.

Extensibility

  • Add new skills easily: register skill, affordance, prompt example.
  • Chain-of-Thought: Add "Explanation" → helps with negation and reasoning-heavy queries.
  • Multilingual: Almost no performance drop (English, Chinese, French, Spanish).

Open-Source Variant

  • CLIPort for pick-and-place.
  • Affordances approximated by ViLD open-vocabulary object detector.
  • GPT-3 as language model.

Limitations & Future Work

  • Limits: Inherits LLM biases; skill library is bottleneck; hard to react to skill failures.
  • Closed-loop extensions: Huang et al. use environment feedback + inner monologue for replanning.
  • Future directions: Expand/robustify skills, explore new grounding sources (non-robotic), test if natural language is the right ontology, combine planning + language, use LMs for policy pretraining.

Ref

  • ichter, b., Brohan, A., Chebotar, Y., Finn, C., Hausman, K., Herzog, A., Ho, D., Ibarz, J., Irpan, A., Jang, E., Julian, R., Kalashnikov, D., Levine, S., Lu, Y., Parada, C., Rao, K., Sermanet, P., Toshev, A. T., Vanhoucke, V., Xia, F., Xiao, T., Xu, P., Yan, M., Brown, N., Ahn, M., Cortes, O., Sievers, N., Tan, C., Xu, S., Reyes, D., Rettinghouse, J., Quiambao, J., Pastor, P., Luu, L., Lee, K.-H., Kuang, Y., Jesmonth, S., Joshi, N. J., Jeffrey, K., Ruano, R. J., Hsu, J., Gopalakrishnan, K., David, B., Zeng, A., & Fu, C. K. (2023). Do As I Can, Not As I Say: Grounding Language in Robotic Affordances Proceedings of The 6th Conference on Robot Learning, Proceedings of Machine Learning Research. https://proceedings.mlr.press/v205/ichter23a.html

FDA +004

· 약 15분

Data Preparation

  • In real world applications, data can be inconsistent, incomplete, and noisy.
  • Data Collection problems: when data is collected incorrectly
  • Incomplete Data: when information is missing
  • Data entry problems: when data is entered incorrectly
  • Contradictions in data: when the data says something in one place, and then says a different thing elsewhere in the dataset. We can think of this data as noisy.
  • Discrepancy in naming conventions: when data descriptions are unclear, people may misinterpret their meaning.
  • Duplicated records: when integrating data from different sources, the same data may get entered multiple times.
  • Data transmission problems: when data is sent between different people or databases or companies, things can get lost in the process.

Data mining tasks

  • Classification
  • Estimation
  • Prediction
  • Characterisation
  • Discrimination
  • Affinity grouping
  • Clustering
  • Time series analysis

Data Cleaning

  • Missing data
    • Ignore the record
    • Fill the missing value manually
    • Fill missing values with calculated values
      • The missing values can be filled using the average value for a particular attribute
      • or by using attribute mean for all samples belonging to the same class as the given record.
      • also be filled using methods such as Bayesian classification or decision trees to automatically infer the values.
  • Noisy data: a meaningless variation that cannot be interpreted properly by machines
    • Binning
      • binning methods use the neighbour's data, this is referred to as local smoothing
      • can replace all data in a segment by its mean or boundary values
    • Clustering
      • grouping of data points according to a distance measure
      • use a clustering algorithm to classify each data point into a specific group
      • can detect outliers
    • Regression
      • a data mining function that deals with the prediction of a continuous value rather than a class
      • maps data values to a function
      • Using regression to fit data by finding a mathematical equation may be used to smooth noisy data.

Binning

PriceEqui-widthEqui-depth
7[0, 10][7, 20]
20[11, 20][7, 20]
22[21, 30][22, 50]
50[41, 50][22, 50]
51[51, 60][51, 53]
53[51, 60][51, 53]
  • Equi-width: Bins have equal width.
  • Equi-depth: Bins have the same number of values in them or almost the same number if they don't divide equally.

Equi-width binning

Equal-interval binning, split the whole range of numbers into intervals with equal size.

  • Price: 4, 8, 9, 15, 21, 21, 22, 26, 27, 28, 29, 36
  • Equal-width binning
    • Bin1 [4, 12]: 4, 8, 9
    • Bin2 (12, 20]: 15
    • Bin3 (20, 28]: 21, 21, 22, 26, 27, 28
    • Bin4 (28, 36]: 29, 36
  • Smoothing by bin means
    • Bin1: 7, 7, 7
    • Bin2: 15
    • Bin3: 24, 24, 24, 24, 24, 24
    • Bin4: 33, 33
  • Smoothing by bin boundaries
    • Bin1: 4, 9, 9
    • Bin2: 15
    • Bin3: 21, 21, 21, 28, 28, 28
    • Bin4: 29, 36

Equi-depth binning

Equal-frequency binning, use intervals containing an equal number of values.

  • Price: 4, 8, 9, 15, 21, 21, 22, 26, 27, 28, 29, 36
  • Equal-depth binnning
    • Bin1: 4, 8, 9
    • Bin2: 15, 21, 21
    • Bin3: 22, 26, 27
    • Bin4: 28, 29, 36
  • Smoothing by bin means: each value in a bin is replaced by the mean value of the bin.
    • Bin1: 7, 7, 7
    • Bin2: 19, 19, 19
    • Bin3: 25, 25, 25
    • Bin4: 31, 31, 31
  • Smoothing by bin boundaries: each bin value is replace by the closest boundary value.
    • Bin1: 4, 9, 9
    • Bin2: 15, 21, 21
    • Bin3: 22, 27, 27
    • Bin4: 28, 28, 36

Data Integration

provides unified data by combining data from various heterogeneous data sources into a coherent data store

  • The sources can include flat files, databases or multiple data cubes.
  • Careful integration may help to avoid and reduce inconsistencies and redundancies in the final dataset.
  • Building an enterprise's data warehouse is considered one of the most popular data integration implementations.
  • Redundant attributes: An attribute (feature or column of a dataset) is called redundant if it can be derived from any other attribute or set of attributes.
    • In the process of data integration in data mining, the use of multiple data stores may lead to the problem of redundancy in data.
    • Dimension naming or inconsistencies in an attribute can also lead to redundancies in the dataset.

Pearson correlation coefficient

  • Correlation analysis can be used to detect redundancies in Numerical data
  • It can measure how strongly one attribute implies the other on the basis of the available data.
  • > 0.5: a strong positive correlation, A⬆️ B⬆️
  • < -0.5: a strong negative correlation, A⬆️ B⬇️
  • 0: no correlation. A and B are independent.
  • correlation != causation

rA,B=nxy(x)(y)(nx2(x)2)(ny2(y)2)r_{A,B} = \frac{n\sum{}xy - (\sum{}x)(\sum{}y)} {\sqrt{(n\sum_{} x^2 - (\sum_{} x)^2) \, (n\sum_{} y^2 - (\sum_{} y)^2)}}

rA,B=(xixˉ)(yiyˉ)(xixˉ)2(yiyˉ)2r_{A,B} = \frac{\sum_{} (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{} (x_i - \bar{x})^2} \, \sqrt{\sum_{} (y_i - \bar{y})^2}}

  • step-by-step derivation
    • 분산: Var(X)=1n(xixˉ)2Var(X) = \frac{1}{n} \sum_{} (x_i - \bar{x})^2
    • 공분산: Cov(X,Y)=1n(xixˉ)(yiyˉ)Cov(X,Y) = \frac{1}{n} \sum_{} (x_i - \bar{x})(y_i - \bar{y})
    • 상관계수 (정규화): ρ=Cov(X,Y)σXσY\rho = \frac{\mathrm{Cov}(X,Y)}{\sigma_X \sigma_Y}
    • 평균: xˉ=1nxiyˉ=1nyi\bar{x} = \frac{1}{n} \sum_{}x_i \quad \bar{y} = \frac{1}{n} \sum_{} y_i
    • 분자 전개
      • (xixˉ)(yiyˉ)\sum_{} (x_i - \bar{x})(y_i - \bar{y})
      • (xiyixiyˉyixˉ+xˉyˉ)\sum_{} (x_i y_i - x_i \bar{y} - y_i \bar{x} + \bar{x}\bar{y})
      • (xiyi)yˉxixˉyi+nxˉyˉ\sum_{} (x_i y_i ) - \bar{y}\sum_{} x_i - \bar{x}\sum_{} y_i + n\bar{x}\bar{y}
      • 평균 대입
        • xiyi1n(yi)(xi)1n(x1)(y1)+1n(x1)(y1)\sum_{}x_iy_i - \frac{1}{n}(\sum_{}y_i)(\sum{}x_i) - \frac{1}{n}(\sum{}x_1)(\sum{}y_1) + \frac{1}{n}(\sum{}x_1)(\sum{}y_1)
        • xiyi1n(xi)(yi)\sum{}x_iy_i - \frac{1}{n}(\sum{}x_i)(\sum{}y_i)
    • 분모 전개
      • (xixˉ)2(yiyˉ)2\sqrt{\sum_{} (x_i - \bar{x})^2} \, \sqrt{\sum_{} (y_i - \bar{y})^2}
      • xi22xˉxi+nxˉ2yi22yˉyi+nyˉ2\sqrt{\sum_{} x_i^2 - 2\bar{x}\sum_{} x_i + n\bar{x}^2} \, \sqrt{\sum_{} y_i^2 - 2\bar{y}\sum_{} y_i + n\bar{y}^2}
      • 평균 대입
        • xi22n(xi)(xi)+1n(xi)2yi22n(yi)+1n(yi)2\sqrt{\sum_{} x_i^2 - \frac{2}{n}(\sum_{} x_i)(\sum_{} x_i) + \frac{1}{n}(\sum{}x_i)^2} \, \sqrt{\sum_{} y_i^2 - \frac{2}{n}(\sum_{} y_i) + \frac{1}{n}(\sum{}y_i)^2}
        • xi21n(xi)2yi21n(yi)2\sqrt{\sum_{} x_i^2 - \frac{1}{n}(\sum_{} x_i)^2} \, \sqrt{\sum_{} y_i^2 - \frac{1}{n}(\sum_{} y_i)^2}
    • 재정의 rA,B=xiyi1n(xi)(yi)(xi21n(xi)2)(yi21n(yi)2)r_{A,B} = \frac{\sum{}x_iy_i - \frac{1}{n}(\sum{}x_i)(\sum{}y_i)} {\sqrt{(\sum_{} x_i^2 - \frac{1}{n}(\sum_{} x_i)^2) \, (\sum_{} y_i^2 - \frac{1}{n}(\sum_{} y_i)^2)}}
    • 분자/분모에 n 곱하고 인덱스 생략 rA,B=nxy(x)(y)(nx2(x)2)(ny2(y)2)r_{A,B} = \frac{n\sum{}xy - (\sum{}x)(\sum{}y)} {\sqrt{(n\sum_{} x^2 - (\sum_{} x)^2) \, (n\sum_{} y^2 - (\sum_{} y)^2)}}

Data Transformation

The data is consolidated or transformed so that the patterns found are easier to understand, and the consequent mining process is more efficient.

  • Smoothing: smoothing is used to remove noise from the data to improve clarity around the important features in the dataset
  • Normalization: the method of scaling your data, into a regularized range, so that you can compare and represent it more accurately
  • Discretization & Concept hierarchy generation
    • Discretisation is the process of putting values into buckets so that there are a limited number of possible states.
    • Discretisation transforms a continuous attribute into a categorical attribute, usually happens after the data is cleaned.
    • This process includes replacing lower-level data (primitive) with higher-level concepts through the use of concept hierarchies.
    • Street may be replaced with city, country or region.
    • Age may be replaced with senior, adult, younger and youth.
  • Binarization: transforming data into binary numbers (e.g. 0, 1).
    • This helps make classifier algorithms more efficient.

Data Nomalization

  • the data should be standardised or normalised in order to avoid dependency on the selection of measurement units.

  • This constitutes transforming data to lie within a common or smaller range, like [0.0, 1.0] or [−1, 1].

  • Min-max normalization

    • Min-max normalisation maps a value of KK, indicated by vnv_n, to a new value vnv'_n within the range [new_minK,new_maxK][new\_min_K, new\_max_K]
      • vn=(vnminK)(maxKminK)(new_maxKnew_minK)+new_minKv'_n = \frac{(v_n - min_K)}{(max_K - min_K)} \cdot (new\_max_K - new\_min_K) + new\_min_K
      • calculates the relative position within the original range and reflects it in the new range accordingly.
    • preserves the relationships between the original data values.
    • It will encounter an out-of-bounds error if a future input case for normalisation falls outside of the original data range for K.
  • Z-Score normalization

    • normalises attribute values using the average (i.e., mean) and standard deviation of KK.
      • vn=(vnμK)σKv'_n = \frac{(v_n - \mu_K)}{\sigma_K}
      • It converts the distance of a data point from the mean into a unitless measure.
    • is useful when there are outliers that dominate the min-max normalisation
    • is useful when the actual minimum and maximum of attribute KK are unknown.
  • Decimal scaling normalization

    • The number of decimal points moved is based on the maximum absolute value of KK.
      • vn=vn10jv'_n = \frac{v_n}{10^j}
      • where jj is the smallest integer such that max(vn)<1max(|v'_n|) < 1.
      • divides all values by the power of 10 just larger than the maximum absolute value, bringing them into the range (1,1)(-1, 1).
  • Softmax normalization

    • a nonlinear transformation that yields an 's'-shaped curve that approaches 0 and 1 asymptotically.
    • New values will be mapped between 0 and 1 even if they are beyond the range of your existing data.
    • α=νμλ(σ/2π),ν=11+eα\alpha = \frac{\nu - \mu}{\lambda \, (\sigma / 2\pi)}, \qquad \nu' = \frac{1}{1 + e^{-\alpha}}
      • Center the data around the mean :: use (νμ)(\nu - \mu)
      • Remove units by scaling with the standard deviation :: divide by σ\sigma.
      • Control how steep or flat the curve is :: adjust with λ\lambda.
      • Add (2π)(2\pi) as a conventional constant to better match the logistic curve with statistical distributions.
      • the formula naturally arises by centering at the mean, standardizing by the spread, letting the user control the slope, and refining with a scaling constant.
  • Sigmoid normalization

    • a nonlinear transformation similar to softmax. It ranges between −1 and 1 (asymptotically), and has a fixed linear portion within ±mu±mu.
      • α=νμλ(σ/2π),ν=1eα1+eα \alpha = \frac{\nu - \mu}{\lambda \, (\sigma / 2\pi)}, \qquad \nu' = \frac{1 - e^{-\alpha}}{1 + e^{-\alpha}}
      • Center the data around the mean :: use (νμ)(\nu - \mu)
      • Remove units by scaling with the standard deviation :: divide by σ\sigma.
      • Control how steep or flat the curve is :: adjust with λ\lambda.
      • Add (2π)(2\pi) as a conventional constant to refine scaling with respect to statistical distributions. Apply the hyperbolic tangent :: map the result smoothly into the range [1,1][-1, 1].
      • the formula naturally arises by centering at the mean, standardizing by spread, letting the user control the slope, and using tanh to compress all values into [1,1][-1, 1].

Discretization & Concept hierarchy generation

  • Data discretisation is a form of numerosity reduction that transforms a continuous attribute into a categorical attribute.
  • Higher concept labels or a smaller number of intervals (i.e. binning) are used to replace the raw data in order to simplify the original data and increase the efficiency of mining.
  • Discretisation is very beneficial for generating concept hierarchies automatically, which allow data mining at multiple levels of data abstraction.
  • One or more concept hierarchy can be defined for the single attribute for accommodating the requirements of various users.
SalaryAge➡️SalaryAge
200020-[2000, 2900)[20, 25)
280025-[2000, 2900)[25, 30)
350023-[2900, 3800)[20, 25)
240026-[2000, 2900)[25, 30)
560032-[5600, 6500)[30, 35)
420036-[3800, 4700)[35, 40]
500039-[4700, 5600)[35, 40]
500040-[4700, 5600)[35, 40]
340035-[2900, 3800)[35, 40]
360034-[2900, 3800)[30, 35)
  • If dependent and independent variables have only a few values, a wide range of classification algorithms can be used.

Data Binarizaion

  • maps a categorical or continuous attribute into one or more binary variables.
  • Binarisation can convert a continuous attribute to a categorical attribute which can then be converted into set of binary attributes.
  • only possible to keep the meaning of one categorical value at one time, losing the meaning of the others.
IDGender
1Male
2Female
3Not specified
4Female
IDMaleFemaleNot specified
1100
2010
3001
4010
OutlookTemperatureHumidityWindyPlay
Sunny8585FalseNo
Sunny8090TrueNo
Overcast8378FalseYes
Rain7095FalseYes
Rain6880FalseYes
OutlookOutlookOutlookTemperatureHumidityWindyPlay
OvercastRainSunny
001858500
001809010
100837801
010709501
010688001

Data Reduction

  • to acquire a reduced data set representation which is much smaller in quantity and maintains the quality of the data close to the original data.
  • to reduce data storage and analysis costs while increasing storage efficiency

Aggregation

  • storing and presenting data as a summary, using statistical metrics like means, median and variance.
  • Data aggregation is often used to construct a data cube for data analysis at multiple levels of abstraction.
  • Multidimensional aggregated information is stored in data cubes

Data cube aggregation

Dimensionality reduction

  • to minimize the number of features
  • feature subset selection or feature selection detects and removes weakly relevant, redundant, or irrelevant dimensions or attributes
  • to determine a minimum set of attributes so that the resulting probability distribution of the data classes is as near as possible to the original distribution obtained using all attributes.
  • Feature subset selection: uses only available subsets of the features to reduce the dimensionality of the data
    • Redundant features: Duplicates of all or much of the information present in one or more attributes.
      • the amount of sales tax paid / purchase price of a product
    • Irrelevant features: Contain no information that is important for the data mining process at hand.
      • the color of a product when predicting its price
    • While some redundant and irrelevant attributes can be eliminated immediately by considering the domain knowledge or common sense.
    • The ideal approach to feature selection is to try all possible subsets of features in the input for the data mining algorithm of interest, and then consider the subset that gives the best outcome.
  • Feature subset selection techniques
    • Brute-force approach
    • Embedded approaches:
      • Feature selection occurs naturally as part of the data mining algorithm.
      • The algorithm decides by itself which attributes are to be ignored.
    • Filter approaches:
      • Features are chosen before running the data mining algorithm by taking some of the approaches which are independent of the data mining process.
      • can be selected with pairwise correlation as low as possible.
    • Wrapper approaches:
      • consider the target data mining algorithm as a black box to determine the best subset of attributes.
      • Instead of evaluating all possible combinations, it intelligently searches only a subset to find a near-optimal feature set.
      • Heuristic methods: Forward selection, backward elimination, genetic algorithm, greedy search.
      • Decision tree induction

Numerosity reduction

  • Regression, clustering, histograms, sampling
  • reducing the volume of the data, without any loss of data
    • parametric models: store only the model parameters rather than the actual data, regression, log-linear models
    • non-parametric approaches: clustering, sampling, histograms
  • Histograms
    • unsupervised techniques that does not use a class label
    • Singleton bucket: each of the buckets shows only a single frequency pair/attribute value
    • Equal-width histogram: divided into equal ranges
    • Equal-frequency(depth) histogram: each bucket has the similar number of data
  • Sampling
    • a large dataset to be denoted by a smaller random subset (or sample) of the data
    • often used in preliminary exploration as well as final analysis.
    • useful when processing the entire dataset is too large or expensive.
    • If the sample preserves the important properties of the original dataset (e.g., the mean), the sample is said to be representative
    • Simple random sampling: every data point has an equal probability of being chosen.
    • Sampling without replacement: once a data point is chosen, it cannot be selected again.
    • Sampling with replacement (bootstrap): the same data point can be picked multiple times, since it is placed back into the dataset after selection.
    • Cluster sampling: the dataset is divided into clusters (groups), and sampling is performed at the cluster level.
    • Stratified sampling: the dataset is split into strata (partitions), and random samples are drawn from each stratum. This is especially useful when the data is imbalanced, e.g., sampling customers across different age groups.

Stratified sampling

  • Strata: Youth, Middle-aged, Senior

Vocabulary for AI +005

· 약 3분

Vocabulary & Expressions

Term/ExpressionDefinitionSimpler ParaphraseMeaning
subconsciouslyIn a way that is not fully aware or consciousWithout thinking about it무의식적으로
interleaveto arrange or mix things by placing them alternatelyto alternate or weave together교차 배치하다, 섞다
induceto cause something to happen or existto bring about or give rise to유도하다, 초래하다
polynomiala mathematical expression consisting of variables and coefficients, involving only the operations of addition, subtraction, multiplication, and non-negative integer exponentiation of variablesa type of equation with multiple terms다항식
sinusoidalhaving the shape or characteristics of a sine wavewave-like사인 곡선의
piecewisedefined or done in separate parts or segmentsin segments구간별로, 조각조각
imposeto force something to be accepted or put in placeto establish or apply부과하다, 강요하다
dubioushesitating or doubtinguncertain or questionable의심스러운
presumablyused to convey that what is assumed is likely to be trueprobably아마, 추정컨대
arbitrarybased on random choice or personal whim, rather than any reason or systemrandom or capricious임의의, 자의적인
skimpyinsufficient in quantity or qualityscanty or meager부족한, 빈약한
disjunctiverelating to or denoting a logical operation that combines two or more propositionsseparating or contrasting분리적인, 대립적인
paritythe state or condition of being equal or equivalentequality동등성
intractabledifficult to manage or controlstubborn or unmanageable다루기 힘든
at someone's disposalavailable to be used by someoneat their command~가 다룰 수 있는
deviateto depart from an established course or normto diverge or stray벗어나다
asymptoticapproaching a limit as closely as possiblenearing a boundary점근적인
univariateinvolving only one variablesingle-variable단일 변수의
heterogeneouscomposed of different or diverse elementsmixed or varied이질적인
derivationthe process of obtaining something from a source or originextraction유도, 파생
consolidateto combine or unite into a single entityto merge or strengthen통합하다, 강화하다
asymptoticallyin a manner that approaches a limitnearing a boundary점근적으로
preliminaryserving as a preparation or introductioninitial or preparatory예비의, 준비의
harnessto make use of something effectivelyto utilize활용하다
repertoirea collection or set of skills, abilities, or resourcesa range or inventory레퍼토리
visuomotorrelating to the coordination of visual and motor functionsvisual-motor시각 운동의
Owing tobecause ofdue to~때문에, ~덕분에
trajectorythe path followed by a moving objectpath or course궤적
quadraticrelating to a polynomial of the second degreesecond-degree이차의
stationaritythe property of a process whose statistical properties do not change over timestability정상성
pseudoinversea generalization of the inverse matrix for non-square matricesgeneralized inverse유사 역행렬
logarithmicrelating to the logarithm of a quantitylog-based로그의
sphericalrelating to a spheresphere-based구형의
interchangeableable to be exchanged or replaced with something elsereplaceable교체 가능한
admitto acknowledge or accept the existence or truth of somethingto confess or recognize인정하다
differentiablecapable of being differentiatedable to be derived미분 가능한
derivationthe process of obtaining something from a source or originextraction유도, 파생
parametricrelating to or expressed in terms of parametersvariable모수의
extrapolationthe process of estimating values beyond the known data pointsestimation beyond known data외삽
interpolationthe process of estimating values within the range of known data pointsestimation within known data보간, 내삽
plateauedhaving reached a state of little or no change after a period of activity or progressstabilized정체된
compellingevoking interest, attention, or admiration in a powerfully irresistible waycaptivating매력적인

CLIPort Review

· 약 2분

Key Idea

  • CLIPort proposes a two-stream architecture for vision-based manipulation:
    • Semantic pathway (what): leverages CLIP for broad semantic understanding.
    • Spatial pathway (where): leverages Transporter for fine-grained spatial reasoning.
  • This design is inspired by the two-stream hypothesis in cognitive psychology (ventral/dorsal pathways).

Framework Contributions

  • Benchmark Extension: Expanded the Ravens benchmark with language-grounding tasks for manipulation.
  • Two-Stream Architecture: Uses pre-trained vision-language models (CLIP) to condition precise manipulation policies with language goals.
  • Empirical Results: Demonstrates robustness on diverse manipulation tasks, including multi-task settings and real-robot experiments.

Architectural Design

  • CLIPort integrates semantic (CLIP) with spatial (Transporter) features by lateral fusion.
  • The semantic stream is conditioned with language features from CLIP’s text encoder and fused with intermediate spatial features.
  • Enables end-to-end learning of affordance predictions (pick-and-place) without explicit object models, segmentations, or symbolic states.

Key Insights

  • Formulates manipulation as action detection (where to act), instead of object detection.
  • Tabula rasa systems (like plain Transporter) require new demonstrations for every goal/task. CLIPort addresses this with a strong semantic prior (from CLIP) to generalize across tasks and concepts.
  • Language-conditioned policies provide an intuitive interface for specifying goals and transferring concepts.

Experimental Results

  • Simulation (PyBullet, UR5 robot with suction gripper):
    • 10 language-conditioned tasks with thousands of unique instances.
    • Multi-task CLIPort outperformed or matched single-task models, even with fewer demonstrations.
    • CLIP-only or Transporter-only baselines saturate, while CLIPort exceeds 90% success with just 100 demos.
  • Generalization:
    • CLIPort generalizes to unseen attributes (e.g., new colors, shapes, object categories).
    • Struggles with completely novel attributes (e.g., “pink” or “orange” never seen in training).
  • Real-World Robot Experiments (Franka Panda):
    • Achieved ~70% success on real tasks with just 179 demonstrations.
    • Performance trends were consistent with simulation, validating sim-to-real transfer.

Conclusion

  • CLIPort shows that multi-task, language-conditioned policies generalize across tasks better than object-centric or tabula rasa methods.
  • With action abstraction and spatio-semantic priors, end-to-end models can learn new skills without requiring hand-engineered pipelines.
  • Limitations remain for dexterous 6-DoF manipulation and complex continuous control.

Ref

  • Shridhar, M., Manuelli, L., & Fox, D. (2022). Cliport: What and where pathways for robotic manipulation. Conference on robot learning.

Mitigating Hallucinations on Object Attributes Review

· 약 4분

Overview

  • Introduces a HoOA benchmark that isolates hallucinations on object attributes (color, shape) from existence/relationship errors.
  • Proposes MIAVLM: leverages multiview images (generated from a single image’s 3D representation) and a Multiview Attributes Perceiver (MAP) to make fusion order-invariant.
  • Adds negative instructions during tuning to counter LVLMs’ tendency to answer "Yes".
  • Results: best HoOA metric (0.775 / 0.787) with fastest inference (0.071 / 0.105 s). "9in1" tiling is ineffective; separate multiview inputs help.
  • Training: LM loss, Adam (lr=0.001), cosine annealing, 20 epochs, single NVIDIA 3090.

Hallucinations on Object Attributes (HoOA)

Issues

  • HoOA = incorrect attribute descriptions for existing objects (distinct from HoOE/HoOR).
  • Root causes analyzed:
    • Single-view insufficiency: fine-grained details can be invisible from a single viewpoint.
    • Instruction bias: overexposure to positive/affirmative patterns → "Yes" bias.
    • Order sensitivity: multi-image inputs change predictions when view order changes.

Mitigation Methods (this paper)

  • Multiview prompts: sample views from a single image’s 3D reconstruction to recover missed details.
  • MAP (order-invariant fusion): learn view weights and fuse per-view features via weighted sum; input order has no effect; supports any number of views.
  • Negative instructions: incorporate "No"-answerable questions in tuning to suppress "Yes" bias.

Benchmark (HoOA)

Construction

  • Based on CelebAText-HQ; manual attribute descriptions rewritten into Yes/No questions.
    • Positive questions → correct answer "Yes".
    • Negative questions → attribute flipped/opposite → correct answer "No" (to expose "Yes" bias).
  • Scale: 1,430 images, 14,291 positive + 14,291 negative questions.
  • Split: 9:1 train:test.
  • Metric: average of accuracy on positive and negative questions (balanced HoOA score).

Model: MIAVLM

Visual Extractor (VE)

  • 6 stacked Transformer decoder blocks.
  • Soft prompts PRl×dP \in \mathbb{R}^{l \times d} are queries; image embeddings eie_i are keys/values.
  • Per-view cross-attention computed in parallel (no autoregressive chaining; no assumed order).
  • Per-view output: oi=softmax ⁣((PWQ)(eiWK)d)eiWV,OVE={o1,,on}.o_i = \mathrm{softmax}\!\left(\frac{(P W_Q)(e_i W_K)^\top}{\sqrt{d}}\right) e_i W_V,\quad O_{VE}=\{o_1,\dots,o_n\}.

Multihead Sampler (MS)

  • Learns view weights for fusion.
  • Decomposer (2-layer MLP) maps each view’s [CLS][CLS] to m=4m=4 tokens {ei1,,eim}\{e_i^{1},\dots,e_i^{m}\}.
  • For each token/head jj: compute attention scores vs. PP → mean over prompt tokens → weightsjRn\mathrm{weights}^j \in \mathbb{R}^n.
  • Average across heads: wMS=1mj=1mweightsjRn.w_{MS} = \tfrac{1}{m}\sum_{j=1}^{m}\mathrm{weights}^j \in \mathbb{R}^n.

MS

MAP (Multiview Attributes Perceiver)

  • Order-invariant weighted fusion: Output=i=1nwioi.\text{Output}=\sum_{i=1}^{n} w_i\,o_i.
  • Properties: supports any number of views; permutation-invariant to input order.
  • By learning weights for each view, MAP highlights informative perspectives and suppresses less useful ones, ensuring consistent predictions even when the view order changes. This directly addresses the input-order sensitivity observed in baselines such as OpenFlamingo.

MAP

Benchmarks

Baselines & Input Modes

  • Baselines: BLIP3, OpenFlamingo (4 variants), OPERA, Idefics2, LLaVA-UHD.
  • Two input modes:
    1. Original image only.
    2. Original + 8 generated views.
      • Models that accept only one image use 9in1 tiling (nine images stitched into one).

Main Results

  • MIAVLM:
    • HoOA metric: 0.775 / 0.787 (modes 1 / 2)
    • Positive accuracy: 0.752 / 0.762
    • Negative accuracy: 0.797 / 0.812
    • Inference time: 0.071 / 0.105 s (fastest)
  • 9in1 tiling did not improve results (likely harder to interpret).
  • Nine separate multiview images generally improved performance.

Ablations

  • Negative instructions: boost negative-question accuracy but slightly reduce positive-question accuracy; overall HoOA increases (approx. 0.665 → 0.787).
  • Input-order sensitivity:
    • MIAVLM is order-invariant
    • OpenFlamingos accuracy varies when shuffling view order.

Limitations & Notes

  • Trade-off from negative instructions (negatives ↑, positives ↓).
  • Effectiveness depends on the quality of generated views.

Insights

  • This approach seems especially suitable for perception, where multiple scene views may arrive in arbitrary order, ensuring consistent attribute recognition.

Ref

  • Tan, Z., Li, Y., Meng, S., Yuan, X., Li, W., Mo, T., Wang, B., & Chu, X. (2025, 6–11 April 2025). Mitigating Hallucinations on Object Attributes using Multiview Images and Negative Instructions. ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).