IAI +004
Influencing improvement
- Agent's component
- Agent's prior knowledge, which influence that mode lit builds
- the feedback available to learn from.
Components
- Given an intelligent agent that performs some intelligent tasks, any components of agent program can be improved by learning.
Prior knowledge
- Inductive learning (귀납적 학습): learning a general function or rule (possibly incorrect) from specific input-output pairs
- Bottom to top
- Specific to general
- Deductive learning (연역적 학습): going from a general rule to a new specific rule that is logically entailed, but is useful because it allows more efficient processing
- Top to bottom
- general to specific
Feedback
- feedback on its percept sequence
- no feedback on its percept sequence
- rewards for taking a sequence of actions based on its percept sequence
| - | Supervised Learning | Unsupervised Learning |
|---|---|---|
| Training Data | labeled | unlabeled |
| Computational complexity | simpler | Computationally complex |
| Accuracy | high | less accurate |
Supervised learning
- the agent observes some examples of input-output pairs first and then learns a function or a relationship that maps from inputs to output.
- Attributes/Features: the inputs are independent variables in the problem domain
- Target attribute: the output is the dependent variable which is dependent on the inputs.
- Model: the learned function or relationship
- The agent learns a model using examples and uses this model to predict the outcomes for new inputs.
Unsupervised learning
- The agent collects adequate examples in the problem domain but it does not get any explicit feedback to the examples.
- The agent can make sense of the examples through identifying clusters or frequent patterns in the data.
- When shown a large number of examples, the agent can learn to identify clusters of similar examples.
Reinforcement learning
- the agent learns from a series of actions which can be rewards or punishments to improve its performance in completing the task under consideration.
- the feedback helps the agent to enforce positive actions and reduce the negative actions through adjusting the policy.
Supervised Learning Technique
- Decisinon tree
- Random forests
- Linear regrasssion
- Logistic regression
- K nearest neighbours
- Support vector machines
- Neural networks
Regression problem
- to predict a continuous value as the output for a given input
- weather temperature: solar radiation, wind direction and speed, geographic location..
- how to predict the output value of a new data instance on the basis of observed features from the existing data (historical examples) in the problem domain.
- Elements
- Collection of existing or historical data samples which are represented by a set of attributes or independent variables
- The output values of the existing data samples
- the output variable or attribute must be continuous
- Regressor: a function describes the relationship between the attributes of a data sample and the output.
- takes the values of attributes of a data sample and predicts the output value of this given data sample.
Evaluate a regressor
- R Square/Adjust R Square
- MSE Mean Square Error/RMSE Root Mean Square Error
- MAD Mean Absolute Error
Examples of regression problems
- Predict the fuel price using the Brent crude oil price, financial performance of the oil related companies (cash flow, projects lined up, etc.) and/or geopolitical risks (OPEC announcements, government sanctions, etc.)
- Predict the house price of a suburb from the suburb's profile
- Predict the blood pressure of a patient based on the patient's health profile
- Predict the electricity price using temperature, demand and time.
Classification problem
- to predict discrete or categorical value as the output for a given input
- Pass or Failed given learning outcomes, student ID, prior learning, attitude, commitment and attendance.
- how to put a new data instance into one of predefined categories or classes on the basis of observed features from the existing data in the problem domain.
- Elements
- Collection of existing or historical data samples with class labels
- Predefined categories or classes
- Adequate samples in each category or class in the existing or historical data.
- Logic-based techniques
- Decision tree
- Learning set of rules
- Perceptron-based techniques
- Single-layer perceptron
- Multi-layer perceptron
- RBF network
- SVM
- Statistical learning techniques
- Naive Bayes classifier
- Bayesian networks
- Instance-based learning
- K-nearest neighbor (KNN)
Evaluate a classifier
- Confusion matrix
- Precision
- Recall/Sensitivity
- Specificity
- F1-Score
- Area Under Curve & Receiver Operating Characteristics Curve (AUC-ROC)
Examples of classification problems
- banking, healthcare, medical diagnosis, marketing (sentiment aalysis), telecommunication, agriculture, security (fraud detection).
- e-mails into spam or non-spam class
- loan applications into an approved or a rejected class.
- patients into having a certain disease or not having that disease groups.
- text into positive or negative sentiment.
- customers into churn or non-churn classes.
Overfitting
- general phenomenon with all types of learning models.
- a modeling error that occurs when a function is too closely or exactly fit to a limited set of data points.
- more likely as the complexity of models and the number of input attributes increase
- less likely as the number of training examples is large.
Decision Tree
- if-then statements to define patterns in data
- A if-then statement splits the training data into two or more branches based on some values
- Best Split: The results of each branch should be as homogeneous as possible, or has the lowest impurity possible.
- Information gain
- Gini index
Implement Decision Tree
- the split (a feature and a condition) that leads to the lowest impurity in the resulting child nodes, in a greedy manner
- For categorical features: each unique value can be a split condition.
- For continuous features: midpoints between consecutive sorted unique values are used as split conditions.
- For each potential split condition, the algorithm calculates the impurity of the resulting child nodes.
- The lowest impurity node becomes the split point for that branch.
- The process is then repeated recursively for each child node until all leaf nodes are pure, or the stopping criteria are met.
The selectino of best split attributes
- ID3: employs a top-down, greedy search through the space of possible branches with no backtracking using information gain
- C4.5: using information gain ratio
- CART: using Gini Index
- Gini Index
- Chi-Square
- Reduction in Variance
Entropy
the fundamental quantity in information theory. It is a measure of the uncertainty of a random variable.
- the fundamental quantity in information theory. It is a measure of the uncertainty of a random variable
- A more homogeneous node with a clear majority class has low impurity and low entropy, while a more mixed distribution of classes has high impurity and high entropy.
Information gain
the decrease in entropy.
- The information gain from the attribute test on (split on A) is the expected reduction in entropy.
- Information gain computes the difference between entropy before the split and average entropy after the split of the dataset based on given attribute values
Gini index
- For classification, another impurity measure commonly used for classification tasks in decision trees.
- a lower Gini index indicates lower impurity, meaning that the samples in the node predominantly belong to a single class
- a bit more computational efficient than entropy as it does not involve logarithm calculations. but results are quite similar.
Variance
- For regression tree, the target variable is continuous rather than categorical.
- use variance as a measure of impurity in regression trees.
- lower variance indicates that the data points are closely clustered around the mean
Prediction
- For classification tasks: the predicted class label is the majority class among the training samples in the leaf node.
- For regression tasks: the predicted value is the mean of the target values of the training samples in the leaf node.
Dealing with Overfitting
- Overfitting: a common issue in decision trees, where the model captures noise or outliers in the training data rather than the underlying pattern.
- the model performs poorly when applied to new, unseen data.
Setting Stopping Criteria
- prevent the tree from becoming overly complex, which may lead to overfitting
- applied during the tree construction process
- limiting the maximum depth of the tree
- setting a minimum number of samples per leaf node
- requiring a minimum impurity decrease for a split
Pruning Strategies
- applied after the tree has been fully grown
- removing branches from the fully grown tree to simplify its structure
- ensure that it captures the underlying patterns in the data rather than noise or outliers
- Pruned trees perform significantly better than unpruned trees when the data contain a large amount of noise.
Ensemble Methods
- combine multiple decision trees to form a more robust and accurate model.
- address overfitting by averaging the predictions of the individual trees, reducing variance and improving generalization.
- Random Forests
- Gradient Boosted Trees
Random Forest
combines multiple weak decision tree models to create a stronger learning model.
- two types of randomness are introduced to ensure that the individual decision trees are diverse and less prone to overfitting.
- Random sampling of the input data
- Bootstraping:
- involves sampling with replacement 복원추출 (meaning that some instances appearing multiple times and others not appearing) from the original dataset, creating a new dataset.
- each decision tree is trained on a slightly different set of data points, reducing the likelihood of overfitting.
- Random selection of features at each split
- At each split in each decision tree, a random subset of features is considered when determining the best split.
- each tree in the ensemble does not rely on the same set of features for making decisions, resulting in a more diverse set of trees.
- By considering only a subset of features at each split, the model is less likely to be influenced by a small number of dominant features, leading to a more balanced and accurate prediction.
| 구분 | 데이터 무작위성 (Bootstrapping) | 속성 무작위성 (Feature Subset Selection) |
|---|---|---|
| 적용 위치 | 트리 훈련 데이터 선택 단계 | 트리의 각 분할(split) 단계 |
| 방법 | 원본 데이터셋에서 복원 추출(with replacement)로 샘플링하여 새로운 학습용 부분집합 생성 | 전체 속성 중 무작위로 일부 속성만 선택 후, 그 속성들로만 분할 기준 탐색 |
| 특징 | - 각 나무가 다른 데이터 포인트로 학습됨 - 일부 샘플은 여러 번 등장, 일부는 제외될 수 있음 | - 각 분할이 다른 속성을 사용 가능 - 동일한 속성에 과도하게 의존하지 않음 |
| 효과 | - 트리 간의 데이터 다양성 확보 - 과적합 감소 | - 트리 간의 속성 다양성 확보 - 소수 지배적 속성의 영향 축소 |
| 결과 | 더 다양한 데이터 시나리오를 반영한 트리들 생성 | 더 다양한 의사결정 규칙을 반영한 트리들 생성 |
Predict with Random Forest
- aggregating the predictions of all individual decision trees in the forest.
- Majority voting: For classification, Count the number of times each class is predicted by the individual decision trees. The class with the highest count is considered as the final prediction.
- Averaging: For regression, Calculate the mean of the predictions made by the individual decision trees. The mean value is considered as the final prediction.
Linear regression
a learning technique that finds a linear relationship between input variables and the target variable based on a fundamental assumption that there is a linear relationship between input variables and the target variable
- e.g. the input variables (engine size, weight and car age) ➡️ target variable (car fuel efficiency)
- assumption that there is a linear relationship
- A linear regression technique learns a set of coefficients to estimate the linear relationship between and , denoted as , which can be represented by the following equation.
- is a weight vector
- linear regression model is an approximate function between the input variables and the target variable, there will be an error between the output of the model and the actual output value for a data sample
- This error can be represented by a loss function, which calculates the mean square error
- for solving regression problems
Solving a linear regression problem
- to find the best linear relationship that best fits the training data of data samples.
- makes the loss to be minimised.
- to find the best weight vector , such that for a given training dataset of data samples.
- gradient descent: continuously resamples the gradient of the weight coefficients in the opposite direction depending on the weight .
- Until the loss function reaches the global minimum
- to change the individual components of a little bit in the direction that minimises , and to do this many times.
-
- : the step size, the learning rate
- Training model: the process of iteratively updating weights with a learning rate to minimise loss, where the final weight vector defines the model used for predicting new data.
- use regularisation on a multivariate linear function to avoid overfitting.
- Batch gradient descent: consider the entire training dataset at once.
- Stochastic gradient descent (SGD): consider only a single training data sample at a time.
- can be used in an online setting, where new data is coming one at a time, or offline, where we cycle through the same data as many times as is necessary, taking a step after considering each single example.
- With a fixed learning rate , the stochastic version does not guarantee convergence.
- often faster than batch gradient descent.
- With a schedule of decreasing learning rates (SA), the stochastic version does guarantee convergence.
- These update rules are derived as the next weight update equations by taking the partial derivatives of the loss function with respect to and .
Logistic Regression
an extension of linear regression in such a way that the output of a linear regression model goes through a logistic function
- The output value of this logistic function is between 0 and 1.
- 0 is for certainly being labeled "0" and 1 is for certainly being labeled "1", and a value between 0 and 1 represents the probability of being labeled "1"
- a logistic regression model: a linear regression model + a logistic function
- mainly for solving classification problems
Nearest Neighbor
a technique to predict the output of a given new sample based on a collection of existing samples.
- is to find the k-nearest neighbours of given sample in the collection and determine the output based on these k neighbours.
- k is always chosen to be an odd number.
- can be used for both classification and regression problems.
- For classification: majority vote of the neighbours.
- For regression: mean/median (or regression) of the neighbours.
- Instance-based learning
- KNN does not learn a separate model.
- Instead, it stores all training data and uses them directly at prediction time.
- Non-parametric model
- KNN has no parameters (like weights in linear regression) to train.
- The model is essentially the full dataset plus a distance measure.
Distance measures
- Minkowski distance or norm
- Euclidean distance: , for the dimensions are measuring similar properties, such as the width, height and depth of 3D objects.
- Manhattan distance: , for the dimensions are measuring dissimilar properties, such as age, weight, and gender of a patient.
- Hamming distance: the number of attributes on which the two points differ, for Boolean attribute values
Nomarlization
- use the raw data from each dimension then the total distance will be affected by a change in scale in any dimension
- To avoid this, apply normalization to the measurements in each dimension.
- to compute the mean and standard deviation of the values in each dimension, and rescale them
- The rescaling is done using the formula:
- where is the normalized value, is the original value, is the mean, and is the standard deviation.
Time complexity
- Conceptually trivial: Given a set of N examples and a query , iterate through the examples, measure the distance to from each one, and keep the best k.
- 's time complexity is , N is the number of examples in the training dataset.
- Use a k-dimensional tree: a balanced binary tree with an arbitrary number of dimensions.
- Time complexity can be improved to
- appropriate only when there are many more examples than dimensions
- It works well with up to 10 dimensions with thousands of examples.
- Use a Hash table with a locality-sensitive hash (LSH)
- Time complexity can be improved to
SVM
a framework for finding a boundary that distinctly classifies the data points in an optimal way.
- supervised learning, binary classification
- SVM chooses the boundary with the maximum possible geometric margin, which has the largest distance to the nearest training data points of any class
- initially designed for binary classification problems but can also be applied for solving multi-class classification problems
Linear discriminant
- is multiplied by its matching weight
- all these products are added together and passed to a threshold function
- Decision surface: if then else
- Decision function:
- To make a decision, the continuous value is passed through the sign function so that it outputs either +1 or -1.
- If the data from the two classes can be separated with a hyperplane, linearly separable.
Hyperplane
- separates the data in 2D by a line or in 3D by a plane
- The orientation of the hyperplane is given by the vector
- the location of the hyperplane is given by
- The distance from the origin to the hyperplane is
- If a given data sample and , then this data sample is on the separation boundary. It can normally be assigned to any class.
- geometric margin: the minimum distance between the samples and the hyperplane by constructing and solving a constrained optimization problem
- primary optimization problem: to maximize the minimal geometric distance across the training dataset of m samples.
- dual optimization problem: easier to solve. More importantly the dual optimisation problem enables the so-called kernel trick in SVM
Attractive Properties
- SVMs construct a maximum margin separator
- the largest possible distance to example points, helping to improve generalization
- SVMs create a linear separating hyperplane
- kernel trick: to embed the data into a higher-dimensional space
- Often data that are not linearly separable in the original input space are easily separable in a higher-dimensional space
- In general (excepted some special cases) if we have data points then they will always be separable in spaces of dimensions or more
- SVMs are a nonparametric method
- retain training examples and potentially need to store them all
- In practice, they often end up retaining only a small fraction of examples
- have the flexibility to represent complex functions, but they are resistant to overfitting
- not usually expect to find a linear separator in the input space , but we can find linear separators in the high-dimensional feature space simply by replacing in
-
- with
- can often be computed without first computing for each point.
-
- In a higher dimensional feature space, which is created by transformation , if we can express , the kernel function can be applied to pairs of input data to evaluate dot product in some corresponding feature space.
- kernel trick is to plug a kernel function into the dual optimisation problem to replace
- Optimal linear separators can be found efficiently in feature spaces with billions of (or, in some cases, infinitely many) dimensions.
- we can learn in the higher-dimensional space, but we compute only kernel functions rather than the full list of features for each data point.
Classification evaluation metrics
| 용어 | 설명 |
|---|---|
| True Positive (TP) | 실제 1, 예측 1 |
| True Negative (TN) | 실제 0, 예측 0 |
| False Positive (FP) | 실제 0, 예측 1 (0을 잘못 1로 예측) |
| False Negative (FN) | 실제 1, 예측 0 (1을 놓쳐서 0으로 예측) |
Accuracy
the proportion of correctly classified instances (data points or samples) among the total instances.
- a ratio of the number of correct predictions
- it may not be suitable for imbalanced datasets where the class distribution is skewed
- a model that always predicts the majority class will have high accuracy but may not be useful in practice.
Precision
the proportion of true positives among all positive predictions.
- the model's ability to not mistakenly view negatives as positives
- A high precision value indicates that the model has made fewer false positive predictions.
- it's useful to minimize the number of false positives.
Recall
Sensitivity, True Positive Rate, the proportion of true positive instances among the actual positive instances
- the model's ability to not mistakenly view actual positives as negatives
- A high recall value indicates that the model has successfully identified a large portion of the actual positive instances
- it's useful when the cost of false negatives is high
- e.g. in medical diagnosis, where failing to identify a disease can have severe consequences
F1 Score
the harmonic mean of precision and recall
- a balanced evaluation of the model's performance
- It is particularly useful when dealing with imbalanced datasets, where one class is significantly more prevalent than the other.
- It's useful for imbalanced datasets to balance precision and recall.
Regression evaluation metrics
Mean Absolute Error (MAE)
the average of the absolute differences between the predicted values and the actual values
- It measures the average magnitude of errors made by the model, without considering their direction
- A lower MAE value indicates that the model has made smaller prediction errors
Mean Squared Error (MSE)
the average squared difference between the predicted and actual values
- It's useful when you want to penalise larger errors more heavily, making it more sensitive to outliers at the same time
- often used as a loss function when training regression models
- A lower MSE value indicates that the model has smaller prediction errors, with a strong preference for avoiding large errors.
Root Mean Squared Error (RMSE)
the square root of MSE
- it more interpretable as it is in the same units as the dependent variable.
R-Squared (Coefficient of Determination)
how well the regression model approximates the actual data
- the proportion of "sum squared regression (SSR)" and "total sum of squares (SST)"
- SSR obviously captures the model's prediction errors
- SST is the variance of the target variable.
- can be viewed as a naive model () using the average value of the target variable as the prediction
- : The model predicts perfectly (the error is 0).
- : The model does no better than a naive model that always predicts the mean of the target variable.
- : The model performs worse than simply predicting the mean, meaning its predictions increase the error compared to the naive baseline.








