"iai" 태그로 연결된 12개 게시물개의 게시물이 있습니다.

Structural Knowledge

Structural knowledge: the relationship between concepts and objects in the world.
Hierarchical approach: built on classification and uses hierarchies as the structures for knowledge representation, presented in a graphical format.
- Semantic networks: nodes represent concepts, and edges represent relationships between concepts.
- Ontologies
  - A formal specification of concepts, relationships, and constraints within a domain that enables machines to reason automatically.
  - It defines classes (concepts), subclasses (hierarchical relations), and properties (attributes or relations) that describe how entities interact.
  - Unlike simple taxonomies, ontologies also include logical axioms and constraints that specify allowable relationships and permit inference and consistency checking through automated reasoning.
  - Classes: Main concepts or categories (e.g. Person, Animal).
  - Subclasses: Subcategories (e.g. Dog ⊂ Animal).
  - Properties: Attributes (e.g. hasPart, hasColor).
  - Relations: Connection rules (e.g. MemberOf, SubsetOf).
  - Constraints: Logical constraints (e.g. disjointness, transitivity) that enable inference.

Requirement of hierarchices

Inclusiveness: Dog ⊆ Mammal ⊆ Animal
Species/differentia: Dog = Mammal + (barks, canine traits, etc.)
Inheritance: Mammal: has_fur, gives_live_birth → Dog: inherits all Mammal properties
Transitivity: If Dog ⊆ Mammal and Mammal ⊆ Animal, then Dog ⊆ Animal
Systematic and predictable rules for association and distinction
Mutual exclusivity: Reptile ∩ Mammal = ∅
Necessary and sfficient criteria:
- Necessary: $\forall x,\ \text{Mammal}(x) \Rightarrow \text{Vertebrate}(x) \land \text{ProducesMilk}(x)$
- Sufficient: $\forall x,\ \text{Vertebrate}(x) \land \text{ProducesMilk}(x) \Rightarrow \text{Mammal}(x)$

Advantage of hierarchical approach

Inferring from incomplete evidence (if the shared criteria are not obvious or easily observable).
- Animal → Mammal → Dog: If an entity is classified as a Mammal, we can infer properties such as having fur and giving live birth, even if the entity is not explicitly identified as a Dog.
Excellent representations in mature domains
- Domains where entities and relationships are well understood and stable
- e.g. medical diagnosis, biological taxonomy, type systems in programming languages.
Useful for entities that are well defined and have clear class boudnaries.
- Good fit: HTTP status codes, chemical elements, and biological species
- Poor fit: emotions, social roles, and cultural practices
Some theory or model is necessary to guide the identification
- Provides criteria for defining entities and relationships
- e.g. Evolutionary theory in biological taxonomy, Type theory in programming languages

Partition

A partition of a category is a set of subcategories that form a disjoint, exhaustive composition of that category.

Disjoint: Two or more categories are disjoint if they don't share common members.
Exhaustive composition: The subcategories together cover all members of the parent category, leaving no member unclassified.
Examples:
- ❌ Category: Animal (Not a partition)
  - Mammal, Bird (Reptiles, fish, insects are missing, not exhaustive)
- ✅ Category: Integer (Partition)
  - Even, Odd (Disjoint and exhaustive)

Physical composition

PartOf Relation: Partof(a, b) is a relation representing that one thing, 'a', is a part of another thing, 'b'.
BunchOf Relation: BunchOf(a) is a relation, taking a set of objects 'a', to represent a composite object made up of those parts.
Examples:
- Partof(Wheel, Car): A wheel is part of a car (one-to-one relation).
- BunchOf({Wheel1, Wheel2, Wheel3, Wheel4}): A car is a bunch of four wheels (many-to-one relation).
Link between PartOf and BunchOf:
- $\forall x (x \in s \implies PartOf(x, BunchOf(s)))$
- $\forall y \Big[\big(\forall x (x \in s \implies PartOf(x, y)\big) \implies PartOf(BunchOf(s), y)\Big]$
Why useful?
- Reasoning from individual parts -> group -> larger object.
- Avoiding ambiguity between: "this thing is part of", "these things together form"
- Without BunchOf, ontologies cannot represent: piles, colleciton, aggregates, composite physical structures.

Measurements

Quantitive measures (Ratio, Interval)
- Represented as numbers with units
- Support arithmetic and unit conversion
- Enable numeric reasoing (e.g. 2.54cm = 1 inch)
Non-quantitative measures (Ordinal)
- Cannot be meaningufully represented as numbers
- Can still be compared using ordering relactions (<, >, =)
- Suppor qualitative reasoning (e.g. one task is more difficult than another)

Objects

Stuff
- Represents substances
- Uncountable masses
- Definitions include only intrinsic properties (e.g. Butter, Unsalted Butter)
- $b \in Butter \land PartOf(p, b) \implies p \in Butter$
Things
- Represents discrete objects
- Countable entities
- Definitions include extrinsic properties (e.g. PoundOfButter, StickOfButter)
  - It depends on measurement
  - It depends on contextual constraints

Time

Fluent: a condition whose truth value can change over time.
- e.g. "The box is on the table", $On(box, table)$ , $On(box, table, t)$
- a time-dependent proposition.
Time scale and absolute time
- Ontology represents time along a single continuous timeline with a fixed reference point.
- $Date(0, 20, 21, 24, 1, 1995) = Seconds(300000000)$
- It allows arithmetic operations on time and comparisons between time points.
Time intertvals
- Time can be represented as moments (instants) or intervals (durations).
- $Duration(i) = Time(End(i)) - Time(Start(i))$
Partition of time
- $Partition({Moments, ExtendedIntervals}, Intervals)$
- All time intervals can be partitioned into moments and extended (non-zero-length) intervals.
- It provides a complete and precise ontology of time.

Event calculus

$T(f, t_1, t_2)$ : Fluent $f$ is true for all time between time $t_1$ and $t_2$ .
$Happens(e, t_1, t_2)$ : Event $e$ start at time $t_1$ and ends at time $t_2$ .
$Initiate(e, f, t)$ : Event $e$ causes fluent $f$ to become true at time $t$ .
$Terminates(e, f, t)$ : Event $e$ causes fluent $f$ to cease to be true at time $t$ .
$Initiated(f, t_1, t_2)$ : Fluent $f$ becomes true at some point between $t_1$ and $t_2$ .
$Terminated(f, t_1, t_2)$ : Fluent $f$ ceases to be true at some point between $t_1$ and $t_2$ .
$t_1 < t_2$ : Time point $t_1$ occurs before time point $t_2$ .
Happens(PutBoxOnTable, 10, 12)
Initiate(PutBoxOnTable, On(Box, Table), 12)
- Initiated(On(box, table), 10, 12)
T(On(Box, Table), 12, 20)
Terminates(RemoveBoxFromTable, On(Box, Table), 20)
- Terminated(On(box, table), 18, 22)

Successor-state axiom

후속상태공리

Define how the world changes after an action occurs.
- what changes when an action happens
- what stays the same.
Without successor-state axioms, we face the frame problem:
- after every action, we would need to explicitly list all facts that did not change.
$Move(A, B, X)$ $M o v e (A, B, X)$ : moving block A from the top of block B to position X.
- Preconditions:
  - $On(A, B)$ : A is on top of B
  - $Clear(A)$ : nothing is on top of A
  - $Clear(X)$ : nothing is on position X
- Effects:
  - $On(A, X)$ : A is now on position X
  - $Clear(B)$ : B is now clear
  - $Clear(X)$ : X is no longer clear (X is now occupied by A)

Semantic networks

Visually represent a knowledge base.
Support efficient inference.
Allow properties of an object ot be inferred from its category membership.
Representing individual objects, categories of objects, and relations among objects.
Categories are the primary buildling blokcs of large-scale knowledge representation schemes.

Taxonomy hierarchy

A hierarchical structure of categories
Each lower category is a more specific kind of its parent
Organized from general → specific

University Ontology

Entity	Class	Example Statemnets
Alice	Student	$enrolledIn(CS101)$ , $assessedBy(Prof.Smith)$
Prof.Smith	Lecturer	$memberOf(Lecturer)$ , $teaches(CS101)$
CS101	Course	$belongsTo(CSDepartment)$

Knowledge graph

represents information and its relationships using a graph structure.
Nodes: entities or concepts (e.g. people, places, things).
Edges: relationships between nodes (e.g. "is a", "part of", "located in").

Types of Knowledge Graphs

General Knowledge Graphs
Domain-Specific Knowledge Graphs
Semantic Knowledge Graphs
Social Knowledge Graphs
Temporal Knowledge Graphs
Special Knowledge Graphs
Statistical Knowledge Graphs
Probabilistic Knowledge Graphs
Textual Knowledge Graphs
Multi-modal Knowledge Graphs

General Knowledge Graphs

Comprehensive information representation
Entity-Relationship Structure
Linked Data
Semantic Enrichment
Capabilities & Use Cases
- Querying and Analysis
- Data Integration
- ML and AI applications

Examples

DBPedia: RDF, resource descritions framework
Wikidata
YAGO
Google Knowledge Graph
Microsoft Academic Graph
IBM Watson Knowledge Studio

Reasoning for categories

Infer the presence of certain objects from perceptual input
Infer category membership from perceived properties
Use category information to make predictions
It enables an agent to identify objects from observed properties and to predict further characteristics using category knowledge.

Reasoning using Semantic networks

Semantic networks are:
- systems designed specifically for organizing categories reasoning with categories.
- Provide graphical representations of a knowledge base
Using a semanctic network, reasoning can be performed based on:
- relationships between objects
- category membership
- inheritance between categories
- properties associated with categories
- Allows an object to inherit general knowledge from its category.
The inheritance algorithm:
- Starts from the object itself
- Follows links upwards through the category hierarchy
- Stops as soon as it finds a value for the property
- This suppors efficient reasoning, default values, and exception handling.

RDF

Resource Description Framework (RDF)

Triple Structure: subject -> predicate -> object
Subject: represents the resource being described or identified by a URI.
Predicate (or Property): desribes the relationship between the subject and the object. also represented by a URI.
Object: represents the value or target of the relationship. It can be a URI or a literal (a string, number, or date).

Uncertain Domain

Agent may need to handle uncertainty
Environment is partially observable
- Fully observable: An agent knows where it is.
- The drone never knows its exact position, only an estimated one.
- The true user intent is not directly observable, only noisy speech data.
- It cannot see the whole environment or detect humans perfectly.
- The true health state (disease present or not) is hidden.
Environment is non-deterministic
- AI Trading Agent
  - even if the agent takes the same action, the outcome can vary dramatically.
  - from market prices changes unpredictably due to external events (news, other traders, economic data).
  - network latency, competing traders' behavior are stochasic and independent, random fluctuations can affect the outcome.
  - buy action could lead to profit, loss, or no change depending on uncontrollable external factors.
- AI Helpdesk Chatbot
  - even when the chatbot executes the same action, it sends queries to multiple backend systems, the result may differ each time.
  - network latency or failures, backend load, concurrent requests, and external dependencies can lead to different response times or outcomes.
  - same command doesn't guarantee the same outcome.
Autonomous Driving
- Partial Observability
  - Sensors cannot detect everything, blind spots, weather effects, occluded vehicles.
- Non-Determinism
  - Even if the car signals a turn and starts moving, other drivers might brake suddenly, pedestrians might cross unexpectedly, or traffic lights might malfunction.
  - depending on random external events.
Medical Diagnosis & Treatment Assistant
- Partial Observability
  - The patient's internal health state is not directly visible.
  - tests can be inaccurate or incomplete: false positives/negatives, missing data
- Non-Determinism
  - The same treatment can have different effects on different patients due to genetics, lifestyle, or random biological responses.
  - A drug may succeed, fail, or cause side effects.

Bayes' Theorem

$P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}$

$P(A|B)$ : Posterior probability, the probability of event A given taht B has occurred.
$P(B|A)$ : Likelihood, the probability of event B occurring given that A is true.
$P(A)$ : Prior probability, the initial probability of A being true before considering B.
$P(B)$ : Evidence, the total probability of B occurring.
examples
- 1% of people have a certain disease (prior probability). $P(Disease) = 0.01$
- the test for the disease is 99% accurate (likelihood).
  - $P(PositiveTest|Disease) = 0.99$
  - $P(NegativeTest|NoDisease) = 0.99$
  - so $P(PositiveTest|NoDisease) = 0.01$
- $P(PositiveTest) = P(PositiveTest|Disease) \cdot P(Disease) + P(PositiveTest|NoDisease) \cdot P(NoDisease)$ $P (P os i t i v e T es t) = P (P os i t i v e T es t ∣ D i se a se) \cdot P (D i se a se) + P (P os i t i v e T es t ∣ N oD i se a se) \cdot P (N oD i se a se)$
  - $= 0.99 \cdot 0.01 + 0.01 \cdot 0.99 = 0.0198$
- $P(Disease|PositiveTest) = \frac{P(PositiveTest|Disease) \cdot P(Disease)}{P(PositiveTest)}$ $P (D i se a se ∣ P os i t i v e T es t) = \frac{P ( P os i t i v e T es t ∣ D i se a se ) \cdot P ( D i se a se )}{P ( P os i t i v e T es t )}$
  - $= \frac{0.99 \cdot 0.01}{0.0198} \approx 0.5$
the conditional probability $P(effect|cause)$ quantifies the relationship in the causal direction from cause to effect.
but $P(cause|effect)$ $P (c a u se ∣ e ff ec t)$ is often what we really want to know, describing the relationship in the diagnostic direction from effect to cause.
- In medical dignosis, the doctor knows $P(symptoms|disease)$ from medical studies, and want to derive a $P(disease|symptoms)$ for a particular patient.
- $P(disease|symptoms) = \frac{P(symptoms|disease) \cdot P(disease)}{P(symptoms)}$

General Form of Bayes' Rule

$P(Y|X) = \alpha \cdot P(X|Y) \cdot P(Y)$

$\alpha$ is the normalization constant needed to make the entries in $P(Y|X)$ sum to 1 for each value of X.
conditional independency
- two variables X and Y are conditionally independent given a third variable Z if the value of X provides no information about Y once Z is known.
- $P(X, Y|Z) = P(X|Z) \cdot P(Y|Z)$

Constructing Bayesian Network

constructing a Bayesian network in which resulting joint probability distributions are a good representation of the agent's knowledge.
1. $P(x_1, ..., x_n) = \prod_{i=1}^{n} P(x_i | \text{parents}(x_i))$
1. rewrite the entries in the joint probability distribution in terms of conditional probabilities using the product rule.
- $P(x_1 \land ... \land x_n) = P(x_n | x_{n-1} \land ... \land x_{1}) \cdot P(x_{n-1} \land ... \land x_{1})$
1. repeat step 2 until all variables are included.
- $P(x_1, ..., x_n) = P(x_n | x_{n-1} \land ... \land x_{1}) \cdot P(x_{n-1} | x_{n-2} \land ... \land x_{1}) \cdots P(x_2 | x_1) \cdot P(x_1)$

Knowledge Representation

The agent's knowledge can at best provide only a degree of belief in the relevant logical sentences.
A probabilistic agent may have a numerical degree of belief between 0 and 1.

Bayesian Network

a directed graph
each node repressents a random variable and edge with an arrow from X to Y represents that X is a parent of Y.
each node is associated with a probability distribution
if a node Y has parents X, the Y node has a conditional probability distribution with respect to its parents X, i.e. $P(Y| \text{parents}(Y))$
if a node has no any parents, it has an unconditional probability distribution.

Burglary Example

a new burglar alarm installed at home.
it is reliable at detecting a burglary, but also responds on occasion to minor earthquakes
you have two neighbors, John and Mary who promised to call you at work when they hear the alarm.
given the evidence of who has or has not called, estimate the probability of a burglary.
Knowledge representation as the probability of each event that can occur in this case.

$P(\text{event}_{j}) = \frac{\text{number of ccurrences of event }_j}{\text{total number of experiments}}$

calulate the full joint probability distribution for all relevant variables.
- specifying probabilities for all possible events one by one is unnatural and tedious.

$P(B, E, A, J, M) = \prod_{i}^{5} P(X_i | \text{parents}(X_i)) = P(B) \cdot P(E) \cdot P(A|B,E) \cdot P(J|A) \cdot P(M|A)$

\begin{align*} & P(M, J, A, E, B) = P(M | J, A, E, B) \cdot P(J, A, E, B) \\ & P(J | A, E, B) \cdot P(A, E, B) \\ & \quad P(A | E, B) \cdot P(E, B) \\ & \quad \quad P(E | B) \cdot P(B) \\ & = P(M|J,A,E,B) \cdot P(J|A,E,B) \cdot P(A|E,B) \cdot P(E|B) \cdot P(B) \\ & = P(M|J,A,E,B) \cdot P(J|A,E,B) \cdot P(A|E,B) \cdot P(E) \cdot P(B) \\ & \quad i.e.\space P(E|B) = P(E) \text{ (Earthquake is independent of Burglary)} \\ & = P(M|A) \cdot P(J|A) \cdot P(A|E,B) \cdot P(E) \cdot P(B) \\ & \quad i.e.\space P(J|A,E,B) = P(J|A) \text{ (JohnCalls depends only on Alarm)} \\ & \quad i.e.\space P(M|J,A,E,B) = P(M|A) \text{ (MaryCalls depends only on Alarm)} \end{align*}

$P(M, J, A, E, B) = P(B) \cdot P(E) \cdot P(A|B,E) \cdot P(J|A) \cdot P(M|A)$

Probabilistic reasnoing

to explain how to use network models to reason under uncertainty.
Exact inference: calculate the exact posterior probabilities.
Approximate inference: Stochastic approximation techniques such as likelihood weighting and Markov Chain Monte Carlo
- can give reasonable estimates of the true posterior probabilities in a network
- can cope with much larger networks than exact inference algorithms.
Static worlds: Each random variable has a single fixed value.
Dynamic worlds: view the world as a series of snapshots, or time slices.
- Each of snapshots contains a set of random variables, some observed and some hidden.
- The values of these variables can change over time.
- With the assumption of state sequences are Markov, the future is independent of the past given the present.

Inference in Bayesian Networks

Computing the posterior probability distribution of a set of query variables, given some observed events, called some assignment of values to a set of evidence variables.
$X$ $X$ : query variable(s)
- is it raining?
$E$ $E$ : the set of evidence variables with observed values $E_{1}, ..., E_{m}$ $E_{1}, ..., E_{m}$
- weather report says it is cloudy.
- weather report says there is high humidity.
$e$ $e$ : a particular observed event.
- $E_{1} = \text{cloudy}$
- $E_{2} = \text{high humidity}$
$Y$ $Y$ : the non-evidence, non-query variables.
- the other variables in the network.
$X = \{X\} \cup E \cup Y$ $X = {X} \cup E \cup Y$
- the complete set of variables in the network.
$P(X|e)$ : the posterior probability distribution of the query variable(s) given the evidence.
$P(X|e) = \alpha \sum_{y} P(X, e, y)$

Exact inference in Bayesian Networks

One exact inference method is called by inference by enumeration.
Observed some event $A = true$ $A = t r u e$ and $B = true$ $B = t r u e$ , want to compute the posterior probability distribution $P(C|A, B)$ $P (C ∣ A, B)$ .
- Use the full conditional probability table
- Use the Bayesian Network
It calculates all the required probability components in each branch in the evaluation tree, and this results in repetitive calculations.
To avoid this, we can use variable elimination algorithm and clustering algorithms.
Calculation of probability distribution for a query variable is complexity exponential.
Enumeration method can become intractable in large, multiple connected networks. -> use approximate inference methods.

For any Bayesian network with given nodes, $X = \{X_1, X_2, ..., X_n\}$ , the joint probability distribution is given by: $P(X) = P(X_1 \land X_2 \land \ldots \land X_n) = \prod_{i=1}^{n} P(X_i | \text{parents}(X_i))$

Using the Bayesian network, we can compute the conditional probability.

$P(B | event) = \alpha \sum_{e}\sum_{a} P(B, E, A, j, m)$

$i=1, P(X_1 | \text{parents}(X_1)) = P(B)$
$i=2, P(X_2 | \text{parents}(X_2)) = P(E)$
$i=3, P(X_3 | \text{parents}(X_3)) = P(A|B,E)$
$i=4, P(X_4 | \text{parents}(X_4)) = P(j|A)$
$i=5, P(X_5 | \text{parents}(X_5)) = P(m|A)$
$P(B, E, A, j, m) = P(B) \cdot P(E) \cdot P(A|B,E) \cdot P(j|A) \cdot P(m|A)$

\begin{align*} & P(B | event) = \alpha \sum_{e}\sum_{a} P(B, E, A, j, m) \\ & = \alpha \sum_{e}\sum_{a} P(B) \cdot P(E) \cdot P(A|B,E) \cdot P(j|A) \cdot P(m|A) \\ & = \alpha \cdot P(B) \sum_{e} P(E) \sum_{a} P(A|B,E) \cdot P(j|A) \cdot P(m|A) \end{align*}

Berglary example calculation:

\begin{align*} & P(b | event) = \alpha P(b) \sum_{e} P(E) \sum_{a} P(A|b,E) \cdot P(j|A) \cdot P(m|A) \\ & = \alpha P(b) \bigg[ P(e) \sum_{a} P(A|b,e) \cdot P(j|A) \cdot P(m|A) + P(\neg e) \sum_{a} P(A|b,\neg e) \cdot P(j|A) \cdot P(m|A) \bigg] \\ & \quad \sum_{a} P(A,b,e) \cdot P(j|A) \cdot P(m|A) \\ & \quad = P(a,b,e) \cdot P(j|a) \cdot P(m|a) + P(\neg a,b,e) \cdot P(j|\neg a) \cdot P(m|\neg a) \\ & \quad \sum_{a} P(A|b,\neg e) \cdot P(j|A) \cdot P(m|A) \\ & \quad = P(a,b,\neg e) \cdot P(j|a) \cdot P(m|a) + P(\neg a,b,\neg e) \cdot P(j|\neg a) \cdot P(m|\neg a) \\ & = \alpha P(b) \bigg[ P(e) \big( P(a|b,e) \cdot P(j|a) \cdot P(m|a) + P(\neg a|b,e) \cdot P(j|\neg a) \cdot P(m|\neg a) \big) \\ & \quad \quad + P(\neg e) \big( P(a|b,\neg e) \cdot P(j|a) \cdot P(m|a) + P(\neg a|b,\neg e) \cdot P(j|\neg a) \cdot P(m|\neg a) \big) \bigg] \\ \end{align*}

IAI +010

2025년 10월 12일 · 약 11분

Eunkwang Shin

Owner

Knowledge representation

Intelligent agents need knowledge about the world in order to reach good decisions.
Declarative knowledge is represented in a form of sentences in a knowledge representation language and stored in a knowledge base.
Knowledge base is used by an inference engine to infer a new sentence which will be used for the agent to decide what action to take next.
Formal languages are defined by its grammar and semantic rules
- grammar: defines the syntax of legal sentences
- semantic rules: defines the meaning.
Common knowledge representation formalisms:
- Propositional logic
- First-order logic
- Fuzzy logic
- Semantic networks
- Ontologies

Propositional logic

a declarative language in which can handle propositions that are known true, known false, or completely (unknown true or false)
a BNK (Backus-Naur Form) grammer of sentences in propositional logic.
- $\neg$ = NOT
- $\land$ = AND
- $\lor$ = OR
- $\implies$ = IMPLIES
- $\iff$ = IF AND ONLY IF
Negation: a sentence using $\neg$ is called negation
Literal: either an atomic sentence or a negated atomic sentence
Conjunction: two sentences connected by $\land$ . Eac hof them is called conjunct
Disjunction: two sentences connected by $\lor$ . Each of them is called disjunct
Implication: two sentences connected by $\implies$ $⟹$ . $P \implies Q$ $P ⟹ Q$ . P is called premise or antecedent and Q is the conclusion or consequent.
- An implication is if-then statement.

Truth tables

P	Q	$\neg$ P	P $\land$ Q	P $\lor$ Q	P $\implies$ Q	P $\iff$ Q
T	T	F	T	T	T	T
T	F	F	F	T	F	F
F	T	T	F	T	T	F
F	F	T	F	F	T	T

FOL, First-order logic

a declarative language
Syntax of FOL builds on that of propositional logic
terms to represent objects, universal quantifier, and existential quantifier
a model in FOL must provide the information required to determine the truth value of every atomic sentence in the language.
$\forall x P(x)$ means for all x, P(x) is true
$\exists x P(x)$ means there exists an x such that P(x) is true

Logical equivalence

logical equivalence	meaning
$(\alpha \land \beta) \equiv (\beta \land \alpha)$	Commutativity of $\land$
$(\alpha \lor \beta) \equiv (\beta \lor \alpha)$	Commutativity of $\lor$
$(\alpha \land (\beta \land \gamma)) \equiv ((\alpha \land \beta) \land \gamma)$	Associativity of $\land$
$(\alpha \lor (\beta \lor \gamma)) \equiv ((\alpha \lor \beta) \lor \gamma)$	Associativity of $\lor$
$\neg(\neg \alpha) \equiv \alpha$	Double negation elimination
$(\alpha \implies \beta) \equiv (\neg \alpha \lor \beta)$	Implication elimination
$(\alpha \implies \beta) \equiv (\neg \beta \implies \neg \alpha)$	Contraposition
( $\alpha \iff \beta) \equiv ((\alpha \implies \beta) \land (\beta \implies \alpha))$	Biconditional elimination
$\neg(\alpha \land \beta) \equiv (\neg \alpha \lor \neg \beta)$	De Morgan's law for $\land$
$\neg(\alpha \lor \beta) \equiv (\neg \alpha \land \neg \beta)$	De Morgan's law for $\lor$
$(\alpha \land (\beta \lor \gamma)) \equiv ((\alpha \land \beta) \lor (\alpha \land \gamma))$	Distributivity of $\land$ over $\lor$
$(\alpha \lor (\beta \land \gamma)) \equiv ((\alpha \lor \beta) \land (\alpha \lor \gamma))$	Distributivity of $\lor$ over $\land$

$\forall x \neg P(x) \equiv \neg \exists x P(x)$

For all x, not P(x) is logically equivalent to "it is not the case that there exists an x such that P(x) is true."

Reasoning

Deductive reasoning

a process of reasoning from one or more statements (premises) to reach a logical conclusion.
first premise, second premise, therefore conclusion.

Inductive reasoning

the process of resoning from specific observations to borader generalizations and theories.
also described as a method where one's experiences and observations are synthesized to cmoe up with a general truth.
premises and then conclusion

Inference

steps in reasoning
moves from premises to logical consequences
In AI Context, inference is to derive new logical sentences (as the conclusion) from existing logical sentences (as premises).
researchers develop automated inference systems to emulate human inference.

Inference Problem

$KB \models \gamma$

KB, Knowledge Base, is a set of propositions that represent what is known about the world.
$\gamma$ , Query sentence, is the target conclusion which needs to be confirmed based on the given KB.
where $\models$ denotes the relation of logical entailment between KB and the sentence $\gamma$ , reading as "KB entails $\gamma$ " or "if KB is true, then $\gamma$ must also be true".
$\alpha \models \beta \iff M(\alpha) \subseteq M(\beta)$ $α ⊨ β ⟺ M (α) \subseteq M (β)$
- $M(\alpha)$ is the set of all models that satisfy $\alpha$ .
- $M(\beta)$ is the set of all models that satisfy $\beta$ .
- The statement $\alpha \models \beta$ means that in every model where $\alpha$ is true, $\beta$ is also true. In other words, if $\alpha$ holds, then $\beta$ must also hold.
Model Checking: enumerates all possible models and checks if the entailment holds in each model.
- Knowledge base will be used to draw inferences.
- Query sentence, $\gamma$ , is needed to be checked whether it is entailed by the KB.
- Symbols, a list of all symbols (or atomic propositions) used in the problem context.
- Models, assignments of truth and false values to those identified symbols.
Model Checking Procedure
1. Identify the propositional symbols involved in the KB sentences and query sentence
2. Enumerate all possible models by assigning truth values to the identified symbols.
3. Evaluate the KB sentence in each model and fined the models in which KB is true.
4. Evaluate the query sentence in the models from step 3 and check if query sentence is true in these models.
5. Conclude that the KB entails the query sentence $\gamma$ if and only if the query sentence is true in all models where the KB is true.

Inference Problem Example

Obersvation P: "It is raining"
Query sentence Q: "The ground is wet"
Knowledge: $P \implies Q$ (If it is raining, then the ground is wet)
Knowledge Base KB: $P \land (P \implies Q)$
Inference Problem: $KB \models Q$ from KB=T, get Q=T
Proof:
1. From the 4 possible models, only Model 1 makes KB true.
2. M(KB) = model 1
3. Model 1 also makes Q true.
4. M(Q) = model 1
5. $M(KB) \subseteq M(Q)$ , therefore $KB \models Q$

Model	P	Q	P $\implies$ Q	P $\land$ (P $\implies$ Q)
1	T	T	T	T
2	T	F	F	F
3	F	T	T	F
4	F	F	T	F

Inference by theorem proving

to apply rules of inference directly to the sentences in the KB to construct a proof of the desired sentence without consulting models.

Proof: a chain of consequences that leads to the desired goal.
KB will be used to draw inferences.
Desired sentence is needed to be checked whether it is entailed by the KB.
The rules of inference are the approved logical equivalences and rules.

Aspect	Model Checking	Theorem Proving
방식	참/거짓으로 실제 계산	논리 규칙을 사용해 증명
예시	진리표	Modus Ponens, Resolution
장점	단순함	복잡한 문장도 처리 가능
단점	계산 많음	규칙 익혀야 함

Modus Ponens Rule

$\frac{\alpha \implies \beta, \alpha}{\therefore \beta}$

whenever any sentences of the form $\alpha \implies \beta$ and $\alpha$ are given, then the sentence $\beta$ can be inferred.

Add-Elimination Rule

$\frac{\alpha \land \beta}{\therefore \alpha}$

from a conjunction, one of the conjuncts can be inferred.

Terms to contradiction and resolution

CNF
┌────────────────────────────────────────────────┐
│            (P ∨ Q)         ∧    (¬Q ∨ R)       │
│   ┌──────────────────────┐   ┌───────────────┐ │
│   │  Clause 1            │   │  Clause 2     │ │
│   │  (P ∨ Q)             │   │  (¬Q ∨ R)     │ │
│   │  ┌───────┬───────┐   │   │  ┌──────┬─────┐ │
│   │  │  P    │   Q   │   │   │  │ ¬Q   │  R  │ │
│   │  └───────┴───────┘   │   │  └──────┴─────┘ │
│   └──────────────────────┘   └───────────────┘ │
└────────────────────────────────────────────────┘

Literal: an atomic sentence or its negation
Complementary literals: a literal and its negation are complementary literals
Clause: an expression formed from a collection of finite literals
- In most $l_1 \lor l_2 \lor ... \lor l_n$ cases, a clause is a disjunction of finite literals.
- written as the symbol $l_i$ .
Conjunctive Normal Form: a sentence expressed as a conjunction of clauses.
Satisfiability: a sentence is satisfiable if it is true in, or satisfied by, some models.

Propositional satisfiability (SAT) problem

to determine the satisfiability of sentences in propositional logic.
if there exists a model that satisfies a given logical sentence, then the sentence is satisfiable.
Satisfiability can be checked by enumerating the possible models until one is found that satisfies the sentence, or by resolving complementary literals until an empty clause is derived.
many problems in CS are really SAT problems.

Resolution rule

Unit resolution rule

$\frac{(l_1 \lor l_2 \lor ... \lor l_{i-1} \lor l_i \lor l_{i+1} \lor ... \lor l_k,\space m)}{l_1 \lor ... \lor l_{i-1} \lor l_{i+1} \lor ... \lor l_k}$

where $l$ is a literal and $l_i$ and $m$ are complementary literals.
Unit resolution rule takes a clause which is a disjunction of literals, and a literal and produces a new clause as the resolvent.

Full resolution rule

$\frac{(l_1 \lor l_2 \lor ... \lor l_{i-1} \lor l_i \lor l_{i+1} \lor ... \lor l_k,\space m_1 \lor m_2 \lor ... \lor m_{j-1} \lor m_j \lor m_{j+1} \lor ... \lor m_n)}{(l_1 \lor ... \lor l_{i-1} \lor l_{i+1} \lor ... \lor l_k \lor m_1 \lor ... \lor m_{j-1} \lor m_{j+1} \lor ... \lor m_n)}$

where $l$ is a literal and $l_i$ and $m_j$ are complementary literals.
Full resolution rule takes two clauses which are disjuctions of literals and produces a new clause containing all the literals of the two original clauses except for the two complementary literals.

Inference via proof by contradiction through resolution

$\alpha \models \beta$

to prove that $\alpha \models \beta$ , we can show that the sentence $\alpha \land \neg \beta$ is unsatisfiable.
by deriving an empty clause $()$ from $\alpha \land \neg \beta$ using the resolution rule.
In order to derive an empty clause, we use resolution which is a process to resolve complementary literals until to find an empty clause.
1. R1: P (observation)
2. R2: P $\implies$ Q (knowledge, raining implies ground is wet)
3. $KB = P \land (P \implies Q)$
4. Query sentence: Q
5. Inference problem: $KB \models Q$ , i.e. from $KB=T$ , get $Q=T$
6. Proof:
- Let $(P \land (P \implies Q)) \land \neg Q$ valid
- Convert $(P \land (P \implies Q)) \land \neg Q$ to CNF
- Apply implication elimination rule, one has $(P \land (\neg P \lor Q)) \land \neg Q$ $(P \land (\neg P \lor Q)) \land \neg Q$
  - $C_1: P$
  - $C_2: \neg P \lor Q$
  - $C_3: \neg Q$
- Apply unit resolution rule to $C_1$ and $C_2$ , resolve $P$ and $\neg P$ , one has $C_4: Q$
- Apply unit resolution rule to $C_3$ and $C_4$ , resolve $Q$ and $\neg Q$ , one has $C_5: ()$

Inference in FOL

convert the first-order inference to propositional inference using the ruls for quantifiers.
- universal instantiation (UI)
- existential instantiation (EI)
do inference in propositional logic using the methods about inference in propositional logic.
this approach to first-order logic inference via propositionalization is complete, means any entailed sentences can be proved.
- in most cases, this approach works
- in some cases, it is slow and only useful when the domain is small.
get rid of quantifiers by instantiating them with specific constants or variables.

Universal instantiation (UI)

$\frac{\forall v \space \alpha}{\text{Subst}(\{v/g\},\alpha)}$

to convert sentences with universal quantifiers to sentences without universal quantifiers.
it can infer any sentence obtained by substituting a ground term for the universally quantified variable.
$Subst(\{\theta, \alpha\})$ denotes the result of applying the subsitution $\theta$ to the sentence $\alpha$ and $g$ is a ground term or a constant symbol.
universal instantiation can be applied many times to produce may different consequences.
$\forall x Loves(x, Mary)$ $\forall xL o v es (x, M a ry)$
- $Loves(John, Mary)$
- $Loves(Sue, Mary)$
- $Loves(Bill, Mary)$
- ...

Existential instantiation (EI)

to convert sentences with existential quantifiers to sentences without existential quantifiers.
the quantified variable can be replaced by a single new constant symbol.
for any sentence $\alpha$ , variable $v$ , and constant symbol $k$ not appear elsewhere in the knowldge base, the following rule is sound: $\frac{\exists v \space \alpha}{\text{Subst}(\{v/k\},\alpha)}$
$\text{Subst}\{\theta, \alpha\}$ denotes the result of applying the subsitiution $\theta={v/k}$ in the sentence $\alpha$ and $k$ is ground term or a new constant symbol.
Existential instantiation can be applied only once, and then the existentially quantified sentence can be discarded.
$\exists x Loves(x, Mary)$ $\exists xL o v es (x, M a ry)$
- $Loves(K, Mary)$
- where $K$ is a new constant symbol not appear elsewhere in the knowledge base

CNF, Conjunctive Normal Form

Step	Rule Name	Description	Example
1	제거	$\implies$ , $\iff$ 없애기	$P \implies Q$ → $\neg P \lor Q$
2	De Morgan's	$\lnot$ 분배	$\lnot(P \lor Q) \rightarrow (\lnot P \land \lnot Q)$
3	Double negation	이중부정 제거	$\lnot(\lnot P) \rightarrow P$
4	Distribution	$\lor$ over $\land$ 분배	$(P \lor (Q \land R))$ → $(P \lor Q) \land (P \lor R)$
5	And-Elimination	$\land$ 분리	$(A \land B) \rightarrow A, B$ 따로

IAI +009

2025년 10월 9일 · 약 7분

Eunkwang Shin

Owner

Generative AI

Gen AI refers to a category of AI models designed to generte new content, synt99hetic data that resembles a given dataset.
Gen AI models create new content, including text, images, audio, and video.

Historcal context of Gen AI

1980s: The development of statistical approaches to AI emerged, focusing on probabilistic models
1990s: Hidden Markov Models (HMMs) became popular in speech recognition and sequence generation tasks, making a shift towards using statistical methods in generative processes.
1990s-2000s: The resurgence of Neural Networks, with the introduction of deep learning techniques. However, hardware and data limitations hampered progress.
2010s: The advent of deep learning algorithms, especially convolutional neural networks (CNNs) and recurrent neural networks (RNNs), transformed the landscape of AI.
2013: VAEs were introduced, providing a probabilistic approach to data generation, allowing for smooth latent space interpolation and structure.
2014: GANs, a novel framework where two neural networks (a generator and a discriminator) are trained simultaneously, allowing for the generation of highly realistic images and other data types.
2015-2020: Generative models began to find applications beyond image synthesis, including text generation, music composition, and even video generation. OpenAI's GPT-2 showcased the potential of transformer-based models for generating coherent text.
2020s: The introduction of larger and more capable models like OpenAI's GPT-3 and subsequent iterations revolutionized natural language processing.
- DALL-E and Stable Diffusion pushed the boundaries of image generation, allowing users to create images from textual descriptions. This sparked creative exploration and practical applications in marketing, art, and design.
Present: The integration of multiple data modalities (text, image, audio) led to the development of models like CLIP and GPT-4, which can understand and generate content across different formats, enhancing the versatility of generative AI.

Variational Autoencoders, VAEs

VAEs are from the probabilistic approach to build a generative AI model
VAE learns a latent distribution instead of a fixed latent presentation, allowing for the generation of new samples.
Latent space
- sample from a Gaussian distribution
- $\mu$ and $\sigma$ are learned during training
- $z$ is a new sample from the latent space
$q_\phi(z|x)$ is the encoder that maps input data $x$ to a distribution over the latent space $z$
$p_\theta(x|z)$ is the decoder that maps a latent variable $z$ back to the data space $x$
During training, the model optimizes the reconstruction loss and a regularization term (KL divergence) to ensure the learned latent distribution is close to a prior distribution (usually a standard normal distribution).
Latent Representation
- Dimensionality Reduction: latent space is typically lower-dimensional than the input space
- Smoothness: In a well-structured latent space, similar inputs will be represented by nearby points.
- Generative Capabilities: Once trained, can sample from the latent space to generate new data that resembles the training set.
- Regularization: The VAE incorporates a regularization term in its loss function (KL divergence), which encourages the lerned latent distribution to be close to a standard normal distribution.
Decoder
- The decoder takes the sampled latent representation $z$ and reconstructs the original input data.
- The goal is to make the reconstructed data as close as possible to the original input.

VAE Examples

Face Generation Example

Trains a VAE on face photos.
Latent space learns meaningful features (e.g. hair color, glasses, smile)
By sampling and interpolating in latent space, can generate new faces or smoothly morph one face into another.

Anomaly Detection Example

Train a VAE on normal sensor data.
When fed unusual data, reconstruction error will be high.
Use this for fault detection in machines, fraud detction, etc.

Autoencoder

Encoder
- Input Layer: takes in the original data
- Hidden Layer: These layers progrssively reduce the dimensionality of the input through operations like linear transformations and non-linear activations. (e.g., ReLU-Rectified Linear Unit)
- Ouput Layer: Produces the final latent represntation. Sometimes Sigmoid or Tanh are used depending on the nature of the input data.
- Learns to capture meaningful patterns in the data by minimizaing reconstruction loss, which mesures the difference between the original input and the reconstructed output produced by the decoder.
Decoder
- typically mirrors the structure of the encoder but in reverse.
- During trainng, Mean squared Error (MSE) for continuous data or Binary Cross-Entropy (BCE) for binary data are commonly used to quantify the reconstruction loss.

Image Reconstuction Example

input: 28x28 pixel
encoder: compresses it to just 16 numbers (latent vector)
decoder: expands those 16 numbers back into a 28x28 image
result: the reconstructed digit looks similar to the original but not in the training set.

Denoising Images

input: a noisy photo of a cat
encoder: learns to ignore the noise and compress meaningful features.
decoder: rebuilds the image without the noise.
output: a clearer cat image.

Loss function

Reconstruction loss: Ensure output similar to input
KL divergence loss: Push the learned distribution to be close to a standard normal distribution. (can sample new data)

Generative Adversarial Network, GAN

Generator: Learns to generate fake (generated) data that resembles data distribution.
Discriminator: Learns to distinguish between real data and data generated by the Generator.
Steps
1. Generator creates fake data samples fro mrandom input (noise).
2. Discriminator evaludates these samples together with real data samples.
3. Discriminator outputs probabilities indicatcing whether each sample is real or fake.
4. Based on the prediction of the Discriminator, the Generator and Discriminator are updated using specific loss functions.
Applications
- Labels to Street Scenes
- Labels to Facade
- BW to Color
- Aerial to Map
- Day to Nihght
- Edges to Photo
- Text-to-Image Synthesis

Autoregressive Models

a class of generative models where the current value of a time series is expressed as a linear function of its own past values plus some noise.
foundational in generative AI and widely used in generative AI, particularly for tasks like text generation, speech synthesis.
Examples
- GPT series: state-of-the-art autoregressive language models used for text generation.
- WaveNet: an autoregressive model for generating high-quality audio by predicting each sample conditioned on previous samples.

Training Autoregressive Models

Pretraining
- the model sees sequences of tokens and learns to guess the next token.
- objective: minimize cross-entropy loss between its guess and the true next token.
- this teaches grammar, facts, reasoning patterns, and style from raw text.
Supervised fine-tuning (optional)
- smaller curated datasets (questions to answer, instructions to respond) teach it to follow directions.
Reinforcement learning from Human Feedback (optional)
- Humans rank multiple model outputs; a reward model is trained on those rankings; GPT is then optimized to produce higher-reward responses (safer, more helpful, less toxic)
Text to Tokens via a tokenizer (e.g. BPE, Byte Pair Encoding)
Each token becomes a vector (Embedding)
Positional encodings inject order information.
Transformer layer
- self-attention: each token looks at all previous token (causal mask) and decides which ones matter, computing weighted combinations.
- Multiple heads let it attend to different patterns (syntax, long-range links, etc.).
- feed-forward network: a nonlinear MLP refines each token's representation.
- residual connections & layer norm stabilize tranining.

Interfence

a prompt (the context)
GPT computes probabiliites for the next token.
A decoding stragety samples a token
- Decoding knobs
  - Greedy: take the top token
  - Top-K
  - Nucleus (Top-p) sampling (limit to likely options)
  - Temperature scales randomness, controls how random the next token choice is
    - lower = more determistic, higher = more creative
append the token and repeat autoregressive generation.

IAI +008

2025년 9월 22일 · 약 13분

Eunkwang Shin

Owner

NLP

Tokenization

The process of dividing a text into a sequence of words.

Different models uses different tokenization methods.

BERT TOKEN	BERT ID	GPT TOKEN	GPT ID
my	2026	My	3666
grandson	7631	Ġgrandson	31845
loved	3866	Ġloved	10140
it	2009	Ġit	340
!	999	Ġ!	0
so	2061	Ġso	1406
much	2172	Ġmuch	881
fun	4569	Ġfun	1257
!	999	Ġ!	0

Corpus

a large and structured collection of text

a corpus typically consists of at least a million words of text
at least tens of thousands of distinct vocabulary words.

Text classification

the process of categorizing text into organized groups.
text classifiers can automatically analyze text and then assign a set of predefined tags or categories based on its content.
machine learning approach
- Features
  - BoW
  - TF-IDF
- Features + classifier
  - Logistic regression
  - SVM
  - Naive Bayes
Deep learning approach
- Neural models: CNNs (capture local n-grams)
- RNNs, LSTMs (sequence-aware)
- Transformers (e.g. BERT)
  - Contextual embedding
Modern enhancements
- Embedding + classifier: Pretrained embeddings (Word2Vec) + classifier
- Fine-tuned transformers: BERT fine-tuned

Bag of Words, BoW

it represents a text as a bag of its words
we can understand the meaning of a document from its content (words) their multiplicity (frequency, number of occurrences)
mainly used as a tool of feature extraction.
limitations
- ignores syntax and the context
- disregarding grammar
- discards word order - words are independent of each other
- considers only the meanings of the words in the sentence

# examples

Example1 = "He likes to watch movies. Mary likes movies too."
Example2 = "Mary also likes to watch football games."

# Vocabulary
Vocab = {"He", "likes", "to", "watch", "movies", "Mary", "also", "football", "games"}

# BoW representation
BoW1 = {He: 1, likes: 2, to: 1, watch: 1, movies: 2, Mary: 1, also: 0, football: 0, games: 0}
BoW2 = {He: 0, likes: 1, to: 1, watch: 1, movies: 0, Mary: 1, also: 1, football: 1, games: 1}

[[1,2,1,1,2,1,0,0,0], [0,1,1,1,0,1,1,1,1]] 

TF-IDF

Term Frequency - Inverse Document Frequency

Model based on the statistics of word counts.
idea is that key terms and important ideas are likely to repeat.
includes a scoring function that measure the relevance of a document to a query.
- the function takes a document with a corpus and a query as input and returns a numeric score.
- the doucments that have the highest scores are considered as the most relevant documents.
Term Frequency
- $TF(q_i, d_j, D)$
Inverse Document Frequency
- $IDF(q_i, D) = \log\frac{N}{DF(q_i, D)}$ $I D F (q_{i}, D) = lo g \frac{N}{D F ( q _{i} , D )}$
  - $DF(q_i, D)$ : number of documents in the corpus $D$ that contain the term $q_i$
$N = |D|$ : total number of documents in the corpus $D$
TF-IDF score
- $TFIDF(q_i, d_j, D) = TF(q_i, d_j) \cdot IDF(q_i, D)$

Examples of TF-IDF

Document 1: "John likes to watch movies."
Document 2: "Mary likes movies too."
Document 3: "John also likes football."

Step	Term	Document	Number of times term `t` appears in document `d`	Total number of terms in document `d`	`TF(t, d)`	Total number of documents	Number of documents containing the term `t`	`IDF(t)`	`IDF(t) (base 10)`	`TF-IDF(t,d)`	`TF-IDF(t,d)` (base 10)
1	t=likes	d1	1	5	0.2	3	3	0	0	0	0
	t=likes	d2	1	4	0.25					0	0
	t=likes	d3	1	4	0.25					0	0
2	t=watch	d1	1	5	0.2	3	1	1.098612289	0.477121255	0.219722458	0.095424251
	t=watch	d2	0	4	0					0	0
	t=watch	d3	0	4	0					0	0
3	t=mary	d1	0	5	0	3	1	1.098612289	0.477121255	0	0
	t=mary	d2	1	4	0.25					0.274653072	0.119280314
	t=mary	d3	0	4	0					0	0
4	t=football	d1	0	5	0	3	1	1.098612289	0.477121255	0	0
	t=football	d2	0	4	0					0	0
	t=football	d3	1	4	0.25					0.274653072	0.119280314

Naive Bayes

naive refers to a very strong simplifying assumption about the features
Conditional independence assumption
- Given class $Y$ , all the features $X = {X_1, X_2, ..., X_n}$ are assumed mutually independent.
$P(X | Y) = P(X_1, ..., X_n | Y) = \prod_{i=1}^{n} P(X_i | Y)$
Bayes rule
- $P(Y | X) = \frac{P(X | Y) P(Y)}{P(X)} \propto P(Y) P(X | Y)$
- $P(Y | X_1, ..., X_n) \propto P(Y) \prod_{i=1}^{n} P(X_i | Y)$
Sentiment Analysis
- $p(C_k | x_1, ..., x_n) = \frac{1}{Z} p(C_k) \prod_{i=1}^{n} p(x_i | C_k)$ $p (C_{k} ∣ x_{1}, ..., x_{n}) = \frac{1}{Z} p (C_{k}) \prod_{i = 1}^{n} p (x_{i} ∣ C_{k})$
  - $Z = p(x) = \sum_{k} p(C_k) p(x | C_k)$ $Z = p (x) = \sum_{k} p (C_{k}) p (x ∣ C_{k})$
    - scaling factor for normalization of $p(C_k | x)$
- maximum a posteriori (MAP) decision rule
  - $\hat{y} = \argmax\limits_{k \in 1, ..., K}p(C_k) \prod_{i=1}^{n} p(x_i | C_k)$

Examples of Naive Bayes

$P(Class | w_{1:N}) = \alpha \cdot P(Class) \cdot \prod_{j} P(w_j | Class)$

Analyze the sentiment
- my grandson loved it
- $x_1 = \text{my}$ , $x_2 = \text{grandson}$ , $x_3 = \text{loved}$ , $x_4 = \text{it}$
- $C = {positive, negative}$
- $\hat{y} = \argmax\limits_{k \in \{positive, negative\}} p(C_k) \cdot p(x_1 | C_k) \cdot p(x_2 | C_k) \cdot p(x_3 | C_k) \cdot p(x_4 | C_k)$ $y^=k∈{positive,negative}argmaxp(Ck)⋅p(x1∣Ck)⋅p(x2∣Ck)⋅p(x3∣Ck)⋅p(x4∣Ck)$
  - positive case
    - $p(positive) = 0.49$
    - $p(my | positive) = 0.30$
    - $p(grandson | positive) = 0.01$
    - $p(loved | positive) = 0.32$
    - $p(it | positive) = 0.30$
    - $p(positive | x) \propto p(positive) \cdot p(my | positive) \cdot p(grandson | positive) \cdot p(loved | positive) \cdot p(it | positive) = 0.49 \times 0.30 \times 0.01 \times 0.32 \times 0.30 = 0.00014256$
  - nagative case
    - $p(negative) = 0.51$
    - $p(my | negative) = 0.20$
    - $p(grandson | negative) = 0.02$
    - $p(loved | negative) = 0.08$
    - $p(it | negative) = 0.4$
    - $p(negative | x) \propto p(negative) \cdot p(my | negative) \cdot p(grandson | negative) \cdot p(loved | negative) \cdot p(it | negative) = 0.51 \times 0.20 \times 0.02 \times 0.08 \times 0.4 = 0.00006528$
  - $Z = p(x) = \sum_{k} p(C_k) p(x | C_k)$ $Z = p (x) = \sum_{k} p (C_{k}) p (x ∣ C_{k})$
    - $= p(positive) \cdot p(my, grandson, loved, it | positive) + p(negative) \cdot p(my, grandson, loved, it | negative)$
    - $= 0.00014256 + 0.00006528 = 0.00020784$
    - the probability of $x$ , obtained by adding up its probabilities under each class.
  - $p(positive | x) = \frac{p(positive) \cdot p(my, grandson, loved, it | positive)}{Z} = \frac{0.00014256}{0.00020784} \approx 0.6867$
  - $p(negative | x) = \frac{p(negative) \cdot p(my, grandson, loved, it | negative)}{Z} = \frac{0.00006528}{0.00020784} \approx 0.3133$
- Decision
  - $\hat{y} = \argmax\limits_{k \in \{positive, negative\}} \{k = positive : 0.6867, k = negative : 0.3133\}$
  - The sentiment is positive.
Spam detection
- Probabilities learned from data:
  - $P(\text{Spam}) = 0.4$ , $P(\text{Ham}) = 0.6$
  - $P(\text{free}|\text{Spam}) = 0.8$ , $P(\text{free}|\text{Ham}) = 0.1$
  - $P(\text{win}|\text{Spam}) = 0.7$ , $P(\text{win}|\text{Ham}) = 0.05$
- New document: "free win"
  - Spam score: $0.4 \times 0.8 \times 0.7 = 0.224$
  - Ham score: $0.6 \times 0.1 \times 0.05 = 0.003$
    → Classified as Spam.

N-gram model

N-gram: a sequence of written symbols of length n
- unigram, bigram, trigram
the probability of each symbol is dependent only on the n-1 previous symbols.
$P(w_j|w_{1:j-1}) = P(w_j|w_{j-n+1:j-1})$
$P(w_1:N) = \prod_{j=1}^{N} P(w_j|w_{1:j-1}) \approx \prod_{j=1}^{N} P(w_j|w_{j-n+1:j-1})$

Examples of N-gram

$W_{1:N}$ $W_{1 : N}$ is "This article is on NLP"
- N=5
- Bigram (n-gram with n=2)

P("This article is on NLP")
/* full chain rule,  */
= P("This") // j = 1
  * P("article" | "This") // j = 2
  * P("is" | "This article") // j = 3
  * P("on" | "This article is") // j = 4
  * P("NLP" | "This article is on") // j = 5
/* bigram approximation */
= P("This") // j =1
  * P("article" | "This") // j = 2
  * P("is" | "article") // j = 3
  * P("on" | "is") // j = 4
  * P("NLP" | "on") // j = 5

Step (j)	Word (W_j)	Bigram	Trigram
1	This	$(P(\text{This}))$	$(P(\text{This}))$
2	article	$(P(\text{article} \mid \text{This}))$	$(P(\text{article} \mid \text{This}))$
3	is	$(P(\text{is} \mid \text{article}))$	$(P(\text{is} \mid \text{This, article}))$
4	on	$(P(\text{on} \mid \text{is}))$	$(P(\text{on} \mid \text{article, is}))$
5	NLP	$(P(\text{NLP} \mid \text{on}))$	$(P(\text{NLP} \mid \text{is, on}))$

Transformer

given a set of input vectors (tokens), attention lets each token look at the others and form a weighted average of them.
The weights are data-dependent.
The model learns the weights representing which tokens are relevant to which other tokens.

$X = \text{Input Embeddings},\qquad Q = X W_Q,\quad K = X W_K,\quad V = X W_V$

Scores: $S = \frac{QK^T}{\sqrt{d}}$ $S = \frac{Q K ^{T}}{d}$
- dot products between every query and every key
- the scale $\sqrt{d}$ keeps gradients stable
Weights: $A = Softmax(S)$ $A = S o f t ma x (S)$
- row-wise softmax
- each row sums to 1
Output: $Attention(Q, K, V) = A V$ $A tt e n t i o n (Q, K, V) = A V$
- each output token is a weighted sum of the value vectors, with weights determined by the attention scores.(how much to pay attention to each position)
Query (Q): What am I looking for?
Key (K): What is the label/address of the information I have?
Value (V): What is the actual information I want to convey?

Cross-Attention

look up relevant information in another sequence.
- e.g. decoder attending to encoder outputs in translation
$Q$ $Q$ from the current sequence, $K$ $K$ and $V$ $V$ from the other sequence.
- $Q = X_{\text{target}} W_Q, \quad K = X_{\text{source}} W_K, \quad V = X_{\text{source}} W_V$
Cross-attention = Which source tokens are relevant to this target token?
Self-attention = Which other tokens in this sequence are relevant to this token?
$head_i = Attention(Q W_i^Q, K W_i^K, V W_i^V)$
$MultiHead(Q, K, V) = Concat(head_1, ..., head_h) W_O$

Word representation

One-hot representation

represents each word as a vector of length N.
where N is the size of the vocabulary.

vocab = {he, is, singing, she, dancing, stage}

she = [0, 0, 0, 1, 0, 0]
is = [0, 1, 0, 0, 0, 0]
singing = [0, 0, 1, 0, 0, 0]

limitations
- incredibly inefficient for large vocabularies
- not embed any intrinsic meaning of words
- unable to represent similarity between likely words
- the representation of documents is sparse vectors
  - can cause challenges in computation

Word Embedding

represents individual words as vectors in a low-dimensional continuous space.
a distributed representation of a word.
generate a unique value for each word while using smaller vectors compared with one-hot encoding.
common vector dictionaries
- word2vec
- Glove (Global Vectors)

Contextual embedding

generates different vectors for the same word based on its context.
the word "bank" would have different embeddings in the sentences:
- "He went to the bank to deposit money."
- "She sat by the river bank and enjoyed the view."
BERT or GPT use deep neural networks to process a sequence of tokens.
Each token's embedding is computed by considering the token itself, its position in the sequence and the surrounding tokens, context.
captures both semantic and syntactic role in that specific context.

Part of Speech (POS) tagging

lexical category or tag that indicates the grammatical role of a word in a sentence.
parts of speech allow language models to capture generalizations such as "adjectives often modify nouns" or "verbs often follow subjects".

From  the start , it  took  a   person
IN    DT  NN    , PRP VBD   DT  NN

with great qualities  to  succeed
IN   JJ    NNS        TO  VB

Tag	Description	Example
CC	Coordinating conjunction	and, but
CD	Cardinal number	one, two
DT	Determiner	the, a
EX	Existential there	there
FW	Foreign word	doppelgänger
IN	Preposition or subordinating conjunction	in, of
JJ	Adjective	big, old
JJR	Adjective, comparative	bigger, older
JJS	Adjective, superlative	biggest, oldest
LS	List item marker	1, 2, One
MD	Modal	can, will
NN	Noun, singular or mass	cat, tree
NNS	Noun, plural	cats, trees
NNP	Proper noun, singular	John, London
NNPS	Proper noun, plural	Smiths, Londons
PDT	Predeterminer	all, both
POS	Possessive ending	's, s'
PRP	Personal pronoun	I, you, he
PRP$	Possessive pronoun	my, your, his
RB	Adverb	quickly, very
RBR	Adverb, comparative	faster, better
RBS	Adverb, superlative	fastest, best
RP	Particle	up, off
SYM	Symbol	$, %, &
TO	to	to
UH	Interjection	oh, wow
VB	Verb, base form	be, have
VBD	Verb, past tense	was, had
VBG	Verb, gerund or present participle	being, having
VBN	Verb, past participle	been, had
VBP	Verb, non-3rd person singular present	talk, have
VBZ	Verb, 3rd person singular present	talks, has
WDT	Wh-determiner	which, that
WP	Wh-pronoun	who, what
WP$	Possessive wh-pronoun	whose
WRB	Wh-adverb	where, when
#	Pound sign	#
$	Dollar sign	$
,	Comma	,
.	Sentence-final punctuation	. ! ?

Example of POS tagging

Hidden Markov Model (HMM)
- takes in a temproral sequence of evidence observations
- predicts the lexical categories
Logistic regression
- build 45 different logistics regression models, one for each part of speech
- ask each model how probable it is that the example word is a member of that category, given the feature values for that word in its particular context.

Machine translation

translate a sentence from a source lnaguae to a trget language.
train an MT model: a large corpus of source/target sentence paris and hope that the trained MT model can accurately translate new sentences.
want to generate a target language sentence that corresponds to the source language sentence
the geneartion of each target word is conditional on the entire source sentence and on all previously generated target words.

Example of machine translation

a sequance-to-sequence model
- use two RNNs (LSTM)
attentional sequence-to-sequence model
- use attentino to create a context-based summarization of the source sentence into a fixed-dimension representation
transformer-based model
- encoder: reads the source sentence and turns it into a rich, contextual set of vectors.
- decoder: generates the target sentence one token at a time, using what it has generated so far and the encoder's representations.

Text generation

a subfield of NLP
leverages knowledge in computational linguistics and AI to automatically generate natural language texts
can satisfy certain communicativa requirements

Example of text generation

Classifier based on word embeddings, e.g. RNN and LSTM
- RNN: each input word is encoded as a word embedding vector $x_i$ $x_{i}$ , a hidden layer $z_t$ $z_{t}$ , the classes are the words of the vocabulary
  - the output $y_t$ will be a softmax probability distribution over the possible values of the next word in the sentence.
- LSTM: can choose to remember som parts of the input, copying it over to the next time step, and to forget toher parts.
Pre-trained languaeg model using deep learning
- BERT
- GPT-X, Generativ ePre-trained Transformer

Transfer learning

experience with one learning task helps an agent learn better on another task.
pretraining: a form of transfer learning in which we use a large amount of shared general-domain language data to train an initial version of an NLP model.
- we can use a smaller amount of domain-specific data to refine the model
- the refined model can learn the vocabulary, idioms, syntactic structure, and other linguistic phenomena that are specific to the new domain.
For NN, learning consist of adjusting weight, so the most plausible approach for transfer learning is to copy over the weights learned for task A to a network that will be trained for task B.
- The weights are then updated by gradient descent in the usual way using data for task B.
the popularity of transfer learning is the availability of high-quality pretrained models.
will want to freeze the first few layers of the pretrained model
- these layers serve as feature detectors that will be useful for new model.
- new data set will be allowed to modify the parameters of the higher levels only
  - these are the layers that identify problem-specific features and do classification.

IAI +007

2025년 9월 15일 · 약 8분

Eunkwang Shin

Owner

Computer vision

a field of artificial intelligence enabling computers and systems to derive meaningful information from digital images, videos and other visual inputs.
vision is a perceptual channel that accepts a stimulus and reports soome representation of the world
computer vision enables intelligent agents to see, observe and understand of environment

Core problems of CV

reconstruction: an agent builds a mode lfo the world from an image or a set of images
recognition: an agent draws distinctions amont the objects it encounters based on visual and other information.
- image classification
- object detection
- image segmentation

Classic approaches to object recognition problems

feature-based object recognition approach
- works well for faces looking directly at the camera.
pattern-element-based object recognition approach
- a useful abstraction is to assume that some objects are made up of local patterns which tend to move around with respect to one another.
- we can model objects with pattern elements.

Modern approaches to object recognition problems

deep learning networks
- to recognition problems enables that features can be automatically learned and extracted from raw image data compares with the manual feature extraction in the classic approaches.
- AlexNet, VGGNet, GoogleNet, ResNet, DenseNet, EfficientNet, RegNet...
basic models and derived models
- YOLO, SSD, RetinaNet, R-CNN...

Evaluation metrics for image classification

Metric	Definition	Use case
Accuracy	the percentage of correcly predicted labels out of all predictions made	commonly used in balanced datasets but can be mis leading in imbalanced classes
Precisino	the ratio of correctly predicted positive observations to the total predicted positives.	useful when the cost of false positives is high (e.g., spam detection)
Recall (Sensitivity or True Positive Rate)	the ratio of correctly predictted observations to all the actual positives	important when the cost of false negatives s high (e.g., disease detection)
F1 Score	the harmonic mean of Precision and Recall	used when you need to balance precision and recall, especially in imbalanced datasets
Specificity (True Negative Rate)	the ratio of correctly predicted negative observations to all the actual negatives.	important when false positives should be minimized (e.g. medical tests for diseases)
Confusion Matrix	a table used to describe the performance of a classification model by showing the true positive, false positive, true negative, and false negative counts	provides a comprehensive understanding of a models' performance across all classes
ROC Curve (Receiver Operation Characteristic)	a graphical representation of the classifier's performance across all thresholds, plotting the true positive rate (recall) against the false positive rate (1 - specificity)	used to evaludate binary classirifres and compare models
AUC (Area Under the Curve)	the area under the ROC curve, providing a single number summary of the models' ability to discriminate between positive and negative classes	useful for evaluating binary classification models, particularly when dealing with imbalanced datasets.

Confusion Matrix

Actual class ➡️ Predicted class ⬇️	Positive	Negative	Metric
Positive	TP: True Positive	FP: False Positive	Precision: $\frac{TP}{TP + FP}$
Negative	FN: False Negative	TN: True Negative	Negative Predictive Value: $\frac{TN}{TN + FN}$
Metric	Recall or Sensitivity: $\frac{TP}{TP + FN}$	Specificity: $\frac{TN}{TN + FP}$	Accuracy: $\frac{TP + TN}{TP + TN + FP + FN}$

ROC Curve

Sort predicted probabilities
Try multiple thresholds
For each threshold, compute predicted labels
Compute TPR and FPR
Plot ROC curve (FPR vs TPR) i. $TPR = \frac{TP}{TP + FN}$ ii. $FPR = \frac{FP}{FP + TN}$
Compute AUC as area under ROC curve

from sklearn.metrics import roc_curve, roc_auc_score
fpr, tpr, thresholds = roc_curve(y_true, y_scores)
auc = roc_auc_score(y_true, y_scores)

Evaluation metric for object detection

Metric	Definition	Use case
Intersection over Union (IoU)	measures how much the predicted bounding box overlaps with the ground truth.	$IoU = \frac{\text{Area of Overlap}}{\text{Area of Union}}$
Precision	the ratio of correctly predicted postive observations to the total predicted positives	$Precision = \frac{TP}{TP + FP}$
Recall	the ratio of correctly predicted observations to all the actual positives	$Recall = \frac{TP}{TP + FN}$
F1 Score	the harmonic mean of Precision and Recall	$F1 = 2 \times \frac{Precision \cdot Recall}{Precision + Recall}$
Average Precision (AP)	Precision-Recall Curve: Precision vs Recall at different thresholds AP: Area under the Precision-Recall curve (per class)	$AP = \int_{0}^{1} P(r) \, dr$
Mean Average Precision (mAP)	mean of AP across all classes	provides a comprehensive understanding of a models' performance across all classes
AP@[.50:.95]	COCO benchmark	Averages AP at IoU threshold from 0.5 to 0.95 in steps of 0.05

AP is computed by summing trapezoid areas under the Precision-Recall curve.

Convolutional Neural Network (CNN)

contains spatially local connectinos at lest in the early layers
has patterns of weights that are replicated across the units in each layer.
use a kernel to detect patterns of weights that is replicated across multiple local regions in an image
use convolution that applies the kernel to the pixels of the image

Input
   ↓
[Conv → ReLU] → [Pooling]
   ↓
[Conv → ReLU] → [Pooling]
   ↓
Flatten (2D → 1D 벡터)
   ↓
Fully Connected Layer
   ↓
Output Layer (Softmax/Sigmoid)

$z_i = \sum_{j=1}^{l} k_j \cdot x_{j + (i-1)s}$

$z_i$ : the $i$ -th output element (convolution result)
$j$ : index inside the kernel (from 1 to $l$ )
$l$ : kernel size (number of elements in the kernel)
$k_j$ : the $j$ -th kernel weight (filter element)
$x$ : input sequence (the original data)
$x_{j+(i-1)s}$ : the input element aligned with kernel position $j$ when shifted by stride
$s$ : stride (step size of the kernel movement)
$(i-1)s$ : total shift in the input for the $i$ -th convolution

Receptive Field

the receptive field of a neuron is the portion of the sensory input that can affect aht neuron's activation.
In CNNs, the receptive field of a unit in the first hidden layer is small.
just the size of the kernel, e.g., 3x3 or 5x5.
In the deeper layers of the network, it can be much larger.

1D CNN

$R_{L} = R_{L-1} + (k_{L} - 1) \cdot \prod_{i=1}^{L-1} s_i$

$R_L$ : receptive field size at the $L-th$ layer
$k_L$ : kernel size at the $L-th$ layer
$s_i$ : stride at the $i-th$ layer

Layer	Kernel k	Stride s	Calculation	Result RF
Conv1	3	1	1 + (3-1)*1	3
Pool1	2	2	3 + (2-1)*1	4
Conv2	3	1	4 + (3-1)*2	8

Pooling

works like a convolutional layer, with a kernel size $l$ and stride $s$ , but the operation is applied is fixed rather than learned.
no activation fucntion is associated with the pooling layer.
common forms of pooling
- average pooling
- max pooling: saying a feature exists somewhere in the unit's receptive field.

Dropout

a way to reduce the test-set error of a network to increase its ability of generalization.
makes a etwork herder to fit the traninig set.
the network is created by deactivating a randomly chosen subset of the units (dropout rate).
cannot explain why it works but the results is better

Batch Normalization

modern neural networks are almost always trained with some variant of stochastic gradient descent (SGD).
Batch normlization rescales the values generated at the internal layers of the network from the examples within each minibatch.
- it standardizes the input to a layer for each mini-batch.
standardizes the mean and variance of the values.
maeks it much simpler to train a deep network.

Tranining in a CNN

$Cross-entropy(y, \hat{y}) = -\sum_{i} y_i \log(\hat{y}_i)$

Forward pass
Backward pass
Parameter update

Variants of CNNs

AlexNet

AlexNet Layers

ResNet

stands for a residual neural network.
was designed to enable hundreds or thousands of convolutional layers.
Residual neural networks do this by utilizing skip connections, or shortcurts to jump over some layers.
was an innovative solution to the "vanishing gradient" problem.

x ────────────────► (+) ──► y
   │                     ▲
   ▼                     │
 [Conv → ReLU → Conv] = F(x)

VGGNet

increases the depth of the network through adding more convolutional layers by using small convolution filters (3x3) while other parameters are fixed.

VGGNet

Derived model for object detection

YOLO

Split an images to S times S blocks.
Apply object classification to each block and get confidence score of different objecxt for each block.
Based on the class probability map to locate objects.

YOLO

Calculation AP and mAP

Gather Detection results
Match Predictions to Ground Truth
Compute Precision and Recall to generate a set of (recall, precision) points
Build the Precision-Recall (PR) Curve (presioin decreases as recall increases)
Smooth the Precision-Recall Curve
Calculate AP
Extend to mAP

IAI +006

2025년 9월 12일 · 약 8분

Eunkwang Shin

Owner

Markov process or Markov chain

the state sequence satisfies the Markov assumption

The current state depends on only a finite fixed number of previous states.
$P(X_t | X_{0:t-1}) = P(X_t | X_{t-1})$

Decision Theory

Choosing actions based on the desirability of their immediate outcomes.
The outcome of taking action $a$ in state $s_0$ is deterministic then $Result(s_0, a)$ .
If the outcome is non-deterministic (stochastic), fully observable, the agent knows the current state $s_0$ $s_{0}$ .
- It is represented as a random variable whose values are the possible outcome states.
- The transition model specifies the probabilities of possbile outcome states.
- $P(s, a, s')$

MEU, Maximum Expected Utility

A rational agent should choose the action that maximizes its expcted utility.

The MEU principle could be seen as defining all of AI.
$action = argmax_a(EU(a|e))$ $a c t i o n = a r g ma x_{a} (E U (a ∣ e))$
- among this set of actions, give the highest expected utility.
$EU(a|e) = \sum_{s'} (P(Result(a) = s' | a, e) * U(s'))$

Sequential decision problem

type of problem or scenario where a decision-maker must make a series of choices or decisions over time in order to achieve a desired outcome.

a sequential decision-making problem or sequential decision-making task

MDP, Markov Decision Process

a sequential decision problem for a fully observable, stochastic environment environment with a Markovian transition model and additive rewards.

a formal framework for modeling sequential decision problems.
including states, actions, transition probabilities, and rewards
used in reinforcement learning and operations research.

Component	Description
Environment	fully observable, Stochastic
States	a set of states (with an initial state S₀)
Transition model	Markovian
Reward	additive
Agent	Make decisions at time step for actions Interacts with environment through percepts and actions

A set of states: an initial state $S_0$ , possible terminal state(s)
A set of actions in each state: $ACTIONS(s)$
A stochastic Markov transition model: $P(s' | s, a)$
A reward function: $R(s)$ or $R(s, a, s')$
In MDP, Additive Reward is defined as a function that assigns a numerical value to each state-action pair or state-transition pair.
- represents the immediate benefit or cost associated with taking a specific action in a particular state.
The solution must specify what the agent should do for any state that the agent might reach.
Policy: $π(s)$ $π (s)$ , an action for each state
- A solution to an MDP.
- $\pi(s)$ is the action recommended by the policy $\pi$ for state $s$
Q state: $Q(s, a)$ , the expected utility of doing action $a$ in state $s$

RL, Reinforcement Learning

an agent interacts with an environment
takes actions to maximize a cumulative reward signal over time.
learns a policy that balances exploration and exploitation
- exploration: trying new actions.
- exploitation: choosing known good actions.

Sample MDP

Environment

The agent lives in a 4x3 fully observable stochastic environment
Walls block the agent's path; Begin in the start state, the agent must choose an action at each time step
Actions available to the agent in each state: Up, Down, Left, or Right
The interaction with the environment terminates when the agent reaches one of the goal states, marked +1 or -1

4x3 Grid World

	1	2	4
3			+1
2		🚫	-1
1	START

START: Initial position (1,1)
🚫: Wall (impassable)
+1: Goal state with positive reward
-1: Goal state with negative reward

Transition Model

Stochastic environment with uncertainty
When agent chooses an action, there's:
- 0.8 probability of moving in intended direction
- 0.1 probability of moving perpendicular to intended direction (each side)

Reward Function

The agent receives rewards each time step:
- Small "living" reward each step (-0.04 in all states except the terminal states)
- Big rewards come at the end (+1 for good and -1 for bad terminal states)

Goal

To maximize sum of rewards from the start state to a terminate state

Additive Reward

a type of reward structure
the total reward earned by an agent in an MDP
calculated by summing up the individual rewards received at each time step as the agent interacts with the environment.

Discounted reward

a techinque used to calcluate the expected cumulative reward over time.
applying a discount factor to future rewards, prioritizing more immediate rewards over distant ones.

Utility

$U^{\pi}(s) = E\left[\sum_{t=0}^{\infty} \gamma^t R(S_t, \pi(S_t), S_{t+1})\right]$

The expectation $E$ is with repect to the probability distribution over state sequences determined by $s$ and $\pi$ .
$\gamma$ is the discount factor, $0 \leq \gamma < 1$ , which determines the present value of future rewards.
$R(S_t, \pi(S_t), S_{t+1})$ is the reward received when transitioning from state $S_t$ to state $S_{t+1}$ by taking action $\pi(S_t)$ .

Bellman equation

$U(s) = max_{a \in A(s)} \sum_{s'}P[s'|s,a](R(s,a,s') + \gamma U(s'))$

Formula Component	Description
$U(s)$	the utility of state $s$
$max_a$	the maximum over all possible actions $a \in ACTIONS(s)$
$\sum_{s'}$	the sum over all possible successor states $s'$
$P[s'\vert s,a]$	the transition probability of reaching state $s'$
$R(s,a,s')$	the reward received after transitioning from state $s$ to state $s'$ by taking action $a$
$\gamma$	the discount factor, $0 \leq \gamma < 1$
$U(s')$	the utility of the successor state $s'$

In optimal setting, the neighbor state meets the Bellman equation.
From $s_0$ to termination, each step in the sequence should match the Bellman equation.

$\pi^{*}_s = argmax_{\pi}U^{\pi}(s)$

Q function

the expected utility of taking a given action in a given state.
enables the agent to act optimally simply by choosing
- $U(s) = max_a Q(s,a)$
The optimal policy can be extracted from the Q-function.
- $\pi^{*}(s) = argmax_a Q(s,a)$

$Q(s,a) = \sum_{s'}P[s'|s,a](R(s,a,s') + \gamma U(s')) \\ \space = \sum_{s'}P[s'|s,a](R(s,a,s') + \gamma max_{a'}Q(s',a'))$

Reinforcement Learning

Key components of RL

Agent
Environment
State
Action
Reward
Policy

Application of RL

Robotics
Autonomous systems
Game playing
- Atari video games from raw visual input
- Playing poker
- Alpha Go
Recommendation systems

RL Algorithms

Model-based RL
- often learns a utility function $U(s)$
- the sum of rewards from state $s$ onward from observing the effects of its actions on the environment.
Model-free RL
- Action-utility learning, such as Q-learning
- Policy learning
Passive RL
- the agent's policy is fixed
- the task is to learn the utilities of states.
Active RL
- the agent must figure out what to do through exploration and exploitation.
  - Exploration: trying new actions to discover their effects.
  - Exploitation: leveraging known actions that yield high rewards.

Direct Utility Estimation

the utility of a state is the expected total reward from the state onward.
- the expecgted reward-to-go from that state.
- each trial provides a sample of this quantity for state visited.
end of each sequence, the algorighm calculates the observed reward-to-go for each state and updates the estimated utility for the state accordingly.
In the limit of infinitly many trials, the sample average will converge to the expectation in teh Bellman equation.
Ignores the connections between states.
- misses opportunities to learning, it learns nothing until the end of the trial.
- often converges very slowly.

Adaptive Dynamic Programming, ADP

$U_i(s) = \sum_{s'}P(s'|s,\pi_{i}(s))[R(s,\pi_{i}(s),s') + \gamma U_{i}(s')]$

advantage of the constraints among the utilities of states by learning the transition model
- that connects them and solves the corresponding MDP
- using dynamic programming to calculate the utilities of the states.
These Bellman equations are linear when the policy $\pi_i$ $π_{i}$ is fixed
- can be solved using any linear algebra package.

Learning a transition model

the environment is fully observable
the agent can learn the mapping from a state-action pair $(s, a)$ to the resulting state $s'$ .
The transition model $P(s'|s,a)$ is represented as a table and it is estimated directly from the counts that are accumulated in $N_{s'|s,a}$ .
The counts record how often state $s'$ $s^{'}$ is reached when executing $a$ $a$ in $s$ $s$ .
- $P(s'|s,a) = \frac{N_{s'|s,a}}{\sum_{s''}N_{s''|s,a}}$

Temporal difference, TD

$U^{\pi}(s) \leftarrow U^{\pi}(s) + \alpha[R(s,\pi(s),s') + \gamma U^{\pi}(s') - U^{\pi}(s)]$

when a transition occurs from state $s$ to state $s'$ via action $\pi(s)$
$\alpha$ is the learning rate, $0 < \alpha \leq 1$
no need a transition model to perform updates
the environment is itself supplies the connection between neighboring states in the form of observed transitions.

Q learning

model-free reinforcement learning algorithm
the agent learns a Q-function, $Q(s,a)$
avoids the need for a model by learning an action-utility function $Q(s,a)$ instead of utility function $U(s)$ .
a model-free Q-learning TD update
- $Q(s,a) \leftarrow Q(s,a) + \alpha[R(s,a,s') + \gamma max_{a'}Q(s',a') - Q(s,a)]$
- this update is calculated whenever action $a$ is executed in state $s$ leading to state $s'$
TD Q-learning agent doesn't need a transition model $P(s'|s,a)$ $P (s^{'} ∣ s, a)$
- either for learning or for action selection.

IAI +005

2025년 9월 2일 · 약 13분

Eunkwang Shin

Owner

Neural Network Development History

1950s-1960s: Early Foundations
- McCulloch & Pitts (1943): mathematical neuron model
- Rosenblatt's Perceptron (1958): first trainable network
- Minsky & Papert (1969): limitations (XOR problem) → AI Winter
1970s–1980s: First Revival
- Werbos (1974); Rumelhart, Hinton, Williams (1986): Backpropagation
- Hopfield Networks (1982): associative memory
- Renewed optimism but limited by hardware
1990s: Consolidation
- LeCun's CNN (LeNet, 1989): digit recognition
- Elman, Jordan: Recurrent Neural Networks
- Symbolic AI still dominated mainstream
2000s: Deep Learning Foundations
- Better hardware (GPUs) + large datasets
- Hinton (2006): Deep Belief Networks (unsupervised pretraining)
- Connectionism regains attention
2010s: Deep Learning Boom
- ImageNet (2012): AlexNet breakthrough
- RNNs, LSTMs, GRUs → speech & translation
- Transformers (2017): revolutionized NLP
2020s: Scaling & Foundation Models
- Large Language Models (GPT, BERT, etc.)
- Multimodal AI: vision, text, speech integration
- Connectionism dominates AI research & industry

Neural Network Models

a collection of units (neurons) connected together
The properties of the network are determined by its topology and the properties of the neurons.
Roughly speaking, the neuron fires when a linear combination of its inputs exceeds some (hard or soft) threshold.

Simple Neuron

$in_j = \sum_{i=0}^{n} w_{ij}a_i$
$out_j = g(in_j)$
$a_j = g(\sum_{i=0}^{n} w_{ij} a_i)$

Activation function

ReLU function

$ReLU(x) = max(0, x)$

an abbreviation for rectified linear unit
Commonly used

Softplus function

$Softplus(x) = \log(1 + e^x)$

A smooth version of the ReLU function

Logistic or Sigmoid function

$Logistic(x) = \frac{1}{1 + e^{-x}}$

Non-linear, can represent a nonlinear function

Tanh function

$tanh(x) = \frac{e^{2x} -1}{e^{2x} + 1}$

Topology of a neural network

Feed-forward network (FFN):
- Every node receives inputs from "upstream" nodes and delivers output to "downstream" nodes.
- There are no loops.
- FFN represents a function of its current inputs, thus it has no internal state other than the weights themselves.
Recurrent Network (RNN):
- A recurrent network feeds its outputs back into its own inputs.
- In a recurrent network, the neuron values can eventually settle down, keep cycling, or behave unpredictably.
- can support short-term memory

FFN	RNN

Training Process

Go through each training sample.
If correctly classified → do nothing.
If misclassified → update the weights:
$w_i \leftarrow w_i + \alpha(y - \hat{y})x_i$

Perceptron for Binary Classification

A perceptron separates data into two classes with a hyperplane.
if $w \cdot x \geq 0 \rightarrow 1$
if $w \cdot x \le 0 \rightarrow 0$

Learning Rules

Aspect	Perceptron Learning Rule	Gradient Descent (with Sigmoid)
Activation function	Hard threshold (계단 함수) $Threshold(z) = 1 \; \text{if} \; z \ge 0,\; 0 \; \text{otherwise}$	Sigmoid (연속 함수) $h_w(x) = \frac{1}{1+e^{-w \cdot x}}$
Output	0 또는 1	0과 1 사이의 실수 값
Loss function	없음 (틀리면 조정, 맞으면 유지) 규칙 기반 학습	$L = (y - h_w(x))^2$ (L2 loss) 또는 Cross-Entropy (실무에서 자주 사용)
Update rule	틀렸을 때만: $w \leftarrow w + \alpha (y - h_w(x))x$	경사하강법: $w \leftarrow w + \alpha (y - h_w(x)) \cdot h_w(x)(1-h_w(x)) \cdot x$
Why derivative?	Hard threshold는 미분 불가능 → 단순 규칙 사용	Sigmoid는 연속적이고 미분 가능 → Loss 함수의 기울기(gradient)를 따라 업데이트. 여기서 $h_w(x)(1-h_w(x))$ 항은 sigmoid의 도함수에서 나온 것.
Interpretation	틀리면 정답 방향으로 한 걸음 이동	Loss가 줄어드는 방향으로 점진적으로 이동

Feadforward NN, FNN

a multilayer perceptron network
one input layer, N hidden layers, N >= 1, and one output layer.
Except for the input layer, each layer has a same activation function g.
The final output is represented by a vector function of inputs and weights.
If it has three layers, Shallow Neural Network, otherwise Deep Neural Network.

Traning a FNN

Forward
- Activation passing from the input layer to the output layer
- Calculate the output
Backward
- Errors propagating backward from the output layer to the input layer
- Update weights

Forward phase

Activation of each node is computed in two steps:
1. Weighted sum (in): sum of activations from the previous layer, multiplied by weights.
2. Apply activation function g: pass the weighted sum through g to produce the node's activation.
Process: propagate activations layer by layer towards the output layer.
Output value (example with 2 layers):
- $h_w(x) = g^{(2)}\big(W^{(2)} g^{(1)}(W^{(1)} x)\big)$

Backward phase

Loss function: choose squared error loss (L2)
- $L_2(y, \hat{y}) = (y - \hat{y})^2$
Prediction:
- $\hat{y} = h_w(x)$
Gradient descent: compute gradient of the loss with respect to weights, then update weights along the negative gradient direction.
- $w_{i,j} \leftarrow w_{i,j} - \alpha \cdot gradient_{w_{i,j}}$
Example: sigmoid activation:
- $\hat{y} = \frac{1}{1 + e^{-w \cdot x}}$
- Gradient of the loss:
  - $gradient_{w_{i,j}} = \frac{\partial}{\partial w_{i,j}} Loss(h_w) = 2 (y - h_w(x)) \cdot \Big(- \frac{\partial}{\partial w_{i,j}} h_w(x)\Big)$
- Chain rule applied:
  - $\frac{\partial g(f(x))}{\partial x} = g'(f(x)) \cdot f'(x)$
Example: Gradient derivation for sigmoid
- Weighted input
  - $W \cdot X = w_{1,3}x_1 + w_{2,3}x_2 + w_{0,3}x_0$
  - (where $x_0 = 1$ for the bias)
- Gradient of the loss
  - $gradient_{w_{i,j}} = \frac{\partial}{\partial w_{i,j}} Loss(h_w) = 2 (y - h_w(x)) \cdot \Big(-\frac{\partial}{\partial w_{i,j}} h_w(x)\Big)$
- Derivative of sigmoid output
  - $\frac{\partial}{\partial w_{i,j}} h_w(x) = h_w(x)(1 - h_w(x)) \cdot \frac{\partial}{\partial w_{i,j}} (W \cdot X)$ $\frac{\partial}{\partial w _{i, j}} h_{w} (x) = h_{w} (x) (1 - h_{w} (x)) \cdot \frac{\partial}{\partial w _{i, j}} (W \cdot X)$
    - $\frac{\partial}{\partial w_{i,j}} \left( \frac{1}{1 + e^{-W X}} \right)$
    - $= \left( \frac{1}{1 + e^{-W X}} \right) \left(1 - \frac{1}{1 + e^{-W X}} \right) \cdot \frac{\partial}{\partial w_{i,j}} (W X)$
    - $= h_w(x) \big(1 - h_w(x)\big) \cdot \frac{\partial}{\partial w_{i,j}} (W X)$
- Derivative of weighted input
  - $\frac{\partial}{\partial w_{0,3}}(W \cdot X) = x_0 = 1$
  - $\frac{\partial}{\partial w_{1,3}}(W \cdot X) = x_1$
  - $\frac{\partial}{\partial w_{2,3}}(W \cdot X) = x_2$
- Weight update rule
  - General form: $w_{i,j} \leftarrow w_{i,j} - \alpha \cdot gradient_{w_{i,j}}$
  - $w_{0,3} \leftarrow w_{0,3} + \alpha (y - h_w(x)) h_w(x)(1 - h_w(x))$
  - $w_{1,3} \leftarrow w_{1,3} + \alpha (y - h_w(x)) h_w(x)(1 - h_w(x)) x_1$
  - $w_{2,3} \leftarrow w_{2,3} + \alpha (y - h_w(x)) h_w(x)(1 - h_w(x)) x_2$

Backward phase Steps

Select a loss function
- For example, squared error loss:
- $L(y, \hat{y}) = (y - \hat{y})^2, \quad \hat{y} = h_w(x)$
Choose an activation function
- Suppose we use a sigmoid:
- $h_w(x) = \frac{1}{1 + e^{-W \cdot X}}$
Calculate the error at the output node
- The delta (error term) at the output is
- $\Delta_{out} = 2(\hat{y} - y) \cdot g'(in_{out})$
Calculate the error at hidden nodes
- A hidden unit may connect to multiple nodes in the next layer.
- Therefore, its error is the weighted sum of all deltas it feeds into, scaled by its own derivative:
- $\Delta_i = g'(in_i) \sum_j w_{i,j} \Delta_j$
- The summation appears because the hidden node's output influences several downstream nodes, and all those error signals must be aggregated.
Update the weights with gradient descent
- The gradient with respect to weight $w_{i,j}$ is simply the input times the delta:
- $\frac{\partial L}{\partial w_{i,j}} = a_i \Delta_j$
- Update rule:
- $w_{i,j} \leftarrow w_{i,j} - \alpha \, a_i \Delta_j$

Vanishing gradient

The error signal are extinguished altogher as they are propagated back through the network
In deep feedforward networks with sigmoid/tanh, repeated multiplication of small derivatives ( $0 < g'(z) < 1$ ) during backpropagation causes the gradient to vanish.

Optimizer

Training a neural network consists of modifying the network's parameters, minimizing the loss function on the training set.
any kind of optimization algorithm could be used.
modern neural networks are almost always trained with some variant of stochastic gradient descent (SGD). Adam Optimizer
The optimiser is specified in the compilation step with tensorflow.

Recurrent NN, RNN

units may take as input a value computed from their own output at an earlier step in the computation.
have internal state, or memory: inputs received at earlier time steps affect the RNN's response to the current input.
be used to perform more general computations.
- to analyze sequential data in which a new input vector $x_t$ arrives at each time step
Markov assumption: the hidden state $z_t$ $z_{t}$ of the network suffices to capture the information from all previous inputs.
- $z_t = f(z_{t-1}, x_t)$
- Once trained, this function represents a time-homogeneous process
- The same update rule $f_w$ applies at every time step, regardless of whether it’s the first input or the hundredth.
RNNs are designed for sequential data.
a hidden state that captures information from previous steps.
suffer from vanishing/exploding gradients.
Good for short-term dependencies.

Backpropagtion Through Time, BPTT

gradient expression is recursive.
- $\frac{\partial z_t}{\partial w_{z,z}}$
- $= \frac{\partial}{\partial w_{z,z}} g_z(in_{z,t})$
- $= g_z'(in_{z,t}) \frac{\partial in_{z,t}}{\partial w_{z,z}}$
- $= g_z'(in_{z,t}) \frac{\partial}{\partial w_{z,z}} (w_{z,z} z_{t-1} + w_{x,z} x_t + w_{0,z})$
- $= g_z'(in_{z,t}) \left( z_{t-1} + w_{z,z} \frac{\partial z_{t-1}}{\partial w_{z,z}} \right)$ $= g_{z}^{'} (i n_{z, t}) (z_{t - 1} + w_{z, z} \frac{\partial z _{t - 1}}{\partial w _{z, z}})$
  - $\frac{\partial z_t}{\partial W_{z,z}}$ includes $\frac{\partial z_{t-1}}{\partial W_{z,z}}$
the gradient with run time being linear in the size of the network
handled automatically by deep learning software systems.
Iterating the recursion shows that the gradient at time $T$ $T$ includes a term proportional to:
- $w_{z,z} \prod_{t=1}^{T} g'_z(in_{z,t})$
Since for sigmoid, tanh, and ReLU we have $g' \leq 1$ , if $w_{z,z} < 1$ the RNN will suffer from the vanishing gradient problem.
If $w_{z,z} > 1$ , we may encounter the exploding gradient problem.

Long Short-Term Memory, LSTM

memory cell is essentially copied from time step to time step.
New information enters the memory by adding updates.
- the gradient expressions do not accumulate multiplicatively over time.
include gating units: vectors control the flow of information in the LSTM, elementwise multiplication of the corresponding information vector.
a type of RNN designed to overcome vanishing gradient.
use gates (input, forget, output) to control information flow.
Capable of learning long-term dependencies.
Widely used in NLP, speech recognition, and time series forecasting.

Gates in LSTM

Forget gate: decides what information to discard from the cell state.
Input gate: decides what new information to store in the cell state.
Output gate: decides what information to output from the cell state.
- similar role to the hidden state in basic RNNs.
Update equations:
- $f_t = \sigma(W_{x,f}x_t + W_{z,f}z_{t-1})$ $f_{t} = σ (W_{x, f} x_{t} + W_{z, f} z_{t - 1})$
  - Decides which parts of the previous cell state $c_{t-1}$ should be kept or discarded.
- $i_t = \sigma(W_{x,i}x_t + W_{z,i}z_{t-1})$ $i_{t} = σ (W_{x, i} x_{t} + W_{z, i} z_{t - 1})$
  - Determines how much of the new information from the current input $x_t$ and the previous hidden state $z_{t-1}$ should be added.
- $o_t = \sigma(W_{x,o}x_t + W_{z,o}z_{t-1})$ $o_{t} = σ (W_{x, o} x_{t} + W_{z, o} z_{t - 1})$
  - Controls which parts of the current cell state $c_t$ are exposed as the hidden state $z_t$ .
- $c_t = c_{t-1} \odot f_t + i_t \odot \text{tanh}(W_{x,C}x_t + W_{z,C}z_{t-1})$ $c_{t} = c_{t - 1} ⊙ f_{t} + i_{t} ⊙ tanh (W_{x, C} x_{t} + W_{z, C} z_{t - 1})$
  - Cell state update
  - Past information ( $c_{t-1}$ ) is partially retained through the forget gate.
  - New information is added through the input gate and $\tanh$ .
  - Thus, $c_t$ serves as the long-term memory of the LSTM.
- $z_t = o_t \odot \text{tanh}(c_t)$ $z_{t} = o_{t} ⊙ tanh (c_{t})$
  - Hidden state update
  - The cell state is normalized with $\tanh(c_t)$ and filtered by the output gate.
  - $z_t$ is the hidden state passed forward to the next time step.

Gated Recurrent Unit, GRU

Variant of RNN with gating mechanisms.
Designed to capture long-term dependencies without complex architecture.
a simpler alternative to LSTMs. (lightweight, effective RNN variant)
Captures temporal dependencies (short & long).
Combine input and forget gates into a single update gate.
Require fewer parameters than LSTM, making them faster to train.
Perform comparably to LSTMs in many tasks.
Prevents vanishing gradient.
Good balance between complexity & performance.
Excels in time series forecasting tasks.
Widely used in finance, energy and IoT.

Gates in GRU

Update gate (z): decides how much past information to keep
Reset Gate (r): decides how much past information to forget
Candidate hidden state ( $\tilde{h}$ ): potential new memory
Final hidden state ( $h$ ): weighted combination of old and new information.

GRU Workflow

Reset gate ( $r$ $r$ )
- Controls how much of the previous hidden state should be "forgotten."
- A small value means most of the past memory is erased, while a large value means much of it is retained.
Update gate ( $z$ $z$ )
- Acts as a switch to decide whether to keep the previous state $h_{prev}$ or replace it with the new candidate $\tilde{h}$ .
- If $z=1$ , the past is fully kept; if $z=0$ , it is completely replaced by the new candidate.
Candidate state ( $\tilde{h}$ $\tilde{h}$ )
- Combines the current input $x_t$ with the reset-gated previous hidden state to generate the "candidate" new information.
Final hidden state ( $h$ $h$ )
- Blends the past and the candidate using the update gate $z$ .
- If $z$ is large → the past memory dominates.
- If $z$ is small → the new candidate dominates.
$h=(1−z)\tilde{h}+zh_{prev}$

Comparison: RNN vs LSTM vs GRU

Attribute	RNN	LSTM	GRU
Architecture	Simple, hidden state	Complex, memory cell + 3 gates	Simplified, 2 gates (update/reset)
Information Flow	Stored in hidden state	Controlled by gates	Controlled by merged gates
Long-term Dependency	Weak (vanishing gradient)	Strong (gates solve vanishing gradient)	Strong (gates solve vanishing gradient)
Short-term Dependency	Strong	Strong	Strong
Number of Parameters	Few	Many	Fewer than LSTM
Training Speed	Fast	Slow	Fast
Performance	Good for short-term	Good for long-term	Efficient, similar to LSTM
Application Areas	Simple time series, basic NLP	NLP, speech, time series forecasting	Finance, IoT, energy, time series
Vanishing Gradient	Yes	No	No
Typical Use Cases	Text generation, simple prediction	Translation, speech recognition	Time series prediction, sensor data
Simplicity	Very simple, rarely used	More complex, expressive (3 gates)	Simpler than LSTM, fewer parameters
Expressiveness	Limited, struggles with long-term	High, handles very complex sequences	Moderate, good for moderate data size
Training Efficiency	Fast, but limited	Slower, better for complex data	Fast, efficient, similar performance
Trade-off	Simple but weak for long-term	Capacity for complex, long sequences	Simplicity vs. capacity

IAI +004

2025년 8월 26일 · 약 21분

Eunkwang Shin

Owner

Influencing improvement

Agent's component
Agent's prior knowledge, which influence that mode lit builds
the feedback available to learn from.

Components

Given an intelligent agent that performs some intelligent tasks, any components of agent program can be improved by learning.

Prior knowledge

Inductive learning (귀납적 학습): learning a general function or rule (possibly incorrect) from specific input-output pairs
- Bottom to top
- Specific to general
Deductive learning (연역적 학습): going from a general rule to a new specific rule that is logically entailed, but is useful because it allows more efficient processing
- Top to bottom
- general to specific

Feedback

feedback on its percept sequence
no feedback on its percept sequence
rewards for taking a sequence of actions based on its percept sequence

-	Supervised Learning	Unsupervised Learning
Training Data	labeled	unlabeled
Computational complexity	simpler	Computationally complex
Accuracy	high	less accurate

Supervised learning

the agent observes some examples of input-output pairs first and then learns a function or a relationship that maps from inputs to output.
Attributes/Features: the inputs are independent variables in the problem domain
Target attribute: the output is the dependent variable which is dependent on the inputs.
Model: the learned function or relationship
The agent learns a model using examples and uses this model to predict the outcomes for new inputs.

Unsupervised learning

The agent collects adequate examples in the problem domain but it does not get any explicit feedback to the examples.
The agent can make sense of the examples through identifying clusters or frequent patterns in the data.
When shown a large number of examples, the agent can learn to identify clusters of similar examples.

Reinforcement learning

the agent learns from a series of actions which can be rewards or punishments to improve its performance in completing the task under consideration.
the feedback helps the agent to enforce positive actions and reduce the negative actions through adjusting the policy.

Supervised Learning Technique

Decisinon tree
Random forests
Linear regrasssion
Logistic regression
K nearest neighbours
Support vector machines
Neural networks

Regression problem

to predict a continuous value as the output for a given input
weather temperature: solar radiation, wind direction and speed, geographic location..
how to predict the output value of a new data instance on the basis of observed features from the existing data (historical examples) in the problem domain.
Elements
- Collection of existing or historical data samples which are represented by a set of attributes or independent variables
- The output values of the existing data samples
  - the output variable or attribute must be continuous
Regressor: a function describes the relationship between the attributes of a data sample and the output.
- takes the values of attributes of a data sample and predicts the output value of this given data sample.

Evaluate a regressor

R Square/Adjust R Square
MSE Mean Square Error/RMSE Root Mean Square Error
MAD Mean Absolute Error

Examples of regression problems

Predict the fuel price using the Brent crude oil price, financial performance of the oil related companies (cash flow, projects lined up, etc.) and/or geopolitical risks (OPEC announcements, government sanctions, etc.)
Predict the house price of a suburb from the suburb's profile
Predict the blood pressure of a patient based on the patient's health profile
Predict the electricity price using temperature, demand and time.

Classification problem

to predict discrete or categorical value as the output for a given input
Pass or Failed given learning outcomes, student ID, prior learning, attitude, commitment and attendance.
how to put a new data instance into one of predefined categories or classes on the basis of observed features from the existing data in the problem domain.
Elements
- Collection of existing or historical data samples with class labels
- Predefined categories or classes
- Adequate samples in each category or class in the existing or historical data.
Logic-based techniques
- Decision tree
- Learning set of rules
Perceptron-based techniques
- Single-layer perceptron
- Multi-layer perceptron
- RBF network
SVM
Statistical learning techniques
- Naive Bayes classifier
- Bayesian networks
Instance-based learning
- K-nearest neighbor (KNN)

Evaluate a classifier

Confusion matrix
Precision
Recall/Sensitivity
Specificity
F1-Score
Area Under Curve & Receiver Operating Characteristics Curve (AUC-ROC)

Examples of classification problems

banking, healthcare, medical diagnosis, marketing (sentiment aalysis), telecommunication, agriculture, security (fraud detection).
e-mails into spam or non-spam class
loan applications into an approved or a rejected class.
patients into having a certain disease or not having that disease groups.
text into positive or negative sentiment.
customers into churn or non-churn classes.

Overfitting

general phenomenon with all types of learning models.
a modeling error that occurs when a function is too closely or exactly fit to a limited set of data points.
more likely as the complexity of models and the number of input attributes increase
less likely as the number of training examples is large.

Decision Tree

if-then statements to define patterns in data
A if-then statement splits the training data into two or more branches based on some values
Best Split: The results of each branch should be as homogeneous as possible, or has the lowest impurity possible.
- Information gain
- Gini index

Implement Decision Tree

the split (a feature and a condition) that leads to the lowest impurity in the resulting child nodes, in a greedy manner
For categorical features: each unique value can be a split condition.
For continuous features: midpoints between consecutive sorted unique values are used as split conditions.
For each potential split condition, the algorithm calculates the impurity of the resulting child nodes.
The lowest impurity node becomes the split point for that branch.
The process is then repeated recursively for each child node until all leaf nodes are pure, or the stopping criteria are met.

The selectino of best split attributes

ID3: employs a top-down, greedy search through the space of possible branches with no backtracking using information gain
C4.5: using information gain ratio
CART: using Gini Index
Gini Index
Chi-Square
Reduction in Variance

Entropy

the fundamental quantity in information theory. It is a measure of the uncertainty of a random variable.

the fundamental quantity in information theory. It is a measure of the uncertainty of a random variable
A more homogeneous node with a clear majority class has low impurity and low entropy, while a more mixed distribution of classes has high impurity and high entropy.

Information gain

the decrease in entropy.

The information gain from the attribute test on (split on A) is the expected reduction in entropy.
Information gain computes the difference between entropy before the split and average entropy after the split of the dataset based on given attribute values
$Entropy(S) = - \sum_{} p_i \log_2 p_i$
$Gain(S,A)=Entropy(S) − Entropy_{remain}(S,A)$

Gini index

For classification, another impurity measure commonly used for classification tasks in decision trees.
a lower Gini index indicates lower impurity, meaning that the samples in the node predominantly belong to a single class
a bit more computational efficient than entropy as it does not involve logarithm calculations. but results are quite similar.
$Gini(S) = 1 - \sum_{i=1}^{K} p_i^2$

Variance

For regression tree, the target variable is continuous rather than categorical.
use variance as a measure of impurity in regression trees.
lower variance indicates that the data points are closely clustered around the mean
$Var(S) = \frac{1}{N} \sum_{} (y_i - \mu)^2$

Prediction

For classification tasks: the predicted class label is the majority class among the training samples in the leaf node.
For regression tasks: the predicted value is the mean of the target values of the training samples in the leaf node.

Dealing with Overfitting

Overfitting: a common issue in decision trees, where the model captures noise or outliers in the training data rather than the underlying pattern.
the model performs poorly when applied to new, unseen data.

Setting Stopping Criteria

prevent the tree from becoming overly complex, which may lead to overfitting
applied during the tree construction process
limiting the maximum depth of the tree
setting a minimum number of samples per leaf node
requiring a minimum impurity decrease for a split

Pruning Strategies

applied after the tree has been fully grown
removing branches from the fully grown tree to simplify its structure
ensure that it captures the underlying patterns in the data rather than noise or outliers
Pruned trees perform significantly better than unpruned trees when the data contain a large amount of noise.

Ensemble Methods

combine multiple decision trees to form a more robust and accurate model.
address overfitting by averaging the predictions of the individual trees, reducing variance and improving generalization.
Random Forests
Gradient Boosted Trees

Random Forest

combines multiple weak decision tree models to create a stronger learning model.

two types of randomness are introduced to ensure that the individual decision trees are diverse and less prone to overfitting.
Random sampling of the input data
Bootstraping:
- involves sampling with replacement 복원추출 (meaning that some instances appearing multiple times and others not appearing) from the original dataset, creating a new dataset.
- each decision tree is trained on a slightly different set of data points, reducing the likelihood of overfitting.
Random selection of features at each split
- At each split in each decision tree, a random subset of features is considered when determining the best split.
- each tree in the ensemble does not rely on the same set of features for making decisions, resulting in a more diverse set of trees.
- By considering only a subset of features at each split, the model is less likely to be influenced by a small number of dominant features, leading to a more balanced and accurate prediction.

구분	데이터 무작위성 (Bootstrapping)	속성 무작위성 (Feature Subset Selection)
적용 위치	트리 훈련 데이터 선택 단계	트리의 각 분할(split) 단계
방법	원본 데이터셋에서 복원 추출(with replacement)로 샘플링하여 새로운 학습용 부분집합 생성	전체 속성 중 무작위로 일부 속성만 선택 후, 그 속성들로만 분할 기준 탐색
특징	- 각 나무가 다른 데이터 포인트로 학습됨 - 일부 샘플은 여러 번 등장, 일부는 제외될 수 있음	- 각 분할이 다른 속성을 사용 가능 - 동일한 속성에 과도하게 의존하지 않음
효과	- 트리 간의 데이터 다양성 확보 - 과적합 감소	- 트리 간의 속성 다양성 확보 - 소수 지배적 속성의 영향 축소
결과	더 다양한 데이터 시나리오를 반영한 트리들 생성	더 다양한 의사결정 규칙을 반영한 트리들 생성

Predict with Random Forest

aggregating the predictions of all individual decision trees in the forest.
Majority voting: For classification, Count the number of times each class is predicted by the individual decision trees. The class with the highest count is considered as the final prediction.
Averaging: For regression, Calculate the mean of the predictions made by the individual decision trees. The mean value is considered as the final prediction.

Linear regression

a learning technique that finds a linear relationship between input variables and the target variable based on a fundamental assumption that there is a linear relationship between input variables and the target variable

e.g. the input variables (engine size, weight and car age) ➡️ target variable (car fuel efficiency)
- assumption that there is a linear relationship
A linear regression technique learns a set of coefficients to estimate the linear relationship between $x$ $x$ and $y$ $y$ , denoted as $h_w$ $h_{w}$ , which can be represented by the following equation.
- $h\_w(x) = w_0 + w_1x_1 + ... + w_nx_n = \sum_{i=0}^{n} w_ix_i$
- $w$ is a weight vector
- $\hat{y} = \sum_{i=0}^{n} w_ix_i$
linear regression model is an approximate function between the input variables and the target variable, there will be an error between the output of the model and the actual output value for a data sample
- This error can be represented by a loss function, which calculates the mean square error
- $Loss(h_w) = \frac{1}{2m}\sum_{j=1}^{m}(h_w(x_j) - y_j)^2 = \frac{1}{2m}\sum_{j=1}^{m}(y_j - \sum_{i=0}^{n} w_ix_{j,i})^2$
for solving regression problems

Solving a linear regression problem

to find the best linear relationship $h_w$ $h_{w}$ that best fits the training data of $m$ $m$ data samples.
- makes the loss to be minimised.
to find the best weight vector $w^*$ $w^{*}$ , such that for a given training dataset of $m$ $m$ data samples.
- $w^* = \arg\min_{w} Loss(h_w)$
gradient descent: continuously resamples the gradient of the weight coefficients in the opposite direction depending on the weight $w$ $w$ .
- Until the loss function $Loss(h_w)$ reaches the global minimum
- to change the individual components of $w$ a little bit in the direction that minimises $Loss(h_w)$ , and to do this many times.
$w_i \;\leftarrow\; w_i + \alpha \sum_{j=1}^{m} x_{j,i} \Big( y_j - h_w(x_j) \Big)$ $w_{i} \leftarrow w_{i} + α \sum_{j = 1}^{m} x_{j, i} (y_{j} - h_{w} (x_{j}))$
- $\alpha$ : the step size, the learning rate
Training model: the process of iteratively updating weights with a learning rate to minimise loss, where the final weight vector defines the model used for predicting new data.
use regularisation on a multivariate linear function to avoid overfitting.
Batch gradient descent: consider the entire training dataset $(X, y)$ $(X, y)$ at once.
- $w_0 \;\leftarrow\; w_0 + \alpha \sum_{j=1}^{m} \Big(y_j - (w_0 + w_1 x_j)\Big)$
- $w_1 \;\leftarrow\; w_1 + \alpha \Big(\sum_{j=1}^{m} (y_j - (w_0 + w_1 x_j)) \cdot x_j\Big)$
Stochastic gradient descent (SGD): consider only a single training data sample $(x_j, y_j)$ $(x_{j}, y_{j})$ at a time.
- $w_0 \leftarrow w_0 + \alpha \big( y_j - (w_0 + w_1 x_j) \big)$
- $w_1 \leftarrow w_1 + \alpha \big( (y_j - (w_0 + w_1 x_j)) \cdot x_j \big)$
- can be used in an online setting, where new data is coming one at a time, or offline, where we cycle through the same data as many times as is necessary, taking a step after considering each single example.
- With a fixed learning rate $\alpha$ , the stochastic version does not guarantee convergence.
- often faster than batch gradient descent.
- With a schedule of decreasing learning rates (SA), the stochastic version does guarantee convergence.
These update rules are derived as the next weight update equations by taking the partial derivatives of the loss function with respect to $w_0$ and $w_1$ .

Logistic Regression

an extension of linear regression in such a way that the output of a linear regression model goes through a logistic function

$y(x) = \frac{1}{1 + e^{-x}}$
The output value of this logistic function is between 0 and 1.
0 is for certainly being labeled "0" and 1 is for certainly being labeled "1", and a value between 0 and 1 represents the probability of being labeled "1"
a logistic regression model: a linear regression model + a logistic function
mainly for solving classification problems

Nearest Neighbor

a technique to predict the output of a given new sample based on a collection of existing samples.

is to find the k-nearest neighbours of given sample in the collection and determine the output based on these k neighbours.
k is always chosen to be an odd number.
can be used for both classification and regression problems.
- For classification: majority vote of the neighbours.
- For regression: mean/median (or regression) of the neighbours.
Instance-based learning
- KNN does not learn a separate model.
- Instead, it stores all training data and uses them directly at prediction time.
Non-parametric model
- KNN has no parameters (like weights in linear regression) to train.
- The model is essentially the full dataset plus a distance measure.

Distance measures

Minkowski distance or $L^p$ norm
- $L^p(x_j, x_q) = \left( \sum_i |x_{j,i} - x_{q,i}|^p \right)^{1/p}$
- Euclidean distance: $p = 2$ , for the dimensions are measuring similar properties, such as the width, height and depth of 3D objects.
- Manhattan distance: $p = 1$ , for the dimensions are measuring dissimilar properties, such as age, weight, and gender of a patient.
- Hamming distance: the number of attributes on which the two points differ, for Boolean attribute values

Nomarlization

use the raw data from each dimension then the total distance will be affected by a change in scale in any dimension
To avoid this, apply normalization to the measurements in each dimension.
to compute the mean $\mu_i$ and standard deviation $\sigma_i$ of the values in each dimension, and rescale them
The rescaling is done using the formula:
- $x'_{j,i} = \frac{x_{j,i} - \mu_i}{\sigma_i}$ where $x'_{j,i}$ is the normalized value, $x_{j,i}$ is the original value, $\mu_i$ is the mean, and $\sigma_i$ is the standard deviation.

Time complexity

Conceptually trivial: Given a set of N examples and a query $x_q$ , iterate through the examples, measure the distance to $x_q$ from each one, and keep the best k.
$NN(k, x_q)$ 's time complexity is $O(N)$ , N is the number of examples in the training dataset.
Use a k-dimensional tree: a balanced binary tree with an arbitrary number of dimensions.
- Time complexity can be improved to $O(\log N)$
- appropriate only when there are many more examples than dimensions
- It works well with up to 10 dimensions with thousands of examples.
Use a Hash table with a locality-sensitive hash (LSH)
- Time complexity can be improved to $O(1)$

SVM

a framework for finding a boundary that distinctly classifies the data points in an optimal way.

supervised learning, binary classification
SVM chooses the boundary with the maximum possible geometric margin, which has the largest distance to the nearest training data points of any class
initially designed for binary classification problems but can also be applied for solving multi-class classification problems

Linear discriminant

$X_i$ is multiplied by its matching weight $w_i$
all these products are added together and passed to a threshold function
Decision surface: if $g(x) = w \cdot x \gt 0$ then $f(x) = +1 (class1)$ else $f(x) = -1 (class2)$
Decision function: $f(x) = \text{sign}(g(x)) = \text{sign}(w_0 + w_1x)$ $f (x) = sign (g (x)) = sign (w_{0} + w_{1} x)$
- To make a decision, the continuous value $g(x)$ is passed through the sign function so that it outputs either +1 or -1.
If the data from the two classes can be separated with a hyperplane, linearly separable.

Hyperplane

separates the data in 2D by a line or in 3D by a plane
The orientation of the hyperplane is given by the vector $w$
the location of the hyperplane is given by $w_0$
The distance from the origin to the hyperplane is $\frac{|w_0|}{\|w\|}$
If a given data sample $x^*$ and $g(x^*) = 0$ , then this data sample is on the separation boundary. It can normally be assigned to any class.
geometric margin: the minimum distance between the samples and the hyperplane by constructing and solving a constrained optimization problem
- $\gamma_i = y_i \left( \frac{w}{\|w\|} \cdot x_i + \frac{w_0}{\|w\|} \right)$
primary optimization problem: to maximize the minimal geometric distance across the training dataset of m samples.
- $\max_{w, w_0} \Big( \min_{i=1,\ldots,N} \gamma_i \Big) = \max_{w, w_0} \Big( \min_{i=1,\ldots,N} \Big( y_i \Big( \frac{w}{\|w\|} \cdot x_i + \frac{w_0}{\|w\|} \Big) \Big) \Big)$
- $\min_{w, w_0} \; \frac{1}{2}\|w\|^2$
- $\text{s.t. } \; y_i (w \cdot x_i + w_0) \geq \min_{i=1,\ldots,N} \big( y_i (w \cdot x_i + w_0) \big)$
dual optimization problem: easier to solve. More importantly the dual optimisation problem enables the so-called kernel trick in SVM
- $\max_{\alpha} \; \sum_{i=1}^N \alpha_i - \frac{1}{2} \sum_{i=1}^N \sum_{j=1}^N \alpha_i \alpha_j y_i y_j (x_i \cdot x_j)$
- $\min_{\alpha} \; \frac{1}{2} \sum_{i=1}^N \sum_{j=1}^N \alpha_i \alpha_j y_i y_j \,(x_i \cdot x_j) \;-\; \sum_{i=1}^N \alpha_i$
- $\text{s.t.} \quad \sum_{i=1}^m \alpha_i y_i = 0, \quad \alpha_i \geq 0, \quad i=1,2,\ldots,N$

Attractive Properties

SVMs construct a maximum margin separator
- the largest possible distance to example points, helping to improve generalization
SVMs create a linear separating hyperplane
- kernel trick: to embed the data into a higher-dimensional space
- Often data that are not linearly separable in the original input space are easily separable in a higher-dimensional space
- In general (excepted some special cases) if we have $N$ data points then they will always be separable in spaces of $N$ dimensions or more
SVMs are a nonparametric method
- retain training examples and potentially need to store them all
- In practice, they often end up retaining only a small fraction of examples
- have the flexibility to represent complex functions, but they are resistant to overfitting
not usually expect to find a linear separator in the input space $x$ $x$ , but we can find linear separators in the high-dimensional feature space $F(x)$ $F (x)$ simply by replacing $(x_j x_k)$ $(x_{j} x_{k})$ in
- $argmax_{\alpha} \sum_{j}\alpha_j - \frac{1}{2} \sum_{j,k}\alpha_j \alpha_k y_j y_k (x_j \cdot x_k)$ $a r g ma x_{α} \sum_{j} α_{j} - \frac{1}{2} \sum_{j, k} α_{j} α_{k} y_{j} y_{k} (x_{j} \cdot x_{k})$
  - with $K(x_j, x_k) = F(x_j) \cdot F(x_k)$
  - $F(x_j) \cdot F(x_k)$ can often be computed without first computing $F$ for each point.
In a higher dimensional feature space, which is created by transformation $F(x)$ $F (x)$ , if we can express $K(x_j \cdot x_k) = F(x_j) \cdot F(x_k)$ $K (x_{j} \cdot x_{k}) = F (x_{j}) \cdot F (x_{k})$ , the kernel function $K(x_j \cdot x_k)$ $K (x_{j} \cdot x_{k})$ can be applied to pairs of input data to evaluate dot product in some corresponding feature space.
- kernel trick is to plug a kernel function $K(x_j \cdot x_k)$ into the dual optimisation problem to replace $(x_j \cdot x_k)$
- Optimal linear separators can be found efficiently in feature spaces with billions of (or, in some cases, infinitely many) dimensions.
- we can learn in the higher-dimensional space, but we compute only kernel functions rather than the full list of features for each data point.

Classification evaluation metrics

용어	설명
True Positive (TP)	실제 1, 예측 1
True Negative (TN)	실제 0, 예측 0
False Positive (FP)	실제 0, 예측 1 (0을 잘못 1로 예측)
False Negative (FN)	실제 1, 예측 0 (1을 놓쳐서 0으로 예측)

Accuracy

the proportion of correctly classified instances (data points or samples) among the total instances.

$Accuracy = \frac{TP + TN}{\text{Total number of predictions}}$

a ratio of the number of correct predictions
it may not be suitable for imbalanced datasets where the class distribution is skewed
- a model that always predicts the majority class will have high accuracy but may not be useful in practice.

Precision

the proportion of true positives among all positive predictions.

$Precision = \frac{TP}{\text{number of positive predictions}} = \frac{TP}{TP + FP}$

the model's ability to not mistakenly view negatives as positives
A high precision value indicates that the model has made fewer false positive predictions.
it's useful to minimize the number of false positives.

Recall

Sensitivity, True Positive Rate, the proportion of true positive instances among the actual positive instances

$Recall = \frac{TP}{\text{number of positive instances}} = \frac{TP}{TP + FN}$

the model's ability to not mistakenly view actual positives as negatives
A high recall value indicates that the model has successfully identified a large portion of the actual positive instances
it's useful when the cost of false negatives is high
- e.g. in medical diagnosis, where failing to identify a disease can have severe consequences

F1 Score

the harmonic mean of precision and recall

$F1 = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall}$

a balanced evaluation of the model's performance
It is particularly useful when dealing with imbalanced datasets, where one class is significantly more prevalent than the other.
It's useful for imbalanced datasets to balance precision and recall.

Regression evaluation metrics

Mean Absolute Error (MAE)

the average of the absolute differences between the predicted values and the actual values

$MAE = \frac{1}{n} \sum_{i=1}^n \lvert y_i - \hat{y}_i \rvert$

It measures the average magnitude of errors made by the model, without considering their direction
A lower MAE value indicates that the model has made smaller prediction errors

Mean Squared Error (MSE)

the average squared difference between the predicted and actual values

$MSE = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2$

It's useful when you want to penalise larger errors more heavily, making it more sensitive to outliers at the same time
often used as a loss function when training regression models
A lower MSE value indicates that the model has smaller prediction errors, with a strong preference for avoiding large errors.

Root Mean Squared Error (RMSE)

the square root of MSE

$RMSE = \sqrt{MSE}$

it more interpretable as it is in the same units as the dependent variable.

R-Squared (Coefficient of Determination)

how well the regression model approximates the actual data

$R^2 = 1 - \frac{SSR}{SST} = 1 - \frac{\sum_{i=1}^n (y_i - \hat{y}_i)^2}{\sum_{i=1}^n (y_i - \bar{y})^2}$

the proportion of "sum squared regression (SSR)" and "total sum of squares (SST)"
- SSR obviously captures the model's prediction errors
- SST is the variance of the target variable.
  - can be viewed as a naive model ( $\hat{y} = \bar{y}$ ) using the average value of the target variable as the prediction
$R^2 = 1$ : The model predicts perfectly (the error is 0).
$R^2 = 0$ : The model does no better than a naive model that always predicts the mean of the target variable.
$R^2 < 0$ : The model performs worse than simply predicting the mean, meaning its predictions increase the error compared to the naive baseline.

IAI +003

2025년 8월 15일 · 약 5분

Eunkwang Shin

Owner

Local Search Problem

To find the state that gives the optimal/best value of the evaluation function

It can be seen as an optimization problem.
a computational problem that finds the best solution (a state) that satisfies the given constraints
evaluation function === objective function
Only cares about the optimal solution/best state without considering the paths to reach the best state (the optimal solution)
Not systematic

Feasible region & solution

Feasible region: the set of all possible or candidate solutions which are the solutions that satisfies the problem's constraints
Feasible solution: a solution in the feasible region

Search Problem vs Local Search Problem

Path-based vs State-based

Aspects	Search Problem	Local Search Problem
State	All possible states - state-space landscape	Range of decision variables and constraints
Goal	Goal state & goal test	Evaluation function & objective function
Evaluation	Measure closeness to goal - distance/fitness	Minimize cost or maximize fitness
Transition/Successor	Transition function	Successor function

Discrete & Continuous Optimization

Discrete optimization: optimization problems where the solution space is discrete (e.g., 8 queens problem)
Continuous optimization: optimization problems where the solution space is continuous (e.g., real numbers, any value within a range)

Information needed for Local Search

All possible states: state-space landscape
Transition function: To find neighbor or successor state
Goal state
Objective function: A way to measure how close to the goal state
Start state

Search state-space

Global Maximum: A state that maximizes the objective function over the entire state space
Local Maximum: A state that maximizes the objective function within a small area around it.
Plateau: A state such that the objective function is constant in an area around it.
- Shoulder: A plateau that has uphill edge.
- Flat: A plateau whose edges go downhill.

Advantages

use little memory
can often find reasonably good solution in large or infinite search spaces
useful for solving pure optimization problems
don't need to know the path to the solution.

Hill climbing

keeps track of one current state and on each iteration moves to the neighboring state with highest value.

$f = max(-cost(X))$
Steps
- Evaluate the initial stat
- If it is equal to the goal state, return. Otherwise, continue.
- Find a neighboring state
- Evaluate this state. If it is closer to the goal state than before, replace the initial state with this state.
- Repeat steps 2-4 until it reaches a goal state (local or global maximum) or runs out of time.
No search tree, No backtracking, Don't look ahead beyond the current state.
- get stuck due to local maxima, plateaus, or ridges.

Variations of HC

Simple HC: greedy local search which expands the current state and moves on to the best neighbor.
Stochastic HC: choose randomly among the neighbors going uphill.
First-choice HC: generate random successor until one is better. Good for states with high numbers of successors.
Random restart: conducts a series of hill climbing searches from random initial states until a goal state is found.

Simulated Annealing

based upon the annealing process to model the search process for finding an optimal solution to an optimisation problem

annealing schedule, temperature, energy
finds the minimal value of the objective function (energy function)
starts with a high temperature and then gradually reduces the temperature
$P = e^{-\Delta E / kT}$ $P = e^{- Δ E / k T}$
- $\Delta E$ : how bad the new state is compared to the old state
- $T$ : temperature is getting lower over time
- $k$ : a scaling factor
Swap condition: $\Delta E <= 0$ or ${-\Delta E / kT} > \text{random}$

Evolutionary algorithms

Local beam search
Stochastic beam search
Genetic algorithms

Characteristics

size of the population
representation of each individual
mixing number
selection process for selecting the individuals who will become the parents of the next generation
recombination procedure
mutation rate
makeup of the next generation

Genetic algorithm

It uses operators, such reproduction, crossover and mutation, inspired by the natural evolutionary principles.

State: is represented by an individual in a population. Traditional representation is a chromosome
Objective function: is used to evaluate the fitness of an individual (= fitness function, 적합도 함수)
Successor function: consists of three operators: reproduction, crossover, and mutation
Solution: is found through evolution from one generation to another generation

Genetic Algorithm

Roulette Wheel Selection

Compute total fitness of all individuals.
- Example: A=30, B=20, C=40, D=10 → Total = 100.
Calculate probability of each individual being selected
- Formula: $P(i) = \frac{fitness(i)}{total\_fitness}$ $P (i) = \frac{f i t n ess ( i )}{t o t a l _ f i t n ess}$
  - A = 30/100 = 0.30
  - B = 20/100 = 0.20
  - C = 40/100 = 0.40
  - D = 10/100 = 0.10
Convert to cumulative probabilities
- P4 = 0.10
- P4 + P3 = 0.50
- P4 + P3 + P2 = 0.90
- P4 + P3 + P2 + P1 = 1.00
Generate a random number between 0 and 1.
Select an individual based on the random number and cumulative probabilities.

Roulette Wheel Selection

⚫ random = 0.07 → falls in P4 [0, 0.10)
🔺 random = 0.37 → falls in P3 [0.10, 0.50)
⬟ random = 0.82 → falls in P2 [0.50, 0.90)

Applications of GA

Parameter tuning: optimize the parameters in NN
Planning: economic dispatch, train timetabling
Design & Control problems: robotic control, adaptive control systems
Successful use of GA requires careful engineering of the representation

Structural Knowledge​

Requirement of hierarchices​

Advantage of hierarchical approach​

Partition​

Physical composition​

Measurements​

Objects​

Time​

Event calculus​

Successor-state axiom​

Semantic networks​

Taxonomy hierarchy​

University Ontology​

Knowledge graph​

Types of Knowledge Graphs​

General Knowledge Graphs​

Examples​

Reasoning for categories​

Reasoning using Semantic networks​

RDF​

Uncertain Domain​

Bayes' Theorem​

General Form of Bayes' Rule​

Constructing Bayesian Network​

Knowledge Representation​

Bayesian Network​

Burglary Example​

Probabilistic reasnoing​

Inference in Bayesian Networks​

Exact inference in Bayesian Networks​

Knowledge representation​

Propositional logic​

FOL, First-order logic​

Logical equivalence​

Reasoning​

Deductive reasoning​

Inductive reasoning​

Inference​

Inference Problem​

Inference Problem Example​

Inference by theorem proving​

Modus Ponens Rule​

Add-Elimination Rule​

Terms to contradiction and resolution​

Propositional satisfiability (SAT) problem​

Resolution rule​

Unit resolution rule​

Full resolution rule​

Inference via proof by contradiction through resolution​

Inference in FOL​

Universal instantiation (UI)​

Existential instantiation (EI)​

CNF, Conjunctive Normal Form​

Generative AI​

Historcal context of Gen AI​

Variational Autoencoders, VAEs​

VAE Examples​

Face Generation Example​

Anomaly Detection Example​

Autoencoder​

Image Reconstuction Example​

Denoising Images​

Loss function​

Generative Adversarial Network, GAN​

Autoregressive Models​

Training Autoregressive Models​

Interfence​

NLP​

Tokenization​

Corpus​

Text classification​

Bag of Words, BoW​

TF-IDF​

Examples of TF-IDF​

Naive Bayes​

Examples of Naive Bayes​

N-gram model​

Examples of N-gram​

Transformer​

Cross-Attention​

Structural Knowledge

Requirement of hierarchices

Advantage of hierarchical approach

Partition

Physical composition

Measurements

Objects

Time

Event calculus

Successor-state axiom

Semantic networks

Taxonomy hierarchy

University Ontology

Knowledge graph

Types of Knowledge Graphs

General Knowledge Graphs

Examples

Reasoning for categories

Reasoning using Semantic networks

RDF

Uncertain Domain

Bayes' Theorem

General Form of Bayes' Rule

Constructing Bayesian Network

Knowledge Representation

Bayesian Network

Burglary Example

Probabilistic reasnoing

Inference in Bayesian Networks

Exact inference in Bayesian Networks

Knowledge representation

Propositional logic

FOL, First-order logic

Logical equivalence

Reasoning

Deductive reasoning

Inductive reasoning

Inference

Inference Problem

Inference Problem Example

Inference by theorem proving

Modus Ponens Rule

Add-Elimination Rule

Terms to contradiction and resolution

Propositional satisfiability (SAT) problem

Resolution rule

Unit resolution rule

Full resolution rule

Inference via proof by contradiction through resolution

Inference in FOL

Universal instantiation (UI)

Existential instantiation (EI)

CNF, Conjunctive Normal Form

Generative AI

Historcal context of Gen AI

Variational Autoencoders, VAEs

VAE Examples

Face Generation Example

Anomaly Detection Example

Autoencoder

Image Reconstuction Example

Denoising Images

Loss function

Generative Adversarial Network, GAN

Autoregressive Models

Training Autoregressive Models

Interfence

NLP

Tokenization

Corpus

Text classification

Bag of Words, BoW

TF-IDF

Examples of TF-IDF

Naive Bayes

Examples of Naive Bayes

N-gram model

Examples of N-gram

Transformer

Cross-Attention