본문으로 건너뛰기

"iai" 태그로 연결된 12개 게시물개의 게시물이 있습니다.

모든 태그 보기

IAI +012

· 약 8분

Structural Knowledge

  • Structural knowledge: the relationship between concepts and objects in the world.
  • Hierarchical approach: built on classification and uses hierarchies as the structures for knowledge representation, presented in a graphical format.
    • Semantic networks: nodes represent concepts, and edges represent relationships between concepts.
    • Ontologies
      • A formal specification of concepts, relationships, and constraints within a domain that enables machines to reason automatically.
      • It defines classes (concepts), subclasses (hierarchical relations), and properties (attributes or relations) that describe how entities interact.
      • Unlike simple taxonomies, ontologies also include logical axioms and constraints that specify allowable relationships and permit inference and consistency checking through automated reasoning.
      • Classes: Main concepts or categories (e.g. Person, Animal).
      • Subclasses: Subcategories (e.g. Dog ⊂ Animal).
      • Properties: Attributes (e.g. hasPart, hasColor).
      • Relations: Connection rules (e.g. MemberOf, SubsetOf).
      • Constraints: Logical constraints (e.g. disjointness, transitivity) that enable inference.

Requirement of hierarchices

  • Inclusiveness: Dog ⊆ Mammal ⊆ Animal
  • Species/differentia: Dog = Mammal + (barks, canine traits, etc.)
  • Inheritance: Mammal: has_fur, gives_live_birthDog: inherits all Mammal properties
  • Transitivity: If Dog ⊆ Mammal and Mammal ⊆ Animal, then Dog ⊆ Animal
  • Systematic and predictable rules for association and distinction
  • Mutual exclusivity: Reptile ∩ Mammal = ∅
  • Necessary and sfficient criteria:
    • Necessary: x, Mammal(x)Vertebrate(x)ProducesMilk(x)\forall x,\ \text{Mammal}(x) \Rightarrow \text{Vertebrate}(x) \land \text{ProducesMilk}(x)
    • Sufficient: x, Vertebrate(x)ProducesMilk(x)Mammal(x)\forall x,\ \text{Vertebrate}(x) \land \text{ProducesMilk}(x) \Rightarrow \text{Mammal}(x)

Advantage of hierarchical approach

  • Inferring from incomplete evidence (if the shared criteria are not obvious or easily observable).
    • Animal → Mammal → Dog: If an entity is classified as a Mammal, we can infer properties such as having fur and giving live birth, even if the entity is not explicitly identified as a Dog.
  • Excellent representations in mature domains
    • Domains where entities and relationships are well understood and stable
    • e.g. medical diagnosis, biological taxonomy, type systems in programming languages.
  • Useful for entities that are well defined and have clear class boudnaries.
    • Good fit: HTTP status codes, chemical elements, and biological species
    • Poor fit: emotions, social roles, and cultural practices
  • Some theory or model is necessary to guide the identification
    • Provides criteria for defining entities and relationships
    • e.g. Evolutionary theory in biological taxonomy, Type theory in programming languages

Partition

A partition of a category is a set of subcategories that form a disjoint, exhaustive composition of that category.

  • Disjoint: Two or more categories are disjoint if they don't share common members.
  • Exhaustive composition: The subcategories together cover all members of the parent category, leaving no member unclassified.
  • Examples:
    • ❌ Category: Animal (Not a partition)
      • Mammal, Bird (Reptiles, fish, insects are missing, not exhaustive)
    • ✅ Category: Integer (Partition)
      • Even, Odd (Disjoint and exhaustive)

Physical composition

  • PartOf Relation: Partof(a, b) is a relation representing that one thing, 'a', is a part of another thing, 'b'.
  • BunchOf Relation: BunchOf(a) is a relation, taking a set of objects 'a', to represent a composite object made up of those parts.
  • Examples:
    • Partof(Wheel, Car): A wheel is part of a car (one-to-one relation).
    • BunchOf({Wheel1, Wheel2, Wheel3, Wheel4}): A car is a bunch of four wheels (many-to-one relation).
  • Link between PartOf and BunchOf:
    • x(xs    PartOf(x,BunchOf(s)))\forall x (x \in s \implies PartOf(x, BunchOf(s)))
    • y[(x(xs    PartOf(x,y))    PartOf(BunchOf(s),y)]\forall y \Big[\big(\forall x (x \in s \implies PartOf(x, y)\big) \implies PartOf(BunchOf(s), y)\Big]
  • Why useful?
    • Reasoning from individual parts -> group -> larger object.
    • Avoiding ambiguity between: "this thing is part of", "these things together form"
    • Without BunchOf, ontologies cannot represent: piles, colleciton, aggregates, composite physical structures.

Measurements

  • Quantitive measures (Ratio, Interval)
    • Represented as numbers with units
    • Support arithmetic and unit conversion
    • Enable numeric reasoing (e.g. 2.54cm = 1 inch)
  • Non-quantitative measures (Ordinal)
    • Cannot be meaningufully represented as numbers
    • Can still be compared using ordering relactions (<, >, =)
    • Suppor qualitative reasoning (e.g. one task is more difficult than another)

Objects

  • Stuff
    • Represents substances
    • Uncountable masses
    • Definitions include only intrinsic properties (e.g. Butter, Unsalted Butter)
    • bButterPartOf(p,b)    pButterb \in Butter \land PartOf(p, b) \implies p \in Butter
  • Things
    • Represents discrete objects
    • Countable entities
    • Definitions include extrinsic properties (e.g. PoundOfButter, StickOfButter)
      • It depends on measurement
      • It depends on contextual constraints

Time

  • Fluent: a condition whose truth value can change over time.
    • e.g. "The box is on the table", On(box,table)On(box, table), On(box,table,t)On(box, table, t)
    • a time-dependent proposition.
  • Time scale and absolute time
    • Ontology represents time along a single continuous timeline with a fixed reference point.
    • Date(0,20,21,24,1,1995)=Seconds(300000000)Date(0, 20, 21, 24, 1, 1995) = Seconds(300000000)
    • It allows arithmetic operations on time and comparisons between time points.
  • Time intertvals
    • Time can be represented as moments (instants) or intervals (durations).
    • Duration(i)=Time(End(i))Time(Start(i))Duration(i) = Time(End(i)) - Time(Start(i))
  • Partition of time
    • Partition(Moments,ExtendedIntervals,Intervals)Partition({Moments, ExtendedIntervals}, Intervals)
    • All time intervals can be partitioned into moments and extended (non-zero-length) intervals.
    • It provides a complete and precise ontology of time.

Event calculus

  • T(f,t1,t2)T(f, t_1, t_2): Fluent ff is true for all time between time t1t_1 and t2t_2.
  • Happens(e,t1,t2)Happens(e, t_1, t_2): Event ee start at time t1t_1 and ends at time t2t_2.
  • Initiate(e,f,t)Initiate(e, f, t): Event ee causes fluent ff to become true at time tt.
  • Terminates(e,f,t)Terminates(e, f, t): Event ee causes fluent ff to cease to be true at time tt.
  • Initiated(f,t1,t2)Initiated(f, t_1, t_2): Fluent ff becomes true at some point between t1t_1 and t2t_2.
  • Terminated(f,t1,t2)Terminated(f, t_1, t_2): Fluent ff ceases to be true at some point between t1t_1 and t2t_2.
  • t1<t2t_1 < t_2: Time point t1t_1 occurs before time point t2t_2.
  • Happens(PutBoxOnTable, 10, 12)
  • Initiate(PutBoxOnTable, On(Box, Table), 12)
    • Initiated(On(box, table), 10, 12)
  • T(On(Box, Table), 12, 20)
  • Terminates(RemoveBoxFromTable, On(Box, Table), 20)
    • Terminated(On(box, table), 18, 22)

Successor-state axiom

후속상태공리

  • Define how the world changes after an action occurs.
    • what changes when an action happens
    • what stays the same.
  • Without successor-state axioms, we face the frame problem:
    • after every action, we would need to explicitly list all facts that did not change.
  • Move(A,B,X)Move(A, B, X): moving block A from the top of block B to position X.
    • Preconditions:
      • On(A,B)On(A, B): A is on top of B
      • Clear(A)Clear(A): nothing is on top of A
      • Clear(X)Clear(X): nothing is on position X
    • Effects:
      • On(A,X)On(A, X): A is now on position X
      • Clear(B)Clear(B): B is now clear
      • Clear(X)Clear(X): X is no longer clear (X is now occupied by A)

Semantic networks

  • Visually represent a knowledge base.
  • Support efficient inference.
  • Allow properties of an object ot be inferred from its category membership.
  • Representing individual objects, categories of objects, and relations among objects.
  • Categories are the primary buildling blokcs of large-scale knowledge representation schemes.

Taxonomy hierarchy

  • A hierarchical structure of categories
  • Each lower category is a more specific kind of its parent
  • Organized from general → specific

University Ontology

EntityClassExample Statemnets
AliceStudentenrolledIn(CS101)enrolledIn(CS101), assessedBy(Prof.Smith)assessedBy(Prof.Smith)
Prof.SmithLecturermemberOf(Lecturer)memberOf(Lecturer), teaches(CS101)teaches(CS101)
CS101CoursebelongsTo(CSDepartment)belongsTo(CSDepartment)

Knowledge graph

  • represents information and its relationships using a graph structure.
  • Nodes: entities or concepts (e.g. people, places, things).
  • Edges: relationships between nodes (e.g. "is a", "part of", "located in").

Types of Knowledge Graphs

  • General Knowledge Graphs
  • Domain-Specific Knowledge Graphs
  • Semantic Knowledge Graphs
  • Social Knowledge Graphs
  • Temporal Knowledge Graphs
  • Special Knowledge Graphs
  • Statistical Knowledge Graphs
  • Probabilistic Knowledge Graphs
  • Textual Knowledge Graphs
  • Multi-modal Knowledge Graphs

General Knowledge Graphs

  • Comprehensive information representation
  • Entity-Relationship Structure
  • Linked Data
  • Semantic Enrichment
  • Capabilities & Use Cases
    • Querying and Analysis
    • Data Integration
    • ML and AI applications

Examples

  • DBPedia: RDF, resource descritions framework
  • Wikidata
  • YAGO
  • Google Knowledge Graph
  • Microsoft Academic Graph
  • IBM Watson Knowledge Studio

Reasoning for categories

  • Infer the presence of certain objects from perceptual input
  • Infer category membership from perceived properties
  • Use category information to make predictions
  • It enables an agent to identify objects from observed properties and to predict further characteristics using category knowledge.

Reasoning using Semantic networks

  • Semantic networks are:
    • systems designed specifically for organizing categories reasoning with categories.
    • Provide graphical representations of a knowledge base
  • Using a semanctic network, reasoning can be performed based on:
    • relationships between objects
    • category membership
    • inheritance between categories
    • properties associated with categories
    • Allows an object to inherit general knowledge from its category.
  • The inheritance algorithm:
    • Starts from the object itself
    • Follows links upwards through the category hierarchy
    • Stops as soon as it finds a value for the property
    • This suppors efficient reasoning, default values, and exception handling.

RDF

Resource Description Framework (RDF)

  • Triple Structure: subject -> predicate -> object
  • Subject: represents the resource being described or identified by a URI.
  • Predicate (or Property): desribes the relationship between the subject and the object. also represented by a URI.
  • Object: represents the value or target of the relationship. It can be a URI or a literal (a string, number, or date).

IAI +011

· 약 10분

Uncertain Domain

  • Agent may need to handle uncertainty
  • Environment is partially observable
    • Fully observable: An agent knows where it is.
    • The drone never knows its exact position, only an estimated one.
    • The true user intent is not directly observable, only noisy speech data.
    • It cannot see the whole environment or detect humans perfectly.
    • The true health state (disease present or not) is hidden.
  • Environment is non-deterministic
    • AI Trading Agent
      • even if the agent takes the same action, the outcome can vary dramatically.
      • from market prices changes unpredictably due to external events (news, other traders, economic data).
      • network latency, competing traders' behavior are stochasic and independent, random fluctuations can affect the outcome.
      • buy action could lead to profit, loss, or no change depending on uncontrollable external factors.
    • AI Helpdesk Chatbot
      • even when the chatbot executes the same action, it sends queries to multiple backend systems, the result may differ each time.
      • network latency or failures, backend load, concurrent requests, and external dependencies can lead to different response times or outcomes.
      • same command doesn't guarantee the same outcome.
  • Autonomous Driving
    • Partial Observability
      • Sensors cannot detect everything, blind spots, weather effects, occluded vehicles.
    • Non-Determinism
      • Even if the car signals a turn and starts moving, other drivers might brake suddenly, pedestrians might cross unexpectedly, or traffic lights might malfunction.
      • depending on random external events.
  • Medical Diagnosis & Treatment Assistant
    • Partial Observability
      • The patient's internal health state is not directly visible.
      • tests can be inaccurate or incomplete: false positives/negatives, missing data
    • Non-Determinism
      • The same treatment can have different effects on different patients due to genetics, lifestyle, or random biological responses.
      • A drug may succeed, fail, or cause side effects.

Bayes' Theorem

P(AB)=P(BA)P(A)P(B)P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}

  • P(AB)P(A|B): Posterior probability, the probability of event A given taht B has occurred.
  • P(BA)P(B|A): Likelihood, the probability of event B occurring given that A is true.
  • P(A)P(A): Prior probability, the initial probability of A being true before considering B.
  • P(B)P(B): Evidence, the total probability of B occurring.
  • examples
    • 1% of people have a certain disease (prior probability). P(Disease)=0.01P(Disease) = 0.01
    • the test for the disease is 99% accurate (likelihood).
      • P(PositiveTestDisease)=0.99P(PositiveTest|Disease) = 0.99
      • P(NegativeTestNoDisease)=0.99P(NegativeTest|NoDisease) = 0.99
      • so P(PositiveTestNoDisease)=0.01P(PositiveTest|NoDisease) = 0.01
    • P(PositiveTest)=P(PositiveTestDisease)P(Disease)+P(PositiveTestNoDisease)P(NoDisease)P(PositiveTest) = P(PositiveTest|Disease) \cdot P(Disease) + P(PositiveTest|NoDisease) \cdot P(NoDisease)
      • =0.990.01+0.010.99=0.0198= 0.99 \cdot 0.01 + 0.01 \cdot 0.99 = 0.0198
    • P(DiseasePositiveTest)=P(PositiveTestDisease)P(Disease)P(PositiveTest)P(Disease|PositiveTest) = \frac{P(PositiveTest|Disease) \cdot P(Disease)}{P(PositiveTest)}
      • =0.990.010.01980.5= \frac{0.99 \cdot 0.01}{0.0198} \approx 0.5
  • the conditional probability P(effectcause)P(effect|cause) quantifies the relationship in the causal direction from cause to effect.
  • but P(causeeffect)P(cause|effect) is often what we really want to know, describing the relationship in the diagnostic direction from effect to cause.
    • In medical dignosis, the doctor knows P(symptomsdisease)P(symptoms|disease) from medical studies, and want to derive a P(diseasesymptoms)P(disease|symptoms) for a particular patient.
    • P(diseasesymptoms)=P(symptomsdisease)P(disease)P(symptoms)P(disease|symptoms) = \frac{P(symptoms|disease) \cdot P(disease)}{P(symptoms)}

General Form of Bayes' Rule

P(YX)=αP(XY)P(Y)P(Y|X) = \alpha \cdot P(X|Y) \cdot P(Y)

  • α\alpha is the normalization constant needed to make the entries in P(YX)P(Y|X) sum to 1 for each value of X.
  • conditional independency
    • two variables X and Y are conditionally independent given a third variable Z if the value of X provides no information about Y once Z is known.
    • P(X,YZ)=P(XZ)P(YZ)P(X, Y|Z) = P(X|Z) \cdot P(Y|Z)

Constructing Bayesian Network

  • constructing a Bayesian network in which resulting joint probability distributions are a good representation of the agent's knowledge.
    1. P(x1,...,xn)=i=1nP(xiparents(xi))P(x_1, ..., x_n) = \prod_{i=1}^{n} P(x_i | \text{parents}(x_i))
    1. rewrite the entries in the joint probability distribution in terms of conditional probabilities using the product rule.
    • P(x1...xn)=P(xnxn1...x1)P(xn1...x1)P(x_1 \land ... \land x_n) = P(x_n | x_{n-1} \land ... \land x_{1}) \cdot P(x_{n-1} \land ... \land x_{1})
    1. repeat step 2 until all variables are included.
    • P(x1,...,xn)=P(xnxn1...x1)P(xn1xn2...x1)P(x2x1)P(x1)P(x_1, ..., x_n) = P(x_n | x_{n-1} \land ... \land x_{1}) \cdot P(x_{n-1} | x_{n-2} \land ... \land x_{1}) \cdots P(x_2 | x_1) \cdot P(x_1)

Knowledge Representation

  • The agent's knowledge can at best provide only a degree of belief in the relevant logical sentences.
  • A probabilistic agent may have a numerical degree of belief between 0 and 1.

Bayesian Network

  • a directed graph
  • each node repressents a random variable and edge with an arrow from X to Y represents that X is a parent of Y.
  • each node is associated with a probability distribution
  • if a node Y has parents X, the Y node has a conditional probability distribution with respect to its parents X, i.e. P(Yparents(Y))P(Y| \text{parents}(Y))
  • if a node has no any parents, it has an unconditional probability distribution.

Burglary Example

  • a new burglar alarm installed at home.
  • it is reliable at detecting a burglary, but also responds on occasion to minor earthquakes
  • you have two neighbors, John and Mary who promised to call you at work when they hear the alarm.
  • given the evidence of who has or has not called, estimate the probability of a burglary.
  • Knowledge representation as the probability of each event that can occur in this case.

P(eventj)=number of ccurrences of event jtotal number of experimentsP(\text{event}_{j}) = \frac{\text{number of ccurrences of event }_j}{\text{total number of experiments}}

  • calulate the full joint probability distribution for all relevant variables.
    • specifying probabilities for all possible events one by one is unnatural and tedious.

P(B,E,A,J,M)=i5P(Xiparents(Xi))=P(B)P(E)P(AB,E)P(JA)P(MA)P(B, E, A, J, M) = \prod_{i}^{5} P(X_i | \text{parents}(X_i)) = P(B) \cdot P(E) \cdot P(A|B,E) \cdot P(J|A) \cdot P(M|A)

P(M,J,A,E,B)=P(MJ,A,E,B)P(J,A,E,B)P(JA,E,B)P(A,E,B)P(AE,B)P(E,B)P(EB)P(B)=P(MJ,A,E,B)P(JA,E,B)P(AE,B)P(EB)P(B)=P(MJ,A,E,B)P(JA,E,B)P(AE,B)P(E)P(B)i.e. P(EB)=P(E) (Earthquake is independent of Burglary)=P(MA)P(JA)P(AE,B)P(E)P(B)i.e. P(JA,E,B)=P(JA) (JohnCalls depends only on Alarm)i.e. P(MJ,A,E,B)=P(MA) (MaryCalls depends only on Alarm)\begin{align*} & P(M, J, A, E, B) = P(M | J, A, E, B) \cdot P(J, A, E, B) \\ & P(J | A, E, B) \cdot P(A, E, B) \\ & \quad P(A | E, B) \cdot P(E, B) \\ & \quad \quad P(E | B) \cdot P(B) \\ & = P(M|J,A,E,B) \cdot P(J|A,E,B) \cdot P(A|E,B) \cdot P(E|B) \cdot P(B) \\ & = P(M|J,A,E,B) \cdot P(J|A,E,B) \cdot P(A|E,B) \cdot P(E) \cdot P(B) \\ & \quad i.e.\space P(E|B) = P(E) \text{ (Earthquake is independent of Burglary)} \\ & = P(M|A) \cdot P(J|A) \cdot P(A|E,B) \cdot P(E) \cdot P(B) \\ & \quad i.e.\space P(J|A,E,B) = P(J|A) \text{ (JohnCalls depends only on Alarm)} \\ & \quad i.e.\space P(M|J,A,E,B) = P(M|A) \text{ (MaryCalls depends only on Alarm)} \end{align*}

P(M,J,A,E,B)=P(B)P(E)P(AB,E)P(JA)P(MA) P(M, J, A, E, B) = P(B) \cdot P(E) \cdot P(A|B,E) \cdot P(J|A) \cdot P(M|A)

Probabilistic reasnoing

  • to explain how to use network models to reason under uncertainty.
  • Exact inference: calculate the exact posterior probabilities.
  • Approximate inference: Stochastic approximation techniques such as likelihood weighting and Markov Chain Monte Carlo
    • can give reasonable estimates of the true posterior probabilities in a network
    • can cope with much larger networks than exact inference algorithms.
  • Static worlds: Each random variable has a single fixed value.
  • Dynamic worlds: view the world as a series of snapshots, or time slices.
    • Each of snapshots contains a set of random variables, some observed and some hidden.
    • The values of these variables can change over time.
    • With the assumption of state sequences are Markov, the future is independent of the past given the present.

Inference in Bayesian Networks

  • Computing the posterior probability distribution of a set of query variables, given some observed events, called some assignment of values to a set of evidence variables.
  • XX: query variable(s)
    • is it raining?
  • EE: the set of evidence variables with observed values E1,...,EmE_{1}, ..., E_{m}
    • weather report says it is cloudy.
    • weather report says there is high humidity.
  • ee: a particular observed event.
    • E1=cloudyE_{1} = \text{cloudy}
    • E2=high humidityE_{2} = \text{high humidity}
  • YY: the non-evidence, non-query variables.
    • the other variables in the network.
  • X={X}EYX = \{X\} \cup E \cup Y
    • the complete set of variables in the network.
  • P(Xe)P(X|e): the posterior probability distribution of the query variable(s) given the evidence.
  • P(Xe)=αyP(X,e,y)P(X|e) = \alpha \sum_{y} P(X, e, y)

Exact inference in Bayesian Networks

  • One exact inference method is called by inference by enumeration.
  • Observed some event A=trueA = true and B=trueB = true, want to compute the posterior probability distribution P(CA,B)P(C|A, B).
    • Use the full conditional probability table
    • Use the Bayesian Network
  • It calculates all the required probability components in each branch in the evaluation tree, and this results in repetitive calculations.
  • To avoid this, we can use variable elimination algorithm and clustering algorithms.
  • Calculation of probability distribution for a query variable is complexity exponential.
  • Enumeration method can become intractable in large, multiple connected networks. -> use approximate inference methods.

For any Bayesian network with given nodes, X={X1,X2,...,Xn}X = \{X_1, X_2, ..., X_n\}, the joint probability distribution is given by: P(X)=P(X1X2Xn)=i=1nP(Xiparents(Xi))P(X) = P(X_1 \land X_2 \land \ldots \land X_n) = \prod_{i=1}^{n} P(X_i | \text{parents}(X_i))

Using the Bayesian network, we can compute the conditional probability.

P(Bevent)=αeaP(B,E,A,j,m) P(B | event) = \alpha \sum_{e}\sum_{a} P(B, E, A, j, m)

  • i=1,P(X1parents(X1))=P(B)i=1, P(X_1 | \text{parents}(X_1)) = P(B)
  • i=2,P(X2parents(X2))=P(E)i=2, P(X_2 | \text{parents}(X_2)) = P(E)
  • i=3,P(X3parents(X3))=P(AB,E)i=3, P(X_3 | \text{parents}(X_3)) = P(A|B,E)
  • i=4,P(X4parents(X4))=P(jA)i=4, P(X_4 | \text{parents}(X_4)) = P(j|A)
  • i=5,P(X5parents(X5))=P(mA)i=5, P(X_5 | \text{parents}(X_5)) = P(m|A)
  • P(B,E,A,j,m)=P(B)P(E)P(AB,E)P(jA)P(mA)P(B, E, A, j, m) = P(B) \cdot P(E) \cdot P(A|B,E) \cdot P(j|A) \cdot P(m|A)
P(Bevent)=αeaP(B,E,A,j,m)=αeaP(B)P(E)P(AB,E)P(jA)P(mA)=αP(B)eP(E)aP(AB,E)P(jA)P(mA)\begin{align*} & P(B | event) = \alpha \sum_{e}\sum_{a} P(B, E, A, j, m) \\ & = \alpha \sum_{e}\sum_{a} P(B) \cdot P(E) \cdot P(A|B,E) \cdot P(j|A) \cdot P(m|A) \\ & = \alpha \cdot P(B) \sum_{e} P(E) \sum_{a} P(A|B,E) \cdot P(j|A) \cdot P(m|A) \end{align*}

Berglary example calculation:

P(bevent)=αP(b)eP(E)aP(Ab,E)P(jA)P(mA)=αP(b)[P(e)aP(Ab,e)P(jA)P(mA)+P(¬e)aP(Ab,¬e)P(jA)P(mA)]aP(A,b,e)P(jA)P(mA)=P(a,b,e)P(ja)P(ma)+P(¬a,b,e)P(j¬a)P(m¬a)aP(Ab,¬e)P(jA)P(mA)=P(a,b,¬e)P(ja)P(ma)+P(¬a,b,¬e)P(j¬a)P(m¬a)=αP(b)[P(e)(P(ab,e)P(ja)P(ma)+P(¬ab,e)P(j¬a)P(m¬a))+P(¬e)(P(ab,¬e)P(ja)P(ma)+P(¬ab,¬e)P(j¬a)P(m¬a))]\begin{align*} & P(b | event) = \alpha P(b) \sum_{e} P(E) \sum_{a} P(A|b,E) \cdot P(j|A) \cdot P(m|A) \\ & = \alpha P(b) \bigg[ P(e) \sum_{a} P(A|b,e) \cdot P(j|A) \cdot P(m|A) + P(\neg e) \sum_{a} P(A|b,\neg e) \cdot P(j|A) \cdot P(m|A) \bigg] \\ & \quad \sum_{a} P(A,b,e) \cdot P(j|A) \cdot P(m|A) \\ & \quad = P(a,b,e) \cdot P(j|a) \cdot P(m|a) + P(\neg a,b,e) \cdot P(j|\neg a) \cdot P(m|\neg a) \\ & \quad \sum_{a} P(A|b,\neg e) \cdot P(j|A) \cdot P(m|A) \\ & \quad = P(a,b,\neg e) \cdot P(j|a) \cdot P(m|a) + P(\neg a,b,\neg e) \cdot P(j|\neg a) \cdot P(m|\neg a) \\ & = \alpha P(b) \bigg[ P(e) \big( P(a|b,e) \cdot P(j|a) \cdot P(m|a) + P(\neg a|b,e) \cdot P(j|\neg a) \cdot P(m|\neg a) \big) \\ & \quad \quad + P(\neg e) \big( P(a|b,\neg e) \cdot P(j|a) \cdot P(m|a) + P(\neg a|b,\neg e) \cdot P(j|\neg a) \cdot P(m|\neg a) \big) \bigg] \\ \end{align*}

IAI +010

· 약 11분

Knowledge representation

  • Intelligent agents need knowledge about the world in order to reach good decisions.
  • Declarative knowledge is represented in a form of sentences in a knowledge representation language and stored in a knowledge base.
  • Knowledge base is used by an inference engine to infer a new sentence which will be used for the agent to decide what action to take next.
  • Formal languages are defined by its grammar and semantic rules
    • grammar: defines the syntax of legal sentences
    • semantic rules: defines the meaning.
  • Common knowledge representation formalisms:
    • Propositional logic
    • First-order logic
    • Fuzzy logic
    • Semantic networks
    • Ontologies

Propositional logic

  • a declarative language in which can handle propositions that are known true, known false, or completely (unknown true or false)
  • a BNK (Backus-Naur Form) grammer of sentences in propositional logic.
    • ¬\neg = NOT
    • \land = AND
    • \lor = OR
    •     \implies = IMPLIES
    •     \iff = IF AND ONLY IF
  • Negation: a sentence using ¬\neg is called negation
  • Literal: either an atomic sentence or a negated atomic sentence
  • Conjunction: two sentences connected by \land. Eac hof them is called conjunct
  • Disjunction: two sentences connected by \lor. Each of them is called disjunct
  • Implication: two sentences connected by     \implies. P    QP \implies Q. P is called premise or antecedent and Q is the conclusion or consequent.
    • An implication is if-then statement.

Truth tables

PQ¬\negPP \land QP \lor QP     \implies QP     \iff Q
TTFTTTT
TFFFTFF
FTTFTTF
FFTFFTT

FOL, First-order logic

  • a declarative language
  • Syntax of FOL builds on that of propositional logic
  • terms to represent objects, universal quantifier, and existential quantifier
  • a model in FOL must provide the information required to determine the truth value of every atomic sentence in the language.
  • xP(x)\forall x P(x) means for all x, P(x) is true
  • xP(x)\exists x P(x) means there exists an x such that P(x) is true

Logical equivalence

logical equivalencemeaning
(αβ)(βα)(\alpha \land \beta) \equiv (\beta \land \alpha)Commutativity of \land
(αβ)(βα)(\alpha \lor \beta) \equiv (\beta \lor \alpha)Commutativity of \lor
(α(βγ))((αβ)γ)(\alpha \land (\beta \land \gamma)) \equiv ((\alpha \land \beta) \land \gamma)Associativity of \land
(α(βγ))((αβ)γ)(\alpha \lor (\beta \lor \gamma)) \equiv ((\alpha \lor \beta) \lor \gamma)Associativity of \lor
¬(¬α)α\neg(\neg \alpha) \equiv \alphaDouble negation elimination
(α    β)(¬αβ)(\alpha \implies \beta) \equiv (\neg \alpha \lor \beta)Implication elimination
(α    β)(¬β    ¬α)(\alpha \implies \beta) \equiv (\neg \beta \implies \neg \alpha)Contraposition
(α    β)((α    β)(β    α))\alpha \iff \beta) \equiv ((\alpha \implies \beta) \land (\beta \implies \alpha))Biconditional elimination
¬(αβ)(¬α¬β)\neg(\alpha \land \beta) \equiv (\neg \alpha \lor \neg \beta)De Morgan's law for \land
¬(αβ)(¬α¬β)\neg(\alpha \lor \beta) \equiv (\neg \alpha \land \neg \beta)De Morgan's law for \lor
(α(βγ))((αβ)(αγ))(\alpha \land (\beta \lor \gamma)) \equiv ((\alpha \land \beta) \lor (\alpha \land \gamma))Distributivity of \land over \lor
(α(βγ))((αβ)(αγ))(\alpha \lor (\beta \land \gamma)) \equiv ((\alpha \lor \beta) \land (\alpha \lor \gamma))Distributivity of \lor over \land

x¬P(x)¬xP(x)\forall x \neg P(x) \equiv \neg \exists x P(x)

  • For all x, not P(x) is logically equivalent to "it is not the case that there exists an x such that P(x) is true."

Reasoning

Deductive reasoning

  • a process of reasoning from one or more statements (premises) to reach a logical conclusion.
  • first premise, second premise, therefore conclusion.

Inductive reasoning

  • the process of resoning from specific observations to borader generalizations and theories.
  • also described as a method where one's experiences and observations are synthesized to cmoe up with a general truth.
  • premises and then conclusion

Inference

  • steps in reasoning
  • moves from premises to logical consequences
  • In AI Context, inference is to derive new logical sentences (as the conclusion) from existing logical sentences (as premises).
  • researchers develop automated inference systems to emulate human inference.

Inference Problem

KBγKB \models \gamma

  • KB, Knowledge Base, is a set of propositions that represent what is known about the world.
  • γ\gamma, Query sentence, is the target conclusion which needs to be confirmed based on the given KB.
  • where \models denotes the relation of logical entailment between KB and the sentence γ\gamma, reading as "KB entails γ\gamma" or "if KB is true, then γ\gamma must also be true".
  • αβ    M(α)M(β)\alpha \models \beta \iff M(\alpha) \subseteq M(\beta)
    • M(α)M(\alpha) is the set of all models that satisfy α\alpha.
    • M(β)M(\beta) is the set of all models that satisfy β\beta.
    • The statement αβ\alpha \models \beta means that in every model where α\alpha is true, β\beta is also true. In other words, if α\alpha holds, then β\beta must also hold.
  • Model Checking: enumerates all possible models and checks if the entailment holds in each model.
    • Knowledge base will be used to draw inferences.
    • Query sentence, γ\gamma, is needed to be checked whether it is entailed by the KB.
    • Symbols, a list of all symbols (or atomic propositions) used in the problem context.
    • Models, assignments of truth and false values to those identified symbols.
  • Model Checking Procedure
    1. Identify the propositional symbols involved in the KB sentences and query sentence
    2. Enumerate all possible models by assigning truth values to the identified symbols.
    3. Evaluate the KB sentence in each model and fined the models in which KB is true.
    4. Evaluate the query sentence in the models from step 3 and check if query sentence is true in these models.
    5. Conclude that the KB entails the query sentence γ\gamma if and only if the query sentence is true in all models where the KB is true.

Inference Problem Example

  • Obersvation P: "It is raining"
  • Query sentence Q: "The ground is wet"
  • Knowledge: P    QP \implies Q (If it is raining, then the ground is wet)
  • Knowledge Base KB: P(P    Q)P \land (P \implies Q)
  • Inference Problem: KBQKB \models Q from KB=T, get Q=T
  • Proof:
    1. From the 4 possible models, only Model 1 makes KB true.
    2. M(KB) = model 1
    3. Model 1 also makes Q true.
    4. M(Q) = model 1
    5. M(KB)M(Q)M(KB) \subseteq M(Q), therefore KBQKB \models Q
ModelPQP     \implies QP \land (P     \implies Q)
1TTTT
2TFFF
3FTTF
4FFTF

Inference by theorem proving

to apply rules of inference directly to the sentences in the KB to construct a proof of the desired sentence without consulting models.

  • Proof: a chain of consequences that leads to the desired goal.
  • KB will be used to draw inferences.
  • Desired sentence is needed to be checked whether it is entailed by the KB.
  • The rules of inference are the approved logical equivalences and rules.
AspectModel CheckingTheorem Proving
방식참/거짓으로 실제 계산논리 규칙을 사용해 증명
예시진리표Modus Ponens, Resolution
장점단순함복잡한 문장도 처리 가능
단점계산 많음규칙 익혀야 함

Modus Ponens Rule

α    β,αβ\frac{\alpha \implies \beta, \alpha}{\therefore \beta}

  • whenever any sentences of the form α    β\alpha \implies \beta and α\alpha are given, then the sentence β\beta can be inferred.

Add-Elimination Rule

αβα\frac{\alpha \land \beta}{\therefore \alpha}

  • from a conjunction, one of the conjuncts can be inferred.

Terms to contradiction and resolution

CNF
┌────────────────────────────────────────────────┐
(P ∨ Q)(¬Q ∨ R)
│ ┌──────────────────────┐ ┌───────────────┐ │
│ │ Clause 1 │ │ Clause 2 │ │
│ │ (P ∨ Q) │ │ (¬Q ∨ R) │ │
│ │ ┌───────┬───────┐ │ │ ┌──────┬─────┐ │
│ │ │ P │ Q │ │ │ │ ¬Q │ R │ │
│ │ └───────┴───────┘ │ │ └──────┴─────┘ │
│ └──────────────────────┘ └───────────────┘ │
└────────────────────────────────────────────────┘
  • Literal: an atomic sentence or its negation
  • Complementary literals: a literal and its negation are complementary literals
  • Clause: an expression formed from a collection of finite literals
    • In most l1l2...lnl_1 \lor l_2 \lor ... \lor l_n cases, a clause is a disjunction of finite literals.
    • written as the symbol lil_i.
  • Conjunctive Normal Form: a sentence expressed as a conjunction of clauses.
  • Satisfiability: a sentence is satisfiable if it is true in, or satisfied by, some models.

Propositional satisfiability (SAT) problem

  • to determine the satisfiability of sentences in propositional logic.
  • if there exists a model that satisfies a given logical sentence, then the sentence is satisfiable.
  • Satisfiability can be checked by enumerating the possible models until one is found that satisfies the sentence, or by resolving complementary literals until an empty clause is derived.
  • many problems in CS are really SAT problems.

Resolution rule

Unit resolution rule

(l1l2...li1lili+1...lk, m)l1...li1li+1...lk \frac{(l_1 \lor l_2 \lor ... \lor l_{i-1} \lor l_i \lor l_{i+1} \lor ... \lor l_k,\space m)}{l_1 \lor ... \lor l_{i-1} \lor l_{i+1} \lor ... \lor l_k}

  • where ll is a literal and lil_i and mm are complementary literals.
  • Unit resolution rule takes a clause which is a disjunction of literals, and a literal and produces a new clause as the resolvent.

Full resolution rule

(l1l2...li1lili+1...lk, m1m2...mj1mjmj+1...mn)(l1...li1li+1...lkm1...mj1mj+1...mn) \frac{(l_1 \lor l_2 \lor ... \lor l_{i-1} \lor l_i \lor l_{i+1} \lor ... \lor l_k,\space m_1 \lor m_2 \lor ... \lor m_{j-1} \lor m_j \lor m_{j+1} \lor ... \lor m_n)}{(l_1 \lor ... \lor l_{i-1} \lor l_{i+1} \lor ... \lor l_k \lor m_1 \lor ... \lor m_{j-1} \lor m_{j+1} \lor ... \lor m_n)}

  • where ll is a literal and lil_i and mjm_j are complementary literals.
  • Full resolution rule takes two clauses which are disjuctions of literals and produces a new clause containing all the literals of the two original clauses except for the two complementary literals.

Inference via proof by contradiction through resolution

αβ \alpha \models \beta

  • to prove that αβ\alpha \models \beta, we can show that the sentence α¬β\alpha \land \neg \beta is unsatisfiable.
  • by deriving an empty clause ()() from α¬β\alpha \land \neg \beta using the resolution rule.
  • In order to derive an empty clause, we use resolution which is a process to resolve complementary literals until to find an empty clause.
    1. R1: P (observation)
    2. R2: P     \implies Q (knowledge, raining implies ground is wet)
    3. KB=P(P    Q)KB = P \land (P \implies Q)
    4. Query sentence: Q
    5. Inference problem: KBQKB \models Q, i.e. from KB=TKB=T, get Q=TQ=T
    6. Proof:
    • Let (P(P    Q))¬Q(P \land (P \implies Q)) \land \neg Q valid
    • Convert (P(P    Q))¬Q(P \land (P \implies Q)) \land \neg Q to CNF
    • Apply implication elimination rule, one has (P(¬PQ))¬Q(P \land (\neg P \lor Q)) \land \neg Q
      • C1:PC_1: P
      • C2:¬PQC_2: \neg P \lor Q
      • C3:¬QC_3: \neg Q
    • Apply unit resolution rule to C1C_1 and C2C_2, resolve PP and ¬P\neg P, one has C4:QC_4: Q
    • Apply unit resolution rule to C3C_3 and C4C_4, resolve QQ and ¬Q\neg Q, one has C5:()C_5: ()

Inference in FOL

  • convert the first-order inference to propositional inference using the ruls for quantifiers.
    • universal instantiation (UI)
    • existential instantiation (EI)
  • do inference in propositional logic using the methods about inference in propositional logic.
  • this approach to first-order logic inference via propositionalization is complete, means any entailed sentences can be proved.
    • in most cases, this approach works
    • in some cases, it is slow and only useful when the domain is small.
  • get rid of quantifiers by instantiating them with specific constants or variables.

Universal instantiation (UI)

v αSubst({v/g},α)\frac{\forall v \space \alpha}{\text{Subst}(\{v/g\},\alpha)}

  • to convert sentences with universal quantifiers to sentences without universal quantifiers.
  • it can infer any sentence obtained by substituting a ground term for the universally quantified variable.
  • Subst({θ,α})Subst(\{\theta, \alpha\}) denotes the result of applying the subsitution θ\theta to the sentence α\alpha and gg is a ground term or a constant symbol.
  • universal instantiation can be applied many times to produce may different consequences.
  • xLoves(x,Mary)\forall x Loves(x, Mary)
    • Loves(John,Mary)Loves(John, Mary)
    • Loves(Sue,Mary)Loves(Sue, Mary)
    • Loves(Bill,Mary)Loves(Bill, Mary)
    • ...

Existential instantiation (EI)

  • to convert sentences with existential quantifiers to sentences without existential quantifiers.
  • the quantified variable can be replaced by a single new constant symbol.
  • for any sentence α\alpha, variable vv, and constant symbol kk not appear elsewhere in the knowldge base, the following rule is sound: v αSubst({v/k},α)\frac{\exists v \space \alpha}{\text{Subst}(\{v/k\},\alpha)}
  • Subst{θ,α}\text{Subst}\{\theta, \alpha\} denotes the result of applying the subsitiution θ=v/k\theta={v/k} in the sentence α\alpha and kk is ground term or a new constant symbol.
  • Existential instantiation can be applied only once, and then the existentially quantified sentence can be discarded.
  • xLoves(x,Mary)\exists x Loves(x, Mary)
    • Loves(K,Mary)Loves(K, Mary)
    • where KK is a new constant symbol not appear elsewhere in the knowledge base

CNF, Conjunctive Normal Form

StepRule NameDescriptionExample
1제거    \implies,     \iff 없애기P    QP \implies Q¬PQ\neg P \lor Q
2De Morgan's¬\lnot 분배¬(PQ)(¬P¬Q)\lnot(P \lor Q) \rightarrow (\lnot P \land \lnot Q)
3Double negation이중부정 제거¬(¬P)P\lnot(\lnot P) \rightarrow P
4Distribution\lor over \land 분배(P(QR))(P \lor (Q \land R))(PQ)(PR)(P \lor Q) \land (P \lor R)
5And-Elimination\land 분리(AB)A,B(A \land B) \rightarrow A, B 따로

IAI +009

· 약 7분

Generative AI

  • Gen AI refers to a category of AI models designed to generte new content, synt99hetic data that resembles a given dataset.
  • Gen AI models create new content, including text, images, audio, and video.

Historcal context of Gen AI

  • 1980s: The development of statistical approaches to AI emerged, focusing on probabilistic models
  • 1990s: Hidden Markov Models (HMMs) became popular in speech recognition and sequence generation tasks, making a shift towards using statistical methods in generative processes.
  • 1990s-2000s: The resurgence of Neural Networks, with the introduction of deep learning techniques. However, hardware and data limitations hampered progress.
  • 2010s: The advent of deep learning algorithms, especially convolutional neural networks (CNNs) and recurrent neural networks (RNNs), transformed the landscape of AI.
  • 2013: VAEs were introduced, providing a probabilistic approach to data generation, allowing for smooth latent space interpolation and structure.
  • 2014: GANs, a novel framework where two neural networks (a generator and a discriminator) are trained simultaneously, allowing for the generation of highly realistic images and other data types.
  • 2015-2020: Generative models began to find applications beyond image synthesis, including text generation, music composition, and even video generation. OpenAI's GPT-2 showcased the potential of transformer-based models for generating coherent text.
  • 2020s: The introduction of larger and more capable models like OpenAI's GPT-3 and subsequent iterations revolutionized natural language processing.
    • DALL-E and Stable Diffusion pushed the boundaries of image generation, allowing users to create images from textual descriptions. This sparked creative exploration and practical applications in marketing, art, and design.
  • Present: The integration of multiple data modalities (text, image, audio) led to the development of models like CLIP and GPT-4, which can understand and generate content across different formats, enhancing the versatility of generative AI.

Variational Autoencoders, VAEs

  • VAEs are from the probabilistic approach to build a generative AI model
  • VAE learns a latent distribution instead of a fixed latent presentation, allowing for the generation of new samples.
  • Latent space
    • sample from a Gaussian distribution
    • μ\mu and σ\sigma are learned during training
    • zz is a new sample from the latent space
  • qϕ(zx)q_\phi(z|x) is the encoder that maps input data xx to a distribution over the latent space zz
  • pθ(xz)p_\theta(x|z) is the decoder that maps a latent variable zz back to the data space xx
  • During training, the model optimizes the reconstruction loss and a regularization term (KL divergence) to ensure the learned latent distribution is close to a prior distribution (usually a standard normal distribution).
  • Latent Representation
    • Dimensionality Reduction: latent space is typically lower-dimensional than the input space
    • Smoothness: In a well-structured latent space, similar inputs will be represented by nearby points.
    • Generative Capabilities: Once trained, can sample from the latent space to generate new data that resembles the training set.
    • Regularization: The VAE incorporates a regularization term in its loss function (KL divergence), which encourages the lerned latent distribution to be close to a standard normal distribution.
  • Decoder
    • The decoder takes the sampled latent representation zz and reconstructs the original input data.
    • The goal is to make the reconstructed data as close as possible to the original input.

VAE Examples

Face Generation Example

  • Trains a VAE on face photos.
  • Latent space learns meaningful features (e.g. hair color, glasses, smile)
  • By sampling and interpolating in latent space, can generate new faces or smoothly morph one face into another.

Anomaly Detection Example

  • Train a VAE on normal sensor data.
  • When fed unusual data, reconstruction error will be high.
  • Use this for fault detection in machines, fraud detction, etc.

Autoencoder

  • Encoder
    • Input Layer: takes in the original data
    • Hidden Layer: These layers progrssively reduce the dimensionality of the input through operations like linear transformations and non-linear activations. (e.g., ReLU-Rectified Linear Unit)
    • Ouput Layer: Produces the final latent represntation. Sometimes Sigmoid or Tanh are used depending on the nature of the input data.
    • Learns to capture meaningful patterns in the data by minimizaing reconstruction loss, which mesures the difference between the original input and the reconstructed output produced by the decoder.
  • Decoder
    • typically mirrors the structure of the encoder but in reverse.
    • During trainng, Mean squared Error (MSE) for continuous data or Binary Cross-Entropy (BCE) for binary data are commonly used to quantify the reconstruction loss.

Image Reconstuction Example

  • input: 28x28 pixel
  • encoder: compresses it to just 16 numbers (latent vector)
  • decoder: expands those 16 numbers back into a 28x28 image
  • result: the reconstructed digit looks similar to the original but not in the training set.

Denoising Images

  • input: a noisy photo of a cat
  • encoder: learns to ignore the noise and compress meaningful features.
  • decoder: rebuilds the image without the noise.
  • output: a clearer cat image.

Loss function

  • Reconstruction loss: Ensure output similar to input
  • KL divergence loss: Push the learned distribution to be close to a standard normal distribution. (can sample new data)

Generative Adversarial Network, GAN

  • Generator: Learns to generate fake (generated) data that resembles data distribution.
  • Discriminator: Learns to distinguish between real data and data generated by the Generator.
  • Steps
    1. Generator creates fake data samples fro mrandom input (noise).
    2. Discriminator evaludates these samples together with real data samples.
    3. Discriminator outputs probabilities indicatcing whether each sample is real or fake.
    4. Based on the prediction of the Discriminator, the Generator and Discriminator are updated using specific loss functions.
  • Applications
    • Labels to Street Scenes
    • Labels to Facade
    • BW to Color
    • Aerial to Map
    • Day to Nihght
    • Edges to Photo
    • Text-to-Image Synthesis

Autoregressive Models

  • a class of generative models where the current value of a time series is expressed as a linear function of its own past values plus some noise.
  • foundational in generative AI and widely used in generative AI, particularly for tasks like text generation, speech synthesis.
  • Examples
    • GPT series: state-of-the-art autoregressive language models used for text generation.
    • WaveNet: an autoregressive model for generating high-quality audio by predicting each sample conditioned on previous samples.

Training Autoregressive Models

  • Pretraining
    • the model sees sequences of tokens and learns to guess the next token.
    • objective: minimize cross-entropy loss between its guess and the true next token.
    • this teaches grammar, facts, reasoning patterns, and style from raw text.
  • Supervised fine-tuning (optional)
    • smaller curated datasets (questions to answer, instructions to respond) teach it to follow directions.
  • Reinforcement learning from Human Feedback (optional)
    • Humans rank multiple model outputs; a reward model is trained on those rankings; GPT is then optimized to produce higher-reward responses (safer, more helpful, less toxic)
  • Text to Tokens via a tokenizer (e.g. BPE, Byte Pair Encoding)
  • Each token becomes a vector (Embedding)
  • Positional encodings inject order information.
  • Transformer layer
    • self-attention: each token looks at all previous token (causal mask) and decides which ones matter, computing weighted combinations.
    • Multiple heads let it attend to different patterns (syntax, long-range links, etc.).
    • feed-forward network: a nonlinear MLP refines each token's representation.
    • residual connections & layer norm stabilize tranining.

Interfence

  • a prompt (the context)
  • GPT computes probabiliites for the next token.
  • A decoding stragety samples a token
    • Decoding knobs
      • Greedy: take the top token
      • Top-K
      • Nucleus (Top-p) sampling (limit to likely options)
      • Temperature scales randomness, controls how random the next token choice is
        • lower = more determistic, higher = more creative
  • append the token and repeat autoregressive generation.

IAI +008

· 약 13분

NLP

Tokenization

The process of dividing a text into a sequence of words.

  • Different models uses different tokenization methods.
BERT TOKENBERT IDGPT TOKENGPT ID
my2026My3666
grandson7631Ġgrandson31845
loved3866Ġloved10140
it2009Ġit340
!999Ġ!0
so2061Ġso1406
much2172Ġmuch881
fun4569Ġfun1257
!999Ġ!0

Corpus

a large and structured collection of text

  • a corpus typically consists of at least a million words of text
  • at least tens of thousands of distinct vocabulary words.

Text classification

  • the process of categorizing text into organized groups.
  • text classifiers can automatically analyze text and then assign a set of predefined tags or categories based on its content.
  • machine learning approach
    • Features
      • BoW
      • TF-IDF
    • Features + classifier
      • Logistic regression
      • SVM
      • Naive Bayes
  • Deep learning approach
    • Neural models: CNNs (capture local n-grams)
    • RNNs, LSTMs (sequence-aware)
    • Transformers (e.g. BERT)
      • Contextual embedding
  • Modern enhancements
    • Embedding + classifier: Pretrained embeddings (Word2Vec) + classifier
    • Fine-tuned transformers: BERT fine-tuned

Bag of Words, BoW

  • it represents a text as a bag of its words
  • we can understand the meaning of a document from its content (words) their multiplicity (frequency, number of occurrences)
  • mainly used as a tool of feature extraction.
  • limitations
    • ignores syntax and the context
    • disregarding grammar
    • discards word order - words are independent of each other
    • considers only the meanings of the words in the sentence
# examples

Example1 = "He likes to watch movies. Mary likes movies too."
Example2 = "Mary also likes to watch football games."

# Vocabulary
Vocab = {"He", "likes", "to", "watch", "movies", "Mary", "also", "football", "games"}

# BoW representation
BoW1 = {He: 1, likes: 2, to: 1, watch: 1, movies: 2, Mary: 1, also: 0, football: 0, games: 0}
BoW2 = {He: 0, likes: 1, to: 1, watch: 1, movies: 0, Mary: 1, also: 1, football: 1, games: 1}

[[1,2,1,1,2,1,0,0,0], [0,1,1,1,0,1,1,1,1]]

TF-IDF

Term Frequency - Inverse Document Frequency

  • Model based on the statistics of word counts.
  • idea is that key terms and important ideas are likely to repeat.
  • includes a scoring function that measure the relevance of a document to a query.
    • the function takes a document with a corpus and a query as input and returns a numeric score.
    • the doucments that have the highest scores are considered as the most relevant documents.
  • Term Frequency
    • TF(qi,dj,D)TF(q_i, d_j, D)
  • Inverse Document Frequency
    • IDF(qi,D)=logNDF(qi,D)IDF(q_i, D) = \log\frac{N}{DF(q_i, D)}
      • DF(qi,D)DF(q_i, D): number of documents in the corpus DD that contain the term qiq_i
  • N=DN = |D|: total number of documents in the corpus DD
  • TF-IDF score
    • TFIDF(qi,dj,D)=TF(qi,dj)IDF(qi,D)TFIDF(q_i, d_j, D) = TF(q_i, d_j) \cdot IDF(q_i, D)

Examples of TF-IDF

  • Document 1: "John likes to watch movies."
  • Document 2: "Mary likes movies too."
  • Document 3: "John also likes football."
StepTermDocumentNumber of times term t
appears in document d
Total number of terms
in document d
TF(t, d)Total number of documentsNumber of documents
containing the term t
IDF(t)IDF(t) (base 10)TF-IDF(t,d)TF-IDF(t,d) (base 10)
1t=likesd1150.2330000
t=likesd2140.2500
t=likesd3140.2500
2t=watchd1150.2311.0986122890.4771212550.2197224580.095424251
t=watchd204000
t=watchd304000
3t=maryd1050311.0986122890.47712125500
t=maryd2140.250.2746530720.119280314
t=maryd304000
4t=footballd1050311.0986122890.47712125500
t=footballd204000
t=footballd3140.250.2746530720.119280314

Naive Bayes

  • naive refers to a very strong simplifying assumption about the features
  • Conditional independence assumption
    • Given class YY, all the features X=X1,X2,...,XnX = {X_1, X_2, ..., X_n} are assumed mutually independent.
  • P(XY)=P(X1,...,XnY)=i=1nP(XiY)P(X | Y) = P(X_1, ..., X_n | Y) = \prod_{i=1}^{n} P(X_i | Y)
  • Bayes rule
    • P(YX)=P(XY)P(Y)P(X)P(Y)P(XY)P(Y | X) = \frac{P(X | Y) P(Y)}{P(X)} \propto P(Y) P(X | Y)
    • P(YX1,...,Xn)P(Y)i=1nP(XiY)P(Y | X_1, ..., X_n) \propto P(Y) \prod_{i=1}^{n} P(X_i | Y)
  • Sentiment Analysis
    • p(Ckx1,...,xn)=1Zp(Ck)i=1np(xiCk)p(C_k | x_1, ..., x_n) = \frac{1}{Z} p(C_k) \prod_{i=1}^{n} p(x_i | C_k)
      • Z=p(x)=kp(Ck)p(xCk)Z = p(x) = \sum_{k} p(C_k) p(x | C_k)
        • scaling factor for normalization of p(Ckx)p(C_k | x)
    • maximum a posteriori (MAP) decision rule
      • y^=arg maxk1,...,Kp(Ck)i=1np(xiCk)\hat{y} = \argmax\limits_{k \in 1, ..., K}p(C_k) \prod_{i=1}^{n} p(x_i | C_k)

Examples of Naive Bayes

P(Classw1:N)=αP(Class)jP(wjClass)P(Class | w_{1:N}) = \alpha \cdot P(Class) \cdot \prod_{j} P(w_j | Class)

  • Analyze the sentiment
    • my grandson loved it
    • x1=myx_1 = \text{my}, x2=grandsonx_2 = \text{grandson}, x3=lovedx_3 = \text{loved}, x4=itx_4 = \text{it}
    • C=positive,negativeC = {positive, negative}
    • y^=arg maxk{positive,negative}p(Ck)p(x1Ck)p(x2Ck)p(x3Ck)p(x4Ck)\hat{y} = \argmax\limits_{k \in \{positive, negative\}} p(C_k) \cdot p(x_1 | C_k) \cdot p(x_2 | C_k) \cdot p(x_3 | C_k) \cdot p(x_4 | C_k)
      • positive case
        • p(positive)=0.49p(positive) = 0.49
        • p(mypositive)=0.30p(my | positive) = 0.30
        • p(grandsonpositive)=0.01p(grandson | positive) = 0.01
        • p(lovedpositive)=0.32p(loved | positive) = 0.32
        • p(itpositive)=0.30p(it | positive) = 0.30
        • p(positivex)p(positive)p(mypositive)p(grandsonpositive)p(lovedpositive)p(itpositive)=0.49×0.30×0.01×0.32×0.30=0.00014256p(positive | x) \propto p(positive) \cdot p(my | positive) \cdot p(grandson | positive) \cdot p(loved | positive) \cdot p(it | positive) = 0.49 \times 0.30 \times 0.01 \times 0.32 \times 0.30 = 0.00014256
      • nagative case
        • p(negative)=0.51p(negative) = 0.51
        • p(mynegative)=0.20p(my | negative) = 0.20
        • p(grandsonnegative)=0.02p(grandson | negative) = 0.02
        • p(lovednegative)=0.08p(loved | negative) = 0.08
        • p(itnegative)=0.4p(it | negative) = 0.4
        • p(negativex)p(negative)p(mynegative)p(grandsonnegative)p(lovednegative)p(itnegative)=0.51×0.20×0.02×0.08×0.4=0.00006528p(negative | x) \propto p(negative) \cdot p(my | negative) \cdot p(grandson | negative) \cdot p(loved | negative) \cdot p(it | negative) = 0.51 \times 0.20 \times 0.02 \times 0.08 \times 0.4 = 0.00006528
      • Z=p(x)=kp(Ck)p(xCk)Z = p(x) = \sum_{k} p(C_k) p(x | C_k)
        • =p(positive)p(my,grandson,loved,itpositive)+p(negative)p(my,grandson,loved,itnegative)= p(positive) \cdot p(my, grandson, loved, it | positive) + p(negative) \cdot p(my, grandson, loved, it | negative)
        • =0.00014256+0.00006528=0.00020784= 0.00014256 + 0.00006528 = 0.00020784
        • the probability of xx, obtained by adding up its probabilities under each class.
      • p(positivex)=p(positive)p(my,grandson,loved,itpositive)Z=0.000142560.000207840.6867p(positive | x) = \frac{p(positive) \cdot p(my, grandson, loved, it | positive)}{Z} = \frac{0.00014256}{0.00020784} \approx 0.6867
      • p(negativex)=p(negative)p(my,grandson,loved,itnegative)Z=0.000065280.000207840.3133p(negative | x) = \frac{p(negative) \cdot p(my, grandson, loved, it | negative)}{Z} = \frac{0.00006528}{0.00020784} \approx 0.3133
    • Decision
      • y^=arg maxk{positive,negative}{k=positive:0.6867,k=negative:0.3133}\hat{y} = \argmax\limits_{k \in \{positive, negative\}} \{k = positive : 0.6867, k = negative : 0.3133\}
      • The sentiment is positive.
  • Spam detection
    • Probabilities learned from data:
      • P(Spam)=0.4P(\text{Spam}) = 0.4, P(Ham)=0.6P(\text{Ham}) = 0.6
      • P(freeSpam)=0.8P(\text{free}|\text{Spam}) = 0.8, P(freeHam)=0.1P(\text{free}|\text{Ham}) = 0.1
      • P(winSpam)=0.7P(\text{win}|\text{Spam}) = 0.7, P(winHam)=0.05P(\text{win}|\text{Ham}) = 0.05
    • New document: "free win"
      • Spam score: 0.4×0.8×0.7=0.2240.4 \times 0.8 \times 0.7 = 0.224
      • Ham score: 0.6×0.1×0.05=0.0030.6 \times 0.1 \times 0.05 = 0.003
        → Classified as Spam.

N-gram model

  • N-gram: a sequence of written symbols of length n
    • unigram, bigram, trigram
  • the probability of each symbol is dependent only on the n-1 previous symbols.
  • P(wjw1:j1)=P(wjwjn+1:j1)P(w_j|w_{1:j-1}) = P(w_j|w_{j-n+1:j-1})
  • P(w1:N)=j=1NP(wjw1:j1)j=1NP(wjwjn+1:j1)P(w_1:N) = \prod_{j=1}^{N} P(w_j|w_{1:j-1}) \approx \prod_{j=1}^{N} P(w_j|w_{j-n+1:j-1})

Examples of N-gram

  • W1:NW_{1:N} is "This article is on NLP"
    • N=5
    • Bigram (n-gram with n=2)
P("This article is on NLP")
/* full chain rule, */
= P("This") // j = 1
* P("article" | "This") // j = 2
* P("is" | "This article") // j = 3
* P("on" | "This article is") // j = 4
* P("NLP" | "This article is on") // j = 5
/* bigram approximation */
= P("This") // j =1
* P("article" | "This") // j = 2
* P("is" | "article") // j = 3
* P("on" | "is") // j = 4
* P("NLP" | "on") // j = 5
Step (j)Word (W_j)BigramTrigram
1This(P(This))(P(\text{This}))(P(This))(P(\text{This}))
2article(P(articleThis))(P(\text{article} \mid \text{This}))(P(articleThis))(P(\text{article} \mid \text{This}))
3is(P(isarticle))(P(\text{is} \mid \text{article}))(P(isThis, article))(P(\text{is} \mid \text{This, article}))
4on(P(onis))(P(\text{on} \mid \text{is}))(P(onarticle, is))(P(\text{on} \mid \text{article, is}))
5NLP(P(NLPon))(P(\text{NLP} \mid \text{on}))(P(NLPis, on))(P(\text{NLP} \mid \text{is, on}))

Transformer

  • given a set of input vectors (tokens), attention lets each token look at the others and form a weighted average of them.
  • The weights are data-dependent.
  • The model learns the weights representing which tokens are relevant to which other tokens.

X=Input Embeddings,Q=XWQ,K=XWK,V=XWVX = \text{Input Embeddings},\qquad Q = X W_Q,\quad K = X W_K,\quad V = X W_V

  • Scores: S=QKTdS = \frac{QK^T}{\sqrt{d}}
    • dot products between every query and every key
    • the scale d\sqrt{d} keeps gradients stable
  • Weights: A=Softmax(S)A = Softmax(S)
    • row-wise softmax
    • each row sums to 1
  • Output: Attention(Q,K,V)=AVAttention(Q, K, V) = A V
    • each output token is a weighted sum of the value vectors, with weights determined by the attention scores.(how much to pay attention to each position)
  • Query (Q): What am I looking for?
  • Key (K): What is the label/address of the information I have?
  • Value (V): What is the actual information I want to convey?

Cross-Attention

  • look up relevant information in another sequence.
    • e.g. decoder attending to encoder outputs in translation
  • QQ from the current sequence, KK and VV from the other sequence.
    • Q=XtargetWQ,K=XsourceWK,V=XsourceWVQ = X_{\text{target}} W_Q, \quad K = X_{\text{source}} W_K, \quad V = X_{\text{source}} W_V
  • Cross-attention = Which source tokens are relevant to this target token?
  • Self-attention = Which other tokens in this sequence are relevant to this token?
  • headi=Attention(QWiQ,KWiK,VWiV)head_i = Attention(Q W_i^Q, K W_i^K, V W_i^V)
  • MultiHead(Q,K,V)=Concat(head1,...,headh)WOMultiHead(Q, K, V) = Concat(head_1, ..., head_h) W_O

Word representation

One-hot representation

  • represents each word as a vector of length N.
  • where N is the size of the vocabulary.
vocab = {he, is, singing, she, dancing, stage}

she = [0, 0, 0, 1, 0, 0]
is = [0, 1, 0, 0, 0, 0]
singing = [0, 0, 1, 0, 0, 0]
  • limitations
    • incredibly inefficient for large vocabularies
    • not embed any intrinsic meaning of words
    • unable to represent similarity between likely words
    • the representation of documents is sparse vectors
      • can cause challenges in computation

Word Embedding

  • represents individual words as vectors in a low-dimensional continuous space.
  • a distributed representation of a word.
  • generate a unique value for each word while using smaller vectors compared with one-hot encoding.
  • common vector dictionaries
    • word2vec
    • Glove (Global Vectors)

Contextual embedding

  • generates different vectors for the same word based on its context.
  • the word "bank" would have different embeddings in the sentences:
    • "He went to the bank to deposit money."
    • "She sat by the river bank and enjoyed the view."
  • BERT or GPT use deep neural networks to process a sequence of tokens.
  • Each token's embedding is computed by considering the token itself, its position in the sequence and the surrounding tokens, context.
  • captures both semantic and syntactic role in that specific context.

Part of Speech (POS) tagging

  • lexical category or tag that indicates the grammatical role of a word in a sentence.
  • parts of speech allow language models to capture generalizations such as "adjectives often modify nouns" or "verbs often follow subjects".
From the start , it took a person
IN DT NN , PRP VBD DT NN

with great qualities to succeed
IN JJ NNS TO VB
TagDescriptionExample
CCCoordinating conjunctionand, but
CDCardinal numberone, two
DTDeterminerthe, a
EXExistential therethere
FWForeign worddoppelgänger
INPreposition or subordinating conjunctionin, of
JJAdjectivebig, old
JJRAdjective, comparativebigger, older
JJSAdjective, superlativebiggest, oldest
LSList item marker1, 2, One
MDModalcan, will
NNNoun, singular or masscat, tree
NNSNoun, pluralcats, trees
NNPProper noun, singularJohn, London
NNPSProper noun, pluralSmiths, Londons
PDTPredeterminerall, both
POSPossessive ending's, s'
PRPPersonal pronounI, you, he
PRP$Possessive pronounmy, your, his
RBAdverbquickly, very
RBRAdverb, comparativefaster, better
RBSAdverb, superlativefastest, best
RPParticleup, off
SYMSymbol$, %, &
TOtoto
UHInterjectionoh, wow
VBVerb, base formbe, have
VBDVerb, past tensewas, had
VBGVerb, gerund or present participlebeing, having
VBNVerb, past participlebeen, had
VBPVerb, non-3rd person singular presenttalk, have
VBZVerb, 3rd person singular presenttalks, has
WDTWh-determinerwhich, that
WPWh-pronounwho, what
WP$Possessive wh-pronounwhose
WRBWh-adverbwhere, when
#Pound sign#
$Dollar sign$
,Comma,
.Sentence-final punctuation. ! ?

Example of POS tagging

  • Hidden Markov Model (HMM)
    • takes in a temproral sequence of evidence observations
    • predicts the lexical categories
  • Logistic regression
    • build 45 different logistics regression models, one for each part of speech
    • ask each model how probable it is that the example word is a member of that category, given the feature values for that word in its particular context.

Machine translation

  • translate a sentence from a source lnaguae to a trget language.
  • train an MT model: a large corpus of source/target sentence paris and hope that the trained MT model can accurately translate new sentences.
  • want to generate a target language sentence that corresponds to the source language sentence
  • the geneartion of each target word is conditional on the entire source sentence and on all previously generated target words.

Example of machine translation

  • a sequance-to-sequence model
    • use two RNNs (LSTM)
  • attentional sequence-to-sequence model
    • use attentino to create a context-based summarization of the source sentence into a fixed-dimension representation
  • transformer-based model
    • encoder: reads the source sentence and turns it into a rich, contextual set of vectors.
    • decoder: generates the target sentence one token at a time, using what it has generated so far and the encoder's representations.

Text generation

  • a subfield of NLP
  • leverages knowledge in computational linguistics and AI to automatically generate natural language texts
  • can satisfy certain communicativa requirements

Example of text generation

  • Classifier based on word embeddings, e.g. RNN and LSTM
    • RNN: each input word is encoded as a word embedding vector xix_i, a hidden layer ztz_t, the classes are the words of the vocabulary
      • the output yty_t will be a softmax probability distribution over the possible values of the next word in the sentence.
    • LSTM: can choose to remember som parts of the input, copying it over to the next time step, and to forget toher parts.
  • Pre-trained languaeg model using deep learning
    • BERT
    • GPT-X, Generativ ePre-trained Transformer

Transfer learning

  • experience with one learning task helps an agent learn better on another task.
  • pretraining: a form of transfer learning in which we use a large amount of shared general-domain language data to train an initial version of an NLP model.
    • we can use a smaller amount of domain-specific data to refine the model
    • the refined model can learn the vocabulary, idioms, syntactic structure, and other linguistic phenomena that are specific to the new domain.
  • For NN, learning consist of adjusting weight, so the most plausible approach for transfer learning is to copy over the weights learned for task A to a network that will be trained for task B.
    • The weights are then updated by gradient descent in the usual way using data for task B.
  • the popularity of transfer learning is the availability of high-quality pretrained models.
  • will want to freeze the first few layers of the pretrained model
    • these layers serve as feature detectors that will be useful for new model.
    • new data set will be allowed to modify the parameters of the higher levels only
      • these are the layers that identify problem-specific features and do classification.

IAI +007

· 약 8분

Computer vision

  • a field of artificial intelligence enabling computers and systems to derive meaningful information from digital images, videos and other visual inputs.
  • vision is a perceptual channel that accepts a stimulus and reports soome representation of the world
  • computer vision enables intelligent agents to see, observe and understand of environment

Core problems of CV

  • reconstruction: an agent builds a mode lfo the world from an image or a set of images
  • recognition: an agent draws distinctions amont the objects it encounters based on visual and other information.
    • image classification
    • object detection
    • image segmentation

Classic approaches to object recognition problems

  • feature-based object recognition approach
    • works well for faces looking directly at the camera.
  • pattern-element-based object recognition approach
    • a useful abstraction is to assume that some objects are made up of local patterns which tend to move around with respect to one another.
    • we can model objects with pattern elements.

Modern approaches to object recognition problems

  • deep learning networks
    • to recognition problems enables that features can be automatically learned and extracted from raw image data compares with the manual feature extraction in the classic approaches.
    • AlexNet, VGGNet, GoogleNet, ResNet, DenseNet, EfficientNet, RegNet...
  • basic models and derived models
    • YOLO, SSD, RetinaNet, R-CNN...

Evaluation metrics for image classification

MetricDefinitionUse case
Accuracythe percentage of correcly predicted labels out of all predictions madecommonly used in balanced datasets but can be mis leading in imbalanced classes
Precisinothe ratio of correctly predicted positive observations to the total predicted positives.useful when the cost of false positives is high (e.g., spam detection)
Recall (Sensitivity or True Positive Rate)the ratio of correctly predictted observations to all the actual positivesimportant when the cost of false negatives s high (e.g., disease detection)
F1 Scorethe harmonic mean of Precision and Recallused when you need to balance precision and recall, especially in imbalanced datasets
Specificity (True Negative Rate)the ratio of correctly predicted negative observations to all the actual negatives.important when false positives should be minimized (e.g. medical tests for diseases)
Confusion Matrixa table used to describe the performance of a classification model by showing the true positive, false positive, true negative, and false negative countsprovides a comprehensive understanding of a models' performance across all classes
ROC Curve (Receiver Operation Characteristic)a graphical representation of the classifier's performance across all thresholds, plotting the true positive rate (recall) against the false positive rate (1 - specificity)used to evaludate binary classirifres and compare models
AUC (Area Under the Curve)the area under the ROC curve, providing a single number summary of the models' ability to discriminate between positive and negative classesuseful for evaluating binary classification models, particularly when dealing with imbalanced datasets.

Confusion Matrix

Actual class ➡️
Predicted class ⬇️
PositiveNegativeMetric
PositiveTP: True PositiveFP: False PositivePrecision: TPTP+FP\frac{TP}{TP + FP}
NegativeFN: False NegativeTN: True NegativeNegative Predictive Value: TNTN+FN\frac{TN}{TN + FN}
MetricRecall or Sensitivity: TPTP+FN\frac{TP}{TP + FN}Specificity: TNTN+FP\frac{TN}{TN + FP}Accuracy: TP+TNTP+TN+FP+FN\frac{TP + TN}{TP + TN + FP + FN}

ROC Curve

  1. Sort predicted probabilities
  2. Try multiple thresholds
  3. For each threshold, compute predicted labels
  4. Compute TPR and FPR
  5. Plot ROC curve (FPR vs TPR) i. TPR=TPTP+FNTPR = \frac{TP}{TP + FN} ii. FPR=FPFP+TNFPR = \frac{FP}{FP + TN}
  6. Compute AUC as area under ROC curve
from sklearn.metrics import roc_curve, roc_auc_score
fpr, tpr, thresholds = roc_curve(y_true, y_scores)
auc = roc_auc_score(y_true, y_scores)

Evaluation metric for object detection

MetricDefinitionUse case
Intersection over Union (IoU)measures how much the predicted bounding box overlaps with the ground truth.IoU=Area of OverlapArea of UnionIoU = \frac{\text{Area of Overlap}}{\text{Area of Union}}
Precisionthe ratio of correctly predicted postive observations to the total predicted positivesPrecision=TPTP+FPPrecision = \frac{TP}{TP + FP}
Recallthe ratio of correctly predicted observations to all the actual positivesRecall=TPTP+FNRecall = \frac{TP}{TP + FN}
F1 Scorethe harmonic mean of Precision and RecallF1=2×PrecisionRecallPrecision+RecallF1 = 2 \times \frac{Precision \cdot Recall}{Precision + Recall}
Average Precision (AP)Precision-Recall Curve: Precision vs Recall at different thresholds
AP: Area under the Precision-Recall curve (per class)
AP=01P(r)drAP = \int_{0}^{1} P(r) \, dr
Mean Average Precision (mAP)mean of AP across all classesprovides a comprehensive understanding of a models' performance across all classes
AP@[.50:.95]COCO benchmarkAverages AP at IoU threshold from 0.5 to 0.95 in steps of 0.05
  • AP is computed by summing trapezoid areas under the Precision-Recall curve.

Convolutional Neural Network (CNN)

  • contains spatially local connectinos at lest in the early layers
  • has patterns of weights that are replicated across the units in each layer.
  • use a kernel to detect patterns of weights that is replicated across multiple local regions in an image
  • use convolution that applies the kernel to the pixels of the image
Input

[Conv → ReLU][Pooling]

[Conv → ReLU][Pooling]

Flatten (2D → 1D 벡터)

Fully Connected Layer

Output Layer (Softmax/Sigmoid)

zi=j=1lkjxj+(i1)sz_i = \sum_{j=1}^{l} k_j \cdot x_{j + (i-1)s}

  • ziz_i: the ii-th output element (convolution result)
  • jj: index inside the kernel (from 1 to ll)
  • ll: kernel size (number of elements in the kernel)
  • kjk_j: the jj-th kernel weight (filter element)
  • xx: input sequence (the original data)
  • xj+(i1)sx_{j+(i-1)s}: the input element aligned with kernel position jj when shifted by stride
  • ss: stride (step size of the kernel movement)
  • (i1)s(i-1)s: total shift in the input for the ii-th convolution

Receptive Field

  • the receptive field of a neuron is the portion of the sensory input that can affect aht neuron's activation.
  • In CNNs, the receptive field of a unit in the first hidden layer is small.
  • just the size of the kernel, e.g., 3x3 or 5x5.
  • In the deeper layers of the network, it can be much larger.

1D CNN

RL=RL1+(kL1)i=1L1siR_{L} = R_{L-1} + (k_{L} - 1) \cdot \prod_{i=1}^{L-1} s_i

  • RLR_L: receptive field size at the LthL-th layer
  • kLk_L: kernel size at the LthL-th layer
  • sis_i: stride at the ithi-th layer
LayerKernel kStride sCalculationResult RF
Conv1311 + (3-1)*13
Pool1223 + (2-1)*14
Conv2314 + (3-1)*28

Pooling

  • works like a convolutional layer, with a kernel size ll and stride ss, but the operation is applied is fixed rather than learned.
  • no activation fucntion is associated with the pooling layer.
  • common forms of pooling
    • average pooling
    • max pooling: saying a feature exists somewhere in the unit's receptive field.

Dropout

  • a way to reduce the test-set error of a network to increase its ability of generalization.
  • makes a etwork herder to fit the traninig set.
  • the network is created by deactivating a randomly chosen subset of the units (dropout rate).
  • cannot explain why it works but the results is better

Batch Normalization

  • modern neural networks are almost always trained with some variant of stochastic gradient descent (SGD).
  • Batch normlization rescales the values generated at the internal layers of the network from the examples within each minibatch.
    • it standardizes the input to a layer for each mini-batch.
  • standardizes the mean and variance of the values.
  • maeks it much simpler to train a deep network.

Tranining in a CNN

Crossentropy(y,y^)=iyilog(y^i)Cross-entropy(y, \hat{y}) = -\sum_{i} y_i \log(\hat{y}_i)

  • Forward pass
  • Backward pass
  • Parameter update

Variants of CNNs

AlexNet

AlexNet

AlexNet Layers

ResNet

  • stands for a residual neural network.
  • was designed to enable hundreds or thousands of convolutional layers.
  • Residual neural networks do this by utilizing skip connections, or shortcurts to jump over some layers.
  • was an innovative solution to the "vanishing gradient" problem.
x ────────────────► (+) ──► y
│ ▲
▼ │
[Conv → ReLU → Conv] = F(x)

VGGNet

  • increases the depth of the network through adding more convolutional layers by using small convolution filters (3x3) while other parameters are fixed.

VGGNet

Derived model for object detection

YOLO

  1. Split an images to S times S blocks.
  2. Apply object classification to each block and get confidence score of different objecxt for each block.
  3. Based on the class probability map to locate objects.

YOLO

Calculation AP and mAP

  1. Gather Detection results
  2. Match Predictions to Ground Truth
  3. Compute Precision and Recall to generate a set of (recall, precision) points
  4. Build the Precision-Recall (PR) Curve (presioin decreases as recall increases)
  5. Smooth the Precision-Recall Curve
  6. Calculate AP
  7. Extend to mAP

IAI +006

· 약 8분

Markov process or Markov chain

the state sequence satisfies the Markov assumption

  • The current state depends on only a finite fixed number of previous states.
  • P(XtX0:t1)=P(XtXt1)P(X_t | X_{0:t-1}) = P(X_t | X_{t-1})

Decision Theory

  • Choosing actions based on the desirability of their immediate outcomes.
  • The outcome of taking action aa in state s0s_0 is deterministic then Result(s0,a)Result(s_0, a).
  • If the outcome is non-deterministic (stochastic), fully observable, the agent knows the current state s0s_0.
    • It is represented as a random variable whose values are the possible outcome states.
    • The transition model specifies the probabilities of possbile outcome states.
    • P(s,a,s)P(s, a, s')

MEU, Maximum Expected Utility

A rational agent should choose the action that maximizes its expcted utility.

  • The MEU principle could be seen as defining all of AI.
  • action=argmaxa(EU(ae))action = argmax_a(EU(a|e))
    • among this set of actions, give the highest expected utility.
  • EU(ae)=s(P(Result(a)=sa,e)U(s))EU(a|e) = \sum_{s'} (P(Result(a) = s' | a, e) * U(s'))

Sequential decision problem

type of problem or scenario where a decision-maker must make a series of choices or decisions over time in order to achieve a desired outcome.

  • a sequential decision-making problem or sequential decision-making task

MDP, Markov Decision Process

a sequential decision problem for a fully observable, stochastic environment environment with a Markovian transition model and additive rewards.

  • a formal framework for modeling sequential decision problems.
  • including states, actions, transition probabilities, and rewards
  • used in reinforcement learning and operations research.
ComponentDescription
Environmentfully observable, Stochastic
Statesa set of states (with an initial state S₀)
Transition modelMarkovian
Rewardadditive
AgentMake decisions at time step for actions
Interacts with environment through percepts and actions
  • A set of states: an initial state S0S_0, possible terminal state(s)
  • A set of actions in each state: ACTIONS(s)ACTIONS(s)
  • A stochastic Markov transition model: P(ss,a)P(s' | s, a)
  • A reward function: R(s)R(s) or R(s,a,s)R(s, a, s')
  • In MDP, Additive Reward is defined as a function that assigns a numerical value to each state-action pair or state-transition pair.
    • represents the immediate benefit or cost associated with taking a specific action in a particular state.
  • The solution must specify what the agent should do for any state that the agent might reach.
  • Policy: π(s)π(s), an action for each state
    • A solution to an MDP.
    • π(s)\pi(s) is the action recommended by the policy π\pi for state ss
  • Q state: Q(s,a)Q(s, a), the expected utility of doing action aa in state ss

RL, Reinforcement Learning

  • an agent interacts with an environment
  • takes actions to maximize a cumulative reward signal over time.
  • learns a policy that balances exploration and exploitation
    • exploration: trying new actions.
    • exploitation: choosing known good actions.

Sample MDP

Environment

  • The agent lives in a 4x3 fully observable stochastic environment
  • Walls block the agent's path; Begin in the start state, the agent must choose an action at each time step
  • Actions available to the agent in each state: Up, Down, Left, or Right
  • The interaction with the environment terminates when the agent reaches one of the goal states, marked +1 or -1

4x3 Grid World

1234
3+1
2🚫-1
1START
  • START: Initial position (1,1)
  • 🚫: Wall (impassable)
  • +1: Goal state with positive reward
  • -1: Goal state with negative reward

Transition Model

  • Stochastic environment with uncertainty
  • When agent chooses an action, there's:
    • 0.8 probability of moving in intended direction
    • 0.1 probability of moving perpendicular to intended direction (each side)

Reward Function

  • The agent receives rewards each time step:
    • Small "living" reward each step (-0.04 in all states except the terminal states)
    • Big rewards come at the end (+1 for good and -1 for bad terminal states)

Goal

  • To maximize sum of rewards from the start state to a terminate state

Additive Reward

  • a type of reward structure
  • the total reward earned by an agent in an MDP
  • calculated by summing up the individual rewards received at each time step as the agent interacts with the environment.

Discounted reward

  • a techinque used to calcluate the expected cumulative reward over time.
  • applying a discount factor to future rewards, prioritizing more immediate rewards over distant ones.

Utility

Uπ(s)=E[t=0γtR(St,π(St),St+1)]U^{\pi}(s) = E\left[\sum_{t=0}^{\infty} \gamma^t R(S_t, \pi(S_t), S_{t+1})\right]

  • The expectation EE is with repect to the probability distribution over state sequences determined by ss and π\pi.
  • γ\gamma is the discount factor, 0γ<10 \leq \gamma < 1, which determines the present value of future rewards.
  • R(St,π(St),St+1)R(S_t, \pi(S_t), S_{t+1}) is the reward received when transitioning from state StS_t to state St+1S_{t+1} by taking action π(St)\pi(S_t).

Bellman equation

U(s)=maxaA(s)sP[ss,a](R(s,a,s)+γU(s))U(s) = max_{a \in A(s)} \sum_{s'}P[s'|s,a](R(s,a,s') + \gamma U(s'))

Formula ComponentDescription
U(s)U(s)the utility of state ss
maxamax_athe maximum over all possible actions aACTIONS(s)a \in ACTIONS(s)
s\sum_{s'}the sum over all possible successor states ss'
P[ss,a]P[s'\vert s,a]the transition probability of reaching state ss'
R(s,a,s)R(s,a,s')the reward received after transitioning from state ss to state ss' by taking action aa
γ\gammathe discount factor, 0γ<10 \leq \gamma < 1
U(s)U(s')the utility of the successor state ss'
  • In optimal setting, the neighbor state meets the Bellman equation.
  • From s0s_0 to termination, each step in the sequence should match the Bellman equation.

πs=argmaxπUπ(s)\pi^{*}_s = argmax_{\pi}U^{\pi}(s)

Q function

  • the expected utility of taking a given action in a given state.
  • enables the agent to act optimally simply by choosing
    • U(s)=maxaQ(s,a)U(s) = max_a Q(s,a)
  • The optimal policy can be extracted from the Q-function.
    • π(s)=argmaxaQ(s,a)\pi^{*}(s) = argmax_a Q(s,a)

Q(s,a)=sP[ss,a](R(s,a,s)+γU(s)) =sP[ss,a](R(s,a,s)+γmaxaQ(s,a))Q(s,a) = \sum_{s'}P[s'|s,a](R(s,a,s') + \gamma U(s')) \\ \space = \sum_{s'}P[s'|s,a](R(s,a,s') + \gamma max_{a'}Q(s',a'))

Reinforcement Learning

Key components of RL

  • Agent
  • Environment
  • State
  • Action
  • Reward
  • Policy

Application of RL

  • Robotics
  • Autonomous systems
  • Game playing
    • Atari video games from raw visual input
    • Playing poker
    • Alpha Go
  • Recommendation systems

RL Algorithms

  • Model-based RL
    • often learns a utility function U(s)U(s)
    • the sum of rewards from state ss onward from observing the effects of its actions on the environment.
  • Model-free RL
    • Action-utility learning, such as Q-learning
    • Policy learning
  • Passive RL
    • the agent's policy is fixed
    • the task is to learn the utilities of states.
  • Active RL
    • the agent must figure out what to do through exploration and exploitation.
      • Exploration: trying new actions to discover their effects.
      • Exploitation: leveraging known actions that yield high rewards.

Direct Utility Estimation

  • the utility of a state is the expected total reward from the state onward.
    • the expecgted reward-to-go from that state.
    • each trial provides a sample of this quantity for state visited.
  • end of each sequence, the algorighm calculates the observed reward-to-go for each state and updates the estimated utility for the state accordingly.
  • In the limit of infinitly many trials, the sample average will converge to the expectation in teh Bellman equation.
  • Ignores the connections between states.
    • misses opportunities to learning, it learns nothing until the end of the trial.
    • often converges very slowly.

Adaptive Dynamic Programming, ADP

Ui(s)=sP(ss,πi(s))[R(s,πi(s),s)+γUi(s)]U_i(s) = \sum_{s'}P(s'|s,\pi_{i}(s))[R(s,\pi_{i}(s),s') + \gamma U_{i}(s')]

  • advantage of the constraints among the utilities of states by learning the transition model
    • that connects them and solves the corresponding MDP
    • using dynamic programming to calculate the utilities of the states.
  • These Bellman equations are linear when the policy πi\pi_i is fixed
    • can be solved using any linear algebra package.

Learning a transition model

  • the environment is fully observable
  • the agent can learn the mapping from a state-action pair (s,a)(s, a) to the resulting state ss'.
  • The transition model P(ss,a)P(s'|s,a) is represented as a table and it is estimated directly from the counts that are accumulated in Nss,aN_{s'|s,a}.
  • The counts record how often state ss' is reached when executing aa in ss.
    • P(ss,a)=Nss,asNss,aP(s'|s,a) = \frac{N_{s'|s,a}}{\sum_{s''}N_{s''|s,a}}

Temporal difference, TD

Uπ(s)Uπ(s)+α[R(s,π(s),s)+γUπ(s)Uπ(s)]U^{\pi}(s) \leftarrow U^{\pi}(s) + \alpha[R(s,\pi(s),s') + \gamma U^{\pi}(s') - U^{\pi}(s)]

  • when a transition occurs from state ss to state ss' via action π(s)\pi(s)
  • α\alpha is the learning rate, 0<α10 < \alpha \leq 1
  • no need a transition model to perform updates
  • the environment is itself supplies the connection between neighboring states in the form of observed transitions.

Q learning

  • model-free reinforcement learning algorithm
  • the agent learns a Q-function, Q(s,a)Q(s,a)
  • avoids the need for a model by learning an action-utility function Q(s,a)Q(s,a) instead of utility function U(s)U(s).
  • a model-free Q-learning TD update
    • Q(s,a)Q(s,a)+α[R(s,a,s)+γmaxaQ(s,a)Q(s,a)]Q(s,a) \leftarrow Q(s,a) + \alpha[R(s,a,s') + \gamma max_{a'}Q(s',a') - Q(s,a)]
    • this update is calculated whenever action aa is executed in state ss leading to state ss'
  • TD Q-learning agent doesn't need a transition model P(ss,a)P(s'|s,a)
    • either for learning or for action selection.

IAI +005

· 약 13분

Neural Network Development History

  • 1950s-1960s: Early Foundations
    • McCulloch & Pitts (1943): mathematical neuron model
    • Rosenblatt's Perceptron (1958): first trainable network
    • Minsky & Papert (1969): limitations (XOR problem) → AI Winter
  • 1970s–1980s: First Revival
    • Werbos (1974); Rumelhart, Hinton, Williams (1986): Backpropagation
    • Hopfield Networks (1982): associative memory
    • Renewed optimism but limited by hardware
  • 1990s: Consolidation
    • LeCun's CNN (LeNet, 1989): digit recognition
    • Elman, Jordan: Recurrent Neural Networks
    • Symbolic AI still dominated mainstream
  • 2000s: Deep Learning Foundations
    • Better hardware (GPUs) + large datasets
    • Hinton (2006): Deep Belief Networks (unsupervised pretraining)
    • Connectionism regains attention
  • 2010s: Deep Learning Boom
    • ImageNet (2012): AlexNet breakthrough
    • RNNs, LSTMs, GRUs → speech & translation
    • Transformers (2017): revolutionized NLP
  • 2020s: Scaling & Foundation Models
    • Large Language Models (GPT, BERT, etc.)
    • Multimodal AI: vision, text, speech integration
    • Connectionism dominates AI research & industry

Neural Network Models

  • a collection of units (neurons) connected together
  • The properties of the network are determined by its topology and the properties of the neurons.
  • Roughly speaking, the neuron fires when a linear combination of its inputs exceeds some (hard or soft) threshold.

Simple Neuron

  • inj=i=0nwijaiin_j = \sum_{i=0}^{n} w_{ij}a_i
  • outj=g(inj)out_j = g(in_j)
  • aj=g(i=0nwijai)a_j = g(\sum_{i=0}^{n} w_{ij} a_i)

Activation function

ReLU function

ReLU(x)=max(0,x)ReLU(x) = max(0, x)

  • an abbreviation for rectified linear unit
  • Commonly used

Softplus function

Softplus(x)=log(1+ex)Softplus(x) = \log(1 + e^x)

  • A smooth version of the ReLU function

Logistic or Sigmoid function

Logistic(x)=11+exLogistic(x) = \frac{1}{1 + e^{-x}}

  • Non-linear, can represent a nonlinear function

Tanh function

tanh(x)=e2x1e2x+1tanh(x) = \frac{e^{2x} -1}{e^{2x} + 1}

Topology of a neural network

  • Feed-forward network (FFN):
    • Every node receives inputs from "upstream" nodes and delivers output to "downstream" nodes.
    • There are no loops.
    • FFN represents a function of its current inputs, thus it has no internal state other than the weights themselves.
  • Recurrent Network (RNN):
    • A recurrent network feeds its outputs back into its own inputs.
    • In a recurrent network, the neuron values can eventually settle down, keep cycling, or behave unpredictably.
    • can support short-term memory
FFNRNN
FFNRNN

Training Process

  • Go through each training sample.
  • If correctly classified → do nothing.
  • If misclassified → update the weights:
  • wiwi+α(yy^)xiw_i \leftarrow w_i + \alpha(y - \hat{y})x_i

Perceptron for Binary Classification

  • A perceptron separates data into two classes with a hyperplane.
  • if wx01w \cdot x \geq 0 \rightarrow 1
  • if wx00w \cdot x \le 0 \rightarrow 0

Learning Rules

AspectPerceptron Learning RuleGradient Descent (with Sigmoid)
Activation functionHard threshold (계단 함수)
Threshold(z)=1  if  z0,  0  otherwiseThreshold(z) = 1 \; \text{if} \; z \ge 0,\; 0 \; \text{otherwise}
Sigmoid (연속 함수)
hw(x)=11+ewxh_w(x) = \frac{1}{1+e^{-w \cdot x}}
Output0 또는 10과 1 사이의 실수 값
Loss function없음 (틀리면 조정, 맞으면 유지)
규칙 기반 학습
L=(yhw(x))2L = (y - h_w(x))^2 (L2 loss)
또는 Cross-Entropy (실무에서 자주 사용)
Update rule틀렸을 때만:
ww+α(yhw(x))xw \leftarrow w + \alpha (y - h_w(x))x
경사하강법:
ww+α(yhw(x))hw(x)(1hw(x))xw \leftarrow w + \alpha (y - h_w(x)) \cdot h_w(x)(1-h_w(x)) \cdot x
Why derivative?Hard threshold는 미분 불가능 → 단순 규칙 사용Sigmoid는 연속적이고 미분 가능 → Loss 함수의 기울기(gradient)를 따라 업데이트.
여기서 hw(x)(1hw(x))h_w(x)(1-h_w(x)) 항은 sigmoid의 도함수에서 나온 것.
Interpretation틀리면 정답 방향으로 한 걸음 이동Loss가 줄어드는 방향으로 점진적으로 이동

Feadforward NN, FNN

  • a multilayer perceptron network
  • one input layer, N hidden layers, N >= 1, and one output layer.
  • Except for the input layer, each layer has a same activation function g.
  • The final output is represented by a vector function of inputs and weights.
  • If it has three layers, Shallow Neural Network, otherwise Deep Neural Network.

Traning a FNN

  • Forward
    • Activation passing from the input layer to the output layer
    • Calculate the output
  • Backward
    • Errors propagating backward from the output layer to the input layer
    • Update weights

Forward phase

  • Activation of each node is computed in two steps:

    1. Weighted sum (in): sum of activations from the previous layer, multiplied by weights.
    2. Apply activation function g: pass the weighted sum through g to produce the node's activation.
  • Process: propagate activations layer by layer towards the output layer.

  • Output value (example with 2 layers):

    • hw(x)=g(2)(W(2)g(1)(W(1)x))h_w(x) = g^{(2)}\big(W^{(2)} g^{(1)}(W^{(1)} x)\big)

Backward phase

  • Loss function: choose squared error loss (L2)
    • L2(y,y^)=(yy^)2L_2(y, \hat{y}) = (y - \hat{y})^2
  • Prediction:
    • y^=hw(x)\hat{y} = h_w(x)
  • Gradient descent: compute gradient of the loss with respect to weights, then update weights along the negative gradient direction.
    • wi,jwi,jαgradientwi,jw_{i,j} \leftarrow w_{i,j} - \alpha \cdot gradient_{w_{i,j}}
  • Example: sigmoid activation:
    • y^=11+ewx\hat{y} = \frac{1}{1 + e^{-w \cdot x}}
    • Gradient of the loss:
      • gradientwi,j=wi,jLoss(hw)=2(yhw(x))(wi,jhw(x))gradient_{w_{i,j}} = \frac{\partial}{\partial w_{i,j}} Loss(h_w) = 2 (y - h_w(x)) \cdot \Big(- \frac{\partial}{\partial w_{i,j}} h_w(x)\Big)
    • Chain rule applied:
      • g(f(x))x=g(f(x))f(x)\frac{\partial g(f(x))}{\partial x} = g'(f(x)) \cdot f'(x)
  • Example: Gradient derivation for sigmoid
    • Weighted input
      • WX=w1,3x1+w2,3x2+w0,3x0W \cdot X = w_{1,3}x_1 + w_{2,3}x_2 + w_{0,3}x_0
      • (where x0=1x_0 = 1 for the bias)
    • Gradient of the loss
      • gradientwi,j=wi,jLoss(hw)=2(yhw(x))(wi,jhw(x))gradient_{w_{i,j}} = \frac{\partial}{\partial w_{i,j}} Loss(h_w) = 2 (y - h_w(x)) \cdot \Big(-\frac{\partial}{\partial w_{i,j}} h_w(x)\Big)
    • Derivative of sigmoid output
      • wi,jhw(x)=hw(x)(1hw(x))wi,j(WX)\frac{\partial}{\partial w_{i,j}} h_w(x) = h_w(x)(1 - h_w(x)) \cdot \frac{\partial}{\partial w_{i,j}} (W \cdot X)
        • wi,j(11+eWX)\frac{\partial}{\partial w_{i,j}} \left( \frac{1}{1 + e^{-W X}} \right)
        • =(11+eWX)(111+eWX)wi,j(WX)= \left( \frac{1}{1 + e^{-W X}} \right) \left(1 - \frac{1}{1 + e^{-W X}} \right) \cdot \frac{\partial}{\partial w_{i,j}} (W X)
        • =hw(x)(1hw(x))wi,j(WX)= h_w(x) \big(1 - h_w(x)\big) \cdot \frac{\partial}{\partial w_{i,j}} (W X)
    • Derivative of weighted input
      • w0,3(WX)=x0=1\frac{\partial}{\partial w_{0,3}}(W \cdot X) = x_0 = 1
      • w1,3(WX)=x1\frac{\partial}{\partial w_{1,3}}(W \cdot X) = x_1
      • w2,3(WX)=x2\frac{\partial}{\partial w_{2,3}}(W \cdot X) = x_2
    • Weight update rule
      • General form: wi,jwi,jαgradientwi,jw_{i,j} \leftarrow w_{i,j} - \alpha \cdot gradient_{w_{i,j}}
      • w0,3w0,3+α(yhw(x))hw(x)(1hw(x))w_{0,3} \leftarrow w_{0,3} + \alpha (y - h_w(x)) h_w(x)(1 - h_w(x))
      • w1,3w1,3+α(yhw(x))hw(x)(1hw(x))x1w_{1,3} \leftarrow w_{1,3} + \alpha (y - h_w(x)) h_w(x)(1 - h_w(x)) x_1
      • w2,3w2,3+α(yhw(x))hw(x)(1hw(x))x2w_{2,3} \leftarrow w_{2,3} + \alpha (y - h_w(x)) h_w(x)(1 - h_w(x)) x_2

Backward phase Steps

  1. Select a loss function
    • For example, squared error loss:
    • L(y,y^)=(yy^)2,y^=hw(x)L(y, \hat{y}) = (y - \hat{y})^2, \quad \hat{y} = h_w(x)
  2. Choose an activation function
    • Suppose we use a sigmoid:
    • hw(x)=11+eWXh_w(x) = \frac{1}{1 + e^{-W \cdot X}}
  3. Calculate the error at the output node
    • The delta (error term) at the output is
    • Δout=2(y^y)g(inout)\Delta_{out} = 2(\hat{y} - y) \cdot g'(in_{out})
  4. Calculate the error at hidden nodes
    • A hidden unit may connect to multiple nodes in the next layer.
    • Therefore, its error is the weighted sum of all deltas it feeds into, scaled by its own derivative:
    • Δi=g(ini)jwi,jΔj \Delta_i = g'(in_i) \sum_j w_{i,j} \Delta_j
    • The summation appears because the hidden node's output influences several downstream nodes, and all those error signals must be aggregated.
  5. Update the weights with gradient descent
    • The gradient with respect to weight wi,jw_{i,j} is simply the input times the delta:
    • Lwi,j=aiΔj\frac{\partial L}{\partial w_{i,j}} = a_i \Delta_j
    • Update rule:
    • wi,jwi,jαaiΔjw_{i,j} \leftarrow w_{i,j} - \alpha \, a_i \Delta_j

Vanishing gradient

  • The error signal are extinguished altogher as they are propagated back through the network
  • In deep feedforward networks with sigmoid/tanh, repeated multiplication of small derivatives (0<g(z)<10 < g'(z) < 1) during backpropagation causes the gradient to vanish.

Optimizer

  • Training a neural network consists of modifying the network's parameters, minimizing the loss function on the training set.
  • any kind of optimization algorithm could be used.
  • modern neural networks are almost always trained with some variant of stochastic gradient descent (SGD). Adam Optimizer
  • The optimiser is specified in the compilation step with tensorflow.

Recurrent NN, RNN

  • units may take as input a value computed from their own output at an earlier step in the computation.
  • have internal state, or memory: inputs received at earlier time steps affect the RNN's response to the current input.
  • be used to perform more general computations.
    • to analyze sequential data in which a new input vector xtx_t arrives at each time step
  • Markov assumption: the hidden state ztz_t of the network suffices to capture the information from all previous inputs.
    • zt=f(zt1,xt)z_t = f(z_{t-1}, x_t)
    • Once trained, this function represents a time-homogeneous process
    • The same update rule fwf_w applies at every time step, regardless of whether it’s the first input or the hundredth.
  • RNNs are designed for sequential data.
  • a hidden state that captures information from previous steps.
  • suffer from vanishing/exploding gradients.
  • Good for short-term dependencies.

Backpropagtion Through Time, BPTT

  • gradient expression is recursive.
    • ztwz,z\frac{\partial z_t}{\partial w_{z,z}}
    • =wz,zgz(inz,t)= \frac{\partial}{\partial w_{z,z}} g_z(in_{z,t})
    • =gz(inz,t)inz,twz,z= g_z'(in_{z,t}) \frac{\partial in_{z,t}}{\partial w_{z,z}}
    • =gz(inz,t)wz,z(wz,zzt1+wx,zxt+w0,z)= g_z'(in_{z,t}) \frac{\partial}{\partial w_{z,z}} (w_{z,z} z_{t-1} + w_{x,z} x_t + w_{0,z})
    • =gz(inz,t)(zt1+wz,zzt1wz,z)= g_z'(in_{z,t}) \left( z_{t-1} + w_{z,z} \frac{\partial z_{t-1}}{\partial w_{z,z}} \right)
      • ztWz,z\frac{\partial z_t}{\partial W_{z,z}} includes zt1Wz,z\frac{\partial z_{t-1}}{\partial W_{z,z}}
  • the gradient with run time being linear in the size of the network
  • handled automatically by deep learning software systems.
  • Iterating the recursion shows that the gradient at time TT includes a term proportional to:
    • wz,zt=1Tgz(inz,t)w_{z,z} \prod_{t=1}^{T} g'_z(in_{z,t})
  • Since for sigmoid, tanh, and ReLU we have g1g' \leq 1, if wz,z<1w_{z,z} < 1 the RNN will suffer from the vanishing gradient problem.
  • If wz,z>1w_{z,z} > 1, we may encounter the exploding gradient problem.

Long Short-Term Memory, LSTM

  • memory cell is essentially copied from time step to time step.
  • New information enters the memory by adding updates.
    • the gradient expressions do not accumulate multiplicatively over time.
  • include gating units: vectors control the flow of information in the LSTM, elementwise multiplication of the corresponding information vector.
  • a type of RNN designed to overcome vanishing gradient.
  • use gates (input, forget, output) to control information flow.
  • Capable of learning long-term dependencies.
  • Widely used in NLP, speech recognition, and time series forecasting.

Gates in LSTM

  • Forget gate: decides what information to discard from the cell state.
  • Input gate: decides what new information to store in the cell state.
  • Output gate: decides what information to output from the cell state.
    • similar role to the hidden state in basic RNNs.
  • Update equations:
    • ft=σ(Wx,fxt+Wz,fzt1)f_t = \sigma(W_{x,f}x_t + W_{z,f}z_{t-1})
      • Decides which parts of the previous cell state ct1c_{t-1} should be kept or discarded.
    • it=σ(Wx,ixt+Wz,izt1)i_t = \sigma(W_{x,i}x_t + W_{z,i}z_{t-1})
      • Determines how much of the new information from the current input xtx_t and the previous hidden state zt1z_{t-1} should be added.
    • ot=σ(Wx,oxt+Wz,ozt1)o_t = \sigma(W_{x,o}x_t + W_{z,o}z_{t-1})
      • Controls which parts of the current cell state ctc_t are exposed as the hidden state ztz_t.
    • ct=ct1ft+ittanh(Wx,Cxt+Wz,Czt1)c_t = c_{t-1} \odot f_t + i_t \odot \text{tanh}(W_{x,C}x_t + W_{z,C}z_{t-1})
      • Cell state update
      • Past information (ct1c_{t-1}) is partially retained through the forget gate.
      • New information is added through the input gate and tanh\tanh.
      • Thus, ctc_t serves as the long-term memory of the LSTM.
    • zt=ottanh(ct)z_t = o_t \odot \text{tanh}(c_t)
      • Hidden state update
      • The cell state is normalized with tanh(ct)\tanh(c_t) and filtered by the output gate.
      • ztz_t is the hidden state passed forward to the next time step.

Gated Recurrent Unit, GRU

  • Variant of RNN with gating mechanisms.
  • Designed to capture long-term dependencies without complex architecture.
  • a simpler alternative to LSTMs. (lightweight, effective RNN variant)
  • Captures temporal dependencies (short & long).
  • Combine input and forget gates into a single update gate.
  • Require fewer parameters than LSTM, making them faster to train.
  • Perform comparably to LSTMs in many tasks.
  • Prevents vanishing gradient.
  • Good balance between complexity & performance.
  • Excels in time series forecasting tasks.
  • Widely used in finance, energy and IoT.

Gates in GRU

  • Update gate (z): decides how much past information to keep
  • Reset Gate (r): decides how much past information to forget
  • Candidate hidden state (h~\tilde{h}): potential new memory
  • Final hidden state (hh): weighted combination of old and new information.

GRU Workflow

  • Reset gate (rr)
    • Controls how much of the previous hidden state should be "forgotten."
    • A small value means most of the past memory is erased, while a large value means much of it is retained.
  • Update gate (zz)
    • Acts as a switch to decide whether to keep the previous state hprevh_{prev} or replace it with the new candidate h~\tilde{h}.
    • If z=1z=1, the past is fully kept; if z=0z=0, it is completely replaced by the new candidate.
  • Candidate state (h~\tilde{h})
    • Combines the current input xtx_t with the reset-gated previous hidden state to generate the "candidate" new information.
  • Final hidden state (hh)
    • Blends the past and the candidate using the update gate zz.
    • If zz is large → the past memory dominates.
    • If zz is small → the new candidate dominates.
  • h=(1z)h~+zhprevh=(1−z)\tilde{h}+zh_{prev}

Comparison: RNN vs LSTM vs GRU

AttributeRNNLSTMGRU
ArchitectureSimple, hidden stateComplex, memory cell + 3 gatesSimplified, 2 gates (update/reset)
Information FlowStored in hidden stateControlled by gatesControlled by merged gates
Long-term DependencyWeak (vanishing gradient)Strong (gates solve vanishing gradient)Strong (gates solve vanishing gradient)
Short-term DependencyStrongStrongStrong
Number of ParametersFewManyFewer than LSTM
Training SpeedFastSlowFast
PerformanceGood for short-termGood for long-termEfficient, similar to LSTM
Application AreasSimple time series, basic NLPNLP, speech, time series forecastingFinance, IoT, energy, time series
Vanishing GradientYesNoNo
Typical Use CasesText generation, simple predictionTranslation, speech recognitionTime series prediction, sensor data
SimplicityVery simple, rarely usedMore complex, expressive (3 gates)Simpler than LSTM, fewer parameters
ExpressivenessLimited, struggles with long-termHigh, handles very complex sequencesModerate, good for moderate data size
Training EfficiencyFast, but limitedSlower, better for complex dataFast, efficient, similar performance
Trade-offSimple but weak for long-termCapacity for complex, long sequencesSimplicity vs. capacity

IAI +004

· 약 21분

Influencing improvement

  • Agent's component
  • Agent's prior knowledge, which influence that mode lit builds
  • the feedback available to learn from.

Components

  • Given an intelligent agent that performs some intelligent tasks, any components of agent program can be improved by learning.

Prior knowledge

  • Inductive learning (귀납적 학습): learning a general function or rule (possibly incorrect) from specific input-output pairs
    • Bottom to top
    • Specific to general
  • Deductive learning (연역적 학습): going from a general rule to a new specific rule that is logically entailed, but is useful because it allows more efficient processing
    • Top to bottom
    • general to specific

Feedback

  • feedback on its percept sequence
  • no feedback on its percept sequence
  • rewards for taking a sequence of actions based on its percept sequence
-Supervised LearningUnsupervised Learning
Training Datalabeledunlabeled
Computational complexitysimplerComputationally complex
Accuracyhighless accurate

Supervised learning

  • the agent observes some examples of input-output pairs first and then learns a function or a relationship that maps from inputs to output.
  • Attributes/Features: the inputs are independent variables in the problem domain
  • Target attribute: the output is the dependent variable which is dependent on the inputs.
  • Model: the learned function or relationship
  • The agent learns a model using examples and uses this model to predict the outcomes for new inputs.

Unsupervised learning

  • The agent collects adequate examples in the problem domain but it does not get any explicit feedback to the examples.
  • The agent can make sense of the examples through identifying clusters or frequent patterns in the data.
  • When shown a large number of examples, the agent can learn to identify clusters of similar examples.

Reinforcement learning

  • the agent learns from a series of actions which can be rewards or punishments to improve its performance in completing the task under consideration.
  • the feedback helps the agent to enforce positive actions and reduce the negative actions through adjusting the policy.

Supervised Learning Technique

  • Decisinon tree
  • Random forests
  • Linear regrasssion
  • Logistic regression
  • K nearest neighbours
  • Support vector machines
  • Neural networks

Regression problem

  • to predict a continuous value as the output for a given input
  • weather temperature: solar radiation, wind direction and speed, geographic location..
  • how to predict the output value of a new data instance on the basis of observed features from the existing data (historical examples) in the problem domain.
  • Elements
    • Collection of existing or historical data samples which are represented by a set of attributes or independent variables
    • The output values of the existing data samples
      • the output variable or attribute must be continuous
  • Regressor: a function describes the relationship between the attributes of a data sample and the output.
    • takes the values of attributes of a data sample and predicts the output value of this given data sample.

Evaluate a regressor

  • R Square/Adjust R Square
  • MSE Mean Square Error/RMSE Root Mean Square Error
  • MAD Mean Absolute Error

Examples of regression problems

  • Predict the fuel price using the Brent crude oil price, financial performance of the oil related companies (cash flow, projects lined up, etc.) and/or geopolitical risks (OPEC announcements, government sanctions, etc.)
  • Predict the house price of a suburb from the suburb's profile
  • Predict the blood pressure of a patient based on the patient's health profile
  • Predict the electricity price using temperature, demand and time.

Classification problem

  • to predict discrete or categorical value as the output for a given input
  • Pass or Failed given learning outcomes, student ID, prior learning, attitude, commitment and attendance.
  • how to put a new data instance into one of predefined categories or classes on the basis of observed features from the existing data in the problem domain.
  • Elements
    • Collection of existing or historical data samples with class labels
    • Predefined categories or classes
    • Adequate samples in each category or class in the existing or historical data.
  • Logic-based techniques
    • Decision tree
    • Learning set of rules
  • Perceptron-based techniques
    • Single-layer perceptron
    • Multi-layer perceptron
    • RBF network
  • SVM
  • Statistical learning techniques
    • Naive Bayes classifier
    • Bayesian networks
  • Instance-based learning
    • K-nearest neighbor (KNN)

Evaluate a classifier

  • Confusion matrix
  • Precision
  • Recall/Sensitivity
  • Specificity
  • F1-Score
  • Area Under Curve & Receiver Operating Characteristics Curve (AUC-ROC)

Examples of classification problems

  • banking, healthcare, medical diagnosis, marketing (sentiment aalysis), telecommunication, agriculture, security (fraud detection).
  • e-mails into spam or non-spam class
  • loan applications into an approved or a rejected class.
  • patients into having a certain disease or not having that disease groups.
  • text into positive or negative sentiment.
  • customers into churn or non-churn classes.

Overfitting

  • general phenomenon with all types of learning models.
  • a modeling error that occurs when a function is too closely or exactly fit to a limited set of data points.
  • more likely as the complexity of models and the number of input attributes increase
  • less likely as the number of training examples is large.

Decision Tree

  • if-then statements to define patterns in data
  • A if-then statement splits the training data into two or more branches based on some values
  • Best Split: The results of each branch should be as homogeneous as possible, or has the lowest impurity possible.
    • Information gain
    • Gini index

Implement Decision Tree

  • the split (a feature and a condition) that leads to the lowest impurity in the resulting child nodes, in a greedy manner
  • For categorical features: each unique value can be a split condition.
  • For continuous features: midpoints between consecutive sorted unique values are used as split conditions.
  • For each potential split condition, the algorithm calculates the impurity of the resulting child nodes.
  • The lowest impurity node becomes the split point for that branch.
  • The process is then repeated recursively for each child node until all leaf nodes are pure, or the stopping criteria are met.

The selectino of best split attributes

  • ID3: employs a top-down, greedy search through the space of possible branches with no backtracking using information gain
  • C4.5: using information gain ratio
  • CART: using Gini Index
  • Gini Index
  • Chi-Square
  • Reduction in Variance

Entropy

the fundamental quantity in information theory. It is a measure of the uncertainty of a random variable.

  • the fundamental quantity in information theory. It is a measure of the uncertainty of a random variable
  • A more homogeneous node with a clear majority class has low impurity and low entropy, while a more mixed distribution of classes has high impurity and high entropy.

Information gain

the decrease in entropy.

  • The information gain from the attribute test on (split on A) is the expected reduction in entropy.
  • Information gain computes the difference between entropy before the split and average entropy after the split of the dataset based on given attribute values
  • Entropy(S)=pilog2piEntropy(S) = - \sum_{} p_i \log_2 p_i
  • Gain(S,A)=Entropy(S)Entropyremain(S,A)Gain(S,A)=Entropy(S) − Entropy_{remain}(S,A)

Gini index

  • For classification, another impurity measure commonly used for classification tasks in decision trees.
  • a lower Gini index indicates lower impurity, meaning that the samples in the node predominantly belong to a single class
  • a bit more computational efficient than entropy as it does not involve logarithm calculations. but results are quite similar.
  • Gini(S)=1i=1Kpi2Gini(S) = 1 - \sum_{i=1}^{K} p_i^2

Variance

  • For regression tree, the target variable is continuous rather than categorical.
  • use variance as a measure of impurity in regression trees.
  • lower variance indicates that the data points are closely clustered around the mean
  • Var(S)=1N(yiμ)2Var(S) = \frac{1}{N} \sum_{} (y_i - \mu)^2

Prediction

  • For classification tasks: the predicted class label is the majority class among the training samples in the leaf node.
  • For regression tasks: the predicted value is the mean of the target values of the training samples in the leaf node.

Dealing with Overfitting

  • Overfitting: a common issue in decision trees, where the model captures noise or outliers in the training data rather than the underlying pattern.
  • the model performs poorly when applied to new, unseen data.

Setting Stopping Criteria

  • prevent the tree from becoming overly complex, which may lead to overfitting
  • applied during the tree construction process
  • limiting the maximum depth of the tree
  • setting a minimum number of samples per leaf node
  • requiring a minimum impurity decrease for a split

Pruning Strategies

  • applied after the tree has been fully grown
  • removing branches from the fully grown tree to simplify its structure
  • ensure that it captures the underlying patterns in the data rather than noise or outliers
  • Pruned trees perform significantly better than unpruned trees when the data contain a large amount of noise.

Ensemble Methods

  • combine multiple decision trees to form a more robust and accurate model.
  • address overfitting by averaging the predictions of the individual trees, reducing variance and improving generalization.
  • Random Forests
  • Gradient Boosted Trees

Random Forest

combines multiple weak decision tree models to create a stronger learning model.

  • two types of randomness are introduced to ensure that the individual decision trees are diverse and less prone to overfitting.
  • Random sampling of the input data
  • Bootstraping:
    • involves sampling with replacement 복원추출 (meaning that some instances appearing multiple times and others not appearing) from the original dataset, creating a new dataset.
    • each decision tree is trained on a slightly different set of data points, reducing the likelihood of overfitting.
  • Random selection of features at each split
    • At each split in each decision tree, a random subset of features is considered when determining the best split.
    • each tree in the ensemble does not rely on the same set of features for making decisions, resulting in a more diverse set of trees.
    • By considering only a subset of features at each split, the model is less likely to be influenced by a small number of dominant features, leading to a more balanced and accurate prediction.
구분데이터 무작위성 (Bootstrapping)속성 무작위성 (Feature Subset Selection)
적용 위치트리 훈련 데이터 선택 단계트리의 각 분할(split) 단계
방법원본 데이터셋에서 복원 추출(with replacement)로 샘플링하여 새로운 학습용 부분집합 생성전체 속성 중 무작위로 일부 속성만 선택 후, 그 속성들로만 분할 기준 탐색
특징- 각 나무가 다른 데이터 포인트로 학습됨
- 일부 샘플은 여러 번 등장, 일부는 제외될 수 있음
- 각 분할이 다른 속성을 사용 가능
- 동일한 속성에 과도하게 의존하지 않음
효과- 트리 간의 데이터 다양성 확보
- 과적합 감소
- 트리 간의 속성 다양성 확보
- 소수 지배적 속성의 영향 축소
결과더 다양한 데이터 시나리오를 반영한 트리들 생성더 다양한 의사결정 규칙을 반영한 트리들 생성

Predict with Random Forest

  • aggregating the predictions of all individual decision trees in the forest.
  • Majority voting: For classification, Count the number of times each class is predicted by the individual decision trees. The class with the highest count is considered as the final prediction.
  • Averaging: For regression, Calculate the mean of the predictions made by the individual decision trees. The mean value is considered as the final prediction.

Linear regression

a learning technique that finds a linear relationship between input variables and the target variable based on a fundamental assumption that there is a linear relationship between input variables and the target variable

  • e.g. the input variables (engine size, weight and car age) ➡️ target variable (car fuel efficiency)
    • assumption that there is a linear relationship
  • A linear regression technique learns a set of coefficients to estimate the linear relationship between xx and yy, denoted as hwh_w, which can be represented by the following equation.
    • h_w(x)=w0+w1x1+...+wnxn=i=0nwixih\_w(x) = w_0 + w_1x_1 + ... + w_nx_n = \sum_{i=0}^{n} w_ix_i
    • ww is a weight vector
    • y^=i=0nwixi\hat{y} = \sum_{i=0}^{n} w_ix_i
  • linear regression model is an approximate function between the input variables and the target variable, there will be an error between the output of the model and the actual output value for a data sample
    • This error can be represented by a loss function, which calculates the mean square error
    • Loss(hw)=12mj=1m(hw(xj)yj)2=12mj=1m(yji=0nwixj,i)2Loss(h_w) = \frac{1}{2m}\sum_{j=1}^{m}(h_w(x_j) - y_j)^2 = \frac{1}{2m}\sum_{j=1}^{m}(y_j - \sum_{i=0}^{n} w_ix_{j,i})^2
  • for solving regression problems

Solving a linear regression problem

  • to find the best linear relationship hwh_wthat best fits the training data of mm data samples.
    • makes the loss to be minimised.
  • to find the best weight vector ww^*, such that for a given training dataset of mm data samples.
    • w=argminwLoss(hw)w^* = \arg\min_{w} Loss(h_w)
  • gradient descent: continuously resamples the gradient of the weight coefficients in the opposite direction depending on the weight ww.
    • Until the loss function Loss(hw)Loss(h_w) reaches the global minimum
    • to change the individual components of ww a little bit in the direction that minimises Loss(hw)Loss(h_w), and to do this many times.
  • wi    wi+αj=1mxj,i(yjhw(xj))w_i \;\leftarrow\; w_i + \alpha \sum_{j=1}^{m} x_{j,i} \Big( y_j - h_w(x_j) \Big)
    • α\alpha: the step size, the learning rate
  • Training model: the process of iteratively updating weights with a learning rate to minimise loss, where the final weight vector defines the model used for predicting new data.
  • use regularisation on a multivariate linear function to avoid overfitting.
  • Batch gradient descent: consider the entire training dataset (X,y)(X, y) at once.
    • w0    w0+αj=1m(yj(w0+w1xj))w_0 \;\leftarrow\; w_0 + \alpha \sum_{j=1}^{m} \Big(y_j - (w_0 + w_1 x_j)\Big)
    • w1    w1+α(j=1m(yj(w0+w1xj))xj)w_1 \;\leftarrow\; w_1 + \alpha \Big(\sum_{j=1}^{m} (y_j - (w_0 + w_1 x_j)) \cdot x_j\Big)
  • Stochastic gradient descent (SGD): consider only a single training data sample (xj,yj)(x_j, y_j) at a time.
    • w0w0+α(yj(w0+w1xj))w_0 \leftarrow w_0 + \alpha \big( y_j - (w_0 + w_1 x_j) \big)
    • w1w1+α((yj(w0+w1xj))xj)w_1 \leftarrow w_1 + \alpha \big( (y_j - (w_0 + w_1 x_j)) \cdot x_j \big)
    • can be used in an online setting, where new data is coming one at a time, or offline, where we cycle through the same data as many times as is necessary, taking a step after considering each single example.
    • With a fixed learning rate α\alpha, the stochastic version does not guarantee convergence.
    • often faster than batch gradient descent.
    • With a schedule of decreasing learning rates (SA), the stochastic version does guarantee convergence.
  • These update rules are derived as the next weight update equations by taking the partial derivatives of the loss function with respect to w0w_0 and w1w_1.

Logistic Regression

an extension of linear regression in such a way that the output of a linear regression model goes through a logistic function

  • y(x)=11+exy(x) = \frac{1}{1 + e^{-x}}
  • The output value of this logistic function is between 0 and 1.
  • 0 is for certainly being labeled "0" and 1 is for certainly being labeled "1", and a value between 0 and 1 represents the probability of being labeled "1"
  • a logistic regression model: a linear regression model + a logistic function
  • mainly for solving classification problems

Nearest Neighbor

a technique to predict the output of a given new sample based on a collection of existing samples.

  • is to find the k-nearest neighbours of given sample in the collection and determine the output based on these k neighbours.
  • k is always chosen to be an odd number.
  • can be used for both classification and regression problems.
    • For classification: majority vote of the neighbours.
    • For regression: mean/median (or regression) of the neighbours.
  • Instance-based learning
    • KNN does not learn a separate model.
    • Instead, it stores all training data and uses them directly at prediction time.
  • Non-parametric model
    • KNN has no parameters (like weights in linear regression) to train.
    • The model is essentially the full dataset plus a distance measure.

Distance measures

  • Minkowski distance or LpL^p norm
    • Lp(xj,xq)=(ixj,ixq,ip)1/pL^p(x_j, x_q) = \left( \sum_i |x_{j,i} - x_{q,i}|^p \right)^{1/p}
    • Euclidean distance: p=2p = 2, for the dimensions are measuring similar properties, such as the width, height and depth of 3D objects.
    • Manhattan distance: p=1p = 1, for the dimensions are measuring dissimilar properties, such as age, weight, and gender of a patient.
    • Hamming distance: the number of attributes on which the two points differ, for Boolean attribute values

Nomarlization

  • use the raw data from each dimension then the total distance will be affected by a change in scale in any dimension
  • To avoid this, apply normalization to the measurements in each dimension.
  • to compute the mean μi\mu_i and standard deviation σi\sigma_i of the values in each dimension, and rescale them
  • The rescaling is done using the formula:
    • xj,i=xj,iμiσix'_{j,i} = \frac{x_{j,i} - \mu_i}{\sigma_i} where xj,ix'_{j,i} is the normalized value, xj,ix_{j,i} is the original value, μi\mu_i is the mean, and σi\sigma_i is the standard deviation.

Time complexity

  • Conceptually trivial: Given a set of N examples and a query xqx_q, iterate through the examples, measure the distance to xqx_q from each one, and keep the best k.
  • NN(k,xq)NN(k, x_q)'s time complexity is O(N)O(N), N is the number of examples in the training dataset.
  • Use a k-dimensional tree: a balanced binary tree with an arbitrary number of dimensions.
    • Time complexity can be improved to O(logN)O(\log N)
    • appropriate only when there are many more examples than dimensions
    • It works well with up to 10 dimensions with thousands of examples.
  • Use a Hash table with a locality-sensitive hash (LSH)
    • Time complexity can be improved to O(1)O(1)

SVM

a framework for finding a boundary that distinctly classifies the data points in an optimal way.

  • supervised learning, binary classification
  • SVM chooses the boundary with the maximum possible geometric margin, which has the largest distance to the nearest training data points of any class
  • initially designed for binary classification problems but can also be applied for solving multi-class classification problems

Linear discriminant

  • XiX_i is multiplied by its matching weight wiw_i
  • all these products are added together and passed to a threshold function
  • Decision surface: if g(x)=wx>0g(x) = w \cdot x \gt 0 then f(x)=+1(class1)f(x) = +1 (class1) else f(x)=1(class2)f(x) = -1 (class2)
  • Decision function: f(x)=sign(g(x))=sign(w0+w1x)f(x) = \text{sign}(g(x)) = \text{sign}(w_0 + w_1x)
    • To make a decision, the continuous value g(x)g(x) is passed through the sign function so that it outputs either +1 or -1.
  • If the data from the two classes can be separated with a hyperplane, linearly separable.

Hyperplane

  • separates the data in 2D by a line or in 3D by a plane
  • The orientation of the hyperplane is given by the vector ww
  • the location of the hyperplane is given by w0w_0
  • The distance from the origin to the hyperplane is w0w\frac{|w_0|}{\|w\|}
  • If a given data sample xx^* and g(x)=0g(x^*) = 0, then this data sample is on the separation boundary. It can normally be assigned to any class.
  • geometric margin: the minimum distance between the samples and the hyperplane by constructing and solving a constrained optimization problem
    • γi=yi(wwxi+w0w)\gamma_i = y_i \left( \frac{w}{\|w\|} \cdot x_i + \frac{w_0}{\|w\|} \right)
  • primary optimization problem: to maximize the minimal geometric distance across the training dataset of m samples.
    • maxw,w0(mini=1,,Nγi)=maxw,w0(mini=1,,N(yi(wwxi+w0w)))\max_{w, w_0} \Big( \min_{i=1,\ldots,N} \gamma_i \Big) = \max_{w, w_0} \Big( \min_{i=1,\ldots,N} \Big( y_i \Big( \frac{w}{\|w\|} \cdot x_i + \frac{w_0}{\|w\|} \Big) \Big) \Big)
    • minw,w0  12w2\min_{w, w_0} \; \frac{1}{2}\|w\|^2
    • s.t.   yi(wxi+w0)mini=1,,N(yi(wxi+w0))\text{s.t. } \; y_i (w \cdot x_i + w_0) \geq \min_{i=1,\ldots,N} \big( y_i (w \cdot x_i + w_0) \big)
  • dual optimization problem: easier to solve. More importantly the dual optimisation problem enables the so-called kernel trick in SVM
    • maxα  i=1Nαi12i=1Nj=1Nαiαjyiyj(xixj)\max_{\alpha} \; \sum_{i=1}^N \alpha_i - \frac{1}{2} \sum_{i=1}^N \sum_{j=1}^N \alpha_i \alpha_j y_i y_j (x_i \cdot x_j)
    • minα  12i=1Nj=1Nαiαjyiyj(xixj)    i=1Nαi\min_{\alpha} \; \frac{1}{2} \sum_{i=1}^N \sum_{j=1}^N \alpha_i \alpha_j y_i y_j \,(x_i \cdot x_j) \;-\; \sum_{i=1}^N \alpha_i
    • s.t.i=1mαiyi=0,αi0,i=1,2,,N\text{s.t.} \quad \sum_{i=1}^m \alpha_i y_i = 0, \quad \alpha_i \geq 0, \quad i=1,2,\ldots,N

Attractive Properties

  • SVMs construct a maximum margin separator
    • the largest possible distance to example points, helping to improve generalization
  • SVMs create a linear separating hyperplane
    • kernel trick: to embed the data into a higher-dimensional space
    • Often data that are not linearly separable in the original input space are easily separable in a higher-dimensional space
    • In general (excepted some special cases) if we have NN data points then they will always be separable in spaces of NN dimensions or more
  • SVMs are a nonparametric method
    • retain training examples and potentially need to store them all
    • In practice, they often end up retaining only a small fraction of examples
    • have the flexibility to represent complex functions, but they are resistant to overfitting
  • not usually expect to find a linear separator in the input space xx, but we can find linear separators in the high-dimensional feature space F(x)F(x) simply by replacing (xjxk)(x_j x_k) in
    • argmaxαjαj12j,kαjαkyjyk(xjxk)argmax_{\alpha} \sum_{j}\alpha_j - \frac{1}{2} \sum_{j,k}\alpha_j \alpha_k y_j y_k (x_j \cdot x_k)
      • with K(xj,xk)=F(xj)F(xk)K(x_j, x_k) = F(x_j) \cdot F(x_k)
      • F(xj)F(xk)F(x_j) \cdot F(x_k) can often be computed without first computing FF for each point.
  • In a higher dimensional feature space, which is created by transformation F(x)F(x), if we can express K(xjxk)=F(xj)F(xk)K(x_j \cdot x_k) = F(x_j) \cdot F(x_k), the kernel function K(xjxk)K(x_j \cdot x_k) can be applied to pairs of input data to evaluate dot product in some corresponding feature space.
    • kernel trick is to plug a kernel function K(xjxk)K(x_j \cdot x_k) into the dual optimisation problem to replace (xjxk)(x_j \cdot x_k)
    • Optimal linear separators can be found efficiently in feature spaces with billions of (or, in some cases, infinitely many) dimensions.
    • we can learn in the higher-dimensional space, but we compute only kernel functions rather than the full list of features for each data point.

Classification evaluation metrics

용어설명
True Positive (TP)실제 1, 예측 1
True Negative (TN)실제 0, 예측 0
False Positive (FP)실제 0, 예측 1 (0을 잘못 1로 예측)
False Negative (FN)실제 1, 예측 0 (1을 놓쳐서 0으로 예측)

Accuracy

the proportion of correctly classified instances (data points or samples) among the total instances.

Accuracy=TP+TNTotal number of predictionsAccuracy = \frac{TP + TN}{\text{Total number of predictions}}

  • a ratio of the number of correct predictions
  • it may not be suitable for imbalanced datasets where the class distribution is skewed
    • a model that always predicts the majority class will have high accuracy but may not be useful in practice.

Precision

the proportion of true positives among all positive predictions.

Precision=TPnumber of positive predictions=TPTP+FPPrecision = \frac{TP}{\text{number of positive predictions}} = \frac{TP}{TP + FP}

  • the model's ability to not mistakenly view negatives as positives
  • A high precision value indicates that the model has made fewer false positive predictions.
  • it's useful to minimize the number of false positives.

Recall

Sensitivity, True Positive Rate, the proportion of true positive instances among the actual positive instances

Recall=TPnumber of positive instances=TPTP+FNRecall = \frac{TP}{\text{number of positive instances}} = \frac{TP}{TP + FN}

  • the model's ability to not mistakenly view actual positives as negatives
  • A high recall value indicates that the model has successfully identified a large portion of the actual positive instances
  • it's useful when the cost of false negatives is high
    • e.g. in medical diagnosis, where failing to identify a disease can have severe consequences

F1 Score

the harmonic mean of precision and recall

F1=2PrecisionRecallPrecision+RecallF1 = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall}

  • a balanced evaluation of the model's performance
  • It is particularly useful when dealing with imbalanced datasets, where one class is significantly more prevalent than the other.
  • It's useful for imbalanced datasets to balance precision and recall.

Regression evaluation metrics

Mean Absolute Error (MAE)

the average of the absolute differences between the predicted values and the actual values

MAE=1ni=1nyiy^iMAE = \frac{1}{n} \sum_{i=1}^n \lvert y_i - \hat{y}_i \rvert

  • It measures the average magnitude of errors made by the model, without considering their direction
  • A lower MAE value indicates that the model has made smaller prediction errors

Mean Squared Error (MSE)

the average squared difference between the predicted and actual values

MSE=1ni=1n(yiy^i)2MSE = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2

  • It's useful when you want to penalise larger errors more heavily, making it more sensitive to outliers at the same time
  • often used as a loss function when training regression models
  • A lower MSE value indicates that the model has smaller prediction errors, with a strong preference for avoiding large errors.

Root Mean Squared Error (RMSE)

the square root of MSE

RMSE=MSERMSE = \sqrt{MSE}

  • it more interpretable as it is in the same units as the dependent variable.

R-Squared (Coefficient of Determination)

how well the regression model approximates the actual data

R2=1SSRSST=1i=1n(yiy^i)2i=1n(yiyˉ)2R^2 = 1 - \frac{SSR}{SST} = 1 - \frac{\sum_{i=1}^n (y_i - \hat{y}_i)^2}{\sum_{i=1}^n (y_i - \bar{y})^2}

  • the proportion of "sum squared regression (SSR)" and "total sum of squares (SST)"
    • SSR obviously captures the model's prediction errors
    • SST is the variance of the target variable.
      • can be viewed as a naive model (y^=yˉ\hat{y} = \bar{y}) using the average value of the target variable as the prediction
  • R2=1R^2 = 1: The model predicts perfectly (the error is 0).
  • R2=0R^2 = 0: The model does no better than a naive model that always predicts the mean of the target variable.
  • R2<0R^2 < 0: The model performs worse than simply predicting the mean, meaning its predictions increase the error compared to the naive baseline.

IAI +003

· 약 5분

Local Search Problem

To find the state that gives the optimal/best value of the evaluation function

  • It can be seen as an optimization problem.
  • a computational problem that finds the best solution (a state) that satisfies the given constraints
  • evaluation function === objective function
  • Only cares about the optimal solution/best state without considering the paths to reach the best state (the optimal solution)
  • Not systematic

Feasible region & solution

  • Feasible region: the set of all possible or candidate solutions which are the solutions that satisfies the problem's constraints
  • Feasible solution: a solution in the feasible region

Search Problem vs Local Search Problem

Path-based vs State-based

AspectsSearch ProblemLocal Search Problem
StateAll possible states - state-space landscapeRange of decision variables and constraints
GoalGoal state & goal testEvaluation function & objective function
EvaluationMeasure closeness to goal - distance/fitnessMinimize cost or maximize fitness
Transition/SuccessorTransition functionSuccessor function

Discrete & Continuous Optimization

  • Discrete optimization: optimization problems where the solution space is discrete (e.g., 8 queens problem)
  • Continuous optimization: optimization problems where the solution space is continuous (e.g., real numbers, any value within a range)
  • All possible states: state-space landscape
  • Transition function: To find neighbor or successor state
  • Goal state
  • Objective function: A way to measure how close to the goal state
  • Start state

Search state-space

  • Global Maximum: A state that maximizes the objective function over the entire state space
  • Local Maximum: A state that maximizes the objective function within a small area around it.
  • Plateau: A state such that the objective function is constant in an area around it.
    • Shoulder: A plateau that has uphill edge.
    • Flat: A plateau whose edges go downhill.

Advantages

  • use little memory
  • can often find reasonably good solution in large or infinite search spaces
  • useful for solving pure optimization problems
  • don't need to know the path to the solution.

Hill climbing

keeps track of one current state and on each iteration moves to the neighboring state with highest value.

  • f=max(cost(X))f = max(-cost(X))
  • Steps
    • Evaluate the initial stat
    • If it is equal to the goal state, return. Otherwise, continue.
    • Find a neighboring state
    • Evaluate this state. If it is closer to the goal state than before, replace the initial state with this state.
    • Repeat steps 2-4 until it reaches a goal state (local or global maximum) or runs out of time.
  • No search tree, No backtracking, Don't look ahead beyond the current state.
    • get stuck due to local maxima, plateaus, or ridges.

Variations of HC

  • Simple HC: greedy local search which expands the current state and moves on to the best neighbor.
  • Stochastic HC: choose randomly among the neighbors going uphill.
  • First-choice HC: generate random successor until one is better. Good for states with high numbers of successors.
  • Random restart: conducts a series of hill climbing searches from random initial states until a goal state is found.

Simulated Annealing

based upon the annealing process to model the search process for finding an optimal solution to an optimisation problem

  • annealing schedule, temperature, energy
  • finds the minimal value of the objective function (energy function)
  • starts with a high temperature and then gradually reduces the temperature
  • P=eΔE/kTP = e^{-\Delta E / kT}
    • ΔE\Delta E: how bad the new state is compared to the old state
    • TT: temperature is getting lower over time
    • kk: a scaling factor
  • Swap condition: ΔE<=0\Delta E <= 0 or ΔE/kT>random{-\Delta E / kT} > \text{random}

Evolutionary algorithms

  • Local beam search
  • Stochastic beam search
  • Genetic algorithms

Characteristics

  • size of the population
  • representation of each individual
  • mixing number
  • selection process for selecting the individuals who will become the parents of the next generation
  • recombination procedure
  • mutation rate
  • makeup of the next generation

Genetic algorithm

It uses operators, such reproduction, crossover and mutation, inspired by the natural evolutionary principles.

  • State: is represented by an individual in a population. Traditional representation is a chromosome
  • Objective function: is used to evaluate the fitness of an individual (= fitness function, 적합도 함수)
  • Successor function: consists of three operators: reproduction, crossover, and mutation
  • Solution: is found through evolution from one generation to another generation

Genetic Algorithm

Roulette Wheel Selection

  • Compute total fitness of all individuals.
    • Example: A=30, B=20, C=40, D=10 → Total = 100.
  • Calculate probability of each individual being selected
    • Formula: P(i)=fitness(i)total_fitnessP(i) = \frac{fitness(i)}{total\_fitness}
      • A = 30/100 = 0.30
      • B = 20/100 = 0.20
      • C = 40/100 = 0.40
      • D = 10/100 = 0.10
  • Convert to cumulative probabilities
    • P4 = 0.10
    • P4 + P3 = 0.50
    • P4 + P3 + P2 = 0.90
    • P4 + P3 + P2 + P1 = 1.00
  • Generate a random number between 0 and 1.
  • Select an individual based on the random number and cumulative probabilities.

Roulette Wheel Selection

  • ⚫ random = 0.07 → falls in P4 [0, 0.10)
  • 🔺 random = 0.37 → falls in P3 [0.10, 0.50)
  • ⬟ random = 0.82 → falls in P2 [0.50, 0.90)

Applications of GA

  • Parameter tuning: optimize the parameters in NN
  • Planning: economic dispatch, train timetabling
  • Design & Control problems: robotic control, adaptive control systems
  • Successful use of GA requires careful engineering of the representation