Blog | gracefullight.dev

FDA +010

2025년 10월 8일 · 약 4분

Eunkwang Shin

Owner

SVM

a supervised learning framework for finding a boundary between data points belonging to different classes.

Convex Hull

the lines surrounding the outermost points of each class.
since the classes are linearly separable, convex hulls do not intersect.

Margins

the distance between data points of different classes seprarated by a hyperplane.
multiple possible hyperplanes can separate the classes, but the optimal hyperplane maximizes the margin.
Maximum Margin Hyperplane
- a separation line/plane orthogonal to the shortest line connecting the convex hull.
- the line that is farthest apart from each convex hull.
- it separates the data points with the widest margin.

Kernels

kernel is used to represent kernel functions
- which are used to convert low-dimensional space to high-dimensional space
- by applying a function on low-dimensional data points to come up with higher dimensions
  - which can be used to linearly separate the data points of classes using a hyperplane.
the function of a kernel is to take data as input and transform it into the required form.
- different algorithms use different types of kernel functions.
- linear, non-linear, polynomial, radial basis function (RBF), and sigmoid.

Kernel trick

Basic idea of SVM and Kernel trick is to find the plane which can separate, classify or split the data with maximum margin as possible.
The distance from a point $(x_0, y_0)$ to a line $Ax + By + C = 0$ is given by the formula: $\text{Distance} = \frac{|Ax_0 + By_0 + C|}{\sqrt{A^2 + B^2}}$
The distance between $H_0$ $H_{0}$ , $H_1$ $H_{1}$ is then: $|w \cdot x + b| / ||w|| = 1/||w||$ $∣ w \cdot x + b ∣/∣∣ w ∣∣ = 1/∣∣ w ∣∣$ .
- The total distance between $H_1$ and $H_2$ is $2/||w||$ .
- $H_1: w \cdot x + b = 1$
- $H_2: w \cdot x + b = -1$ .
$\text{Distance} = \frac{|(+1) - (-1)|}{||w||} = \frac{2}{||w||}$
to maximiza the margin, we need to minimize $||w||$
this can be solved through Lagrangian formula or Lagrangian multipliers. $L = \frac{1}{2} ||w||^2 - \sum_{i=1}^{n} \alpha_i [y_i (w \cdot x_i + b) - 1]$
Setting the gradient of $L$ $\frac{\partial L}{\partial w} = 0 \Rightarrow w = \sum_{i=1}^{n} \alpha_i y_i x_i$ $\frac{\partial L}{\partial b} = 0 \Rightarrow \sum_{i=1}^{n} \alpha_i y_i = 0$
The dual form of the optimization problem is: $\text{Maximize } W(\alpha) = \sum_{i=1}^{n} \alpha_i - \frac{1}{2} \sum_{i=1}^{n} \sum_{j=1}^{n} \alpha_i \alpha_j y_i y_j (x_i \cdot x_j)$
Capital letters such as A, X, Y denote matrices;
Greek letters such as $\Phi$ , $\kappa$ denote functions;
lower-case bold letters such as $a$ , $b$ denote vectors;
script letters such as $\mathcal{A}$ , $\mathcal{B}$ , $\mathcal{V}$ , $\mathcal{E}$ denote sets or spaces;
$||a||$ denotes the $L^2$ norm of vector;
$||x||_F$ denotes the Frobenius norm of matrix, and is given by $||x||_F = \big(\sum_{i,j} |x_{ij}|^2\big)^{1/2}$ ;

Linear Kernel

$K(x_i, x_j) = x_i \cdot x_j$

Gaussian Kernel / RBF Kernel

$K(x_i, x_j) = \exp\Big(-\frac{||x_i - x_j||^2}{2\sigma^2}\Big)$

Polynomial Kernel

$K(x_i, x_j) = (x_i \cdot x_j + 1)^h$

Sequential Minimal Optimization (SMO)

an algorithm for solving the quadratic programming (QP) problem that arises during the training of support vector machines (SVMs).

Advantages

Overfitting is unlikely
- SVM uses the maximum margin hyperplane, which is relatively stable.
- the maximum margin hyperplane is only sensitive to the changes in the support vectors.
- the variance in the data may have a relatively low effect on the performance
Computational complexity
- since every time we need to classify a new sample we need to examine that sample with all the support vectors.
- Using the kernel trick can help to alleviate this by calculating the dot product before doing the nonlinear mapping.

Disadvantages

not perform very well when the data set has more noise
where the number of features for each data point exceeds the number of training data samples, the SVM will underperform.

Applications

Image classification
Handwriting recognition
Bioinformatics
Text classification
Top 10 SVM Applications

Vocabulary for AI 010

2025년 10월 4일 · 약 2분

Eunkwang Shin

Owner

Vocabulary & Expressions

Term/Expression	Definition	Simpler Paraphrase	Meaning
cavity	a hollow space within a solid object	hollow space	구멍, 공동
marginalization	treatment of a person, group, or concept as insignificant or peripheral	sidelining	주변화
abbreviate	shorten a word, phrase, or text	shorten	줄이다, 생략하다
summation	the process of adding things together	addition	합계, 총합
multivalue	having multiple values or meanings	multiple values	다중 값
quantify	express or measure the quantity of something	measure	수량화하다
derive	obtain something from (a specified source)	obtain	끌어내다, 얻다
meningitis	inflammation of the protective membranes covering the brain and spinal cord	brain inflammation	수막염
casual	relating to or showing the cause of something	cause-related	인과 관계의
epidemic	a widespread occurrence of an infectious disease in a community at a particular time	outbreak	유행병
proportionately	in a way that corresponds in size or amount to something else	correspondingly	비례하여
aching	experiencing a continuous or prolonged dull pain	painful	아픈, 쑤시는
soup sth up	modify or improve something to make it more powerful or effective	enhance	성능을 향상시키다
perpendicular	at an angle of 90 degrees to a given line, plane, or surface	at right angles	수직의
wiggly	having many curves or bends	curvy	구불구불한
versatility	ability to adapt or be adapted to many different functions or activities	adaptability	다재다능함
interpolate	estimate or insert (a value or function) between two known values in a sequence	estimate between	보간하다
commutativity	the property that the order of applying an operation does not change the result	order-independence	교환법칙
associativity	the property that the grouping of operations does not change the result	grouping-independence	결합법칙
contraposition	a logical operation that involves reversing and negating both the hypothesis and conclusion of a conditional statement	reverse and negate	대우
distributivity	the property that an operation can be distributed over another operation	distribution property	분배법칙

Vocabulary for AI 009

2025년 9월 27일 · 약 6분

Eunkwang Shin

Owner

Vocabulary & Expressions

Term/Expression	Definition	Simpler Paraphrase	Meaning
canonical	conforming to a general rule or acceptable procedure	standard	정통의, 표준의
assert	state a fact or belief confidently and forcefully	declare	단언하다, 주장하다
lurking	remaining hidden so as to wait in ambush	hidden	숨어있는, 잠복하는
bleak	(of an area of land) lacking vegetation and exposed to the elements	desolate	황량한, 적막한
albeit	although	though	비록 ~일지라도
stench	a strong and very unpleasant smell	foul smell	악취
woeful	characterized by, expressive of, or causing sorrow or misery	sorrowful	슬픈, 비참한
As for	with regard to; concerning	regarding	~에 관하여
ignorance	lack of knowledge or information	unawareness	무지, 무식
utterly	completely and without qualification; absolutely	completely	완전히, 전적으로
prudent	acting with or showing care and thought for the future	wise	신중한, 현명한
arithmetic	the branch of mathematics dealing with the properties and manipulation of numbers	math	산수, 계산
entailment	a relationship between sentences in which one sentence logically follows from one or more others	implication	함축, 수반
syntactic	relating to the arrangement of words and phrases to create well-formed sentences in a language	grammatical	구문의, 통사론의
stands for	represents or signifies	represents	~을 나타내다
mnemonic	a device such as a pattern of letters, ideas, or associations that assists in remembering something	memory aid	기억을 돕는 장치
parentheses	a pair of round brackets () used to mark off a parenthetical word or expression	brackets	괄호
negation	the contradiction or denial of something	denial	부정, 반대
antecedent	a thing or event that existed before or logically precedes another	predecessor	선행사, 앞서는 것
precedence	the condition of being considered more important than someone or something else; priority in importance, order, or rank	priority	우선, 우선권
disjuncts	a word or phrase that is grammatically independent of the other parts of the sentence in which it occurs	separate part	분리된 부분
causation	the action of causing something	causing	인과, 원인 제공
decidedly	in a manner that is clear and definite	clearly	단호하게, 명확히
suffice	be enough or adequate	be sufficient	충분하다
tautology	a statement that is true by necessity or by virtue of its logical form	redundancy	동어 반복, 자명한 진리
conversely	introducing a statement or idea that reverses one that has just been made or referred to	in contrast	반대로
contrapositively	in a way that involves the contrapositive of a statement	by contrapositive	대우적으로
refutation	the action of proving a statement or theory to be wrong or false; disproof	disproving	반박, 논박
monotonicity	the property of a function to be either entirely non-increasing or non-decreasing	consistency	단조성
resolvent	a clause obtained by resolving two clauses containing complementary literals	derived clause	해석절
soundness	the quality of being based on valid reasoning or good judgment	validity	타당성
yield	produce or provide (a natural, agricultural, or industrial product)	produce	산출하다, 양보하다
ontological	relating to the branch of metaphysics dealing with the nature of being	existential	존재론의
commitment	the state or quality of being dedicated to a cause, activity, etc.	dedication	헌신, 약속
pedagogical	relating to teaching	educational	교육의, 교수법의
arity	the number of arguments or operands that a function or operation takes	number of arguments	(함수의) 인수 개수
surrogate mothers	women who carry and give birth to a child for another person or couple	gestational carriers	대리모
predicate	a symbol or function that represents a property or relation	property	술어, 속성, 명제함수
kinship	blood relationship	family relationship	친족 관계
theorem	a general proposition not self-evident but proved by a chain of reasoning; a truth established by means of accepted truths	proven statement	정리, 명제
existentially	relating to existence	relating to existence	존재에 관한
Universal	a type of quantification that states that a predicate holds for all members of a specified set	for all	전체에 대한 한정
existential	a type of quantification that states that a predicate holds for at least one member of a specified set	there exists	존재에 대한 한정
latent	existing but not yet developed or manifest; hidden or concealed	hidden	잠재적인, 숨어있는

Knowledge Base

TELL, ASKS, TELL
- TELLs the knowledge base what it preceives
- ASKs the knowledge base what action it should perform
  - reasoning may be done about the current state of the world
  - about the outcomes of possible action sequences and so on
- TELLs the knowledge base which action was chosen, and returns the action so that it can be executed
MAKE-PERCEPT-SENTENCE
- constructs a sentence asserting that the agent preceived the given percept at the given time.
MAKE-ACTION-QUERY
- constructs a sentence that asks what action should be done at the current time.
MAKE-ACTION-SENTENCE
- constructs a sentence asserting that the chosen action was executed.

Logical connectives

BNF (Backus–Naur Form) grammar of sentences Symbols from Logic and Set Theory

$\neg$ $\neg$ : negation (NOT), $\neg W_{1,3}$ $\neg W_{1, 3}$
- A literal is either an atomic sentence (a positive literal) or a negated atomic sentence (a negative literal).
$\land$ $\land$ : conjunction (AND), $W_{1,3} \land P_{3, 1}$ $W_{1, 3} \land P_{3, 1}$
- its parts are the conjuncts.
$\lor$ $\lor$ : disjunction (OR), $(W_{1,3} \land P_{3,1}) \lor W_{2,2}$ $(W_{1, 3} \land P_{3, 1}) \lor W_{2, 2}$
- its parts are the disjuncts.
$\implies$ $⟹$ : implication (IMPLIES), $(W_{1,3} \land P_{3,1}) \implies \neg W_{2,2}$ $(W_{1, 3} \land P_{3, 1}) ⟹ \neg W_{2, 2}$
- its premise or antecedent, and its conclusion or consequent is the part that follows the $\implies$ .
- Implications are also called rules or if-then statements.
- Sometimes written as $\rightarrow$ or $\supset$ .
$\iff$ : biconditional (IF AND ONLY IF), $W_{1,3} \iff \neg W_{2,2}$

Truth tables

$P$	$Q$	$\lnot P$	$P \land Q$	$P \lor Q$	$P \implies Q$	$P \iff Q$
false	false	true	false	false	true	true
false	true	true	false	true	true	false
true	false	false	false	true	false	false
true	true	false	true	true	true	true

Logical equivalence: $P \equiv Q$ :
Validity: a sentence is valid if it is true in all models.
- Deduction Theorem: For any sentences $\alpha$ and $\beta$ , $\alpha \models \beta$ if and only if $(\alpha \implies \beta)$ is valid.
Satisfiability: a sentence is satisfiable if it is true in, or satisfied by, some model.
Modus Ponens: $\frac{\alpha \implies \beta, \alpha}{\therefore \beta}$ $\frac{α ⟹ β , α}{∴ β}$
- whenever any sentences of the form $\alpha \implies \beta$ and $\alpha$ are given, then the sentence $\beta$ can be inferred.
And-Elimination: $\frac{\alpha \land \beta}{\therefore \alpha}$ $\frac{α \land β}{∴ α}$
- from a conjunction, one of the conjuncts can be inferred.
monotonicity: the set of entailed sentences can only increase as information is added to the knowledge base.
- inference rules can be applied whenever suitable premises are found in the knowledge base, regardless of what other sentences are present.

IAI +008

2025년 9월 22일 · 약 13분

Eunkwang Shin

Owner

NLP

Tokenization

The process of dividing a text into a sequence of words.

Different models uses different tokenization methods.

BERT TOKEN	BERT ID	GPT TOKEN	GPT ID
my	2026	My	3666
grandson	7631	Ġgrandson	31845
loved	3866	Ġloved	10140
it	2009	Ġit	340
!	999	Ġ!	0
so	2061	Ġso	1406
much	2172	Ġmuch	881
fun	4569	Ġfun	1257
!	999	Ġ!	0

Corpus

a large and structured collection of text

a corpus typically consists of at least a million words of text
at least tens of thousands of distinct vocabulary words.

Text classification

the process of categorizing text into organized groups.
text classifiers can automatically analyze text and then assign a set of predefined tags or categories based on its content.
machine learning approach
- Features
  - BoW
  - TF-IDF
- Features + classifier
  - Logistic regression
  - SVM
  - Naive Bayes
Deep learning approach
- Neural models: CNNs (capture local n-grams)
- RNNs, LSTMs (sequence-aware)
- Transformers (e.g. BERT)
  - Contextual embedding
Modern enhancements
- Embedding + classifier: Pretrained embeddings (Word2Vec) + classifier
- Fine-tuned transformers: BERT fine-tuned

Bag of Words, BoW

it represents a text as a bag of its words
we can understand the meaning of a document from its content (words) their multiplicity (frequency, number of occurrences)
mainly used as a tool of feature extraction.
limitations
- ignores syntax and the context
- disregarding grammar
- discards word order - words are independent of each other
- considers only the meanings of the words in the sentence

# examples

Example1 = "He likes to watch movies. Mary likes movies too."
Example2 = "Mary also likes to watch football games."

# Vocabulary
Vocab = {"He", "likes", "to", "watch", "movies", "Mary", "also", "football", "games"}

# BoW representation
BoW1 = {He: 1, likes: 2, to: 1, watch: 1, movies: 2, Mary: 1, also: 0, football: 0, games: 0}
BoW2 = {He: 0, likes: 1, to: 1, watch: 1, movies: 0, Mary: 1, also: 1, football: 1, games: 1}

[[1,2,1,1,2,1,0,0,0], [0,1,1,1,0,1,1,1,1]] 

TF-IDF

Term Frequency - Inverse Document Frequency

Model based on the statistics of word counts.
idea is that key terms and important ideas are likely to repeat.
includes a scoring function that measure the relevance of a document to a query.
- the function takes a document with a corpus and a query as input and returns a numeric score.
- the doucments that have the highest scores are considered as the most relevant documents.
Term Frequency
- $TF(q_i, d_j, D)$
Inverse Document Frequency
- $IDF(q_i, D) = \log\frac{N}{DF(q_i, D)}$ $I D F (q_{i}, D) = lo g \frac{N}{D F ( q _{i} , D )}$
  - $DF(q_i, D)$ : number of documents in the corpus $D$ that contain the term $q_i$
$N = |D|$ : total number of documents in the corpus $D$
TF-IDF score
- $TFIDF(q_i, d_j, D) = TF(q_i, d_j) \cdot IDF(q_i, D)$

Examples of TF-IDF

Document 1: "John likes to watch movies."
Document 2: "Mary likes movies too."
Document 3: "John also likes football."

Step	Term	Document	Number of times term `t` appears in document `d`	Total number of terms in document `d`	`TF(t, d)`	Total number of documents	Number of documents containing the term `t`	`IDF(t)`	`IDF(t) (base 10)`	`TF-IDF(t,d)`	`TF-IDF(t,d)` (base 10)
1	t=likes	d1	1	5	0.2	3	3	0	0	0	0
	t=likes	d2	1	4	0.25					0	0
	t=likes	d3	1	4	0.25					0	0
2	t=watch	d1	1	5	0.2	3	1	1.098612289	0.477121255	0.219722458	0.095424251
	t=watch	d2	0	4	0					0	0
	t=watch	d3	0	4	0					0	0
3	t=mary	d1	0	5	0	3	1	1.098612289	0.477121255	0	0
	t=mary	d2	1	4	0.25					0.274653072	0.119280314
	t=mary	d3	0	4	0					0	0
4	t=football	d1	0	5	0	3	1	1.098612289	0.477121255	0	0
	t=football	d2	0	4	0					0	0
	t=football	d3	1	4	0.25					0.274653072	0.119280314

Naive Bayes

naive refers to a very strong simplifying assumption about the features
Conditional independence assumption
- Given class $Y$ , all the features $X = {X_1, X_2, ..., X_n}$ are assumed mutually independent.
$P(X | Y) = P(X_1, ..., X_n | Y) = \prod_{i=1}^{n} P(X_i | Y)$
Bayes rule
- $P(Y | X) = \frac{P(X | Y) P(Y)}{P(X)} \propto P(Y) P(X | Y)$
- $P(Y | X_1, ..., X_n) \propto P(Y) \prod_{i=1}^{n} P(X_i | Y)$
Sentiment Analysis
- $p(C_k | x_1, ..., x_n) = \frac{1}{Z} p(C_k) \prod_{i=1}^{n} p(x_i | C_k)$ $p (C_{k} ∣ x_{1}, ..., x_{n}) = \frac{1}{Z} p (C_{k}) \prod_{i = 1}^{n} p (x_{i} ∣ C_{k})$
  - $Z = p(x) = \sum_{k} p(C_k) p(x | C_k)$ $Z = p (x) = \sum_{k} p (C_{k}) p (x ∣ C_{k})$
    - scaling factor for normalization of $p(C_k | x)$
- maximum a posteriori (MAP) decision rule
  - $\hat{y} = \argmax\limits_{k \in 1, ..., K}p(C_k) \prod_{i=1}^{n} p(x_i | C_k)$

Examples of Naive Bayes

$P(Class | w_{1:N}) = \alpha \cdot P(Class) \cdot \prod_{j} P(w_j | Class)$

Analyze the sentiment
- my grandson loved it
- $x_1 = \text{my}$ , $x_2 = \text{grandson}$ , $x_3 = \text{loved}$ , $x_4 = \text{it}$
- $C = {positive, negative}$
- $\hat{y} = \argmax\limits_{k \in \{positive, negative\}} p(C_k) \cdot p(x_1 | C_k) \cdot p(x_2 | C_k) \cdot p(x_3 | C_k) \cdot p(x_4 | C_k)$ $y^=k∈{positive,negative}argmaxp(Ck)⋅p(x1∣Ck)⋅p(x2∣Ck)⋅p(x3∣Ck)⋅p(x4∣Ck)$
  - positive case
    - $p(positive) = 0.49$
    - $p(my | positive) = 0.30$
    - $p(grandson | positive) = 0.01$
    - $p(loved | positive) = 0.32$
    - $p(it | positive) = 0.30$
    - $p(positive | x) \propto p(positive) \cdot p(my | positive) \cdot p(grandson | positive) \cdot p(loved | positive) \cdot p(it | positive) = 0.49 \times 0.30 \times 0.01 \times 0.32 \times 0.30 = 0.00014256$
  - nagative case
    - $p(negative) = 0.51$
    - $p(my | negative) = 0.20$
    - $p(grandson | negative) = 0.02$
    - $p(loved | negative) = 0.08$
    - $p(it | negative) = 0.4$
    - $p(negative | x) \propto p(negative) \cdot p(my | negative) \cdot p(grandson | negative) \cdot p(loved | negative) \cdot p(it | negative) = 0.51 \times 0.20 \times 0.02 \times 0.08 \times 0.4 = 0.00006528$
  - $Z = p(x) = \sum_{k} p(C_k) p(x | C_k)$ $Z = p (x) = \sum_{k} p (C_{k}) p (x ∣ C_{k})$
    - $= p(positive) \cdot p(my, grandson, loved, it | positive) + p(negative) \cdot p(my, grandson, loved, it | negative)$
    - $= 0.00014256 + 0.00006528 = 0.00020784$
    - the probability of $x$ , obtained by adding up its probabilities under each class.
  - $p(positive | x) = \frac{p(positive) \cdot p(my, grandson, loved, it | positive)}{Z} = \frac{0.00014256}{0.00020784} \approx 0.6867$
  - $p(negative | x) = \frac{p(negative) \cdot p(my, grandson, loved, it | negative)}{Z} = \frac{0.00006528}{0.00020784} \approx 0.3133$
- Decision
  - $\hat{y} = \argmax\limits_{k \in \{positive, negative\}} \{k = positive : 0.6867, k = negative : 0.3133\}$
  - The sentiment is positive.
Spam detection
- Probabilities learned from data:
  - $P(\text{Spam}) = 0.4$ , $P(\text{Ham}) = 0.6$
  - $P(\text{free}|\text{Spam}) = 0.8$ , $P(\text{free}|\text{Ham}) = 0.1$
  - $P(\text{win}|\text{Spam}) = 0.7$ , $P(\text{win}|\text{Ham}) = 0.05$
- New document: "free win"
  - Spam score: $0.4 \times 0.8 \times 0.7 = 0.224$
  - Ham score: $0.6 \times 0.1 \times 0.05 = 0.003$
    → Classified as Spam.

N-gram model

N-gram: a sequence of written symbols of length n
- unigram, bigram, trigram
the probability of each symbol is dependent only on the n-1 previous symbols.
$P(w_j|w_{1:j-1}) = P(w_j|w_{j-n+1:j-1})$
$P(w_1:N) = \prod_{j=1}^{N} P(w_j|w_{1:j-1}) \approx \prod_{j=1}^{N} P(w_j|w_{j-n+1:j-1})$

Examples of N-gram

$W_{1:N}$ $W_{1 : N}$ is "This article is on NLP"
- N=5
- Bigram (n-gram with n=2)

P("This article is on NLP")
/* full chain rule,  */
= P("This") // j = 1
  * P("article" | "This") // j = 2
  * P("is" | "This article") // j = 3
  * P("on" | "This article is") // j = 4
  * P("NLP" | "This article is on") // j = 5
/* bigram approximation */
= P("This") // j =1
  * P("article" | "This") // j = 2
  * P("is" | "article") // j = 3
  * P("on" | "is") // j = 4
  * P("NLP" | "on") // j = 5

Step (j)	Word (W_j)	Bigram	Trigram
1	This	$(P(\text{This}))$	$(P(\text{This}))$
2	article	$(P(\text{article} \mid \text{This}))$	$(P(\text{article} \mid \text{This}))$
3	is	$(P(\text{is} \mid \text{article}))$	$(P(\text{is} \mid \text{This, article}))$
4	on	$(P(\text{on} \mid \text{is}))$	$(P(\text{on} \mid \text{article, is}))$
5	NLP	$(P(\text{NLP} \mid \text{on}))$	$(P(\text{NLP} \mid \text{is, on}))$

Transformer

given a set of input vectors (tokens), attention lets each token look at the others and form a weighted average of them.
The weights are data-dependent.
The model learns the weights representing which tokens are relevant to which other tokens.

$X = \text{Input Embeddings},\qquad Q = X W_Q,\quad K = X W_K,\quad V = X W_V$

Scores: $S = \frac{QK^T}{\sqrt{d}}$ $S = \frac{Q K ^{T}}{d}$
- dot products between every query and every key
- the scale $\sqrt{d}$ keeps gradients stable
Weights: $A = Softmax(S)$ $A = S o f t ma x (S)$
- row-wise softmax
- each row sums to 1
Output: $Attention(Q, K, V) = A V$ $A tt e n t i o n (Q, K, V) = A V$
- each output token is a weighted sum of the value vectors, with weights determined by the attention scores.(how much to pay attention to each position)
Query (Q): What am I looking for?
Key (K): What is the label/address of the information I have?
Value (V): What is the actual information I want to convey?

Cross-Attention

look up relevant information in another sequence.
- e.g. decoder attending to encoder outputs in translation
$Q$ $Q$ from the current sequence, $K$ $K$ and $V$ $V$ from the other sequence.
- $Q = X_{\text{target}} W_Q, \quad K = X_{\text{source}} W_K, \quad V = X_{\text{source}} W_V$
Cross-attention = Which source tokens are relevant to this target token?
Self-attention = Which other tokens in this sequence are relevant to this token?
$head_i = Attention(Q W_i^Q, K W_i^K, V W_i^V)$
$MultiHead(Q, K, V) = Concat(head_1, ..., head_h) W_O$

Word representation

One-hot representation

represents each word as a vector of length N.
where N is the size of the vocabulary.

vocab = {he, is, singing, she, dancing, stage}

she = [0, 0, 0, 1, 0, 0]
is = [0, 1, 0, 0, 0, 0]
singing = [0, 0, 1, 0, 0, 0]

limitations
- incredibly inefficient for large vocabularies
- not embed any intrinsic meaning of words
- unable to represent similarity between likely words
- the representation of documents is sparse vectors
  - can cause challenges in computation

Word Embedding

represents individual words as vectors in a low-dimensional continuous space.
a distributed representation of a word.
generate a unique value for each word while using smaller vectors compared with one-hot encoding.
common vector dictionaries
- word2vec
- Glove (Global Vectors)

Contextual embedding

generates different vectors for the same word based on its context.
the word "bank" would have different embeddings in the sentences:
- "He went to the bank to deposit money."
- "She sat by the river bank and enjoyed the view."
BERT or GPT use deep neural networks to process a sequence of tokens.
Each token's embedding is computed by considering the token itself, its position in the sequence and the surrounding tokens, context.
captures both semantic and syntactic role in that specific context.

Part of Speech (POS) tagging

lexical category or tag that indicates the grammatical role of a word in a sentence.
parts of speech allow language models to capture generalizations such as "adjectives often modify nouns" or "verbs often follow subjects".

From  the start , it  took  a   person
IN    DT  NN    , PRP VBD   DT  NN

with great qualities  to  succeed
IN   JJ    NNS        TO  VB

Tag	Description	Example
CC	Coordinating conjunction	and, but
CD	Cardinal number	one, two
DT	Determiner	the, a
EX	Existential there	there
FW	Foreign word	doppelgänger
IN	Preposition or subordinating conjunction	in, of
JJ	Adjective	big, old
JJR	Adjective, comparative	bigger, older
JJS	Adjective, superlative	biggest, oldest
LS	List item marker	1, 2, One
MD	Modal	can, will
NN	Noun, singular or mass	cat, tree
NNS	Noun, plural	cats, trees
NNP	Proper noun, singular	John, London
NNPS	Proper noun, plural	Smiths, Londons
PDT	Predeterminer	all, both
POS	Possessive ending	's, s'
PRP	Personal pronoun	I, you, he
PRP$	Possessive pronoun	my, your, his
RB	Adverb	quickly, very
RBR	Adverb, comparative	faster, better
RBS	Adverb, superlative	fastest, best
RP	Particle	up, off
SYM	Symbol	$, %, &
TO	to	to
UH	Interjection	oh, wow
VB	Verb, base form	be, have
VBD	Verb, past tense	was, had
VBG	Verb, gerund or present participle	being, having
VBN	Verb, past participle	been, had
VBP	Verb, non-3rd person singular present	talk, have
VBZ	Verb, 3rd person singular present	talks, has
WDT	Wh-determiner	which, that
WP	Wh-pronoun	who, what
WP$	Possessive wh-pronoun	whose
WRB	Wh-adverb	where, when
#	Pound sign	#
$	Dollar sign	$
,	Comma	,
.	Sentence-final punctuation	. ! ?

Example of POS tagging

Hidden Markov Model (HMM)
- takes in a temproral sequence of evidence observations
- predicts the lexical categories
Logistic regression
- build 45 different logistics regression models, one for each part of speech
- ask each model how probable it is that the example word is a member of that category, given the feature values for that word in its particular context.

Machine translation

translate a sentence from a source lnaguae to a trget language.
train an MT model: a large corpus of source/target sentence paris and hope that the trained MT model can accurately translate new sentences.
want to generate a target language sentence that corresponds to the source language sentence
the geneartion of each target word is conditional on the entire source sentence and on all previously generated target words.

Example of machine translation

a sequance-to-sequence model
- use two RNNs (LSTM)
attentional sequence-to-sequence model
- use attentino to create a context-based summarization of the source sentence into a fixed-dimension representation
transformer-based model
- encoder: reads the source sentence and turns it into a rich, contextual set of vectors.
- decoder: generates the target sentence one token at a time, using what it has generated so far and the encoder's representations.

Text generation

a subfield of NLP
leverages knowledge in computational linguistics and AI to automatically generate natural language texts
can satisfy certain communicativa requirements

Example of text generation

Classifier based on word embeddings, e.g. RNN and LSTM
- RNN: each input word is encoded as a word embedding vector $x_i$ $x_{i}$ , a hidden layer $z_t$ $z_{t}$ , the classes are the words of the vocabulary
  - the output $y_t$ will be a softmax probability distribution over the possible values of the next word in the sentence.
- LSTM: can choose to remember som parts of the input, copying it over to the next time step, and to forget toher parts.
Pre-trained languaeg model using deep learning
- BERT
- GPT-X, Generativ ePre-trained Transformer

Transfer learning

experience with one learning task helps an agent learn better on another task.
pretraining: a form of transfer learning in which we use a large amount of shared general-domain language data to train an initial version of an NLP model.
- we can use a smaller amount of domain-specific data to refine the model
- the refined model can learn the vocabulary, idioms, syntactic structure, and other linguistic phenomena that are specific to the new domain.
For NN, learning consist of adjusting weight, so the most plausible approach for transfer learning is to copy over the weights learned for task A to a network that will be trained for task B.
- The weights are then updated by gradient descent in the usual way using data for task B.
the popularity of transfer learning is the availability of high-quality pretrained models.
will want to freeze the first few layers of the pretrained model
- these layers serve as feature detectors that will be useful for new model.
- new data set will be allowed to modify the parameters of the higher levels only
  - these are the layers that identify problem-specific features and do classification.

FDA +009

2025년 9월 21일 · 약 2분

Eunkwang Shin

Owner

Linear Separability

the data is linearly separable if
- it can be separated by a point on a single dimension line of data points
- by a line on a two-dimensional representation of data points
- by a plane (a two-dimensional surface) in a three-dimensional representation of the data points
If it is non-linearly separable, look at other options for classification.

Hyperplane

the conceptual divide between data
Weight vector: represented and generated in weight space.
Choosing the hyperplane
- Minimum distance between samples
- Least-squares method
- Gradient Descent

Artificial Neural Networks, ANN

Strength: for high dimensionality problems, the complex relations between variables
Weaknesses: theoretically complex, computationally intensive, needs large data sets, complicated to implement
Kinds of ANN
- Perceptrons, Multilayer Perceptrons
- Deep learning neural networks
- Kohonen networks
- Convolutional neural networks
- Radial Basis Functions
- Recurrent neural networks
- Support Vector Machines
- Competitive learning
- Boltzmann machines

Multilayer Perceptrons, MLP

Challenges
- Decide on the network topology.
  - how many hidden layers are needed
  - how many neurons in each of the hidden layers
- Find values for the weights which make the network produce the correct output values for the given input values.
Neural networks only accept numeric data.
- need to convert the categorical into numeric.
- One-Hot encoding, Thermometer encoding.
high values may need to be scaled into a similar range as neural networks
- need to do a log transform to pull the values into a target range.
- [-1, +1] or [0, 1]
input neurons should be as small as possible.
- adding neurons -> more parameters and weights -> amplify any bias. (overtrain the network)
one categorical attribute may have many attribute values
- each adding a parameter -> adding risk of overtraining

Resilient Propagation, RProp

directly adjust the weight step based on the local gradient information
introduces a weight update value $u_{ij}$ for each weight $w_{ij}$
updates it based on the sign of the partial derivative of the error with respect to the weight
update value $u_{ij}$ $u_{ij}$
- if sign changes (i.e. jumped over local minima) -> $u_{ij}$ slightly decreased.
- if sign remains the same -> $u_{ij}$ slightly increased.
weight $w_{ij}$ $w_{ij}$
- if derivative is $+ve$ (i.e. error increasing) -> $w_{ij}$ decreased by $u_{ij}$
- if the derivative is negative (i.e. error decreasing) -> $w_{ij}$ increased by $u_{ij}$
- if the derivative changes sign, the last weight update is reverted. (backtracks the last weight update)

KNIME

RProp MLP Learner + MultiLayerPerceptron Predictor
MultilayerPerceptron + Weka Predictor (back propagation with momentum)

FDA +008

2025년 9월 17일 · 약 8분

Eunkwang Shin

Owner

Mearues of performance

Accuracy

$A = \frac{TP + TN}{TP + TN + FP + FN}$

the ratio of correct predictions to all predictions
it is the number of true positives and true negatives (the correct predictions) divided by the total number of predictions.

Error rate

$E = \frac{FP + FN}{TP + TN + FP + FN}$

the ratio of incorrect predictions to all predictions
It is equivalent to $1 – Accuracy$ .
It is the number of false positives and false negatives divided by the total number of predictions.

True positive rate, Recall, Sensitivity

$TPR = \frac{TP}{P} = \frac{TP}{TP + FN}$

the proportion of actual positives for which the test result is positive.
it shows how sensitive the model is to detecting positive instances.
the ratio of true positives to all positives
the data points that were correctly predicted as positive divided by the number of true positives and false negatives.

False positive rate, Fall Out, False Alarm

$FPR = \frac{FP}{N} = \frac{FP}{FP + TN}$

the proportion of actual negatives for which the test result is positive.
It is equivalent to $1 – Specificity$ $1- Sp ec i f i c i t y$ .
- large values of specificity indicate small false negative rates.

The true negative rate, Specificity

$SPC = TNR = \frac{TN}{N} = \frac{TN}{FP + TN}$

the proportion of actual negatives for which the test result is negative
It shows how well the model does at identifying actual negatives as negative.

The false negative rate, Miss Rate

$FNR = \frac{FN}{TP+FN}$

the proportion of the actual positives for which the test result is negative.
It is equivalent to $1 – Sensitivity$ $1- S e n s i t i v i t y$ .
- large values of sensitivity indicate small false positive rates.
It is a common trick that often change classifiers to bias them towards making type 1 FP versus type 2 FN errors.
- This can often change classifier to bias towards making FP errors rather than FN errors.
- It depends on which are worse to make
- They are biased if there are different numbers of items in each class

Precision, Positive Predictive Value

PPV, Positive Predictive Value
a measure of how accurate and precise the positive predictions are
It is the ratio of true positives to predicted positives.

Accuracy and Error rate

in practice, the previous measures don't always work well
- they are biased, especially if there are different numbers of items in each class.
if the data is imbalanced, it's much better to use a true positive rate or true negative rate instead.

F1

$F = 2 * \frac{Precision * Recall}{Precision + Recall}$

known as F-score or F-measure
a measure of the accuracy of the test
It is the harmonic mean of the recall and precision, where an F1 score reaches its best value at 1 (perfect precision and recall).
It allows the Recall and Precision to be assessed in the same calculation.

Predicted \ Actual	Positive	Negative	Total
Positive	110	5	115
Negative	10	60	70
Total	120	65	185

$Accuracy = \frac{110 + 60}{185} = 0.91$
$Error Rate = \frac{10 + 5}{185} = 0.08$
$True Positive Rate (Sensitivity/Recall) = \frac{110}{115} = 0.95$
$False Positive Rate = \frac{10}{70} = 0.14$
$True Negative Rate (Specificty) = \frac{60}{70} = 0.85$
$False Negative Rate = \frac{5}{115} = 0.04$
$Precision = \frac{110}{120} = 0.91$
$F1 = (2 * 0.91 * 0.95) / (0.91 + 0.95) = 0.92$

ROC curve

Receiver Operating Characteristic Curve

a graphical plot that explains how well a binary classifier system performs as the threshold at which it calls a data point as positive is varied.
ROC graphs were originally used in the communications area to look at false alarm rates.
The x-axis is the false positive
The y-axis is the true positive rate (sensitivity, recall)
ROC graphs contain all the information in the confusion matrix.
- TPR(=TP/(TP+FN)), FPR(=FP/(FP+TN))
a visual tool to compare trade-offs between the ability of a classifier to correctly identify positive cases and the number of negative cases that are incorrectly identified.
an essential evaluation metric for checking the performance of a classification model.

AUC

Area Under the ROC Curve

AUC 0: the model predicts a negative class as a positive class and vice versa.
AUC 0.5: the model has no discriminative capacity to differentiate between negative class and positive class
- diagonal line from (0,0) to (1,1)
AUC 0.7: 70% chance that the model will be able to differentiate between the positive and negative classes.
AUC 1: the model predicts all positive class as positive class and all negative class as negative class.
ROC is a probability curve and AUC represents the degree of separability
It shows how capable the model is of differentiating between the classes.

Training/testing split

train-test split procedure is used to estimate the performance of machine learning algorithms
should not be used
- when the dataset is small
- when the dataset is imbalanced
- where additional configuration is required
- train_test_split(..., stratify=y)
most common approach to validate model performance
- traning data is 80%
- testing data is 20%

k-fold cross-validation

helps select the model which will perform best on unseen data
overcoming the problem of overfitting and underfitting.
a parameter called k defines the number of portions that a given data sample will be split into.
k-fold cross-validation has less bias because it ensures that every single discovery from the main dataset has a chance to appear in both training and test sets.

Iteration of a 5-fold cross-validation

# X = Test set
# T = Train set
1st fold : XXXX TTTTTTTTTTTTT
2nd fold : TTTT XXXX TTTTTTTT
3rd fold : TTTTTT XXXX TTTTTT
4th fold : TTTTTTTT XXXX TTTT
5th fold : TTTTTTTTTTTTT XXXX

k = 3
dataset = [1, 2, 3, 4, 5, 6]

fold1 = [5, 2]
fold2 = [1, 3]
fold3 = [4, 6]

Model1 will be trained on fold1 and fold2, and tested on fold3
Model2 will be trained on fold2 and fold3, and tested on fold1
Model3 will be trained on fold1 and fold3, and tested on fold2
we can take the accuracy as the average of all rounds to get the final accuracy.

Bias-variance decomposition

a formal method for understanding the prediction error of a model.

the average of the distance between the target and model predictions.
simple model: High bias, Low variance → Underfitting
- not complex, high error component
complex model: Low bias, High variance → Overfitting
- quite sensitive to the specific training set

Bias

the average of the distance between the target and model predictions.

how well the model can do for any training set.
the difference between the expected value and the parameter that we want to estimate
$\text{Bias} = E[\hat{\theta}] - \theta$ $Bias = E [\hat{θ}] - θ$
- If the bias is exactly zero, the estimator is unbiased
- If the bias is greater than zero, the estimator is positively biased
- If the bias is less than zero, the estimator is negatively biased

Variance

the deviation between the average prediction value and the predicted value.

Classifier error is also affected by variability in the training data because different training sets lead to different decision boundaries.
High variance: it produces different results for different training sets.
- sensitive to the particular training set.
- models with less parameters tend to have lower variance.
$Var(\hat{\theta}) = E[(\hat{\theta} - E[\hat{\theta}])^2]$

Noise

changes in the target value

objects with the same attribute values leading to different class labels.
these errors are unavoidable even when you know the correct decision boundary.

Evaluating

Underfitting

the model did not capture enough patterns in the data

The model provides poor performance on both the training and the test set.
Reasons:
- The training model is not trained as tightly as possible.
- The model is not able to learn more.
  - model is not suitable for the task.
Avoid underfitting by:
- using more training data
- choosing / training a more complex model
- increase the number of parameters in the model, the type or complexity of the model, or the traninig time till a cost function is minimized.

Overfitting

the model captures noise and patterns which do not generalize well to new data.

The model has extremely good performance on the training set, but poor performance on the test set.
Reasons:
- the training data is not a perfect standard
Avoid overfitting by:
- regularization
- pruning (parameters, strucutres of classifiers)
- reducing the descriptive length (minimize the sum of the model's complexity and the dscription of the traning data)
- optimization (use a separate subset for validating the model)
- expansion (use or generate more training data)
  - adding synthetic samples to the dataset.

K-NN

k-Nearest Neighbors

a non-parametric technique
- not involving any assumptions as to the form or parameters of a frequency distribution
supervised learning classifier
- uses proximity to make classifications or predictions about the grouping of an individual data point
- defining new cases based on similarity measures (e.g., distance functions).
- Euclidean: $d(x, y) = \sqrt{\sum_{} (x_i - y_i)^2}$
- Minkowski: $d(x, y) = (\sum_{} |x_i - y_i|^p)^{1/p}$
- Hamming: $d(x, y) = \sum_{} |x_i - y_i|$
In weighted k-NN, weigh the votes according to distance
- $w_i = \frac{1}{d^2}$
If k is too small, it may be sensitive to noise points
If k is too large, its neighbourhood may include points from other classes
Choose an odd value for k, to eliminate ties

Industrial Applications of KNN

Retail business data analysis: Identify customer patterns and generate business value.
Security and operational management: Simplify daily operations such as theft prevention.
Credit card usage monitoring: Detect unusual patterns to identify fraudulent transactions.
Transaction scrutiny software: Spot unfamiliar patterns and flag suspicious activity.
Point-of-sale (POS) data analysis: Analyse register data for operational insights.

IAI +007

2025년 9월 15일 · 약 8분

Eunkwang Shin

Owner

Computer vision

a field of artificial intelligence enabling computers and systems to derive meaningful information from digital images, videos and other visual inputs.
vision is a perceptual channel that accepts a stimulus and reports soome representation of the world
computer vision enables intelligent agents to see, observe and understand of environment

Core problems of CV

reconstruction: an agent builds a mode lfo the world from an image or a set of images
recognition: an agent draws distinctions amont the objects it encounters based on visual and other information.
- image classification
- object detection
- image segmentation

Classic approaches to object recognition problems

feature-based object recognition approach
- works well for faces looking directly at the camera.
pattern-element-based object recognition approach
- a useful abstraction is to assume that some objects are made up of local patterns which tend to move around with respect to one another.
- we can model objects with pattern elements.

Modern approaches to object recognition problems

deep learning networks
- to recognition problems enables that features can be automatically learned and extracted from raw image data compares with the manual feature extraction in the classic approaches.
- AlexNet, VGGNet, GoogleNet, ResNet, DenseNet, EfficientNet, RegNet...
basic models and derived models
- YOLO, SSD, RetinaNet, R-CNN...

Evaluation metrics for image classification

Metric	Definition	Use case
Accuracy	the percentage of correcly predicted labels out of all predictions made	commonly used in balanced datasets but can be mis leading in imbalanced classes
Precisino	the ratio of correctly predicted positive observations to the total predicted positives.	useful when the cost of false positives is high (e.g., spam detection)
Recall (Sensitivity or True Positive Rate)	the ratio of correctly predictted observations to all the actual positives	important when the cost of false negatives s high (e.g., disease detection)
F1 Score	the harmonic mean of Precision and Recall	used when you need to balance precision and recall, especially in imbalanced datasets
Specificity (True Negative Rate)	the ratio of correctly predicted negative observations to all the actual negatives.	important when false positives should be minimized (e.g. medical tests for diseases)
Confusion Matrix	a table used to describe the performance of a classification model by showing the true positive, false positive, true negative, and false negative counts	provides a comprehensive understanding of a models' performance across all classes
ROC Curve (Receiver Operation Characteristic)	a graphical representation of the classifier's performance across all thresholds, plotting the true positive rate (recall) against the false positive rate (1 - specificity)	used to evaludate binary classirifres and compare models
AUC (Area Under the Curve)	the area under the ROC curve, providing a single number summary of the models' ability to discriminate between positive and negative classes	useful for evaluating binary classification models, particularly when dealing with imbalanced datasets.

Confusion Matrix

Actual class ➡️ Predicted class ⬇️	Positive	Negative	Metric
Positive	TP: True Positive	FP: False Positive	Precision: $\frac{TP}{TP + FP}$
Negative	FN: False Negative	TN: True Negative	Negative Predictive Value: $\frac{TN}{TN + FN}$
Metric	Recall or Sensitivity: $\frac{TP}{TP + FN}$	Specificity: $\frac{TN}{TN + FP}$	Accuracy: $\frac{TP + TN}{TP + TN + FP + FN}$

ROC Curve

Sort predicted probabilities
Try multiple thresholds
For each threshold, compute predicted labels
Compute TPR and FPR
Plot ROC curve (FPR vs TPR) i. $TPR = \frac{TP}{TP + FN}$ ii. $FPR = \frac{FP}{FP + TN}$
Compute AUC as area under ROC curve

from sklearn.metrics import roc_curve, roc_auc_score
fpr, tpr, thresholds = roc_curve(y_true, y_scores)
auc = roc_auc_score(y_true, y_scores)

Evaluation metric for object detection

Metric	Definition	Use case
Intersection over Union (IoU)	measures how much the predicted bounding box overlaps with the ground truth.	$IoU = \frac{\text{Area of Overlap}}{\text{Area of Union}}$
Precision	the ratio of correctly predicted postive observations to the total predicted positives	$Precision = \frac{TP}{TP + FP}$
Recall	the ratio of correctly predicted observations to all the actual positives	$Recall = \frac{TP}{TP + FN}$
F1 Score	the harmonic mean of Precision and Recall	$F1 = 2 \times \frac{Precision \cdot Recall}{Precision + Recall}$
Average Precision (AP)	Precision-Recall Curve: Precision vs Recall at different thresholds AP: Area under the Precision-Recall curve (per class)	$AP = \int_{0}^{1} P(r) \, dr$
Mean Average Precision (mAP)	mean of AP across all classes	provides a comprehensive understanding of a models' performance across all classes
AP@[.50:.95]	COCO benchmark	Averages AP at IoU threshold from 0.5 to 0.95 in steps of 0.05

AP is computed by summing trapezoid areas under the Precision-Recall curve.

Convolutional Neural Network (CNN)

contains spatially local connectinos at lest in the early layers
has patterns of weights that are replicated across the units in each layer.
use a kernel to detect patterns of weights that is replicated across multiple local regions in an image
use convolution that applies the kernel to the pixels of the image

Input
   ↓
[Conv → ReLU] → [Pooling]
   ↓
[Conv → ReLU] → [Pooling]
   ↓
Flatten (2D → 1D 벡터)
   ↓
Fully Connected Layer
   ↓
Output Layer (Softmax/Sigmoid)

$z_i = \sum_{j=1}^{l} k_j \cdot x_{j + (i-1)s}$

$z_i$ : the $i$ -th output element (convolution result)
$j$ : index inside the kernel (from 1 to $l$ )
$l$ : kernel size (number of elements in the kernel)
$k_j$ : the $j$ -th kernel weight (filter element)
$x$ : input sequence (the original data)
$x_{j+(i-1)s}$ : the input element aligned with kernel position $j$ when shifted by stride
$s$ : stride (step size of the kernel movement)
$(i-1)s$ : total shift in the input for the $i$ -th convolution

Receptive Field

the receptive field of a neuron is the portion of the sensory input that can affect aht neuron's activation.
In CNNs, the receptive field of a unit in the first hidden layer is small.
just the size of the kernel, e.g., 3x3 or 5x5.
In the deeper layers of the network, it can be much larger.

1D CNN

$R_{L} = R_{L-1} + (k_{L} - 1) \cdot \prod_{i=1}^{L-1} s_i$

$R_L$ : receptive field size at the $L-th$ layer
$k_L$ : kernel size at the $L-th$ layer
$s_i$ : stride at the $i-th$ layer

Layer	Kernel k	Stride s	Calculation	Result RF
Conv1	3	1	1 + (3-1)*1	3
Pool1	2	2	3 + (2-1)*1	4
Conv2	3	1	4 + (3-1)*2	8

Pooling

works like a convolutional layer, with a kernel size $l$ and stride $s$ , but the operation is applied is fixed rather than learned.
no activation fucntion is associated with the pooling layer.
common forms of pooling
- average pooling
- max pooling: saying a feature exists somewhere in the unit's receptive field.

Dropout

a way to reduce the test-set error of a network to increase its ability of generalization.
makes a etwork herder to fit the traninig set.
the network is created by deactivating a randomly chosen subset of the units (dropout rate).
cannot explain why it works but the results is better

Batch Normalization

modern neural networks are almost always trained with some variant of stochastic gradient descent (SGD).
Batch normlization rescales the values generated at the internal layers of the network from the examples within each minibatch.
- it standardizes the input to a layer for each mini-batch.
standardizes the mean and variance of the values.
maeks it much simpler to train a deep network.

Tranining in a CNN

$Cross-entropy(y, \hat{y}) = -\sum_{i} y_i \log(\hat{y}_i)$

Forward pass
Backward pass
Parameter update

Variants of CNNs

AlexNet

AlexNet Layers

ResNet

stands for a residual neural network.
was designed to enable hundreds or thousands of convolutional layers.
Residual neural networks do this by utilizing skip connections, or shortcurts to jump over some layers.
was an innovative solution to the "vanishing gradient" problem.

x ────────────────► (+) ──► y
   │                     ▲
   ▼                     │
 [Conv → ReLU → Conv] = F(x)

VGGNet

increases the depth of the network through adding more convolutional layers by using small convolution filters (3x3) while other parameters are fixed.

VGGNet

Derived model for object detection

YOLO

Split an images to S times S blocks.
Apply object classification to each block and get confidence score of different objecxt for each block.
Based on the class probability map to locate objects.

YOLO

Calculation AP and mAP

Gather Detection results
Match Predictions to Ground Truth
Compute Precision and Recall to generate a set of (recall, precision) points
Build the Precision-Recall (PR) Curve (presioin decreases as recall increases)
Smooth the Precision-Recall Curve
Calculate AP
Extend to mAP

FSD +008

2025년 9월 15일 · 약 5분

Eunkwang Shin

Owner

OOP Principles

Encapsulation

bundles data attributes (fields) with methods that use the data in a single unit called "Object"
hides sensitive data attributes by declaring the fields private.
fields can be declared protected in the parent class, allowing access from the child class.
exposes the data attributes values only through public getters/setters to allow access or modification of the field values.

Intheritance

class A:
    pass

class B(A):
    pass

class C(A):
    pass

Superclass defines common method signatures (with/without implementation) and fields.
Subclasses provide the implementations for (or override) these method signatures
private attributes cannot be directly accessed from the child classes.
to refer to superclass properties (fields, methods) or constructor, use the keyword super()

Polymorphism

means many forms.
permits an object to have multiple types.
allows object of different types but with a common parent to be stored in the same collection.
enables inherited methods from the parent to perform different tasks when called by subclasses.

Abstraction

the process of hiding the implementation details and exposing only the necessary behavior to the user.
abstract class must at least contain one abstract method.
abstract class ccan have concrete methods.
abtract methods are only prototypes in the parent class, the actual implementation is provided by the subclasses that inherit the abstract class.
rules
- if a class contains at least on abstract methods, it should be declared abstract.
- if another class inherits an abstract class, it must implement all the abstract methods of that class.

from abc import ABC, abstractmethod

class Person(ABC):
    def __init__(self, name):
        self.name = name

    @abstractmethod
    def show(self):
        pass

    def show_name(self):
        print(f"Name: {self.name}")

class Student(Person):
    def __init__(self, name, id):
        self.id = id
        super().__init__(name)

    def show(self):
        super().show_name()
        print(f"ID: {self.id}")

interfaces are complete abstract classes declared with the interface keyword.
only contain prototype (abstract) methods with empty body code.
cannot be instantiated nor inherited as they do not have constructors.
a class can implement multiple interfaces.
a class implementinig an interface must provide an implementation for all its abstract methods.
add an access layer to further hide the methods impelmentation.
act as a middleware inside the program, allowing human, machines, and other software to interact with the program's functionalities without knowing the implementation details.

# interface A is a fully abstract class
# interface A must inherit ABC
# class B(A) must implement all abstract methods of A
# interface A methods have the @abstractmethod decorator with pass as body-code
from abc import ABC, abstractmethod

class Person(ABC):
    @abstractmethod
    def show_info(self):
        pass

class Student(Person):
    def __init__(self, name, id):
        self.name = name
        self.id = id
    
    def show_info(self):
        print(f"Name: {self.name}, ID: {self.id}")

OOP Design Rules

How to split the code into seperate classes and preserve encapsulation
How to organizae the code into methods to hide the implementation details and expose the behavior
How the objects interact at runtime
How to name the class entities

5 Design Rules

Encapsulation
- hide fields behind methdos
- requires fields to be private while allowing methods to be public
- requires a methods related to a field (such as access, modify, use) be defind in the same class.
Push code to the right
- requires the code/methods used by objects of a class be written in the same class.
- ensures methods are written for reusability and placed in the correct class so they can make use of the class's fields.
- determines which class is responsible for defining the methods needed to achieve the program's goals.
Spread plans across classes
- involves planning the distribution of code across multiple classes from the start.
- focuses on distributing responsibilities logically among different classes, so that each class has a clear and focused role.
- by convention, this rule requires using the same method name (for the same goal) across all classes.
Hide by default
- hiding implementation details within a class and exposing only the necessary functionality to the outside.
- helps in achieving encapsulation and abstraction.
- promotes better design and maintainability.
Follow naming conventions
- use nouns to name fiels.
- use nouns to name functions.
- use verbs to name procedures.
- if an entitiy is compsed of two or more words.
  - use camelCase/snake_case for fields and methods.
  - use PascalCase for class names.

consolidates the action-trieggers of a program into a single method.
common design choice in applications with user interaction.
the menu methods is executed in the program's main method.
- offers users interactive CLI command choices.
menu() requires a read-function in a while loop, allowing the menu() tor repeatedly read inpus from STDIN.

<condition> = read_choice()

loop (<condition>):
    check (<condition>):
        case 1: do task <action_1()>
        case 2: do task <action_2()>
        ...
        case n: do task <action_n()>
        default: <alternative-action>

menu(self):
    choice = input("Enter choice: (d/w/b/h/x): ").lower()
    while choice != "x":
        match choice:
            case "d":
                self.deposit()
            case "w":
                self.withdraw()
            case "b":
                self.show_balance()
            case "h":
                self.show_history()
            case "x":
                self.exit()
            case _:
                self.show_invalid_choice_message()

        choice = input("Enter choice: (d/w/b/h/x): ").lower()

FDA +007

2025년 9월 14일 · 약 7분

Eunkwang Shin

Owner

Classification and Prediction

Classification

classifies data based on the training set and the class labels and uses it in classifying new data

Model construction (learning)
Model evaluation (accuracy)
Model use (classification)

Prediction

predicts categorical class labels based on unseen data

models continuous-valued functions

Classfication Methods

Decision tree induction
Bayesian classification
Nearest neighbour classification, case-based reasoning
Neural networks
Support Vector Machines
Ensemble methods

Issues around classification and prediction

Predictive accuracy
Speed and scalability
Robustness
Interpretability
'Goodness' of classifier

Classification process

Decision Tree

Builds trees to describe data
Easy to translate into rules
Robust to niosy data
able to build disjunctive expressions
Inductive bias prefers small trees over larger.

Decision Tree Methodologies

Itertative Dichotomiser 3 (ID3)
C4.5
Classification And Regression Trees (CART)
Chi-square Automatic Interaction Detection (CHAID)
Multivariate Adaptive Regression Splines (MARS)

Decision Tree Forms

Balanced
- each branch has the same depth from the root to leaves.
- all nodes have the same number of splits
Deep
- some nodes have different levels, wherein some of them are split into more branches.
Bushy
- split into multi-way from the root
- undesirable because the split may lead to small numbers of instance in each leaf node.

Entropy and Information Gain

Entropy

measures the amount of disorder / (im)purity in a collection of things. i.e. the unpredictability of the data.

$Entropy(S) = -p_{+} log_2(p_{+}) - p_{-} log_2(p_{-})$

Constructing a decision tree is all about finding an attribute that returns the highest information gain and the smallest entropy.
$S$ stands for total number of samples
$P_{+}$ denotes the likelihood of a yes (positive) answer.
$P_{-}$ denotes the likelihood of a no (negative) outcome.

$E(S) = \sum_{i=1}^{c} -p_i log_2(p_i)$

Information Gain

measures how well a given attribute separates the training examples according to the target classification.

The larger the information gain is, the stronger the feature will be.

$Gain(S,A) = Entropy(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} Entropy(S_v)$

$S$ = set of training examples
$A$ = the particular attribute to be tested
$Values(A)$ = the set of values for the attribute A
$S_v$ = subset of $S$ with attribute $A$ having value $v$ s

Iterative Dichotomiser 3

constructs trees in a top-down manner.
check each instance attribute with a statistical test to see how well it alone classifies (splits) the training examples.
this becomes the root node.
descendent is created for each possible value of the attribute and training examples split to the appropriate descendent.
repeat the procedure for each descendent.
the algorithm is going to issue recursion on each of the partitions.
the output of this algorithm is the creation of a Model.

function id3 (examples, target, attrs):
  create root node for tree
  
  if examples all +ve, return root with label=+
  if examples all -ve, return root with label=-
  if attrs is empty
    return root w/ label=most common value of target in examples else
   else
      A ← attribute from attrs that best splits examples
      root ← A for each possible value, vi , in A
        add a new branch below root corresp. to the test A = vi
        examples_v ← the subset of examples with A = vi

        if examples_vi is empty
          add a leaf node below branch w/ label = most common value of target from examples
        else below the branch add the subtree given by
          id3(examples_vi , target, attrs - {A})

      return root

Hypothesis space search

ID3 can be seen as a search through a space of hypotheses for one that matches the training data.
It's a greedy search
Hypothesis space = all the possible trees
Simple to complex, hill climbing search
Complete search = decision tree can represent all possible hypotheses
Maintains only one current hypothesis
- cannot find alternative decision trees
No backtracking, maybe stuck in local optima
Uses all training data at each step
- less sensitive to errors in individual examples

Inductive Bias

Shorter trees preferred
High information gain near the root
cos' simple to complex search

Issues with decision trees

Overfitting training data

occurs when a tree gives higher accuracy on the training data than another tree, but lower accuracy on the unseen data.
because the training data is noisy or not representative of the unseen data.
can measure how well the tree generalizes by checking error on test data.

Dividing the data

Training set
- used to build the initial model
- may need to enrich the data to get enough of the special cases
Cross validation set
- used to adjust the initial model
- used to work out the correct values of parameters in model
- models can be tweaked to be less dependent on idiosyncrasies in the training data to be a more general model
- idea is to prevent over-training (i.e. finding patterns where none exist)
Test set
- used to evaluate the model performance

Avoiding Overfitting

stop growing the tree once the test error decreases
grow the tree as normal (i.e. wit hoverfitting) then post-prune it.
use a separate set of data apart from training to test when to prune nodes (training & cross-validation set)
use all data to train, but apply a statistical test whether to expand/prune a node.
use an explicit complexity measure (e.g. Minimum Description Length, MDL) to trade off accuracy vs complexity.

Reduced-Error pruning

build tree
consider each node in tree for pruning
pruning
- remove subtree
- make into a leaf
- assign label as most common class in associated training examples.
if pruned tree has as good error on the cross validation set as the unpruned tree, do the prune.
keep pruning until the error on cross validation set increases.

Cross validation set

the main difficulty comes when you don't have much data and need it all for training.
k-fold cross-validation or leave-one-out training can help.

Rule Post-Pruning

converts the decision tree to rules
removes preconditions that do not worsen the accuracy (on the cross validation set or with a statistical test)
sorts the rules by estimated accuracy and uses this order when classifying new data

Continuous Valued Variables

use real valued attributes for tests at nodes.
Dynamically define new discrete attributes that partition the continuous attribute.
discrete: $A < v = true$ and $A >= v = false$
the boundary poins can be estimated from the training data.

The confusion matrix

a basic method of evaluation of classifiers
The columns have numbers associated with the actual number of positive data points in the test set and the actual number of negative data points in the test set

	Predicted Positive	Predicted Negative
Actual Positive	True Positive (TP)	False Negative (FN)
Actual Negative	False Positive (FP)	True Negative (TN)

TP: the number of correct prediction of positive samples.
- the number of data points in the test set that were positive and the classifier correctly assigned them to the positive group
TN: the number of correct prediction of negative samples
- the number of data points in our test set that was actually negative and predicted as negative by the classifiers
FP: the number of incorrect predictions of positive samples
- the value of data points in our test set that was actually negative but were predicted by the classifier as positives.
- These ones are obvious errors, which is called a type 1 error.
- type I error
FN: the number of incorrect prediction of negative samples.
- the number of data points in our test set that was actually positive but were predicted by the classifier as negative
- usually called the type 2 error, and when we are trying to build a classification.
- might have a trade-off between the number of the type 1 error or the type 2 error.
- type II error

Vocabulary for AI 008

2025년 9월 14일 · 약 3분

Eunkwang Shin

Owner

Vocabulary & Expressions

Term/Expression	Definition	Simpler Paraphrase	Meaning
invariance	the quality of remaining unchanged when something else changes	unchangingness	불변성
disjunction	a logical operation that outputs true whenever at least one of its inputs is true	or operation	논리합, OR
convergence	the process of coming together to a common point	coming together	수렴
impractically	not in a practical or realistic manner	not practical	비현실적으로
receptor	a cell or group of cells that receives stimuli and transmits them to sensory nerves	sensor	수용체, 감각기
sensation	the process of sensing our environment through touch, taste, sight, sound, and smell	feeling	감각
diffuse	spread out over a large area; not concentrated	spread out	확산되다
albedo	the proportion of the incident light or radiation that is reflected by a surface	reflectivity	반사율
color constancy	the feature of the human color perception system which ensures that the perceived color of objects remains relatively constant under varying illumination conditions	consistent color perception	색채 항등성
occlusion	the blocking of light or other radiation by an object	blockage	폐색, 차폐
deformation	the action or process of changing in shape or distorting, especially through the application of pressure	distortion	변형
foreshortening	the visual effect or optical illusion that causes an object or distance to appear shorter than it actually is because it is angled toward the viewer	perspective shortening	단축, 단축법
courtesy	provided at no cost	free of charge	무료 제공
apprenticeship learning	a type of learning where an agent learns to perform tasks by observing and imitating a more experienced agent	learning by imitation	도제 학습
consolidate	make (something) physically stronger or more solid	strengthen	강화하다
aquamarine	a light bluish-green color	light blue-green	아쿠아마린
fulvous	a dull yellowish-brown color	dull yellow-brown	황갈색의
verbatim	in exactly the same words as were used originally	word for word	말 그대로, 축어적으로
lexicon	the vocabulary of a person, language, or branch of knowledge	vocabulary	어휘집
adherent	someone who supports a particular party, person, or set of ideas	supporter	지지자
syntactic consitituent	a word or a group of words that function as a single unit within a hierarchical structure	sentence unit	구문 성분
pragmatics	the branch of linguistics dealing with language in use and the contexts in which it is used	language use	화용론
case	a grammatical category that marks the relationship between a noun and other words in a sentence	grammatical role	격
person	a grammatical category that distinguishes between different participants in a conversation (e.g., first person, second person, third person)	participant distinction	인칭
number	a grammatical category that expresses count distinctions (e.g., singular, plural)	count distinction	수
head	the main word in a phrase that determines its syntactic type	main word	중심어
separability	the quality of being able to be separated or divided	divisibility	분리 가능성
intrinsic	belonging naturally; essential	inherent	본질적인

기법	아이디어	비유
Laplace (Add-One)	모든 경우의 수에 +1을 더해줌. 안 본 것도 최소한의 기회를 줌.	시험 점수를 0점 맞아도 최소 1점은 줘
Backoff	n-gram이 없으면 더 짧은 n-gram으로 내려가서 확률을 씀.	3단어 조합 못 봤네? → 그럼 2단어 조합으로 보자 → 그것도 없으면 1단어라도 보자.
Linear Interpolation	여러 레벨(n=1,2,3)을 동시에 섞어서 확률 계산.	국어, 영어, 수학 점수를 일정 비율(λ)로 합쳐서 성적 보는 것.
Witten-Bell / Kneser-Ney	더 정교하게 확률 질량을 재분배. (현업에서 성능 좋음)	선생님이 본 적 없는 문제지만 유사 문제를 잘 풀었으니 점수 좀 더 주자 하는 것.
Stupid Backoff	대충 큰 코퍼스를 모아서, 그냥 단순 backoff만 사용. (빅데이터 시대 구글에서 씀)	데이터가 워낙 많으니까, 그냥 단순한 방법도 잘 먹힌다.

SVM​

Convex Hull​

Margins​

Kernels​

Kernel trick​

Linear Kernel​

Gaussian Kernel / RBF Kernel​

Polynomial Kernel​

Sequential Minimal Optimization (SMO)​

Advantages​

Disadvantages​

Applications​

Vocabulary & Expressions​

Vocabulary & Expressions​

Knowledge Base​

Logical connectives​

Truth tables​

NLP​

Tokenization​

Corpus​

Text classification​

Bag of Words, BoW​

TF-IDF​

Examples of TF-IDF​

Naive Bayes​

Examples of Naive Bayes​

N-gram model​

Examples of N-gram​

Transformer​

Cross-Attention​

Word representation​

One-hot representation​

Word Embedding​

Contextual embedding​

Part of Speech (POS) tagging​

Example of POS tagging​

Machine translation​

Example of machine translation​

Text generation​

Example of text generation​

Transfer learning​

Linear Separability​

Hyperplane​

Artificial Neural Networks, ANN​

Multilayer Perceptrons, MLP​

Resilient Propagation, RProp​

KNIME​

Mearues of performance​

Accuracy​

Error rate​

True positive rate, Recall, Sensitivity​

False positive rate, Fall Out, False Alarm​

The true negative rate, Specificity​

The false negative rate, Miss Rate​

Precision, Positive Predictive Value​

Accuracy and Error rate​

F1​

ROC curve​

AUC​

Training/testing split​

k-fold cross-validation​

Bias-variance decomposition​

Bias​

Variance​

Noise​

Evaluating​

Underfitting​

Overfitting​

K-NN​

Industrial Applications of KNN​

Computer vision​

Core problems of CV​

Classic approaches to object recognition problems​

Modern approaches to object recognition problems​

Evaluation metrics for image classification​

Confusion Matrix​

ROC Curve​

Evaluation metric for object detection​

Convolutional Neural Network (CNN)​

Receptive Field​

SVM

Convex Hull

Margins

Kernels

Kernel trick

Linear Kernel

Gaussian Kernel / RBF Kernel

Polynomial Kernel

Sequential Minimal Optimization (SMO)

Advantages

Disadvantages

Applications

Vocabulary & Expressions

Vocabulary & Expressions

Knowledge Base

Logical connectives

Truth tables

NLP

Tokenization

Corpus

Text classification

Bag of Words, BoW

TF-IDF

Examples of TF-IDF

Naive Bayes

Examples of Naive Bayes

N-gram model

Examples of N-gram

Transformer

Cross-Attention

Word representation

One-hot representation

Word Embedding

Contextual embedding

Part of Speech (POS) tagging

Example of POS tagging

Machine translation

Example of machine translation

Text generation

Example of text generation

Transfer learning

Linear Separability

Hyperplane

Artificial Neural Networks, ANN

Multilayer Perceptrons, MLP

Resilient Propagation, RProp

KNIME

Mearues of performance

Accuracy

Error rate

True positive rate, Recall, Sensitivity

False positive rate, Fall Out, False Alarm

The true negative rate, Specificity

The false negative rate, Miss Rate

Precision, Positive Predictive Value

Accuracy and Error rate

F1

ROC curve

AUC

Training/testing split

k-fold cross-validation

Bias-variance decomposition

Bias

Variance

Noise

Evaluating

Underfitting

Overfitting

K-NN

Industrial Applications of KNN

Computer vision

Core problems of CV

Classic approaches to object recognition problems

Modern approaches to object recognition problems

Evaluation metrics for image classification

Confusion Matrix

ROC Curve

Evaluation metric for object detection

Convolutional Neural Network (CNN)

Receptive Field