Skip to main content

Vocabulary for AI 010

· 2 min read

Vocabulary & Expressions

Term/ExpressionDefinitionSimpler ParaphraseMeaning
cavitya hollow space within a solid objecthollow space구멍, 공동
marginalizationtreatment of a person, group, or concept as insignificant or peripheralsidelining주변화
abbreviateshorten a word, phrase, or textshorten줄이다, 생략하다
summationthe process of adding things togetheraddition합계, 총합
multivaluehaving multiple values or meaningsmultiple values다중 값
quantifyexpress or measure the quantity of somethingmeasure수량화하다
deriveobtain something from (a specified source)obtain끌어내다, 얻다
meningitisinflammation of the protective membranes covering the brain and spinal cordbrain inflammation수막염
casualrelating to or showing the cause of somethingcause-related인과 관계의
epidemica widespread occurrence of an infectious disease in a community at a particular timeoutbreak유행병
proportionatelyin a way that corresponds in size or amount to something elsecorrespondingly비례하여
achingexperiencing a continuous or prolonged dull painpainful아픈, 쑤시는
soup sth upmodify or improve something to make it more powerful or effectiveenhance성능을 향상시키다
perpendicularat an angle of 90 degrees to a given line, plane, or surfaceat right angles수직의
wigglyhaving many curves or bendscurvy구불구불한
versatilityability to adapt or be adapted to many different functions or activitiesadaptability다재다능함
interpolateestimate or insert (a value or function) between two known values in a sequenceestimate between보간하다
commutativitythe property that the order of applying an operation does not change the resultorder-independence교환법칙
associativitythe property that the grouping of operations does not change the resultgrouping-independence결합법칙
contrapositiona logical operation that involves reversing and negating both the hypothesis and conclusion of a conditional statementreverse and negate대우
distributivitythe property that an operation can be distributed over another operationdistribution property분배법칙

Vocabulary for AI 009

· 6 min read

Vocabulary & Expressions

Term/ExpressionDefinitionSimpler ParaphraseMeaning
canonicalconforming to a general rule or acceptable procedurestandard정통의, 표준의
assertstate a fact or belief confidently and forcefullydeclare단언하다, 주장하다
lurkingremaining hidden so as to wait in ambushhidden숨어있는, 잠복하는
bleak(of an area of land) lacking vegetation and exposed to the elementsdesolate황량한, 적막한
albeitalthoughthough비록 ~일지라도
stencha strong and very unpleasant smellfoul smell악취
woefulcharacterized by, expressive of, or causing sorrow or miserysorrowful슬픈, 비참한
As forwith regard to; concerningregarding~에 관하여
ignorancelack of knowledge or informationunawareness무지, 무식
utterlycompletely and without qualification; absolutelycompletely완전히, 전적으로
prudentacting with or showing care and thought for the futurewise신중한, 현명한
arithmeticthe branch of mathematics dealing with the properties and manipulation of numbersmath산수, 계산
entailmenta relationship between sentences in which one sentence logically follows from one or more othersimplication함축, 수반
syntacticrelating to the arrangement of words and phrases to create well-formed sentences in a languagegrammatical구문의, 통사론의
stands forrepresents or signifiesrepresents~을 나타내다
mnemonica device such as a pattern of letters, ideas, or associations that assists in remembering somethingmemory aid기억을 돕는 장치
parenthesesa pair of round brackets () used to mark off a parenthetical word or expressionbrackets괄호
negationthe contradiction or denial of somethingdenial부정, 반대
antecedenta thing or event that existed before or logically precedes anotherpredecessor선행사, 앞서는 것
precedencethe condition of being considered more important than someone or something else; priority in importance, order, or rankpriority우선, 우선권
disjunctsa word or phrase that is grammatically independent of the other parts of the sentence in which it occursseparate part분리된 부분
causationthe action of causing somethingcausing인과, 원인 제공
decidedlyin a manner that is clear and definiteclearly단호하게, 명확히
sufficebe enough or adequatebe sufficient충분하다
tautologya statement that is true by necessity or by virtue of its logical formredundancy동어 반복, 자명한 진리
converselyintroducing a statement or idea that reverses one that has just been made or referred toin contrast반대로
contrapositivelyin a way that involves the contrapositive of a statementby contrapositive대우적으로
refutationthe action of proving a statement or theory to be wrong or false; disproofdisproving반박, 논박
monotonicitythe property of a function to be either entirely non-increasing or non-decreasingconsistency단조성
resolventa clause obtained by resolving two clauses containing complementary literalsderived clause해석절
soundnessthe quality of being based on valid reasoning or good judgmentvalidity타당성
yieldproduce or provide (a natural, agricultural, or industrial product)produce산출하다, 양보하다
ontologicalrelating to the branch of metaphysics dealing with the nature of beingexistential존재론의
commitmentthe state or quality of being dedicated to a cause, activity, etc.dedication헌신, 약속
pedagogicalrelating to teachingeducational교육의, 교수법의
aritythe number of arguments or operands that a function or operation takesnumber of arguments(함수의) 인수 개수
surrogate motherswomen who carry and give birth to a child for another person or couplegestational carriers대리모
predicatea symbol or function that represents a property or relationproperty술어, 속성, 명제함수
kinshipblood relationshipfamily relationship친족 관계
theorema general proposition not self-evident but proved by a chain of reasoning; a truth established by means of accepted truthsproven statement정리, 명제
existentiallyrelating to existencerelating to existence존재에 관한
Universala type of quantification that states that a predicate holds for all members of a specified setfor all전체에 대한 한정
existentiala type of quantification that states that a predicate holds for at least one member of a specified setthere exists존재에 대한 한정
latentexisting but not yet developed or manifest; hidden or concealedhidden잠재적인, 숨어있는

Knowledge Base

  • TELL, ASKS, TELL
    • TELLs the knowledge base what it preceives
    • ASKs the knowledge base what action it should perform
      • reasoning may be done about the current state of the world
      • about the outcomes of possible action sequences and so on
    • TELLs the knowledge base which action was chosen, and returns the action so that it can be executed
  • MAKE-PERCEPT-SENTENCE
    • constructs a sentence asserting that the agent preceived the given percept at the given time.
  • MAKE-ACTION-QUERY
    • constructs a sentence that asks what action should be done at the current time.
  • MAKE-ACTION-SENTENCE
    • constructs a sentence asserting that the chosen action was executed.

Logical connectives

BNF (Backus–Naur Form) grammar of sentences Symbols from Logic and Set Theory

  • ¬\neg : negation (NOT), ¬W1,3\neg W_{1,3}
    • A literal is either an atomic sentence (a positive literal) or a negated atomic sentence (a negative literal).
  • \land : conjunction (AND), W1,3P3,1W_{1,3} \land P_{3, 1}
    • its parts are the conjuncts.
  • \lor : disjunction (OR), (W1,3P3,1)W2,2(W_{1,3} \land P_{3,1}) \lor W_{2,2}
    • its parts are the disjuncts.
  •     \implies : implication (IMPLIES), (W1,3P3,1)    ¬W2,2(W_{1,3} \land P_{3,1}) \implies \neg W_{2,2}
    • its premise or antecedent, and its conclusion or consequent is the part that follows the     \implies.
    • Implications are also called rules or if-then statements.
    • Sometimes written as \rightarrow or \supset.
  •     \iff : biconditional (IF AND ONLY IF), W1,3    ¬W2,2W_{1,3} \iff \neg W_{2,2}

Truth tables

PPQQ¬P\lnot PPQP \land QPQP \lor QP    QP \implies QP    QP \iff Q
falsefalsetruefalsefalsetruetrue
falsetruetruefalsetruetruefalse
truefalsefalsefalsetruefalsefalse
truetruefalsetruetruetruetrue
  • Logical equivalence: PQP \equiv Q:
  • Validity: a sentence is valid if it is true in all models.
    • Deduction Theorem: For any sentences α\alpha and β\beta, αβ\alpha \models \beta if and only if (α    β)(\alpha \implies \beta) is valid.
  • Satisfiability: a sentence is satisfiable if it is true in, or satisfied by, some model.
  • Modus Ponens: α    β,αβ\frac{\alpha \implies \beta, \alpha}{\therefore \beta}
    • whenever any sentences of the form α    β\alpha \implies \beta and α\alpha are given, then the sentence β\beta can be inferred.
  • And-Elimination: αβα\frac{\alpha \land \beta}{\therefore \alpha}
    • from a conjunction, one of the conjuncts can be inferred.
  • monotonicity: the set of entailed sentences can only increase as information is added to the knowledge base.
    • inference rules can be applied whenever suitable premises are found in the knowledge base, regardless of what other sentences are present.

IAI +008

· 13 min read

NLP

Tokenization

The process of dividing a text into a sequence of words.

  • Different models uses different tokenization methods.
BERT TOKENBERT IDGPT TOKENGPT ID
my2026My3666
grandson7631Ġgrandson31845
loved3866Ġloved10140
it2009Ġit340
!999Ġ!0
so2061Ġso1406
much2172Ġmuch881
fun4569Ġfun1257
!999Ġ!0

Corpus

a large and structured collection of text

  • a corpus typically consists of at least a million words of text
  • at least tens of thousands of distinct vocabulary words.

Text classification

  • the process of categorizing text into organized groups.
  • text classifiers can automatically analyze text and then assign a set of predefined tags or categories based on its content.
  • machine learning approach
    • Features
      • BoW
      • TF-IDF
    • Features + classifier
      • Logistic regression
      • SVM
      • Naive Bayes
  • Deep learning approach
    • Neural models: CNNs (capture local n-grams)
    • RNNs, LSTMs (sequence-aware)
    • Transformers (e.g. BERT)
      • Contextual embedding
  • Modern enhancements
    • Embedding + classifier: Pretrained embeddings (Word2Vec) + classifier
    • Fine-tuned transformers: BERT fine-tuned

Bag of Words, BoW

  • it represents a text as a bag of its words
  • we can understand the meaning of a document from its content (words) their multiplicity (frequency, number of occurrences)
  • mainly used as a tool of feature extraction.
  • limitations
    • ignores syntax and the context
    • disregarding grammar
    • discards word order - words are independent of each other
    • considers only the meanings of the words in the sentence
# examples

Example1 = "He likes to watch movies. Mary likes movies too."
Example2 = "Mary also likes to watch football games."

# Vocabulary
Vocab = {"He", "likes", "to", "watch", "movies", "Mary", "also", "football", "games"}

# BoW representation
BoW1 = {He: 1, likes: 2, to: 1, watch: 1, movies: 2, Mary: 1, also: 0, football: 0, games: 0}
BoW2 = {He: 0, likes: 1, to: 1, watch: 1, movies: 0, Mary: 1, also: 1, football: 1, games: 1}

[[1,2,1,1,2,1,0,0,0], [0,1,1,1,0,1,1,1,1]]

TF-IDF

Term Frequency - Inverse Document Frequency

  • Model based on the statistics of word counts.
  • idea is that key terms and important ideas are likely to repeat.
  • includes a scoring function that measure the relevance of a document to a query.
    • the function takes a document with a corpus and a query as input and returns a numeric score.
    • the doucments that have the highest scores are considered as the most relevant documents.
  • Term Frequency
    • TF(qi,dj,D)TF(q_i, d_j, D)
  • Inverse Document Frequency
    • IDF(qi,D)=logNDF(qi,D)IDF(q_i, D) = \log\frac{N}{DF(q_i, D)}
      • DF(qi,D)DF(q_i, D): number of documents in the corpus DD that contain the term qiq_i
  • N=DN = |D|: total number of documents in the corpus DD
  • TF-IDF score
    • TFIDF(qi,dj,D)=TF(qi,dj)IDF(qi,D)TFIDF(q_i, d_j, D) = TF(q_i, d_j) \cdot IDF(q_i, D)

Examples of TF-IDF

  • Document 1: "John likes to watch movies."
  • Document 2: "Mary likes movies too."
  • Document 3: "John also likes football."
StepTermDocumentNumber of times term t
appears in document d
Total number of terms
in document d
TF(t, d)Total number of documentsNumber of documents
containing the term t
IDF(t)IDF(t) (base 10)TF-IDF(t,d)TF-IDF(t,d) (base 10)
1t=likesd1150.2330000
t=likesd2140.2500
t=likesd3140.2500
2t=watchd1150.2311.0986122890.4771212550.2197224580.095424251
t=watchd204000
t=watchd304000
3t=maryd1050311.0986122890.47712125500
t=maryd2140.250.2746530720.119280314
t=maryd304000
4t=footballd1050311.0986122890.47712125500
t=footballd204000
t=footballd3140.250.2746530720.119280314

Naive Bayes

  • naive refers to a very strong simplifying assumption about the features
  • Conditional independence assumption
    • Given class YY, all the features X=X1,X2,...,XnX = {X_1, X_2, ..., X_n} are assumed mutually independent.
  • P(XY)=P(X1,...,XnY)=i=1nP(XiY)P(X | Y) = P(X_1, ..., X_n | Y) = \prod_{i=1}^{n} P(X_i | Y)
  • Bayes rule
    • P(YX)=P(XY)P(Y)P(X)P(Y)P(XY)P(Y | X) = \frac{P(X | Y) P(Y)}{P(X)} \propto P(Y) P(X | Y)
    • P(YX1,...,Xn)P(Y)i=1nP(XiY)P(Y | X_1, ..., X_n) \propto P(Y) \prod_{i=1}^{n} P(X_i | Y)
  • Sentiment Analysis
    • p(Ckx1,...,xn)=1Zp(Ck)i=1np(xiCk)p(C_k | x_1, ..., x_n) = \frac{1}{Z} p(C_k) \prod_{i=1}^{n} p(x_i | C_k)
      • Z=p(x)=kp(Ck)p(xCk)Z = p(x) = \sum_{k} p(C_k) p(x | C_k)
        • scaling factor for normalization of p(Ckx)p(C_k | x)
    • maximum a posteriori (MAP) decision rule
      • y^=arg maxk1,...,Kp(Ck)i=1np(xiCk)\hat{y} = \argmax\limits_{k \in 1, ..., K}p(C_k) \prod_{i=1}^{n} p(x_i | C_k)

Examples of Naive Bayes

P(Classw1:N)=αP(Class)jP(wjClass)P(Class | w_{1:N}) = \alpha \cdot P(Class) \cdot \prod_{j} P(w_j | Class)

  • Analyze the sentiment
    • my grandson loved it
    • x1=myx_1 = \text{my}, x2=grandsonx_2 = \text{grandson}, x3=lovedx_3 = \text{loved}, x4=itx_4 = \text{it}
    • C=positive,negativeC = {positive, negative}
    • y^=arg maxk{positive,negative}p(Ck)p(x1Ck)p(x2Ck)p(x3Ck)p(x4Ck)\hat{y} = \argmax\limits_{k \in \{positive, negative\}} p(C_k) \cdot p(x_1 | C_k) \cdot p(x_2 | C_k) \cdot p(x_3 | C_k) \cdot p(x_4 | C_k)
      • positive case
        • p(positive)=0.49p(positive) = 0.49
        • p(mypositive)=0.30p(my | positive) = 0.30
        • p(grandsonpositive)=0.01p(grandson | positive) = 0.01
        • p(lovedpositive)=0.32p(loved | positive) = 0.32
        • p(itpositive)=0.30p(it | positive) = 0.30
        • p(positivex)p(positive)p(mypositive)p(grandsonpositive)p(lovedpositive)p(itpositive)=0.49×0.30×0.01×0.32×0.30=0.00014256p(positive | x) \propto p(positive) \cdot p(my | positive) \cdot p(grandson | positive) \cdot p(loved | positive) \cdot p(it | positive) = 0.49 \times 0.30 \times 0.01 \times 0.32 \times 0.30 = 0.00014256
      • nagative case
        • p(negative)=0.51p(negative) = 0.51
        • p(mynegative)=0.20p(my | negative) = 0.20
        • p(grandsonnegative)=0.02p(grandson | negative) = 0.02
        • p(lovednegative)=0.08p(loved | negative) = 0.08
        • p(itnegative)=0.4p(it | negative) = 0.4
        • p(negativex)p(negative)p(mynegative)p(grandsonnegative)p(lovednegative)p(itnegative)=0.51×0.20×0.02×0.08×0.4=0.00006528p(negative | x) \propto p(negative) \cdot p(my | negative) \cdot p(grandson | negative) \cdot p(loved | negative) \cdot p(it | negative) = 0.51 \times 0.20 \times 0.02 \times 0.08 \times 0.4 = 0.00006528
      • Z=p(x)=kp(Ck)p(xCk)Z = p(x) = \sum_{k} p(C_k) p(x | C_k)
        • =p(positive)p(my,grandson,loved,itpositive)+p(negative)p(my,grandson,loved,itnegative)= p(positive) \cdot p(my, grandson, loved, it | positive) + p(negative) \cdot p(my, grandson, loved, it | negative)
        • =0.00014256+0.00006528=0.00020784= 0.00014256 + 0.00006528 = 0.00020784
        • the probability of xx, obtained by adding up its probabilities under each class.
      • p(positivex)=p(positive)p(my,grandson,loved,itpositive)Z=0.000142560.000207840.6867p(positive | x) = \frac{p(positive) \cdot p(my, grandson, loved, it | positive)}{Z} = \frac{0.00014256}{0.00020784} \approx 0.6867
      • p(negativex)=p(negative)p(my,grandson,loved,itnegative)Z=0.000065280.000207840.3133p(negative | x) = \frac{p(negative) \cdot p(my, grandson, loved, it | negative)}{Z} = \frac{0.00006528}{0.00020784} \approx 0.3133
    • Decision
      • y^=arg maxk{positive,negative}{k=positive:0.6867,k=negative:0.3133}\hat{y} = \argmax\limits_{k \in \{positive, negative\}} \{k = positive : 0.6867, k = negative : 0.3133\}
      • The sentiment is positive.
  • Spam detection
    • Probabilities learned from data:
      • P(Spam)=0.4P(\text{Spam}) = 0.4, P(Ham)=0.6P(\text{Ham}) = 0.6
      • P(freeSpam)=0.8P(\text{free}|\text{Spam}) = 0.8, P(freeHam)=0.1P(\text{free}|\text{Ham}) = 0.1
      • P(winSpam)=0.7P(\text{win}|\text{Spam}) = 0.7, P(winHam)=0.05P(\text{win}|\text{Ham}) = 0.05
    • New document: "free win"
      • Spam score: 0.4×0.8×0.7=0.2240.4 \times 0.8 \times 0.7 = 0.224
      • Ham score: 0.6×0.1×0.05=0.0030.6 \times 0.1 \times 0.05 = 0.003
        → Classified as Spam.

N-gram model

  • N-gram: a sequence of written symbols of length n
    • unigram, bigram, trigram
  • the probability of each symbol is dependent only on the n-1 previous symbols.
  • P(wjw1:j1)=P(wjwjn+1:j1)P(w_j|w_{1:j-1}) = P(w_j|w_{j-n+1:j-1})
  • P(w1:N)=j=1NP(wjw1:j1)j=1NP(wjwjn+1:j1)P(w_1:N) = \prod_{j=1}^{N} P(w_j|w_{1:j-1}) \approx \prod_{j=1}^{N} P(w_j|w_{j-n+1:j-1})

Examples of N-gram

  • W1:NW_{1:N} is "This article is on NLP"
    • N=5
    • Bigram (n-gram with n=2)
P("This article is on NLP")
/* full chain rule, */
= P("This") // j = 1
* P("article" | "This") // j = 2
* P("is" | "This article") // j = 3
* P("on" | "This article is") // j = 4
* P("NLP" | "This article is on") // j = 5
/* bigram approximation */
= P("This") // j =1
* P("article" | "This") // j = 2
* P("is" | "article") // j = 3
* P("on" | "is") // j = 4
* P("NLP" | "on") // j = 5
Step (j)Word (W_j)BigramTrigram
1This(P(This))(P(\text{This}))(P(This))(P(\text{This}))
2article(P(articleThis))(P(\text{article} \mid \text{This}))(P(articleThis))(P(\text{article} \mid \text{This}))
3is(P(isarticle))(P(\text{is} \mid \text{article}))(P(isThis, article))(P(\text{is} \mid \text{This, article}))
4on(P(onis))(P(\text{on} \mid \text{is}))(P(onarticle, is))(P(\text{on} \mid \text{article, is}))
5NLP(P(NLPon))(P(\text{NLP} \mid \text{on}))(P(NLPis, on))(P(\text{NLP} \mid \text{is, on}))

Transformer

  • given a set of input vectors (tokens), attention lets each token look at the others and form a weighted average of them.
  • The weights are data-dependent.
  • The model learns the weights representing which tokens are relevant to which other tokens.

X=Input Embeddings,Q=XWQ,K=XWK,V=XWVX = \text{Input Embeddings},\qquad Q = X W_Q,\quad K = X W_K,\quad V = X W_V

  • Scores: S=QKTdS = \frac{QK^T}{\sqrt{d}}
    • dot products between every query and every key
    • the scale d\sqrt{d} keeps gradients stable
  • Weights: A=Softmax(S)A = Softmax(S)
    • row-wise softmax
    • each row sums to 1
  • Output: Attention(Q,K,V)=AVAttention(Q, K, V) = A V
    • each output token is a weighted sum of the value vectors, with weights determined by the attention scores.(how much to pay attention to each position)
  • Query (Q): What am I looking for?
  • Key (K): What is the label/address of the information I have?
  • Value (V): What is the actual information I want to convey?

Cross-Attention

  • look up relevant information in another sequence.
    • e.g. decoder attending to encoder outputs in translation
  • QQ from the current sequence, KK and VV from the other sequence.
    • Q=XtargetWQ,K=XsourceWK,V=XsourceWVQ = X_{\text{target}} W_Q, \quad K = X_{\text{source}} W_K, \quad V = X_{\text{source}} W_V
  • Cross-attention = Which source tokens are relevant to this target token?
  • Self-attention = Which other tokens in this sequence are relevant to this token?
  • headi=Attention(QWiQ,KWiK,VWiV)head_i = Attention(Q W_i^Q, K W_i^K, V W_i^V)
  • MultiHead(Q,K,V)=Concat(head1,...,headh)WOMultiHead(Q, K, V) = Concat(head_1, ..., head_h) W_O

Word representation

One-hot representation

  • represents each word as a vector of length N.
  • where N is the size of the vocabulary.
vocab = {he, is, singing, she, dancing, stage}

she = [0, 0, 0, 1, 0, 0]
is = [0, 1, 0, 0, 0, 0]
singing = [0, 0, 1, 0, 0, 0]
  • limitations
    • incredibly inefficient for large vocabularies
    • not embed any intrinsic meaning of words
    • unable to represent similarity between likely words
    • the representation of documents is sparse vectors
      • can cause challenges in computation

Word Embedding

  • represents individual words as vectors in a low-dimensional continuous space.
  • a distributed representation of a word.
  • generate a unique value for each word while using smaller vectors compared with one-hot encoding.
  • common vector dictionaries
    • word2vec
    • Glove (Global Vectors)

Contextual embedding

  • generates different vectors for the same word based on its context.
  • the word "bank" would have different embeddings in the sentences:
    • "He went to the bank to deposit money."
    • "She sat by the river bank and enjoyed the view."
  • BERT or GPT use deep neural networks to process a sequence of tokens.
  • Each token's embedding is computed by considering the token itself, its position in the sequence and the surrounding tokens, context.
  • captures both semantic and syntactic role in that specific context.

Part of Speech (POS) tagging

  • lexical category or tag that indicates the grammatical role of a word in a sentence.
  • parts of speech allow language models to capture generalizations such as "adjectives often modify nouns" or "verbs often follow subjects".
From the start , it took a person
IN DT NN , PRP VBD DT NN

with great qualities to succeed
IN JJ NNS TO VB
TagDescriptionExample
CCCoordinating conjunctionand, but
CDCardinal numberone, two
DTDeterminerthe, a
EXExistential therethere
FWForeign worddoppelgänger
INPreposition or subordinating conjunctionin, of
JJAdjectivebig, old
JJRAdjective, comparativebigger, older
JJSAdjective, superlativebiggest, oldest
LSList item marker1, 2, One
MDModalcan, will
NNNoun, singular or masscat, tree
NNSNoun, pluralcats, trees
NNPProper noun, singularJohn, London
NNPSProper noun, pluralSmiths, Londons
PDTPredeterminerall, both
POSPossessive ending's, s'
PRPPersonal pronounI, you, he
PRP$Possessive pronounmy, your, his
RBAdverbquickly, very
RBRAdverb, comparativefaster, better
RBSAdverb, superlativefastest, best
RPParticleup, off
SYMSymbol$, %, &
TOtoto
UHInterjectionoh, wow
VBVerb, base formbe, have
VBDVerb, past tensewas, had
VBGVerb, gerund or present participlebeing, having
VBNVerb, past participlebeen, had
VBPVerb, non-3rd person singular presenttalk, have
VBZVerb, 3rd person singular presenttalks, has
WDTWh-determinerwhich, that
WPWh-pronounwho, what
WP$Possessive wh-pronounwhose
WRBWh-adverbwhere, when
#Pound sign#
$Dollar sign$
,Comma,
.Sentence-final punctuation. ! ?

Example of POS tagging

  • Hidden Markov Model (HMM)
    • takes in a temproral sequence of evidence observations
    • predicts the lexical categories
  • Logistic regression
    • build 45 different logistics regression models, one for each part of speech
    • ask each model how probable it is that the example word is a member of that category, given the feature values for that word in its particular context.

Machine translation

  • translate a sentence from a source lnaguae to a trget language.
  • train an MT model: a large corpus of source/target sentence paris and hope that the trained MT model can accurately translate new sentences.
  • want to generate a target language sentence that corresponds to the source language sentence
  • the geneartion of each target word is conditional on the entire source sentence and on all previously generated target words.

Example of machine translation

  • a sequance-to-sequence model
    • use two RNNs (LSTM)
  • attentional sequence-to-sequence model
    • use attentino to create a context-based summarization of the source sentence into a fixed-dimension representation
  • transformer-based model
    • encoder: reads the source sentence and turns it into a rich, contextual set of vectors.
    • decoder: generates the target sentence one token at a time, using what it has generated so far and the encoder's representations.

Text generation

  • a subfield of NLP
  • leverages knowledge in computational linguistics and AI to automatically generate natural language texts
  • can satisfy certain communicativa requirements

Example of text generation

  • Classifier based on word embeddings, e.g. RNN and LSTM
    • RNN: each input word is encoded as a word embedding vector xix_i, a hidden layer ztz_t, the classes are the words of the vocabulary
      • the output yty_t will be a softmax probability distribution over the possible values of the next word in the sentence.
    • LSTM: can choose to remember som parts of the input, copying it over to the next time step, and to forget toher parts.
  • Pre-trained languaeg model using deep learning
    • BERT
    • GPT-X, Generativ ePre-trained Transformer

Transfer learning

  • experience with one learning task helps an agent learn better on another task.
  • pretraining: a form of transfer learning in which we use a large amount of shared general-domain language data to train an initial version of an NLP model.
    • we can use a smaller amount of domain-specific data to refine the model
    • the refined model can learn the vocabulary, idioms, syntactic structure, and other linguistic phenomena that are specific to the new domain.
  • For NN, learning consist of adjusting weight, so the most plausible approach for transfer learning is to copy over the weights learned for task A to a network that will be trained for task B.
    • The weights are then updated by gradient descent in the usual way using data for task B.
  • the popularity of transfer learning is the availability of high-quality pretrained models.
  • will want to freeze the first few layers of the pretrained model
    • these layers serve as feature detectors that will be useful for new model.
    • new data set will be allowed to modify the parameters of the higher levels only
      • these are the layers that identify problem-specific features and do classification.

FDA +009

· 2 min read

Linear Separability

  • the data is linearly separable if
    • it can be separated by a point on a single dimension line of data points
    • by a line on a two-dimensional representation of data points
    • by a plane (a two-dimensional surface) in a three-dimensional representation of the data points
  • If it is non-linearly separable, look at other options for classification.

Hyperplane

  • the conceptual divide between data
  • Weight vector: represented and generated in weight space.
  • Choosing the hyperplane
    • Minimum distance between samples
    • Least-squares method
    • Gradient Descent

Artificial Neural Networks, ANN

  • Strength: for high dimensionality problems, the complex relations between variables
  • Weaknesses: theoretically complex, computationally intensive, needs large data sets, complicated to implement
  • Kinds of ANN
    • Perceptrons, Multilayer Perceptrons
    • Deep learning neural networks
    • Kohonen networks
    • Convolutional neural networks
    • Radial Basis Functions
    • Recurrent neural networks
    • Support Vector Machines
    • Competitive learning
    • Boltzmann machines

Multilayer Perceptrons, MLP

  • Challenges
    • Decide on the network topology.
      • how many hidden layers are needed
      • how many neurons in each of the hidden layers
    • Find values for the weights which make the network produce the correct output values for the given input values.
  • Neural networks only accept numeric data.
    • need to convert the categorical into numeric.
    • One-Hot encoding, Thermometer encoding.
  • high values may need to be scaled into a similar range as neural networks
    • need to do a log transform to pull the values into a target range.
    • [-1, +1] or [0, 1]
  • input neurons should be as small as possible.
    • adding neurons -> more parameters and weights -> amplify any bias. (overtrain the network)
  • one categorical attribute may have many attribute values
    • each adding a parameter -> adding risk of overtraining

Resilient Propagation, RProp

  • directly adjust the weight step based on the local gradient information
  • introduces a weight update value uiju_{ij} for each weight wijw_{ij}
  • updates it based on the sign of the partial derivative of the error with respect to the weight
  • update value uiju_{ij}
    • if sign changes (i.e. jumped over local minima) -> uiju_{ij} slightly decreased.
    • if sign remains the same -> uiju_{ij} slightly increased.
  • weight wijw_{ij}
    • if derivative is +ve+ve (i.e. error increasing) -> wijw_{ij} decreased by uiju_{ij}
    • if the derivative is negative (i.e. error decreasing) -> wijw_{ij} increased by uiju_{ij}
    • if the derivative changes sign, the last weight update is reverted. (backtracks the last weight update)

KNIME

  • RProp MLP Learner + MultiLayerPerceptron Predictor
  • MultilayerPerceptron + Weka Predictor (back propagation with momentum)

FDA +008

· 8 min read

Mearues of performance

Accuracy

A=TP+TNTP+TN+FP+FNA = \frac{TP + TN}{TP + TN + FP + FN}

  • the ratio of correct predictions to all predictions
  • it is the number of true positives and true negatives (the correct predictions) divided by the total number of predictions.

Error rate

E=FP+FNTP+TN+FP+FNE = \frac{FP + FN}{TP + TN + FP + FN}

  • the ratio of incorrect predictions to all predictions
  • It is equivalent to 1Accuracy1 – Accuracy.
  • It is the number of false positives and false negatives divided by the total number of predictions.

True positive rate, Recall, Sensitivity

TPR=TPP=TPTP+FNTPR = \frac{TP}{P} = \frac{TP}{TP + FN}

  • the proportion of actual positives for which the test result is positive.
  • it shows how sensitive the model is to detecting positive instances.
  • the ratio of true positives to all positives
  • the data points that were correctly predicted as positive divided by the number of true positives and false negatives.

False positive rate, Fall Out, False Alarm

FPR=FPN=FPFP+TNFPR = \frac{FP}{N} = \frac{FP}{FP + TN}

  • the proportion of actual negatives for which the test result is positive.
  • It is equivalent to 1Specificity1 – Specificity.
    • large values of specificity indicate small false negative rates.

The true negative rate, Specificity

SPC=TNR=TNN=TNFP+TNSPC = TNR = \frac{TN}{N} = \frac{TN}{FP + TN}

  • the proportion of actual negatives for which the test result is negative
  • It shows how well the model does at identifying actual negatives as negative.

The false negative rate, Miss Rate

FNR=FNTP+FNFNR = \frac{FN}{TP+FN}

  • the proportion of the actual positives for which the test result is negative.
  • It is equivalent to 1Sensitivity1 – Sensitivity.
    • large values of sensitivity indicate small false positive rates.
  • It is a common trick that often change classifiers to bias them towards making type 1 FP versus type 2 FN errors.
    • This can often change classifier to bias towards making FP errors rather than FN errors.
    • It depends on which are worse to make
    • They are biased if there are different numbers of items in each class

Precision, Positive Predictive Value

  • PPV, Positive Predictive Value
  • a measure of how accurate and precise the positive predictions are
  • It is the ratio of true positives to predicted positives.

Accuracy and Error rate

  • in practice, the previous measures don't always work well
    • they are biased, especially if there are different numbers of items in each class.
  • if the data is imbalanced, it's much better to use a true positive rate or true negative rate instead.

F1

F=2PrecisionRecallPrecision+RecallF = 2 * \frac{Precision * Recall}{Precision + Recall}

  • known as F-score or F-measure
  • a measure of the accuracy of the test
  • It is the harmonic mean of the recall and precision, where an F1 score reaches its best value at 1 (perfect precision and recall).
  • It allows the Recall and Precision to be assessed in the same calculation.
Predicted \ ActualPositiveNegativeTotal
Positive1105115
Negative106070
Total12065185
  • Accuracy=110+60185=0.91Accuracy = \frac{110 + 60}{185} = 0.91
  • ErrorRate=10+5185=0.08Error Rate = \frac{10 + 5}{185} = 0.08
  • TruePositiveRate(Sensitivity/Recall)=110115=0.95True Positive Rate (Sensitivity/Recall) = \frac{110}{115} = 0.95
  • FalsePositiveRate=1070=0.14False Positive Rate = \frac{10}{70} = 0.14
  • TrueNegativeRate(Specificty)=6070=0.85True Negative Rate (Specificty) = \frac{60}{70} = 0.85
  • FalseNegativeRate=5115=0.04False Negative Rate = \frac{5}{115} = 0.04
  • Precision=110120=0.91Precision = \frac{110}{120} = 0.91
  • F1=(20.910.95)/(0.91+0.95)=0.92F1 = (2 * 0.91 * 0.95) / (0.91 + 0.95) = 0.92

ROC curve

Receiver Operating Characteristic Curve

  • a graphical plot that explains how well a binary classifier system performs as the threshold at which it calls a data point as positive is varied.
  • ROC graphs were originally used in the communications area to look at false alarm rates.
  • The x-axis is the false positive
  • The y-axis is the true positive rate (sensitivity, recall)
  • ROC graphs contain all the information in the confusion matrix.
    • TPR(=TP/(TP+FN)), FPR(=FP/(FP+TN))
  • a visual tool to compare trade-offs between the ability of a classifier to correctly identify positive cases and the number of negative cases that are incorrectly identified.
  • an essential evaluation metric for checking the performance of a classification model.

AUC

Area Under the ROC Curve

  • AUC 0: the model predicts a negative class as a positive class and vice versa.
  • AUC 0.5: the model has no discriminative capacity to differentiate between negative class and positive class
    • diagonal line from (0,0) to (1,1)
  • AUC 0.7: 70% chance that the model will be able to differentiate between the positive and negative classes.
  • AUC 1: the model predicts all positive class as positive class and all negative class as negative class.
  • ROC is a probability curve and AUC represents the degree of separability
  • It shows how capable the model is of differentiating between the classes.

Training/testing split

  • train-test split procedure is used to estimate the performance of machine learning algorithms
  • should not be used
    • when the dataset is small
    • when the dataset is imbalanced
    • where additional configuration is required
    • train_test_split(..., stratify=y)
  • most common approach to validate model performance
    • traning data is 80%
    • testing data is 20%

k-fold cross-validation

  • helps select the model which will perform best on unseen data
  • overcoming the problem of overfitting and underfitting.
  • a parameter called k defines the number of portions that a given data sample will be split into.
  • k-fold cross-validation has less bias because it ensures that every single discovery from the main dataset has a chance to appear in both training and test sets.
Iteration of a 5-fold cross-validation

# X = Test set
# T = Train set
1st fold : XXXX TTTTTTTTTTTTT
2nd fold : TTTT XXXX TTTTTTTT
3rd fold : TTTTTT XXXX TTTTTT
4th fold : TTTTTTTT XXXX TTTT
5th fold : TTTTTTTTTTTTT XXXX
k = 3
dataset = [1, 2, 3, 4, 5, 6]

fold1 = [5, 2]
fold2 = [1, 3]
fold3 = [4, 6]
  • Model1 will be trained on fold1 and fold2, and tested on fold3
  • Model2 will be trained on fold2 and fold3, and tested on fold1
  • Model3 will be trained on fold1 and fold3, and tested on fold2
  • we can take the accuracy as the average of all rounds to get the final accuracy.

Bias-variance decomposition

a formal method for understanding the prediction error of a model.

  • the average of the distance between the target and model predictions.
  • simple model: High bias, Low variance → Underfitting
    • not complex, high error component
  • complex model: Low bias, High variance → Overfitting
    • quite sensitive to the specific training set

Bias

the average of the distance between the target and model predictions.

  • how well the model can do for any training set.
  • the difference between the expected value and the parameter that we want to estimate
  • Bias=E[θ^]θ\text{Bias} = E[\hat{\theta}] - \theta
    • If the bias is exactly zero, the estimator is unbiased
    • If the bias is greater than zero, the estimator is positively biased
    • If the bias is less than zero, the estimator is negatively biased

Variance

the deviation between the average prediction value and the predicted value.

  • Classifier error is also affected by variability in the training data because different training sets lead to different decision boundaries.
  • High variance: it produces different results for different training sets.
    • sensitive to the particular training set.
    • models with less parameters tend to have lower variance.
  • Var(θ^)=E[(θ^E[θ^])2]Var(\hat{\theta}) = E[(\hat{\theta} - E[\hat{\theta}])^2]

Noise

changes in the target value

  • objects with the same attribute values leading to different class labels.
  • these errors are unavoidable even when you know the correct decision boundary.

Evaluating

Underfitting

the model did not capture enough patterns in the data

  • The model provides poor performance on both the training and the test set.
  • Reasons:
    • The training model is not trained as tightly as possible.
    • The model is not able to learn more.
      • model is not suitable for the task.
  • Avoid underfitting by:
    • using more training data
    • choosing / training a more complex model
    • increase the number of parameters in the model, the type or complexity of the model, or the traninig time till a cost function is minimized.

Overfitting

the model captures noise and patterns which do not generalize well to new data.

  • The model has extremely good performance on the training set, but poor performance on the test set.
  • Reasons:
    • the training data is not a perfect standard
  • Avoid overfitting by:
    • regularization
    • pruning (parameters, strucutres of classifiers)
    • reducing the descriptive length (minimize the sum of the model's complexity and the dscription of the traning data)
    • optimization (use a separate subset for validating the model)
    • expansion (use or generate more training data)
      • adding synthetic samples to the dataset.

K-NN

k-Nearest Neighbors

  • a non-parametric technique
    • not involving any assumptions as to the form or parameters of a frequency distribution
  • supervised learning classifier
    • uses proximity to make classifications or predictions about the grouping of an individual data point
    • defining new cases based on similarity measures (e.g., distance functions).
    • Euclidean: d(x,y)=(xiyi)2d(x, y) = \sqrt{\sum_{} (x_i - y_i)^2}
    • Minkowski: d(x,y)=(xiyip)1/pd(x, y) = (\sum_{} |x_i - y_i|^p)^{1/p}
    • Hamming: d(x,y)=xiyid(x, y) = \sum_{} |x_i - y_i|
  • In weighted k-NN, weigh the votes according to distance
    • wi=1d2w_i = \frac{1}{d^2}
  • If k is too small, it may be sensitive to noise points
  • If k is too large, its neighbourhood may include points from other classes
  • Choose an odd value for k, to eliminate ties

Industrial Applications of KNN

  • Retail business data analysis: Identify customer patterns and generate business value.
  • Security and operational management: Simplify daily operations such as theft prevention.
  • Credit card usage monitoring: Detect unusual patterns to identify fraudulent transactions.
  • Transaction scrutiny software: Spot unfamiliar patterns and flag suspicious activity.
  • Point-of-sale (POS) data analysis: Analyse register data for operational insights.

IAI +007

· 8 min read

Computer vision

  • a field of artificial intelligence enabling computers and systems to derive meaningful information from digital images, videos and other visual inputs.
  • vision is a perceptual channel that accepts a stimulus and reports soome representation of the world
  • computer vision enables intelligent agents to see, observe and understand of environment

Core problems of CV

  • reconstruction: an agent builds a mode lfo the world from an image or a set of images
  • recognition: an agent draws distinctions amont the objects it encounters based on visual and other information.
    • image classification
    • object detection
    • image segmentation

Classic approaches to object recognition problems

  • feature-based object recognition approach
    • works well for faces looking directly at the camera.
  • pattern-element-based object recognition approach
    • a useful abstraction is to assume that some objects are made up of local patterns which tend to move around with respect to one another.
    • we can model objects with pattern elements.

Modern approaches to object recognition problems

  • deep learning networks
    • to recognition problems enables that features can be automatically learned and extracted from raw image data compares with the manual feature extraction in the classic approaches.
    • AlexNet, VGGNet, GoogleNet, ResNet, DenseNet, EfficientNet, RegNet...
  • basic models and derived models
    • YOLO, SSD, RetinaNet, R-CNN...

Evaluation metrics for image classification

MetricDefinitionUse case
Accuracythe percentage of correcly predicted labels out of all predictions madecommonly used in balanced datasets but can be mis leading in imbalanced classes
Precisinothe ratio of correctly predicted positive observations to the total predicted positives.useful when the cost of false positives is high (e.g., spam detection)
Recall (Sensitivity or True Positive Rate)the ratio of correctly predictted observations to all the actual positivesimportant when the cost of false negatives s high (e.g., disease detection)
F1 Scorethe harmonic mean of Precision and Recallused when you need to balance precision and recall, especially in imbalanced datasets
Specificity (True Negative Rate)the ratio of correctly predicted negative observations to all the actual negatives.important when false positives should be minimized (e.g. medical tests for diseases)
Confusion Matrixa table used to describe the performance of a classification model by showing the true positive, false positive, true negative, and false negative countsprovides a comprehensive understanding of a models' performance across all classes
ROC Curve (Receiver Operation Characteristic)a graphical representation of the classifier's performance across all thresholds, plotting the true positive rate (recall) against the false positive rate (1 - specificity)used to evaludate binary classirifres and compare models
AUC (Area Under the Curve)the area under the ROC curve, providing a single number summary of the models' ability to discriminate between positive and negative classesuseful for evaluating binary classification models, particularly when dealing with imbalanced datasets.

Confusion Matrix

Actual class ➡️
Predicted class ⬇️
PositiveNegativeMetric
PositiveTP: True PositiveFP: False PositivePrecision: TPTP+FP\frac{TP}{TP + FP}
NegativeFN: False NegativeTN: True NegativeNegative Predictive Value: TNTN+FN\frac{TN}{TN + FN}
MetricRecall or Sensitivity: TPTP+FN\frac{TP}{TP + FN}Specificity: TNTN+FP\frac{TN}{TN + FP}Accuracy: TP+TNTP+TN+FP+FN\frac{TP + TN}{TP + TN + FP + FN}

ROC Curve

  1. Sort predicted probabilities
  2. Try multiple thresholds
  3. For each threshold, compute predicted labels
  4. Compute TPR and FPR
  5. Plot ROC curve (FPR vs TPR) i. TPR=TPTP+FNTPR = \frac{TP}{TP + FN} ii. FPR=FPFP+TNFPR = \frac{FP}{FP + TN}
  6. Compute AUC as area under ROC curve
from sklearn.metrics import roc_curve, roc_auc_score
fpr, tpr, thresholds = roc_curve(y_true, y_scores)
auc = roc_auc_score(y_true, y_scores)

Evaluation metric for object detection

MetricDefinitionUse case
Intersection over Union (IoU)measures how much the predicted bounding box overlaps with the ground truth.IoU=Area of OverlapArea of UnionIoU = \frac{\text{Area of Overlap}}{\text{Area of Union}}
Precisionthe ratio of correctly predicted postive observations to the total predicted positivesPrecision=TPTP+FPPrecision = \frac{TP}{TP + FP}
Recallthe ratio of correctly predicted observations to all the actual positivesRecall=TPTP+FNRecall = \frac{TP}{TP + FN}
F1 Scorethe harmonic mean of Precision and RecallF1=2×PrecisionRecallPrecision+RecallF1 = 2 \times \frac{Precision \cdot Recall}{Precision + Recall}
Average Precision (AP)Precision-Recall Curve: Precision vs Recall at different thresholds
AP: Area under the Precision-Recall curve (per class)
AP=01P(r)drAP = \int_{0}^{1} P(r) \, dr
Mean Average Precision (mAP)mean of AP across all classesprovides a comprehensive understanding of a models' performance across all classes
AP@[.50:.95]COCO benchmarkAverages AP at IoU threshold from 0.5 to 0.95 in steps of 0.05
  • AP is computed by summing trapezoid areas under the Precision-Recall curve.

Convolutional Neural Network (CNN)

  • contains spatially local connectinos at lest in the early layers
  • has patterns of weights that are replicated across the units in each layer.
  • use a kernel to detect patterns of weights that is replicated across multiple local regions in an image
  • use convolution that applies the kernel to the pixels of the image
Input

[Conv → ReLU][Pooling]

[Conv → ReLU][Pooling]

Flatten (2D → 1D 벡터)

Fully Connected Layer

Output Layer (Softmax/Sigmoid)

zi=j=1lkjxj+(i1)sz_i = \sum_{j=1}^{l} k_j \cdot x_{j + (i-1)s}

  • ziz_i: the ii-th output element (convolution result)
  • jj: index inside the kernel (from 1 to ll)
  • ll: kernel size (number of elements in the kernel)
  • kjk_j: the jj-th kernel weight (filter element)
  • xx: input sequence (the original data)
  • xj+(i1)sx_{j+(i-1)s}: the input element aligned with kernel position jj when shifted by stride
  • ss: stride (step size of the kernel movement)
  • (i1)s(i-1)s: total shift in the input for the ii-th convolution

Receptive Field

  • the receptive field of a neuron is the portion of the sensory input that can affect aht neuron's activation.
  • In CNNs, the receptive field of a unit in the first hidden layer is small.
  • just the size of the kernel, e.g., 3x3 or 5x5.
  • In the deeper layers of the network, it can be much larger.

1D CNN

RL=RL1+(kL1)i=1L1siR_{L} = R_{L-1} + (k_{L} - 1) \cdot \prod_{i=1}^{L-1} s_i

  • RLR_L: receptive field size at the LthL-th layer
  • kLk_L: kernel size at the LthL-th layer
  • sis_i: stride at the ithi-th layer
LayerKernel kStride sCalculationResult RF
Conv1311 + (3-1)*13
Pool1223 + (2-1)*14
Conv2314 + (3-1)*28

Pooling

  • works like a convolutional layer, with a kernel size ll and stride ss, but the operation is applied is fixed rather than learned.
  • no activation fucntion is associated with the pooling layer.
  • common forms of pooling
    • average pooling
    • max pooling: saying a feature exists somewhere in the unit's receptive field.

Dropout

  • a way to reduce the test-set error of a network to increase its ability of generalization.
  • makes a etwork herder to fit the traninig set.
  • the network is created by deactivating a randomly chosen subset of the units (dropout rate).
  • cannot explain why it works but the results is better

Batch Normalization

  • modern neural networks are almost always trained with some variant of stochastic gradient descent (SGD).
  • Batch normlization rescales the values generated at the internal layers of the network from the examples within each minibatch.
    • it standardizes the input to a layer for each mini-batch.
  • standardizes the mean and variance of the values.
  • maeks it much simpler to train a deep network.

Tranining in a CNN

Crossentropy(y,y^)=iyilog(y^i)Cross-entropy(y, \hat{y}) = -\sum_{i} y_i \log(\hat{y}_i)

  • Forward pass
  • Backward pass
  • Parameter update

Variants of CNNs

AlexNet

AlexNet

AlexNet Layers

ResNet

  • stands for a residual neural network.
  • was designed to enable hundreds or thousands of convolutional layers.
  • Residual neural networks do this by utilizing skip connections, or shortcurts to jump over some layers.
  • was an innovative solution to the "vanishing gradient" problem.
x ────────────────► (+) ──► y
│ ▲
▼ │
[Conv → ReLU → Conv] = F(x)

VGGNet

  • increases the depth of the network through adding more convolutional layers by using small convolution filters (3x3) while other parameters are fixed.

VGGNet

Derived model for object detection

YOLO

  1. Split an images to S times S blocks.
  2. Apply object classification to each block and get confidence score of different objecxt for each block.
  3. Based on the class probability map to locate objects.

YOLO

Calculation AP and mAP

  1. Gather Detection results
  2. Match Predictions to Ground Truth
  3. Compute Precision and Recall to generate a set of (recall, precision) points
  4. Build the Precision-Recall (PR) Curve (presioin decreases as recall increases)
  5. Smooth the Precision-Recall Curve
  6. Calculate AP
  7. Extend to mAP

FSD +008

· 5 min read

OOP Principles

Encapsulation

  • bundles data attributes (fields) with methods that use the data in a single unit called "Object"
  • hides sensitive data attributes by declaring the fields private.
  • fields can be declared protected in the parent class, allowing access from the child class.
  • exposes the data attributes values only through public getters/setters to allow access or modification of the field values.

Intheritance

class A:
pass

class B(A):
pass

class C(A):
pass
  • Superclass defines common method signatures (with/without implementation) and fields.
  • Subclasses provide the implementations for (or override) these method signatures
  • private attributes cannot be directly accessed from the child classes.
  • to refer to superclass properties (fields, methods) or constructor, use the keyword super()

Polymorphism

  • means many forms.
  • permits an object to have multiple types.
  • allows object of different types but with a common parent to be stored in the same collection.
  • enables inherited methods from the parent to perform different tasks when called by subclasses.

Abstraction

  • the process of hiding the implementation details and exposing only the necessary behavior to the user.
  • abstract class must at least contain one abstract method.
  • abstract class ccan have concrete methods.
  • abtract methods are only prototypes in the parent class, the actual implementation is provided by the subclasses that inherit the abstract class.
  • rules
    • if a class contains at least on abstract methods, it should be declared abstract.
    • if another class inherits an abstract class, it must implement all the abstract methods of that class.
from abc import ABC, abstractmethod

class Person(ABC):
def __init__(self, name):
self.name = name

@abstractmethod
def show(self):
pass

def show_name(self):
print(f"Name: {self.name}")

class Student(Person):
def __init__(self, name, id):
self.id = id
super().__init__(name)

def show(self):
super().show_name()
print(f"ID: {self.id}")
  • interfaces are complete abstract classes declared with the interface keyword.
  • only contain prototype (abstract) methods with empty body code.
  • cannot be instantiated nor inherited as they do not have constructors.
  • a class can implement multiple interfaces.
  • a class implementinig an interface must provide an implementation for all its abstract methods.
  • add an access layer to further hide the methods impelmentation.
  • act as a middleware inside the program, allowing human, machines, and other software to interact with the program's functionalities without knowing the implementation details.
# interface A is a fully abstract class
# interface A must inherit ABC
# class B(A) must implement all abstract methods of A
# interface A methods have the @abstractmethod decorator with pass as body-code
from abc import ABC, abstractmethod

class Person(ABC):
@abstractmethod
def show_info(self):
pass

class Student(Person):
def __init__(self, name, id):
self.name = name
self.id = id

def show_info(self):
print(f"Name: {self.name}, ID: {self.id}")

OOP Design Rules

  • How to split the code into seperate classes and preserve encapsulation
  • How to organizae the code into methods to hide the implementation details and expose the behavior
  • How the objects interact at runtime
  • How to name the class entities

5 Design Rules

  • Encapsulation
    • hide fields behind methdos
    • requires fields to be private while allowing methods to be public
    • requires a methods related to a field (such as access, modify, use) be defind in the same class.
  • Push code to the right
    • requires the code/methods used by objects of a class be written in the same class.
    • ensures methods are written for reusability and placed in the correct class so they can make use of the class's fields.
    • determines which class is responsible for defining the methods needed to achieve the program's goals.
  • Spread plans across classes
    • involves planning the distribution of code across multiple classes from the start.
    • focuses on distributing responsibilities logically among different classes, so that each class has a clear and focused role.
    • by convention, this rule requires using the same method name (for the same goal) across all classes.
  • Hide by default
    • hiding implementation details within a class and exposing only the necessary functionality to the outside.
    • helps in achieving encapsulation and abstraction.
    • promotes better design and maintainability.
  • Follow naming conventions
    • use nouns to name fiels.
    • use nouns to name functions.
    • use verbs to name procedures.
    • if an entitiy is compsed of two or more words.
      • use camelCase/snake_case for fields and methods.
      • use PascalCase for class names.
  • consolidates the action-trieggers of a program into a single method.
  • common design choice in applications with user interaction.
  • the menu methods is executed in the program's main method.
    • offers users interactive CLI command choices.
  • menu() requires a read-function in a while loop, allowing the menu() tor repeatedly read inpus from STDIN.
<condition> = read_choice()

loop (<condition>):
check (<condition>):
case 1: do task <action_1()>
case 2: do task <action_2()>
...
case n: do task <action_n()>
default: <alternative-action>
menu(self):
choice = input("Enter choice: (d/w/b/h/x): ").lower()
while choice != "x":
match choice:
case "d":
self.deposit()
case "w":
self.withdraw()
case "b":
self.show_balance()
case "h":
self.show_history()
case "x":
self.exit()
case _:
self.show_invalid_choice_message()

choice = input("Enter choice: (d/w/b/h/x): ").lower()

FDA +007

· 7 min read

Classification and Prediction

Classification

classifies data based on the training set and the class labels and uses it in classifying new data

  • Model construction (learning)
  • Model evaluation (accuracy)
  • Model use (classification)

Prediction

predicts categorical class labels based on unseen data

  • models continuous-valued functions

Classfication Methods

  • Decision tree induction
  • Bayesian classification
  • Nearest neighbour classification, case-based reasoning
  • Neural networks
  • Support Vector Machines
  • Ensemble methods

Issues around classification and prediction

  • Predictive accuracy
  • Speed and scalability
  • Robustness
  • Interpretability
  • 'Goodness' of classifier

Classification process

Decision Tree

  • Builds trees to describe data
  • Easy to translate into rules
  • Robust to niosy data
  • able to build disjunctive expressions
  • Inductive bias prefers small trees over larger.

Decision Tree Methodologies

  • Itertative Dichotomiser 3 (ID3)
  • C4.5
  • Classification And Regression Trees (CART)
  • Chi-square Automatic Interaction Detection (CHAID)
  • Multivariate Adaptive Regression Splines (MARS)

Decision Tree Forms

  • Balanced
    • each branch has the same depth from the root to leaves.
    • all nodes have the same number of splits
  • Deep
    • some nodes have different levels, wherein some of them are split into more branches.
  • Bushy
    • split into multi-way from the root
    • undesirable because the split may lead to small numbers of instance in each leaf node.

Entropy and Information Gain

Entropy

measures the amount of disorder / (im)purity in a collection of things. i.e. the unpredictability of the data.

Entropy(S)=p+log2(p+)plog2(p)Entropy(S) = -p_{+} log_2(p_{+}) - p_{-} log_2(p_{-})

  • Constructing a decision tree is all about finding an attribute that returns the highest information gain and the smallest entropy.
  • SS stands for total number of samples
  • P+P_{+} denotes the likelihood of a yes (positive) answer.
  • PP_{-} denotes the likelihood of a no (negative) outcome.

E(S)=i=1cpilog2(pi)E(S) = \sum_{i=1}^{c} -p_i log_2(p_i)

Information Gain

measures how well a given attribute separates the training examples according to the target classification.

  • The larger the information gain is, the stronger the feature will be.

Gain(S,A)=Entropy(S)vValues(A)SvSEntropy(Sv)Gain(S,A) = Entropy(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} Entropy(S_v)

  • SS = set of training examples
  • AA = the particular attribute to be tested
  • Values(A)Values(A) = the set of values for the attribute A
  • SvS_v = subset of SS with attribute AA having value vvs

Iterative Dichotomiser 3

  • constructs trees in a top-down manner.
  • check each instance attribute with a statistical test to see how well it alone classifies (splits) the training examples.
  • this becomes the root node.
  • descendent is created for each possible value of the attribute and training examples split to the appropriate descendent.
  • repeat the procedure for each descendent.
  • the algorithm is going to issue recursion on each of the partitions.
  • the output of this algorithm is the creation of a Model.
function id3 (examples, target, attrs):
create root node for tree

if examples all +ve, return root with label=+
if examples all -ve, return root with label=-
if attrs is empty
return root w/ label=most common value of target in examples else
else
A ← attribute from attrs that best splits examples
root ← A for each possible value, vi , in A
add a new branch below root corresp. to the test A = vi
examples_v ← the subset of examples with A = vi

if examples_vi is empty
add a leaf node below branch w/ label = most common value of target from examples
else below the branch add the subtree given by
id3(examples_vi , target, attrs - {A})

return root
  • ID3 can be seen as a search through a space of hypotheses for one that matches the training data.
  • It's a greedy search
  • Hypothesis space = all the possible trees
  • Simple to complex, hill climbing search
  • Complete search = decision tree can represent all possible hypotheses
  • Maintains only one current hypothesis
    • cannot find alternative decision trees
  • No backtracking, maybe stuck in local optima
  • Uses all training data at each step
    • less sensitive to errors in individual examples

Inductive Bias

  • Shorter trees preferred
  • High information gain near the root
  • cos' simple to complex search

Issues with decision trees

Overfitting training data

  • occurs when a tree gives higher accuracy on the training data than another tree, but lower accuracy on the unseen data.
  • because the training data is noisy or not representative of the unseen data.
  • can measure how well the tree generalizes by checking error on test data.

Dividing the data

  • Training set
    • used to build the initial model
    • may need to enrich the data to get enough of the special cases
  • Cross validation set
    • used to adjust the initial model
    • used to work out the correct values of parameters in model
    • models can be tweaked to be less dependent on idiosyncrasies in the training data to be a more general model
    • idea is to prevent over-training (i.e. finding patterns where none exist)
  • Test set
    • used to evaluate the model performance

Avoiding Overfitting

  • stop growing the tree once the test error decreases
  • grow the tree as normal (i.e. wit hoverfitting) then post-prune it.
  • use a separate set of data apart from training to test when to prune nodes (training & cross-validation set)
  • use all data to train, but apply a statistical test whether to expand/prune a node.
  • use an explicit complexity measure (e.g. Minimum Description Length, MDL) to trade off accuracy vs complexity.

Reduced-Error pruning

  • build tree
  • consider each node in tree for pruning
  • pruning
    • remove subtree
    • make into a leaf
    • assign label as most common class in associated training examples.
  • if pruned tree has as good error on the cross validation set as the unpruned tree, do the prune.
  • keep pruning until the error on cross validation set increases.

Cross validation set

  • the main difficulty comes when you don't have much data and need it all for training.
  • k-fold cross-validation or leave-one-out training can help.

Rule Post-Pruning

  • converts the decision tree to rules
  • removes preconditions that do not worsen the accuracy (on the cross validation set or with a statistical test)
  • sorts the rules by estimated accuracy and uses this order when classifying new data

Continuous Valued Variables

  • use real valued attributes for tests at nodes.
  • Dynamically define new discrete attributes that partition the continuous attribute.
  • discrete: A<v=trueA < v = true and A>=v=falseA >= v = false
  • the boundary poins can be estimated from the training data.

The confusion matrix

  • a basic method of evaluation of classifiers
  • The columns have numbers associated with the actual number of positive data points in the test set and the actual number of negative data points in the test set
Predicted PositivePredicted Negative
Actual PositiveTrue Positive (TP)False Negative (FN)
Actual NegativeFalse Positive (FP)True Negative (TN)
  • TP: the number of correct prediction of positive samples.
    • the number of data points in the test set that were positive and the classifier correctly assigned them to the positive group
  • TN: the number of correct prediction of negative samples
    • the number of data points in our test set that was actually negative and predicted as negative by the classifiers
  • FP: the number of incorrect predictions of positive samples
    • the value of data points in our test set that was actually negative but were predicted by the classifier as positives.
    • These ones are obvious errors, which is called a type 1 error.
    • type I error
  • FN: the number of incorrect prediction of negative samples.
    • the number of data points in our test set that was actually positive but were predicted by the classifier as negative
    • usually called the type 2 error, and when we are trying to build a classification.
    • might have a trade-off between the number of the type 1 error or the type 2 error.
    • type II error

Vocabulary for AI 008

· 3 min read

Vocabulary & Expressions

Term/ExpressionDefinitionSimpler ParaphraseMeaning
invariancethe quality of remaining unchanged when something else changesunchangingness불변성
disjunctiona logical operation that outputs true whenever at least one of its inputs is trueor operation논리합, OR
convergencethe process of coming together to a common pointcoming together수렴
impracticallynot in a practical or realistic mannernot practical비현실적으로
receptora cell or group of cells that receives stimuli and transmits them to sensory nervessensor수용체, 감각기
sensationthe process of sensing our environment through touch, taste, sight, sound, and smellfeeling감각
diffusespread out over a large area; not concentratedspread out확산되다
albedothe proportion of the incident light or radiation that is reflected by a surfacereflectivity반사율
color constancythe feature of the human color perception system which ensures that the perceived color of objects remains relatively constant under varying illumination conditionsconsistent color perception색채 항등성
occlusionthe blocking of light or other radiation by an objectblockage폐색, 차폐
deformationthe action or process of changing in shape or distorting, especially through the application of pressuredistortion변형
foreshorteningthe visual effect or optical illusion that causes an object or distance to appear shorter than it actually is because it is angled toward the viewerperspective shortening단축, 단축법
courtesyprovided at no costfree of charge무료 제공
apprenticeship learninga type of learning where an agent learns to perform tasks by observing and imitating a more experienced agentlearning by imitation도제 학습
consolidatemake (something) physically stronger or more solidstrengthen강화하다
aquamarinea light bluish-green colorlight blue-green아쿠아마린
fulvousa dull yellowish-brown colordull yellow-brown황갈색의
verbatimin exactly the same words as were used originallyword for word말 그대로, 축어적으로
lexiconthe vocabulary of a person, language, or branch of knowledgevocabulary어휘집
adherentsomeone who supports a particular party, person, or set of ideassupporter지지자
syntactic consitituenta word or a group of words that function as a single unit within a hierarchical structuresentence unit구문 성분
pragmaticsthe branch of linguistics dealing with language in use and the contexts in which it is usedlanguage use화용론
casea grammatical category that marks the relationship between a noun and other words in a sentencegrammatical role
persona grammatical category that distinguishes between different participants in a conversation (e.g., first person, second person, third person)participant distinction인칭
numbera grammatical category that expresses count distinctions (e.g., singular, plural)count distinction
headthe main word in a phrase that determines its syntactic typemain word중심어
separabilitythe quality of being able to be separated or divideddivisibility분리 가능성
intrinsicbelonging naturally; essentialinherent본질적인
기법아이디어비유
Laplace (Add-One)모든 경우의 수에 +1을 더해줌. 안 본 것도 최소한의 기회를 줌.시험 점수를 0점 맞아도 최소 1점은 줘
Backoffn-gram이 없으면 더 짧은 n-gram으로 내려가서 확률을 씀.3단어 조합 못 봤네? → 그럼 2단어 조합으로 보자 → 그것도 없으면 1단어라도 보자.
Linear Interpolation여러 레벨(n=1,2,3)을 동시에 섞어서 확률 계산.국어, 영어, 수학 점수를 일정 비율(λ)로 합쳐서 성적 보는 것.
Witten-Bell / Kneser-Ney더 정교하게 확률 질량을 재분배. (현업에서 성능 좋음)선생님이 본 적 없는 문제지만 유사 문제를 잘 풀었으니 점수 좀 더 주자 하는 것.
Stupid Backoff대충 큰 코퍼스를 모아서, 그냥 단순 backoff만 사용. (빅데이터 시대 구글에서 씀)데이터가 워낙 많으니까, 그냥 단순한 방법도 잘 먹힌다.

IAI +006

· 8 min read

Markov process or Markov chain

the state sequence satisfies the Markov assumption

  • The current state depends on only a finite fixed number of previous states.
  • P(XtX0:t1)=P(XtXt1)P(X_t | X_{0:t-1}) = P(X_t | X_{t-1})

Decision Theory

  • Choosing actions based on the desirability of their immediate outcomes.
  • The outcome of taking action aa in state s0s_0 is deterministic then Result(s0,a)Result(s_0, a).
  • If the outcome is non-deterministic (stochastic), fully observable, the agent knows the current state s0s_0.
    • It is represented as a random variable whose values are the possible outcome states.
    • The transition model specifies the probabilities of possbile outcome states.
    • P(s,a,s)P(s, a, s')

MEU, Maximum Expected Utility

A rational agent should choose the action that maximizes its expcted utility.

  • The MEU principle could be seen as defining all of AI.
  • action=argmaxa(EU(ae))action = argmax_a(EU(a|e))
    • among this set of actions, give the highest expected utility.
  • EU(ae)=s(P(Result(a)=sa,e)U(s))EU(a|e) = \sum_{s'} (P(Result(a) = s' | a, e) * U(s'))

Sequential decision problem

type of problem or scenario where a decision-maker must make a series of choices or decisions over time in order to achieve a desired outcome.

  • a sequential decision-making problem or sequential decision-making task

MDP, Markov Decision Process

a sequential decision problem for a fully observable, stochastic environment environment with a Markovian transition model and additive rewards.

  • a formal framework for modeling sequential decision problems.
  • including states, actions, transition probabilities, and rewards
  • used in reinforcement learning and operations research.
ComponentDescription
Environmentfully observable, Stochastic
Statesa set of states (with an initial state S₀)
Transition modelMarkovian
Rewardadditive
AgentMake decisions at time step for actions
Interacts with environment through percepts and actions
  • A set of states: an initial state S0S_0, possible terminal state(s)
  • A set of actions in each state: ACTIONS(s)ACTIONS(s)
  • A stochastic Markov transition model: P(ss,a)P(s' | s, a)
  • A reward function: R(s)R(s) or R(s,a,s)R(s, a, s')
  • In MDP, Additive Reward is defined as a function that assigns a numerical value to each state-action pair or state-transition pair.
    • represents the immediate benefit or cost associated with taking a specific action in a particular state.
  • The solution must specify what the agent should do for any state that the agent might reach.
  • Policy: π(s)π(s), an action for each state
    • A solution to an MDP.
    • π(s)\pi(s) is the action recommended by the policy π\pi for state ss
  • Q state: Q(s,a)Q(s, a), the expected utility of doing action aa in state ss

RL, Reinforcement Learning

  • an agent interacts with an environment
  • takes actions to maximize a cumulative reward signal over time.
  • learns a policy that balances exploration and exploitation
    • exploration: trying new actions.
    • exploitation: choosing known good actions.

Sample MDP

Environment

  • The agent lives in a 4x3 fully observable stochastic environment
  • Walls block the agent's path; Begin in the start state, the agent must choose an action at each time step
  • Actions available to the agent in each state: Up, Down, Left, or Right
  • The interaction with the environment terminates when the agent reaches one of the goal states, marked +1 or -1

4x3 Grid World

1234
3+1
2🚫-1
1START
  • START: Initial position (1,1)
  • 🚫: Wall (impassable)
  • +1: Goal state with positive reward
  • -1: Goal state with negative reward

Transition Model

  • Stochastic environment with uncertainty
  • When agent chooses an action, there's:
    • 0.8 probability of moving in intended direction
    • 0.1 probability of moving perpendicular to intended direction (each side)

Reward Function

  • The agent receives rewards each time step:
    • Small "living" reward each step (-0.04 in all states except the terminal states)
    • Big rewards come at the end (+1 for good and -1 for bad terminal states)

Goal

  • To maximize sum of rewards from the start state to a terminate state

Additive Reward

  • a type of reward structure
  • the total reward earned by an agent in an MDP
  • calculated by summing up the individual rewards received at each time step as the agent interacts with the environment.

Discounted reward

  • a techinque used to calcluate the expected cumulative reward over time.
  • applying a discount factor to future rewards, prioritizing more immediate rewards over distant ones.

Utility

Uπ(s)=E[t=0γtR(St,π(St),St+1)]U^{\pi}(s) = E\left[\sum_{t=0}^{\infty} \gamma^t R(S_t, \pi(S_t), S_{t+1})\right]

  • The expectation EE is with repect to the probability distribution over state sequences determined by ss and π\pi.
  • γ\gamma is the discount factor, 0γ<10 \leq \gamma < 1, which determines the present value of future rewards.
  • R(St,π(St),St+1)R(S_t, \pi(S_t), S_{t+1}) is the reward received when transitioning from state StS_t to state St+1S_{t+1} by taking action π(St)\pi(S_t).

Bellman equation

U(s)=maxaA(s)sP[ss,a](R(s,a,s)+γU(s))U(s) = max_{a \in A(s)} \sum_{s'}P[s'|s,a](R(s,a,s') + \gamma U(s'))

Formula ComponentDescription
U(s)U(s)the utility of state ss
maxamax_athe maximum over all possible actions aACTIONS(s)a \in ACTIONS(s)
s\sum_{s'}the sum over all possible successor states ss'
P[ss,a]P[s'\vert s,a]the transition probability of reaching state ss'
R(s,a,s)R(s,a,s')the reward received after transitioning from state ss to state ss' by taking action aa
γ\gammathe discount factor, 0γ<10 \leq \gamma < 1
U(s)U(s')the utility of the successor state ss'
  • In optimal setting, the neighbor state meets the Bellman equation.
  • From s0s_0 to termination, each step in the sequence should match the Bellman equation.

πs=argmaxπUπ(s)\pi^{*}_s = argmax_{\pi}U^{\pi}(s)

Q function

  • the expected utility of taking a given action in a given state.
  • enables the agent to act optimally simply by choosing
    • U(s)=maxaQ(s,a)U(s) = max_a Q(s,a)
  • The optimal policy can be extracted from the Q-function.
    • π(s)=argmaxaQ(s,a)\pi^{*}(s) = argmax_a Q(s,a)

Q(s,a)=sP[ss,a](R(s,a,s)+γU(s)) =sP[ss,a](R(s,a,s)+γmaxaQ(s,a))Q(s,a) = \sum_{s'}P[s'|s,a](R(s,a,s') + \gamma U(s')) \\ \space = \sum_{s'}P[s'|s,a](R(s,a,s') + \gamma max_{a'}Q(s',a'))

Reinforcement Learning

Key components of RL

  • Agent
  • Environment
  • State
  • Action
  • Reward
  • Policy

Application of RL

  • Robotics
  • Autonomous systems
  • Game playing
    • Atari video games from raw visual input
    • Playing poker
    • Alpha Go
  • Recommendation systems

RL Algorithms

  • Model-based RL
    • often learns a utility function U(s)U(s)
    • the sum of rewards from state ss onward from observing the effects of its actions on the environment.
  • Model-free RL
    • Action-utility learning, such as Q-learning
    • Policy learning
  • Passive RL
    • the agent's policy is fixed
    • the task is to learn the utilities of states.
  • Active RL
    • the agent must figure out what to do through exploration and exploitation.
      • Exploration: trying new actions to discover their effects.
      • Exploitation: leveraging known actions that yield high rewards.

Direct Utility Estimation

  • the utility of a state is the expected total reward from the state onward.
    • the expecgted reward-to-go from that state.
    • each trial provides a sample of this quantity for state visited.
  • end of each sequence, the algorighm calculates the observed reward-to-go for each state and updates the estimated utility for the state accordingly.
  • In the limit of infinitly many trials, the sample average will converge to the expectation in teh Bellman equation.
  • Ignores the connections between states.
    • misses opportunities to learning, it learns nothing until the end of the trial.
    • often converges very slowly.

Adaptive Dynamic Programming, ADP

Ui(s)=sP(ss,πi(s))[R(s,πi(s),s)+γUi(s)]U_i(s) = \sum_{s'}P(s'|s,\pi_{i}(s))[R(s,\pi_{i}(s),s') + \gamma U_{i}(s')]

  • advantage of the constraints among the utilities of states by learning the transition model
    • that connects them and solves the corresponding MDP
    • using dynamic programming to calculate the utilities of the states.
  • These Bellman equations are linear when the policy πi\pi_i is fixed
    • can be solved using any linear algebra package.

Learning a transition model

  • the environment is fully observable
  • the agent can learn the mapping from a state-action pair (s,a)(s, a) to the resulting state ss'.
  • The transition model P(ss,a)P(s'|s,a) is represented as a table and it is estimated directly from the counts that are accumulated in Nss,aN_{s'|s,a}.
  • The counts record how often state ss' is reached when executing aa in ss.
    • P(ss,a)=Nss,asNss,aP(s'|s,a) = \frac{N_{s'|s,a}}{\sum_{s''}N_{s''|s,a}}

Temporal difference, TD

Uπ(s)Uπ(s)+α[R(s,π(s),s)+γUπ(s)Uπ(s)]U^{\pi}(s) \leftarrow U^{\pi}(s) + \alpha[R(s,\pi(s),s') + \gamma U^{\pi}(s') - U^{\pi}(s)]

  • when a transition occurs from state ss to state ss' via action π(s)\pi(s)
  • α\alpha is the learning rate, 0<α10 < \alpha \leq 1
  • no need a transition model to perform updates
  • the environment is itself supplies the connection between neighboring states in the form of observed transitions.

Q learning

  • model-free reinforcement learning algorithm
  • the agent learns a Q-function, Q(s,a)Q(s,a)
  • avoids the need for a model by learning an action-utility function Q(s,a)Q(s,a) instead of utility function U(s)U(s).
  • a model-free Q-learning TD update
    • Q(s,a)Q(s,a)+α[R(s,a,s)+γmaxaQ(s,a)Q(s,a)]Q(s,a) \leftarrow Q(s,a) + \alpha[R(s,a,s') + \gamma max_{a'}Q(s',a') - Q(s,a)]
    • this update is calculated whenever action aa is executed in state ss leading to state ss'
  • TD Q-learning agent doesn't need a transition model P(ss,a)P(s'|s,a)
    • either for learning or for action selection.