본문으로 건너뛰기

IAI +008

· 약 13분

NLP

Tokenization

The process of dividing a text into a sequence of words.

  • Different models uses different tokenization methods.
BERT TOKENBERT IDGPT TOKENGPT ID
my2026My3666
grandson7631Ġgrandson31845
loved3866Ġloved10140
it2009Ġit340
!999Ġ!0
so2061Ġso1406
much2172Ġmuch881
fun4569Ġfun1257
!999Ġ!0

Corpus

a large and structured collection of text

  • a corpus typically consists of at least a million words of text
  • at least tens of thousands of distinct vocabulary words.

Text classification

  • the process of categorizing text into organized groups.
  • text classifiers can automatically analyze text and then assign a set of predefined tags or categories based on its content.
  • machine learning approach
    • Features
      • BoW
      • TF-IDF
    • Features + classifier
      • Logistic regression
      • SVM
      • Naive Bayes
  • Deep learning approach
    • Neural models: CNNs (capture local n-grams)
    • RNNs, LSTMs (sequence-aware)
    • Transformers (e.g. BERT)
      • Contextual embedding
  • Modern enhancements
    • Embedding + classifier: Pretrained embeddings (Word2Vec) + classifier
    • Fine-tuned transformers: BERT fine-tuned

Bag of Words, BoW

  • it represents a text as a bag of its words
  • we can understand the meaning of a document from its content (words) their multiplicity (frequency, number of occurrences)
  • mainly used as a tool of feature extraction.
  • limitations
    • ignores syntax and the context
    • disregarding grammar
    • discards word order - words are independent of each other
    • considers only the meanings of the words in the sentence
# examples

Example1 = "He likes to watch movies. Mary likes movies too."
Example2 = "Mary also likes to watch football games."

# Vocabulary
Vocab = {"He", "likes", "to", "watch", "movies", "Mary", "also", "football", "games"}

# BoW representation
BoW1 = {He: 1, likes: 2, to: 1, watch: 1, movies: 2, Mary: 1, also: 0, football: 0, games: 0}
BoW2 = {He: 0, likes: 1, to: 1, watch: 1, movies: 0, Mary: 1, also: 1, football: 1, games: 1}

[[1,2,1,1,2,1,0,0,0], [0,1,1,1,0,1,1,1,1]]

TF-IDF

Term Frequency - Inverse Document Frequency

  • Model based on the statistics of word counts.
  • idea is that key terms and important ideas are likely to repeat.
  • includes a scoring function that measure the relevance of a document to a query.
    • the function takes a document with a corpus and a query as input and returns a numeric score.
    • the doucments that have the highest scores are considered as the most relevant documents.
  • Term Frequency
    • TF(qi,dj,D)TF(q_i, d_j, D)
  • Inverse Document Frequency
    • IDF(qi,D)=logNDF(qi,D)IDF(q_i, D) = \log\frac{N}{DF(q_i, D)}
      • DF(qi,D)DF(q_i, D): number of documents in the corpus DD that contain the term qiq_i
  • N=DN = |D|: total number of documents in the corpus DD
  • TF-IDF score
    • TFIDF(qi,dj,D)=TF(qi,dj)IDF(qi,D)TFIDF(q_i, d_j, D) = TF(q_i, d_j) \cdot IDF(q_i, D)

Examples of TF-IDF

  • Document 1: "John likes to watch movies."
  • Document 2: "Mary likes movies too."
  • Document 3: "John also likes football."
StepTermDocumentNumber of times term t
appears in document d
Total number of terms
in document d
TF(t, d)Total number of documentsNumber of documents
containing the term t
IDF(t)IDF(t) (base 10)TF-IDF(t,d)TF-IDF(t,d) (base 10)
1t=likesd1150.2330000
t=likesd2140.2500
t=likesd3140.2500
2t=watchd1150.2311.0986122890.4771212550.2197224580.095424251
t=watchd204000
t=watchd304000
3t=maryd1050311.0986122890.47712125500
t=maryd2140.250.2746530720.119280314
t=maryd304000
4t=footballd1050311.0986122890.47712125500
t=footballd204000
t=footballd3140.250.2746530720.119280314

Naive Bayes

  • naive refers to a very strong simplifying assumption about the features
  • Conditional independence assumption
    • Given class YY, all the features X=X1,X2,...,XnX = {X_1, X_2, ..., X_n} are assumed mutually independent.
  • P(XY)=P(X1,...,XnY)=i=1nP(XiY)P(X | Y) = P(X_1, ..., X_n | Y) = \prod_{i=1}^{n} P(X_i | Y)
  • Bayes rule
    • P(YX)=P(XY)P(Y)P(X)P(Y)P(XY)P(Y | X) = \frac{P(X | Y) P(Y)}{P(X)} \propto P(Y) P(X | Y)
    • P(YX1,...,Xn)P(Y)i=1nP(XiY)P(Y | X_1, ..., X_n) \propto P(Y) \prod_{i=1}^{n} P(X_i | Y)
  • Sentiment Analysis
    • p(Ckx1,...,xn)=1Zp(Ck)i=1np(xiCk)p(C_k | x_1, ..., x_n) = \frac{1}{Z} p(C_k) \prod_{i=1}^{n} p(x_i | C_k)
      • Z=p(x)=kp(Ck)p(xCk)Z = p(x) = \sum_{k} p(C_k) p(x | C_k)
        • scaling factor for normalization of p(Ckx)p(C_k | x)
    • maximum a posteriori (MAP) decision rule
      • y^=arg maxk1,...,Kp(Ck)i=1np(xiCk)\hat{y} = \argmax\limits_{k \in 1, ..., K}p(C_k) \prod_{i=1}^{n} p(x_i | C_k)

Examples of Naive Bayes

P(Classw1:N)=αP(Class)jP(wjClass)P(Class | w_{1:N}) = \alpha \cdot P(Class) \cdot \prod_{j} P(w_j | Class)

  • Analyze the sentiment
    • my grandson loved it
    • x1=myx_1 = \text{my}, x2=grandsonx_2 = \text{grandson}, x3=lovedx_3 = \text{loved}, x4=itx_4 = \text{it}
    • C=positive,negativeC = {positive, negative}
    • y^=arg maxk{positive,negative}p(Ck)p(x1Ck)p(x2Ck)p(x3Ck)p(x4Ck)\hat{y} = \argmax\limits_{k \in \{positive, negative\}} p(C_k) \cdot p(x_1 | C_k) \cdot p(x_2 | C_k) \cdot p(x_3 | C_k) \cdot p(x_4 | C_k)
      • positive case
        • p(positive)=0.49p(positive) = 0.49
        • p(mypositive)=0.30p(my | positive) = 0.30
        • p(grandsonpositive)=0.01p(grandson | positive) = 0.01
        • p(lovedpositive)=0.32p(loved | positive) = 0.32
        • p(itpositive)=0.30p(it | positive) = 0.30
        • p(positivex)p(positive)p(mypositive)p(grandsonpositive)p(lovedpositive)p(itpositive)=0.49×0.30×0.01×0.32×0.30=0.00014256p(positive | x) \propto p(positive) \cdot p(my | positive) \cdot p(grandson | positive) \cdot p(loved | positive) \cdot p(it | positive) = 0.49 \times 0.30 \times 0.01 \times 0.32 \times 0.30 = 0.00014256
      • nagative case
        • p(negative)=0.51p(negative) = 0.51
        • p(mynegative)=0.20p(my | negative) = 0.20
        • p(grandsonnegative)=0.02p(grandson | negative) = 0.02
        • p(lovednegative)=0.08p(loved | negative) = 0.08
        • p(itnegative)=0.4p(it | negative) = 0.4
        • p(negativex)p(negative)p(mynegative)p(grandsonnegative)p(lovednegative)p(itnegative)=0.51×0.20×0.02×0.08×0.4=0.00006528p(negative | x) \propto p(negative) \cdot p(my | negative) \cdot p(grandson | negative) \cdot p(loved | negative) \cdot p(it | negative) = 0.51 \times 0.20 \times 0.02 \times 0.08 \times 0.4 = 0.00006528
      • Z=p(x)=kp(Ck)p(xCk)Z = p(x) = \sum_{k} p(C_k) p(x | C_k)
        • =p(positive)p(my,grandson,loved,itpositive)+p(negative)p(my,grandson,loved,itnegative)= p(positive) \cdot p(my, grandson, loved, it | positive) + p(negative) \cdot p(my, grandson, loved, it | negative)
        • =0.00014256+0.00006528=0.00020784= 0.00014256 + 0.00006528 = 0.00020784
        • the probability of xx, obtained by adding up its probabilities under each class.
      • p(positivex)=p(positive)p(my,grandson,loved,itpositive)Z=0.000142560.000207840.6867p(positive | x) = \frac{p(positive) \cdot p(my, grandson, loved, it | positive)}{Z} = \frac{0.00014256}{0.00020784} \approx 0.6867
      • p(negativex)=p(negative)p(my,grandson,loved,itnegative)Z=0.000065280.000207840.3133p(negative | x) = \frac{p(negative) \cdot p(my, grandson, loved, it | negative)}{Z} = \frac{0.00006528}{0.00020784} \approx 0.3133
    • Decision
      • y^=arg maxk{positive,negative}{k=positive:0.6867,k=negative:0.3133}\hat{y} = \argmax\limits_{k \in \{positive, negative\}} \{k = positive : 0.6867, k = negative : 0.3133\}
      • The sentiment is positive.
  • Spam detection
    • Probabilities learned from data:
      • P(Spam)=0.4P(\text{Spam}) = 0.4, P(Ham)=0.6P(\text{Ham}) = 0.6
      • P(freeSpam)=0.8P(\text{free}|\text{Spam}) = 0.8, P(freeHam)=0.1P(\text{free}|\text{Ham}) = 0.1
      • P(winSpam)=0.7P(\text{win}|\text{Spam}) = 0.7, P(winHam)=0.05P(\text{win}|\text{Ham}) = 0.05
    • New document: "free win"
      • Spam score: 0.4×0.8×0.7=0.2240.4 \times 0.8 \times 0.7 = 0.224
      • Ham score: 0.6×0.1×0.05=0.0030.6 \times 0.1 \times 0.05 = 0.003
        → Classified as Spam.

N-gram model

  • N-gram: a sequence of written symbols of length n
    • unigram, bigram, trigram
  • the probability of each symbol is dependent only on the n-1 previous symbols.
  • P(wjw1:j1)=P(wjwjn+1:j1)P(w_j|w_{1:j-1}) = P(w_j|w_{j-n+1:j-1})
  • P(w1:N)=j=1NP(wjw1:j1)j=1NP(wjwjn+1:j1)P(w_1:N) = \prod_{j=1}^{N} P(w_j|w_{1:j-1}) \approx \prod_{j=1}^{N} P(w_j|w_{j-n+1:j-1})

Examples of N-gram

  • W1:NW_{1:N} is "This article is on NLP"
    • N=5
    • Bigram (n-gram with n=2)
P("This article is on NLP")
/* full chain rule, */
= P("This") // j = 1
* P("article" | "This") // j = 2
* P("is" | "This article") // j = 3
* P("on" | "This article is") // j = 4
* P("NLP" | "This article is on") // j = 5
/* bigram approximation */
= P("This") // j =1
* P("article" | "This") // j = 2
* P("is" | "article") // j = 3
* P("on" | "is") // j = 4
* P("NLP" | "on") // j = 5
Step (j)Word (W_j)BigramTrigram
1This(P(This))(P(\text{This}))(P(This))(P(\text{This}))
2article(P(articleThis))(P(\text{article} \mid \text{This}))(P(articleThis))(P(\text{article} \mid \text{This}))
3is(P(isarticle))(P(\text{is} \mid \text{article}))(P(isThis, article))(P(\text{is} \mid \text{This, article}))
4on(P(onis))(P(\text{on} \mid \text{is}))(P(onarticle, is))(P(\text{on} \mid \text{article, is}))
5NLP(P(NLPon))(P(\text{NLP} \mid \text{on}))(P(NLPis, on))(P(\text{NLP} \mid \text{is, on}))

Transformer

  • given a set of input vectors (tokens), attention lets each token look at the others and form a weighted average of them.
  • The weights are data-dependent.
  • The model learns the weights representing which tokens are relevant to which other tokens.

X=Input Embeddings,Q=XWQ,K=XWK,V=XWVX = \text{Input Embeddings},\qquad Q = X W_Q,\quad K = X W_K,\quad V = X W_V

  • Scores: S=QKTdS = \frac{QK^T}{\sqrt{d}}
    • dot products between every query and every key
    • the scale d\sqrt{d} keeps gradients stable
  • Weights: A=Softmax(S)A = Softmax(S)
    • row-wise softmax
    • each row sums to 1
  • Output: Attention(Q,K,V)=AVAttention(Q, K, V) = A V
    • each output token is a weighted sum of the value vectors, with weights determined by the attention scores.(how much to pay attention to each position)
  • Query (Q): What am I looking for?
  • Key (K): What is the label/address of the information I have?
  • Value (V): What is the actual information I want to convey?

Cross-Attention

  • look up relevant information in another sequence.
    • e.g. decoder attending to encoder outputs in translation
  • QQ from the current sequence, KK and VV from the other sequence.
    • Q=XtargetWQ,K=XsourceWK,V=XsourceWVQ = X_{\text{target}} W_Q, \quad K = X_{\text{source}} W_K, \quad V = X_{\text{source}} W_V
  • Cross-attention = Which source tokens are relevant to this target token?
  • Self-attention = Which other tokens in this sequence are relevant to this token?
  • headi=Attention(QWiQ,KWiK,VWiV)head_i = Attention(Q W_i^Q, K W_i^K, V W_i^V)
  • MultiHead(Q,K,V)=Concat(head1,...,headh)WOMultiHead(Q, K, V) = Concat(head_1, ..., head_h) W_O

Word representation

One-hot representation

  • represents each word as a vector of length N.
  • where N is the size of the vocabulary.
vocab = {he, is, singing, she, dancing, stage}

she = [0, 0, 0, 1, 0, 0]
is = [0, 1, 0, 0, 0, 0]
singing = [0, 0, 1, 0, 0, 0]
  • limitations
    • incredibly inefficient for large vocabularies
    • not embed any intrinsic meaning of words
    • unable to represent similarity between likely words
    • the representation of documents is sparse vectors
      • can cause challenges in computation

Word Embedding

  • represents individual words as vectors in a low-dimensional continuous space.
  • a distributed representation of a word.
  • generate a unique value for each word while using smaller vectors compared with one-hot encoding.
  • common vector dictionaries
    • word2vec
    • Glove (Global Vectors)

Contextual embedding

  • generates different vectors for the same word based on its context.
  • the word "bank" would have different embeddings in the sentences:
    • "He went to the bank to deposit money."
    • "She sat by the river bank and enjoyed the view."
  • BERT or GPT use deep neural networks to process a sequence of tokens.
  • Each token's embedding is computed by considering the token itself, its position in the sequence and the surrounding tokens, context.
  • captures both semantic and syntactic role in that specific context.

Part of Speech (POS) tagging

  • lexical category or tag that indicates the grammatical role of a word in a sentence.
  • parts of speech allow language models to capture generalizations such as "adjectives often modify nouns" or "verbs often follow subjects".
From the start , it took a person
IN DT NN , PRP VBD DT NN

with great qualities to succeed
IN JJ NNS TO VB
TagDescriptionExample
CCCoordinating conjunctionand, but
CDCardinal numberone, two
DTDeterminerthe, a
EXExistential therethere
FWForeign worddoppelgänger
INPreposition or subordinating conjunctionin, of
JJAdjectivebig, old
JJRAdjective, comparativebigger, older
JJSAdjective, superlativebiggest, oldest
LSList item marker1, 2, One
MDModalcan, will
NNNoun, singular or masscat, tree
NNSNoun, pluralcats, trees
NNPProper noun, singularJohn, London
NNPSProper noun, pluralSmiths, Londons
PDTPredeterminerall, both
POSPossessive ending's, s'
PRPPersonal pronounI, you, he
PRP$Possessive pronounmy, your, his
RBAdverbquickly, very
RBRAdverb, comparativefaster, better
RBSAdverb, superlativefastest, best
RPParticleup, off
SYMSymbol$, %, &
TOtoto
UHInterjectionoh, wow
VBVerb, base formbe, have
VBDVerb, past tensewas, had
VBGVerb, gerund or present participlebeing, having
VBNVerb, past participlebeen, had
VBPVerb, non-3rd person singular presenttalk, have
VBZVerb, 3rd person singular presenttalks, has
WDTWh-determinerwhich, that
WPWh-pronounwho, what
WP$Possessive wh-pronounwhose
WRBWh-adverbwhere, when
#Pound sign#
$Dollar sign$
,Comma,
.Sentence-final punctuation. ! ?

Example of POS tagging

  • Hidden Markov Model (HMM)
    • takes in a temproral sequence of evidence observations
    • predicts the lexical categories
  • Logistic regression
    • build 45 different logistics regression models, one for each part of speech
    • ask each model how probable it is that the example word is a member of that category, given the feature values for that word in its particular context.

Machine translation

  • translate a sentence from a source lnaguae to a trget language.
  • train an MT model: a large corpus of source/target sentence paris and hope that the trained MT model can accurately translate new sentences.
  • want to generate a target language sentence that corresponds to the source language sentence
  • the geneartion of each target word is conditional on the entire source sentence and on all previously generated target words.

Example of machine translation

  • a sequance-to-sequence model
    • use two RNNs (LSTM)
  • attentional sequence-to-sequence model
    • use attentino to create a context-based summarization of the source sentence into a fixed-dimension representation
  • transformer-based model
    • encoder: reads the source sentence and turns it into a rich, contextual set of vectors.
    • decoder: generates the target sentence one token at a time, using what it has generated so far and the encoder's representations.

Text generation

  • a subfield of NLP
  • leverages knowledge in computational linguistics and AI to automatically generate natural language texts
  • can satisfy certain communicativa requirements

Example of text generation

  • Classifier based on word embeddings, e.g. RNN and LSTM
    • RNN: each input word is encoded as a word embedding vector xix_i, a hidden layer ztz_t, the classes are the words of the vocabulary
      • the output yty_t will be a softmax probability distribution over the possible values of the next word in the sentence.
    • LSTM: can choose to remember som parts of the input, copying it over to the next time step, and to forget toher parts.
  • Pre-trained languaeg model using deep learning
    • BERT
    • GPT-X, Generativ ePre-trained Transformer

Transfer learning

  • experience with one learning task helps an agent learn better on another task.
  • pretraining: a form of transfer learning in which we use a large amount of shared general-domain language data to train an initial version of an NLP model.
    • we can use a smaller amount of domain-specific data to refine the model
    • the refined model can learn the vocabulary, idioms, syntactic structure, and other linguistic phenomena that are specific to the new domain.
  • For NN, learning consist of adjusting weight, so the most plausible approach for transfer learning is to copy over the weights learned for task A to a network that will be trained for task B.
    • The weights are then updated by gradient descent in the usual way using data for task B.
  • the popularity of transfer learning is the availability of high-quality pretrained models.
  • will want to freeze the first few layers of the pretrained model
    • these layers serve as feature detectors that will be useful for new model.
    • new data set will be allowed to modify the parameters of the higher levels only
      • these are the layers that identify problem-specific features and do classification.