IAI +008
· 약 13분
NLP
Tokenization
The process of dividing a text into a sequence of words.
- Different models uses different tokenization methods.
| BERT TOKEN | BERT ID | GPT TOKEN | GPT ID |
|---|---|---|---|
| my | 2026 | My | 3666 |
| grandson | 7631 | Ġgrandson | 31845 |
| loved | 3866 | Ġloved | 10140 |
| it | 2009 | Ġit | 340 |
| ! | 999 | Ġ! | 0 |
| so | 2061 | Ġso | 1406 |
| much | 2172 | Ġmuch | 881 |
| fun | 4569 | Ġfun | 1257 |
| ! | 999 | Ġ! | 0 |
Corpus
a large and structured collection of text
- a corpus typically consists of at least a million words of text
- at least tens of thousands of distinct vocabulary words.
Text classification
- the process of categorizing text into organized groups.
- text classifiers can automatically analyze text and then assign a set of predefined tags or categories based on its content.
- machine learning approach
- Features
- BoW
- TF-IDF
- Features + classifier
- Logistic regression
- SVM
- Naive Bayes
- Features
- Deep learning approach
- Neural models: CNNs (capture local n-grams)
- RNNs, LSTMs (sequence-aware)
- Transformers (e.g. BERT)
- Contextual embedding
- Modern enhancements
- Embedding + classifier: Pretrained embeddings (Word2Vec) + classifier
- Fine-tuned transformers: BERT fine-tuned
Bag of Words, BoW
- it represents a text as a bag of its words
- we can understand the meaning of a document from its content (words) their multiplicity (frequency, number of occurrences)
- mainly used as a tool of feature extraction.
- limitations
- ignores syntax and the context
- disregarding grammar
- discards word order - words are independent of each other
- considers only the meanings of the words in the sentence
# examples
Example1 = "He likes to watch movies. Mary likes movies too."
Example2 = "Mary also likes to watch football games."
# Vocabulary
Vocab = {"He", "likes", "to", "watch", "movies", "Mary", "also", "football", "games"}
# BoW representation
BoW1 = {He: 1, likes: 2, to: 1, watch: 1, movies: 2, Mary: 1, also: 0, football: 0, games: 0}
BoW2 = {He: 0, likes: 1, to: 1, watch: 1, movies: 0, Mary: 1, also: 1, football: 1, games: 1}
[[1,2,1,1,2,1,0,0,0], [0,1,1,1,0,1,1,1,1]]
TF-IDF
Term Frequency - Inverse Document Frequency
- Model based on the statistics of word counts.
- idea is that key terms and important ideas are likely to repeat.
- includes a scoring function that measure the relevance of a document to a query.
- the function takes a document with a corpus and a query as input and returns a numeric score.
- the doucments that have the highest scores are considered as the most relevant documents.
- Term Frequency
- Inverse Document Frequency
-
- : number of documents in the corpus that contain the term
-
- : total number of documents in the corpus
- TF-IDF score
Examples of TF-IDF
- Document 1: "John likes to watch movies."
- Document 2: "Mary likes movies too."
- Document 3: "John also likes football."
| Step | Term | Document | Number of times term tappears in document d | Total number of terms in document d | TF(t, d) | Total number of documents | Number of documents containing the term t | IDF(t) | IDF(t) (base 10) | TF-IDF(t,d) | TF-IDF(t,d) (base 10) |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | t=likes | d1 | 1 | 5 | 0.2 | 3 | 3 | 0 | 0 | 0 | 0 |
| t=likes | d2 | 1 | 4 | 0.25 | 0 | 0 | |||||
| t=likes | d3 | 1 | 4 | 0.25 | 0 | 0 | |||||
| 2 | t=watch | d1 | 1 | 5 | 0.2 | 3 | 1 | 1.098612289 | 0.477121255 | 0.219722458 | 0.095424251 |
| t=watch | d2 | 0 | 4 | 0 | 0 | 0 | |||||
| t=watch | d3 | 0 | 4 | 0 | 0 | 0 | |||||
| 3 | t=mary | d1 | 0 | 5 | 0 | 3 | 1 | 1.098612289 | 0.477121255 | 0 | 0 |
| t=mary | d2 | 1 | 4 | 0.25 | 0.274653072 | 0.119280314 | |||||
| t=mary | d3 | 0 | 4 | 0 | 0 | 0 | |||||
| 4 | t=football | d1 | 0 | 5 | 0 | 3 | 1 | 1.098612289 | 0.477121255 | 0 | 0 |
| t=football | d2 | 0 | 4 | 0 | 0 | 0 | |||||
| t=football | d3 | 1 | 4 | 0.25 | 0.274653072 | 0.119280314 |
Naive Bayes
- naive refers to a very strong simplifying assumption about the features
- Conditional independence assumption
- Given class , all the features are assumed mutually independent.
- Bayes rule
- Sentiment Analysis
-
-
- scaling factor for normalization of
-
- maximum a posteriori (MAP) decision rule
-
Examples of Naive Bayes
- Analyze the sentiment
my grandson loved it- , , ,
-
- positive case
- nagative case
-
- the probability of , obtained by adding up its probabilities under each class.
- positive case
- Decision
- The sentiment is positive.
- Spam detection
- Probabilities learned from data:
- ,
- ,
- ,
- New document: "free win"
- Spam score:
- Ham score:
→ Classified as Spam.
- Probabilities learned from data:
N-gram model
- N-gram: a sequence of written symbols of length
n- unigram, bigram, trigram
- the probability of each symbol is dependent only on the
n-1previous symbols.
Examples of N-gram
- is "This article is on NLP"
- N=5
- Bigram (n-gram with n=2)
P("This article is on NLP")
/* full chain rule, */
= P("This") // j = 1
* P("article" | "This") // j = 2
* P("is" | "This article") // j = 3
* P("on" | "This article is") // j = 4
* P("NLP" | "This article is on") // j = 5
/* bigram approximation */
= P("This") // j =1
* P("article" | "This") // j = 2
* P("is" | "article") // j = 3
* P("on" | "is") // j = 4
* P("NLP" | "on") // j = 5
| Step (j) | Word (W_j) | Bigram | Trigram |
|---|---|---|---|
| 1 | This | ||
| 2 | article | ||
| 3 | is | ||
| 4 | on | ||
| 5 | NLP |
Transformer
- given a set of input vectors (tokens), attention lets each token look at the others and form a weighted average of them.
- The weights are data-dependent.
- The model learns the weights representing which tokens are relevant to which other tokens.
- Scores:
- dot products between every query and every key
- the scale keeps gradients stable
- Weights:
- row-wise softmax
- each row sums to 1
- Output:
- each output token is a weighted sum of the value vectors, with weights determined by the attention scores.(how much to pay attention to each position)
- Query (Q): What am I looking for?
- Key (K): What is the label/address of the information I have?
- Value (V): What is the actual information I want to convey?
Cross-Attention
- look up relevant information in another sequence.
- e.g. decoder attending to encoder outputs in translation
- from the current sequence, and from the other sequence.
- Cross-attention = Which source tokens are relevant to this target token?
- Self-attention = Which other tokens in this sequence are relevant to this token?
Word representation
One-hot representation
- represents each word as a vector of length
N. - where
Nis the size of the vocabulary.
vocab = {he, is, singing, she, dancing, stage}
she = [0, 0, 0, 1, 0, 0]
is = [0, 1, 0, 0, 0, 0]
singing = [0, 0, 1, 0, 0, 0]
- limitations
- incredibly inefficient for large vocabularies
- not embed any intrinsic meaning of words
- unable to represent similarity between likely words
- the representation of documents is sparse vectors
- can cause challenges in computation
Word Embedding
- represents individual words as vectors in a low-dimensional continuous space.
- a distributed representation of a word.
- generate a unique value for each word while using smaller vectors compared with one-hot encoding.
- common vector dictionaries
- word2vec
- Glove (Global Vectors)
Contextual embedding
- generates different vectors for the same word based on its context.
- the word "bank" would have different embeddings in the sentences:
- "He went to the bank to deposit money."
- "She sat by the river bank and enjoyed the view."
- BERT or GPT use deep neural networks to process a sequence of tokens.
- Each token's embedding is computed by considering the token itself, its position in the sequence and the surrounding tokens, context.
- captures both semantic and syntactic role in that specific context.
Part of Speech (POS) tagging
- lexical category or tag that indicates the grammatical role of a word in a sentence.
- parts of speech allow language models to capture generalizations such as "adjectives often modify nouns" or "verbs often follow subjects".
From the start , it took a person
IN DT NN , PRP VBD DT NN
with great qualities to succeed
IN JJ NNS TO VB
| Tag | Description | Example |
|---|---|---|
| CC | Coordinating conjunction | and, but |
| CD | Cardinal number | one, two |
| DT | Determiner | the, a |
| EX | Existential there | there |
| FW | Foreign word | doppelgänger |
| IN | Preposition or subordinating conjunction | in, of |
| JJ | Adjective | big, old |
| JJR | Adjective, comparative | bigger, older |
| JJS | Adjective, superlative | biggest, oldest |
| LS | List item marker | 1, 2, One |
| MD | Modal | can, will |
| NN | Noun, singular or mass | cat, tree |
| NNS | Noun, plural | cats, trees |
| NNP | Proper noun, singular | John, London |
| NNPS | Proper noun, plural | Smiths, Londons |
| PDT | Predeterminer | all, both |
| POS | Possessive ending | 's, s' |
| PRP | Personal pronoun | I, you, he |
| PRP$ | Possessive pronoun | my, your, his |
| RB | Adverb | quickly, very |
| RBR | Adverb, comparative | faster, better |
| RBS | Adverb, superlative | fastest, best |
| RP | Particle | up, off |
| SYM | Symbol | $, %, & |
| TO | to | to |
| UH | Interjection | oh, wow |
| VB | Verb, base form | be, have |
| VBD | Verb, past tense | was, had |
| VBG | Verb, gerund or present participle | being, having |
| VBN | Verb, past participle | been, had |
| VBP | Verb, non-3rd person singular present | talk, have |
| VBZ | Verb, 3rd person singular present | talks, has |
| WDT | Wh-determiner | which, that |
| WP | Wh-pronoun | who, what |
| WP$ | Possessive wh-pronoun | whose |
| WRB | Wh-adverb | where, when |
| # | Pound sign | # |
| $ | Dollar sign | $ |
| , | Comma | , |
| . | Sentence-final punctuation | . ! ? |
Example of POS tagging
- Hidden Markov Model (HMM)
- takes in a temproral sequence of evidence observations
- predicts the lexical categories
- Logistic regression
- build 45 different logistics regression models, one for each part of speech
- ask each model how probable it is that the example word is a member of that category, given the feature values for that word in its particular context.
Machine translation
- translate a sentence from a source lnaguae to a trget language.
- train an MT model: a large corpus of source/target sentence paris and hope that the trained MT model can accurately translate new sentences.
- want to generate a target language sentence that corresponds to the source language sentence
- the geneartion of each target word is conditional on the entire source sentence and on all previously generated target words.
Example of machine translation
- a sequance-to-sequence model
- use two RNNs (LSTM)
- attentional sequence-to-sequence model
- use attentino to create a context-based summarization of the source sentence into a fixed-dimension representation
- transformer-based model
- encoder: reads the source sentence and turns it into a rich, contextual set of vectors.
- decoder: generates the target sentence one token at a time, using what it has generated so far and the encoder's representations.
Text generation
- a subfield of NLP
- leverages knowledge in computational linguistics and AI to automatically generate natural language texts
- can satisfy certain communicativa requirements
Example of text generation
- Classifier based on word embeddings, e.g. RNN and LSTM
- RNN: each input word is encoded as a word embedding vector , a hidden layer , the classes are the words of the vocabulary
- the output will be a softmax probability distribution over the possible values of the next word in the sentence.
- LSTM: can choose to remember som parts of the input, copying it over to the next time step, and to forget toher parts.
- RNN: each input word is encoded as a word embedding vector , a hidden layer , the classes are the words of the vocabulary
- Pre-trained languaeg model using deep learning
- BERT
- GPT-X, Generativ ePre-trained Transformer
Transfer learning
- experience with one learning task helps an agent learn better on another task.
- pretraining: a form of transfer learning in which we use a large amount of shared general-domain language data to train an initial version of an NLP model.
- we can use a smaller amount of domain-specific data to refine the model
- the refined model can learn the vocabulary, idioms, syntactic structure, and other linguistic phenomena that are specific to the new domain.
- For NN, learning consist of adjusting weight, so the most plausible approach for transfer learning is to copy over the weights learned for task A to a network that will be trained for task B.
- The weights are then updated by gradient descent in the usual way using data for task B.
- the popularity of transfer learning is the availability of high-quality pretrained models.
- will want to freeze the first few layers of the pretrained model
- these layers serve as feature detectors that will be useful for new model.
- new data set will be allowed to modify the parameters of the higher levels only
- these are the layers that identify problem-specific features and do classification.