본문으로 건너뛰기

CNN 007

· 약 6분

Transfer Learning

  • Knowledge acquired while solving one task, can be used to solve related tasks
  • Similar to the way humans apply knowledge acquired from on task to solve a new but similar, related task.

Transfer Learning Benefits

  1. Less training data required: Model trained using a large (similar) dataset can be used as a starting point for training on a smaller dataset.
  2. Faster training: Traninig can converage faster, du the use to existing knowledge (weights) to start with rather than from scratch.
  3. Better model generalization: Model is trained to identify features which can be applied to new contexts.

VGG-16

ApproachDescriptionUse CaseWhen to Use
Use Pre-trained ModelUse ImageNet pre-trained model without any additional trainingDogs & cats classificationWhen dataset distribution is similar to ImageNet with few samples
Train FC Layers OnlyUse CONV layers for feature extraction, train FC layers onlyDifferent class classification on similar domainWhen dataset is similar to ImageNet but different classes with limited samples
Train Last CONV + FC LayersTrain last CONV layers (specialized features) and FC layersSignificantly different data distribution domainWhen dataset differs greatly from ImageNet, different classes, and limited samples
Train All CONV + FC LayersTrain all CONV layers and FC layers (with modifications)Complex task with different domainWhen dataset differs greatly from ImageNet, different classes, dataset is large, and task is complex

AlexNet

  • Input: 224x224x3 image
  • Activiations: ReLU after each CONV and FC layer
  • Optimizer: SGD with Momentum
  • Regularization: Dropout in FC1 and FC2
  • Total Trainable Parameters: ~60 million
  • Traninig settings: Nvidia GTX 580 3BG GPUs for 6 days

GoogleNet

  • Accurary: top-5 test erorr rate of 6.7%
  • Close to human level performance
  • 22 layer deep CNN
  • Optimizer: RMSProp
  • Total Trainable Parameters: ~4 million (Significantly reduced)
  • A novel inception module was introduced

GoogleNet

Inecption Module

Inception Module

  • Use filters with different size together
  • Use different types of layers (CONV, POOL etc.) together
  • It leads to better performance and efficiency but complicated architecture.

1X1 Convolution

Input image (6×6×16 \times 6 \times 1), 1x1 kernel, and output can be declared as:

X=[100100100000100100100000100100100000100100100000100100100000100100100000],K=[3]X= \begin{bmatrix} 100&100&100&0&0&0\\ 100&100&100&0&0&0\\ 100&100&100&0&0&0\\ 100&100&100&0&0&0\\ 100&100&100&0&0&0\\ 100&100&100&0&0&0 \end{bmatrix}, \quad K=\begin{bmatrix}3\end{bmatrix} Y=KXY = K * X Y=[300300300000300300300000300300300000300300300000300300300000300300300000]Y= \begin{bmatrix} 300&300&300&0&0&0\\ 300&300&300&0&0&0\\ 300&300&300&0&0&0\\ 300&300&300&0&0&0\\ 300&300&300&0&0&0\\ 300&300&300&0&0&0 \end{bmatrix}

For channel reduction with a 1x1 convolution, each spatial location (i,j)(i,j) is a vector:

xi,jR256\mathbf{x}_{i,j} \in \mathbb{R}^{256}

One 1x1 layer with 128 filters is a matrix:

WR128×256,bR128W \in \mathbb{R}^{128 \times 256},\quad \mathbf{b} \in \mathbb{R}^{128}

At each location, output channels are computed by matrix multiplication:

zi,j=Wxi,j+b,yi,j=ReLU(zi,j)\mathbf{z}_{i,j}=W\mathbf{x}_{i,j}+\mathbf{b},\quad \mathbf{y}_{i,j}=\mathrm{ReLU}(\mathbf{z}_{i,j})

So the shape changes as:

64×64×256    1×1  Conv (128 filters)+ReLU  64×64×12864\times64\times256 \;\xrightarrow{\;1\times1\;\text{Conv (128 filters)}+\mathrm{ReLU}\;} 64\times64\times128

If we flatten all spatial positions (64×64=409664\times64=4096):

XflatR4096×256,Yflat=ReLU(XflatWT+1bT)R4096×128X_{\text{flat}} \in \mathbb{R}^{4096\times256},\quad Y_{\text{flat}}=\mathrm{ReLU}\left(X_{\text{flat}}W^T+\mathbf{1}\mathbf{b}^T\right) \in \mathbb{R}^{4096\times128}

Inception V2 and V3

  • V1 (GoogleNet): Replace one 5x5 conv with two stacked 3x3 conv layers.
    • Number of parameters: 52=255^2=25 vs. 2×32=182\times3^2=18 (about 28% reduction)
  • V2: Factorize an n×nn\times n conv into 1×n1\times n and n×1n\times1 convs.
    • For 3×33\times3: 32=93^2=9 vs. 3+3=63+3=6 (about 33% reduction)
  • V3: Use more aggressive factorization and branch design (e.g., 1×71\times7 and 7×17\times1), plus efficient grid-size reduction.
    • Improves the accuracy-efficiency tradeoff while keeping computation manageable

ResNet

Deep Residual Networks, skip connections, and identity mappings

  • Enabled the development of the much deeper networks
  • ResNet is composed of residual blocks were introduced to address the vanishing gradient problem in deep networks.
    • Degradation problem: adding more layers eventually have negative effect on the final performance

ResNet