Skip to main content

11 posts tagged with "cnn"

View All Tags

CNN 012

· 6 min read
  • Three main layers of a CNN
    • CONV: Convolution Layer
    • POOL: Pooling Layer
    • FC: Fully Connected Layer
    • CONV extracts features, POOL downsamples feature maps, and FC makes the final prediction.
  • Why CNNs use over ANNs for image processing
    • Computationally efficient
    • Using Filters to capture spatial features
    • Sharing weights across the image
  • Overfitting
    • The model essentially memorizes the training data, leading to poor performance on unseen data
    • To prevent overfitting, we can use techniques like:
      • Dropout: Randomly dropping out neurons during training to prevent co-adaptation
      • Batch Normalization: Normalizing the inputs of each layer to stabilize learning
      • L1/L2 Regularization: Adding a penalty to the loss function to discourage large weights
        • L1 regularization adds a penalty based on the absolute value of the weights (can be zero, can make model sparse and useful for feature selection.)
        • L2 regularization adds a penalty based on the squared value of the weights (not can be zero, reduce model complexity and overfitting.)
      • Data Augmentation: Creating new training samples by applying transformations to existing data
  • ReLU
    • If the input is below zero, ReLU does output 0.
    • If the input is above zero, it outputs the input value itself.
    • max(0, x)
    • ReLU can output any number from 0 to infinity, which allows it to capture a wide range of features in the data.
    • It fixes gradient vanishing problem by allowing gradients to flow through the network without being squashed to zero, which can happen with activation functions like sigmoid or tanh.
  • Sigmoid: Binary Classification
  • Softmax: Multi-class Classification
  • Backpropagation: sends the error backward through the network and calculates gradients, so the model knows how to update its weights and biases.
  • Gradient Descent
    • It uses Backpropagation to calculate the exact slope (the gradient) of the error (loss).
    • Then it takes a step in the opposite direction of the gradient to minimize the error.
    • It repeats this interative process until it reaches a local minimum.
  • Vanishing Gradient Problem
    • The gradient becomes too small, so earlier layers learn very slowly or almost stop learning.
    • ReLU helps mitigate this problem by allowing gradients to flow through the network without being squashed to zero.
  • Learning Rate α\alpha
    • It's a hyperparameter that controls how big of a step the model takes down the slope.
    • If α\alpha is too small, the model will take tiny steps and may take a long time to converge.
    • If α\alpha is too large, the model may overshoot the minimum and diverge.
  • Precision: TPTP+FP\frac{TP}{TP + FP}
    • Of all the patients the model predicted as having the disease, how many actually have the disease?
  • Recall: TPTP+FN\frac{TP}{TP + FN}
    • Of all the patients who actually have the disease, how many did the model successfully catch?
    • Recall is more important in medical diagnosis because we want to minimize false negatives (missing a disease).
  • Sliding Window
    • Computationally expensive because it requires multiple passes over the image with different window sizes and strides.
    • Multiple scales are needed to detect objects of varying sizes, which further increases the computational cost.
  • Stride: controls the step size of the sliding filter. Larger stride means smaller output.
  • Edge
    • The points or pixels in an image where brightness or intensities change sharply.
    • Sobel filter
    • Prewitt filter
    • Canny edge detector
  • Padding: adds zeros around the image so the CONV does not shrink the feature map too much.
  • Keep the output dimension the same as the input dimension, we can use padding.
    • P=F12P = \frac{F - 1}{2}
  • Image Classification: Assigning a label to an entire image (e.g., cat, dog, car).
  • Object Detection: Identifying and localizing multiple objects within an image (e.g., bounding boxes)
  • Instance Segmentation: Identifying and segmenting each object instance in an image (e.g., pixel-level masks)
  • Momentum: uses an exponentially weighted average of past gradients to smooth updates and accelerate convergence.
  • RMSProp: uses an exponentially weighted average of squared gradients to adapt the learning rate for each parameter.
  • Adam: combines Momentum and RMSProp by using both the first moment, average gradient, and the second moment, average squared gradient.
  • Hyperparameters: learning rate, batch size, number of epochs, optimizer type, dropout rate, etc.
  • Supervised Learning: The model learns from labeled data, classification, regression.
  • Unsupervised Learning: The model learns from unlabeled data, clustering.
  • Loss/Cost function: an estimate of how far the model's predictions are from the actual target/answer.
  • AI is a broad concept of machines performing human-like tasks.
  • ML is a subset of AI that learns from data
  • DL is a subset of ML that uses deep neural networks with many layers.
  • ML's major problem
    • insufficient data
    • non-representative training data
    • poor-quality data
    • irrelevant features
    • overfitting
    • underfitting
  • When we use ML?
    • a large amount of data for finding patterns and making predictions
    • too many rules or too much complexity for humans to handle
  • Faster R-CNN: Propose regions first, then classify them
    • RPN: Region Proposal Network, which generates candidate object proposals
  • YOLO: Predict boxes and class probabilities directly from the image in one pass
    • Anchor boxes: predefined bounding boxes of different sizes and aspect ratios used to predict the location of objects in YOLO.
  • NMS: Non-Maximum Suppression, selects the best bounding box among overlapping boxes based on confidence scores.
  • 1×1 convolution mixes channel information and can reduce the number of channels, so later convolutions become cheaper.
  • Inception module: learns small, medium, and large visual features at the same time.
  • Transfer Learning Strategies:
    • First, if the new dataset is small and similar to the original dataset, we can use the pre-trained model directly.
    • Second, if the dataset is similar but has different classes, we freeze the convolutional layers and train only the fully connected classification layer.
    • Third, if the dataset is small but not very similar, we freeze the early convolutional layers and fine-tune the later convolutional layers plus the FC layer.
    • Finally, if the dataset is large and different, we can fine-tune the whole network.
  • IoU: Intersection over Union, a metric used to evaluate the accuracy of object detection models by comparing the predicted bounding box with the ground truth bounding box.

CNN 011

· 4 min read

Sequence

  • has a lot of context to predict the next behavior

Sequence modelling types

  • One to One Binary classification
    • X -> Y'
    • Will it rain today? Yes/No
  • Many to One Sentiment Analysis
    • X1, X2, X3, ... -> Y'
    • Is this review positive or negative?
  • One to Many Image Captioning
    • X -> Y1, Y2, Y3, ...
    • Image: A Women is throwing a frisbee in the park
  • Many to Many Q&A with LLMs, Language translations
    • X1, X2, X3, ... -> Y1, Y2, Y3, ...
    • Q: Hey, Siri How's the weather today? A: It's sunny and warm outside.

RNN

Recurrent Neural Network

  • yt=f(xt,ht1)y'-t = f(x_t, h_{t-1})
    • yty'-t: output at time t
    • xtx_t: input at time t
    • ht1h_{t-1}: Past momery

Sequence Modelling

  • Support for Variable-Length input
  • Has Temploral Dependency (Long, Short-term)
  • Preserve the information order
  • Share parameters across sequence

Attention

  • Why
    • RNNs process sequences one step at a time
    • Long sentences lead to Long-term memory loss
    • Important words can be hidden in long dependencies
  • Attention helps to focus on relevant parts of the input
  • For each output word, atention decides which input word is most important
  • Computes a weighted sum of all input vectors
  • Higher weights words are more important

Transformer

  • Self-Attention is the foundation for Transformers architecture
  • Entire sequence is processed in parallel
  • Has Encoder and Decoder block
  • Stack of Layers with Self Attention and Feed Forward Neural Network

Vision Image Transformer (ViT)

  • Vision transformer have extensive application in all computer vision tasks
  • ViT looks at images, like how lanauge model looks at words
  • Image are represented as sequence of patches

Steps to use ViT

  1. Split an image into patches
  2. Flatten the patches
  3. Produce lower-dimensional linear embeddings from the flattened patches
  4. Add positional embeddings
  5. Feed the sequence as input to a standard transformer encoder
  6. Pretrain the model with image labels (fully supervised on a huge dataset)
  7. Finetune on the downstream dataset for image classification

ViT

CNNs vs Vision Transformer (ViT)

Key AspectsCNNsViT
Input HandlingProcesses the entire image using filters (kernels)Splits image into fixed-size patches (like tokens)
Local vs. GlobalFocuses on local patterns first (edges, textures)Uses global self-attention to relate all patches
ArchitectureHierarchical (convs -> pools -> deeper features)Flat transformer encoder stack
Training Data NeedWorks well with limited dataNeeds lots of data or pretraining
ComputationEfficient with low-res inputsComputationally heavier, especially on large images
ParallelismLimited; uses sequential feature stackingHigh; patch processing is highly parallelizable

RF-DETR

Roboflow Detection Transformer

  • Object detection techniques using Transformers
  • An improvement over the original DETR (Detection Transformer) model
  • DETR looks at everything globally but miss small things.
  • RF-DETR looks globally and understands the relationships between things.
  • First real-time Transformer-based object detection architecture
  • Outperforms all object detection models, 60+% mAP on COCO dataset

RF-DETR

Diffusion Models

  • Generate new data samples (images, audio, text) that is similar to a training dataset by learning to reverse a gradual noise process
  • Forward Diffusion
    • Add noise gradually to the original image for many steps
    • Iterate until the image becomes pure noise
    • Gaussian noise used (no learning)
  • Reverse Diffusion
    • Denosing, model is trained to predict and reverse this noise
    • Use the prediction to denoise the image
    • Given a noisy image, it predicts a slightly less noisy image version
    • After several steps, it reconstructs a clean and new image from pure noise

Steps to train a diffusion model

  1. Start with real data
  2. Add noise step by step, until the image becomes pure noise
  3. Train a model to reverse this process, denoising to recover the original image
  4. Once trained, the model can start from pure noise and generate new and realistic samples

Applications of Diffusion Models

  • Given a lof of sprite sample images
  • Can generate New sprite images
    • New image generation from image input

CNN 010

· 5 min read

Drawbacks of Anchor-based detectors

It is sensitive to:

  • Size
  • Aspect Ratio
  • Number of Anchor boxes (Fixed)
  • To much variation with shape
  • Small object
  • May not generalize due to pre-defined anchor boxes
  • Computation expensive

Anchor-free detectors

  • Localize objects without using boxes as proposals
  1. Key-point based
  2. Center-based

Key-point based

  • Locates key object parts in an image
  • Detects spatial locations or points unique to an object
  • With human body as an example
  • Key part of face: nose, eyes, eyebrows, mouth ...
  • Key point of human body: joints, elbows, knees ...
  • Object is represented using Key-points

Center-based

  • Finds positives in the center
  • Predicts four distances from the positive to the potential object boundary
    • Top, left, bottom, right
    • {x, y, T, R, B, L}

YOLO

  • Yolo V1: 2015
    • darknet backbone
  • Yolo V2: 2016
    • Anchor boxes
    • Batch normalization
  • Yolo V3: 2018
    • Objectness score
    • improvement for small objects
  • Yolo V4/V5: 2020
    • Solid Baseline Model
    • Lightweight and Fast
    • image classification, object detection, and instance segmentation
    • Multiple input processing (Video, Image, Live stream)
    • Optimize weights
    • Developed by Ultralytics (not original author)
  • Yolo X/R: 2021
    • Decoupled head
    • First version of Anchor free
    • Improvement efficiency in backbone
  • Yolo V6/V7: 2022
    • Faster and more accurate
  • Yolo NAS/V8: 2023
    • Anchor free
    • Architectural improvement
    • Strong baseline for realtime object detection
  • Yolo V9/V10/V11: 2024
    • Oriented bounding box
    • Strong baseline for oriented object detection
  • Yolo V12: 2025
    • Attention mechanism, introduced transformer
    • Little slower
  • Yolo 26: 2026
    • Deployment on a small form factor hardware
    • realtime object detection on edge devices
    • Strongest baseline for edge device deployment (realtime and accuracy)
    • Efficient Loss Function

YoloX X

  • Anchor-free detector in the Yolo Family
  • Decoupled head used
  • Label assignment using SimOTA
  • Use YoloV3 SPP with DarkNet53 backbone
  • Uses advanced augmentation such as Mix-up & Mosaic
  • Backbone: Feature extraction
  • Neck: Aggregation of multi-scale feature
  • Head: Localization and Classification scores

Decoupled head

decoupled head

  • Coupled Head: one head gives regression score and classification score (Dog/Cat + Location, BBox)
  • Decoupled Head:
    • First head gives Classification score (Dog/Cat)
    • Another head gives Regression score (Location, BBox)

Data Augmentation

mixup augmentation

  • occluded and overlapped objects
  • improve model robustness

mosaic augmentation

  • four images are combined into one
  • crops and resizes the images to create a new training sample

Yolo 26

  • Realtime computer vision model
  • Detection, Segmentation, Classification, Pose, Tracking, OBB (Oriented Bounding Box)
  • Available in Nano, Small, Medium, Large, XLarge
  • E2E detection pipeline (NMS-free, Non-Maximum Suppression free)
  • Designed for edge AI and fast deployment

Why is it faster

  • NMS-free infrerence removes post-processing overhead
  • Direct bounding box regression (No DFL, Distribution Focal Loss)
  • Lower latency and simpler deploymenet graph
  • CPU-optimized architecture
  • Up to 43% faster on CPUs than V11

Key Changes

  • ProgLoss (Progressive Loss Balancing): improves training stability and convergence
  • STAL (Small-Target-Aware Label Assignment): improves small-object detection
  • MuSGD optimizer improves convergence speed
  • Better speed-accuracy trade-off than many previous YOLO models
  • Ideal for robotics, drones, surveillance, and edge devices

Inference pipeline

  • Backbone: Efficient Hybrid CNN + Attention
  • Neck: PAN-FPN (Multi-scale Feature Fusion)
  • Head (Decoupled & Dual Head)
    • One-to-Many Head: Dense supervision (Traning only, many positives)
    • One-to-One Head: Single best match NMS-free inference (Inference & tranining)

Tranining pipeline

Instance Segmentation

  • Identifies each pixel of an object instance
  • whereas Semantic Segmentation classifies object pixels to specific classes/categories
  • Instance Segmentation
    • SegNet
    • DeepMask
    • SharpMask
    • Mask RCNN
  • Semantic Segmentation
    • Conditional Random Field (CRF)
    • Fully Convolutional Networks (FCN)
    • U-Net
    • Pyramid scene parsing network (PSPNet)

Application of Instance Segmentation

  • Autonomous Driving
  • Scene Understanding
  • Aerial Image Processing

Mask R-CNN

Mask-Region Convolutional Neural Network

  • An addition to the RCNN family, perfoming instance segmentation
  • Improved over Faster RCNN
  • Full Convolutional Network for predicting mask for each class/object.
  • Two stages:
    1. RPN proposes candidiate object bounding boxes
    2. Classify the Candidates, refine bounding boxes, and predict mask.

Mask R-CNN architecture

Limitations of Mask R-CNN

  • Computational Complexity: Traning and inference can be computationally intensive, requiring substantial resources (high resolution images or large datasets).
  • Small-Object Segmentation: may struggle with accurately segment very small objects due to limited pixel information.
  • Data Requirements: Training requires a large amount of annotated data, which can be time-consuming and expensive to acquire.
  • Limited Generalization to Unseen Categories: The model's ability to generalize to unseen object categories is limited.

Semantic Segmentation

u-net

  • input image -> u-net -> output segmentation map

References

  • Ge, Z., Liu, S., Wang, F., Li, Z., & Sun, J. (2021). YOLOX: Exceeding YOLO series in 2021. arXiv. https://doi.org/10.48550/arXiv.2107.08430
  • Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional networks for biomedical image segmentation. In N. Navab, J. Hornegger, W. M. Wells, & A. F. Frangi (Eds.), Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015 (Vol. 9351, pp. 234–241). Springer. https://doi.org/10.1007/978-3-319-24574-4_28

CNN 009

· 2 min read

Prediciting Bounding Boxes

  • Using:
    • Sliding Window (Slow)
    • Selective Search
    • Region Proposals
  • Task:
    • Predict Bouding boxes from CNN

Non Maxima Suppression (NMS)

  1. Check the probabilities of each detection and keep ones with score above a certain threshold (0.7)
  2. For remaining boxes, a. Box with highest score is the detection results. b. Discard any remaining boxes with IoU > 0.5 with final detected box c. i.e. overlap with the box with highest score.

Anchor Boxes

  • Associate each object to:
    • A cell which contains its mid-point and
    • Anchor box for the cell with highest IoU
  • Calculate the IoU of Anchor boxes and prediected Bounding Boxes.
    • IoU(Pbb,Abb)=AreaofOverlapAreaofUnionIoU(P_{bb}, A_{bb}) = \frac{Area of Overlap}{Area of Union}
  • y^={P0,x,y,h,w,C1,C2,P0x,y,h,w,C1,C2}\hat{y} = \{P_0, x, y, h, w, C1, C2, \quad P_0 x, y, h, w, C1, C2\}
    • P0P_0 is objectness score
    • x,yx, y are the coordinates of the center of the bounding box relative to
    • h,wh, w are the height and width of the bounding box
    • C1,C2C1, C2 are the class information for the object in the bounding box

YOLO

  • Real-time performance with 45 FPS, 0.02 sec per image
  • Not suitable for small objects
  • Issues with new or multiple aspect ratios and unable to generalize

SSD, Single Shot Detector

  • Similar to YOLO, VGG16 base Convolutional Neural Network layers
  • Take advantage of Anchor boxes with different aspect ratios
  • Large number of anchors boxes are chosen
  • Not suitable for small objects
  • 3 times faster than Faster R-CNN
  • with ResNet-101 base SSD may help in detecting small objects with better features from the CONV layers

SSD 300 architecture

Overview of Object Detection

  • Base Networks
    • VGG156
    • ResNet-101
    • Inception-v2, v3
    • ResNet
    • MobileNet
    • Alexnet
    • ZFNet
  • Object Detection Framework
    • R-CNN family
    • YOLO family
    • SSD family
    • F-RCNN family
  • Faster-RCNN is more accurate but slower
  • YOLO/SSD are faster/real-time but may not be very accurate

CNN 008

· 4 min read

Datasets

PASCAL isual Object Classifcation

PASCAL VOC

  • a popular dataset for object detection, classification and segmentation
  • 20 categories

ImageNet

  • a dataset for object detection
  • 500,000 images, 200 categories
  • Not very popular due to large number of classes and size of the dataset

COCO

Microsoft Common Objects in Context dataset

  • a large-scale object detection, segmentation, and captioning dataset.
  • 330,000 images, 80 categories
  • 200,000 labeled images, 1.5 million object instances
  • 91 stuff categories

Intersecxtion over Union (IoU)

IoU=AreaofOverlapAreaofUnionIoU = \frac{Area of Overlap}{Area of Union}

  • a metric used of the evaluation of an object detector
  • how good is the predicted bounding box for an object detected colosely matches

AP

Average Precision

MetricDescription
APAPAP at IoU=.50:0.05:0.95 (primary challenge metric)
APIoU=.50AP^{IoU=.50} AP at IoU=0.50 (PASCAL VOC metric)
APIoU=.75AP^{IoU=.75} AP at IoU=0.75 (strict metric)
APsmallAP^{small}AP for small objects: area<322area < 32^2
APmediumAP^{medium}AP for medium objects: 322<area<96232^2 < area < 96^2
APlargeAP^{large}AP for large objects: area>962area > 96^2
ARmax=1AR^{max=1}AR given 1 detection per image
ARmax=10AR^{max=10}AR given 10 detections per image
ARmax=100AR^{max=100}AR given 100 detections per image
ARsmallAR^{small}AR for small objects: area<322area < 32^2
ARmediumAR^{medium}AR for medium objects: 322<area<96232^2 < area < 96^2
ARlargeAR^{large}AR for large objects: area>962area > 96^2

Taxonomy of Object Detection

History of Object Detection

History of Object Detection

Classification with Localization

  • Classification Task
    • Input: Image
    • Output: Class label
    • Performance Metric: Accuracy
  • Localization Task
    • Input: Image
    • Output: Bounding box coordinates (x,y,Ht,Wd)(x, y, Ht, Wd) or (x,y,x,y)(x, y, x', y')
    • Performance Metric: IoU

Localization Loss

Localization as a regression problem

Detection as a Classification Problem

Region Proposal

  • Find blobs in the image that are most likely to contain objects.
  • Selective Search: ~1000-2000 region proposal using CPU

R-CNN

Region based CNN

  • Convolution Neural Network as feature extractor
  • SVM as classifier
  • Bounding box regression for localization
  • Pass each region through CNN to extract features, then classify using SVM and refine bounding box using regression
  • Warped image region to fixed size (e.g., 227x227) before passing through CNN
  • Region-of-Interest (RoI) from proposal method around 2000 per image, which is computationally expensive

Fast R-CNN

  • Run Whole image through CNN to get feature map, then classify each region proposal using RoI pooling and fully connected layers
    • Region of Interest (RoIs) from proposal method
    • Crop and Resize features
    • Per-Region Network
    • Linear + Softmax for Object category
    • Linear for Box offset
  • Reduce computation
  • ROIs from feature maps using selective search
  • mAP: 70% for PASCAL VOC 2007

Faster R-CNN

  • Use CNNs to make proposal
  • RPN (Region Proposal Network) to generate region proposals
    • Small nural network to predict proposals from feature map
  • RoI pooling to extract features for each proposal
    • then classify and refine bounding box
  • mAP: 78.8% for PASCAL VOC 2007
ModelDescription
R-CNNLook at every patch one by one
Fast R-CNNLook once, and then inspect patches on feature map
Faster R-CNNPropose patches using a neural network (RPN)

R-CNN Family Comparison

FeatureR-CNNFast R-CNNFaster R-CNN
Region proposalSelective searchSelective searchRPN (learned)
CNN UsagePer regionOnce per imageOnce per image
SpeedVery slowFasterCan work in real-time
TrainingMulti-stage, discretePartially end-to-endFully end-to-end
AccuracyGoodBetterBest of all three

Image Annotation for Object Detection

  • difficulty: not easy to annotate images even for humans

CNN 007

· 6 min read

Transfer Learning

  • Knowledge acquired while solving one task, can be used to solve related tasks
  • Similar to the way humans apply knowledge acquired from on task to solve a new but similar, related task.

Transfer Learning Benefits

  1. Less training data required: Model trained using a large (similar) dataset can be used as a starting point for training on a smaller dataset.
  2. Faster training: Traninig can converage faster, du the use to existing knowledge (weights) to start with rather than from scratch.
  3. Better model generalization: Model is trained to identify features which can be applied to new contexts.

VGG-16

ApproachDescriptionUse CaseWhen to Use
Use Pre-trained ModelUse ImageNet pre-trained model without any additional trainingDogs & cats classificationWhen dataset distribution is similar to ImageNet with few samples
Train FC Layers OnlyUse CONV layers for feature extraction, train FC layers onlyDifferent class classification on similar domainWhen dataset is similar to ImageNet but different classes with limited samples
Train Last CONV + FC LayersTrain last CONV layers (specialized features) and FC layersSignificantly different data distribution domainWhen dataset differs greatly from ImageNet, different classes, and limited samples
Train All CONV + FC LayersTrain all CONV layers and FC layers (with modifications)Complex task with different domainWhen dataset differs greatly from ImageNet, different classes, dataset is large, and task is complex

AlexNet

  • Input: 224x224x3 image
  • Activiations: ReLU after each CONV and FC layer
  • Optimizer: SGD with Momentum
  • Regularization: Dropout in FC1 and FC2
  • Total Trainable Parameters: ~60 million
  • Traninig settings: Nvidia GTX 580 3BG GPUs for 6 days

GoogleNet

  • Accurary: top-5 test erorr rate of 6.7%
  • Close to human level performance
  • 22 layer deep CNN
  • Optimizer: RMSProp
  • Total Trainable Parameters: ~4 million (Significantly reduced)
  • A novel inception module was introduced

GoogleNet

Inecption Module

Inception Module

  • Use filters with different size together
  • Use different types of layers (CONV, POOL etc.) together
  • It leads to better performance and efficiency but complicated architecture.

1X1 Convolution

Input image (6×6×16 \times 6 \times 1), 1x1 kernel, and output can be declared as:

X=[100100100000100100100000100100100000100100100000100100100000100100100000],K=[3]X= \begin{bmatrix} 100&100&100&0&0&0\\ 100&100&100&0&0&0\\ 100&100&100&0&0&0\\ 100&100&100&0&0&0\\ 100&100&100&0&0&0\\ 100&100&100&0&0&0 \end{bmatrix}, \quad K=\begin{bmatrix}3\end{bmatrix} Y=KXY = K * X Y=[300300300000300300300000300300300000300300300000300300300000300300300000]Y= \begin{bmatrix} 300&300&300&0&0&0\\ 300&300&300&0&0&0\\ 300&300&300&0&0&0\\ 300&300&300&0&0&0\\ 300&300&300&0&0&0\\ 300&300&300&0&0&0 \end{bmatrix}

For channel reduction with a 1x1 convolution, each spatial location (i,j)(i,j) is a vector:

xi,jR256\mathbf{x}_{i,j} \in \mathbb{R}^{256}

One 1x1 layer with 128 filters is a matrix:

WR128×256,bR128W \in \mathbb{R}^{128 \times 256},\quad \mathbf{b} \in \mathbb{R}^{128}

At each location, output channels are computed by matrix multiplication:

zi,j=Wxi,j+b,yi,j=ReLU(zi,j)\mathbf{z}_{i,j}=W\mathbf{x}_{i,j}+\mathbf{b},\quad \mathbf{y}_{i,j}=\mathrm{ReLU}(\mathbf{z}_{i,j})

So the shape changes as:

64×64×256    1×1  Conv (128 filters)+ReLU  64×64×12864\times64\times256 \;\xrightarrow{\;1\times1\;\text{Conv (128 filters)}+\mathrm{ReLU}\;} 64\times64\times128

If we flatten all spatial positions (64×64=409664\times64=4096):

XflatR4096×256,Yflat=ReLU(XflatWT+1bT)R4096×128X_{\text{flat}} \in \mathbb{R}^{4096\times256},\quad Y_{\text{flat}}=\mathrm{ReLU}\left(X_{\text{flat}}W^T+\mathbf{1}\mathbf{b}^T\right) \in \mathbb{R}^{4096\times128}

Inception V2 and V3

  • V1 (GoogleNet): Replace one 5x5 conv with two stacked 3x3 conv layers.
    • Number of parameters: 52=255^2=25 vs. 2×32=182\times3^2=18 (about 28% reduction)
  • V2: Factorize an n×nn\times n conv into 1×n1\times n and n×1n\times1 convs.
    • For 3×33\times3: 32=93^2=9 vs. 3+3=63+3=6 (about 33% reduction)
  • V3: Use more aggressive factorization and branch design (e.g., 1×71\times7 and 7×17\times1), plus efficient grid-size reduction.
    • Improves the accuracy-efficiency tradeoff while keeping computation manageable

ResNet

Deep Residual Networks, skip connections, and identity mappings

  • Enabled the development of the much deeper networks
  • ResNet is composed of residual blocks were introduced to address the vanishing gradient problem in deep networks.
    • Degradation problem: adding more layers eventually have negative effect on the final performance

ResNet

CNN 006

· 4 min read

Data Preparation

  • Small Dataset (Range: 100 to 100,000 samples)
    • Train/Valid/Test: 60/20/20
    • Train/Test: 70/30
  • Large Dataset (Range: 500,000 to 1M+ samples)
    • Train/Valid/Test: 98%/10,000/10,000
    • usaully more traning data is get better performance.
  • Rule of Thumb: Validation and Test set should com from the same distribution.

Bias and Variance

  • Bias: A value that allows to shift the activation function to left or right to better fit the data.
    • With bias the curve/line will not always pass through origin
    • can get a better fit to training data
  • Variance: The sensitivity of the model to small fluctuations in the training data.
    • The change in prediction accuracy of ML model between training data and test data.
    • Model with high variance pays a lot of attention to tranining data and does not generalize on the data which is has not seen before.
    • With high variance, model perform very well on training data but poorly on test data.

Bias and Variance

  • High Bias
    • High training error, underfitting
    • Validation/test error nearly same as train error
    • Potential things to try:
      • Increase features
      • Make ML model more complicated
      • Decrease Regularation parameters
  • High Variance
    • Low tranining error, overfitting
    • High validation/test error
    • Potential things to try:
      • Increase dataset size
      • Reduce input features
      • Increasing Regularization parameter

Accuracy

  • Bayesian Optimal Error (BOE): Best optimal error that can be achieved by any model on a given dataset.
  • Human-Level performance:
    • Humans are very good at a lot of tasks
    • Can get labelled data from humans to help improve the model performance
    • Gain insights from manual error analysis

Regularization

  • a technique which makes slight modifications to the learning algorithm such that the model generalizes better on the unseen data.
  • Update the loss/cost function by adding a regularization term
    • Loss function=Loss+Regularization term(λ)\text{Loss function} = \text{Loss} + \text{Regularization term}(\lambda)
    • Due to λ\lambda, the weight matrices will decrease, assuming a neural network with smaller weight matrices leads to simpler model.
    • Regularization penalizes the weights matrices of the nodes
  • L2 regularization
  • L1 regularization
  • Dropout

L2 Regularization

Cost function=Loss+λ2mj=1nxwj2\text{Cost function} = \text{Loss} + \frac{\lambda}{2m} \sum_{j=1}^{n_x} w_j^2

  • λ\lambda is a hyper-parameter
  • as weight decay, as it forces the weight to decay towards zero, but not exactly zero.

L1 Regularization

Cost function=Loss+λ2mj=1nxwj\text{Cost function} = \text{Loss} + \frac{\lambda}{2m} \sum_{j=1}^{n_x} |w_j|

  • Penalize the absolute value of the ww
  • Weight may reduce to zero
  • Useful in compressing a model (sparse model)

Dropout

  • It produces good reuslts and most popular regularization technique in deep learning.
  • At every iteration, it randomly selects and drops some nodes and remove all the connections to those nodes.
  • Each iteration has a different set of nodes.

Data Augmentation

  • Simple way to reduce overfitting is to increase size of tranining dataset.
  • By creating more sample using the existing set and applying the following simple operations
    • Flip
    • Rotate
    • Scale
    • Crop
    • Translate
    • Gaussian Noise

Cutout

  • Simple regularization technique of randomly masking out square regions of input during training.
  • Patch size: 16x16 to 64x64
  • Fill value: 0 or mean pixel value
  • Patches: 1-3 per image

Mixup

  • Trains a neural network on convex combinations of pairs of examples and their labels.
  • It regularizes the neural network to favor simple lienar behavior in-between training examples.
  • Image A (λ=0.55\lambda = 0.55) + Image B (λ=0.45\lambda = 0.45) = Blended Output

CutMix

  • Patches are cut and pasted among training images, where the ground truth labels are also mixed proportionally to the area of the patches.
  • Image A + Image B (Patch) = Pasted Patch Output.

Random Agumentation

  • A set of augmentation operations is defined, and a random subset of these operations is applied to each image during training.
  • Identity, AutoContrast, Equalize, Rotate, Solarize, Color, Posterize, Contrast, Brightness, Sharpness, ShearX/Y, TranslateX/Y

Generative Adversarial Networks (GANs)

  • Able to generate images which look similar to the original ones
  • Proven to be very effective in data augmentation, especially when the dataset is small.

Neural Style Transfer

  • Using CNN to separate style
  • Transfer style to different image

CNN 005

· 3 min read

Computer Vision

  • Classification
  • Classification with Localization
  • Object Detection
-ANNCNN
Input1D vector3D tensor (height, width, channels)
ConnectionsFully connectedLocal connections (receptive fields)
OverfittingProne to overfittingLess prone to overfitting

Convolutional Neural Networks (CNN)

  1. Convolutional Layer (CONV)
  2. Pooling Layer (POOL)
  3. Fully Connected Layer (FC)

LENET-5

Convulutional Layer (CONV)

  • The first layer to extraact features from an input image
  • Core buildling block of a CNN
  • Convolutions are basic operation in this layer
  • A number of filters (e.g. edge detectors) are applied to the input image.

Padding

  • Padding is used to control the spatial size of the output feature maps.
  • Negative values at the edges can naturally arise because of padding, and they usually are not a big problem because activation functions and later layers come afterward.
  • Input Matrix dimension: n×n×cn \times n \times c (height, width, channels)
  • Filter size: f×ff \times f
  • Padding (PP): 1, number of pixels added to the border of the input
  • (n×n)(f×f)(n+2Pf+1)×(n+2Pf+1)(n \times n) * (f \times f) \to (n + 2P - f + 1) \times (n + 2P - f + 1)
    • Example: 5×55 \times 5 input with 3×33 \times 3 filter and padding of 1 results in a 5×55 \times 5 output feature map.
  • if input and output matrix dimensions are the same, then P=f12P = \frac{f - 1}{2}.
  • Valid padding (P=0P = 0): No Padding
  • Same padding (P=f12P = \frac{f - 1}{2}): Output size and input size is same, this requires appropriate padding.

Stride

  • It is the number of pixels by which slide the filter across the input image.
No Padding StridesStride with Padding
no padding stridesstride with padding
  • Github: vdumoulin/conv_arithmetic
  • Input Matrix dimension: n×nn \times n
  • Filter size: f×ff \times f
  • Padding: PP
  • Stride: SS
  • Output Size = n+2PfS+1×n+2PfS+1 \left\lfloor \frac{n + 2P - f}{S} + 1 \right\rfloor \times \left\lfloor \frac{n + 2P - f}{S} + 1 \right\rfloor
    • Example: Input Matrix dimension: 5×55 \times 5, Filter size: 3×33 \times 3, Padding: 11, Stride: 22 results in an output size of 2×22 \times 2.

Pooling Layer (POOL)

  • Down sampling operation which reduces the dimensionality of a matrix.
  • Reduces the number of parameters for large image, but retain the valuable information.
  • Max pooling
  • Average pooling
  • Sum pooling

Fully Connected Layer (FC)

  • a traditional Multi-layer Perception (MLP) layer
  • For multi-class classification, usually Softmax activation is used.
  • Softmax ensures the output.
  • Output of the CONV and POOL layers represent a high level features of the Input image.
  • The FC layer takes these features to classify the input image into the desired output classes.

CNN 004

· 6 min read

Logistic Regression as Neural Network

  • if y=1y = 1
    • L=log(y^)L = -\log(\hat{y})
    • if y^1\hat{y} \to 1, then L0L \to 0 (low loss)
    • if y^0\hat{y} \to 0, then LL \to \infty (high loss)
  • if y=0y = 0
    • L=log(1y^)L = -\log(1 - \hat{y})
    • if y^0\hat{y} \to 0, then L0L \to 0 (low loss)
    • if y^1\hat{y} \to 1, then LL \to \infty (high loss)

Gradient Descent

  • it is an iterative approach for error correction in a machene learning model
  • Find ww and bb that will minimize GD(w,b)GD(w, b) (requires Loss/Cost function)
  1. Initialize ww and bb
  2. Perform Forward pass operation/calculations
  3. Compute Loss/Cost function L(a,y)L(a, y)
  4. Compute change in ww and bb (Take the partial derivative of the cost function with respect to Weights and bias dwdw and dbdb)
  5. Update ww and bb (w:=wαdww := w - \alpha dw and b:=bαdbb := b - \alpha db)
  6. Repeat from Step 2 with new values of ww and bb for 'n' number of iterations.
  • α\alpha is the learning rate (hyperparameter) that controls how much we are adjusting the weights and bias of our model with respect to the loss gradient. It is a small positive value (e.g., 0.01, 0.001) that determines the step size at each iteration while moving toward a minimum of the loss function.

Gradient Descent Types

  • Batch Gradient Descent (BGD)
  • Stochastic Gradient Descent (SGD)
  • Mini-batch Gradient Descent (MBGD)

Batch Gradient Descent (BGD)

  1. Process each input sample and find the cost
  2. Find the average cost oveer all input samples
  3. Update ww and bb and repeat the steps for "n" epochs(iterations)
  • Disadvantages:
    • It uses the complete dataset to calculate the gradients at every steps
    • Slow when training data is large
    • Difficult to find the learning rate
    • Difficult to ascertain the number of epochs(iterations)

Stochastic Gradient Descent (SGD)

Due to the random nature, the algorithm is much less regular than BGD.

  1. Process a random input sample and find the cost.
  2. Update ww and bb, and repeat the steps for "n" iterations on the training samples.
  • Advantages:
    • Computes gradient based on single input sample, which is memory efficient.
    • Much faster compared to BGD.
    • Possible to train on large datasets.
    • Randomness is helpful to escape local minima.
  • Disadvantages:
    • Might not reach the optimal value, but very close to it.
      • Simulated annealing: Reduce the learning rate gradually
      • Create a Learning Schedule to determine the learning rate at each iteration.

Mini-batch Gradient Descent (MBGD)

  1. Divide the tranining set into mini-batches of size nn (e.g., 64, 128, 256).
  2. Process all the samples in a mini-batch and find the average cost
  3. Update ww and bb, and repeat the steps for "n" iterations/epoches on the traning samples.
  • Advantages:
    • Computes gradient based on small sets of input smaple
    • Much faster compared to BGD.
    • Possible to train on large dataset.
    • Performance boost on matrix operations using GPUs.
    • Might not reach the optional value but, very close to it and possibly better than SGD.
  • Disadvantages:
    • It may be harder to escape the local minima compared to SGD.

GD

Exponentially Weighted Averages

  • One of the popular algorithm for smoothing sequential data (time series data), aka. moving average.
  • Weight the number of observations and using their average
V0=0V1=0.9V0+0.1θ1V2=0.9V1+0.1θ2V3=0.9V2+0.1θ3Vt=0.9Vt1+0.1θtVt=βVt1+(1β)θtV_0 = 0 \\ V_1 = 0.9 \cdot V_0 + 0.1 \cdot \theta_1 \\ V_2 = 0.9 \cdot V_1 + 0.1 \cdot \theta_2 \\ V_3 = 0.9 \cdot V_2 + 0.1 \cdot \theta_3 \\ \vdots \\ V_t = 0.9 \cdot V_{t-1} + 0.1 \cdot \theta_t \\ V_t = \beta \cdot V_{t-1} + (1 - \beta) \cdot \theta_t

VtV_t is approximate average over 11β\approx \frac{1}{1 - \beta} time steps.

  • For β=0.9\beta = 0.9, VtV_t is average over the last 10 time steps.
  • For β=0.98\beta = 0.98, VtV_t is average over the last 50 time steps.
  • For β=0.5\beta = 0.5, VtV_t is average over the last 2 time steps.

Optimizers

SGD with Moementum

At iteration tt:

  • Calculate dwdw and dbdb on the current mini-batch (Hyper parameters: α\alpha and β\beta)
  • Update the velocity:
    • Vdw=βVdw+(1β)dwVt=βVt1+(1β)θtV_{dw} = \beta V_{dw} + (1 - \beta) dw \rightarrow V_t = \beta V_{t-1} + (1 - \beta) \theta_t
    • Vdb=βVdb+(1β)dbV_{db} = \beta V_{db} + (1 - \beta) db
  • Update parameters:
    • w:=wαVdww := w - \alpha V_{dw}
    • b:=bαVdbb := b - \alpha V_{db}

RMSProp

  • Root Mean Square Propagation.
  • Unpublished adaptive learning method by Geoffrey Hinton.
  • Reduces oscillation but in a different way than Momentum.
  • Divides the learning rate by an exponentially decaying average of squared gradients.
  • Calculate dwdw and dbdb on the current mini-batch
    • Sdw=βSdw+(1β)dw2S_{dw} = \beta S_{dw} + (1 - \beta) dw^2
    • Sdb=βSdb+(1β)db2S_{db} = \beta S_{db} + (1 - \beta) db^2
  • Update parameters:
    • w:=wαdwSdw+ϵw := w - \alpha \frac{dw}{\sqrt{S_{dw}} + \epsilon}
    • b:=bαdbSdb+ϵb := b - \alpha \frac{db}{\sqrt{S_{db}} + \epsilon}
    • ϵ\epsilon is a small number to prevent division by zero (e.g., 108 to 101010^{-8} \text{ to } 10^{-10})

Adam

  • Adaptive Moment Estimation
  • Combination of RMSProp and Momentum
  • Work well for a wide range of non-convex optimization problems in machine learning.
  • Calculate dwdw and dbdb on the current mini-batch
    • Vdw=β1Vdw+(1β1)dwMomentum,β1V_{dw} = \beta_1 V_{dw} + (1 - \beta_1) dw \leftarrow Momentum, \beta_1
    • Vdb=β1Vdb+(1β1)dbV_{db} = \beta_1 V_{db} + (1 - \beta_1) db
    • Sdw=β2Sdw+(1β2)dw2RMSProp,β2S_{dw} = \beta_2 S_{dw} + (1 - \beta_2) dw^2 \leftarrow RMSProp, \beta_2
    • Sdb=β2Sdb+(1β2)db2S_{db} = \beta_2 S_{db} + (1 - \beta_2) db^2
  • Update parameters:
    • w:=wαVdwSdw+ϵw := w - \alpha \frac{V_{dw}}{\sqrt{S_{dw}} + \epsilon}
    • b:=bαVdbSdb+ϵb := b - \alpha \frac{V_{db}}{\sqrt{S_{db}} + \epsilon}
    • ϵ\epsilon is a small number to prevent division by zero (e.g., 108 to 101010^{-8} \text{ to } 10^{-10})
  • Hyper parameter guide:
    • α=0.001\alpha = 0.001
    • β1=0.9\beta_1 = 0.9: Momentum term
    • β2=0.999\beta_2 = 0.999: Moving weighted average
    • ϵ=108\epsilon = 10^{-8}: To prevent division by zero
  • ensmallen.org

Learning Rate Decay

  • Speed-up the learning algorighm by slowing decreasing the learning rate α\alpha as the number of epochs increases.

Activation Functions

  • Getting the output of a layer in a neural network and applying a non-linear function to it.
    • Sigmoid: σ(x)=11+ex\sigma(x) = \frac{1}{1 + e^{-x}}
    • Tanh: tanh(x)=21+e2x1\tanh(x) = \frac{2}{1 + e^{-2x}} - 1
    • Used for binary classification in the output layer.
  • ReLU: A(x)=max(0,x)A(x) = \max(0, x)
    • Rectified Linear Unit
    • Avoids and rectifies vanishing gradient problem
    • Best used in hidden layers
    • Computationally less expensive than sigmoid and tanh
  • Softmax: S(xi)=exijexjS(x_i) = \frac{e^{x_i}}{\sum_{j} e^{x_j}}
    • Turns numbers in probabilities that sum to 1.
    • Used for multi-class classification in the output layer.

CNN 003

· 3 min read

Image Gradient

  • It is a directional change in the intensity or color in an image.
  • can be used to extract valuable information from images.
  • commonly used in edge detection.
  • ➡️ Change is X-directions, ⬇️ Change is Y-directions.
  • Combining both X and Y diretion to estimate if changes are in both directions.

HoG, Histogram of Oriented Gradient

To find edge and shape of the object in the image

  • Computing Image Gradient
    • Use the horizontal and vertical filters to compute gradient values
  • Compute the strength/magnitude and direction of gradient
    • Strength/Magnitude(g): gx2+gy2\sqrt{g_x^2 + g_y^2}
    • Direction(θ\theta): tan1(gy/gx)\tan^{-1}(g_y / g_x)
  • Create orientation histogram
    • Divide the image into small connected regions called Cells which is a 8x8 patch
    • Create cell histogram based on gradient direction and magnitude
    • 64 (8x8) gradient vectors are put into a 9-bin histogram.
    • The bins are the gradient directions (θ\theta) quantized into 9-bins
  • Block Normalization
    • 16x16 pixels blocks or 22 cells are used for normalization, which has 4 histograms.
    • Normalization will make it scale/multiplication invariant
    • Each block will represent 36x1 element vector
  • Intensity: brightness of the pixel
  • Saturation: HSV color space, the amount of gray in the color
  • Calculate the HoG feature vector
    • Each of the 36x1 vectors in each blocks are concatenated into one big vector
    • Size of the vector will be 36xN, where N is the number of blocks in the image
  • Hog feature extractor

LBP, Local Binary Pattern

To describe the image textures

LBPP,R(xc,yc)=p=0P1s(gpgc)2pLBP_{P,R}(x_c, y_c) = \sum_{p=0}^{P-1} s(g_p - g_c) \cdot 2^p

  • An eifficient texture operator which labels each pixels of an image by thresholding their neighbours.
  • A powerful feature for texture classification
  • LBP operator is to describe the image textures using two measures namely, local spatial patterns and the gray scale constract of its strength.
  • S(x)S(x) is a thresholding function
  • (xc,yc)(x_c, y_c) is the center pixel in the 8 pixel neighbourhood
  • gcg_c is gray level of the center pixel
  • gpg_p is gray value of a smpling point in an equally spaced circular neighbourhood of P sampling points and radius R around the point (xc,yc)(x_c, y_c)
  1. Sample pixel neighbourhood
  2. Difference result
  3. Thresholding result

LBP

ANN

L(a,y)=(yloga+(1y)log(1a))L(a, y) = -(y \log a + (1-y) \log (1-a)) GD(w,b)=x=1mi=1mL(a(i),y(i))GD(w, b) = x = \frac{1}{m} \sum_{i=1}^{m} L(a^{(i)}, y^{(i)})