11 posts tagged with "cnn"

View All Tags

CNN 012

May 19, 2026 · 6 min read

Gracefullight

Owner

Three main layers of a CNN
- CONV: Convolution Layer
- POOL: Pooling Layer
- FC: Fully Connected Layer
- CONV extracts features, POOL downsamples feature maps, and FC makes the final prediction.
Why CNNs use over ANNs for image processing
- Computationally efficient
- Using Filters to capture spatial features
- Sharing weights across the image
Overfitting
- The model essentially memorizes the training data, leading to poor performance on unseen data
- To prevent overfitting, we can use techniques like:
  - Dropout: Randomly dropping out neurons during training to prevent co-adaptation
  - Batch Normalization: Normalizing the inputs of each layer to stabilize learning
  - L1/L2 Regularization: Adding a penalty to the loss function to discourage large weights
    - L1 regularization adds a penalty based on the absolute value of the weights (can be zero, can make model sparse and useful for feature selection.)
    - L2 regularization adds a penalty based on the squared value of the weights (not can be zero, reduce model complexity and overfitting.)
  - Data Augmentation: Creating new training samples by applying transformations to existing data
ReLU
- If the input is below zero, ReLU does output 0.
- If the input is above zero, it outputs the input value itself.
- max(0, x)
- ReLU can output any number from 0 to infinity, which allows it to capture a wide range of features in the data.
- It fixes gradient vanishing problem by allowing gradients to flow through the network without being squashed to zero, which can happen with activation functions like sigmoid or tanh.
Sigmoid: Binary Classification
Softmax: Multi-class Classification
Backpropagation: sends the error backward through the network and calculates gradients, so the model knows how to update its weights and biases.
Gradient Descent
- It uses Backpropagation to calculate the exact slope (the gradient) of the error (loss).
- Then it takes a step in the opposite direction of the gradient to minimize the error.
- It repeats this interative process until it reaches a local minimum.
Vanishing Gradient Problem
- The gradient becomes too small, so earlier layers learn very slowly or almost stop learning.
- ReLU helps mitigate this problem by allowing gradients to flow through the network without being squashed to zero.
Learning Rate $\alpha$ $α$
- It's a hyperparameter that controls how big of a step the model takes down the slope.
- If $\alpha$ is too small, the model will take tiny steps and may take a long time to converge.
- If $\alpha$ is too large, the model may overshoot the minimum and diverge.
Precision: $\frac{TP}{TP + FP}$ $\frac{TP}{TP + FP}$
- Of all the patients the model predicted as having the disease, how many actually have the disease?
Recall: $\frac{TP}{TP + FN}$ $\frac{TP}{TP + FN}$
- Of all the patients who actually have the disease, how many did the model successfully catch?
- Recall is more important in medical diagnosis because we want to minimize false negatives (missing a disease).
Sliding Window
- Computationally expensive because it requires multiple passes over the image with different window sizes and strides.
- Multiple scales are needed to detect objects of varying sizes, which further increases the computational cost.
Stride: controls the step size of the sliding filter. Larger stride means smaller output.
Edge
- The points or pixels in an image where brightness or intensities change sharply.
- Sobel filter
- Prewitt filter
- Canny edge detector
Padding: adds zeros around the image so the CONV does not shrink the feature map too much.
Keep the output dimension the same as the input dimension, we can use padding.
- $P = \frac{F - 1}{2}$
Image Classification: Assigning a label to an entire image (e.g., cat, dog, car).
Object Detection: Identifying and localizing multiple objects within an image (e.g., bounding boxes)
Instance Segmentation: Identifying and segmenting each object instance in an image (e.g., pixel-level masks)
Momentum: uses an exponentially weighted average of past gradients to smooth updates and accelerate convergence.
RMSProp: uses an exponentially weighted average of squared gradients to adapt the learning rate for each parameter.
Adam: combines Momentum and RMSProp by using both the first moment, average gradient, and the second moment, average squared gradient.
Hyperparameters: learning rate, batch size, number of epochs, optimizer type, dropout rate, etc.
Supervised Learning: The model learns from labeled data, classification, regression.
Unsupervised Learning: The model learns from unlabeled data, clustering.
Loss/Cost function: an estimate of how far the model's predictions are from the actual target/answer.
AI is a broad concept of machines performing human-like tasks.
ML is a subset of AI that learns from data
DL is a subset of ML that uses deep neural networks with many layers.
ML's major problem
- insufficient data
- non-representative training data
- poor-quality data
- irrelevant features
- overfitting
- underfitting
When we use ML?
- a large amount of data for finding patterns and making predictions
- too many rules or too much complexity for humans to handle
Faster R-CNN: Propose regions first, then classify them
- RPN: Region Proposal Network, which generates candidate object proposals
YOLO: Predict boxes and class probabilities directly from the image in one pass
- Anchor boxes: predefined bounding boxes of different sizes and aspect ratios used to predict the location of objects in YOLO.
NMS: Non-Maximum Suppression, selects the best bounding box among overlapping boxes based on confidence scores.
1×1 convolution mixes channel information and can reduce the number of channels, so later convolutions become cheaper.
Inception module: learns small, medium, and large visual features at the same time.
Transfer Learning Strategies:
- First, if the new dataset is small and similar to the original dataset, we can use the pre-trained model directly.
- Second, if the dataset is similar but has different classes, we freeze the convolutional layers and train only the fully connected classification layer.
- Third, if the dataset is small but not very similar, we freeze the early convolutional layers and fine-tune the later convolutional layers plus the FC layer.
- Finally, if the dataset is large and different, we can fine-tune the whole network.
IoU: Intersection over Union, a metric used to evaluate the accuracy of object detection models by comparing the predicted bounding box with the ground truth bounding box.

CNN 011

May 11, 2026 · 4 min read

Gracefullight

Owner

Sequence

has a lot of context to predict the next behavior

Sequence modelling types

One to One Binary classification
- X -> Y'
- Will it rain today? Yes/No
Many to One Sentiment Analysis
- X1, X2, X3, ... -> Y'
- Is this review positive or negative?
One to Many Image Captioning
- X -> Y1, Y2, Y3, ...
- Image: A Women is throwing a frisbee in the park
Many to Many Q&A with LLMs, Language translations
- X1, X2, X3, ... -> Y1, Y2, Y3, ...
- Q: Hey, Siri How's the weather today? A: It's sunny and warm outside.

RNN

Recurrent Neural Network

$y'-t = f(x_t, h_{t-1})$ $y^{'} - t = f (x_{t}, h_{t - 1})$
- $y'-t$ : output at time t
- $x_t$ : input at time t
- $h_{t-1}$ : Past momery

Sequence Modelling

Support for Variable-Length input
Has Temploral Dependency (Long, Short-term)
Preserve the information order
Share parameters across sequence

Attention

Why
- RNNs process sequences one step at a time
- Long sentences lead to Long-term memory loss
- Important words can be hidden in long dependencies
Attention helps to focus on relevant parts of the input
For each output word, atention decides which input word is most important
Computes a weighted sum of all input vectors
Higher weights words are more important

Transformer

Self-Attention is the foundation for Transformers architecture
Entire sequence is processed in parallel
Has Encoder and Decoder block
Stack of Layers with Self Attention and Feed Forward Neural Network

Vision Image Transformer (ViT)

Vision transformer have extensive application in all computer vision tasks
ViT looks at images, like how lanauge model looks at words
Image are represented as sequence of patches

Steps to use ViT

Split an image into patches
Flatten the patches
Produce lower-dimensional linear embeddings from the flattened patches
Add positional embeddings
Feed the sequence as input to a standard transformer encoder
Pretrain the model with image labels (fully supervised on a huge dataset)
Finetune on the downstream dataset for image classification

ViT

CNNs vs Vision Transformer (ViT)

Key Aspects	CNNs	ViT
Input Handling	Processes the entire image using filters (kernels)	Splits image into fixed-size patches (like tokens)
Local vs. Global	Focuses on local patterns first (edges, textures)	Uses global self-attention to relate all patches
Architecture	Hierarchical (`convs -> pools -> deeper features`)	Flat transformer encoder stack
Training Data Need	Works well with limited data	Needs lots of data or pretraining
Computation	Efficient with low-res inputs	Computationally heavier, especially on large images
Parallelism	Limited; uses sequential feature stacking	High; patch processing is highly parallelizable

RF-DETR

Roboflow Detection Transformer

Object detection techniques using Transformers
An improvement over the original DETR (Detection Transformer) model
DETR looks at everything globally but miss small things.
RF-DETR looks globally and understands the relationships between things.
First real-time Transformer-based object detection architecture
Outperforms all object detection models, 60+% mAP on COCO dataset

RF-DETR

Diffusion Models

Generate new data samples (images, audio, text) that is similar to a training dataset by learning to reverse a gradual noise process
Forward Diffusion
- Add noise gradually to the original image for many steps
- Iterate until the image becomes pure noise
- Gaussian noise used (no learning)
Reverse Diffusion
- Denosing, model is trained to predict and reverse this noise
- Use the prediction to denoise the image
- Given a noisy image, it predicts a slightly less noisy image version
- After several steps, it reconstructs a clean and new image from pure noise

Steps to train a diffusion model

Start with real data
Add noise step by step, until the image becomes pure noise
Train a model to reverse this process, denoising to recover the original image
Once trained, the model can start from pure noise and generate new and realistic samples

Applications of Diffusion Models

Given a lof of sprite sample images
Can generate New sprite images
- New image generation from image input

CNN 010

May 2, 2026 · 5 min read

Gracefullight

Owner

Drawbacks of Anchor-based detectors

It is sensitive to:

Size
Aspect Ratio
Number of Anchor boxes (Fixed)
To much variation with shape
Small object
May not generalize due to pre-defined anchor boxes
Computation expensive

Anchor-free detectors

Localize objects without using boxes as proposals

Key-point based
Center-based

Key-point based

Locates key object parts in an image
Detects spatial locations or points unique to an object
With human body as an example
Key part of face: nose, eyes, eyebrows, mouth ...
Key point of human body: joints, elbows, knees ...
Object is represented using Key-points

Center-based

Finds positives in the center
Predicts four distances from the positive to the potential object boundary
- Top, left, bottom, right
- {x, y, T, R, B, L}

YOLO

Yolo V1: 2015
- darknet backbone
Yolo V2: 2016
- Anchor boxes
- Batch normalization
Yolo V3: 2018
- Objectness score
- improvement for small objects
Yolo V4/V5: 2020
- Solid Baseline Model
- Lightweight and Fast
- image classification, object detection, and instance segmentation
- Multiple input processing (Video, Image, Live stream)
- Optimize weights
- Developed by Ultralytics (not original author)
Yolo X/R: 2021
- Decoupled head
- First version of Anchor free
- Improvement efficiency in backbone
Yolo V6/V7: 2022
- Faster and more accurate
Yolo NAS/V8: 2023
- Anchor free
- Architectural improvement
- Strong baseline for realtime object detection
Yolo V9/V10/V11: 2024
- Oriented bounding box
- Strong baseline for oriented object detection
Yolo V12: 2025
- Attention mechanism, introduced transformer
- Little slower
Yolo 26: 2026
- Deployment on a small form factor hardware
- realtime object detection on edge devices
- Strongest baseline for edge device deployment (realtime and accuracy)
- Efficient Loss Function

YoloX X

Anchor-free detector in the Yolo Family
Decoupled head used
Label assignment using SimOTA
Use YoloV3 SPP with DarkNet53 backbone
Uses advanced augmentation such as Mix-up & Mosaic

Backbone: Feature extraction
Neck: Aggregation of multi-scale feature
Head: Localization and Classification scores

Decoupled head

decoupled head

Coupled Head: one head gives regression score and classification score (Dog/Cat + Location, BBox)
Decoupled Head:
- First head gives Classification score (Dog/Cat)
- Another head gives Regression score (Location, BBox)

Data Augmentation

mixup augmentation

occluded and overlapped objects
improve model robustness

mosaic augmentation

four images are combined into one
crops and resizes the images to create a new training sample

Yolo 26

Realtime computer vision model
Detection, Segmentation, Classification, Pose, Tracking, OBB (Oriented Bounding Box)
Available in Nano, Small, Medium, Large, XLarge
E2E detection pipeline (NMS-free, Non-Maximum Suppression free)
Designed for edge AI and fast deployment

Why is it faster

NMS-free infrerence removes post-processing overhead
Direct bounding box regression (No DFL, Distribution Focal Loss)
Lower latency and simpler deploymenet graph
CPU-optimized architecture
Up to 43% faster on CPUs than V11

Key Changes

ProgLoss (Progressive Loss Balancing): improves training stability and convergence
STAL (Small-Target-Aware Label Assignment): improves small-object detection
MuSGD optimizer improves convergence speed
Better speed-accuracy trade-off than many previous YOLO models
Ideal for robotics, drones, surveillance, and edge devices

Inference pipeline

Backbone: Efficient Hybrid CNN + Attention
Neck: PAN-FPN (Multi-scale Feature Fusion)
Head (Decoupled & Dual Head)
- One-to-Many Head: Dense supervision (Traning only, many positives)
- One-to-One Head: Single best match NMS-free inference (Inference & tranining)

Tranining pipeline

Instance Segmentation

Identifies each pixel of an object instance
whereas Semantic Segmentation classifies object pixels to specific classes/categories
Instance Segmentation
- SegNet
- DeepMask
- SharpMask
- Mask RCNN
Semantic Segmentation
- Conditional Random Field (CRF)
- Fully Convolutional Networks (FCN)
- U-Net
- Pyramid scene parsing network (PSPNet)

Application of Instance Segmentation

Autonomous Driving
Scene Understanding
Aerial Image Processing

Mask R-CNN

Mask-Region Convolutional Neural Network

An addition to the RCNN family, perfoming instance segmentation
Improved over Faster RCNN
Full Convolutional Network for predicting mask for each class/object.
Two stages:
1. RPN proposes candidiate object bounding boxes
2. Classify the Candidates, refine bounding boxes, and predict mask.

Mask R-CNN architecture

Limitations of Mask R-CNN

Computational Complexity: Traning and inference can be computationally intensive, requiring substantial resources (high resolution images or large datasets).
Small-Object Segmentation: may struggle with accurately segment very small objects due to limited pixel information.
Data Requirements: Training requires a large amount of annotated data, which can be time-consuming and expensive to acquire.
Limited Generalization to Unseen Categories: The model's ability to generalize to unseen object categories is limited.

Semantic Segmentation

u-net

input image -> u-net -> output segmentation map

References

Ge, Z., Liu, S., Wang, F., Li, Z., & Sun, J. (2021). YOLOX: Exceeding YOLO series in 2021. arXiv. https://doi.org/10.48550/arXiv.2107.08430
Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional networks for biomedical image segmentation. In N. Navab, J. Hornegger, W. M. Wells, & A. F. Frangi (Eds.), Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015 (Vol. 9351, pp. 234–241). Springer. https://doi.org/10.1007/978-3-319-24574-4_28

CNN 009

April 22, 2026 · 2 min read

Gracefullight

Owner

Prediciting Bounding Boxes

Using:
- Sliding Window (Slow)
- Selective Search
- Region Proposals
Task:
- Predict Bouding boxes from CNN

Non Maxima Suppression (NMS)

Check the probabilities of each detection and keep ones with score above a certain threshold (0.7)
For remaining boxes, a. Box with highest score is the detection results. b. Discard any remaining boxes with IoU > 0.5 with final detected box c. i.e. overlap with the box with highest score.

Anchor Boxes

Associate each object to:
- A cell which contains its mid-point and
- Anchor box for the cell with highest IoU
Calculate the IoU of Anchor boxes and prediected Bounding Boxes.
- $IoU(P_{bb}, A_{bb}) = \frac{Area of Overlap}{Area of Union}$
$\hat{y} = \{P_0, x, y, h, w, C1, C2, \quad P_0 x, y, h, w, C1, C2\}$ $y^={P0,x,y,h,w,C1,C2,P0x,y,h,w,C1,C2}$
- $P_0$ is objectness score
- $x, y$ are the coordinates of the center of the bounding box relative to
- $h, w$ are the height and width of the bounding box
- $C1, C2$ are the class information for the object in the bounding box

YOLO

Real-time performance with 45 FPS, 0.02 sec per image
Not suitable for small objects
Issues with new or multiple aspect ratios and unable to generalize

SSD, Single Shot Detector

Similar to YOLO, VGG16 base Convolutional Neural Network layers
Take advantage of Anchor boxes with different aspect ratios
Large number of anchors boxes are chosen
Not suitable for small objects
3 times faster than Faster R-CNN
with ResNet-101 base SSD may help in detecting small objects with better features from the CONV layers

SSD 300 architecture

Overview of Object Detection

Base Networks
- VGG156
- ResNet-101
- Inception-v2, v3
- ResNet
- MobileNet
- Alexnet
- ZFNet
Object Detection Framework
- R-CNN family
- YOLO family
- SSD family
- F-RCNN family
Faster-RCNN is more accurate but slower
YOLO/SSD are faster/real-time but may not be very accurate

CNN 008

April 21, 2026 · 4 min read

Gracefullight

Owner

Datasets

PASCAL isual Object Classifcation

PASCAL VOC

a popular dataset for object detection, classification and segmentation
20 categories

ImageNet

a dataset for object detection
500,000 images, 200 categories
Not very popular due to large number of classes and size of the dataset

COCO

Microsoft Common Objects in Context dataset

a large-scale object detection, segmentation, and captioning dataset.
330,000 images, 80 categories
200,000 labeled images, 1.5 million object instances
91 stuff categories

Intersecxtion over Union (IoU)

$IoU = \frac{Area of Overlap}{Area of Union}$

a metric used of the evaluation of an object detector
how good is the predicted bounding box for an object detected colosely matches

AP

Average Precision

Metric	Description
$AP$	AP at IoU=.50:0.05:0.95 (primary challenge metric)
$AP^{IoU=.50}$	AP at IoU=0.50 (PASCAL VOC metric)
$AP^{IoU=.75}$	AP at IoU=0.75 (strict metric)
$AP^{small}$	AP for small objects: $area < 32^2$
$AP^{medium}$	AP for medium objects: $32^2 < area < 96^2$
$AP^{large}$	AP for large objects: $area > 96^2$
$AR^{max=1}$	AR given 1 detection per image
$AR^{max=10}$	AR given 10 detections per image
$AR^{max=100}$	AR given 100 detections per image
$AR^{small}$	AR for small objects: $area < 32^2$
$AR^{medium}$	AR for medium objects: $32^2 < area < 96^2$
$AR^{large}$	AR for large objects: $area > 96^2$

Taxonomy of Object Detection

History of Object Detection

Classification with Localization

Classification Task
- Input: Image
- Output: Class label
- Performance Metric: Accuracy
Localization Task
- Input: Image
- Output: Bounding box coordinates $(x, y, Ht, Wd)$ or $(x, y, x', y')$
- Performance Metric: IoU

Localization Loss

Localization as a regression problem

Detection as a Classification Problem

Region Proposal

Find blobs in the image that are most likely to contain objects.
Selective Search: ~1000-2000 region proposal using CPU

R-CNN

Region based CNN

Convolution Neural Network as feature extractor
SVM as classifier
Bounding box regression for localization
Pass each region through CNN to extract features, then classify using SVM and refine bounding box using regression
Warped image region to fixed size (e.g., 227x227) before passing through CNN
Region-of-Interest (RoI) from proposal method around 2000 per image, which is computationally expensive

Fast R-CNN

Run Whole image through CNN to get feature map, then classify each region proposal using RoI pooling and fully connected layers
- Region of Interest (RoIs) from proposal method
- Crop and Resize features
- Per-Region Network
- Linear + Softmax for Object category
- Linear for Box offset
Reduce computation
ROIs from feature maps using selective search
mAP: 70% for PASCAL VOC 2007

Faster R-CNN

Use CNNs to make proposal
RPN (Region Proposal Network) to generate region proposals
- Small nural network to predict proposals from feature map
RoI pooling to extract features for each proposal
- then classify and refine bounding box
mAP: 78.8% for PASCAL VOC 2007

Model	Description
R-CNN	Look at every patch one by one
Fast R-CNN	Look once, and then inspect patches on feature map
Faster R-CNN	Propose patches using a neural network (RPN)

R-CNN Family Comparison

Feature	R-CNN	Fast R-CNN	Faster R-CNN
Region proposal	Selective search	Selective search	RPN (learned)
CNN Usage	Per region	Once per image	Once per image
Speed	Very slow	Faster	Can work in real-time
Training	Multi-stage, discrete	Partially end-to-end	Fully end-to-end
Accuracy	Good	Better	Best of all three

Image Annotation for Object Detection

difficulty: not easy to annotate images even for humans

CNN 007

April 13, 2026 · 6 min read

Gracefullight

Owner

Transfer Learning

Knowledge acquired while solving one task, can be used to solve related tasks
Similar to the way humans apply knowledge acquired from on task to solve a new but similar, related task.

Transfer Learning Benefits

Less training data required: Model trained using a large (similar) dataset can be used as a starting point for training on a smaller dataset.
Faster training: Traninig can converage faster, du the use to existing knowledge (weights) to start with rather than from scratch.
Better model generalization: Model is trained to identify features which can be applied to new contexts.

VGG-16

Approach	Description	Use Case	When to Use
Use Pre-trained Model	Use ImageNet pre-trained model without any additional training	Dogs & cats classification	When dataset distribution is similar to ImageNet with few samples
Train FC Layers Only	Use CONV layers for feature extraction, train FC layers only	Different class classification on similar domain	When dataset is similar to ImageNet but different classes with limited samples
Train Last CONV + FC Layers	Train last CONV layers (specialized features) and FC layers	Significantly different data distribution domain	When dataset differs greatly from ImageNet, different classes, and limited samples
Train All CONV + FC Layers	Train all CONV layers and FC layers (with modifications)	Complex task with different domain	When dataset differs greatly from ImageNet, different classes, dataset is large, and task is complex

AlexNet

Input: 224x224x3 image
Activiations: ReLU after each CONV and FC layer
Optimizer: SGD with Momentum
Regularization: Dropout in FC1 and FC2
Total Trainable Parameters: ~60 million
Traninig settings: Nvidia GTX 580 3BG GPUs for 6 days

GoogleNet

Accurary: top-5 test erorr rate of 6.7%
Close to human level performance
22 layer deep CNN
Optimizer: RMSProp
Total Trainable Parameters: ~4 million (Significantly reduced)
A novel inception module was introduced

GoogleNet

Inecption Module

Inception Module

Use filters with different size together
Use different types of layers (CONV, POOL etc.) together
It leads to better performance and efficiency but complicated architecture.

1X1 Convolution

Input image ( $6 \times 6 \times 1$ ), 1x1 kernel, and output can be declared as:

X= \begin{bmatrix} 100&100&100&0&0&0\\ 100&100&100&0&0&0\\ 100&100&100&0&0&0\\ 100&100&100&0&0&0\\ 100&100&100&0&0&0\\ 100&100&100&0&0&0 \end{bmatrix}, \quad K=\begin{bmatrix}3\end{bmatrix}

Y = K * X

Y= \begin{bmatrix} 300&300&300&0&0&0\\ 300&300&300&0&0&0\\ 300&300&300&0&0&0\\ 300&300&300&0&0&0\\ 300&300&300&0&0&0\\ 300&300&300&0&0&0 \end{bmatrix}

For channel reduction with a 1x1 convolution, each spatial location $(i,j)$ is a vector:

\mathbf{x}_{i,j} \in \mathbb{R}^{256}

One 1x1 layer with 128 filters is a matrix:

W \in \mathbb{R}^{128 \times 256},\quad \mathbf{b} \in \mathbb{R}^{128}

At each location, output channels are computed by matrix multiplication:

\mathbf{z}_{i,j}=W\mathbf{x}_{i,j}+\mathbf{b},\quad \mathbf{y}_{i,j}=\mathrm{ReLU}(\mathbf{z}_{i,j})

So the shape changes as:

64\times64\times256 \;\xrightarrow{\;1\times1\;\text{Conv (128 filters)}+\mathrm{ReLU}\;} 64\times64\times128

If we flatten all spatial positions ( $64\times64=4096$ ):

X_{\text{flat}} \in \mathbb{R}^{4096\times256},\quad Y_{\text{flat}}=\mathrm{ReLU}\left(X_{\text{flat}}W^T+\mathbf{1}\mathbf{b}^T\right) \in \mathbb{R}^{4096\times128}

Inception V2 and V3

V1 (GoogleNet): Replace one 5x5 conv with two stacked 3x3 conv layers.
- Number of parameters: $5^2=25$ vs. $2\times3^2=18$ (about 28% reduction)
V2: Factorize an $n\times n$ $n \times n$ conv into $1\times n$ $1 \times n$ and $n\times1$ $n \times 1$ convs.
- For $3\times3$ : $3^2=9$ vs. $3+3=6$ (about 33% reduction)
V3: Use more aggressive factorization and branch design (e.g., $1\times7$ $1 \times 7$ and $7\times1$ $7 \times 1$ ), plus efficient grid-size reduction.
- Improves the accuracy-efficiency tradeoff while keeping computation manageable

ResNet

Deep Residual Networks, skip connections, and identity mappings

Enabled the development of the much deeper networks
ResNet is composed of residual blocks were introduced to address the vanishing gradient problem in deep networks.
- Degradation problem: adding more layers eventually have negative effect on the final performance

ResNet

CNN 006

March 28, 2026 · 4 min read

Gracefullight

Owner

Data Preparation

Small Dataset (Range: 100 to 100,000 samples)
- Train/Valid/Test: 60/20/20
- Train/Test: 70/30
Large Dataset (Range: 500,000 to 1M+ samples)
- Train/Valid/Test: 98%/10,000/10,000
- usaully more traning data is get better performance.
Rule of Thumb: Validation and Test set should com from the same distribution.

Bias and Variance

Bias: A value that allows to shift the activation function to left or right to better fit the data.
- With bias the curve/line will not always pass through origin
- can get a better fit to training data
Variance: The sensitivity of the model to small fluctuations in the training data.
- The change in prediction accuracy of ML model between training data and test data.
- Model with high variance pays a lot of attention to tranining data and does not generalize on the data which is has not seen before.
- With high variance, model perform very well on training data but poorly on test data.

Bias and Variance

High Bias
- High training error, underfitting
- Validation/test error nearly same as train error
- Potential things to try:
  - Increase features
  - Make ML model more complicated
  - Decrease Regularation parameters
High Variance
- Low tranining error, overfitting
- High validation/test error
- Potential things to try:
  - Increase dataset size
  - Reduce input features
  - Increasing Regularization parameter

Accuracy

Bayesian Optimal Error (BOE): Best optimal error that can be achieved by any model on a given dataset.
Human-Level performance:
- Humans are very good at a lot of tasks
- Can get labelled data from humans to help improve the model performance
- Gain insights from manual error analysis

Regularization

a technique which makes slight modifications to the learning algorithm such that the model generalizes better on the unseen data.
Update the loss/cost function by adding a regularization term
- $\text{Loss function} = \text{Loss} + \text{Regularization term}(\lambda)$
- Due to $\lambda$ , the weight matrices will decrease, assuming a neural network with smaller weight matrices leads to simpler model.
- Regularization penalizes the weights matrices of the nodes
L2 regularization
L1 regularization
Dropout

L2 Regularization

$\text{Cost function} = \text{Loss} + \frac{\lambda}{2m} \sum_{j=1}^{n_x} w_j^2$

$\lambda$ is a hyper-parameter
as weight decay, as it forces the weight to decay towards zero, but not exactly zero.

L1 Regularization

$\text{Cost function} = \text{Loss} + \frac{\lambda}{2m} \sum_{j=1}^{n_x} |w_j|$

Penalize the absolute value of the $w$
Weight may reduce to zero
Useful in compressing a model (sparse model)

Dropout

It produces good reuslts and most popular regularization technique in deep learning.
At every iteration, it randomly selects and drops some nodes and remove all the connections to those nodes.
Each iteration has a different set of nodes.

Data Augmentation

Simple way to reduce overfitting is to increase size of tranining dataset.
By creating more sample using the existing set and applying the following simple operations
- Flip
- Rotate
- Scale
- Crop
- Translate
- Gaussian Noise

Cutout

Simple regularization technique of randomly masking out square regions of input during training.
Patch size: 16x16 to 64x64
Fill value: 0 or mean pixel value
Patches: 1-3 per image

Mixup

Trains a neural network on convex combinations of pairs of examples and their labels.
It regularizes the neural network to favor simple lienar behavior in-between training examples.
Image A ( $\lambda = 0.55$ ) + Image B ( $\lambda = 0.45$ ) = Blended Output

CutMix

Patches are cut and pasted among training images, where the ground truth labels are also mixed proportionally to the area of the patches.
Image A + Image B (Patch) = Pasted Patch Output.

Random Agumentation

A set of augmentation operations is defined, and a random subset of these operations is applied to each image during training.
Identity, AutoContrast, Equalize, Rotate, Solarize, Color, Posterize, Contrast, Brightness, Sharpness, ShearX/Y, TranslateX/Y

Generative Adversarial Networks (GANs)

Able to generate images which look similar to the original ones
Proven to be very effective in data augmentation, especially when the dataset is small.

Neural Style Transfer

Using CNN to separate style
Transfer style to different image

CNN 005

March 28, 2026 · 3 min read

Gracefullight

Owner

Computer Vision

Classification
Classification with Localization
Object Detection

-	ANN	CNN
Input	1D vector	3D tensor (height, width, channels)
Connections	Fully connected	Local connections (receptive fields)
Overfitting	Prone to overfitting	Less prone to overfitting

Convolutional Neural Networks (CNN)

Convolutional Layer (CONV)
Pooling Layer (POOL)
Fully Connected Layer (FC)

LENET-5

Convulutional Layer (CONV)

The first layer to extraact features from an input image
Core buildling block of a CNN
Convolutions are basic operation in this layer
A number of filters (e.g. edge detectors) are applied to the input image.

Padding

Padding is used to control the spatial size of the output feature maps.
Negative values at the edges can naturally arise because of padding, and they usually are not a big problem because activation functions and later layers come afterward.
Input Matrix dimension: $n \times n \times c$ (height, width, channels)
Filter size: $f \times f$
Padding ( $P$ ): 1, number of pixels added to the border of the input
$(n \times n) * (f \times f) \to (n + 2P - f + 1) \times (n + 2P - f + 1)$ $(n \times n) * (f \times f) \to (n + 2 P - f + 1) \times (n + 2 P - f + 1)$
- Example: $5 \times 5$ input with $3 \times 3$ filter and padding of 1 results in a $5 \times 5$ output feature map.
if input and output matrix dimensions are the same, then $P = \frac{f - 1}{2}$ .
Valid padding ( $P = 0$ ): No Padding
Same padding ( $P = \frac{f - 1}{2}$ ): Output size and input size is same, this requires appropriate padding.

Stride

It is the number of pixels by which slide the filter across the input image.

No Padding Strides	Stride with Padding

Github: vdumoulin/conv_arithmetic
Input Matrix dimension: $n \times n$
Filter size: $f \times f$
Padding: $P$
Stride: $S$
Output Size = $\left\lfloor \frac{n + 2P - f}{S} + 1 \right\rfloor \times \left\lfloor \frac{n + 2P - f}{S} + 1 \right\rfloor$ $⌊ \frac{n + 2 P - f}{S} + 1 ⌋ \times ⌊ \frac{n + 2 P - f}{S} + 1 ⌋$
- Example: Input Matrix dimension: $5 \times 5$ , Filter size: $3 \times 3$ , Padding: $1$ , Stride: $2$ results in an output size of $2 \times 2$ .

Pooling Layer (POOL)

Down sampling operation which reduces the dimensionality of a matrix.
Reduces the number of parameters for large image, but retain the valuable information.
Max pooling
Average pooling
Sum pooling

Fully Connected Layer (FC)

a traditional Multi-layer Perception (MLP) layer
For multi-class classification, usually Softmax activation is used.
Softmax ensures the output.
Output of the CONV and POOL layers represent a high level features of the Input image.
The FC layer takes these features to classify the input image into the desired output classes.

CNN 004

March 26, 2026 · 6 min read

Gracefullight

Owner

Logistic Regression as Neural Network

if $y = 1$ $y = 1$
- $L = -\log(\hat{y})$
- if $\hat{y} \to 1$ , then $L \to 0$ (low loss)
- if $\hat{y} \to 0$ , then $L \to \infty$ (high loss)
if $y = 0$ $y = 0$
- $L = -\log(1 - \hat{y})$
- if $\hat{y} \to 0$ , then $L \to 0$ (low loss)
- if $\hat{y} \to 1$ , then $L \to \infty$ (high loss)

Gradient Descent

it is an iterative approach for error correction in a machene learning model
Find $w$ and $b$ that will minimize $GD(w, b)$ (requires Loss/Cost function)

Initialize $w$ and $b$
Perform Forward pass operation/calculations
Compute Loss/Cost function $L(a, y)$
Compute change in $w$ and $b$ (Take the partial derivative of the cost function with respect to Weights and bias $dw$ and $db$ )
Update $w$ and $b$ ( $w := w - \alpha dw$ and $b := b - \alpha db$ )
Repeat from Step 2 with new values of $w$ and $b$ for 'n' number of iterations.

$\alpha$ is the learning rate (hyperparameter) that controls how much we are adjusting the weights and bias of our model with respect to the loss gradient. It is a small positive value (e.g., 0.01, 0.001) that determines the step size at each iteration while moving toward a minimum of the loss function.

Gradient Descent Types

Batch Gradient Descent (BGD)
Stochastic Gradient Descent (SGD)
Mini-batch Gradient Descent (MBGD)

Batch Gradient Descent (BGD)

Process each input sample and find the cost
Find the average cost oveer all input samples
Update $w$ and $b$ and repeat the steps for "n" epochs(iterations)

Disadvantages:
- It uses the complete dataset to calculate the gradients at every steps
- Slow when training data is large
- Difficult to find the learning rate
- Difficult to ascertain the number of epochs(iterations)

Stochastic Gradient Descent (SGD)

Due to the random nature, the algorithm is much less regular than BGD.

Process a random input sample and find the cost.
Update $w$ and $b$ , and repeat the steps for "n" iterations on the training samples.

Advantages:
- Computes gradient based on single input sample, which is memory efficient.
- Much faster compared to BGD.
- Possible to train on large datasets.
- Randomness is helpful to escape local minima.
Disadvantages:
- Might not reach the optimal value, but very close to it.
  - Simulated annealing: Reduce the learning rate gradually
  - Create a Learning Schedule to determine the learning rate at each iteration.

Mini-batch Gradient Descent (MBGD)

Divide the tranining set into mini-batches of size $n$ (e.g., 64, 128, 256).
Process all the samples in a mini-batch and find the average cost
Update $w$ and $b$ , and repeat the steps for "n" iterations/epoches on the traning samples.

Advantages:
- Computes gradient based on small sets of input smaple
- Much faster compared to BGD.
- Possible to train on large dataset.
- Performance boost on matrix operations using GPUs.
- Might not reach the optional value but, very close to it and possibly better than SGD.
Disadvantages:
- It may be harder to escape the local minima compared to SGD.

Exponentially Weighted Averages

One of the popular algorithm for smoothing sequential data (time series data), aka. moving average.
Weight the number of observations and using their average

V_0 = 0 \\ V_1 = 0.9 \cdot V_0 + 0.1 \cdot \theta_1 \\ V_2 = 0.9 \cdot V_1 + 0.1 \cdot \theta_2 \\ V_3 = 0.9 \cdot V_2 + 0.1 \cdot \theta_3 \\ \vdots \\ V_t = 0.9 \cdot V_{t-1} + 0.1 \cdot \theta_t \\ V_t = \beta \cdot V_{t-1} + (1 - \beta) \cdot \theta_t

$V_t$ is approximate average over $\approx \frac{1}{1 - \beta}$ time steps.

For $\beta = 0.9$ , $V_t$ is average over the last 10 time steps.
For $\beta = 0.98$ , $V_t$ is average over the last 50 time steps.
For $\beta = 0.5$ , $V_t$ is average over the last 2 time steps.

Optimizers

SGD with Moementum

At iteration $t$ :

Calculate $dw$ and $db$ on the current mini-batch (Hyper parameters: $\alpha$ and $\beta$ )
Update the velocity:
- $V_{dw} = \beta V_{dw} + (1 - \beta) dw \rightarrow V_t = \beta V_{t-1} + (1 - \beta) \theta_t$
- $V_{db} = \beta V_{db} + (1 - \beta) db$
Update parameters:
- $w := w - \alpha V_{dw}$
- $b := b - \alpha V_{db}$

RMSProp

Root Mean Square Propagation.
Unpublished adaptive learning method by Geoffrey Hinton.
Reduces oscillation but in a different way than Momentum.
Divides the learning rate by an exponentially decaying average of squared gradients.
Calculate $dw$ $d w$ and $db$ $d b$ on the current mini-batch
- $S_{dw} = \beta S_{dw} + (1 - \beta) dw^2$
- $S_{db} = \beta S_{db} + (1 - \beta) db^2$
Update parameters:
- $w := w - \alpha \frac{dw}{\sqrt{S_{dw}} + \epsilon}$
- $b := b - \alpha \frac{db}{\sqrt{S_{db}} + \epsilon}$
- $\epsilon$ is a small number to prevent division by zero (e.g., $10^{-8} \text{ to } 10^{-10}$ )

Adam

Adaptive Moment Estimation
Combination of RMSProp and Momentum
Work well for a wide range of non-convex optimization problems in machine learning.
Calculate $dw$ $d w$ and $db$ $d b$ on the current mini-batch
- $V_{dw} = \beta_1 V_{dw} + (1 - \beta_1) dw \leftarrow Momentum, \beta_1$
- $V_{db} = \beta_1 V_{db} + (1 - \beta_1) db$
- $S_{dw} = \beta_2 S_{dw} + (1 - \beta_2) dw^2 \leftarrow RMSProp, \beta_2$
- $S_{db} = \beta_2 S_{db} + (1 - \beta_2) db^2$
Update parameters:
- $w := w - \alpha \frac{V_{dw}}{\sqrt{S_{dw}} + \epsilon}$
- $b := b - \alpha \frac{V_{db}}{\sqrt{S_{db}} + \epsilon}$
- $\epsilon$ is a small number to prevent division by zero (e.g., $10^{-8} \text{ to } 10^{-10}$ )
Hyper parameter guide:
- $\alpha = 0.001$
- $\beta_1 = 0.9$ : Momentum term
- $\beta_2 = 0.999$ : Moving weighted average
- $\epsilon = 10^{-8}$ : To prevent division by zero
ensmallen.org

Learning Rate Decay

Speed-up the learning algorighm by slowing decreasing the learning rate $\alpha$ as the number of epochs increases.

Activation Functions

Getting the output of a layer in a neural network and applying a non-linear function to it.
- Sigmoid: $\sigma(x) = \frac{1}{1 + e^{-x}}$
- Tanh: $\tanh(x) = \frac{2}{1 + e^{-2x}} - 1$
- Used for binary classification in the output layer.
ReLU: $A(x) = \max(0, x)$ $A (x) = max (0, x)$
- Rectified Linear Unit
- Avoids and rectifies vanishing gradient problem
- Best used in hidden layers
- Computationally less expensive than sigmoid and tanh
Softmax: $S(x_i) = \frac{e^{x_i}}{\sum_{j} e^{x_j}}$ $S (x_{i}) = \frac{e ^{x_{i}}}{\sum _{j} e ^{x_{j}}}$
- Turns numbers in probabilities that sum to 1.
- Used for multi-class classification in the output layer.

CNN 003

March 3, 2026 · 3 min read

Gracefullight

Owner

Image Gradient

It is a directional change in the intensity or color in an image.
can be used to extract valuable information from images.
commonly used in edge detection.
➡️ Change is X-directions, ⬇️ Change is Y-directions.
Combining both X and Y diretion to estimate if changes are in both directions.

HoG, Histogram of Oriented Gradient

To find edge and shape of the object in the image

Computing Image Gradient
- Use the horizontal and vertical filters to compute gradient values
Compute the strength/magnitude and direction of gradient
- Strength/Magnitude(g): $\sqrt{g_x^2 + g_y^2}$
- Direction( $\theta$ ): $\tan^{-1}(g_y / g_x)$
Create orientation histogram
- Divide the image into small connected regions called Cells which is a 8x8 patch
- Create cell histogram based on gradient direction and magnitude
- 64 (8x8) gradient vectors are put into a 9-bin histogram.
- The bins are the gradient directions ( $\theta$ ) quantized into 9-bins
Block Normalization
- 16x16 pixels blocks or 22 cells are used for normalization, which has 4 histograms.
- Normalization will make it scale/multiplication invariant
- Each block will represent 36x1 element vector
Intensity: brightness of the pixel
Saturation: HSV color space, the amount of gray in the color
Calculate the HoG feature vector
- Each of the 36x1 vectors in each blocks are concatenated into one big vector
- Size of the vector will be 36xN, where N is the number of blocks in the image
Hog feature extractor

LBP, Local Binary Pattern

To describe the image textures

$LBP_{P,R}(x_c, y_c) = \sum_{p=0}^{P-1} s(g_p - g_c) \cdot 2^p$

An eifficient texture operator which labels each pixels of an image by thresholding their neighbours.
A powerful feature for texture classification
LBP operator is to describe the image textures using two measures namely, local spatial patterns and the gray scale constract of its strength.
$S(x)$ is a thresholding function
$(x_c, y_c)$ is the center pixel in the 8 pixel neighbourhood
$g_c$ is gray level of the center pixel
$g_p$ is gray value of a smpling point in an equally spaced circular neighbourhood of P sampling points and radius R around the point $(x_c, y_c)$

Sample pixel neighbourhood
Difference result
Thresholding result

LBP

ANN

$L(a, y) = -(y \log a + (1-y) \log (1-a))$ $GD(w, b) = x = \frac{1}{m} \sum_{i=1}^{m} L(a^{(i)}, y^{(i)})$

Sequence​

Sequence modelling types​

RNN​

Sequence Modelling​

Attention​

Transformer​

Vision Image Transformer (ViT)​

Steps to use ViT​

CNNs vs Vision Transformer (ViT)​

RF-DETR​

Diffusion Models​

Steps to train a diffusion model​

Applications of Diffusion Models​

Drawbacks of Anchor-based detectors​

Anchor-free detectors​

Key-point based​

Center-based​

YOLO​

YoloX X​

Decoupled head​

Data Augmentation​

Yolo 26​

Why is it faster​

Key Changes​

Inference pipeline​

Tranining pipeline​

Instance Segmentation​

Application of Instance Segmentation​

Mask R-CNN​

Limitations of Mask R-CNN​

Semantic Segmentation​

References​

Prediciting Bounding Boxes​

Non Maxima Suppression (NMS)​

Anchor Boxes​

YOLO​

SSD, Single Shot Detector​

Overview of Object Detection​

Datasets​

PASCAL isual Object Classifcation​

ImageNet​

COCO​

Intersecxtion over Union (IoU)​

AP​

Taxonomy of Object Detection​

History of Object Detection​

Classification with Localization​

Localization Loss​

Detection as a Classification Problem​

Region Proposal​

R-CNN​

Fast R-CNN​

Faster R-CNN​

R-CNN Family Comparison​

Image Annotation for Object Detection​

Transfer Learning​

Transfer Learning Benefits​

VGG-16​

AlexNet​

GoogleNet​

Inecption Module​

1X1 Convolution​

Inception V2 and V3​

ResNet​

Data Preparation​

Bias and Variance​

Accuracy​

Regularization​

L2 Regularization​

L1 Regularization​

Dropout​

Data Augmentation​

Cutout​

Mixup​

CutMix​

Random Agumentation​

Generative Adversarial Networks (GANs)​

Neural Style Transfer​

Computer Vision​

Convolutional Neural Networks (CNN)​

Sequence

Sequence modelling types

RNN

Sequence Modelling

Attention

Transformer

Vision Image Transformer (ViT)

Steps to use ViT

CNNs vs Vision Transformer (ViT)

RF-DETR

Diffusion Models

Steps to train a diffusion model

Applications of Diffusion Models

Drawbacks of Anchor-based detectors

Anchor-free detectors

Key-point based

Center-based

YOLO

YoloX X

Decoupled head

Data Augmentation

Yolo 26

Why is it faster

Key Changes

Inference pipeline

Tranining pipeline

Instance Segmentation

Application of Instance Segmentation

Mask R-CNN

Limitations of Mask R-CNN

Semantic Segmentation

References

Prediciting Bounding Boxes

Non Maxima Suppression (NMS)

Anchor Boxes

YOLO

SSD, Single Shot Detector

Overview of Object Detection

Datasets

PASCAL isual Object Classifcation

ImageNet

COCO

Intersecxtion over Union (IoU)

AP

Taxonomy of Object Detection

History of Object Detection

Classification with Localization

Localization Loss

Detection as a Classification Problem

Region Proposal

R-CNN

Fast R-CNN

Faster R-CNN

R-CNN Family Comparison

Image Annotation for Object Detection

Transfer Learning

Transfer Learning Benefits

VGG-16

AlexNet

GoogleNet

Inecption Module

1X1 Convolution

Inception V2 and V3

ResNet

Data Preparation

Bias and Variance

Accuracy

Regularization

L2 Regularization

L1 Regularization

Dropout

Data Augmentation

Cutout

Mixup

CutMix

Random Agumentation

Generative Adversarial Networks (GANs)

Neural Style Transfer

Computer Vision

Convolutional Neural Networks (CNN)