CNN 012

May 19, 2026 · 6 min read

Gracefullight

Owner

Three main layers of a CNN
- CONV: Convolution Layer
- POOL: Pooling Layer
- FC: Fully Connected Layer
- CONV extracts features, POOL downsamples feature maps, and FC makes the final prediction.
Why CNNs use over ANNs for image processing
- Computationally efficient
- Using Filters to capture spatial features
- Sharing weights across the image
Overfitting
- The model essentially memorizes the training data, leading to poor performance on unseen data
- To prevent overfitting, we can use techniques like:
  - Dropout: Randomly dropping out neurons during training to prevent co-adaptation
  - Batch Normalization: Normalizing the inputs of each layer to stabilize learning
  - L1/L2 Regularization: Adding a penalty to the loss function to discourage large weights
    - L1 regularization adds a penalty based on the absolute value of the weights (can be zero, can make model sparse and useful for feature selection.)
    - L2 regularization adds a penalty based on the squared value of the weights (not can be zero, reduce model complexity and overfitting.)
  - Data Augmentation: Creating new training samples by applying transformations to existing data
ReLU
- If the input is below zero, ReLU does output 0.
- If the input is above zero, it outputs the input value itself.
- max(0, x)
- ReLU can output any number from 0 to infinity, which allows it to capture a wide range of features in the data.
- It fixes gradient vanishing problem by allowing gradients to flow through the network without being squashed to zero, which can happen with activation functions like sigmoid or tanh.
Sigmoid: Binary Classification
Softmax: Multi-class Classification
Backpropagation: sends the error backward through the network and calculates gradients, so the model knows how to update its weights and biases.
Gradient Descent
- It uses Backpropagation to calculate the exact slope (the gradient) of the error (loss).
- Then it takes a step in the opposite direction of the gradient to minimize the error.
- It repeats this interative process until it reaches a local minimum.
Vanishing Gradient Problem
- The gradient becomes too small, so earlier layers learn very slowly or almost stop learning.
- ReLU helps mitigate this problem by allowing gradients to flow through the network without being squashed to zero.
Learning Rate $\alpha$ $α$
- It's a hyperparameter that controls how big of a step the model takes down the slope.
- If $\alpha$ is too small, the model will take tiny steps and may take a long time to converge.
- If $\alpha$ is too large, the model may overshoot the minimum and diverge.
Precision: $\frac{TP}{TP + FP}$ $\frac{TP}{TP + FP}$
- Of all the patients the model predicted as having the disease, how many actually have the disease?
Recall: $\frac{TP}{TP + FN}$ $\frac{TP}{TP + FN}$
- Of all the patients who actually have the disease, how many did the model successfully catch?
- Recall is more important in medical diagnosis because we want to minimize false negatives (missing a disease).
Sliding Window
- Computationally expensive because it requires multiple passes over the image with different window sizes and strides.
- Multiple scales are needed to detect objects of varying sizes, which further increases the computational cost.
Stride: controls the step size of the sliding filter. Larger stride means smaller output.
Edge
- The points or pixels in an image where brightness or intensities change sharply.
- Sobel filter
- Prewitt filter
- Canny edge detector
Padding: adds zeros around the image so the CONV does not shrink the feature map too much.
Keep the output dimension the same as the input dimension, we can use padding.
- $P = \frac{F - 1}{2}$
Image Classification: Assigning a label to an entire image (e.g., cat, dog, car).
Object Detection: Identifying and localizing multiple objects within an image (e.g., bounding boxes)
Instance Segmentation: Identifying and segmenting each object instance in an image (e.g., pixel-level masks)
Momentum: uses an exponentially weighted average of past gradients to smooth updates and accelerate convergence.
RMSProp: uses an exponentially weighted average of squared gradients to adapt the learning rate for each parameter.
Adam: combines Momentum and RMSProp by using both the first moment, average gradient, and the second moment, average squared gradient.
Hyperparameters: learning rate, batch size, number of epochs, optimizer type, dropout rate, etc.
Supervised Learning: The model learns from labeled data, classification, regression.
Unsupervised Learning: The model learns from unlabeled data, clustering.
Loss/Cost function: an estimate of how far the model's predictions are from the actual target/answer.
AI is a broad concept of machines performing human-like tasks.
ML is a subset of AI that learns from data
DL is a subset of ML that uses deep neural networks with many layers.
ML's major problem
- insufficient data
- non-representative training data
- poor-quality data
- irrelevant features
- overfitting
- underfitting
When we use ML?
- a large amount of data for finding patterns and making predictions
- too many rules or too much complexity for humans to handle
Faster R-CNN: Propose regions first, then classify them
- RPN: Region Proposal Network, which generates candidate object proposals
YOLO: Predict boxes and class probabilities directly from the image in one pass
- Anchor boxes: predefined bounding boxes of different sizes and aspect ratios used to predict the location of objects in YOLO.
NMS: Non-Maximum Suppression, selects the best bounding box among overlapping boxes based on confidence scores.
1×1 convolution mixes channel information and can reduce the number of channels, so later convolutions become cheaper.
Inception module: learns small, medium, and large visual features at the same time.
Transfer Learning Strategies:
- First, if the new dataset is small and similar to the original dataset, we can use the pre-trained model directly.
- Second, if the dataset is similar but has different classes, we freeze the convolutional layers and train only the fully connected classification layer.
- Third, if the dataset is small but not very similar, we freeze the early convolutional layers and fine-tune the later convolutional layers plus the FC layer.
- Finally, if the dataset is large and different, we can fine-tune the whole network.
IoU: Intersection over Union, a metric used to evaluate the accuracy of object detection models by comparing the predicted bounding box with the ground truth bounding box.