Eunkwang Shin

Owner

Full Stack JavaScript Developer | Half-time Open Sourcerer.

모든 저자 보기

Mitigating Hallucinations on Object Attributes Review

2025년 8월 18일 · 약 4분

Eunkwang Shin

Owner

Overview

Introduces a HoOA benchmark that isolates hallucinations on object attributes (color, shape) from existence/relationship errors.
Proposes MIAVLM: leverages multiview images (generated from a single image’s 3D representation) and a Multiview Attributes Perceiver (MAP) to make fusion order-invariant.
Adds negative instructions during tuning to counter LVLMs’ tendency to answer "Yes".
Results: best HoOA metric (0.775 / 0.787) with fastest inference (0.071 / 0.105 s). "9in1" tiling is ineffective; separate multiview inputs help.
Training: LM loss, Adam (lr=0.001), cosine annealing, 20 epochs, single NVIDIA 3090.

Hallucinations on Object Attributes (HoOA)

Issues

HoOA = incorrect attribute descriptions for existing objects (distinct from HoOE/HoOR).
Root causes analyzed:
- Single-view insufficiency: fine-grained details can be invisible from a single viewpoint.
- Instruction bias: overexposure to positive/affirmative patterns → "Yes" bias.
- Order sensitivity: multi-image inputs change predictions when view order changes.

Mitigation Methods (this paper)

Multiview prompts: sample views from a single image’s 3D reconstruction to recover missed details.
MAP (order-invariant fusion): learn view weights and fuse per-view features via weighted sum; input order has no effect; supports any number of views.
Negative instructions: incorporate "No"-answerable questions in tuning to suppress "Yes" bias.

Benchmark (HoOA)

Construction

Based on CelebAText-HQ; manual attribute descriptions rewritten into Yes/No questions.
- Positive questions → correct answer "Yes".
- Negative questions → attribute flipped/opposite → correct answer "No" (to expose "Yes" bias).
Scale: 1,430 images, 14,291 positive + 14,291 negative questions.
Split: 9:1 train:test.
Metric: average of accuracy on positive and negative questions (balanced HoOA score).

Model: MIAVLM

Visual Extractor (VE)

6 stacked Transformer decoder blocks.
Soft prompts $P \in \mathbb{R}^{l \times d}$ are queries; image embeddings $e_i$ are keys/values.
Per-view cross-attention computed in parallel (no autoregressive chaining; no assumed order).
Per-view output: $o_i = \mathrm{softmax}\!\left(\frac{(P W_Q)(e_i W_K)^\top}{\sqrt{d}}\right) e_i W_V,\quad O_{VE}=\{o_1,\dots,o_n\}.$

Multihead Sampler (MS)

Learns view weights for fusion.
Decomposer (2-layer MLP) maps each view’s $[CLS]$ to $m=4$ tokens $\{e_i^{1},\dots,e_i^{m}\}$ .
For each token/head $j$ : compute attention scores vs. $P$ → mean over prompt tokens → $\mathrm{weights}^j \in \mathbb{R}^n$ .
Average across heads: $w_{MS} = \tfrac{1}{m}\sum_{j=1}^{m}\mathrm{weights}^j \in \mathbb{R}^n.$

MAP (Multiview Attributes Perceiver)

Order-invariant weighted fusion: $\text{Output}=\sum_{i=1}^{n} w_i\,o_i.$
Properties: supports any number of views; permutation-invariant to input order.
By learning weights for each view, MAP highlights informative perspectives and suppresses less useful ones, ensuring consistent predictions even when the view order changes. This directly addresses the input-order sensitivity observed in baselines such as OpenFlamingo.

MAP

Benchmarks

Baselines & Input Modes

Baselines: BLIP3, OpenFlamingo (4 variants), OPERA, Idefics2, LLaVA-UHD.
Two input modes:
1. Original image only.
2. Original + 8 generated views.
  - Models that accept only one image use 9in1 tiling (nine images stitched into one).

Main Results

MIAVLM:
- HoOA metric: 0.775 / 0.787 (modes 1 / 2)
- Positive accuracy: 0.752 / 0.762
- Negative accuracy: 0.797 / 0.812
- Inference time: 0.071 / 0.105 s (fastest)
9in1 tiling did not improve results (likely harder to interpret).
Nine separate multiview images generally improved performance.

Ablations

Negative instructions: boost negative-question accuracy but slightly reduce positive-question accuracy; overall HoOA increases (approx. 0.665 → 0.787).
Input-order sensitivity:
- MIAVLM is order-invariant
- OpenFlamingos accuracy varies when shuffling view order.

Limitations & Notes

Trade-off from negative instructions (negatives ↑, positives ↓).
Effectiveness depends on the quality of generated views.

Insights

This approach seems especially suitable for perception, where multiple scene views may arrive in arbitrary order, ensuring consistent attribute recognition.

Ref

Tan, Z., Li, Y., Meng, S., Yuan, X., Li, W., Mo, T., Wang, B., & Chu, X. (2025, 6–11 April 2025). Mitigating Hallucinations on Object Attributes using Multiview Images and Negative Instructions. ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

FSD +004

2025년 8월 18일 · 약 1분

Eunkwang Shin

Owner

match

match term:
  case pattern-1:
    action-1
  case pattern-2:
    action-2
  case pattern-3:
    action-3
  # the underscore _ case executes the default code
  case _:
    action-default

Repetition Statements

The count-controlled repetition: a fixed number of times.
The sentinel-controlled repetition: a designated value that ends the loop.
The infinite repetition: continues until externally stopped.

The For Loop

for <value> in <range of values>:
  <code>

sum = 0;

# [1, 2, ..., 19]
# adds values from 1 to 19 to sum
for e in range(1, 20):
  sum += e

print(f"The sum is: {sum}")

Loop-And-A-Half

n = 5
sum = 0

while n < 10:
  sum += n

  if sum > 100:
    break

Zotero 초기 세팅

2025년 8월 17일 · 약 1분

Eunkwang Shin

Owner

동기화 설정

Settings - Sync - Data Syncing

로그인하고 자동 동기화 설정

브라우저 익스텐션 다운로드

Zotero Connector

Citation 설정

Settings - Cite - APA 7th

등록되어있는지 확인

Export - Item Format - APA 7th

포맷 설정

MS Word Plugin 설치

Settings - Cite - Word Processors

Microsoft Word 섹션의 Install/Reinstall Microsoft Word Add-in 클릭
워드 재시작

플러그인 설치

Tools - Plugins

다운로드 받은 플러그인 드래그 앤 드랍

VSCode 플러그인

아직 못 찾음

Trustworthiness in Vision-Language Models Review

2025년 8월 17일 · 약 6분

Eunkwang Shin

Owner

Overview

Mitigates exposure of private data, produces harmful outputs, or is vulnerable to attacks.
SOTA models: LLaVA, Flamingo, GPT-4

Privacy

Privacy Issues

risk escalates significantly with relevant images as optimizing in the pixel domain is easier than in text
can unintentionally memorize sensitive data, leading to leaks without knowledge of the model’s specifics
Overfitting may also cause retention of sensitive attributes during inference
gradient-based and backdoor attacks further jeopardize VLM privacy with open-source data

Privacy Mitigation Methods

New metrics have been created to assess a model’s ability to reproduce training instances and facilitate cross-model comparisons
models utilizing multiple modalities provide better privacy
safety modules can be integrated to boost resilience against violations
adversarial training can enhance privacy but risks reducing accuracy
New architecture: differentially private CLIP model

Privacy Future Research Directions

Cryptography-based Privacy Preservation
- Secure multi-party computation (SMPC): divides secret information into shares among multiple parties, ensuring that individual shares reveal nothing unless combined
- Homomorphic encryption (HE): allows computations on encrypted data without decryption, and has also been utilized for privacy preservation in transformers
Federated Learning
- enhances privacy in vision-language models (VLMs) by localizing model training, which protects training data from leakage.
- challenges such as communication overhead among devices and statistical heterogeneity from diverse data distributions
Data Manipulation and Finetunning
- Data pseudonymization: substitutes sensitive information with synthetic alternatives.
- Data Sanitization: removes duplicates to reduce memorization and privacy risks.
- knowledge sanitization-fine-tuning: provide safe responses when leakage risks arise.

Fairness and Bias

Fairness and Bias Issues

Bias from training data
- disproportionately features men and lighter-skinned individuals
- outdated vocabulary and imbalanced representation
- clinical models may favor certain patient groups based on gender, language, etc.
Bias from Model
- Gender biases
- misclassification of race-related elements and biased outputs

Fairness and Bias Mitigation Methods

New Datasets and Benchmarks
- Harvard-FairVLMed, PATA, and BOLD enhance evaluations but often lack the scale of established benchmarks.
- create synthetic datasets to improve fairness assessments
  - gender-balanced dataset generated with DALL-E-3 and another consisting of gender-swapped images
  - counterfactual image-text pairs that highlight biases in datasets like COCO Captions
- new metrics
  - gender polarity
  - bias distance in embeddings
- human evaluation
De-biasing
- adjust model instructions and architectures for improved fairness
- detecting biased prompts in pre-trained models
- Post-hoc Bias Mitigation (PBM) effectively reduce bias in image retrieval
- Re-sampling underperforming clusters can enhance fairness
- modification of facial features also mitigate biases
- self-debiasing reduces biased text generation, especially when paired with other methods

Fairness Future Research Directions

Optimized De-biasing
- Additive residual learning: for fairer image representations.
- Calibration loss: retain semantically similar embeddings.
- Counterfactual inference framework: help models learn correct responses through cause and effect.
- Adversarial classifiers: predict image attributes from visual-textual similarities can be combined with instruction tuning to reduce bias.
Disentangled Representation Learning (DRL): simplifies complex data by breaking it in to independent feature groups, improving model predictions.
- Traditional DRL
  - Variational autoencoders (VAEs) for feature encoding based on impact
  - Generative adversarial networks (GANs) for separation.
- Attention in text encoders can be adjusted for fairer outputs.
- challenges: varying definitions of "disentanglement", ensuring fairness.
Human-in-the-Loop (HITL): integrating human intervention into their training to improve precision and fairness
- active learning
- reinforcement learning with human feedback
- explainable AI
- challenges: human bias, finance, and ethical and legal issues persist

Robustness

Robustness Issues

Out-of-Distribution (OOD) Robustness
- ChatGPT excels in adversarial tasks but struggles with OOD robustness and informal medical responses
- MLLMs often fail to generalize beyond training domains due to mapping issues
- vision-language models face difficulties with open-domain concepts, especially when overfitting during fine-tuning
- Large pre-trained image classifiers show initial robustness, which diminishes over time
- Current visual question answering (VQA) models are limited to specific benchmarks, hindering generalization to OOD datasets
- fine-tuning may impair model calibration in OOD contexts.
Adversarial Attack Robustness
- Studies indicate that open-sourced VLMs show performance gaps in red teaming tasks, highlighting the need for improved safety and security.
- misalignment between language and vision modalities creates a "modality gap", complicating adversarial vulnerability.

Robustness Mitigation Methods

Improving Out-of-Distribution Robustness
- enhance OOD detection and generalization. A simple maximum logit detector has been shown to outperform complex methods for anomaly segmentation
- In-context learning (ICL) can also improve multimodal generalization
- A fine-tuned CLIP excels in unsupervised OOD detection
- The OGEN method synthesizes OOD features
- Maximum Concept Matching aligns visual and textual features, and anchor-based finetuning leads to better domain shifts
Defense Against Adversarial Attacks
- VILLA is a two-stage framework for adversarial training of VLMs, featuring task-agnostic adversarial pre-training and task-specific finetuning
  - conducts adversarial training in the embedding space rather than on raw image pixels and text tokens, improving the model’s resilience against adversarial examples
  - SOTA performance across various tasks

Robustness Future Research Directions

Data Augmentation
- MixGen: a data augmentation method that generates new image-text pairs by interpolating images and concatenating text to preserve semantics.
- creating synthetic images involves extracting text prompts via an image captioning model for use in text-to-image diffusion, then mixing these with real datasets.
- bimodal augmentation (BiAug): decouples objects and attributes to synthesize vision-language examples and hard negatives, using LLMs and an object detector to generate detailed descriptions and inpaint corresponding images.
Improved Cross-Modal Alignment
- Sharing learnable parameters
- Applying bidirectional constraints
- Adjusting cross-modal projections
challenges: addressing the modality gap, which impacts robustness to OOD data and adversarial examples

Safety

Safety Issues

Toxicity
- LAION-400M: contains problematic content, including explicit materials and harmful stereotypes
- Advanced models like GeminiProVision and GPT-4V show inherent biases
- Assigning personas to ChatGPT can increase toxicity and reinforce harmful stereotypes
Jailbreaking Risk
- Perturbation can be performed effectively, while FigStep converts harmful content into images with an 82.5% attack rate across multiple VLMs
- replaces captions with malicious prompts, enabling jailbreaks.

Safety Mitigation Methods

Safety Fine-Tuning
- VLGuard
- fine-tuned on synthetic data, reducing sensitivity to NSFW inputs and enhancing performance in cross-modal tasks
Other approach
- Reinforce-Detoxify: uses reinforcement learning to mitigate toxicity and bias in transformer models
- simple mitigations improve automatic scores, these methods risk over-filtering marginalized texts and create discrepancies between automatic and human judgments

Safety Future Research Directions

Context Awareness
- integrating Chain-of-Thought for improved reasoning can enhance CAER tasks with Large VLMs.
- Dual-Aligned Prompt Tuning: combines explicit context from pre-trained LLMs with implicit modeling to create more context-aware prompts
- Visual In-Context Learning: optimizes image retrieval and summarization to enhance task-specific interactions.
Automated Red Teaming (ART)
- RTVLM: a dataset that benchmarks VLMs across faithfulness, privacy, safety, and fairness
- Arondight: automates multi-modal jailbreak attacks using reinforcement learning and uncovers significant security vulnerabilities
- GPT-4 and GPT-4V are more robust against jailbreaks than open-source models
- limited transferability of visual jailbreak methods compared to textual ones
- connects unsafe outputs to prompts, improving the detection of vulnerabilities in text-to-image models

Ref

Vu, K., & Lai, P. (2025). Trustworthiness in Vision-Language Models. In J. Kertesz, B. Li, T. Supnithi, & A. Takhom, Computational Data and Social Networks Singapore.

Vision-Language Models for Vision Tasks Review

2025년 8월 16일 · 약 16분

Eunkwang Shin

Owner

Overview

Most visual recognition studies rely heavily on crowdlabelled data in DNN

Background development of visual recognition paradigms
Foundations its architecture
Datasets in VLM pre-training and evaluations
Review and categorization of existing pre-training methods
Benchmarking analysis discussion
Reach challenges & potential research direction
Training hard
- New learning paradigm
Vision-Language Model Pre-training and Zero-shot Prediction
- Increasing attention
VLMs with transfer learning
- Prompt tuning
- Visual adaption
VLMs with knowledge distillation
- distill knowledge from VLMs to downstream tasks

The development of visual recognition paradigms

Traditional ML: Hand-crafted features for prediction.
Deep Learning: Deep networks (e.g., ResNet) with large-scale labeled data.
Supervised Pre-training + Fine-tuning: Learned representations transferred to downstream tasks.
Unsupervised / Self-supervised Pre-training + Fine-tuning: Objectives like masked modeling and contrastive learning to learn representations.
Vision-Language Models & Zero-shot: Leverage large-scale web data, enabling zero-shot prediction without task-specific fine-tuning.
- Collecting large-scale informative image-text data
- Designing high-capacity models for effective learning from Bigdata.
- Designing new pre-training objectives for learning effective VLMs.

Illustration of development of VLMs for visual recognition

CLIP: Image-text contrastive objective and learns by pulling the paired images and texts close and pushing others faraway in the embedding space.
- enables effective usage of web data and allows zero-shot predictions without task-specific finetuning.

VLM Overview

Given Image-text pairs.
Employs a text encoder and an image encoder to extract image and text features.
Learns the vision-language correlation with certain pre-training objectives.
GAP: Global Average Pooling, a technique used to reduce the spatial dimensions of feature maps while retaining important information.
ViT: Vision Transformer: Transformers for image recognition at scale.
CNN Based: VGG, ResNet, EfficientNet
- ResNet: Adopts skip connections between convolutional blocks which mitigates gradient vanishing and explosion and enables DNN training.
- ResNet-D: Replace global average pooling with transformer multi-head attention.
Transformer Based: ViT
- Adding a normalization layer before the transformer encoder.

VLM pre-training Objectives

Contrastive Objectives

Pros
- Enforce positive pairs to have similar embeddings in contrast to negative pairs.
- Encourages VLMs to learn discriminative vision and language features, where more discriminative features lead to more confident and accurate zero-shot predictions.
Cons
- Joint optimizing positive and negative pairs is complicated and challenging.
- Involves a heuristic temperature hyper-parameter for controlling the feature discriminability.

Image Contrastive Learning

Forcing a query image to be close with its positive keys (its data augmentations)
Faraway from its negative keys (other images)
Learn discriminative features in image modality, which often serves as an auxiliary objective for fully exploiting the image data potential.

Image-Text Contrastive Learning

Pulling the embeddings of paired images and texts close while pushing others away.
Minimizing a symmetrical image-text infoNCE loss
Learn vision-language correlation by contrasting image-text pairs.
- CLIP: A symmetrical image-text infoNCE loss
- ALIGN: scales up the VLM pre-training with large-scale (but noisy image-text pair with noise-robust contrastive learning)
- DeCLIP: Nearest-neighbor supervision to utilize the information from similar pairs, enabling effective pre-training on limited data.
- OTTER: Optimal transport to pseudo-pair images and texts reducing the required training data.
- ZeroVL: Limited data resource via debiased data sampling and data augmentation with coin flipping mixup.
- FILIP: Region-word alignment into contrastive learning, enabling to learn fine-grained vision-language corresponding knowledge.
- Pyramid-CLIP: Multiple semantic levels and performs both cross-level and peer-level contrastive learning for effective VLM pre-training.
- LA-CLIP, ALIP: LLM to augment synthetic captions for given images while RA-CLIP retrieves relevant image-text pairs for image-text pair augmentation.

CLIP

Image-Text-Label Contrastive Learning

Supervised Contrastive Learning into image-text contrastive learning.
Learn discriminative and task-specific features by exploiting both supervised labels and unsupervised image-text pairs.
- UniCL: pre-training allows learning both discriminative and task-specific (image classification) features simultaneously with around 900M image-text pairs.

Image-Text-Label Contrastive Learning

Generative Objectives

Encouraging VLMs to learn rich vision, language and vision-language contexts for better zero-shot predictions.
Generally adopted as additional objectives above other VLM pre-training objectives for learning rich context information.

Masked Image Modelling

Cross-patch correlation by masking and reconstructing images.
Learn image context information by masking and reconstructing images
- MAE, BeiT: certain patches in an image are masked and the encoder is trained to reconstruct them conditioned on unmasked patches.

Masked Image Modelling

Masked Language Modelling

Adopted pre-training objectives in NLP.
Randomly masking a certain percentage of input tokens and predicting them. (15% in BERT)
Learn by masking a fraction of tokens in each input text and training networks to predict the masked tokens.
- FLAVA: masks out 15% text tokens and reconstructs them from the rest tokens for modelling cross-word correlation.
- FIBER: adopts masked language modelling as one of the VLM pre-training objectives to extract better language features.

Masked Language Modelling

Integrates masked image modelling and masked language modelling.
Given an image-text pair, it randomly masks a subset of image patches and a subset of text tokens and then learns to reconstruct them.
Learn by masking a certain percentage of image patches and text tokens and training VLMs to reconstruct them based on the embeddings of unmasked image patches and text tokens.
- FLAVA: 40% image patches and 15% text tokens as in, and employs a MLP to predict masked patched and tokens, capturing rich vision-language correspondence information.

Image-to-Text Generation

Generate descriptive texts for a given image for capturing fine-grained vision-language correlation by training VLMs to predict tokenized texts.
- COCA, NLP, PaLI: train VLMs with the standard encoder-decoder architecture and image captioning objectives.

Image to caption

Alignment Objectives

Align image–text pairs in the embedding space.

pros
- simple, easy to optimize
- can be easily extended to model fine-grained vision-language correlation
cons
- little correlation information within vision or language modality.
adopted as auxiliary losses to other VLM pre-training objectives for enhancing modelling the correlation across vision and language modalities.

Image-Text Matching

models the overall correlation between an entire image and an entire sentence. (전역적 상관관계)
Image-text matching models global image-text correlation by directly aligning paired images and texts
- FLAVA: matches the given image with its paired text via a classifier and a binary classification loss.
- FIBER: follows to mine hard negatives with pair-wise similarities for better alignment between image and text.

Region-Word Matching

captures fine-grained correlations between image regions and specific words. (지역적 상관관계)
models local fine-grained vision-language correlation by aligning paired image regions and word tokens.
benefiting zero-shot dense predictions in object detection and semantic segmentation.
- GLIP, FIBER, DetCLIP: replace object classification logits by region-word alignment scores.
  - the dot-product similarity between regional visual features and token-wise features.

Region-Word Matching, GLIP

VLM Pre-Training Frameworks

VLM pre-training frameworks

Evaluation

Zero-shot Prediction

Image Classification: classify images into pre-defined categories like "prompt engineering".
Semantic Segmentation: by comparing the embeddings of the given image pixels and texts.
Object Detection: localize and classify objects in images with the object locating ability learned from auxiliary datasets.
Image-Text Retrieval
- Text-to-image retrieval that retrieves images based on texts
- Image-to-text retrieval that retrieves texts based on images.

Linear Probing

freezes the pre-trained VLM
trains a linear classifier to classify the VLM-encoded embeddings to assess the VLM representations.

Datasets

For Pre-training VLMs
- CLIP, 2021, 400M, English
- ALIGN, 2021, 1.8B, English
- FILIP, 2021, 300M, English
- WebLi, 2022, 12B, 129 Languages
For VLM Evaluation
- Image Classification
  - PSACAL VOC 2007 Classification, 11-point mAP
  - Oxford-IIIT PETS, Mean Per Class
  - EuroSAT, Accuracy
  - Hateful Memes, ROC AUC
  - Country211, Accuracy
- Image-Text Retrieval
  - Flickr30k, Recall
  - COCO Caption, Recall
- Action Recognition
  - UCF101, Accuracy
  - Kinetics700, Mean(top1, top5)
  - RareAct, mWAP, mSAP
- Object Detection
  - COCO 2017 Detection, box mAP
  - LVIS, box mAP
  - ODinW, box mAP
- Semantic Segmentation
  - Cityscapes, Mean IoU
  - ADE20K, Mean IoU

VLM Transfer learning

which adapts VLMs to fit downstream tasks via prompt tuning, feature adapter.

image and text distributions gap: downstream dataset may have task-specific image styles and text formats
training objectives gap: VLMs are generally trained with task-agnostic objectives, while downstream tasks often involve task-specific objectives. (coarse or fine-grained classification, region or pixel-level recognition)

Transfer via Prompt Tuning

Inspired by the "prompt learning" in NLP

pros
- simple, easy-to-implement
- requires little extra network layer or complex network modifications
- adapting VLMs in a black-box manner, which has clear advantages in transferring VLMs that involve concerns in intellectual property.
cons
- low flexibility by following the manifold (잠재 공간) of the original VLMs in prompting.

Transfer with Text Prompt Tuning

Exploring more effective and efficient learnable text prompts with several labelled downstream samples for each class.
- supervised and few-shot supervised
  - CoOp: Exploring context optimization to learn context words for a single class name with learnable word vectors.
  - CoCoOp: Exploring conditional context optimization that generates a specific prompt for each image.
  - SubPT: designs subspace prompt tuning to improve the generalization of learned prompts.
  - LASP: regularizes learnable prompts with hand-engineered prompts.
  - VPT: models text prompts with instance-specific distribution with better generalization on downstream tasks.
  - KgCoOp: enhances the generalization of unseen class by mitigating the forgetting of textual knowledge.
  - SoftCPT: fine-tunes VLMs on multiple few-shot tasks simultaneously for benefiting from multi-task learning.
  - PLOT: employs optimal transport to learn multiple prompts to describe the diverse characteristics of a category.
  - DualCoOp, TaI-DP: transport VLMs to multi-label classification tasks.
    - DualCoOp: adopts both positive and negative prompts for multi-label classification
    - TaI-DP: double-grained prompt tuning for capturing both coarse-grained and fine-grained embeddings.
  - DenseCLIP: explores language-guided fine-tuning that employs visual features to tune text prompts for dense prediction.
  - ProTeCt: improves the consistency of model predictions for hierarchical classification task.
- unsupervised
  - UPL: optimizes learnable prompts with self-training on selected pseudo-labeled samples.
  - TPT: explores test-time prompt tuning to learn adaptive prompts from a single downstream sample.

Text Prompt Tuning

V is learnable word vectors that are optimized by minimizing the classification loss with the downstream samples.

Transfer with Visual Prompt Tuning

Transfers VLMs by modulating the input of image encoder.
- VP: adopts learnable image perturbations $v$ to modify the input image $x^I$ by $x^I + v$ , aiming to adjust $v$ to minimize a recognition loss.
- RePrompt: integrates retrieval mechanisms into visual prompt tuning, allowing leveraging the knowledge from downstream tasks.
enables pixel-level adaptation to downstream tasks, benefiting them greatly especially for dense prediction tasks.

Visual Prompt Tuning

Transfer with Text-Visual Prompt Tuning

modulate the text and image inputs simultaneously, benefiting from joint prompt optimization on multiple modalities.
- UPT: unifies prompt tuning to jointly optimize text and image prompts, demonstrating the complementary nature of the two prompt tuning tasks.
- MVLPT: explores multi-task vision-language prompt tuning to incorporate cross-task knowledge into text and image prompt tuning.
- MAPLE: conducts multi-modal prompt tuning by aligning visual prompts with their corresponding language prompts, enabling a mutual promotion between text prompts and image prompts.
- CAVPT: introduces a cross attention between class-aware visual prompts and text prompts, encouraging the visual prompts to concentrate more on visual concepts.

Transfer via Feature Adaptation

adapt image or text features with an additional light-weight feature adapter
- Clip-Adapter: inserts several trainable linear layers after CLIP's language and image encoders and optimized them while keeping CLIP architecture and parameters frozen.
- Tip-adapter: a training-free adapter that directly employs the embeddings of few-shot labelled images as the adapter weights.
- SVL-Adapter: a self-supervised adapter which employs an additional encoder for self-supervised learning on input images.
flexible and effective as its architecture and the insertion manner allow tailoring flexibly for different and complex downstream tasks.
requires modifying network architecture and thus can not handle VLMs that have concerns in intellectual property.

Other Transfer Methods

Direct fine-tuning, architecture modification, cross attention
- Wise-FT: combines the weights of a fine-tuned VLM and the original VLM for learning new information from downstream tasks.
- MaskCLIP: extracts dense image features by modifying the architecture of the CLIP image encoder.
- VT-CLIP: introduces visual-guided attention to semantically correlate text features with downstream images, leading to a better transfer performance.
- CALIP: introduces parameter-free attention for effective interaction and communication between visual-guided text features.
- TaskRes: directly tunes text-based classifier to exploit the old knowledge in the pre-trained VLM.
- CuPL, VCD: employ large language models like GPT-3 to augment text prompts for learning rich discriminative text information.

Feature Adaptation

VLM Knowledge Distillation

distils general and robust VLM knowledge to task-specific models without the restriction of VLM architecture, benefiting task-specific designs while tackling various dense prediction tasks.
most VLM knowledge distillation methods focus on transferring image-level knowledge to region- or pixel-level tasks such as object detection and semantic segmentation.

Knowledge Distillation for Object Detection

To distill VLM knowledge to enlarge the detector vocabulary
To better align image-level and object-level representations
- ViLD: distills VLM knowledge to a two-stage detector whose embedding space is enforced to be consistent with that of CLIP image encoder.
- HierKD: hierarchical global-local knowledge distillation.
- RKD: region-based knowledge distillation for better aligning region-level and image-level embeddings.
- ZSD-YOLO: self-labeling data augmentation for exploiting CLIP for better object detection.
- OADP: proposal features while transferring contextual knowledge.
- BARON: uses neighborhood sampling to distill a bag of regions instead of individual regions.
- RO-ViT: distills information from VLMs for open-vocabulary detection.
VLM distillation via prompt learning
- DetPro: a detection prompt technique for learning continuous prompt representations for open-vocabulary object detection.
- PrompDet: regional prompt learning for aligning word embeddings with regional image embeddings.
- PB-OVD: trains object detectors with VLM-predicted pseudo bounding boxes.
- XPM: a robust cross-modal pseudo-labeling strategy that employs VLM-generated pseudo masks for open-vocabulary instance segmentation.
- P3OVD: prompt-driven self-training that refines the VLM-generated pseudo labels with fine-grained prompt tuning.

Knowledge Distillation for Semantic Segmentation

Leverage VLMs to enlarge the vocabulary of segmentation models, aim to segment pixels described by arbitrary texts. (i.e., any categories of pixels beyond base classes)
Tackling the mismatch between image-level and pixel-level representations.
- CLIPSeg: a lightweight transformer decoder to extend CLIP for semantic segmentation.
- LSeg: maximizes the correlation between CLIP text embeddings and pixel-wise image embedding encoded by segmentation models.
- ZegCLIP: employs CLIP to generate semantic masks and introduces a relationship descriptor to mitigate overfitting on base classes.
- MaskCLIP+, SSIW: distill knowledge with VLM-predicted pixel-level pseudo labels.
- FreeSeg: generates mask proposals first and then performs zero-shot classification for them.

Knowledge distillation for weakly-supervised semantic segmentation

Leverage both VLMs and weak supervision (e.g., image-level labels) for semantic segmentation.
CLIP-ES: employs CLIP to refine the class activation map by designing a softmax function and a class-aware attention-based affinity module for mitigating the category confusion issue.
CLIMS: employs CLIP knowledge to generate high-quality class activation maps for better weakly-supervised semantic segmentation.

Performance

VLM is largely attributed to three factors: Big data, Big Model, and Task-agnostic learning.
Limitations
- When data/model size keeps increasing, the performance saturates and further scaling up won’t improve performance
- Adopting large-scale data in VLM pre-training necessitates extensive computation resources
- Adopting large models introduces excessive computation and memory overheads in both training and inference
Transfer Learning
- can mitigate the domain gaps by learning from task-specific data, being labelled or unlabelled.
- Supervised > few-shot supervised = unsupervised transfer (overfitting but challenging)
Knowledge Distillation
- brings clear performance improvement on detection and segmentation tasks
- introduces general and robust VLM knowledge while benefiting from task-specific designs
the development of VLM pre-training for dense visual recognition tasks (on region or pixel-level detection and segmentation) lag far behind.
require certain norms in term of training data, networks and downstream tasks.
- VLM transfer: release their codes and do not require intensive computation resources, easing reproduction and benchmarking.
- VLM pre-training: studied with different data and networks, making benchmarking a very challenging task. also use non-public training data, or require intensive computation resources.
- VLM knowledge distillation: adopt different task-specific backbones, which complicates benchmarking.

Challenges

VLM pre-training
- Fine-grained vision-language correlation modelling: can better recognize patches and pixels beyond images, greatly benefiting dense prediction tasks
- Unification of vision and language learning: enables efficient communications across data modalities which can benefit both training effectiveness and training efficiency.
- Pre-training VLMs with multiple languages: could introduce bias in term of cultures and regions and hinder VLM applications in other language areas.
- Data-efficient VLMs: instead of merely learning from each image-text pair, more useful information could be learned with the supervision among image-text pairs.
- Pre-training VLMs with LLMs: employ LLMs to augment the texts in the raw image-text pairs, which provides richer language knowledge and helps better learn vision-language correlation.
VLM Transfer Learning
- Unsupervised VLM transfer: much lower risk of overfitting than few-shot supervised transfer.
- VLM transfer with visual prompt/adapter: Existing studies focus on text prompt learning. Visual prompt learning or visual adapter, which is complementary to text prompting and can enable pixel-level adaptation in various dense prediction tasks.
- Test-time VLM transfer: Existing studies conduct transfer by fine-tuning VLMs on each downstream task (i.e., prompt learning), leading to repetitive efforts while facing many downstream tasks. Adapting prompts on the fly during inference can circumvent the repetitive training in existing VLM transfer.
- VLM transfer with LLMs: Different from prompt engineering and prompt learning, exploit LLMs to generate text prompts that better describe downstream tasks. This approach is automatic and requires little labelled data.
VLM knowledge distillation
- Knowledge distillation from multiple VLMs: harvest their synergistic effect by coordinating knowledge distillation from multiple VLMs.
- Knowledge distillation for other visual recognition tasks: leverage the knowledge distilled from VLMs to improve performance on other visual recognition tasks. (instance segmentation, panoptic segmentation, person reidentification)

Ref

Zhang, J., Huang, J., Jin, S., & Lu, S. (2024). Vision-Language Models for Vision Tasks: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(8), 5625–5644. https://doi.org/10.1109/TPAMI.2024.3369699

IAI +003

2025년 8월 15일 · 약 5분

Eunkwang Shin

Owner

Local Search Problem

To find the state that gives the optimal/best value of the evaluation function

It can be seen as an optimization problem.
a computational problem that finds the best solution (a state) that satisfies the given constraints
evaluation function === objective function
Only cares about the optimal solution/best state without considering the paths to reach the best state (the optimal solution)
Not systematic

Feasible region & solution

Feasible region: the set of all possible or candidate solutions which are the solutions that satisfies the problem's constraints
Feasible solution: a solution in the feasible region

Search Problem vs Local Search Problem

Path-based vs State-based

Aspects	Search Problem	Local Search Problem
State	All possible states - state-space landscape	Range of decision variables and constraints
Goal	Goal state & goal test	Evaluation function & objective function
Evaluation	Measure closeness to goal - distance/fitness	Minimize cost or maximize fitness
Transition/Successor	Transition function	Successor function

Discrete & Continuous Optimization

Discrete optimization: optimization problems where the solution space is discrete (e.g., 8 queens problem)
Continuous optimization: optimization problems where the solution space is continuous (e.g., real numbers, any value within a range)

Information needed for Local Search

All possible states: state-space landscape
Transition function: To find neighbor or successor state
Goal state
Objective function: A way to measure how close to the goal state
Start state

Search state-space

Global Maximum: A state that maximizes the objective function over the entire state space
Local Maximum: A state that maximizes the objective function within a small area around it.
Plateau: A state such that the objective function is constant in an area around it.
- Shoulder: A plateau that has uphill edge.
- Flat: A plateau whose edges go downhill.

Advantages

use little memory
can often find reasonably good solution in large or infinite search spaces
useful for solving pure optimization problems
don't need to know the path to the solution.

Hill climbing

keeps track of one current state and on each iteration moves to the neighboring state with highest value.

$f = max(-cost(X))$
Steps
- Evaluate the initial stat
- If it is equal to the goal state, return. Otherwise, continue.
- Find a neighboring state
- Evaluate this state. If it is closer to the goal state than before, replace the initial state with this state.
- Repeat steps 2-4 until it reaches a goal state (local or global maximum) or runs out of time.
No search tree, No backtracking, Don't look ahead beyond the current state.
- get stuck due to local maxima, plateaus, or ridges.

Variations of HC

Simple HC: greedy local search which expands the current state and moves on to the best neighbor.
Stochastic HC: choose randomly among the neighbors going uphill.
First-choice HC: generate random successor until one is better. Good for states with high numbers of successors.
Random restart: conducts a series of hill climbing searches from random initial states until a goal state is found.

Simulated Annealing

based upon the annealing process to model the search process for finding an optimal solution to an optimisation problem

annealing schedule, temperature, energy
finds the minimal value of the objective function (energy function)
starts with a high temperature and then gradually reduces the temperature
$P = e^{-\Delta E / kT}$ $P = e^{- Δ E / k T}$
- $\Delta E$ : how bad the new state is compared to the old state
- $T$ : temperature is getting lower over time
- $k$ : a scaling factor
Swap condition: $\Delta E <= 0$ or ${-\Delta E / kT} > \text{random}$

Evolutionary algorithms

Local beam search
Stochastic beam search
Genetic algorithms

Characteristics

size of the population
representation of each individual
mixing number
selection process for selecting the individuals who will become the parents of the next generation
recombination procedure
mutation rate
makeup of the next generation

Genetic algorithm

It uses operators, such reproduction, crossover and mutation, inspired by the natural evolutionary principles.

State: is represented by an individual in a population. Traditional representation is a chromosome
Objective function: is used to evaluate the fitness of an individual (= fitness function, 적합도 함수)
Successor function: consists of three operators: reproduction, crossover, and mutation
Solution: is found through evolution from one generation to another generation

Genetic Algorithm

Roulette Wheel Selection

Compute total fitness of all individuals.
- Example: A=30, B=20, C=40, D=10 → Total = 100.
Calculate probability of each individual being selected
- Formula: $P(i) = \frac{fitness(i)}{total\_fitness}$ $P (i) = \frac{f i t n ess ( i )}{t o t a l _ f i t n ess}$
  - A = 30/100 = 0.30
  - B = 20/100 = 0.20
  - C = 40/100 = 0.40
  - D = 10/100 = 0.10
Convert to cumulative probabilities
- P4 = 0.10
- P4 + P3 = 0.50
- P4 + P3 + P2 = 0.90
- P4 + P3 + P2 + P1 = 1.00
Generate a random number between 0 and 1.
Select an individual based on the random number and cumulative probabilities.

Roulette Wheel Selection

⚫ random = 0.07 → falls in P4 [0, 0.10)
🔺 random = 0.37 → falls in P3 [0.10, 0.50)
⬟ random = 0.82 → falls in P2 [0.50, 0.90)

Applications of GA

Parameter tuning: optimize the parameters in NN
Planning: economic dispatch, train timetabling
Design & Control problems: robotic control, adaptive control systems
Successful use of GA requires careful engineering of the representation

FDA +003

2025년 8월 14일 · 약 8분

Eunkwang Shin

Owner

CRISP-DM

CRISP-DM (Cross-Industry Standard Process for Data Mining)

Business understanding
Data understanding
Data preparation
Modeling
Evaluation

Business understanding

Determine business objectives
Assess situation
Determine data mining goals
Produce project plan

Data understanding

Collect initial data
Describe data
Explore data
Verify data quality

Data preperation

Select data
Clean data
Consturct data
Integrate data
Format data

Modeling

Select modeling technique
Generate test design
Build model
Assess model

Evaluation

Evaludate results
Review process
Determine next steps

Deployment

Plan development
Plan monitoring & maintenance
Produce final report
Review project

Instance & Attributes

Instance: the terms associated with specific objects. Instances are described by a set of values for the features.
Attributes: the collection of features of the object that are maintained in a dataset.
Object: a collection of features about which measurements can be taken.
- Car: fuel consumption, cylinders, horsepower...

Qualitative & Quantitative data

Qualitative data: less structured, non-statistical, measured using other descriptors and identifiers
- white, heavy, wild...
Quantitative data: statistical, measured using hard numbers.
- 130cm, 400kg, 4 legs...

Discrete & Continuous (Quantitative) data

Discrete data: fixed, round numbers, countable
- number of legs, count of aeroplane depatures, number of times a person commutes for a job in a week
Continuous data: measured over time intervals
- weight, solar irradiation, temperature of a room

Summary

Qualitative	Quantitiative (discrete)	Quantitiative (continuous)
Title	Duration	Rating
Production Country	Release Year
Director
Genres
Description

Categorizing attributees

항목	Nominal (categorical)	Ordinal	Interval	Ratio
정의	값이 라벨·이름 역할만 함. 순서 없음.	값 사이에 순서 있음. 간격은 정의되지 않음.	순서 + 고정·동일한 단위(간격). 절대 0 없음.	Interval 속성 + 절대적 0 있음. 차이와 비율 모두 의미 있음.
예시	머리카락 색 `{blonde, brown, ginger}` 우편번호 산업코드, 연구분야 코드 Blood type, License number	키: `tall > average > short` 체중: `light < average < heavy` Star ratings, Tshirt sizes	키(cm), 몸무게(kg) (원문 기준) 12시간제 시각(차이 비교) 시간 간격(5분~10분) Waist size, Time	나이(년) 소득(천 달러) 켈빈 온도 금액, 개수, 질량, 길이, 전류 Body weight, Medicine dosage
예시	머리카락 색 `{blonde, brown, ginger}`, 우편번호, 산업코드/연구분야 코드, Blood type, License number	키: `tall > average > short`, 체중: `light < average < heavy`, Star ratings, Tshirt sizes	키(cm), 몸무게(kg) (원문 기준), 12시간제 시각(차이 비교), 시간 간격(5분~10분), Waist size, Time	나이(년), 소득(천 달러), 켈빈 온도, 금액, 개수, 질량, 길이, 전류, Body weight, Medicine dosage
허용 비교	`=, ≠`	`=, ≠, <, >`	`=, ≠, <, >, +, −`	`=, ≠, <, >, +, −, ×, ÷`
연산 / 분석	Mode(최빈값) Entropy(불확실성 측정) Contingency table(교차표) Correlation(Chi-squared test of independence) Chi-squared test	Median Percentiles Rank correlation(Spearman) Run tests(Mann–Whitney U, Wilcoxon) Sign tests	Mean Standard Deviation Pearson correlation T-test F-test(ANOVA)	Geometric Mean Harmonic Mean Percent variation(CV)
설명	통계적 평균·표준편차 무의미	순위는 비교 가능하지만 간격·크기 비교 불가. 중앙값·순위기반 통계 적합.	간격 일정 → +, − 가능. 절대 0 없음 → 비율 해석 불가.	절대 0 → 모든 연산 가능. 비율·곱셈 해석 가능.
변수 특징	Named variables	Named & Ordered variables	Named & Ordered & Distance between variables	Named & Ordered & Distance between variables & Makes sense to multiply/divide
Analysis Method	Frequency	Frequency Median and percentiles	Frequency Median and percentiles Add or Subtract Mean, standard deviation, standard error of the mean	Frequency Median and percentiles Add or Subtract Mean, standard deviation, standard error of the mean Ratio
데이터 유형	Qualitative	Qualitative	Quantitative	Quantitative

Attribute Type	Description	Examples	Operations
Nominal	The values of a nominal attribute are just different names, i.e. nominal attributes provide only enough information to distinguish one object from another. (`=, ≠`)	post codes, employee ID numbers, eye colour, sex: `{ male, female }`	mode, entropy, contingency, correlation, chi squared test
Ordinal	The values of an ordinal attribute provide enough information to order objects. (`<, >`)	hardness of minerals, `{ good, better, best }`, grades, street numbers	median, percentiles, rank correlation, run tests, sign tests
Interval	For interval attributes, the differences between values are meaningful, i.e. a unit of measurement exists. (`+, −`)	calendar dates, temperature in Celsius or Fahrenheit	mean, standard deviation, Pearson’s correlation, t and F tests
Ratio	For ratio variables both differences and ratios are meaningful. (`×, ÷`)	temperature in Kelvin, monetary quantities, counts, age, mass, length, electrical current	geometric mean, harmonic mean, percent variation

Structured & Unstructured Data

Structured Data: which has an associated fixed data structure.
- Relational table
- Manageable
Unstructured Data: which is expressed in natural language and no specific structure and domain types are defined.
- Documents and sounds.
Semi-structured Data: the format is not fixed and has some degree of flexibility.
- XML, JSON
- emails, text data, image, video and sound, zipped files, web pages.

Curse of dimensionality

The explosive nature of increasing data dimensions and its resulting exponential increase in computational efforts required for its processing and/or analysis.

Characteristics of structured data
- Dimensionality: Datasets with higher numbers of attributes have more dimensions, challenging to work with high dimensional data.
- Sparsity: A dataset termed spare data or having the property of sparsity, which contains many zeros values for most of the attributes.
- Resolution: The patterns depend on the scale or level of resolution.
Real life data is usually in a lower dimensional manifold
- many dimensions can be either ignored or the dimensionality can be reduced.
Local smoothness: small changes in input values give small changes in output values.
- Local interpolation to make predictions.

Datasets

Record Data
- Data Matrix
- Document data: a special type of data matrix where the attributes are of the same type and are asymmetric.
- Transaction data: a special type of record data. Each record involves a set of items. Most often, the attributes are binary, indicating whether or not an item was purchased.
Graph data
- World wide web, Molecular structures (Simplified molecular-inputline-entry system, SMILES)
Ordered data: sequence data, this is a sequence of individual entities, such as a sequence of words or letters.
- Spatial data
- Temporal data
- Sequential data

Data collection

Quality

Missing values: The data was not collected (e.g. age), or some attributes may not be applicable in all cases (e.g. annual income for children).
Empty values: Unlike missing values, an empty value is the one that has no actual value, whereas a missing value has an actual value but it is missing somehow.
Noise: The modification of actual values.
Outlier: A single or very low frequency occurrence of a value of an attribute that is far from the bulk of attribute values.
Duplicate data: The same data is recorded multiple times.
Inconsistent formats: When the same set of data appears in multiple tables from different inputs.

Data auditing

attributes
measured values
comments
attribute type
operations we can do
data type (knime/py)
missing value
any comments about qualities

attributes	measured values	comments	attribute type	operations we can do	Data type (knime/python)	missing value
fixed acidity	`[3.8, 15.9]`	continuous number	ratio	all arithmetic	float	N/A
volatile acidity	`[0.08, 1.58]`	continuous number	ratio	all arithmetic	float	N/A
citric acid	`[0, 1.66]`	continuous number	ratio	all arithmetic	float	N/A
residual sugar	`[0.6, 65.8]`	continuous number	ratio	all arithmetic	float	N/A
chlorides	`[0.009, 0.611]`	continuous number	ratio	all arithmetic	float	N/A
free sulfur dioxide	`[1, 289]`	continuous number	ratio	all arithmetic	int	N/A
total sulfur dioxide	`[6, 440]`	continuous number	ratio	all arithmetic	int	N/A
density	`[0.98711, 1.03898]`	continuous number	ratio	all arithmetic	float	N/A
pH	`[2.72, 4.01]`	continuous number	interval	order, arithmetic	float	N/A
sulphates	`[0.22, 2]`	continuous number	ratio	all arithmetic	float	N/A
alcohol	`[8, 14.9]`	continuous number	ratio	all arithmetic	float	N/A
quality	`[extremely dissatisfied, extremely satisfied, moderately dissatisfied, moderately satisfied, neutral, slightly dissatisfied, slightly satisfied]`	distributed	ordinal	order, counting	str	N/A
color	`[white, red]`	distributed	nominal	counting	str	N/A

Vocabulary for AI +004

2025년 8월 14일 · 약 4분

Eunkwang Shin

Owner

Vocabulary & Expressions

Term/Expression	Definition	Simpler Paraphrase	Meaning
relax	to make a rule or control less severe	to make less strict or severe	완화하다, 느슨하게 하다
reconstruct	to build or form again	to rebuild	재구성하다
reside	to live in a place; to exist or be present	to live; to be located	위치하다, 존재하다
lay out	to arrange or plan something in a clear and organized way	to arrange	배치하다, 설계하다
resemble	to look like or be similar to someone or something	to look like	닮다, 유사하다
amnesia	a condition in which a person is unable to remember things	memory loss	기억상실증
vicinity	the area near or surrounding a particular place	nearby area	인근, 근처
schematically	in a way that represents the main features or relationships of something in a simple and clear form	in a simplified way	도식적으로
superimpose	to place or lay something over something else	to overlay	겹쳐 놓다, 중첩하다
plateaus	a state of little or no change following a period of activity or progress	a period of stability	정체기, 안정기, 고원
wander	to move around without a fixed course, aim, or goal	to roam	방황하다, 헤매다
consecutive	following continuously; in unbroken or logical sequence	sequential	연속적인
converge	to come together from different directions	to meet	수렴하다, 모이다
adage	a saying or proverb expressing a common truth	a wise saying	격언, 속담
porcupine	a large rodent with sharp quills on its back	a spiny animal	호저
stumble	to trip or lose balance while walking or running	to trip	비틀거리다, 넘어지다
metallurgy	the science and technology of metals	metal science	금속공학
crystalline	having the structure and form of a crystal	crystal-like	결정질의
crevice	a narrow opening or fissure	a crack	틈, 균열
bumpy	having an uneven or jolting surface	uneven	울퉁불퉁한
dislodge	to remove or force out from a position	to remove	제거하다, 떼어내다
exponentially	in a way that increases rapidly and significantly	rapidly	기하급수적으로
halt	to stop or pause something	to stop	중단하다, 멈추다
unfruitful	not producing good results	unproductive	결실이 없는
analogous	similar in some way	comparable	유사한
proportional	corresponding in size or amount to something else	relative	비례하는
retained	kept or continued to have	kept	유지된
in accordance with	following or obeying a rule, law, or wish	according to	~에 따라, ~에 일치하여
constitute	to be a part of something	to form	구성하다
permute	to change the order or arrangement of something	to rearrange	순열하다, 배열을 바꾸다
chromosome	a thread-like structure of nucleic acids and protein found in the nucleus of most living cells	genetic structure	염색체
auxiliary	providing supplementary or additional help and support	supplementary	보조의
discriminative	able to distinguish or differentiate	distinguishing	구별 가능한
exploit	to make full use of and benefit from something	to utilize	활용하다
perturbation	a small change or variation	a disturbance	교란
modulate	to adjust or alter the intensity or frequency of something	to adjust	조절하다, 변조하다
retrieval	the process of getting stored information from a computer	search	검색
leverage	to use something to maximum advantage	to utilize	활용하다
discrepancy	a difference or inconsistency	difference	불일치
heterogeneity	the quality or state of being diverse in character or content	diversity	이질성
pseudonymization	the process of replacing private identifiers with fake identifiers or pseudonyms	anonymization	가명화
denote	to be a sign of something	to signify	나타내다, 의미하다

FSD +003

2025년 8월 11일 · 약 4분

Eunkwang Shin

Owner

Terminology

Software: A set of statements written in a programming language to perform tasks
Statement: A single instruction in a program that performs an action when executed.
Snippet: A block of statements.
Software Development: The process of creating a software program.
OOP: Program composed of interconnected objects at runtime.
Expression: An entity-code component of a statement that can be evaluated to produce a value.
Assign: The process of storing the result (a value) of one or more expressions.
Value: A data item (literal or computed) that is stored in a variable.
Compiler: A special program that translates a programming language's source code into machine code.
- Compilers complete the conversion process all at once after changes are made to the code and before the code is executed
Interpreter: A computer program that directly executes code without requiring it to be previously compiled into machine language.
- Interpreters complete the conversion process one step at a time while the code is being executed.

Software development

Software development process is an iterative approach.

java
- javac Welcome.java: Compiles the Java source file Welcome.java into class binary file.
- java Welcome: Executes the Java program Welcome.
python
- python welcome.py: Executes the Python script welcome.py.

OOP

Object: An object is a thing, tangible and intangible. An object has fields that contain the data and methods to access and modify the data.
Class: A class is an abstract definition of objects. A class is a template of a blueprint that defines what data and methods are included in objects.
Method: A block of code grouped together to perform an operation. A method has a name, parameters, and a return type.
Field: A field is a data attribute of an object. A field value is exposed using object methods.
Organizing code into classes improves modularity, reusability, extendability, and scalability.

Java vs Python

Identifier type	Java	Python
Class	Use CamelCase for multi-word classes	Use snake_case for multi-word classes
Function	use verbs or verb phrases	use lowercase_with_underscores
Procedure	use verbs or verb phrases	use lowercase_with_underscores
Variable	camelCase	lowercase_with_underscores
Constant	All uppercase words separated by underscores	All uppercase words separated by underscores
Package	Lowercase words separated by dots	Lowercase words separated by underscores

Java uses the toString() function to return objects' information.
Python can refer to attributes directly or use the __str()__ function to return objects' information

Data types

Data Type	Size	Default value	Description
byte	1 byte	0	8-bit signed integer
short	2 bytes	0	16-bit signed integer
int	4 bytes	0	32-bit signed integer
long	8 bytes	0	64-bit signed integer
float	4 bytes	0.0f	32-bit floating point
double	8 bytes	0.0d	64-bit floating point
boolean	1 bit	false	true or false
char	2 bytes	'\u0000'	16-bit Unicode character

Non-Primitive Data Types

Non-primitive: Arrays, Classes, Interfaces, and Strings.
Non-primitive data types are by default set to null in Java, None in Python.

Variables

Static: enables the variable to be used without creating an object of its defining class.
Final: makes the variable unchangeable.

Operators

Operator Category	Java	Python
Unary	expr++ expr--
	++expr --expr +expr -expr	+expr -expr
Arithmetic	`* / &`	`* / &`
	`+ -`	`+ -`
Relational	`< > <= >=`	`< > <= >=`
	`== !=`	`== !=`
Logical	`! &&`	`not and`
	\|\|	`or`
Ternary	`(expr1) ? <expr2> : <expr3>`	`(expr1) if <expr2> then <expr3>`
Assignment	`= += -= *= /= %=`	`= += -= = /= %= *=`
Identity/Membership		`is is not in not in`

Java: boolean q = (5 % 2 != 2) ? true : false
Python: q = True if (5 % 2 != 2) else False

Standard Input

import java.util.Scanner;

public class Inputs {
  static Scanner in = new Scanner(System.in);

  public static void main(String[] args) {
    System.out.print("X = ");
    int x = in.nextInt();
    System.out.println("x squared = " + Math.pow(x, 2));
  }
}

import sys

x = int(input("x = "))

print("x squared = ", pow(x, 2))

String

String (java)

Immutable

String s1 = "Hello";: initialize using literal syntax
String s2 = new String("Hello");: initialize using a constructor

s1 == s1 // false
s1.equals(s2) // true

String Format (Python)

Symbol	Meaning	Example code	Output
`<`	Left align	`f'[{42:<5}]'`	`[42 ]`
`>`	Right align	`f'[{42:>5}]'`	`[ 42]`
`^`	Center align	`f'[{42:^5}]'`	`[ 42 ]`
`<` with fill char	Left align with custom fill	`f'[{42:-<5}]'`	`[42---]`
`>` with fill char	Right align with custom fill	`f'[{42:->5}]'`	`[---42]`
`^` with fill char	Center align with custom fill	`f'[{42:->5}]'`	`[-42--]`

Array

Array (java)

int[] x = {2, 4, -1, 11, 3};

Declaration: int[] x
Instantiation: x = new int[5];
Initialization: x[0] = 2; x[1] = 4; x[2] = -1;

IAI +002

2025년 8월 9일 · 약 10분

Eunkwang Shin

Owner

Environment

All possible state and information about how the states are related.
The costs from one state to each of its adjacent states are also given.

Agent

Simulated intelligence knows which state it is in.
If it takes an action at a given state, it knows the next state and the corresponding cost.

Characteristics of the environment

Fully Observable: The agent always knows the current state of the environment at each point in time.
Deterministic: The next state of the environment is completely determined by the current state and the action taken by the agent.
Static: The environment is unchanged.
Discrete: A limited number of distinct, clearly defined actions.
Single agent: An agent operating by itself in an environment.

Search problem

Finding a path from a starting point to a goal point in a space.

The initial state
State space: The environment or area where the search takes place
A set of actions: The possible actions that the agent can take in each state.
- ACTION (s)
A transition model:
- takes in a state and an action.
- returns the successor state, which is any state reachable from doing action a in state s.
- RESULT(s, a)
A goal state:
- The target location or position that needs to be reached.
- represented by a goal test function
A path cost function:
- The cost associated with a particular path taken through the state space.
- c(s1, a, s2)

Frontier

A set of nodes that are under consideration to be expanded.
A set of leaf nodes in the search spanning tree are available for expansion at any given step.
A search algorithm determines how to choose a node in the Frontier to grow the search spanning tree.

Search Algorithm

Tree Search vs Graph Search

Explored Set

The frontier in graph search separates the search-space graph into two regions, the explored region and the unexplored region, so that Every path from the initial state to an unexplored state has to pass through a state in the frontier.

Performance measures

Completeness
Cost Optimality
Time complexity
Space complexity

BFS

Queue

BFS Tree

from collections import deque

def bfs_tree(start, goal_test, successors):
    """
    start: 시작 상태
    goal_test(s): 목표 검사 함수 -> bool
    successors(s): 상태 s에서 갈 수 있는 다음 상태들의 리스트 반환

    반환: 목표에 도달하는 경로(list) 또는 None
    (Tree-search: explored/중복 체크 안 함)
    """
    if goal_test(start):
        return [start]

    # 노드 = (state, parent_index)
    nodes = [(start, None)]
    frontier = deque([0])  # nodes의 인덱스를 큐에 저장

    while frontier:
        parent_idx = frontier.popleft()
        parent_state, _ = nodes[parent_idx]

        for nxt in successors(parent_state):
            nodes.append((nxt, parent_idx))
            child_idx = len(nodes) - 1

            if goal_test(nxt):
                # 경로 복원
                path, i = [], child_idx
                while i is not None:
                    path.append(nodes[i][0])
                    i = nodes[i][1]
                return list(reversed(path))

            frontier.append(child_idx)

    return None

from collections import deque

def bfs_graph(start, goal_test, successors):
    """
    start: 시작 상태 (예: 'Arad')
    goal_test(s): s가 목표면 True
    successors(s): 상태 s에서 (다음상태, 비용) 혹은 그냥 다음상태 리스트 반환
                   아래에서는 다음상태 리스트라고 가정
    반환: start -> ... -> goal 경로 리스트, 없으면 None
    """
    # 노드 = (state, parent_index)
    frontier = deque([(start, None)])   # FIFO 큐
    frontier_states = {start}           # frontier에 있는 상태 집합 (중복 방지)
    explored = set()                    # 이미 확장한 상태(Closed)

    # 경로 복원을 위해 모든 노드를 배열에 따로 저장
    nodes = [(start, None)]             # nodes[i] = (state, parent_index)
    index_in_queue = deque([0])         # frontier에서의 인덱스(=nodes의 인덱스)

    if goal_test(start):
        return [start]

    while frontier:
        state, parent = frontier.popleft()
        node_idx = index_in_queue.popleft()
        frontier_states.discard(state)
        explored.add(state)

        for nxt in successors(state):
            if (nxt not in explored) and (nxt not in frontier_states):
                # child 노드 저장
                nodes.append((nxt, node_idx))
                child_idx = len(nodes) - 1

                if goal_test(nxt):
                    # 경로 복원
                    path = []
                    i = child_idx
                    while i is not None:
                        path.append(nodes[i][0])
                        i = nodes[i][1]
                    return list(reversed(path))

                # frontier에 삽입
                frontier.append((nxt, node_idx))
                index_in_queue.append(child_idx)
                frontier_states.add(nxt)

    return None

graph = {
    "Arad": ["Sibiu", "Timisoara", "Zerind"],
    "Sibiu": ["Arad", "Fagaras"],
    "Timisoara": ["Arad", "Lugoj"],
    "Zerind": ["Arad"],
    "Fagaras": [],
    "Lugoj": []
}

path = bfs(
    start="Arad",
    goal_test=lambda s: s == "Lugoj",
    successors=lambda s: graph.get(s, [])
)
print(path) 

Has the shallowest path to every node on the frontier
memory-intensive as it stores all nodes.

DFS

Stack

def depth_first_search(initial_state, goal_test, actions):
    """
    initial_state: 시작 상태
    goal_test(s): 상태 s가 목표면 True
    actions(s): 상태 s에서 이동 가능한 다음 상태들의 리스트 반환
    반환: start → goal 경로(list) 또는 None
    """

    # 모든 노드 저장: nodes[i] = (state, parent_index)
    nodes = [(initial_state, None)]

    # frontier ← FILO 스택 (여기서는 노드 인덱스만 저장)
    frontier = [0]

    # frontier에 있는 상태들의 집합 (중복 삽입 방지용)
    stacked_states = {initial_state}

    # explored ← 이미 확장(자식 생성)한 상태들의 집합
    explored = set()

    # 시작 상태가 목표라면 바로 반환
    if goal_test(initial_state):
        return [initial_state]

    # DFS 루프 시작
    while True:
        # frontier가 비면 실패
        if not frontier:
            return None

        # 스택에서 맨 위 노드 꺼내기
        node_idx = frontier.pop()
        state, parent_idx = nodes[node_idx]

        # 스택 상태 집합에서 제거 (이제 확장할 차례)
        stacked_states.discard(state)

        # 현재 상태에서 가능한 모든 자식 상태 확인
        for child_state in actions(state):
            # 자식 상태가 explored나 frontier에 없을 때만 처리
            if (child_state not in explored) and (child_state not in stacked_states):
                # 새 노드 저장 (부모는 현재 노드)
                nodes.append((child_state, node_idx))
                child_idx = len(nodes) - 1

                # 목표 상태면 경로 복원해서 반환
                if goal_test(child_state):
                    path, i = [], child_idx
                    while i is not None:
                        path.append(nodes[i][0])
                        i = nodes[i][1]
                    return list(reversed(path))

                # 목표가 아니면 스택에 push
                frontier.append(child_idx)
                stacked_states.add(child_state)

        # 모든 자식 처리가 끝나면 explored에 추가
        explored.add(state)

Low memory usage
Can get stuck in deep or infinite branches (Not cost-optimal)

UCS

Priority Queue

lowest path cost f(n) = g(n)
Best-first search with the evaluation function
Uniform-cost search is complete and cost optimal
Dijkstra's algorithm finds the shortest path from the root node to every other node in a graph with non-negative edge weights.
A special case of Dijkstra's algorithm in which the

import heapq

def uniform_cost_search(initial_state, goal_test, actions, step_cost):
    """
    initial_state: 시작 상태
    goal_test(s): 상태 s가 목표면 True
    actions(s): 상태 s에서 가능한 다음 상태 리스트
    step_cost(s, s_next): s -> s_next 이동 비용 (양수 가정)

    반환: start → goal 경로(list) 또는 None
    """

    # 모든 노드 저장: nodes[i] = (state, parent_idx, path_cost)
    nodes = [(initial_state, None, 0.0)]

    # frontier ← PATH-COST 기준 최소 힙 (원소: (cost, node_idx))
    frontier = [(0.0, 0)]
    heapq.heapify(frontier)

    # frontier에 있는 상태의 현재 최저 비용(멤버십/비용 비교용)
    frontier_costs = {initial_state: 0.0}

    # explored ← 이미 확장 완료한 상태 집합
    explored = set()

    # 시작이 곧 목표면 바로 반환
    if goal_test(initial_state):
        return [initial_state]

    # loop do
    while frontier:
        # node ← POP(frontier)  /* 최소 비용 노드 */
        cost, node_idx = heapq.heappop(frontier)
        state, parent_idx, path_cost = nodes[node_idx]

        # 힙에 남아 있는 구버전(더 비싼 버전)이면 건너뛴다
        if state in frontier_costs and cost != frontier_costs[state]:
            continue

        # goal test (슈도코드: pop 직후 검사)
        if goal_test(state):
            # SOLUTION(node) → 경로 복원
            path = []
            i = node_idx
            while i is not None:
                path.append(nodes[i][0])
                i = nodes[i][1]
            return list(reversed(path))

        # add node.STATE to explored
        explored.add(state)
        # frontier 목록에서 이 상태 제거(더 이상 frontier에 없음)
        frontier_costs.pop(state, None)

        # for each action in ACTIONS(node.STATE) do
        for nxt in actions(state):
            new_cost = path_cost + step_cost(state, nxt)

            # child.STATE not in explored or frontier ?
            in_explored = (nxt in explored)
            in_frontier = (nxt in frontier_costs)

            # (1) explored/ fronter 어디에도 없으면 새로 삽입
            if not in_explored and not in_frontier:
                nodes.append((nxt, node_idx, new_cost))
                child_idx = len(nodes) - 1
                heapq.heappush(frontier, (new_cost, child_idx))
                frontier_costs[nxt] = new_cost

            # (2) frontier에 있는데, 더 싼 경로를 찾았다면 "교체"
            elif in_frontier and new_cost < frontier_costs[nxt]:
                nodes.append((nxt, node_idx, new_cost))
                child_idx = len(nodes) - 1
                heapq.heappush(frontier, (new_cost, child_idx))
                # 현재 최저비용을 갱신 → 이전 힙 항목은 나중에 팝될 때 비용불일치로 자동 무시
                frontier_costs[nxt] = new_cost

    # if EMPTY?(frontier) then failure
    return None

Greedy Best First Search

f(n) = h(n)
$h(n) = h_{SLD}$ , where $SLD$ for the Straight-Line Distance
It expands the node with the lowest $h(n)$ value at each step

from heapq import heappush, heappop

def gbfs_path(G, start, goal, heuristic):
    """
    Greedy Best-First Search (GBFS)
    G: 인접 리스트 dict, G[u] = 이웃들의 리스트/이터러블
    heuristic(x, goal): 추정거리 h(x)
    반환: start -> ... -> goal 경로(list) 또는 None
    """

    # 우선순위 큐 원소: (h(state), state, path)
    pq = []
    heappush(pq, (heuristic(start, goal), start, [start]))

    visited = set()          # 이미 꺼내서 확장한 노드(재방문 방지)
    in_frontier = {start}    # 큐에 들어간 노드(중복 삽입 방지)

    while pq:
        # 휴리스틱이 가장 작은 노드를 꺼냄
        _, vertex, path = heappop(pq)
        in_frontier.discard(vertex)

        # 이미 확장했다면 스킵
        if vertex in visited:
            continue
        visited.add(vertex)

        # 목표면 경로 반환
        if vertex == goal:
            return path

        # 이웃을 휴리스틱 순으로 큐에 추가
        for neighbor in G.get(vertex, []):
            if neighbor in visited or neighbor in in_frontier:
                continue
            heappush(pq, (heuristic(neighbor, goal), neighbor, path + [neighbor]))
            in_frontier.add(neighbor)

    return None

A* Search

f(n) = g(n) + h(n)
The most common informed search algorithm.
The tree-search version of A* is optimal if h(n) is an admissible heuristic.
The graph-search version is optimal if h(n) is consistent.

def astar_path(G, start, goal):
    """
    Find a path from start to goal using A* Search.
    G: NetworkX Graph
    start: 시작 노드
    goal: 목표 노드
    """
    pq = PriorityQueue()
    # 시작 노드를 경로 리스트와 함께 큐에 추가, f = 0
    pq.push((start, [start]), 0)
    visited = set()

    while pq:
        (vertex, path) = pq.pop()

        # 이미 방문했다면 스킵
        if vertex in visited:
            continue
        visited.add(vertex)

        # 목표 도착 시 경로 반환
        if vertex == goal:
            return path

        # 인접 노드 탐색
        for neighbor in G[vertex]:
            if neighbor in visited:
                continue
            # g(n) = 현재 경로까지의 실제 비용
            g_cost = nx.path_weight(G, path + [neighbor], 'weight')
            # h(n) = 휴리스틱(목표까지의 추정 비용)
            h_cost = heuristic(cities[neighbor], cities[goal])
            f_cost = g_cost + h_cost

            pq.push((neighbor, path + [neighbor]), f_cost)

    return None

Admissibility

$h(n) \leq h^*(n)$
Never overestimate the cost to reach the goal
A straight line distance between a node and the goal node is an admissible heuristic as it is always shorter than the actual distance between this node to the goal node.
With an admissible heuristic, A* is cost-optimal.

Consistency

$h(n) \leq c(n, a, n') + h(n')$
h(n) is consistent if the estimated cost is always less than or equal to the actual cost.

Admissible vs Consistent

Consistent ⇒ Admissible (모든 consistent 휴리스틱은 admissible)
Admissible ⇏ Consistent (거꾸로는 성립 안 함)
The tree search version of A* is optimal if h(n) is admissible
The graph search version of A* is optimal if h(n) is consistent

Summary

Measure / Criteria	BFS	DFS	Uniform Cost	A*
Complete?	Yes	No	Yes	Yes
Time complexity	$O(b^d)$	$O(b^m)$	$O\left(b^{1 + \lfloor C^* / \epsilon \rfloor}\right)$	$O(b^d)$
Space complexity	$O(b^d)$	$O(bm)$	$O\left(b^{1 + \lfloor C^* / \epsilon \rfloor}\right)$	$O(b^d)$
Cost optimal?	Yes	No	Yes	Yes

$\epsilon$ is the smallest positive cost of any single step (edge) in the search problem.

Overview​

Hallucinations on Object Attributes (HoOA)​

Issues​

Mitigation Methods (this paper)​

Benchmark (HoOA)​

Construction​

Model: MIAVLM​

Visual Extractor (VE)​

Multihead Sampler (MS)​

MAP (Multiview Attributes Perceiver)​

Benchmarks​

Baselines & Input Modes​

Main Results​

Ablations​

Limitations & Notes​

Insights​

Ref​

match​

Repetition Statements​

The For Loop​

Loop-And-A-Half​

동기화 설정​

브라우저 익스텐션 다운로드​

Citation 설정​

MS Word Plugin 설치​

플러그인 설치​

VSCode 플러그인​

Overview​

Privacy​

Privacy Issues​

Privacy Mitigation Methods​

Privacy Future Research Directions​

Fairness and Bias​

Fairness and Bias Issues​

Fairness and Bias Mitigation Methods​

Fairness Future Research Directions​

Robustness​

Robustness Issues​

Robustness Mitigation Methods​

Robustness Future Research Directions​

Safety​

Safety Issues​

Safety Mitigation Methods​

Safety Future Research Directions​

Ref​

Overview​

The development of visual recognition paradigms​

VLM Overview​

VLM pre-training Objectives​

Contrastive Objectives​

Image Contrastive Learning​

Image-Text Contrastive Learning​

Image-Text-Label Contrastive Learning​

Generative Objectives​

Masked Image Modelling​

Masked Language Modelling​

Masked Cross-Modal Modelling​

Image-to-Text Generation​

Alignment Objectives​

Image-Text Matching​

Region-Word Matching​

VLM Pre-Training Frameworks​

Evaluation​

Zero-shot Prediction​

Linear Probing​

Datasets​

VLM Transfer learning​

Transfer via Prompt Tuning​

Transfer with Text Prompt Tuning​

Transfer with Visual Prompt Tuning​

Transfer with Text-Visual Prompt Tuning​

Transfer via Feature Adaptation​

Other Transfer Methods​

VLM Knowledge Distillation​

Knowledge Distillation for Object Detection​

Knowledge Distillation for Semantic Segmentation​

Knowledge distillation for weakly-supervised semantic segmentation​

Performance​

Challenges​

Ref​

Overview

Hallucinations on Object Attributes (HoOA)

Issues

Mitigation Methods (this paper)

Benchmark (HoOA)

Construction

Model: MIAVLM

Visual Extractor (VE)

Multihead Sampler (MS)

MAP (Multiview Attributes Perceiver)

Benchmarks

Baselines & Input Modes

Main Results

Ablations

Limitations & Notes

Insights

Ref

match

Repetition Statements

The For Loop

Loop-And-A-Half

동기화 설정

브라우저 익스텐션 다운로드

Citation 설정

MS Word Plugin 설치

플러그인 설치

VSCode 플러그인

Overview

Privacy

Privacy Issues

Privacy Mitigation Methods

Privacy Future Research Directions

Fairness and Bias

Fairness and Bias Issues

Fairness and Bias Mitigation Methods

Fairness Future Research Directions

Robustness

Robustness Issues

Robustness Mitigation Methods

Robustness Future Research Directions

Safety

Safety Issues

Safety Mitigation Methods

Safety Future Research Directions

Ref

Overview

The development of visual recognition paradigms

VLM Overview

VLM pre-training Objectives

Contrastive Objectives

Image Contrastive Learning

Image-Text Contrastive Learning

Image-Text-Label Contrastive Learning

Generative Objectives

Masked Image Modelling

Masked Language Modelling

Masked Cross-Modal Modelling

Image-to-Text Generation

Alignment Objectives

Image-Text Matching

Region-Word Matching

VLM Pre-Training Frameworks

Evaluation

Zero-shot Prediction

Linear Probing

Datasets

VLM Transfer learning

Transfer via Prompt Tuning

Transfer with Text Prompt Tuning

Transfer with Visual Prompt Tuning

Transfer with Text-Visual Prompt Tuning

Transfer via Feature Adaptation

Other Transfer Methods

VLM Knowledge Distillation

Knowledge Distillation for Object Detection

Knowledge Distillation for Semantic Segmentation

Knowledge distillation for weakly-supervised semantic segmentation

Performance

Challenges

Ref