Blog | gracefullight.dev

Research methodology

September 10, 2025 · One min read

Owner

Quantitative and Qualitative Methods

Category	Sub Category	Quantitative	Qualitative
Requirement	Question	Hypothesis	Interest
	Method	Control and randomization	Curiosity and reflexivity
	Data collection	Response	Viewpoint
	Outcome	Dependent variable	Accounts
Ideal	Data	Numerical	Textual
	Sample size	Large (power)	Small (saturation)
	Context	Eliminated	Highlighted
	Analysis	Rejection on null	Synthesis

Types of Research Methodology

Historical: Qualitative
Comparative: Qualitative
Descriptive: Qualitative
Correlation: Quantitative
Experimental: Quantitative
Evaluation: Qualitative
Action: Qualitative
Ethnographic: Various (not quantitative)
Ethnogenic: Various (not quantitative)
Feminist/Identity Politics: Various (not quantitative)
Cultural: Various (not quantitative)

Common data collection methods

Qualitative data collection methods

Observations: recording what you have seen, heard, or encountered in detailed field notes
Interviews: asking people questions in one-on-one conversations
Focus groups: asking questions and generating discussion among a group of people
Surveys: distributing questionnaires with open-ended questions
Secondary research: collecting existing data in the form of texts, images, audio or video recordings, etc.

Quantitative data collection methods

Experiments
Computer Simulation and Agent-Based Models
Controlled observations
Surveys: paper, kiosk, mobile, questionnaires
Longitudinal studies
Polls and Telephone interviews
Face-to-face interviews

FDA +006

September 9, 2025 · 9 min read

Gracefullight

Owner

Unsupervised machine learning

Item	Supervised machine learning	Unsupervised machine learning
Data availability	Input and output variables will be given.	Only the input data will be given.
Labeling	Algorithms are trained using labelled data.	Algorithms are used against data which is not labelled.
Algorithms	Support Vector Machine, Linear and Logistic Regression and Classification Trees.	Cluster algorithms, K-means, Hierarchical clustering, etc.
Complexity	simpler method.	computationally complex.
Learning mode	The learning method takes place offline.	The learning method takes place in real-time.
Reliability	highly accurate and trustworthy method.	less accurate and less trustworthy method.

Processing data

most common tasks are clustering, anomaly detection, and neural networks.
infer underlying patterns without human supervision or intervention and enable us to discover both the differences and similarities in a dataset.
can be considered ideal solutions for exploratory data mining.

Clustering

objects (unlabelled data) are organised into groups, where the members of each group are similar in some way to each other and less similar to those in other groups.

Classification assigns objects/data to the predefined (labelled) classes
Clustering groups the objects/data based on the similarities between them
used in pattern recognition, image analysis and bioinformatics.
different clustering algorithms can produce different results based on their own definition of a cluster
the parameters (such as the distance function, density threshold and the number of expected clusters) of the clustering algorithm should be set based on the particular characteristics of the dataset and the user’s intention

Domain	Use cases
Biology and bioinformatics	Cluster algorithms have been used in biological systematics for comparing the genus differences in organisms.
Medicine	Cluster analysis can be used to detect underlying factors of particular diseases, such as coronary artery disease. It is also used to describe patterns of antibiotic resistance.
Market basket	Cluster analysis has gained increasing popularity in market research. It can be used to classify different groups of consumers by behaviour analysis. It helps to build a better understanding of market segmentation, pricing and new product testing.
Computer science	Clustering is a powerful tool for various tasks in the area of computer science, such as reforming functionality in software evolution, object recognition in computer vision and lexical ambiguity in natural language process.
Car insurance	Identify customer groups with high average claim costs.

Similarity Measure: Numerical measure of how alike two data objects often fall between 0 (no similarity) and 1 (complete similarity)
Dissimilarity, or Distance Measure: Numerical measure of how different two data objects are range from 0 (objects are alike) to $\infty$ (objects are different)
Proximity: Refers to a similarity or dissimilarity

Distance measures

Distance metrics or dissimilarity measures

basically deal with finding the proximity or distance between data points and determining if they can be clustered together.
Manhattan distance: distance between two vectors if they could only move right angles.
- $Dist(A, B) = \sum_{} |a_{i} - b_{i}|$
- no diagonal movement involved in calculating the distance.
Euclidean distance: can best be explained as the length of a segment connecting two points.
- $Dist(A, B) = \sqrt{\sum_{} (a_{i} - b_{i})^{2}}$
- calculated from the cartesian coordinates of the points using the Pythagorean theorem.
- Typically, one needs to normalize the data before using this distance measure.
- the dimensionality increases of your data, the less useful Euclidean distance becomes
Cosine similarity: the cosine of the angle between two vectors.
- $Dist(A, B) = \frac{\sum_{} (x_i \cdot y_i)}{\sqrt{\sum_{} x_i^2 \cdot \sum_{} y_i^2}}$
- a way to counteract Euclidean distance’s problem with high dimensionality.
- has the same inner product of the vectors if they were normalized to both have length one
- The magnitude of vectors is not taken into account, merely their direction.
  - In practice, this means that the differences in values are not fully taken into account.
Single link: the shortest distance between points
Complete link: the largest distance between points.
Average link: average distance between points.
Centroid: the distance between centroids.

Weighted distance measures

$Dist(A, B) = \sqrt{\sum_{} w_i (a_{i} - b_{i})^{2}}$

a weight to the attributes as some attributes are more important than others.
force clustering to pay more attention to higher weight attributes and form clusters that depend more on those heavily weighted attributes.

Dissimilarity

Simple matching coefficient, SMC: invariant, if the binary variable is symmetric.
- $d(i,j) = \frac{b+c}{a+b+c+d}$ $d (i, j) = \frac{b + c}{a + b + c + d}$
  - the proportion of mismatches (b+c) out of all attributes (a+b+c+d).
- The simple matching coefficient is used when 0 and 1 are equally important, treating matches of both 1s and 0s the same way.
Jaccard coefficient: non-invariant, if the binary variable is asymmetric.
- $d(i,j) = \frac{b+c}{a+b+c}$ $d (i, j) = \frac{b + c}{a + b + c}$
  - ignores cases where both are 0 (d), and only considers mismatches relative to at least one positive case.
- The Jaccard coefficient is used when 1 (presence) is more meaningful than 0 (absence).

Similarity Matrix

After calculating all distances, we can create a similarity matrix
containing the distance between each pair of data points.

Similarity matrix

ID	Gender	Age	Salary
1	M	45	45000
2	F	32	54000
3	F	23	32000
4	M	36	58000

Gender: binarized
Age: normalized
Salary: normalized

ID	Gender	Age	Salary
1	1	1	0.25
2	0	0.6	0.7
3	0	0	0
4	1	0.7	0.8

$dist(ID2, ID3) = \sqrt{(0-0)^2 + (0.6-0)^2 + (0.7-0)^2} = 0.92$
$dist(ID2, ID4) = \sqrt{(0-1)^2 + (0.6-0.7)^2 + (0.7-0.8)^2} = 1.02$

Clustring methodologies

Hierarchical approach: create trees of clusters and sub-clusters
- Divisive (Top-down): Start with all examples in a single cluster, and decide how to break the cluster into multiple sub-clusters.
- Agglomerative (Bottom-up): Start with each example in its own separate cluster. Decide which clusters to merge.
Partitional (K starting points): Start with $K$ $K$ random cluster centers, and decide which examples to put in each of the clusters.
- Adjust the cluster centers after each allocation of examples to clusters.
- k-means, k-medoids

Choosing a clustering method

Consideration	What to look for	Typical choices
Scalability	Near-linear time and bounded memory on large datasets.	MiniBatch K-Means, BIRCH, scalable DBSCAN with indexing.
Arbitrary shapes	Ability to find non-spherical clusters.	DBSCAN, HDBSCAN, Spectral clustering.
Noise and outliers	Robustness to noise; ability to mark points as noise.	DBSCAN, HDBSCAN (labels noise), GMM with low-weight components.
Mixed attribute types	Works with categorical + numeric or custom distances.	k-prototypes/k-modes, Agglomerative with Gower distance.
Few parameters	Minimal, intuitive hyperparameters; stable defaults.	Agglomerative (linkage, distance), HDBSCAN (min cluster size).
Order insensitivity	Results independent of input order.	Most batch methods; shuffle for MiniBatch K-Means.
High dimensionality	Handles curse of dimensionality or uses reduction.	PCA + K-Means/Agglomerative, Spectral after reduction, cosine distance.
User constraints	Must-link/cannot-link or size constraints supported.	COP-K-Means, constrained agglomerative, semi-supervised variants.
Interpretability	Easy to explain clusters and decisions.	K-Means centroids, Agglomerative dendrograms, GMM probabilities.

Clustering Terminology

Clustering Points

Centroid: a point in the middle of a cluster. It may not be an actual point in the dataset.
Medoid: an actual point in the dataset that is centrally located and is, therefore, representative of the cluster.
Representative points: are points around the cluster that are representative of the cluster.
High intra-class similarity: the homogeneity, the closeness of data points within a single cluster
Low inter-class similarity: The distance between two separate clusters

Class in Cluster

A good clustering method will produce high-quality clusters with high intra-class similarity and low inter-class similarity.

Hierarchical clustering

Hierarchical approaches lead to the formation of dendrograms
The top and bottom of a dendrogram represent the two extremes of clustering
- At the bottom, a leaf is an individual cluste
- At the top, the root is one cluster

AGNES

AGgglomerative NESting hierarchical clustering algorithm.

Agglomerative hierarchical clustering follows a bottom-up approach
- starting with clusters of single objects and merging them into bigger and bigger clusters
agglomerative clustering process terminates (or finishes) when a termination condition is satisfied or there is only one cluster containing all objects.
based on Euclidean distance between two objects
steps of the algorithm:
1. Make a cluster with only one object as member for all objects
2. Calculate the Euclidean distance between each pair of clusters
3. Choose the cluster pair with the smallest distance and merge them to make one cluster
4. Repeat step 2 with the new combined cluster and the other, older clusters
5. Repeat steps 3 and 4 until all the objects are merged into a single cluster.

DIANA

DIvisive ANAlysis clustering algorithm.

The top-to-bottom approach is followed in divisive hierarchical clustering
- starts with a cluster containing all objects.
This cluster is broken up into smaller clusters, and this process of breaking up clusters continues until each cluster contains one object or a given termination condition is satisfied.
steps of the algorithm:
1. The process of starts at the root with all the points as one cluster.
2. It recursively splits higher-level clusters to build the dendrogram.
3. It can be considered as a global approach.
4. It is more efficient when compared with agglomerative clustering.

Agenes vs Diana

Single-linkage clustering

the minimum method, connectedness, or nearest neighbour method

two clusters are linked by a single element pair
The distance between clusters is defined as the shortest distance from a member of the first cluster to a member of the second cluster.

Complete-linkage clustering

the furthest neighbour method, maximum method, or diameter method.

the distance between two clusters is defined as the greatest distance between any member of the first cluster and any member of second cluster

Average-linkage clustering

the minimum variance method

the distance between two clusters is calculated by averaging the distance between each member of first cluster and each member of second cluster

FDA +005

September 9, 2025 · 3 min read

Gracefullight

Owner

Data visualization

Successful visualization requires data be converted into a visual format.
Motivation is to play to the strengths of people
- for people to quickly absorb a large mount of informatin and find patterns in it.

Data visualization process

a human who looks at the visual and perceives information.
the human should be able to answer some questions by looking at the visual after perception.

Item	Exploratory	Explanatory
Purpose	To analyse data to solve a question or develop a hypothesis.	To convey a message or idea.
Target audience	Expert users with prior knowledge of the subject.	Non-expert users with limited or no background knowledge.
When	Usually happens during the data analytics project and is internal facing.	Usually happens after the exploration phase and is often external facing.
Approach	Unguided, users explore, with no clear conclusion.	Guided through author-chosen comparisons, clear conclusions.
Representation	Has an analytical purpose and represents the complexity of data.	No analytical purpose and represents understandable data.

Driven by Data: Explanatory visualization examples

Descirptive statistics

Describe some data through a quantitative summarisation of its behaviour
Help us summarise data in a meaningful way.
Highlight things like whether there are any values that are ill defined, or which make no sense.
measures of central tendency
- mean
- median
- mode
measures of spread
- range
- variance
- standard deviation
- interquartile range

Inferential statistics

includes more advanced methods such as hypothesis tests, ANOVA, and regression.
make claims about how general this dataset is.
we can make inferences from a sample to a population.

Measures of central tendency

a quick and easy way to describe a dataset by condensing it down to just one representative value.
can easily compare one dataset to another.
Mean: the averge of a dataset.
Median: the middle value in a dataset.
Mode: the most commonly occurring value in a dataset.

Distribution

Measures of spread

how similar or varied a set of values are for a particular variable
summarise the data in a more detailed way that shows how scattered the values are and how much they differ from the central tendency.
Range: the largest value of a variable and subtracts the smallest.
- The bigger the range the more spread out a data set is.
Variance: how far a set of numbers are spread out from their average value.
Standard deviation: the square root of the variance
- gives us back a value that has the same units as the mean
Interquarile range, IQR: the difference between the upper and lower medians.
- First, find the median of a set of data.
- Then, find the medians of the upper and lower half of the data.

Frequency distribution

Frequency is the number of times a data value occurs or repeats.
$frequency(v_i) = \frac{\text{number of objects with attribute value}\thinspace v_i}{m}$
Visual displays that organise and present frequency counts so that the information can be interpreted more easily.
can show absolute frequencies or relative frequencies, such as proportions or percentages.
can be shown in a table or graph.
Some common methods of showing frequency distributions include frequency tables, histograms or bar charts.
Frequency tables: display the number of occurrences of a particular value or characteristic.
Histograms: a type of graph in which each column represents a numeric variable
- useful for describing the shape, centre and spread to better understand the distribution of the dataset.
Bar charts: a type of graph in which each column represents a categorical variable or a discrete ungrouped numeric variable.

Vocabulary for AI +007

September 8, 2025 · 3 min read

Gracefullight

Owner

Vocabulary & Expressions

Term/Expression	Definition	Simpler Paraphrase	Meaning
referee	A person who supervises a game or match to ensure the rules are followed	Umpire	심판
nonterminal state	The agent is still in the middle of the episode. The environment can continue to produce rewards, and the agent can still take actions.	ongoing state	비종결 상태
apprenticeship	a period of time working as an apprentice	internship	수습 기간
in a similar vein	in a similar way	similarly	비슷한 맥락에서
subsequent	coming after something in time; following	following	그 다음의
in the sense	in the most limited meaning of a word, phrase, etc.	in the meaning	~라는 의미에서
hypothesis	a supposition or proposed explanation made on the basis of limited evidence as a starting point for further investigation	assumption	가설
untimely	happening or done at an unsuitable time	ill-timed	때 아닌, 시기상조의
utility	the state of being useful, profitable, or beneficial	usefulness	효용
thereafter	after that time	after that	그 후에
analogy	a comparison between two things, typically for the purpose of explanation or clarification	comparison	유추, 비유
inherent	existing in something as a permanent, essential, or characteristic attribute	intrinsic	내재하는, 타고난

RL

항목	Policy evaluation	Passive learning	Policy search
정책 (π)	고정, 외부에서 주어짐	고정, 외부에서 주어짐	없음 → 직접 탐색·개선
환경 모델 (P, R)	모두 알고 있음 (완전한 MDP)	모름, 경험으로 추정	알 수도 있고 모를 수도 있음
학습 목표	주어진 정책 하에서 $U^\pi(s)$ 계산	주어진 정책 하에서 $U^\pi(s)$ 경험 기반 추정	최적 정책 $\pi^*$ 를 직접 찾기
접근 방식	벨만 방정식 기반 반복 계산 (iterative policy evaluation)	환경을 실제로 탐험하면서 transition & reward 관찰, 그로부터 utility 추정	정책 파라미터를 조금씩 바꾸며 return 극대화 (policy gradient, evolutionary search 등)
필요 데이터	없음 (모델이 다 주어짐)	환경 경험 (trajectory, reward sequence)	환경 경험 (보상 피드백), 때로는 gradient
계산/학습 특징	계산 문제, 오차 없이 수렴 가능	샘플 효율 낮음, Monte Carlo/TD 방식 사용	gradient variance 큼, local optimum 위험
적용 예시	교재의 Gridworld (모델식 다 알려진 경우)	환경은 블랙박스, 정책 고정 실험 (시뮬레이션 따라다니기)	로봇 제어, 연속 행동 공간 (PPO, REINFORCE 등)
장점	정확·빠름, 모델만 있으면 해석 용이	모델이 없어도 가능, 실제 환경에서 학습	연속·복잡한 행동 직접 최적화 가능
단점	현실 환경은 모델을 모르는 경우가 많음	정책을 개선할 수는 없음 (평가 전용)	학습 불안정, 많은 데이터 필요

$R(S_t,π(S_t),S_{t+1})$

시간 $t$ 에 상태 $S_t$ 에 있었고, 정책이 정한 행동 $π(S_t)$ 을 했더니, 다음 상태 $S_{t+1}$ 에 도착했다. 그때 받는 보상은 $R(S_t,π(S_t),S_{t+1})$ 이다.

현재 상태에서 정책이 정한 행동을 취해 다음 상태로 갔을 때 받은 보상
$S_t$ : 시간 $t$ 에 에이전트가 위치한 현재 상태
$π(S_t)$ : 정책 π가 현재 상태 $S_t$ 에서 선택한 행동 (action)
$S_{t+1}$ : 그 행동을 수행한 뒤 도달한 다음 상태
$R$ : 이 세 가지 (현재 상태, 행동, 다음 상태)에 의해 결정되는 보상 함수

Passive RL

Direct Utility Estimation
ADP (Adaptive Dynamic Programming)
TD (Temporal Difference Learning)

FSD +007

September 8, 2025 · 3 min read

Gracefullight

Owner

Collection

a data sturcucture that groups multiple elemtns togehter logically.

referenced by the collection name, and its elements are accessed using indexing or look up methods.
some collections allow dynamic sizing, expanding and shrinking with the data, others have a fixed size.
can store mixed data types.
- Java collections are typically type-safe using generics.
- Python uses dynamic typing.

Python List

mylist = ["Tom", 30, 112.5]
len(mylist)
mylist[index]
mylist.append(item)
mylist.insert(index, item)
# returns a slice of a list from first to last-1
mylist[first:last]
mylist.index(item)

# replaces items from first to last-1 with a list
mylist[first:last] = [list-values] 
# adds list 2 at the end of list 1
mylist = list1 + list2
# adds list 2 at the end of list 1
list1.extend(list2)

mylist.remove(item)
mylist.pop(index)
mylist.pop()
del mylist[index]
del mylist
mylist.clear()

# Sorts the list alphanumerically, ascending
mylist.sort()
mylist.sort(reverse =True)
mylist.reverse()
mylist.count()

Python Set

unordered, not indexed.
mutable, unique.

myset = { 'hello', 5, True, 3.5 }

for x in myset:
  print(x)

myset.add(item)
# Merges myset iwth the otherset, rtaining unique values
myset.update(otherset)
# Adds other set items to a set (only unique items are retained)
mynewset = myset.union(otherset)
# Retain only the items that exists into set1 and set2
myset = set1.intersection(set2)

# report error if item not found
myset.remove(item)
myset.discard(item)

myset.pop()
myset.clear()
del myset

Python Tuple

a cllection of items of any type
ordered, indexed
unchangable, once a tuple is created the elemets are fixed

mytuple = ("Tom", 30, 112.5)
len(mytuple)
mytuple[index]
mytuple[first:last]
mytuple = tuple + tuple2

Python Dictionary

a collection of items represented as key-value pairs
unordered, indexed by uniaue keys
itmes are mutable
allow duplicate values but not duplicate keys

mydata = {
  "name": "Tom",
  "age": 30,
  "role": "admin"
}

mydata.keys()
len(mydata)
mydata[key]
mydata[key] = new-value
del mydata[key]
del mydata

# Deletes an entry associated with key
val = mydata.pop(key)
# Updates/Inserts { k: v } entry into the dictionary
mydata.update({ k: v })

Java List

an interface of the Java Collection Framework (JCF)
cannot be instantiated.
common implementation of List interface
- ArrayList
- LinkedList

List<Integer> numbers = new ArrayList<>();
List<String> names = new LinkedList<>();

numbers.get(0);
numbers.get(numbers.size() - 1);

names.get(indexOf("Hello"));
names.get(lastIndexOf("Hello"));

numbers.add(5);
names.remove("Hello");
names.remove(indexOf("Hello"));

numbers.removeAll(<another list>);
// set(2, 12) replaces the item at index 2 with 12
numbers.set(2, 12);

Java Set

an interface of the Java Collection Framework (JCF)
unordered, unique objects.

HashSet<String> names = new HashSet();
HashSet<String> names = new HashSet(Array.asList("Tom", "Jerry", "Mickey"));

HashSet<String> names = new HashSet();
ArrayList list1 = new ArrayList();
ArrayList list2 = new ArrayList();

list1.add("Tom");
list1.add("Jerry");

names.addAll(list1);
names.addAll(list2);

names.remove("Tom");
boolean isRemoved = names.remove("Tom");

for (String name : names) {
  System.out.println(name);
}

Iterator<String> it = names.iterator();
while (it.hasNext()) {
  System.out.println(it.next());
}

names.clear();
names.isEmpty();
names.contains("Tim");
names.size();
names.removeAll(set2);
names.containsAll(set2);
// Retain set2 elements and discard the rest
names.retainAll(set2);

Java Map

interface from java.util stores data as a keiy-value pairs
contain unique keys that are associated with specific values.

HashMap<Integer, String> people = new HashMap<>();
people.put(1, "Tom");
people.put(2, "Jerry");
people.put(3, "Mickey");

people.putIfAbsent(2, "Donald");
System.out.println(people.get(2));

people.put(2, "Lucy");
people.replace(2, "Amy");
people.remove(2);

System.out.prinln(people.keySet());
System.out.println(people.values());

people.clear();
people.isEmpty();
people.containsKey(2);
people.size();
people.getOrDefault(50, "Unknown");
// Checks if the value is mapped with one or more keys
people.containsValue("Jim");

Operation Patterns

Finding an item in a list: Using the lookup pattern
Finding multiple items in a list: Using the updated-lookup pattern
Removing certain items from a list: Using the remove-all pattern

Sentence structures

September 4, 2025 · 4 min read

Gracefullight

Owner

Types of sentence structure

The secret to good writing is variation and using a mix of these types of sentences within your paragraphs in your written work.

Simple sentence
- is one independent clause in a subject-verb pattern
- e.g. The Australian government introduced an official carbon tax on 1 July 2012.
Compound sentence
- is two independent clauses connected by a coordinating conjunction.
- e.g. The Australian government introduced an official carbon tax on 1 July 2012, but this was met with opposition from the general public.
Complex sentence
- consists of an independent clause and a dependent clause.
- e.g. As the Australian government recognized the necessity to significantly reduce greenhouse gas emissions, it introduced an official carbon tax on 1 July 2012.
Compound-complex sentences
- consists of more than one independent clause and one or more dependent clauses
- e.g. As the Australian government recognized the necessity to significantly reduce greenhouse gas emissions, it introduced an offical carbon tax on 1 July 2012, but this was met with opposition from the general public.

Common sentence structure errors

Sentence fragments

A sentence fragment is missing some of its parts.
There are three main reasons why a sentence may be incomplete.
- Missing subject
  - e.g Becoming extinct because of rising sea tempratures.
  - Correction: Phytoplankton could become extinct because of rising sea temperatures.
- Missing verb
  - e.g. Significantly, one particular form of Western Australian finch.
  - Correction: Significantly, one particular form of Western Austrailan finch has decreased in numbers.
- Incomplete thought
  - e.g. In a recent article about loss of habitat due to climate change.
  - Correction: In a recent article about loss of habitat due to climate change, Australian animals were shown to be particularly vulnerable.
Sentences beginning with words like so, as, because, who, which, that are often incomplete.

Run-on sentences

A run-on sentence occurs when two simple sentences are incorrectly joined. e.g. Poverty, famine and major public health problems around the developing world are important indicators of a changing climate these issues are not being addressed globally.
Use a joining or linking word such as and, but, or, nor, for, so, yet.
- Correction: Poverty, famine and major public health problems around the developing world are important indicators of a changing climate, but these issues are not being addressed globally.
Make two separate sentences.
- Correction: Poverty, famine and major public health problems around the devleoping world are important indicators of a chaing climate. These issues are not being addressed globally.

Lack of Meaning

Ensure that each sentence you write has clear meaning in English.
It must be fully understandable when read.
If you ware not sure if your sentence has clear meaning in English, perhaps think about rewriting it in a simpler and clearer way that you can fully understand (as will hopefully your reader).

Tips for writing

Consider the following example where each sentence follows a similar structure.
- Topic Sentence: main point of the paragraph.
- Supporting Sentence: Examples, evidence, or analysis.
- Concluding Sentence: wrap up paragraph by linking to broader topic, or linking to the next paragraph or section.
This uniformity leads to a lack of cohesion, making the paragraph feel disjointed and somewhat monotonous.
- e.g. Nursing education states that measures should be in place to avoid infection. Also, that infection rates tend to soar when hygiene standards decrease. Appropriate steps should be taken to decrease these risks. It is suggested that medical staff are educated to understand these risks.
- Correction: Nursing educators argue that strict measures should be implemented to avoid infection in medical institutions. There is also much evidence to demonstrate that infection rates rise dramatically when hygiene standards begin to fall. Therefore, it is argued that appropriate steps need to be in place to decrease and minimize these potential risks. Furthermore, aggressive steps should be taken to ensure that all staff maintain effective hygiene and infection control.

Ref

UTS: Sentence structures

VLA Test Review

September 3, 2025 · 6 min read

Gracefullight

Owner

VLATest: Testing and Evaluating Vision-Language-Action Models for Robotic Manipulation

VLATest fuzzes 18,604 manipulation scenes (10 operators, 4 tasks) to systematically stress-test VLA robustness.
Seven VLA models show low success and brittleness to confounders, lighting/camera changes, unseen objects, and instruction mutations; larger pretraining helps.
Priorities: scale/augment demo data (incl. sim2real), use stepwise/CoT prompting & multi-agent setups, and expand benchmarks with online risk assessment.

Motivation & Gap

Problem: Current VLA models are typically evaluated on small, hand-crafted scenes, leaving general performance and robustness in diverse scenarios underexplored.
Goal: Introduce VLATest, a generation-based fuzzing framework that automatically creates robotic manipulation scenes to test performance and robustness of VLA models.

What Are VLA Models?

Vision-Language-Action (VLA) models take natural language instructions + camera images and output low-level robot actions (Δx, Δθ, Δgrip).
Inference loop: Tokenize text/image → transformer predicts action token A₁ → execute → append A₁ + new image tokens I₂ → predict A₂ → … until success or step limit.

VLA Architecture

Training & Evaluation

Training: (1) Train from scratch on robot demonstrations, or (2) fine-tune a large VLM (e.g., Llava) with >1B params pretraining.
Evaluation: Task-specific metrics (e.g., grasp, lift, hold for “pick up”), either in sim (auto-metrics) or real (manual labels).

VLATest Framework

Ten testing operators grouped across:
- Target objects: type, position, orientation
- Confounding objects: type, position, orientation, count
- Lighting: intensity
- Camera: position, orientation
Scene generation (Alg. 1): sample valid targets → (optional) confounders → mutate lighting (factor α) → mutate camera pose (d, θ). Semantic validity checks prevent infeasible scenes.

VLA Test

Research Questions (RQ)

RQ1: Basic performance on popular manipulation tasks
RQ2: Effect of confounding object count
RQ3: Effect of lighting changes
RQ4: Effect of camera pose changes
RQ5: Robustness to unseen objects (OOD)
RQ6: Robustness to instruction mutations

Tasks & Prompting

Tasks:
1. Pick up an object (grasp + lift ≥0.02 m for 5 frames)
2. Move A near B (≤0.05 m)
3. Put A on B (stable stacking)
4. Put A into B (fully inside)
Standard prompts (RQ1–RQ5):
- pick up [obj] · move [objA] near [objB] · put [objA] on [objB] · put [objA] into [objB]
Instruction mutations (RQ6): 10 paraphrases per task (GPT-4o), manually validated for semantic equivalence.

Experimental Setup

Scenes: 18,604 across 4 tasks (ManiSkill2).
Models: 7 public VLAs (RT-1-1k/58k/400k, RT-1-X, Octo-small/base, OpenVLA-7b).
Compute: >580 GPU hours.

Key Results & Findings

RQ1 — Overall Performance

VLA models underperform overall; no single model dominates across tasks.
Example best-case rates (default settings): 34.4% (Task1, RT-1-400k), 12.7% (Task2, OpenVLA-7b), 2.2% (Task3, RT-1-X), 2.1% (Task4, Octo-small).
Stepwise breakdown (Task 1): grasp 23.3% → lift 15.7% → hold 12.4% ⇒ difficulty composing sequential actions.
- Implication (Finding 2): Consider stepwise prompting / chain-of-thought to decompose complex tasks.

RQ1 — Coverage Metric

No established coverage for VLA; adopted trajectory coverage (pragmatic).
Increasing cases from n=10 to n=1000 achieved 100% coverage across tasks (object-position novelty relative to workspace).

RQ2 — Confounding Objects

More confounders ⇒ worse performance; models struggle to locate the correct object.
Similarity doesn’t matter much: Mann–Whitney U shows no significant difference between similar vs dissimilar distractors (p = 0.443, 0.614, 0.657, 0.443; effect sizes ≈ 0.23–0.29).

RQ3 — Lighting Robustness

Lighting perturbations significantly hurt performance.
OpenVLA-7b most robust (77.9% of previously passed cases still pass), plausibly due to SigLIP + DINOv2 pretraining and LLaVA 1.5 mixture.
Sensitivity: even α < 2.5 increase drops success to ~0.7×; α > 8 ⇒ ~40% of default-pass scenes succeed.
Decreasing light hurts less than increasing; α < 0.2 still ~60% pass.

RQ4 — Camera Pose Robustness

Small pose changes (≤5° rotation, ≤5 cm shift) reduce success to 34.0% of default.
RT-1-400k most robust (45.6% retain), OpenVLA-7b at 31.3%; Octo models <10%.
- Likely due to training data scale differences.

RQ5 — Unseen Objects

Using YCB (56 unseen objects) leads to large performance drops versus seen objects: avg –74.2%, –66.7%, –66.7%, –20.0% on Tasks 1–4.
Transfer rate across steps:
- $\displaystyle T_r^n = \frac{\text{Success rate}_n}{\text{Success rate}_{n-1}}$ , with $\text{Success rate}_0 = 100\%$
- Paired t-tests show significant differences on $T_r^1$ for Task 1 & 2 (p = 0.011, 0.007; Cohen’s d = 1.34, 0.891).
- Primary failure mode: recognizing/locating unseen objects.

RQ6 — Instruction Mutations

Mutated instructions generally reduce performance (avg drops: –32.8% T1, –1.7% T2, –8.3% T3; negligible on T4).
Larger language backbones help: OpenVLA-7b (Llama 2-7B) is more robust, sometimes improving under mutations (e.g., T1, T4).

Implications & Directions

Scale matters: larger pretraining and robot-demo datasets improve robustness (lighting/camera).
Data enrichment: use data augmentation and sim-to-real to diversify external factors; leverage traditional controllers to auto-generate demonstrations.
Prompting strategies: adopt stepwise/CoT prompting; consider multi-agent decompositions.
Benchmarking: the 18,604 VLATest scenes serve as an early benchmark; expand to more tasks/robots/conditions.
Online risk assessment: explore uncertainty estimation and safety monitoring for runtime quality control.

Robotics foundation models: (1) LLMs for planning/rewards; (2) Multi-modal FMs (VLMs/VLAs) for manipulation & perception.
CPS testing: gray-box/black-box fuzzing and search-based testing exist, but not directly applicable to VLAs (multimodality, autoregression, scale).
FM evaluation: beyond static benchmarks, VLATest dynamically generates 3D manipulation test cases—distinct from text-only testing.

Threats to Validity (mitigations in study)

Internal: randomness (mitigated by 18,604 scenes); potential prompt bias (mutations manually validated).
External: generalization to other tasks/models; chose popular tasks (Open X-Embodiment) and SOTA public models.
Construct: limited operators (lighting/camera/confounders chosen; future: #lights, camera intrinsics, resolution).
- Coverage: trajectory coverage used as a pragmatic proxy.

Conclusion

VLATest: early, generation-based fuzzing framework (10 operators) for VLA testing in ManiSkill2.
Empirical evidence across 7 models / 4 tasks / 18,604 scenes shows limited robustness (lighting, camera, unseen objects, instruction variation).
Points to data scaling, prompting, benchmarking, and risk assessment as practical paths to more reliable VLA systems.

Ref

Wang, Z., Zhou, Z., Song, J., Huang, Y., Shu, Z., & Ma, L. (2025). VLATest: Testing and Evaluating Vision-Language-Action Models for Robotic Manipulation. Proceedings of the ACM on Software Engineering, 2(FSE), 1615–1638.

IAI +005

September 2, 2025 · 13 min read

Gracefullight

Owner

Neural Network Development History

1950s-1960s: Early Foundations
- McCulloch & Pitts (1943): mathematical neuron model
- Rosenblatt's Perceptron (1958): first trainable network
- Minsky & Papert (1969): limitations (XOR problem) → AI Winter
1970s–1980s: First Revival
- Werbos (1974); Rumelhart, Hinton, Williams (1986): Backpropagation
- Hopfield Networks (1982): associative memory
- Renewed optimism but limited by hardware
1990s: Consolidation
- LeCun's CNN (LeNet, 1989): digit recognition
- Elman, Jordan: Recurrent Neural Networks
- Symbolic AI still dominated mainstream
2000s: Deep Learning Foundations
- Better hardware (GPUs) + large datasets
- Hinton (2006): Deep Belief Networks (unsupervised pretraining)
- Connectionism regains attention
2010s: Deep Learning Boom
- ImageNet (2012): AlexNet breakthrough
- RNNs, LSTMs, GRUs → speech & translation
- Transformers (2017): revolutionized NLP
2020s: Scaling & Foundation Models
- Large Language Models (GPT, BERT, etc.)
- Multimodal AI: vision, text, speech integration
- Connectionism dominates AI research & industry

Neural Network Models

a collection of units (neurons) connected together
The properties of the network are determined by its topology and the properties of the neurons.
Roughly speaking, the neuron fires when a linear combination of its inputs exceeds some (hard or soft) threshold.

Simple Neuron

$in_j = \sum_{i=0}^{n} w_{ij}a_i$
$out_j = g(in_j)$
$a_j = g(\sum_{i=0}^{n} w_{ij} a_i)$

Activation function

ReLU function

$ReLU(x) = max(0, x)$

an abbreviation for rectified linear unit
Commonly used

Softplus function

$Softplus(x) = \log(1 + e^x)$

A smooth version of the ReLU function

Logistic or Sigmoid function

$Logistic(x) = \frac{1}{1 + e^{-x}}$

Non-linear, can represent a nonlinear function

Tanh function

$tanh(x) = \frac{e^{2x} -1}{e^{2x} + 1}$

Topology of a neural network

Feed-forward network (FFN):
- Every node receives inputs from "upstream" nodes and delivers output to "downstream" nodes.
- There are no loops.
- FFN represents a function of its current inputs, thus it has no internal state other than the weights themselves.
Recurrent Network (RNN):
- A recurrent network feeds its outputs back into its own inputs.
- In a recurrent network, the neuron values can eventually settle down, keep cycling, or behave unpredictably.
- can support short-term memory

FFN	RNN

Training Process

Go through each training sample.
If correctly classified → do nothing.
If misclassified → update the weights:
$w_i \leftarrow w_i + \alpha(y - \hat{y})x_i$

Perceptron for Binary Classification

A perceptron separates data into two classes with a hyperplane.
if $w \cdot x \geq 0 \rightarrow 1$
if $w \cdot x \le 0 \rightarrow 0$

Learning Rules

Aspect	Perceptron Learning Rule	Gradient Descent (with Sigmoid)
Activation function	Hard threshold (계단 함수) $Threshold(z) = 1 \; \text{if} \; z \ge 0,\; 0 \; \text{otherwise}$	Sigmoid (연속 함수) $h_w(x) = \frac{1}{1+e^{-w \cdot x}}$
Output	0 또는 1	0과 1 사이의 실수 값
Loss function	없음 (틀리면 조정, 맞으면 유지) 규칙 기반 학습	$L = (y - h_w(x))^2$ (L2 loss) 또는 Cross-Entropy (실무에서 자주 사용)
Update rule	틀렸을 때만: $w \leftarrow w + \alpha (y - h_w(x))x$	경사하강법: $w \leftarrow w + \alpha (y - h_w(x)) \cdot h_w(x)(1-h_w(x)) \cdot x$
Why derivative?	Hard threshold는 미분 불가능 → 단순 규칙 사용	Sigmoid는 연속적이고 미분 가능 → Loss 함수의 기울기(gradient)를 따라 업데이트. 여기서 $h_w(x)(1-h_w(x))$ 항은 sigmoid의 도함수에서 나온 것.
Interpretation	틀리면 정답 방향으로 한 걸음 이동	Loss가 줄어드는 방향으로 점진적으로 이동

Feadforward NN, FNN

a multilayer perceptron network
one input layer, N hidden layers, N >= 1, and one output layer.
Except for the input layer, each layer has a same activation function g.
The final output is represented by a vector function of inputs and weights.
If it has three layers, Shallow Neural Network, otherwise Deep Neural Network.

Traning a FNN

Forward
- Activation passing from the input layer to the output layer
- Calculate the output
Backward
- Errors propagating backward from the output layer to the input layer
- Update weights

Forward phase

Activation of each node is computed in two steps:
1. Weighted sum (in): sum of activations from the previous layer, multiplied by weights.
2. Apply activation function g: pass the weighted sum through g to produce the node's activation.
Process: propagate activations layer by layer towards the output layer.
Output value (example with 2 layers):
- $h_w(x) = g^{(2)}\big(W^{(2)} g^{(1)}(W^{(1)} x)\big)$

Backward phase

Loss function: choose squared error loss (L2)
- $L_2(y, \hat{y}) = (y - \hat{y})^2$
Prediction:
- $\hat{y} = h_w(x)$
Gradient descent: compute gradient of the loss with respect to weights, then update weights along the negative gradient direction.
- $w_{i,j} \leftarrow w_{i,j} - \alpha \cdot gradient_{w_{i,j}}$
Example: sigmoid activation:
- $\hat{y} = \frac{1}{1 + e^{-w \cdot x}}$
- Gradient of the loss:
  - $gradient_{w_{i,j}} = \frac{\partial}{\partial w_{i,j}} Loss(h_w) = 2 (y - h_w(x)) \cdot \Big(- \frac{\partial}{\partial w_{i,j}} h_w(x)\Big)$
- Chain rule applied:
  - $\frac{\partial g(f(x))}{\partial x} = g'(f(x)) \cdot f'(x)$
Example: Gradient derivation for sigmoid
- Weighted input
  - $W \cdot X = w_{1,3}x_1 + w_{2,3}x_2 + w_{0,3}x_0$
  - (where $x_0 = 1$ for the bias)
- Gradient of the loss
  - $gradient_{w_{i,j}} = \frac{\partial}{\partial w_{i,j}} Loss(h_w) = 2 (y - h_w(x)) \cdot \Big(-\frac{\partial}{\partial w_{i,j}} h_w(x)\Big)$
- Derivative of sigmoid output
  - $\frac{\partial}{\partial w_{i,j}} h_w(x) = h_w(x)(1 - h_w(x)) \cdot \frac{\partial}{\partial w_{i,j}} (W \cdot X)$ $\frac{\partial}{\partial w _{i, j}} h_{w} (x) = h_{w} (x) (1 - h_{w} (x)) \cdot \frac{\partial}{\partial w _{i, j}} (W \cdot X)$
    - $\frac{\partial}{\partial w_{i,j}} \left( \frac{1}{1 + e^{-W X}} \right)$
    - $= \left( \frac{1}{1 + e^{-W X}} \right) \left(1 - \frac{1}{1 + e^{-W X}} \right) \cdot \frac{\partial}{\partial w_{i,j}} (W X)$
    - $= h_w(x) \big(1 - h_w(x)\big) \cdot \frac{\partial}{\partial w_{i,j}} (W X)$
- Derivative of weighted input
  - $\frac{\partial}{\partial w_{0,3}}(W \cdot X) = x_0 = 1$
  - $\frac{\partial}{\partial w_{1,3}}(W \cdot X) = x_1$
  - $\frac{\partial}{\partial w_{2,3}}(W \cdot X) = x_2$
- Weight update rule
  - General form: $w_{i,j} \leftarrow w_{i,j} - \alpha \cdot gradient_{w_{i,j}}$
  - $w_{0,3} \leftarrow w_{0,3} + \alpha (y - h_w(x)) h_w(x)(1 - h_w(x))$
  - $w_{1,3} \leftarrow w_{1,3} + \alpha (y - h_w(x)) h_w(x)(1 - h_w(x)) x_1$
  - $w_{2,3} \leftarrow w_{2,3} + \alpha (y - h_w(x)) h_w(x)(1 - h_w(x)) x_2$

Backward phase Steps

Select a loss function
- For example, squared error loss:
- $L(y, \hat{y}) = (y - \hat{y})^2, \quad \hat{y} = h_w(x)$
Choose an activation function
- Suppose we use a sigmoid:
- $h_w(x) = \frac{1}{1 + e^{-W \cdot X}}$
Calculate the error at the output node
- The delta (error term) at the output is
- $\Delta_{out} = 2(\hat{y} - y) \cdot g'(in_{out})$
Calculate the error at hidden nodes
- A hidden unit may connect to multiple nodes in the next layer.
- Therefore, its error is the weighted sum of all deltas it feeds into, scaled by its own derivative:
- $\Delta_i = g'(in_i) \sum_j w_{i,j} \Delta_j$
- The summation appears because the hidden node's output influences several downstream nodes, and all those error signals must be aggregated.
Update the weights with gradient descent
- The gradient with respect to weight $w_{i,j}$ is simply the input times the delta:
- $\frac{\partial L}{\partial w_{i,j}} = a_i \Delta_j$
- Update rule:
- $w_{i,j} \leftarrow w_{i,j} - \alpha \, a_i \Delta_j$

Vanishing gradient

The error signal are extinguished altogher as they are propagated back through the network
In deep feedforward networks with sigmoid/tanh, repeated multiplication of small derivatives ( $0 < g'(z) < 1$ ) during backpropagation causes the gradient to vanish.

Optimizer

Training a neural network consists of modifying the network's parameters, minimizing the loss function on the training set.
any kind of optimization algorithm could be used.
modern neural networks are almost always trained with some variant of stochastic gradient descent (SGD). Adam Optimizer
The optimiser is specified in the compilation step with tensorflow.

Recurrent NN, RNN

units may take as input a value computed from their own output at an earlier step in the computation.
have internal state, or memory: inputs received at earlier time steps affect the RNN's response to the current input.
be used to perform more general computations.
- to analyze sequential data in which a new input vector $x_t$ arrives at each time step
Markov assumption: the hidden state $z_t$ $z_{t}$ of the network suffices to capture the information from all previous inputs.
- $z_t = f(z_{t-1}, x_t)$
- Once trained, this function represents a time-homogeneous process
- The same update rule $f_w$ applies at every time step, regardless of whether it’s the first input or the hundredth.
RNNs are designed for sequential data.
a hidden state that captures information from previous steps.
suffer from vanishing/exploding gradients.
Good for short-term dependencies.

Backpropagtion Through Time, BPTT

gradient expression is recursive.
- $\frac{\partial z_t}{\partial w_{z,z}}$
- $= \frac{\partial}{\partial w_{z,z}} g_z(in_{z,t})$
- $= g_z'(in_{z,t}) \frac{\partial in_{z,t}}{\partial w_{z,z}}$
- $= g_z'(in_{z,t}) \frac{\partial}{\partial w_{z,z}} (w_{z,z} z_{t-1} + w_{x,z} x_t + w_{0,z})$
- $= g_z'(in_{z,t}) \left( z_{t-1} + w_{z,z} \frac{\partial z_{t-1}}{\partial w_{z,z}} \right)$ $= g_{z}^{'} (i n_{z, t}) (z_{t - 1} + w_{z, z} \frac{\partial z _{t - 1}}{\partial w _{z, z}})$
  - $\frac{\partial z_t}{\partial W_{z,z}}$ includes $\frac{\partial z_{t-1}}{\partial W_{z,z}}$
the gradient with run time being linear in the size of the network
handled automatically by deep learning software systems.
Iterating the recursion shows that the gradient at time $T$ $T$ includes a term proportional to:
- $w_{z,z} \prod_{t=1}^{T} g'_z(in_{z,t})$
Since for sigmoid, tanh, and ReLU we have $g' \leq 1$ , if $w_{z,z} < 1$ the RNN will suffer from the vanishing gradient problem.
If $w_{z,z} > 1$ , we may encounter the exploding gradient problem.

Long Short-Term Memory, LSTM

memory cell is essentially copied from time step to time step.
New information enters the memory by adding updates.
- the gradient expressions do not accumulate multiplicatively over time.
include gating units: vectors control the flow of information in the LSTM, elementwise multiplication of the corresponding information vector.
a type of RNN designed to overcome vanishing gradient.
use gates (input, forget, output) to control information flow.
Capable of learning long-term dependencies.
Widely used in NLP, speech recognition, and time series forecasting.

Gates in LSTM

Forget gate: decides what information to discard from the cell state.
Input gate: decides what new information to store in the cell state.
Output gate: decides what information to output from the cell state.
- similar role to the hidden state in basic RNNs.
Update equations:
- $f_t = \sigma(W_{x,f}x_t + W_{z,f}z_{t-1})$ $f_{t} = σ (W_{x, f} x_{t} + W_{z, f} z_{t - 1})$
  - Decides which parts of the previous cell state $c_{t-1}$ should be kept or discarded.
- $i_t = \sigma(W_{x,i}x_t + W_{z,i}z_{t-1})$ $i_{t} = σ (W_{x, i} x_{t} + W_{z, i} z_{t - 1})$
  - Determines how much of the new information from the current input $x_t$ and the previous hidden state $z_{t-1}$ should be added.
- $o_t = \sigma(W_{x,o}x_t + W_{z,o}z_{t-1})$ $o_{t} = σ (W_{x, o} x_{t} + W_{z, o} z_{t - 1})$
  - Controls which parts of the current cell state $c_t$ are exposed as the hidden state $z_t$ .
- $c_t = c_{t-1} \odot f_t + i_t \odot \text{tanh}(W_{x,C}x_t + W_{z,C}z_{t-1})$ $c_{t} = c_{t - 1} ⊙ f_{t} + i_{t} ⊙ tanh (W_{x, C} x_{t} + W_{z, C} z_{t - 1})$
  - Cell state update
  - Past information ( $c_{t-1}$ ) is partially retained through the forget gate.
  - New information is added through the input gate and $\tanh$ .
  - Thus, $c_t$ serves as the long-term memory of the LSTM.
- $z_t = o_t \odot \text{tanh}(c_t)$ $z_{t} = o_{t} ⊙ tanh (c_{t})$
  - Hidden state update
  - The cell state is normalized with $\tanh(c_t)$ and filtered by the output gate.
  - $z_t$ is the hidden state passed forward to the next time step.

Gated Recurrent Unit, GRU

Variant of RNN with gating mechanisms.
Designed to capture long-term dependencies without complex architecture.
a simpler alternative to LSTMs. (lightweight, effective RNN variant)
Captures temporal dependencies (short & long).
Combine input and forget gates into a single update gate.
Require fewer parameters than LSTM, making them faster to train.
Perform comparably to LSTMs in many tasks.
Prevents vanishing gradient.
Good balance between complexity & performance.
Excels in time series forecasting tasks.
Widely used in finance, energy and IoT.

Gates in GRU

Update gate (z): decides how much past information to keep
Reset Gate (r): decides how much past information to forget
Candidate hidden state ( $\tilde{h}$ ): potential new memory
Final hidden state ( $h$ ): weighted combination of old and new information.

GRU Workflow

Reset gate ( $r$ $r$ )
- Controls how much of the previous hidden state should be "forgotten."
- A small value means most of the past memory is erased, while a large value means much of it is retained.
Update gate ( $z$ $z$ )
- Acts as a switch to decide whether to keep the previous state $h_{prev}$ or replace it with the new candidate $\tilde{h}$ .
- If $z=1$ , the past is fully kept; if $z=0$ , it is completely replaced by the new candidate.
Candidate state ( $\tilde{h}$ $\tilde{h}$ )
- Combines the current input $x_t$ with the reset-gated previous hidden state to generate the "candidate" new information.
Final hidden state ( $h$ $h$ )
- Blends the past and the candidate using the update gate $z$ .
- If $z$ is large → the past memory dominates.
- If $z$ is small → the new candidate dominates.
$h=(1−z)\tilde{h}+zh_{prev}$

Comparison: RNN vs LSTM vs GRU

Attribute	RNN	LSTM	GRU
Architecture	Simple, hidden state	Complex, memory cell + 3 gates	Simplified, 2 gates (update/reset)
Information Flow	Stored in hidden state	Controlled by gates	Controlled by merged gates
Long-term Dependency	Weak (vanishing gradient)	Strong (gates solve vanishing gradient)	Strong (gates solve vanishing gradient)
Short-term Dependency	Strong	Strong	Strong
Number of Parameters	Few	Many	Fewer than LSTM
Training Speed	Fast	Slow	Fast
Performance	Good for short-term	Good for long-term	Efficient, similar to LSTM
Application Areas	Simple time series, basic NLP	NLP, speech, time series forecasting	Finance, IoT, energy, time series
Vanishing Gradient	Yes	No	No
Typical Use Cases	Text generation, simple prediction	Translation, speech recognition	Time series prediction, sensor data
Simplicity	Very simple, rarely used	More complex, expressive (3 gates)	Simpler than LSTM, fewer parameters
Expressiveness	Limited, struggles with long-term	High, handles very complex sequences	Moderate, good for moderate data size
Training Efficiency	Fast, but limited	Slower, better for complex data	Fast, efficient, similar performance
Trade-off	Simple but weak for long-term	Capacity for complex, long sequences	Simplicity vs. capacity

Open X-Embodiment review

September 1, 2025 · 5 min read

Gracefullight

Owner

RT-X

RT-X trains generalist robot policies by co-training RT-1/RT-2 on an X-embodiment mix of multi-robot, multi-task data, enabling efficient adaptation to new robots, tasks, and environments.
It standardizes 1M+ trajectories from 22 embodiments into the Open X-Embodiment (RLDS/tfrecord) repository, unifying observations and 7-DoF actions via coarse alignment.
Experiments show strong positive transfer and emergent skills (≈3× with RT-2-X on cross-robot tasks); performance scales with model capacity, short image histories, and web pretraining, while sensing/actuation diversity and frame alignment remain open problems.

RT-X Architecture

Motivation

Seeks a generalist X-robot policy that can be efficiently adapted to new robots, tasks, and environments.
Mirrors a trend from CV/NLP where general-purpose, web-scale pretrained models outperform narrow, task-specific models.
Robotics lacks comparably large, diverse interaction datasets, making direct transfer of these lessons challenging.

Objectives

Positive transfer: Test whether co-training on data from many robots improves performance on each training domain.
Ecosystem building: Organize large robotic datasets to enable future X-embodiment research.

Core Approach

Train RT-1 and RT-2 on data from 9 different manipulators, producing RT-X variants that outperform policies trained only on the evaluation domain and show better generalization and new capabilities.

What’s Different From Prior Transfer Methods

Many prior works reduce the embodiment gap via specialized mechanisms (shared action spaces, representation learning objectives, policy adaptation using embodiment metadata, decoupled robot/environment representations, domain translation).
RT-X directly trains on X-embodiment data without explicit gap-reduction machinery and still observes positive transfer.

Dataset & Format (Open X-Embodiment)

1M+ real robot trajectories, 22 embodiments (single-arm, bimanual, quadrupeds), pooled from 60 datasets / 34 labs, standardized for easy use.
Uses RLDS (serialized tfrecord), supporting varied action spaces and input modalities (RGB, depth, point clouds), and efficient parallel loading across major DL frameworks.
Language annotations are leveraged; PaLM is used to extract objects/behaviors from instructions.

RLDS

Data Format Consolidation (Coarse Alignment)

Observations: History of recent images + language instruction. One canonical camera view per dataset is resized to a common resolution.
Actions: Convert original controls to a 7-DoF end-effector vector (x, y, z, roll, pitch, yaw, gripper or their rates). Actions are normalized before discretization; outputs are de-normalized per embodiment.
Deliberate non-alignment: Camera poses/properties are not standardized; action frame alignment across datasets is not enforced. The same action vector may cause different motions on different robots (absolute/relative, position/velocity allowed).

Policy Architectures

RT-1 (≈35M params): Transformer for control. Inputs: 15-frame image history + natural-language instruction.
- Vision via ImageNet-pretrained EfficientNet; language via USE embedding.
- Fuse via FiLM → 81 vision–language tokens → decoder-only Transformer outputs tokenized actions.
RT-2 (VLA family): Internet-scale VLM co-fine-tuned to output action as text tokens (e.g., 1 128 91 241 5 101 127).
- Any pretrained VLM can be adapted; this work uses RT-2–PaLI-X (ViT backbone + UL2 LM; primarily pretrained on WebLI).

Training Setup

Robotics data mixture: Data from 9 manipulators (a union of multiple well-known robotics datasets).
Loss: Standard categorical cross-entropy over tokenized actions.
Regimes:
- RT-1-X: Trained solely on the robotics mixture.
- RT-2-X: Co-fine-tuned on a ~1:1 mix of original VLM data and the robotics mixture.

Experimental Questions

Does X-embodiment co-training improve in-domain performance (positive transfer)?
Does it improve generalization to unseen tasks?
How do model size, architecture, and dataset composition influence performance/generalization?

Key Results

Small-scale domains: RT-1-X outperforms the Original Method (the authors’ per-dataset baselines) on 4/5 datasets with a large average gain → limited data domains benefit greatly from X-embodiment co-training.
Large-scale domains:
- RT-1-X does not beat an RT-1 trained only on the embodiment-specific large dataset (suggests underfitting for this class).
- RT-2-X (larger capacity) outperforms both Original Method and RT-1 → X-robot training helps even in data-rich regimes when using sufficient capacity.

Generalization & Emergent Skills

Unseen objects/backgrounds/environments: RT-2 and RT-2-X perform on par (VLM backbone already strong here).
Emergent skills (transfer across robots): On Google Robot tasks that do not appear in RT-2’s dataset but exist in Bridge (for WidowX), RT-2-X ≈ 3× RT-2.
- Removing Bridge from RT-2-X training significantly reduces hold-out performance → skills likely transferred from WidowX data.

Design Insights (Ablations)

Short image history notably improves generalization.
Web pretraining is critical for large models’ high performance.
Model capacity matters: 55B model succeeds more than 5B on emergent skills → greater capacity ⇒ greater cross-dataset transfer.
Co-fine-tuning vs. fine-tuning: Similar performance in this study (attributed to the greater diversity of robotics data in RT-2-X vs. prior works).

Limitations (Open Problems)

Does not cover robots with very different sensing/actuation modalities.
Does not study generalization to new robots nor define a decision criterion for when positive transfer will occur.
Camera pose/properties and control frame remain unaligned; a deliberate but still challenging domain gap to address in future work.

Ref

O’Neill, A., Rehman, A., Maddukuri, A., Gupta, A., Padalkar, A., Lee, A., Pooley, A., Gupta, A., Mandlekar, A., & Jain, A. (2024). Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. 2024 IEEE International Conference on Robotics and Automation (ICRA).

FSD +006

September 1, 2025 · 4 min read

Gracefullight

Owner

OOP vs Procedural Programming

OOP

a programming paradigm built around the concept of objects, which contain data and code to manipulate data.
The idea to model real-world entities and their interactions.
Global Data (fields) are enclosed in the objects.
Program components/tasks are easily divided across the development team / Requires more planning and design preparation
Easier to manage and maintain dependencies between objects / OOP programs are much larger and complex
Objects export the interface and hide the implementation and data / Tend to use more memory and GPU
Code is highly reusable and easy to scale and distribute / Making changes in one class potentially impact others, which can complicate the development of the code.

Procedural Programming

the concept of procedure calls by structuring the program around procedures. (or functions/subroutines)
a sequential manner unless directed otherwise.
Global data (elements) is exposed to all the functions.
Easier to compile and interpret / Difficult to scale or extend
Straightforward and simpler to code / Dependencies between elements are unclear and not well-structured.
Less memory requirements / Data is exposed and insecure due to its exposure across the whole program
Easy to track the program flow / Hard to divide the work among programmers in a team.

Classes

A class is a template/blueprint used to create objects

java	python
a pure OOP language	supports OOP
code must be written in classes	classes are optional
executable class must have `main()`	scripts run without including a class
Encapsulation can be enforced by declaring fields as private	fields (global variables) are public by default
Visibility is managed through access modifiers	N/A ("_" to identify private data attributes, but still accessible)

class <class-name> (<extend - superclass>):
    <variable-name> = <value> #Class fields - data members

    def __init(self, <parameters>): #class constructor - object sbuilder
        <code>

    <method-name> (self, <parameters>): #methods
        <code>

Classes Py

Keywords	Functions
`class`	`__init__()`
`self`: keyword used to refer to object properties	del: the function is used to delete an object
`pass`: keyword used to occupy no-code placement in a function	`__str__()`: The function is used to return string representation of instances
`cls`: keyword used to refer to class properties	`super()`: the function is used call a parent method in a child class

Accessors: functions (with no parameters) in a Python class that provide access to the data attributes of an object.
- known as getter methods, are named starting with the verb get, followed by the field name, which should start with an uppercase letter.
Mutators: procedures (with parameter) in a Python class that enable the developer to modify the values of object attributes.
- known as setter methods, are named starting with the verb set, followed by the field name, which should start with an uppercase letter.

def get<Variable> ():
    return self.<field>

def set<Variable> (self, value):
    self.<field> = value

Classes Java

public class Bank {
  private Customer customer;
  private String branch;

  public Bank() {
    customer = new Customer();
  }

  public Bank(String name) {
    this();
    this.branch = name;
  }

  public boolean find(Bank bank) {
    return this.branch.equals(bank.branch);
  }
}

Quantitative and Qualitative Methods​

Types of Research Methodology​

Common data collection methods​

Qualitative data collection methods​

Quantitative data collection methods​

Unsupervised machine learning​

Clustering​

Distance measures​

Weighted distance measures​

Dissimilarity​

Similarity Matrix​

Clustring methodologies​

Choosing a clustering method​

Clustering Terminology​

Hierarchical clustering​

AGNES​

DIANA​

Single-linkage clustering​

Complete-linkage clustering​

Average-linkage clustering​

Data visualization​

Data visualization process​

Descirptive statistics​

Inferential statistics​

Measures of central tendency​

Measures of spread​

Frequency distribution​

Vocabulary & Expressions​

RL​

Passive RL​

Collection​

Python List​

Python Set​

Python Tuple​

Python Dictionary​

Java List​

Java Set​

Java Map​

Operation Patterns​

Types of sentence structure​

Common sentence structure errors​

Sentence fragments​

Run-on sentences​

Lack of Meaning​

Tips for writing​

Ref​

VLATest: Testing and Evaluating Vision-Language-Action Models for Robotic Manipulation​

Motivation & Gap​

What Are VLA Models?​

Training & Evaluation​

VLATest Framework​

Research Questions (RQ)​

Tasks & Prompting​

Experimental Setup​

Key Results & Findings​

RQ1 — Overall Performance​

RQ1 — Coverage Metric​

RQ2 — Confounding Objects​

RQ3 — Lighting Robustness​

RQ4 — Camera Pose Robustness​

RQ5 — Unseen Objects​

RQ6 — Instruction Mutations​

Implications & Directions​

Related Work​

Threats to Validity (mitigations in study)​

Conclusion​

Ref​

Neural Network Development History​

Neural Network Models​

Activation function​

ReLU function​

Softplus function​

Logistic or Sigmoid function​

Tanh function​

Topology of a neural network​

Training Process​

Perceptron for Binary Classification​

Learning Rules​

Feadforward NN, FNN​

Traning a FNN​

Quantitative and Qualitative Methods

Types of Research Methodology

Common data collection methods

Qualitative data collection methods

Quantitative data collection methods

Unsupervised machine learning

Clustering

Distance measures

Weighted distance measures

Dissimilarity

Similarity Matrix

Clustring methodologies

Choosing a clustering method

Clustering Terminology

Hierarchical clustering

AGNES

DIANA

Single-linkage clustering

Complete-linkage clustering

Average-linkage clustering

Data visualization

Data visualization process

Descirptive statistics

Inferential statistics

Measures of central tendency

Measures of spread

Frequency distribution

Vocabulary & Expressions

RL

Passive RL

Collection

Python List

Python Set

Python Tuple

Python Dictionary

Java List

Java Set

Java Map

Operation Patterns

Types of sentence structure

Common sentence structure errors

Sentence fragments

Run-on sentences

Lack of Meaning

Tips for writing

Ref

VLATest: Testing and Evaluating Vision-Language-Action Models for Robotic Manipulation

Motivation & Gap

What Are VLA Models?

Training & Evaluation

VLATest Framework

Research Questions (RQ)

Tasks & Prompting

Experimental Setup

Key Results & Findings

RQ1 — Overall Performance

RQ1 — Coverage Metric

RQ2 — Confounding Objects

RQ3 — Lighting Robustness

RQ4 — Camera Pose Robustness

RQ5 — Unseen Objects

RQ6 — Instruction Mutations

Implications & Directions

Related Work

Threats to Validity (mitigations in study)

Conclusion

Ref

Neural Network Development History

Neural Network Models

Activation function

ReLU function

Softplus function

Logistic or Sigmoid function

Tanh function

Topology of a neural network

Training Process

Perceptron for Binary Classification

Learning Rules

Feadforward NN, FNN

Traning a FNN