Skip to main content

Research methodology

· One min read

Quantitative and Qualitative Methods

CategorySub CategoryQuantitativeQualitative
RequirementQuestionHypothesisInterest
MethodControl and randomizationCuriosity and reflexivity
Data collectionResponseViewpoint
OutcomeDependent variableAccounts
IdealDataNumericalTextual
Sample sizeLarge (power)Small (saturation)
ContextEliminatedHighlighted
AnalysisRejection on nullSynthesis

Types of Research Methodology

  • Historical: Qualitative
  • Comparative: Qualitative
  • Descriptive: Qualitative
  • Correlation: Quantitative
  • Experimental: Quantitative
  • Evaluation: Qualitative
  • Action: Qualitative
  • Ethnographic: Various (not quantitative)
  • Ethnogenic: Various (not quantitative)
  • Feminist/Identity Politics: Various (not quantitative)
  • Cultural: Various (not quantitative)

Common data collection methods

Qualitative data collection methods

  • Observations: recording what you have seen, heard, or encountered in detailed field notes
  • Interviews: asking people questions in one-on-one conversations
  • Focus groups: asking questions and generating discussion among a group of people
  • Surveys: distributing questionnaires with open-ended questions
  • Secondary research: collecting existing data in the form of texts, images, audio or video recordings, etc.

Quantitative data collection methods

  • Experiments
  • Computer Simulation and Agent-Based Models
  • Controlled observations
  • Surveys: paper, kiosk, mobile, questionnaires
  • Longitudinal studies
  • Polls and Telephone interviews
  • Face-to-face interviews

FDA +006

· 9 min read

Unsupervised machine learning

ItemSupervised machine learningUnsupervised machine learning
Data availabilityInput and output variables will be given.Only the input data will be given.
LabelingAlgorithms are trained using labelled data.Algorithms are used against data which is not labelled.
AlgorithmsSupport Vector Machine, Linear and Logistic Regression and Classification Trees.Cluster algorithms, K-means, Hierarchical clustering, etc.
Complexitysimpler method.computationally complex.
Learning modeThe learning method takes place offline.The learning method takes place in real-time.
Reliabilityhighly accurate and trustworthy method.less accurate and less trustworthy method.

Processing data

  • most common tasks are clustering, anomaly detection, and neural networks.
  • infer underlying patterns without human supervision or intervention and enable us to discover both the differences and similarities in a dataset.
  • can be considered ideal solutions for exploratory data mining.

Clustering

objects (unlabelled data) are organised into groups, where the members of each group are similar in some way to each other and less similar to those in other groups.

  • Classification assigns objects/data to the predefined (labelled) classes
  • Clustering groups the objects/data based on the similarities between them
  • used in pattern recognition, image analysis and bioinformatics.
  • different clustering algorithms can produce different results based on their own definition of a cluster
  • the parameters (such as the distance function, density threshold and the number of expected clusters) of the clustering algorithm should be set based on the particular characteristics of the dataset and the user’s intention
DomainUse cases
Biology and bioinformaticsCluster algorithms have been used in biological systematics for comparing the genus differences in organisms.
MedicineCluster analysis can be used to detect underlying factors of particular diseases, such as coronary artery disease. It is also used to describe patterns of antibiotic resistance.
Market basketCluster analysis has gained increasing popularity in market research. It can be used to classify different groups of consumers by behaviour analysis. It helps to build a better understanding of market segmentation, pricing and new product testing.
Computer scienceClustering is a powerful tool for various tasks in the area of computer science, such as reforming functionality in software evolution, object recognition in computer vision and lexical ambiguity in natural language process.
Car insuranceIdentify customer groups with high average claim costs.
  • Similarity Measure: Numerical measure of how alike two data objects often fall between 0 (no similarity) and 1 (complete similarity)
  • Dissimilarity, or Distance Measure: Numerical measure of how different two data objects are range from 0 (objects are alike) to \infty (objects are different)
  • Proximity: Refers to a similarity or dissimilarity

Distance measures

Distance metrics or dissimilarity measures

  • basically deal with finding the proximity or distance between data points and determining if they can be clustered together.
  • Manhattan distance: distance between two vectors if they could only move right angles.
    • Dist(A,B)=aibiDist(A, B) = \sum_{} |a_{i} - b_{i}|
    • no diagonal movement involved in calculating the distance.
  • Euclidean distance: can best be explained as the length of a segment connecting two points.
    • Dist(A,B)=(aibi)2Dist(A, B) = \sqrt{\sum_{} (a_{i} - b_{i})^{2}}
    • calculated from the cartesian coordinates of the points using the Pythagorean theorem.
    • Typically, one needs to normalize the data before using this distance measure.
    • the dimensionality increases of your data, the less useful Euclidean distance becomes
  • Cosine similarity: the cosine of the angle between two vectors.
    • Dist(A,B)=(xiyi)xi2yi2Dist(A, B) = \frac{\sum_{} (x_i \cdot y_i)}{\sqrt{\sum_{} x_i^2 \cdot \sum_{} y_i^2}}
    • a way to counteract Euclidean distance’s problem with high dimensionality.
    • has the same inner product of the vectors if they were normalized to both have length one
    • The magnitude of vectors is not taken into account, merely their direction.
      • In practice, this means that the differences in values are not fully taken into account.
  • Single link: the shortest distance between points
  • Complete link: the largest distance between points.
  • Average link: average distance between points.
  • Centroid: the distance between centroids.

Weighted distance measures

Dist(A,B)=wi(aibi)2Dist(A, B) = \sqrt{\sum_{} w_i (a_{i} - b_{i})^{2}}

  • a weight to the attributes as some attributes are more important than others.
  • force clustering to pay more attention to higher weight attributes and form clusters that depend more on those heavily weighted attributes.

Dissimilarity

  • Simple matching coefficient, SMC: invariant, if the binary variable is symmetric.
    • d(i,j)=b+ca+b+c+dd(i,j) = \frac{b+c}{a+b+c+d}
      • the proportion of mismatches (b+c) out of all attributes (a+b+c+d).
    • The simple matching coefficient is used when 0 and 1 are equally important, treating matches of both 1s and 0s the same way.
  • Jaccard coefficient: non-invariant, if the binary variable is asymmetric.
    • d(i,j)=b+ca+b+cd(i,j) = \frac{b+c}{a+b+c}
      • ignores cases where both are 0 (d), and only considers mismatches relative to at least one positive case.
    • The Jaccard coefficient is used when 1 (presence) is more meaningful than 0 (absence).

Similarity Matrix

  • After calculating all distances, we can create a similarity matrix
  • containing the distance between each pair of data points.

Similarity matrix

IDGenderAgeSalary
1M4545000
2F3254000
3F2332000
4M3658000
  • Gender: binarized
  • Age: normalized
  • Salary: normalized
IDGenderAgeSalary
1110.25
200.60.7
3000
410.70.8
  • dist(ID2,ID3)=(00)2+(0.60)2+(0.70)2=0.92dist(ID2, ID3) = \sqrt{(0-0)^2 + (0.6-0)^2 + (0.7-0)^2} = 0.92
  • dist(ID2,ID4)=(01)2+(0.60.7)2+(0.70.8)2=1.02dist(ID2, ID4) = \sqrt{(0-1)^2 + (0.6-0.7)^2 + (0.7-0.8)^2} = 1.02

Clustring methodologies

  • Hierarchical approach: create trees of clusters and sub-clusters
    • Divisive (Top-down): Start with all examples in a single cluster, and decide how to break the cluster into multiple sub-clusters.
    • Agglomerative (Bottom-up): Start with each example in its own separate cluster. Decide which clusters to merge.
  • Partitional (K starting points): Start with KK random cluster centers, and decide which examples to put in each of the clusters.
    • Adjust the cluster centers after each allocation of examples to clusters.
    • k-means, k-medoids

Choosing a clustering method

ConsiderationWhat to look forTypical choices
ScalabilityNear-linear time and bounded memory on large datasets.MiniBatch K-Means, BIRCH, scalable DBSCAN with indexing.
Arbitrary shapesAbility to find non-spherical clusters.DBSCAN, HDBSCAN, Spectral clustering.
Noise and outliersRobustness to noise; ability to mark points as noise.DBSCAN, HDBSCAN (labels noise), GMM with low-weight components.
Mixed attribute typesWorks with categorical + numeric or custom distances.k-prototypes/k-modes, Agglomerative with Gower distance.
Few parametersMinimal, intuitive hyperparameters; stable defaults.Agglomerative (linkage, distance), HDBSCAN (min cluster size).
Order insensitivityResults independent of input order.Most batch methods; shuffle for MiniBatch K-Means.
High dimensionalityHandles curse of dimensionality or uses reduction.PCA + K-Means/Agglomerative, Spectral after reduction, cosine distance.
User constraintsMust-link/cannot-link or size constraints supported.COP-K-Means, constrained agglomerative, semi-supervised variants.
InterpretabilityEasy to explain clusters and decisions.K-Means centroids, Agglomerative dendrograms, GMM probabilities.

Clustering Terminology

Clustering Points

  • Centroid: a point in the middle of a cluster. It may not be an actual point in the dataset.
  • Medoid: an actual point in the dataset that is centrally located and is, therefore, representative of the cluster.
  • Representative points: are points around the cluster that are representative of the cluster.
  • High intra-class similarity: the homogeneity, the closeness of data points within a single cluster
  • Low inter-class similarity: The distance between two separate clusters

Class in Cluster

  • A good clustering method will produce high-quality clusters with high intra-class similarity and low inter-class similarity.

Hierarchical clustering

  • Hierarchical approaches lead to the formation of dendrograms
  • The top and bottom of a dendrogram represent the two extremes of clustering
    • At the bottom, a leaf is an individual cluste
    • At the top, the root is one cluster

AGNES

AGgglomerative NESting hierarchical clustering algorithm.

  • Agglomerative hierarchical clustering follows a bottom-up approach
    • starting with clusters of single objects and merging them into bigger and bigger clusters
  • agglomerative clustering process terminates (or finishes) when a termination condition is satisfied or there is only one cluster containing all objects.
  • based on Euclidean distance between two objects
  • steps of the algorithm:
    1. Make a cluster with only one object as member for all objects
    2. Calculate the Euclidean distance between each pair of clusters
    3. Choose the cluster pair with the smallest distance and merge them to make one cluster
    4. Repeat step 2 with the new combined cluster and the other, older clusters
    5. Repeat steps 3 and 4 until all the objects are merged into a single cluster.

DIANA

DIvisive ANAlysis clustering algorithm.

  • The top-to-bottom approach is followed in divisive hierarchical clustering
    • starts with a cluster containing all objects.
  • This cluster is broken up into smaller clusters, and this process of breaking up clusters continues until each cluster contains one object or a given termination condition is satisfied.
  • steps of the algorithm:
    1. The process of starts at the root with all the points as one cluster.
    2. It recursively splits higher-level clusters to build the dendrogram.
    3. It can be considered as a global approach.
    4. It is more efficient when compared with agglomerative clustering.

Agenes vs Diana

Single-linkage clustering

the minimum method, connectedness, or nearest neighbour method

  • two clusters are linked by a single element pair
  • The distance between clusters is defined as the shortest distance from a member of the first cluster to a member of the second cluster.

Complete-linkage clustering

the furthest neighbour method, maximum method, or diameter method.

  • the distance between two clusters is defined as the greatest distance between any member of the first cluster and any member of second cluster

Average-linkage clustering

the minimum variance method

  • the distance between two clusters is calculated by averaging the distance between each member of first cluster and each member of second cluster

FDA +005

· 3 min read

Data visualization

  • Successful visualization requires data be converted into a visual format.
  • Motivation is to play to the strengths of people
    • for people to quickly absorb a large mount of informatin and find patterns in it.

Data visualization process

  • a human who looks at the visual and perceives information.
  • the human should be able to answer some questions by looking at the visual after perception.
ItemExploratoryExplanatory
PurposeTo analyse data to solve a question or develop a hypothesis.To convey a message or idea.
Target audienceExpert users with prior knowledge of the subject.Non-expert users with limited or no background knowledge.
WhenUsually happens during the data analytics project and is internal facing.Usually happens after the exploration phase and is often external facing.
ApproachUnguided, users explore, with no clear conclusion.Guided through author-chosen comparisons, clear conclusions.
RepresentationHas an analytical purpose and represents the complexity of data.No analytical purpose and represents understandable data.

Descirptive statistics

  • Describe some data through a quantitative summarisation of its behaviour
  • Help us summarise data in a meaningful way.
  • Highlight things like whether there are any values that are ill defined, or which make no sense.
  • measures of central tendency
    • mean
    • median
    • mode
  • measures of spread
    • range
    • variance
    • standard deviation
    • interquartile range

Inferential statistics

  • includes more advanced methods such as hypothesis tests, ANOVA, and regression.
  • make claims about how general this dataset is.
  • we can make inferences from a sample to a population.

Measures of central tendency

  • a quick and easy way to describe a dataset by condensing it down to just one representative value.
  • can easily compare one dataset to another.
  • Mean: the averge of a dataset.
  • Median: the middle value in a dataset.
  • Mode: the most commonly occurring value in a dataset.

Distribution

Measures of spread

  • how similar or varied a set of values are for a particular variable
  • summarise the data in a more detailed way that shows how scattered the values are and how much they differ from the central tendency.
  • Range: the largest value of a variable and subtracts the smallest.
    • The bigger the range the more spread out a data set is.
  • Variance: how far a set of numbers are spread out from their average value.
  • Standard deviation: the square root of the variance
    • gives us back a value that has the same units as the mean
  • Interquarile range, IQR: the difference between the upper and lower medians.
    • First, find the median of a set of data.
    • Then, find the medians of the upper and lower half of the data.

Frequency distribution

  • Frequency is the number of times a data value occurs or repeats.
  • frequency(vi)=number of objects with attribute valuevimfrequency(v_i) = \frac{\text{number of objects with attribute value}\thinspace v_i}{m}
  • Visual displays that organise and present frequency counts so that the information can be interpreted more easily.
  • can show absolute frequencies or relative frequencies, such as proportions or percentages.
  • can be shown in a table or graph.
  • Some common methods of showing frequency distributions include frequency tables, histograms or bar charts.
  • Frequency tables: display the number of occurrences of a particular value or characteristic.
  • Histograms: a type of graph in which each column represents a numeric variable
    • useful for describing the shape, centre and spread to better understand the distribution of the dataset.
  • Bar charts: a type of graph in which each column represents a categorical variable or a discrete ungrouped numeric variable.

Vocabulary for AI +007

· 3 min read

Vocabulary & Expressions

Term/ExpressionDefinitionSimpler ParaphraseMeaning
refereeA person who supervises a game or match to ensure the rules are followedUmpire심판
nonterminal stateThe agent is still in the middle of the episode. The environment can continue to produce rewards, and the agent can still take actions.ongoing state비종결 상태
apprenticeshipa period of time working as an apprenticeinternship수습 기간
in a similar veinin a similar waysimilarly비슷한 맥락에서
subsequentcoming after something in time; followingfollowing그 다음의
in the sensein the most limited meaning of a word, phrase, etc.in the meaning~라는 의미에서
hypothesisa supposition or proposed explanation made on the basis of limited evidence as a starting point for further investigationassumption가설
untimelyhappening or done at an unsuitable timeill-timed때 아닌, 시기상조의
utilitythe state of being useful, profitable, or beneficialusefulness효용
thereafterafter that timeafter that그 후에
analogya comparison between two things, typically for the purpose of explanation or clarificationcomparison유추, 비유
inherentexisting in something as a permanent, essential, or characteristic attributeintrinsic내재하는, 타고난

RL

항목Policy evaluationPassive learningPolicy search
정책 (π)고정, 외부에서 주어짐고정, 외부에서 주어짐없음 → 직접 탐색·개선
환경 모델 (P, R)모두 알고 있음 (완전한 MDP)모름, 경험으로 추정알 수도 있고 모를 수도 있음
학습 목표주어진 정책 하에서 Uπ(s)U^\pi(s) 계산주어진 정책 하에서 Uπ(s)U^\pi(s) 경험 기반 추정최적 정책 π\pi^*를 직접 찾기
접근 방식벨만 방정식 기반 반복 계산 (iterative policy evaluation)환경을 실제로 탐험하면서 transition & reward 관찰, 그로부터 utility 추정정책 파라미터를 조금씩 바꾸며 return 극대화 (policy gradient, evolutionary search 등)
필요 데이터없음 (모델이 다 주어짐)환경 경험 (trajectory, reward sequence)환경 경험 (보상 피드백), 때로는 gradient
계산/학습 특징계산 문제, 오차 없이 수렴 가능샘플 효율 낮음, Monte Carlo/TD 방식 사용gradient variance 큼, local optimum 위험
적용 예시교재의 Gridworld (모델식 다 알려진 경우)환경은 블랙박스, 정책 고정 실험 (시뮬레이션 따라다니기)로봇 제어, 연속 행동 공간 (PPO, REINFORCE 등)
장점정확·빠름, 모델만 있으면 해석 용이모델이 없어도 가능, 실제 환경에서 학습연속·복잡한 행동 직접 최적화 가능
단점현실 환경은 모델을 모르는 경우가 많음정책을 개선할 수는 없음 (평가 전용)학습 불안정, 많은 데이터 필요

R(St,π(St),St+1)R(S_t,π(S_t),S_{t+1})

시간 tt에 상태 StS_t에 있었고, 정책이 정한 행동 π(St)π(S_t)을 했더니, 다음 상태 St+1S_{t+1}에 도착했다. 그때 받는 보상은 R(St,π(St),St+1)R(S_t,π(S_t),S_{t+1})이다.

  • 현재 상태에서 정책이 정한 행동을 취해 다음 상태로 갔을 때 받은 보상
  • StS_t: 시간 tt에 에이전트가 위치한 현재 상태
  • π(St)π(S_t): 정책 π가 현재 상태 StS_t에서 선택한 행동 (action)
  • St+1S_{t+1}: 그 행동을 수행한 뒤 도달한 다음 상태
  • RR: 이 세 가지 (현재 상태, 행동, 다음 상태)에 의해 결정되는 보상 함수

Passive RL

  • Direct Utility Estimation
  • ADP (Adaptive Dynamic Programming)
  • TD (Temporal Difference Learning)

FSD +007

· 3 min read

Collection

a data sturcucture that groups multiple elemtns togehter logically.

  • referenced by the collection name, and its elements are accessed using indexing or look up methods.
  • some collections allow dynamic sizing, expanding and shrinking with the data, others have a fixed size.
  • can store mixed data types.
    • Java collections are typically type-safe using generics.
    • Python uses dynamic typing.

Python List

mylist = ["Tom", 30, 112.5]
len(mylist)
mylist[index]
mylist.append(item)
mylist.insert(index, item)
# returns a slice of a list from first to last-1
mylist[first:last]
mylist.index(item)

# replaces items from first to last-1 with a list
mylist[first:last] = [list-values]
# adds list 2 at the end of list 1
mylist = list1 + list2
# adds list 2 at the end of list 1
list1.extend(list2)

mylist.remove(item)
mylist.pop(index)
mylist.pop()
del mylist[index]
del mylist
mylist.clear()

# Sorts the list alphanumerically, ascending
mylist.sort()
mylist.sort(reverse =True)
mylist.reverse()
mylist.count()

Python Set

  • unordered, not indexed.
  • mutable, unique.
myset = { 'hello', 5, True, 3.5 }

for x in myset:
print(x)

myset.add(item)
# Merges myset iwth the otherset, rtaining unique values
myset.update(otherset)
# Adds other set items to a set (only unique items are retained)
mynewset = myset.union(otherset)
# Retain only the items that exists into set1 and set2
myset = set1.intersection(set2)

# report error if item not found
myset.remove(item)
myset.discard(item)

myset.pop()
myset.clear()
del myset

Python Tuple

  • a cllection of items of any type
  • ordered, indexed
  • unchangable, once a tuple is created the elemets are fixed
mytuple = ("Tom", 30, 112.5)
len(mytuple)
mytuple[index]
mytuple[first:last]
mytuple = tuple + tuple2

Python Dictionary

  • a collection of items represented as key-value pairs
  • unordered, indexed by uniaue keys
  • itmes are mutable
  • allow duplicate values but not duplicate keys
mydata = {
"name": "Tom",
"age": 30,
"role": "admin"
}

mydata.keys()
len(mydata)
mydata[key]
mydata[key] = new-value
del mydata[key]
del mydata

# Deletes an entry associated with key
val = mydata.pop(key)
# Updates/Inserts { k: v } entry into the dictionary
mydata.update({ k: v })

Java List

  • an interface of the Java Collection Framework (JCF)
  • cannot be instantiated.
  • common implementation of List interface
    • ArrayList
    • LinkedList
List<Integer> numbers = new ArrayList<>();
List<String> names = new LinkedList<>();

numbers.get(0);
numbers.get(numbers.size() - 1);

names.get(indexOf("Hello"));
names.get(lastIndexOf("Hello"));

numbers.add(5);
names.remove("Hello");
names.remove(indexOf("Hello"));

numbers.removeAll(<another list>);
// set(2, 12) replaces the item at index 2 with 12
numbers.set(2, 12);

Java Set

  • an interface of the Java Collection Framework (JCF)
  • unordered, unique objects.
HashSet<String> names = new HashSet();
HashSet<String> names = new HashSet(Array.asList("Tom", "Jerry", "Mickey"));

HashSet<String> names = new HashSet();
ArrayList list1 = new ArrayList();
ArrayList list2 = new ArrayList();

list1.add("Tom");
list1.add("Jerry");

names.addAll(list1);
names.addAll(list2);

names.remove("Tom");
boolean isRemoved = names.remove("Tom");

for (String name : names) {
System.out.println(name);
}

Iterator<String> it = names.iterator();
while (it.hasNext()) {
System.out.println(it.next());
}

names.clear();
names.isEmpty();
names.contains("Tim");
names.size();
names.removeAll(set2);
names.containsAll(set2);
// Retain set2 elements and discard the rest
names.retainAll(set2);

Java Map

  • interface from java.util stores data as a keiy-value pairs
  • contain unique keys that are associated with specific values.
HashMap<Integer, String> people = new HashMap<>();
people.put(1, "Tom");
people.put(2, "Jerry");
people.put(3, "Mickey");

people.putIfAbsent(2, "Donald");
System.out.println(people.get(2));

people.put(2, "Lucy");
people.replace(2, "Amy");
people.remove(2);

System.out.prinln(people.keySet());
System.out.println(people.values());

people.clear();
people.isEmpty();
people.containsKey(2);
people.size();
people.getOrDefault(50, "Unknown");
// Checks if the value is mapped with one or more keys
people.containsValue("Jim");

Operation Patterns

  • Finding an item in a list: Using the lookup pattern
  • Finding multiple items in a list: Using the updated-lookup pattern
  • Removing certain items from a list: Using the remove-all pattern

Sentence structures

· 4 min read

Types of sentence structure

The secret to good writing is variation and using a mix of these types of sentences within your paragraphs in your written work.

  • Simple sentence
    • is one independent clause in a subject-verb pattern
    • e.g. The Australian government introduced an official carbon tax on 1 July 2012.
  • Compound sentence
    • is two independent clauses connected by a coordinating conjunction.
    • e.g. The Australian government introduced an official carbon tax on 1 July 2012, but this was met with opposition from the general public.
  • Complex sentence
    • consists of an independent clause and a dependent clause.
    • e.g. As the Australian government recognized the necessity to significantly reduce greenhouse gas emissions, it introduced an official carbon tax on 1 July 2012.
  • Compound-complex sentences
    • consists of more than one independent clause and one or more dependent clauses
    • e.g. As the Australian government recognized the necessity to significantly reduce greenhouse gas emissions, it introduced an offical carbon tax on 1 July 2012, but this was met with opposition from the general public.

Common sentence structure errors

Sentence fragments

  • A sentence fragment is missing some of its parts.
  • There are three main reasons why a sentence may be incomplete.
    • Missing subject
      • e.g Becoming extinct because of rising sea tempratures.
      • Correction: Phytoplankton could become extinct because of rising sea temperatures.
    • Missing verb
      • e.g. Significantly, one particular form of Western Australian finch.
      • Correction: Significantly, one particular form of Western Austrailan finch has decreased in numbers.
    • Incomplete thought
      • e.g. In a recent article about loss of habitat due to climate change.
      • Correction: In a recent article about loss of habitat due to climate change, Australian animals were shown to be particularly vulnerable.
  • Sentences beginning with words like so, as, because, who, which, that are often incomplete.

Run-on sentences

  • A run-on sentence occurs when two simple sentences are incorrectly joined. e.g. Poverty, famine and major public health problems around the developing world are important indicators of a changing climate these issues are not being addressed globally.
  • Use a joining or linking word such as and, but, or, nor, for, so, yet.
    • Correction: Poverty, famine and major public health problems around the developing world are important indicators of a changing climate, but these issues are not being addressed globally.
  • Make two separate sentences.
    • Correction: Poverty, famine and major public health problems around the devleoping world are important indicators of a chaing climate. These issues are not being addressed globally.

Lack of Meaning

  • Ensure that each sentence you write has clear meaning in English.
  • It must be fully understandable when read.
  • If you ware not sure if your sentence has clear meaning in English, perhaps think about rewriting it in a simpler and clearer way that you can fully understand (as will hopefully your reader).

Tips for writing

  • Consider the following example where each sentence follows a similar structure.
    • Topic Sentence: main point of the paragraph.
    • Supporting Sentence: Examples, evidence, or analysis.
    • Concluding Sentence: wrap up paragraph by linking to broader topic, or linking to the next paragraph or section.
  • This uniformity leads to a lack of cohesion, making the paragraph feel disjointed and somewhat monotonous.
    • e.g. Nursing education states that measures should be in place to avoid infection. Also, that infection rates tend to soar when hygiene standards decrease. Appropriate steps should be taken to decrease these risks. It is suggested that medical staff are educated to understand these risks.
    • Correction: Nursing educators argue that strict measures should be implemented to avoid infection in medical institutions. There is also much evidence to demonstrate that infection rates rise dramatically when hygiene standards begin to fall. Therefore, it is argued that appropriate steps need to be in place to decrease and minimize these potential risks. Furthermore, aggressive steps should be taken to ensure that all staff maintain effective hygiene and infection control.

Ref

VLA Test Review

· 6 min read

VLATest: Testing and Evaluating Vision-Language-Action Models for Robotic Manipulation

  • VLATest fuzzes 18,604 manipulation scenes (10 operators, 4 tasks) to systematically stress-test VLA robustness.
  • Seven VLA models show low success and brittleness to confounders, lighting/camera changes, unseen objects, and instruction mutations; larger pretraining helps.
  • Priorities: scale/augment demo data (incl. sim2real), use stepwise/CoT prompting & multi-agent setups, and expand benchmarks with online risk assessment.

Motivation & Gap

  • Problem: Current VLA models are typically evaluated on small, hand-crafted scenes, leaving general performance and robustness in diverse scenarios underexplored.
  • Goal: Introduce VLATest, a generation-based fuzzing framework that automatically creates robotic manipulation scenes to test performance and robustness of VLA models.

What Are VLA Models?

  • Vision-Language-Action (VLA) models take natural language instructions + camera images and output low-level robot actions (Δx, Δθ, Δgrip).
  • Inference loop: Tokenize text/image → transformer predicts action token A₁ → execute → append A₁ + new image tokens I₂ → predict A₂ → … until success or step limit.

VLA Architecture

Training & Evaluation

  • Training: (1) Train from scratch on robot demonstrations, or (2) fine-tune a large VLM (e.g., Llava) with >1B params pretraining.
  • Evaluation: Task-specific metrics (e.g., grasp, lift, hold for “pick up”), either in sim (auto-metrics) or real (manual labels).

VLATest Framework

  • Ten testing operators grouped across:
    • Target objects: type, position, orientation
    • Confounding objects: type, position, orientation, count
    • Lighting: intensity
    • Camera: position, orientation
  • Scene generation (Alg. 1): sample valid targets → (optional) confounders → mutate lighting (factor α) → mutate camera pose (d, θ). Semantic validity checks prevent infeasible scenes.

VLA Test

Research Questions (RQ)

  • RQ1: Basic performance on popular manipulation tasks
  • RQ2: Effect of confounding object count
  • RQ3: Effect of lighting changes
  • RQ4: Effect of camera pose changes
  • RQ5: Robustness to unseen objects (OOD)
  • RQ6: Robustness to instruction mutations

Tasks & Prompting

  • Tasks:
    1. Pick up an object (grasp + lift ≥0.02 m for 5 frames)
    2. Move A near B (≤0.05 m)
    3. Put A on B (stable stacking)
    4. Put A into B (fully inside)
  • Standard prompts (RQ1–RQ5):
    • pick up [obj] · move [objA] near [objB] · put [objA] on [objB] · put [objA] into [objB]
  • Instruction mutations (RQ6): 10 paraphrases per task (GPT-4o), manually validated for semantic equivalence.

Experimental Setup

  • Scenes: 18,604 across 4 tasks (ManiSkill2).
  • Models: 7 public VLAs (RT-1-1k/58k/400k, RT-1-X, Octo-small/base, OpenVLA-7b).
  • Compute: >580 GPU hours.

Key Results & Findings

RQ1 — Overall Performance

  • VLA models underperform overall; no single model dominates across tasks.
  • Example best-case rates (default settings): 34.4% (Task1, RT-1-400k), 12.7% (Task2, OpenVLA-7b), 2.2% (Task3, RT-1-X), 2.1% (Task4, Octo-small).
  • Stepwise breakdown (Task 1): grasp 23.3% → lift 15.7% → hold 12.4% ⇒ difficulty composing sequential actions.
    • Implication (Finding 2): Consider stepwise prompting / chain-of-thought to decompose complex tasks.

RQ1 — Coverage Metric

  • No established coverage for VLA; adopted trajectory coverage (pragmatic).
  • Increasing cases from n=10 to n=1000 achieved 100% coverage across tasks (object-position novelty relative to workspace).

RQ2 — Confounding Objects

  • More confounders ⇒ worse performance; models struggle to locate the correct object.
  • Similarity doesn’t matter much: Mann–Whitney U shows no significant difference between similar vs dissimilar distractors (p = 0.443, 0.614, 0.657, 0.443; effect sizes ≈ 0.23–0.29).

RQ3 — Lighting Robustness

  • Lighting perturbations significantly hurt performance.
  • OpenVLA-7b most robust (77.9% of previously passed cases still pass), plausibly due to SigLIP + DINOv2 pretraining and LLaVA 1.5 mixture.
  • Sensitivity: even α < 2.5 increase drops success to ~0.7×; α > 8 ⇒ ~40% of default-pass scenes succeed.
  • Decreasing light hurts less than increasing; α < 0.2 still ~60% pass.

RQ4 — Camera Pose Robustness

  • Small pose changes (≤ rotation, ≤5 cm shift) reduce success to 34.0% of default.
  • RT-1-400k most robust (45.6% retain), OpenVLA-7b at 31.3%; Octo models <10%.
    • Likely due to training data scale differences.

RQ5 — Unseen Objects

  • Using YCB (56 unseen objects) leads to large performance drops versus seen objects: avg –74.2%, –66.7%, –66.7%, –20.0% on Tasks 1–4.
  • Transfer rate across steps:
    • Trn=Success ratenSuccess raten1\displaystyle T_r^n = \frac{\text{Success rate}_n}{\text{Success rate}_{n-1}}, with Success rate0=100%\text{Success rate}_0 = 100\%
    • Paired t-tests show significant differences on Tr1T_r^1 for Task 1 & 2 (p = 0.011, 0.007; Cohen’s d = 1.34, 0.891).
    • Primary failure mode: recognizing/locating unseen objects.

RQ6 — Instruction Mutations

  • Mutated instructions generally reduce performance (avg drops: –32.8% T1, –1.7% T2, –8.3% T3; negligible on T4).
  • Larger language backbones help: OpenVLA-7b (Llama 2-7B) is more robust, sometimes improving under mutations (e.g., T1, T4).

Implications & Directions

  • Scale matters: larger pretraining and robot-demo datasets improve robustness (lighting/camera).
  • Data enrichment: use data augmentation and sim-to-real to diversify external factors; leverage traditional controllers to auto-generate demonstrations.
  • Prompting strategies: adopt stepwise/CoT prompting; consider multi-agent decompositions.
  • Benchmarking: the 18,604 VLATest scenes serve as an early benchmark; expand to more tasks/robots/conditions.
  • Online risk assessment: explore uncertainty estimation and safety monitoring for runtime quality control.
  • Robotics foundation models: (1) LLMs for planning/rewards; (2) Multi-modal FMs (VLMs/VLAs) for manipulation & perception.
  • CPS testing: gray-box/black-box fuzzing and search-based testing exist, but not directly applicable to VLAs (multimodality, autoregression, scale).
  • FM evaluation: beyond static benchmarks, VLATest dynamically generates 3D manipulation test cases—distinct from text-only testing.

Threats to Validity (mitigations in study)

  • Internal: randomness (mitigated by 18,604 scenes); potential prompt bias (mutations manually validated).
  • External: generalization to other tasks/models; chose popular tasks (Open X-Embodiment) and SOTA public models.
  • Construct: limited operators (lighting/camera/confounders chosen; future: #lights, camera intrinsics, resolution).
    • Coverage: trajectory coverage used as a pragmatic proxy.

Conclusion

  • VLATest: early, generation-based fuzzing framework (10 operators) for VLA testing in ManiSkill2.
  • Empirical evidence across 7 models / 4 tasks / 18,604 scenes shows limited robustness (lighting, camera, unseen objects, instruction variation).
  • Points to data scaling, prompting, benchmarking, and risk assessment as practical paths to more reliable VLA systems.

Ref

  • Wang, Z., Zhou, Z., Song, J., Huang, Y., Shu, Z., & Ma, L. (2025). VLATest: Testing and Evaluating Vision-Language-Action Models for Robotic Manipulation. Proceedings of the ACM on Software Engineering, 2(FSE), 1615–1638.

IAI +005

· 13 min read

Neural Network Development History

  • 1950s-1960s: Early Foundations
    • McCulloch & Pitts (1943): mathematical neuron model
    • Rosenblatt's Perceptron (1958): first trainable network
    • Minsky & Papert (1969): limitations (XOR problem) → AI Winter
  • 1970s–1980s: First Revival
    • Werbos (1974); Rumelhart, Hinton, Williams (1986): Backpropagation
    • Hopfield Networks (1982): associative memory
    • Renewed optimism but limited by hardware
  • 1990s: Consolidation
    • LeCun's CNN (LeNet, 1989): digit recognition
    • Elman, Jordan: Recurrent Neural Networks
    • Symbolic AI still dominated mainstream
  • 2000s: Deep Learning Foundations
    • Better hardware (GPUs) + large datasets
    • Hinton (2006): Deep Belief Networks (unsupervised pretraining)
    • Connectionism regains attention
  • 2010s: Deep Learning Boom
    • ImageNet (2012): AlexNet breakthrough
    • RNNs, LSTMs, GRUs → speech & translation
    • Transformers (2017): revolutionized NLP
  • 2020s: Scaling & Foundation Models
    • Large Language Models (GPT, BERT, etc.)
    • Multimodal AI: vision, text, speech integration
    • Connectionism dominates AI research & industry

Neural Network Models

  • a collection of units (neurons) connected together
  • The properties of the network are determined by its topology and the properties of the neurons.
  • Roughly speaking, the neuron fires when a linear combination of its inputs exceeds some (hard or soft) threshold.

Simple Neuron

  • inj=i=0nwijaiin_j = \sum_{i=0}^{n} w_{ij}a_i
  • outj=g(inj)out_j = g(in_j)
  • aj=g(i=0nwijai)a_j = g(\sum_{i=0}^{n} w_{ij} a_i)

Activation function

ReLU function

ReLU(x)=max(0,x)ReLU(x) = max(0, x)

  • an abbreviation for rectified linear unit
  • Commonly used

Softplus function

Softplus(x)=log(1+ex)Softplus(x) = \log(1 + e^x)

  • A smooth version of the ReLU function

Logistic or Sigmoid function

Logistic(x)=11+exLogistic(x) = \frac{1}{1 + e^{-x}}

  • Non-linear, can represent a nonlinear function

Tanh function

tanh(x)=e2x1e2x+1tanh(x) = \frac{e^{2x} -1}{e^{2x} + 1}

Topology of a neural network

  • Feed-forward network (FFN):
    • Every node receives inputs from "upstream" nodes and delivers output to "downstream" nodes.
    • There are no loops.
    • FFN represents a function of its current inputs, thus it has no internal state other than the weights themselves.
  • Recurrent Network (RNN):
    • A recurrent network feeds its outputs back into its own inputs.
    • In a recurrent network, the neuron values can eventually settle down, keep cycling, or behave unpredictably.
    • can support short-term memory
FFNRNN
FFNRNN

Training Process

  • Go through each training sample.
  • If correctly classified → do nothing.
  • If misclassified → update the weights:
  • wiwi+α(yy^)xiw_i \leftarrow w_i + \alpha(y - \hat{y})x_i

Perceptron for Binary Classification

  • A perceptron separates data into two classes with a hyperplane.
  • if wx01w \cdot x \geq 0 \rightarrow 1
  • if wx00w \cdot x \le 0 \rightarrow 0

Learning Rules

AspectPerceptron Learning RuleGradient Descent (with Sigmoid)
Activation functionHard threshold (계단 함수)
Threshold(z)=1  if  z0,  0  otherwiseThreshold(z) = 1 \; \text{if} \; z \ge 0,\; 0 \; \text{otherwise}
Sigmoid (연속 함수)
hw(x)=11+ewxh_w(x) = \frac{1}{1+e^{-w \cdot x}}
Output0 또는 10과 1 사이의 실수 값
Loss function없음 (틀리면 조정, 맞으면 유지)
규칙 기반 학습
L=(yhw(x))2L = (y - h_w(x))^2 (L2 loss)
또는 Cross-Entropy (실무에서 자주 사용)
Update rule틀렸을 때만:
ww+α(yhw(x))xw \leftarrow w + \alpha (y - h_w(x))x
경사하강법:
ww+α(yhw(x))hw(x)(1hw(x))xw \leftarrow w + \alpha (y - h_w(x)) \cdot h_w(x)(1-h_w(x)) \cdot x
Why derivative?Hard threshold는 미분 불가능 → 단순 규칙 사용Sigmoid는 연속적이고 미분 가능 → Loss 함수의 기울기(gradient)를 따라 업데이트.
여기서 hw(x)(1hw(x))h_w(x)(1-h_w(x)) 항은 sigmoid의 도함수에서 나온 것.
Interpretation틀리면 정답 방향으로 한 걸음 이동Loss가 줄어드는 방향으로 점진적으로 이동

Feadforward NN, FNN

  • a multilayer perceptron network
  • one input layer, N hidden layers, N >= 1, and one output layer.
  • Except for the input layer, each layer has a same activation function g.
  • The final output is represented by a vector function of inputs and weights.
  • If it has three layers, Shallow Neural Network, otherwise Deep Neural Network.

Traning a FNN

  • Forward
    • Activation passing from the input layer to the output layer
    • Calculate the output
  • Backward
    • Errors propagating backward from the output layer to the input layer
    • Update weights

Forward phase

  • Activation of each node is computed in two steps:

    1. Weighted sum (in): sum of activations from the previous layer, multiplied by weights.
    2. Apply activation function g: pass the weighted sum through g to produce the node's activation.
  • Process: propagate activations layer by layer towards the output layer.

  • Output value (example with 2 layers):

    • hw(x)=g(2)(W(2)g(1)(W(1)x))h_w(x) = g^{(2)}\big(W^{(2)} g^{(1)}(W^{(1)} x)\big)

Backward phase

  • Loss function: choose squared error loss (L2)
    • L2(y,y^)=(yy^)2L_2(y, \hat{y}) = (y - \hat{y})^2
  • Prediction:
    • y^=hw(x)\hat{y} = h_w(x)
  • Gradient descent: compute gradient of the loss with respect to weights, then update weights along the negative gradient direction.
    • wi,jwi,jαgradientwi,jw_{i,j} \leftarrow w_{i,j} - \alpha \cdot gradient_{w_{i,j}}
  • Example: sigmoid activation:
    • y^=11+ewx\hat{y} = \frac{1}{1 + e^{-w \cdot x}}
    • Gradient of the loss:
      • gradientwi,j=wi,jLoss(hw)=2(yhw(x))(wi,jhw(x))gradient_{w_{i,j}} = \frac{\partial}{\partial w_{i,j}} Loss(h_w) = 2 (y - h_w(x)) \cdot \Big(- \frac{\partial}{\partial w_{i,j}} h_w(x)\Big)
    • Chain rule applied:
      • g(f(x))x=g(f(x))f(x)\frac{\partial g(f(x))}{\partial x} = g'(f(x)) \cdot f'(x)
  • Example: Gradient derivation for sigmoid
    • Weighted input
      • WX=w1,3x1+w2,3x2+w0,3x0W \cdot X = w_{1,3}x_1 + w_{2,3}x_2 + w_{0,3}x_0
      • (where x0=1x_0 = 1 for the bias)
    • Gradient of the loss
      • gradientwi,j=wi,jLoss(hw)=2(yhw(x))(wi,jhw(x))gradient_{w_{i,j}} = \frac{\partial}{\partial w_{i,j}} Loss(h_w) = 2 (y - h_w(x)) \cdot \Big(-\frac{\partial}{\partial w_{i,j}} h_w(x)\Big)
    • Derivative of sigmoid output
      • wi,jhw(x)=hw(x)(1hw(x))wi,j(WX)\frac{\partial}{\partial w_{i,j}} h_w(x) = h_w(x)(1 - h_w(x)) \cdot \frac{\partial}{\partial w_{i,j}} (W \cdot X)
        • wi,j(11+eWX)\frac{\partial}{\partial w_{i,j}} \left( \frac{1}{1 + e^{-W X}} \right)
        • =(11+eWX)(111+eWX)wi,j(WX)= \left( \frac{1}{1 + e^{-W X}} \right) \left(1 - \frac{1}{1 + e^{-W X}} \right) \cdot \frac{\partial}{\partial w_{i,j}} (W X)
        • =hw(x)(1hw(x))wi,j(WX)= h_w(x) \big(1 - h_w(x)\big) \cdot \frac{\partial}{\partial w_{i,j}} (W X)
    • Derivative of weighted input
      • w0,3(WX)=x0=1\frac{\partial}{\partial w_{0,3}}(W \cdot X) = x_0 = 1
      • w1,3(WX)=x1\frac{\partial}{\partial w_{1,3}}(W \cdot X) = x_1
      • w2,3(WX)=x2\frac{\partial}{\partial w_{2,3}}(W \cdot X) = x_2
    • Weight update rule
      • General form: wi,jwi,jαgradientwi,jw_{i,j} \leftarrow w_{i,j} - \alpha \cdot gradient_{w_{i,j}}
      • w0,3w0,3+α(yhw(x))hw(x)(1hw(x))w_{0,3} \leftarrow w_{0,3} + \alpha (y - h_w(x)) h_w(x)(1 - h_w(x))
      • w1,3w1,3+α(yhw(x))hw(x)(1hw(x))x1w_{1,3} \leftarrow w_{1,3} + \alpha (y - h_w(x)) h_w(x)(1 - h_w(x)) x_1
      • w2,3w2,3+α(yhw(x))hw(x)(1hw(x))x2w_{2,3} \leftarrow w_{2,3} + \alpha (y - h_w(x)) h_w(x)(1 - h_w(x)) x_2

Backward phase Steps

  1. Select a loss function
    • For example, squared error loss:
    • L(y,y^)=(yy^)2,y^=hw(x)L(y, \hat{y}) = (y - \hat{y})^2, \quad \hat{y} = h_w(x)
  2. Choose an activation function
    • Suppose we use a sigmoid:
    • hw(x)=11+eWXh_w(x) = \frac{1}{1 + e^{-W \cdot X}}
  3. Calculate the error at the output node
    • The delta (error term) at the output is
    • Δout=2(y^y)g(inout)\Delta_{out} = 2(\hat{y} - y) \cdot g'(in_{out})
  4. Calculate the error at hidden nodes
    • A hidden unit may connect to multiple nodes in the next layer.
    • Therefore, its error is the weighted sum of all deltas it feeds into, scaled by its own derivative:
    • Δi=g(ini)jwi,jΔj \Delta_i = g'(in_i) \sum_j w_{i,j} \Delta_j
    • The summation appears because the hidden node's output influences several downstream nodes, and all those error signals must be aggregated.
  5. Update the weights with gradient descent
    • The gradient with respect to weight wi,jw_{i,j} is simply the input times the delta:
    • Lwi,j=aiΔj\frac{\partial L}{\partial w_{i,j}} = a_i \Delta_j
    • Update rule:
    • wi,jwi,jαaiΔjw_{i,j} \leftarrow w_{i,j} - \alpha \, a_i \Delta_j

Vanishing gradient

  • The error signal are extinguished altogher as they are propagated back through the network
  • In deep feedforward networks with sigmoid/tanh, repeated multiplication of small derivatives (0<g(z)<10 < g'(z) < 1) during backpropagation causes the gradient to vanish.

Optimizer

  • Training a neural network consists of modifying the network's parameters, minimizing the loss function on the training set.
  • any kind of optimization algorithm could be used.
  • modern neural networks are almost always trained with some variant of stochastic gradient descent (SGD). Adam Optimizer
  • The optimiser is specified in the compilation step with tensorflow.

Recurrent NN, RNN

  • units may take as input a value computed from their own output at an earlier step in the computation.
  • have internal state, or memory: inputs received at earlier time steps affect the RNN's response to the current input.
  • be used to perform more general computations.
    • to analyze sequential data in which a new input vector xtx_t arrives at each time step
  • Markov assumption: the hidden state ztz_t of the network suffices to capture the information from all previous inputs.
    • zt=f(zt1,xt)z_t = f(z_{t-1}, x_t)
    • Once trained, this function represents a time-homogeneous process
    • The same update rule fwf_w applies at every time step, regardless of whether it’s the first input or the hundredth.
  • RNNs are designed for sequential data.
  • a hidden state that captures information from previous steps.
  • suffer from vanishing/exploding gradients.
  • Good for short-term dependencies.

Backpropagtion Through Time, BPTT

  • gradient expression is recursive.
    • ztwz,z\frac{\partial z_t}{\partial w_{z,z}}
    • =wz,zgz(inz,t)= \frac{\partial}{\partial w_{z,z}} g_z(in_{z,t})
    • =gz(inz,t)inz,twz,z= g_z'(in_{z,t}) \frac{\partial in_{z,t}}{\partial w_{z,z}}
    • =gz(inz,t)wz,z(wz,zzt1+wx,zxt+w0,z)= g_z'(in_{z,t}) \frac{\partial}{\partial w_{z,z}} (w_{z,z} z_{t-1} + w_{x,z} x_t + w_{0,z})
    • =gz(inz,t)(zt1+wz,zzt1wz,z)= g_z'(in_{z,t}) \left( z_{t-1} + w_{z,z} \frac{\partial z_{t-1}}{\partial w_{z,z}} \right)
      • ztWz,z\frac{\partial z_t}{\partial W_{z,z}} includes zt1Wz,z\frac{\partial z_{t-1}}{\partial W_{z,z}}
  • the gradient with run time being linear in the size of the network
  • handled automatically by deep learning software systems.
  • Iterating the recursion shows that the gradient at time TT includes a term proportional to:
    • wz,zt=1Tgz(inz,t)w_{z,z} \prod_{t=1}^{T} g'_z(in_{z,t})
  • Since for sigmoid, tanh, and ReLU we have g1g' \leq 1, if wz,z<1w_{z,z} < 1 the RNN will suffer from the vanishing gradient problem.
  • If wz,z>1w_{z,z} > 1, we may encounter the exploding gradient problem.

Long Short-Term Memory, LSTM

  • memory cell is essentially copied from time step to time step.
  • New information enters the memory by adding updates.
    • the gradient expressions do not accumulate multiplicatively over time.
  • include gating units: vectors control the flow of information in the LSTM, elementwise multiplication of the corresponding information vector.
  • a type of RNN designed to overcome vanishing gradient.
  • use gates (input, forget, output) to control information flow.
  • Capable of learning long-term dependencies.
  • Widely used in NLP, speech recognition, and time series forecasting.

Gates in LSTM

  • Forget gate: decides what information to discard from the cell state.
  • Input gate: decides what new information to store in the cell state.
  • Output gate: decides what information to output from the cell state.
    • similar role to the hidden state in basic RNNs.
  • Update equations:
    • ft=σ(Wx,fxt+Wz,fzt1)f_t = \sigma(W_{x,f}x_t + W_{z,f}z_{t-1})
      • Decides which parts of the previous cell state ct1c_{t-1} should be kept or discarded.
    • it=σ(Wx,ixt+Wz,izt1)i_t = \sigma(W_{x,i}x_t + W_{z,i}z_{t-1})
      • Determines how much of the new information from the current input xtx_t and the previous hidden state zt1z_{t-1} should be added.
    • ot=σ(Wx,oxt+Wz,ozt1)o_t = \sigma(W_{x,o}x_t + W_{z,o}z_{t-1})
      • Controls which parts of the current cell state ctc_t are exposed as the hidden state ztz_t.
    • ct=ct1ft+ittanh(Wx,Cxt+Wz,Czt1)c_t = c_{t-1} \odot f_t + i_t \odot \text{tanh}(W_{x,C}x_t + W_{z,C}z_{t-1})
      • Cell state update
      • Past information (ct1c_{t-1}) is partially retained through the forget gate.
      • New information is added through the input gate and tanh\tanh.
      • Thus, ctc_t serves as the long-term memory of the LSTM.
    • zt=ottanh(ct)z_t = o_t \odot \text{tanh}(c_t)
      • Hidden state update
      • The cell state is normalized with tanh(ct)\tanh(c_t) and filtered by the output gate.
      • ztz_t is the hidden state passed forward to the next time step.

Gated Recurrent Unit, GRU

  • Variant of RNN with gating mechanisms.
  • Designed to capture long-term dependencies without complex architecture.
  • a simpler alternative to LSTMs. (lightweight, effective RNN variant)
  • Captures temporal dependencies (short & long).
  • Combine input and forget gates into a single update gate.
  • Require fewer parameters than LSTM, making them faster to train.
  • Perform comparably to LSTMs in many tasks.
  • Prevents vanishing gradient.
  • Good balance between complexity & performance.
  • Excels in time series forecasting tasks.
  • Widely used in finance, energy and IoT.

Gates in GRU

  • Update gate (z): decides how much past information to keep
  • Reset Gate (r): decides how much past information to forget
  • Candidate hidden state (h~\tilde{h}): potential new memory
  • Final hidden state (hh): weighted combination of old and new information.

GRU Workflow

  • Reset gate (rr)
    • Controls how much of the previous hidden state should be "forgotten."
    • A small value means most of the past memory is erased, while a large value means much of it is retained.
  • Update gate (zz)
    • Acts as a switch to decide whether to keep the previous state hprevh_{prev} or replace it with the new candidate h~\tilde{h}.
    • If z=1z=1, the past is fully kept; if z=0z=0, it is completely replaced by the new candidate.
  • Candidate state (h~\tilde{h})
    • Combines the current input xtx_t with the reset-gated previous hidden state to generate the "candidate" new information.
  • Final hidden state (hh)
    • Blends the past and the candidate using the update gate zz.
    • If zz is large → the past memory dominates.
    • If zz is small → the new candidate dominates.
  • h=(1z)h~+zhprevh=(1−z)\tilde{h}+zh_{prev}

Comparison: RNN vs LSTM vs GRU

AttributeRNNLSTMGRU
ArchitectureSimple, hidden stateComplex, memory cell + 3 gatesSimplified, 2 gates (update/reset)
Information FlowStored in hidden stateControlled by gatesControlled by merged gates
Long-term DependencyWeak (vanishing gradient)Strong (gates solve vanishing gradient)Strong (gates solve vanishing gradient)
Short-term DependencyStrongStrongStrong
Number of ParametersFewManyFewer than LSTM
Training SpeedFastSlowFast
PerformanceGood for short-termGood for long-termEfficient, similar to LSTM
Application AreasSimple time series, basic NLPNLP, speech, time series forecastingFinance, IoT, energy, time series
Vanishing GradientYesNoNo
Typical Use CasesText generation, simple predictionTranslation, speech recognitionTime series prediction, sensor data
SimplicityVery simple, rarely usedMore complex, expressive (3 gates)Simpler than LSTM, fewer parameters
ExpressivenessLimited, struggles with long-termHigh, handles very complex sequencesModerate, good for moderate data size
Training EfficiencyFast, but limitedSlower, better for complex dataFast, efficient, similar performance
Trade-offSimple but weak for long-termCapacity for complex, long sequencesSimplicity vs. capacity

Open X-Embodiment review

· 5 min read

RT-X

  • RT-X trains generalist robot policies by co-training RT-1/RT-2 on an X-embodiment mix of multi-robot, multi-task data, enabling efficient adaptation to new robots, tasks, and environments.
  • It standardizes 1M+ trajectories from 22 embodiments into the Open X-Embodiment (RLDS/tfrecord) repository, unifying observations and 7-DoF actions via coarse alignment.
  • Experiments show strong positive transfer and emergent skills (≈3× with RT-2-X on cross-robot tasks); performance scales with model capacity, short image histories, and web pretraining, while sensing/actuation diversity and frame alignment remain open problems.

RT-X Architecture

Motivation

  • Seeks a generalist X-robot policy that can be efficiently adapted to new robots, tasks, and environments.
  • Mirrors a trend from CV/NLP where general-purpose, web-scale pretrained models outperform narrow, task-specific models.
  • Robotics lacks comparably large, diverse interaction datasets, making direct transfer of these lessons challenging.

Objectives

  1. Positive transfer: Test whether co-training on data from many robots improves performance on each training domain.
  2. Ecosystem building: Organize large robotic datasets to enable future X-embodiment research.

Core Approach

  • Train RT-1 and RT-2 on data from 9 different manipulators, producing RT-X variants that outperform policies trained only on the evaluation domain and show better generalization and new capabilities.

What’s Different From Prior Transfer Methods

  • Many prior works reduce the embodiment gap via specialized mechanisms (shared action spaces, representation learning objectives, policy adaptation using embodiment metadata, decoupled robot/environment representations, domain translation).
  • RT-X directly trains on X-embodiment data without explicit gap-reduction machinery and still observes positive transfer.

Dataset & Format (Open X-Embodiment)

  • 1M+ real robot trajectories, 22 embodiments (single-arm, bimanual, quadrupeds), pooled from 60 datasets / 34 labs, standardized for easy use.
  • Uses RLDS (serialized tfrecord), supporting varied action spaces and input modalities (RGB, depth, point clouds), and efficient parallel loading across major DL frameworks.
  • Language annotations are leveraged; PaLM is used to extract objects/behaviors from instructions.

RLDS

Data Format Consolidation (Coarse Alignment)

  • Observations: History of recent images + language instruction. One canonical camera view per dataset is resized to a common resolution.
  • Actions: Convert original controls to a 7-DoF end-effector vector (x, y, z, roll, pitch, yaw, gripper or their rates). Actions are normalized before discretization; outputs are de-normalized per embodiment.
  • Deliberate non-alignment: Camera poses/properties are not standardized; action frame alignment across datasets is not enforced. The same action vector may cause different motions on different robots (absolute/relative, position/velocity allowed).

Policy Architectures

  • RT-1 (≈35M params): Transformer for control. Inputs: 15-frame image history + natural-language instruction.
    • Vision via ImageNet-pretrained EfficientNet; language via USE embedding.
    • Fuse via FiLM → 81 vision–language tokens → decoder-only Transformer outputs tokenized actions.
  • RT-2 (VLA family): Internet-scale VLM co-fine-tuned to output action as text tokens (e.g., 1 128 91 241 5 101 127).
    • Any pretrained VLM can be adapted; this work uses RT-2–PaLI-X (ViT backbone + UL2 LM; primarily pretrained on WebLI).

Training Setup

  • Robotics data mixture: Data from 9 manipulators (a union of multiple well-known robotics datasets).
  • Loss: Standard categorical cross-entropy over tokenized actions.
  • Regimes:
    • RT-1-X: Trained solely on the robotics mixture.
    • RT-2-X: Co-fine-tuned on a ~1:1 mix of original VLM data and the robotics mixture.

Experimental Questions

  1. Does X-embodiment co-training improve in-domain performance (positive transfer)?
  2. Does it improve generalization to unseen tasks?
  3. How do model size, architecture, and dataset composition influence performance/generalization?

Key Results

  • Small-scale domains: RT-1-X outperforms the Original Method (the authors’ per-dataset baselines) on 4/5 datasets with a large average gain → limited data domains benefit greatly from X-embodiment co-training.
  • Large-scale domains:
    • RT-1-X does not beat an RT-1 trained only on the embodiment-specific large dataset (suggests underfitting for this class).
    • RT-2-X (larger capacity) outperforms both Original Method and RT-1 → X-robot training helps even in data-rich regimes when using sufficient capacity.

Generalization & Emergent Skills

  • Unseen objects/backgrounds/environments: RT-2 and RT-2-X perform on par (VLM backbone already strong here).
  • Emergent skills (transfer across robots): On Google Robot tasks that do not appear in RT-2’s dataset but exist in Bridge (for WidowX), RT-2-X ≈ 3× RT-2.
    • Removing Bridge from RT-2-X training significantly reduces hold-out performance → skills likely transferred from WidowX data.

Design Insights (Ablations)

  • Short image history notably improves generalization.
  • Web pretraining is critical for large models’ high performance.
  • Model capacity matters: 55B model succeeds more than 5B on emergent skills → greater capacity ⇒ greater cross-dataset transfer.
  • Co-fine-tuning vs. fine-tuning: Similar performance in this study (attributed to the greater diversity of robotics data in RT-2-X vs. prior works).

Limitations (Open Problems)

  • Does not cover robots with very different sensing/actuation modalities.
  • Does not study generalization to new robots nor define a decision criterion for when positive transfer will occur.
  • Camera pose/properties and control frame remain unaligned; a deliberate but still challenging domain gap to address in future work.

Ref

  • O’Neill, A., Rehman, A., Maddukuri, A., Gupta, A., Padalkar, A., Lee, A., Pooley, A., Gupta, A., Mandlekar, A., & Jain, A. (2024). Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. 2024 IEEE International Conference on Robotics and Automation (ICRA).

FSD +006

· 4 min read

OOP vs Procedural Programming

OOP

  • a programming paradigm built around the concept of objects, which contain data and code to manipulate data.
  • The idea to model real-world entities and their interactions.
  • Global Data (fields) are enclosed in the objects.
  • Program components/tasks are easily divided across the development team / Requires more planning and design preparation
  • Easier to manage and maintain dependencies between objects / OOP programs are much larger and complex
  • Objects export the interface and hide the implementation and data / Tend to use more memory and GPU
  • Code is highly reusable and easy to scale and distribute / Making changes in one class potentially impact others, which can complicate the development of the code.

Procedural Programming

  • the concept of procedure calls by structuring the program around procedures. (or functions/subroutines)
  • a sequential manner unless directed otherwise.
  • Global data (elements) is exposed to all the functions.
  • Easier to compile and interpret / Difficult to scale or extend
  • Straightforward and simpler to code / Dependencies between elements are unclear and not well-structured.
  • Less memory requirements / Data is exposed and insecure due to its exposure across the whole program
  • Easy to track the program flow / Hard to divide the work among programmers in a team.

Classes

  • A class is a template/blueprint used to create objects
javapython
a pure OOP languagesupports OOP
code must be written in classesclasses are optional
executable class must have main()scripts run without including a class
Encapsulation can be enforced by declaring fields as privatefields (global variables) are public by default
Visibility is managed through access modifiersN/A ("_" to identify private data attributes, but still accessible)
class <class-name> (<extend - superclass>):
<variable-name> = <value> #Class fields - data members

def __init(self, <parameters>): #class constructor - object sbuilder
<code>

<method-name> (self, <parameters>): #methods
<code>

Classes Py

KeywordsFunctions
class__init__()
self: keyword used to refer to object propertiesdel: the function is used to delete an object
pass: keyword used to occupy no-code placement in a function__str__(): The function is used to return string representation of instances
cls: keyword used to refer to class propertiessuper(): the function is used call a parent method in a child class
  • Accessors: functions (with no parameters) in a Python class that provide access to the data attributes of an object.
    • known as getter methods, are named starting with the verb get, followed by the field name, which should start with an uppercase letter.
  • Mutators: procedures (with parameter) in a Python class that enable the developer to modify the values of object attributes.
    • known as setter methods, are named starting with the verb set, followed by the field name, which should start with an uppercase letter.
def get<Variable> ():
return self.<field>

def set<Variable> (self, value):
self.<field> = value

Classes Java

public class Bank {
private Customer customer;
private String branch;

public Bank() {
customer = new Customer();
}

public Bank(String name) {
this();
this.branch = name;
}

public boolean find(Bank bank) {
return this.branch.equals(bank.branch);
}
}

Packages

Packages Java

  • used to group related classes
  • like folders containing files (classes)
  • either Java defined or user-defined
  • used to write maintainable and portable code and to avoid class name conflicts.

Modules Py

  • used to grou prelated functio nand classes together
  • normal Python scripts that are used into other scripts
  • either Python defined or user-defined
  • used to write maintainable and portable code to improve reusability