Building Vision-Language-Action Model from scratch
Understand what are VLAs? How to build VLAs from scratch?!
In this post I explain what are these VLAs, how people have approached building these ‘magical’ systems so far, and how one can approach building a VLA from scratch. I even justify my design decisions that helped me work-around all the dead-ends I faced. This project assumes we are using a low-end hardware (eg, just an old CPU). I do not go though line-by-line code in this blog [I will do it later] as I was unable to find a decent way to display code correctly here on Substack. [if you have any recommendations, hmu!] Lastly, this blog was written with love and passion, and not with AI — I hope this blog connects with you at a human level. As always, if you have any questions, feel free to comment or reach out to me — this post is to educate and clarify, NOT confuse!
Prologue
VLAs have fascinated me since the moment I have started reading about them from my Reinforcement Learning course @ Northeastern Uni. That pushed me to thoroughly read papers starting from RT-2, OpenVLA, π_0 (and subsequent π papers — highly recommended!), GR00T and so on throughout this year!
You can easily spot a common architecural pattern across all these papers.
Why the world suddenly cares about VLAs
VLAs = generalization. Before RT-2, robots operated using what Google DeepMind described as “complex stacks of systems” playing an “imperfect game of telephone”. VLAs eliminate this architectural fragmentation by enabling a single model to perceive, reason, AND act.
Multi-modality bridges reasoning to control. VLAs solve the grounding problem by treating all modalities—vision, language, and action—as tokens in a unified sequence. The architecture flows naturally: visual encoders convert camera images into patch embeddings, projection layers map these into the language model’s embedding space, and the LLM processes the fused representation to generate action tokens. mini-VLA is a simplified version of the exact same concept delivered differently.
The generalist robot thesis
Physical Intelligence articulated the vision clearly in October 2024: “In the same way that LLMs provide a foundation model for language, these generalist robot policies will provide a robot foundation model for physical intelligence.”
This is an interesting theory that inclined me towards VLA research, like “oh, that’s something! It’s like building ChatGPT for physical world!” — oh boy, I was wrong — apparently it is MUCH HARDER!
Although, our foundation is strong. The Open X-Embodiment dataset in October 2023 provided 1M+ real robot trajectories across 22 embodiments — the “ImageNet of robot learning.” OpenVLA in June 2024 democratized research by achieving performance exceeding the 55B RT-2-X with just 7B parameters. And π_0 in October 2024 became the first model to successfully fold laundry, make coffee, assemble boxes, and bus tables—tasks “no prior robot learning system has done successfully.”
How do I make my own VLA from scratch?
I hope so far I have convinced you how exciting the VLA space is! Well, as the name suggests —
Vision → Language → Action pipeline (simple, right?)
The generic pipeline flows through four components: a vision encoder extracts visual features from RGB observations into patch embeddings; a vision-language projector maps these embeddings into the LLM’s representation space; a language model backbone processes the fused tokens alongside text instructions; and an action head converts LLM outputs to executable robot commands.
mini-VLA
I came up with an idea to dumb-ify all these components just enough to make them work (fast! — on my stupid macbook with no GPU). In this section we will go through the entire pipeline of how to build your own VLA from scratch. This includes — collecting data and organizing it, developing and training a model, design choice and tradeoffs, and inference in simulation. Let’s dive in!
Collecting data
Teaching a robot by showing, not telling. Before our VLA model can do anything impressive — it needs to learn from experience. And since robots don’t naturally have childhoods filled with trial-and-error (yet), we have to manufacture that experience for them.
What are we actually collecting? You should think of every step through the environment is like a frame in a movie paired with behind-the-scenes metadata. Each step (or transition) gives us a raw RGB image frame showing the robot hand, object and goal position; the state vector (think of sensors) captures the low-level info the simulator tracks (positions, velocities, etc.); then the action tells us what the ‘expert’ did in this step; and finally we collect texual instruction describing the task like “push the object to the goal”
Usually, one needs to tele-operate the robot to collect this data. However, as mentioned above, we have “expert policy” in this Meta-World environment. These serve as “teachers” for our dataset. The philosophy used here is,
imitate the expert until you become the expert
which is known as Behavior Cloning 101.
Designing VLA
pixels + words + numbers → actions. Now that we can collect demonstrations, we need a brain (aka policy) that can look, read, feel, and then act.
Given what I see (image), what I know (state), and what I’m told (instruction), which action should I take next?
The BIG Picture
Image → image encoder → image token
Text → text encoder → language token
State → state encoder → state token
Fusion → combine three tokens → fused context
Diffusion head → sample an action conditioned on the fused context
👀 SEEing — Image Encoder
When building a budget-friendly model, we cannot pick any computationally-expensive models (I will talk about them later). So, instead of any huge ResNet or Vision Transformer, we use a tiny CNN. This is like compressing a full photo into a compact “gist” that still knows where the robot hand and object roughly are.
🗣️ READing — Text Encoder
We only keep the last hidden state, which acts as a summary of the instruction, e.g.,
push the object to the goal → a single 128-dimensional vector representing that intent
No fancy transformers here on purpose: the goal is clarity and minimalism, not chasing SOTA.
🫠 FEELing — State Encoder
The state is already a vector from the simulator (positions, velocities, gripper status…).
We just run it through a small MLP (+ LayerNorm) to produce a state token. Think of it as if the image is “what the world looks like,” state is “what the robot body is feeling and knowing.”
🔗 Fusion
Fusion is where the model forms a single, unified understanding of what it sees (image token), what it’s told (text/instruct token), & what the robot knows (state token).
This is not so complicated actually. Think of it as,
A short meeting where vision, language, and state each give a quick report, and the fusion MLP writes a concise summary.
Honestly, you could replace this with attention, cross-attention, or a transformer; the rest of the code doesn’t care—as long as fusion outputs a (B, d_model) token.
🎲 ACTing via Diffusion
Aka policy head! This part is slightly complicated if you have not studied about diffusion models… so be with me! During training, we corrupt the expert’s action with random noise. The model learns to undo the noise, using the image + text + state context as a guide. At test time, we start from pure noise and iteratively denoise to get a plausible action. I know, this part is not as intuitive as previous sections — but these models are proven to work for generative models.
This is a standard DDPM-style setup in action space. q_sample implements the forward diffusion x_t = sqrt(alphā_t)x₀ + sqrt(1–alphā_t)ε. Here, the denoiser is a simple MLP conditioned on time and fused context.
Design Philosophy of mini-VLA
As I said, I have focused on education more than performance. There are several deliberate choices I made in this VLA design that make this framework easy to understand and expand.
I am prioritizing modularity — encoders, fusion, and diffusion-head are separate; we can swap in a different vision encoder or transformer encoder without touching rest of the code — we’ll be fine!
The model must be small but expressive — most people would like to try this on a low-end CPU — the pipeline must still work.
It must be environment-agnostic as well — ie. it must work beyond just Meta-World!
Instead of predicting actions directly, we sample from a learned distribution over possible actions, which tends to yield smoother, more robust behavior.
Lessons while building mini-VLA
Building this VLA from first-principles was beyond exciting! Before this project, VLAs seemed like mysterious “everything machines” that could look at a scene, read an instruction, and act coherently. Once I built each component myself, I realized a VLA is less a monolithic model and more a negotiated agreement among modalities. Makes sense right? I hope so.
This section can become really long, hence keeping it for my next blog where I will talk about the lessons in more technical terms.
Epilogue
If you squint a little, this tiny VLA pipeline is more than a toy project. It’s a sketch of what real robot intelligence looks like.
Scale that up—better encoders, richer language, more diverse tasks, real robot sensors instead of a simulator—and you’re walking toward robots that can learn new behaviors from a few demonstrations, or even from high-level natural language. And everything can be accomplished with slightly better hardware!
If there’s one takeaway from this whole journey, it’s this:
Real robot intelligence won’t be summoned. It will be engineered—carefully, incrementally, and transparently—by people willing to build, break, and rebuild systems like this.
And if you’ve made it this far in the blog, you’re already one of those people. I hope this project inspires you to build something small… that does something big.😃







its so grateful wirk you have done
Awesome article!