Annotation for Embodied AI: What Physical Intelligence, Figure, and 1X Need from Training Data
The biggest bet in AI right now isn’t another chatbot or image generator. It’s embodied AI—machines that exist in physical space and interact with the real world. Companies like Physical Intelligence, Figure, 1X, and others are building foundation models for robots, systems that can generalize across tasks and environments the way GPT generalizes across language tasks. These models are hungry for data. But not just any data—they need training examples that capture the complexity of physical interaction in ways that current annotation approaches weren’t designed to handle.
What Is Embodied AI?
Embodied AI refers to artificial intelligence systems that have physical presence—robots that can see, move, and manipulate objects in the real world. Unlike purely digital AI (chatbots, image generators, recommendation systems), embodied AI must deal with physics, uncertainty, and the consequences of physical actions.
The embodied AI thesis is simple: the same scaling laws that produced GPT-4 and Claude can produce general-purpose robot intelligence. Train a large enough model on enough diverse robot data, and you get a system that can adapt to new tasks and environments without explicit programming.
This is fundamentally different from traditional robotics, which relied on hand-coded behaviors for specific tasks. Traditional robots could weld car frames or pick items from bins because engineers programmed exactly how to do those tasks. Embodied AI aims to learn behaviors from data, generalizing across tasks the way language models generalize across topics.
The shift: From programming robots to do specific tasks → to training robots to learn any task from examples. This changes everything about how robot data is collected, annotated, and used.
The Players Building This Future
Several well-funded companies are racing to build embodied AI foundation models:
Physical Intelligence (Pi)
Founded by robotics researchers from Google, Berkeley, and Stanford, Physical Intelligence raised $400M to build foundation models for robotics. Their approach: train a single model that can control many different robot types across many different tasks. The model learns general principles of physical interaction, then adapts to specific robots and environments.
Pi’s data needs are enormous. They need demonstrations across different robot form factors (arms, humanoids, mobile robots), different tasks (manipulation, navigation, interaction), and different environments (homes, warehouses, offices). And all this data needs consistent annotation that captures what makes physical interaction successful or unsuccessful.
Figure
Figure is building humanoid robots designed to work alongside humans in commercial and industrial settings. Their Figure 01 and 02 robots learn from teleoperation demonstrations, building a foundation of human-like movement and manipulation skills.
Figure’s data strategy emphasizes human demonstration—people controlling robots through VR interfaces, generating training data that captures human intuition about physical tasks. This data needs annotation that describes not just what happened, but why certain approaches work and others don’t.
1X Technologies
1X (formerly Halodi Robotics) builds humanoid robots for household and commercial environments. Their NEO robot is designed for general-purpose tasks in human spaces—the kind of varied, unpredictable environments where traditional robotics fails.
1X’s approach emphasizes learning from real-world deployment. As their robots operate in actual homes and businesses, they generate data that captures edge cases and failure modes no simulation could anticipate. This deployment data needs rapid annotation to feed back into model improvement.
Tesla Optimus
Tesla’s humanoid robot program leverages the company’s existing AI infrastructure and massive workforce. Thousands of Tesla employees can contribute teleoperation demonstrations, generating training data at a scale most robotics companies can’t match.
Tesla’s advantage is volume. Their challenge is ensuring consistent quality across many operators and annotation consistency across massive datasets.
Why Traditional Annotation Falls Short
Embodied AI models have data requirements that traditional computer vision annotation wasn’t designed to meet:
Multimodal Complexity
Robots don’t just see—they feel forces, track joint positions, measure distances with multiple sensors. Training data for embodied AI is inherently multimodal: camera images, depth maps, proprioceptive state, force/torque readings, sometimes audio. Annotation needs to span all these modalities, describing not just what’s visible but how the robot is interacting physically.
Traditional image annotation captures “there’s a cup on the table.” Embodied AI annotation needs “the cup is 23cm from the gripper, oriented 15 degrees from vertical, estimated 200g weight, smooth ceramic surface requiring moderate grip force.”
Temporal Dynamics
Physical interaction unfolds over time. The sequence of actions matters—approach angle, grip timing, lift velocity, placement precision. Annotation needs to capture not just static states but dynamic trajectories and the causal relationships between actions and outcomes.
Traditional annotation: “person picks up cup.” Embodied AI annotation: “approach from right (0.0-1.2s) → pre-shape gripper for cylindrical object (1.2-1.5s) → contact cup at mid-height (1.5s) → close gripper with 15N force (1.5-1.8s) → lift vertically 20cm (1.8-2.5s) → success: cup secured, no slip detected.”
Physics Reasoning
Embodied AI needs to understand physics—how objects behave when pushed, lifted, dropped, stacked. Annotation needs to capture not just what happened but why it happened in terms of physical principles. Was the grasp stable because the center of mass was within the gripper span? Did the object slip because the surface was smooth? Did the stack fall because the base was uneven?
This requires annotators who understand physics, not just visual recognition.
Failure Mode Documentation
Failures are the most valuable training data for embodied AI—they teach the model what not to do and how to recover. But failure annotation is harder than success annotation. You need to identify why the failure occurred, classify the failure type, and describe what a successful approach would have looked like.
Traditional annotation skips failures or marks them as simple negatives. Embodied AI annotation needs rich failure analysis that the model can learn from.
Generalization Requirements
Foundation models need to generalize. The annotation schema needs to support generalization—describing tasks and objects in ways that transfer across variations. “Pick up the cup” should connect to “pick up the mug” and “pick up the glass” through shared concepts of graspable cylindrical containers.
This requires careful taxonomy design that balances specificity (this particular cup) with abstraction (cup-like objects in general).
The gap: Traditional annotation asks “what’s in this image?” Embodied AI annotation asks “what’s happening physically, why is it happening, and how does this connect to other physical interactions?” The complexity increase is substantial.
What Embodied AI Annotation Actually Requires
Based on the needs of foundation model development, here’s what annotation for embodied AI looks like:
1. Action-Centric Description
Every annotation should center on actions—what the robot did, what the human demonstrated, what the outcome was. Object identification matters only in relation to actions performed on those objects.
Format: [Actor] + [Action] + [Object] + [Manner] + [Outcome]
Example: “Robot arm approaches cup from the right, grasps at mid-height with parallel gripper, lifts vertically 15cm—success, object secured.”
2. Physical Property Annotation
Objects need annotation beyond visual appearance—estimated weight, material properties, stability characteristics, graspable surfaces. These physical properties determine how robots should interact with objects.
Example properties: “Coffee mug—ceramic, ~300g full, cylindrical body (graspable), handle (alternative grasp point), hot liquid (careful handling required), stable base (can be placed on flat surfaces).”
3. Causal Relationship Mapping
When something goes wrong (or right), annotation should capture the causal chain. What action led to what outcome? What condition caused the failure? What would have prevented it?
Example: “Grasp failed → cause: gripper contacted cup too high (top 20%) → center of mass below contact point → cup rotated during lift → corrective action: contact point should be at 40-60% height for stable grasp.”
4. Semantic Action Segmentation
Continuous demonstrations need segmentation into meaningful action units. These segments should be semantically meaningful—complete actions that make sense as learning examples.
Good segmentation: “reach for cup” | “grasp cup” | “lift cup” | “transport cup” | “place cup”
Bad segmentation: “move arm 10cm” | “move arm 5cm” | “close gripper” — too granular, loses semantic meaning
5. Variation Documentation
When the same task is performed multiple times, annotation should capture what varies and what stays constant. This helps models learn which aspects of a demonstration are essential vs. incidental.
Example: “Cup grasp demonstration #47—approach angle varied (this time from left vs. usual right approach), grasp height consistent (mid-height), outcome consistent (success). Variation suggests approach angle is flexible for this object type.”
The Scale Challenge
Embodied AI foundation models need massive training datasets—potentially millions of demonstration episodes across thousands of task types. The annotation challenge isn’t just complexity per sample; it’s maintaining quality at scale.
Consistency Across Annotators
When hundreds of annotators label millions of samples, consistency becomes critical. Different annotators describing similar actions in incompatible ways creates noise that degrades model performance. This requires detailed guidelines, calibration procedures, and ongoing quality monitoring.
Evolving Requirements
As models improve, annotation requirements change. Early-stage models might need basic action descriptions. Later-stage models might need subtle failure mode analysis. Annotation processes need to evolve alongside model development.
Speed vs. Quality Tradeoffs
Foundation model development moves fast. Teams need annotation turnaround in days or weeks, not months. But rushing annotation sacrifices quality. Finding the right balance—fast enough for iteration velocity, careful enough for model quality—is an ongoing challenge.
Where We Fit
At Tech AI Remote, we’ve been building capabilities that align with embodied AI annotation needs:
- Action segmentation expertise: Our motion capture project delivered 3,255 action descriptions with frame-accurate timestamps—exactly the temporal annotation embodied AI requires
- Physics-aware annotation: Our robotics bin-picking work included grasp point identification and manipulation reasoning—the kind of physical property annotation foundation models need
- Consistency at scale: We’ve developed processes for maintaining annotation consistency across large projects—calibration procedures, QA workflows, and guideline documentation
- Flexibility: As a smaller specialized provider, we can adapt quickly to evolving requirements—new annotation schemas, new data types, new quality criteria
We’re not trying to be Scale AI. We’re building the specialized capabilities that embodied AI teams need when standard annotation approaches fall short.
The Road Ahead
Embodied AI is still early. The foundation models being built today will look primitive in five years. But the data infrastructure being built now—the annotation schemas, quality processes, and specialized capabilities—will compound over time.
Companies that invest in high-quality physical interaction data today are building moats that will matter as the field matures. And annotation providers that develop genuine expertise in embodied AI data will become essential partners as the industry scales.
The opportunity is clear. The question is who builds the capabilities to capture it.