A Mechanistic Analysis of Sim-and-Real Co-Training in Generative Robot Policies

Overview

What is Co-Training?

Co-training — combining limited in-domain real-world data with abundant surrogate data such as simulation^1,2,3, cross-embodiment robot data^4,5, or even cross-modality data⁶ — is widely used for training generative visuomotor policies. One strategy is to simply do mixed training. However, despite its empirical success, the mechanisms that determine when and why co-training is effective remain poorly understood.

What We Do

We investigate the working principles of sim-and-real co-training through theoretical analysis and multi-layered empirical study, and identify structured representation alignment as the primary intrinsic effect governing performance improvement. We validate these effects with controlled toy model experiments and extensive sim-and-sim and sim-and-real robotic manipulation experiments. Our analysis provides a unified interpretation of recent co-training techniques and motivates a simple method that achieves consistent improvements over prior approaches.

The underlying working mechanisms of co-training systems: structured representation alignment and importance reweighting effect

More broadly, the goal is to provide a closer look beyond the surface of co-training, and to firmly encourage the community to push this important but hard to achieve direction forward together.

You can jump to the Takeaways to learn the key insights.

A Bit of Theoretical Insights: Two Intrinsic Effects

Training a co-trained diffusion policy corresponds to jointly learning a feature encoder $f_\phi:\mathcal{O}\rightarrow\mathcal{Z}$ and a policy model $\pi_\theta:\mathcal{Z}\rightarrow\mathcal{A}$. Through our analysis, we identify that there exists one intrinsic effect in each space.

Structured Representation Alignment

A balance between cross-domain representation alignment and domain-specific discernibility. Alignment enables transfer of task-relevant knowledge from surrogate domain; discernibility allows actions to adapt to the target domain.

Explains ~50% of loss variance

Importance Reweighting Effect

Domain-dependent modulation of action distribution weighting. Determined by the mixing ratio $w$, dataset size $|\mathcal{D}|$, and domain gaps, it operates at a secondary, modulatory level.

Explains ~20% of loss variance

Structured Representation Alignment: in the latent space $p(\mathcal{Z})$

We prove the existence of an analytical optimal solution, which reveals that the behavior of the co-trained score function depends on the learned representations. Based on their alignment structure, three scenarios emerge:

○ ○

Disjoint

Representations from the two domains occupy entirely separate clusters. No positive transfer occurs from the source domain.

No transfer

◈ ◉

Structured Aligned

Close but not collapsed. Action prediction is guided by source-domain neighbors but dominated by target-domain data.

Effective transfer

⬤

Overlapping

Fully collapsed representations. The policy cannot distinguish domains, leading to bimodal action distributions.

Negative transfer

Importance Reweighting: in the conditional action space $p(\mathcal{A|Z})$

Another effect further modulates the action sampling distribution by reweighting score functions across domains. Define $r_k(a^t,t) := \frac{||a^t - \alpha_t a_k||}{\sigma_t \sqrt{d}}$, the relative weight between target and source samples follows:

$g_{k}=\text{Softmax}\big(\ln(w_k) - r_k^2 \cdot d/2\big), \quad w_t = w/N,\; w_s = (1-w)/M$

In special case we can have: $\frac{g_r}{g_s} = \frac{1-w_N}{w_N}\cdot \frac{w}{1-w} \cdot \exp(\frac{r_s^2 - r_r^2}{2})$

The amplitude of this modulation depends on three factors: (i) the mixing ratio $w$, (ii) the dataset sizes $N$ and $M$, and (iii) the domain gaps between source and target actions.

Illustrative Toy Models

To disentangle and understand the individual contributions of both effects, we design controlled toy co-training experiments. The policy model learns a mapping $\pi(y|x):\mathbb{R}^3 \rightarrow \mathbb{R}^2$, where we manually define source and target manifolds with different distributions and alignment structures. The adjustment of the data mixing ratio $w$ controls the importance reweighting effect:

Mixing ratio

Overlap

Structured Aligned

Disjoint

Finding 1: In disjoint, no transfer occurs; in structured aligned, the output is reconstructed with high fidelity; in overlapping, predictions are bimodal. Finding 2: Structured alignment explains ~50% of loss variance, while the mixing ratio accounts for only ~20%. The overall importance reweighting effect is constrained by the underlying representation alignment — changing $w$ alone cannot compensate for poorly aligned representations. Finding 3: With appropriate mixing ratio, co-training can even achieve surprisingly good OOD generalization, e.g. in structured aligned scenario with $w=0.1$ above.

Structured Reprsentation Alignment is the primary intrinsic effect that governs the performance of co-training.

Observations on Real Robots

Alignment Can Be Learned Implicitly

We visualize the latent embeddings with UMAP across different mixing ratios. Within a certain range of balanced mixing ratios ($w \in [0.016, 0.3]$), the shallow layers exhibit local geometry alignment while the deep layer features show global alignment — all without any explicit alignment objective. The phenomenon holds consistently across different tasks.

Task

Local geometry alignment in shallow layers. Drag and zoom to explore.

Task

Global alignment in deep layers. Drag and zoom to explore.

Alignment Correlates with Performance

The correlation between representation alignment (measured by Wasserstein distance) and task success rate is moderate-to-strong, with Pearson and Spearman coefficients in the range of $0.6 \sim 0.8$ across all settings except pure physics-only gaps.

Discernibility is Indispensable

Even when representations appear aligned in low-dimensional space, a simple 2-layer MLP can achieve ~100% domain classification accuracy — indicating that domain-specific information is preserved.

In the physics-only setting, where domain discernibility is harder to maintain, correlation between alignment and performance even becomes negative, confirming that blind alignment without discernibility can be harmful.

Compared to physics-only setting, policy performance is improved in the vis-phys setting.

Existence of Three Representation Regimes

We introduce an additional and complementary control knob via discriminator regularization, which directly modulates domain discernibility. Specifically, we train models with discriminator loss weights of $\{0, 0.05, 0.5\}$, and within each setting sweep the mixing ratio to vary alignment.

We can observe distinct behavior patterns in three regimes: (1) negative correlation in overlapping regime; (2) weak positive correlation in structured aligned regime; (3) strong positive correlation in disjoint regime. The boundary between the regimes is not sharp while the average performance is the highest in the structured aligned regime.

A Unified View of Co-Training Methods

Existing co-training techniques can be understood through the lens of how they improve representation alignment and domain discernibility. We revisit three representative approaches, and further propose a simple combination method.

Alignment

Optimal Transport

Explicitly matches representation distributions across domains via soft coupling.

Alignment

ADDA

Adversarial discriminator promotes domain-invariant representations.

Discernibility

CFG

Classifier-free guidance preserves domain information via separate conditional pathways.

Both

CFG-ADDA (Ours)

Combines adversarial alignment with domain guidance to balance both objectives.

Sim-and-Sim Experiments

Real-World Evaluation

We evaluate all methods in sim-and-real co-training on three challenging manipulation tasks, with 15 trials each:

Method	NutAssembly			MugCleanup			MugHang			Avg
Method	w=0.016	w=0.1	w=0.3	w=0.016	w=0.1	w=0.3	w=0.016	w=0.1	w=0.3	Avg
Real-Only	11 / 30			8 / 30			6 / 30			8.6 / 30
Co-Training	17 / 30	11 / 30	16 / 30	16 / 30	9 / 30	7 / 30	8 / 30	13 / 30	7 / 30	15.3 / 30
+ OT	15 / 30	17 / 30	11 / 30	8 / 30	15 / 30	15 / 30	11 / 30	9 / 30	4 / 30	14.3 / 30
+ ADDA	13 / 30	13 / 30	15 / 30	6 / 30	14 / 30	11 / 30	10 / 30	14 / 30	7 / 30	14.3 / 30
+ CFG	15 / 30	14 / 30	11 / 30	6 / 30	17 / 30	14 / 30	8 / 30	14 / 30	10 / 30	15.3 / 30
+ CFG-ADDA	23 / 30	15 / 30	18 / 30	11 / 30	22 / 30	17 / 30	18 / 30	15 / 30	8 / 30	21 / 30

Task

Real-Only

Co-Training

+ OT

+ ADDA

+ CFG

+ CFG-ADDA (Ours)

Real-world policy performance (best across balanced mixing ratios). CFG-ADDA achieves ~74% average success rate, a substantial improvement over all individual methods. Videos are displayed at 2x speed.

Takeaways

Primary intrinsic effect in co-training: structured representation alignment: Structured representation alignment influences the latent condition space, while importance reweighting modulates the conditional action space.
Both alignment and discernibility are necessary: representation alignment correlates with performance, while blind alignment without domain awareness leads to negative transfer, especially with the existence of physics domain gaps.
Representation alignment can be learned implicitly and progressively: with balanced mixing ratios without any explicit alignment objective: from local geometry alignment in shallow layers to global alignment in deep layers.
A principled guideline for mixing ratio selection. We also provide a coarse range of balanced mixing ratios, which we believe can be a useful starting point for future large-scale co-training experiments, narrowing down the search space.

Algorithm: Guideline for Co-Training Mixing Ratio Selection

Given source dataset size $M$ and target dataset size $N$ with $M > N$, optionally with desired target contribution $q$ (e.g., $q=0.8$), this procedure outputs a narrowed search range $(w_n, w_q)$ for the mixing ratio.

Compute the natural mixing ratio:
$$w_n = \frac{N}{N + M}.$$
Use $w_n$ as the lower bound.
If $M/N > 5$ (target much larger than source), set the upper bound as:
$$w_q = \sqrt{\frac{N}{M}}.$$
Else, set a desired target contribution ratio $q$ (e.g., $q=0.8$) and compute:
$$w_q = \frac{N \cdot q}{(1-q)\cdot M + N \cdot q}.$$
Optionally cap the upper bound at $0.5$ (often sufficient in practice).
Adjust both $w_n$ and $w_q$ upward if the source–target domain gap is large.
Consider domain gaps from visual appearance, physics, and embodiment; since no formal estimator is assumed, apply this adjustment heuristically.
Perform a simple search within the range $(w_n, w_q)$ to find the optimal mixing ratio.

We hope this work provides a clearer understanding of the mechanisms behind co-training and informs the design of more principled, robust co-training algorithms. We sincerely invite the community to further explore in more diverse settings, with the goal of deepening our collective understanding.

References

[1] Wei, Adam, et al. Empirical Analysis of Sim-and-Real Cotraining of Diffusion Policies for Planar Pushing from Pixels. IROS, 2025.
[2] Maddukuri, Abhiram, et al. Sim-and-Real Co-Training: A Simple Recipe for Vision-Based Robotic Manipulation. RSS, 2025.
[3] Cheng, Shuo, et al. Generalizable domain adaptation for sim-and-real policy co-training. NeurIPS, 2025.
[4] Punamiya, Ryan, et al. Egobridge: Domain adaptation for generalizable imitation from egocentric human data. NeurIPS, 2025.
[5] Kareer, Simar, et al. Emergence of Human to Robot Transfer in Vision-Language-Action Models. Preprint, 2025.
[6] Lin, Fanqi, et al. A Systematic Study of Data Modalities and Strategies for Co-training Large Behavior Models for Robot Manipulation. Preprint, 2026.

Citation

@article{lei2026mechanistic,
  title={A Mechanistic Analysis of Sim-and-Real Co-Training in Generative Robot Policies},
  author={Lei, Yu and Liu, Minghuan and Maddukuri, Abhiram and Jiang, Zhenyu and Zhu, Yuke},
  journal={arXiv preprint arXiv:2604.13645},
  year={2026}
}