π, But Make It Fly: Physics-Guided Transfer of VLA Models to Aerial Manipulation

Johnathan Tucker¹, Denis Liu¹, Aiden Swann¹, Allen Ren², Javier Yu¹,

Jiankai Sun¹, Brandon Kim¹, Lachlain McGranahan¹, Quan Vuong², Mac Schwager¹

¹Stanford University, ²Physical Intelligence

arXiv Code (Coming Soon) Data (Coming Soon)

Abstract

Vision-Language-Action (VLA) models such as π₀ have demonstrated remarkable generalization across diverse fixed-base manipulators. However, transferring these foundation models to aerial platforms remains an open challenge due to the fundamental mismatch between the quasi-static dynamics of fixed-base arms and the underactuated, highly dynamic nature of flight. In this work, we introduce AirVLA, a system that investigates the transferability of manipulation-pretrained VLAs to aerial pick-and-place tasks. We find that while visual representations transfer effectively, the specific control dynamics required for flight do not. To bridge this "dynamics gap" without retraining the foundation model, we introduce a Payload-Aware Guidance mechanism that injects payload constraints directly into the policy's flow-matching sampling process. To overcome data scarcity, we further utilize a Gaussian Splatting pipeline to synthesize navigation training data. We evaluate our method through a cumulative 460 real-world experiments which demonstrate that this synthetic data is a key enabler of performance, unlocking 100% success in navigation tasks where directly fine-tuning on teleoperation data alone attains 81% success. Our inference-time intervention, Payload-Aware Guidance, increases real-world pick-and-place task success from 23% to 50%. Finally, we evaluate the model on a long-horizon compositional task, achieving a 62% overall success rate. These results suggest that pre-trained manipulation VLAs, with appropriate data augmentation and physics-informed guidance, can transfer to aerial manipulation and navigation, as well as the composition of these tasks.

Overview of AirVLA: Our method fine-tunes the π₀ vision-language-action model on a combination of teleoperated and 3D Gaussian Splatting synthetic data. (Left) The policy processes multimodal inputs including natural language commands, proprioception, and multi-view camera observations. (Right) To ensure robust flight during manipulation, we introduce a payload-aware guidance signal, F_guide, combined with real-time chunking. (Center) This enables AirVLA to execute novel, zero-shot compositional tasks, such as navigating through gates and manipulating objects.

Method

System Overview

AirVLA is a vision-language-action system for aerial manipulation that transfers manipulation capabilities from foundation models pretrained on fixed-base robot arms to an underactuated quadrotor platform. The system takes as input RGB images from multiple viewpoints and a natural language task description, and outputs relative end-effector pose commands executed by a low-level flight controller.

The key challenge in this transfer is the mismatch between the quasi-static regimes of tabletop manipulation and aerial manipulation, where the platform must continuously stabilize against gravity while simultaneously executing precise gripper motions. Grasping an object introduces a step change in effective mass that, if uncompensated, causes the drone to sag and potentially fail the task.

Our approach addresses this challenge through two main contributions: (1) a physics-aware guidance mechanism that augments the pretrained policy's action generation process with payload-aware vertical compensation at inference time, and (2) a Gaussian-splatting data pipeline that enables efficient collection and synthesis of diverse training trajectories from a small number of seed flights. Together, these allow a VLA model pretrained on large-scale robot manipulation data to perform aerial pick-and-place and navigation with minimal drone-specific fine-tuning.

Hardware

The system integrates the ModalAI Starling 2 Max drone with a customized Universal Manipulation Interface (UMI) gripper and multiple cameras to enable autonomous aerial manipulation. The Starling 2 Max is powered by the VOXL 2 companion computer (Qualcomm QRB5165). The customized UMI-style gripper is attached to the underside of the drone, enabling dynamic grasping. The system integrates one external camera and two onboard cameras (downward and forward-facing), providing RGB images at 5Hz.

Gripper Design: The gripper is designed to be built cheaply without specialized tools. The frame is entirely 3D printed, with two hobby-grade servos slotting in without screws. The fingers are adapted from the UMI gripper, facilitating direct comparison with arm-based policies trained with similar end-effector geometry. The design prioritizes low weight for extended flight time while maintaining sufficient grip strength for the target objects.

Observation and Action Spaces: The observation space consists of RGB images from three cameras at 256×256 resolution, the drone's estimated pose from a motion capture system, and gripper aperture. The action space, consisting of the drone position and yaw, is represented as an action chunk A ∈ ℝ^H×D (i.e., end-effector 4 DoF delta poses and gripper commands D over horizon H). Actions are generated at 10 Hz and executed by the PX4 flight controller via position setpoints.

Policy Architecture

We build on π₀, a vision-language-action model that represents the conditional action distribution using a flow-matching (continuous-time) generative model. Given an observation o (images, proprioception, language), π₀ defines a velocity field v_θ(x_τ, o, τ) over latent action chunks and diffusion/flow time τ ∈ [0,1]. Sampling draws x₀ ~ 𝒩(0, I) and integrates the ODE to obtain the generated action chunk A = x₁.

For real-time execution, we employ Real-Time Chunking (RTC), which enables asynchronous inference by freezing the prefix of the next chunk that will execute before inference completes, and inpainting the remaining suffix conditioned on the frozen prefix. RTC defines a soft temporal mask over the horizon that blends previously committed actions with newly generated actions to avoid discontinuities at chunk boundaries.

Physics-Aware Guidance for Action Generation

RTC shows that inference-time steering can be implemented by modifying the velocity field during sampling. We generalize this idea by introducing a loss Φ(A; o) over the denoised action chunk, and adding a gradient correction term to the base velocity field. Intuitively, we seek samples that both (i) have high probability under the base policy and (ii) minimize the guidance loss. This formulation seeks a 'sweet spot' between the policy's priors and physical constraints—biasing the sampling process toward actions that are high-probability under the VLA (preserving learned manipulation skills) while simultaneously minimizing the cost (enforcing flight feasibility), effectively steering the drone at runtime without retraining.

Payload-Aware Vertical Guidance: In aerial manipulation, the dominant disturbance during grasping is an effective mass increase that manifests as vertical sag under load. Rather than modeling full 6-DOF dynamics, we apply guidance only on the altitude-related action dimension. We bias the drone toward slightly higher altitude under load via a tuned offset capturing the expected sag for typical payloads.

The payload confidence α(o, A_t-1) ∈ [0,1] is computed from the previously executed action chunk and the current measured gripper aperture. When α ≈ 0, the loss vanishes and sampling reduces to vanilla RTC; when α ≈ 1, the sampler prefers chunks whose vertical component is biased toward the desired altitude, compensating for payload. This can be viewed as mode-dependent feedforward (gain scheduling / gravity compensation), but injected inside the generative policy's sampling dynamics.

Gaussian Splat Data Pipeline

Collecting aerial manipulation demonstrations is time-consuming and requires skilled pilots. To efficiently generate diverse training data, we adopt a "Flying in Gaussian Splats"-style approach that couples photorealistic Gaussian-splatting reconstructions with a lightweight drone dynamics model to synthesize large volumes of training data from a small set of seed demonstrations.

Scene Reconstruction: We reconstruct each scene from short walk-throughs captured with the drone's forward-facing camera, resulting in metrically scaled poses used to train a 3D Gaussian splatting model. The resulting model renders photorealistic images from arbitrary camera poses in the captured region.

Gripper Segmentation and Compositing: The downward-facing camera provides critical visual information for manipulation, but the gripper is persistently visible in its field of view. Including raw downward-facing images would introduce an observation bias into the policy. Thus, we explicitly treat the gripper as a separate foreground layer and composite it onto renders from a gripper-free scene model. We use Segment Anything (SAM) to extract gripper masks and create a library of representative gripper appearances, which are then composited onto clean scene renders at synthesis time.

Domain-Randomized Data Synthesis: To enable the policy to recover from dangerous states near obstacles, we synthesize recovery trajectories by randomizing task geometry. For each nominal trajectory, we generate multiple randomized rollouts by sampling initial state perturbations and randomizing intermediate waypoints to increase diversity and induce recovery behaviors. We randomize terminal hover locations, gate exit waypoints, and insert additional waypoints that force the drone to pass near gate extremities (top, bottom, left, or right), eliciting recovery behavior. This procedure yields a large set of physically and geometrically plausible trajectories that cover both nominal executions and off-nominal approaches requiring recovery near gates.

Results Highlights

Penguin Grasp: Pick and place task

Gate Navigation (Left): Navigate through left gate

Gate Navigation (Right): Navigate through right gate

Compositional Task: Navigate then grasp

We conducted a cumulative 460 real-world flight trials to evaluate our method across three task categories: pick-and-place (Penguin Grasp), gate navigation, and compositional navigate-then-grasp tasks. Each method was evaluated across twenty trials per task to assess:

Transfer: How manipulation-pretrained VLA capability transfers to an underactuated aerial manipulator
Inference-time control: Whether RTC and payload-aware guidance reduce execution sensitivity
Data synthesis: Whether Gaussian-splat-based synthetic augmentation improves navigation
Compositionality: Whether a single policy can reliably compose navigation and grasping

Task Suite

Penguin Grasp (Pick-and-Place): "pick up the stuffed animal and put it in the blue bin"
Gate Navigation: "fly through the gate and hover over the stuffed animal"
Compositional (Navigate-then-Grasp): "fly through the gate and hover over the stuffed animal and then pick up the stuffed animal and put it in the blue bin"

Single Task Performance

Table 1 shows performance on individual tasks. Success rates are conditional—for example, "Place" success is conditioned on successful picks. We compare against ACT and Diffusion Policy baselines.

**Table 1: Single Task Performance Benchmarks (%).** Each method evaluated across 20 trials per task.
Method	Penguin Grasp		Navigation (Non-Synthetic)		Navigation (Synthetic)
Method	Pick	Place	Gate	Hover	Gate	Hover
π₀ naive	50.0	0.0	50.0	60.0	45.0	100.0
π₀ + RTC	85.0	23.5	80.0	81.2	95.0	100.0
π₀ + RTC + payload-aware guidance (ours)	100.0	50.0	--	--	--	--
ACT	0.0	0.0	0.0	0.0	0.0	0.0
Diffusion Policy	10.0	0.0	15.0	0.0	0.0	0.0

Compositional Task Performance

Table 2 evaluates zero-shot compositional generalization on the combined navigate-then-grasp task. The policy was fine-tuned on individual tasks but tested on the combined prompt unseen during training.

**Table 2: Compositional Navigate-then-Grasp Success (%).** Each method evaluated across 20 trials per setting.
Data Setting	Method	Gate	Hover	Pick	Place
No Synthetic	π₀ naive	35.0	85.7	42.9	0.0
	π₀ + RTC	80.0	100.0	81.2	15.4
	π₀ + RTC + payload-aware guidance (ours)	70.0	100.0	100.0	35.7
	ACT	0.0	0.0	0.0	0.0
	Diffusion Policy	5.0	0.0	0.0	0.0
Synthetic	π₀ naive	70.0	85.7	25.0	0.0
	π₀ + RTC	95.0	94.7	83.3	20.0
	π₀ + RTC + payload-aware guidance (ours)	85.0	100.0	94.1	62.5
	ACT	25.0	0.0	0.0	0.0
	Diffusion Policy	0.0	0.0	0.0	0.0

Out-of-Distribution Robustness

To assess generalization, we evaluated on novel objects and gate positions not seen during training.

**Table 3: Out-of-Distribution (OOD) Robustness (%).** Success rates for object and gate positioning variations.
Grasp (Object Variation)			Navigation (Gate Locations)
Object	Pick	Place	Location	Gate	Hover
Chips	10.0	0.0	Front	0.0	--
Sandwich	70.0	57.1	Left	0.0	--
Box	30.0	33.3	Right	40.0	100.0

Key Findings

Inference-time structure is critical for aerial pick-and-place. Naive fine-tuning achieves 0% place success, with failures dominated by missed grasps and post-contact disturbances. RTC improves success to 23.5% by stabilizing execution across chunk boundaries, and our payload-aware guidance further improves success to 50%, compensating for payload-induced altitude sag.

RTC substantially improves gate navigation. In non-synthetic trials, RTC increases gate traversal from 50% to 80%, and from 45% to 95% in synthetic trials. This suggests that re-planning at runtime mitigates compounding errors during aggressive motion.

Synthetic augmentation helps most when paired with RTC. In the synthetic setting, π₀+RTC achieves 95% success at gate traversal. This suggests that synthetic augmentation expands coverage of approaches and recoveries, but reliable deployment still benefits from inference-time stabilization.

Compositional tasks reveal zero-shot generalization. Our method achieves a 62.5% overall conditioned success rate on the compositional task, despite never being trained on the combined prompt. This demonstrates strong ability to compose atomic navigation and manipulation behaviors.

Out-of-distribution trials show mixed generalization. The method generalizes effectively to novel objects (up to 57% task success on sandwich), but performance varies by geometry (only 10% for chips). For navigation, the policy adapts best to the "right" gate region (40% success), while "front" and "left" regions failed completely due to bypassing or collision issues.

BibTeX

@misc{tucker2026pimakeflyphysicsguided,
  title={$\pi$, But Make It Fly: Physics-Guided Transfer of VLA Models to Aerial Manipulation}, 
  author={Johnathan Tucker and Denis Liu and Aiden Swann and Allen Ren and Javier Yu and Jiankai Sun and Brandon Kim and Lachlain McGranahan and Quan Vuong and Mac Schwager},
  year={2026},
  eprint={2603.25038},
  archivePrefix={arXiv},
  primaryClass={cs.RO},
  url={https://arxiv.org/abs/2603.25038}
}