Method
System Overview
AirVLA is a vision-language-action system for aerial manipulation that transfers manipulation capabilities from foundation models pretrained on fixed-base robot arms to an underactuated quadrotor platform. The system takes as input RGB images from multiple viewpoints and a natural language task description, and outputs relative end-effector pose commands executed by a low-level flight controller.
The key challenge in this transfer is the mismatch between the quasi-static regimes of tabletop manipulation and aerial manipulation, where the platform must continuously stabilize against gravity while simultaneously executing precise gripper motions. Grasping an object introduces a step change in effective mass that, if uncompensated, causes the drone to sag and potentially fail the task.
Our approach addresses this challenge through two main contributions: (1) a physics-aware guidance mechanism that augments the pretrained policy's action generation process with payload-aware vertical compensation at inference time, and (2) a Gaussian-splatting data pipeline that enables efficient collection and synthesis of diverse training trajectories from a small number of seed flights. Together, these allow a VLA model pretrained on large-scale robot manipulation data to perform aerial pick-and-place and navigation with minimal drone-specific fine-tuning.
Hardware
The system integrates the ModalAI Starling 2 Max drone with a customized Universal Manipulation Interface (UMI) gripper and multiple cameras to enable autonomous aerial manipulation. The Starling 2 Max is powered by the VOXL 2 companion computer (Qualcomm QRB5165). The customized UMI-style gripper is attached to the underside of the drone, enabling dynamic grasping. The system integrates one external camera and two onboard cameras (downward and forward-facing), providing RGB images at 5Hz.
Gripper Design: The gripper is designed to be built cheaply without specialized tools. The frame is entirely 3D printed, with two hobby-grade servos slotting in without screws. The fingers are adapted from the UMI gripper, facilitating direct comparison with arm-based policies trained with similar end-effector geometry. The design prioritizes low weight for extended flight time while maintaining sufficient grip strength for the target objects.
Observation and Action Spaces: The observation space consists of RGB images from three cameras at 256×256 resolution, the drone's estimated pose from a motion capture system, and gripper aperture. The action space, consisting of the drone position and yaw, is represented as an action chunk A ∈ ℝH×D (i.e., end-effector 4 DoF delta poses and gripper commands D over horizon H). Actions are generated at 10 Hz and executed by the PX4 flight controller via position setpoints.
Policy Architecture
We build on π0, a vision-language-action model that represents the conditional action distribution using a flow-matching (continuous-time) generative model. Given an observation o (images, proprioception, language), π0 defines a velocity field vθ(xτ, o, τ) over latent action chunks and diffusion/flow time τ ∈ [0,1]. Sampling draws x0 ~ 𝒩(0, I) and integrates the ODE to obtain the generated action chunk A = x1.
For real-time execution, we employ Real-Time Chunking (RTC), which enables asynchronous inference by freezing the prefix of the next chunk that will execute before inference completes, and inpainting the remaining suffix conditioned on the frozen prefix. RTC defines a soft temporal mask over the horizon that blends previously committed actions with newly generated actions to avoid discontinuities at chunk boundaries.
Physics-Aware Guidance for Action Generation
RTC shows that inference-time steering can be implemented by modifying the velocity field during sampling. We generalize this idea by introducing a loss Φ(A; o) over the denoised action chunk, and adding a gradient correction term to the base velocity field. Intuitively, we seek samples that both (i) have high probability under the base policy and (ii) minimize the guidance loss. This formulation seeks a 'sweet spot' between the policy's priors and physical constraints—biasing the sampling process toward actions that are high-probability under the VLA (preserving learned manipulation skills) while simultaneously minimizing the cost (enforcing flight feasibility), effectively steering the drone at runtime without retraining.
Payload-Aware Vertical Guidance: In aerial manipulation, the dominant disturbance during grasping is an effective mass increase that manifests as vertical sag under load. Rather than modeling full 6-DOF dynamics, we apply guidance only on the altitude-related action dimension. We bias the drone toward slightly higher altitude under load via a tuned offset capturing the expected sag for typical payloads.
The payload confidence α(o, At-1) ∈ [0,1] is computed from the previously executed action chunk and the current measured gripper aperture. When α ≈ 0, the loss vanishes and sampling reduces to vanilla RTC; when α ≈ 1, the sampler prefers chunks whose vertical component is biased toward the desired altitude, compensating for payload. This can be viewed as mode-dependent feedforward (gain scheduling / gravity compensation), but injected inside the generative policy's sampling dynamics.
Gaussian Splat Data Pipeline
Collecting aerial manipulation demonstrations is time-consuming and requires skilled pilots. To efficiently generate diverse training data, we adopt a "Flying in Gaussian Splats"-style approach that couples photorealistic Gaussian-splatting reconstructions with a lightweight drone dynamics model to synthesize large volumes of training data from a small set of seed demonstrations.
Scene Reconstruction: We reconstruct each scene from short walk-throughs captured with the drone's forward-facing camera, resulting in metrically scaled poses used to train a 3D Gaussian splatting model. The resulting model renders photorealistic images from arbitrary camera poses in the captured region.
Gripper Segmentation and Compositing: The downward-facing camera provides critical visual information for manipulation, but the gripper is persistently visible in its field of view. Including raw downward-facing images would introduce an observation bias into the policy. Thus, we explicitly treat the gripper as a separate foreground layer and composite it onto renders from a gripper-free scene model. We use Segment Anything (SAM) to extract gripper masks and create a library of representative gripper appearances, which are then composited onto clean scene renders at synthesis time.
Domain-Randomized Data Synthesis: To enable the policy to recover from dangerous states near obstacles, we synthesize recovery trajectories by randomizing task geometry. For each nominal trajectory, we generate multiple randomized rollouts by sampling initial state perturbations and randomizing intermediate waypoints to increase diversity and induce recovery behaviors. We randomize terminal hover locations, gate exit waypoints, and insert additional waypoints that force the drone to pass near gate extremities (top, bottom, left, or right), eliciting recovery behavior. This procedure yields a large set of physically and geometrically plausible trajectories that cover both nominal executions and off-nominal approaches requiring recovery near gates.