From Detection to Autonomous Action: Engineering Drone Intelligence on the Edge
It is -18°C. Visibility is limited. A drone lifts off on a reconnaissance mission deep in the Swedish arctic, no operator in the loop, no network connection, no GPS-assisted handholding. The drone must find, identify, and pursue a target entirely on its own.
This is not a speculative scenario. It is the operational reality that modern defense systems are increasingly designed for: contested, communication-degraded environments where autonomy is not a luxury but a prerequisite for mission success.
What makes such a mission possible is not any single AI model. It is the integrated architecture beneath it, a tightly coupled system of perception, spatial reasoning, decision logic and control that operates within hard real-time constraints. Autonomy, in practice, is not a model. It is a system.
This article describes the engineering behind one such system: a fully onboard autonomous drone capability developed and tested in real-world arctic conditions. From the first detection frame to a closed-loop intercept, every capability described here runs at the edge, locally, independently, and without reliance on external infrastructure.

Detection is a starting point, not a capability. What transforms detection into operational value is the chain of behavior that follows: Where is the object in physical space? Which target matters most? What happens when it disappears from view? How does the drone close in once it reacquires the target?
The system described here addresses the full chain. At a high level, it enables a drone to:
Rather than interpreting each camera frame in isolation, the drone builds an evolving picture of its environment, storing contextual information, making prioritization decisions, and executing autonomous movement based on classified targets. This shift from reactive perception to persistent, mission-aware behavior is what distinguishes a capable autonomous system from a smart camera.
The architecture is built around a set of functionally decoupled modules: reconnaissance, position estimation, target prioritization, memory-driven navigation, and closed-loop steering, each designed to operate within the real-time constraints of the flight controller while remaining composable into a larger autonomy stack.
The module is designed to be hardware-agnostic: inference runs on an onboard compute platform paired with a dedicated AI accelerator, supporting a range of modern edge inference architectures. Detection models are similarly modular, the system was validated using YOLO-family architectures, but the pipeline is not tied to any specific model or framework. What matters is that inference is fast, local, and deterministic.
Achieving real-time inference for flight control requires deliberate hardware and software optimization. At 30 fps and roughly 20 m/s airborne speed, a target latency of 30 ms or less is necessary, since position estimation error scales with both speed and delay. In this example we are using a Jetson Orin Nano and the YOLOv8 Nano model for object detection.
Each inference cycle produces structured detection outputs: object class, bounding box, detection confidence, and timestamp. These outputs are passed downstream to the position estimation module while simultaneously being written to a lightweight onboard detection database. This detection record is shared with both the ground station and other drones in the swarm using opportunistic data synchronization. When a radio link is available, updates are pushed live. When it is not, whether due to range, terrain, or a contested RF environment, the system falls back to a store-and-forward model. The result is an eventually consistent operational picture that degrades gracefully rather than failing hard.
This shared detection state can then be fed directly into situational awareness tools such as TAK (Tactical Awareness Kit), giving the ground team a continuously updated tactical picture of what has been observed, where, and with what confidence, without requiring the drone to return to base.
In its simplest configuration, reconnaissance operates as a standalone mission: the drone conducts its scan, populates the shared database, and returns. In more complex configurations, the reconnaissance output triggers the next phase of the autonomy stack, handing off a prioritized target to the approach and intercept pipeline.
Object detection operates in image space. Autonomous behaviour requires precise movement in the physical world. The position estimation module bridges this gap, transforming 2D bounding boxes into 3D world coordinates using geometry, drone pose, and prior knowledge of object dimensions.
The system estimates range using the pinhole camera model combining calibrated camera geometry and object-class information. By leveraging prior knowledge about supported targets, it can infer metric distance directly from image data, without requiring additional depth-sensing hardware.
The result is a metric world position for the target, expressed in the same coordinate frame used by the flight controller and all downstream modules.
Edge conditions introduce compounding noise: GPS drift, bounding box jitter from frame-to-frame detection variance, and estimation error that grows with target distance or adverse lighting. Rather than running a full filtering framework, the system applies a confidence-weighted running average across successive detections of the same target, higher-confidence detections contribute more strongly to the estimate, low-confidence ones are down-weighted but not discarded.
The combined effect of the weighted average of the position estimations and a gating mechanism produces accuracy that is considerably better than any single-frame estimation could achieve. In field-tests, individual frame estimates carried a per-frame error in the range of [0.5%, 5%], but the fused output converged to within 1-2 meters of the true target's position, accurate enough for the downstream navigation and steering modules to act on with confidence.
By transforming perception into spatial intelligence, this module is what allows the system to move from seeing objects to knowing where they are, a necessary step toward any meaningful autonomous action.
Real operational environments are rarely clean. A reconnaissance pass over contested terrain may yield simultaneous detections of multiple objects across different classes, number of sightings, and confidence levels. The system needs a principled way to decide which target matters most and it needs to make that decision locally, in real time, without operator input.
The target prioritization module solves this by evaluating all active detections through a weighted scoring formula.
The result is a single ranked target list, updated continuously as new detections arrive and old ones age. The highest-scoring target at any given moment is passed downstream as the selected target for navigation and approach. Because scoring runs within the same real-time loop as detection and position estimation, target selection remains responsive to changes in the scene; if a higher-priority target enters the field of view mid-mission, the system can react without operator intervention.
Visual contact with a target is not guaranteed to persist. Targets move behind cover, exit the camera's field of view during aggressive maneuvering, or temporarily disappear due to motion blur, lighting transitions, or detection dropout. A system that discards its target the moment it loses visual contact is operationally brittle. The memory-driven navigation module addresses this directly.
When the selected target is no longer detected, the system does not reset. Instead it transitions into a transport behaviour: holding the last confirmed world position of the target in memory and navigating toward it as an intermediate waypoint, while the perception stack continues scanning for reacquisition. The stored record includes the estimated world position, timestamp of last detection, and confidence at the time of last observation, together representing the system's best available belief about where the target is, and providing the basis for deciding whether continued pursuit remains warranted.
The moment the target is reacquired above a confidence threshold, the system exits the transport behaviour and transitions directly into closed-loop steering. This capability is what allows the drone to retain operational context across detection gaps, essential for reliable performance in the dynamic environments that edge autonomy is designed for.
When a target is reacquired, the system transitions from navigating to an estimated position to actively closing on a live, continuously updated one. This is the closed-loop autonomous steering module, where the full autonomy stack converges into a single, tightly integrated control loop.
A PID controller produces smooth and stable pursuits of the targets. The result is pursuit behaviour that remains stable even as detection quality fluctuates. The loop runs at approximately 30–50 Hz, aligned with the flight controller's update frequency.
At a system level, closed-loop steering completes the autonomy cycle. The drone is no longer observing or navigating toward an estimate, it is actively tracking a dynamic target in real time, fully onboard, with no external compute and no operator in the loop.
The five modules described in this article: reconnaissance, position estimation, target prioritization, memory-driven navigation and closed-loop steering, function as an integrated autonomy stack. But they are also individually composable. Each module exposes clean interfaces, consumes structured inputs, and produces structured outputs. Reconnaissance can run without triggering an approach. Position estimation can feed a situational awareness tool without feeding a control loop. This composability is a deliberate architectural choice: it allows the system to be deployed at different levels of autonomy depending on the mission type, drone platform and rules of engagement.
A static detection model reflects the conditions it was trained on, not those it encounters in the field. Arctic environments, camouflage, or new target classes will degrade model performance over time. To address this, the system integrates with Scaleout Edge, an Edge AI orchestration platform that manages the full model lifecycle across distributed nodes. Updated models can then be packaged, versioned and pushed to individual drones or entire swarms of drones without manual intervention– keeping inference performance aligned with the operational reality.
Scaleout Edge also enables federated learning across units. Each ground station trains models locally on the operational data gathered by its battalion: detections, conditions, target configurations encountered in the field, without that data ever needing to leave the unit. Only model updates are shared and aggregated into a common model, which is redistributed across all participating ground stations. A battalion that encounters a previously unseen camouflage or new target type in one area improves the detection capability for every other unit in the next training cycle. Enabling the model to become a living asset, continuously sharpened by collective operational experience while keeping sensitive field data exactly where it belongs.
The logical next step is distribution. The onboard database and opportunistic sync architecture already lay the groundwork for swarm-level coordination: multiple drones sharing a common operational picture, dividing search areas, handing off targets and collectively maintaining situational awareness across a contested environment. Edge-first intelligence, designed correctly, scales not by centralizing compute but by distributing it.
The system described in this article was not developed in a lab and extrapolated to the field, it was built for the field and tested there. Validation took place during an arctic strike demonstration in Sweden, conducted in harsh winter conditions that represent some of the most demanding operational environments edge autonomy can face– sub-zero temperatures, low contrast terrain and heavy fog that challenged our object detection models.
Across the demonstration, the full autonomy stack: reconnaissance, position estimation, target prioritization, memory driven navigation and closed loop steering was tested end-to-end on a single airframe with no external compute and no operator in the loop during target pursuit. Position estimates converged to within 1-2 meters under field conditions. The inference pipeline sustained ≈30 ms latency throughout and the control loop maintained stable target pursuit through detection dropouts and target movement.
The operational implication is straightforward: a single drone, operating without connectivity or operator input, can detect, prioritize, pursue, and reacquire a target in contested conditions using only onboard compute. That capability, validated in arctic field conditions, changes what a small unit can accomplish with limited assets in communication-denied environments.