Seeing the Invisible: Deep Learning for Infrared Small Target Detection

How we built a synthetic data pipeline and deep learning system to detect dim, fast-moving objects in infrared imagery — without access to real operational data.

The Problem

Detecting a small, fast-moving object against a cluttered, dynamic background is one of the harder open problems in computer vision. Now constrain it further: the object is dim, potentially occluded, moving unpredictably, and the sensor is itself in motion. The signal-to-noise ratio is low. The background — sky, terrain, sea — is not static. And you need to detect, not just classify.

This is the core challenge in infrared small target detection (IRSTD), and it sits at the intersection of low-level signal processing and high-level deep learning. My work at IIT Roorkee under a DRDO collaboration has been focused squarely on this problem.

Why Infrared?

Visible-spectrum cameras struggle in low-light, adverse weather, and at long range. Infrared sensors capture thermal emissions, making them effective in conditions where optical cameras fail. They are also harder to spoof than radar. For defence applications, IR imaging is a primary modality — which is precisely why detection algorithms that work robustly in IR are strategically important.

The challenge is that the targets of interest are often just a handful of pixels in a high-resolution frame. Traditional detection pipelines rely on handcrafted features — local contrast, saliency maps, morphological operations. These break down as backgrounds become more complex.

The Data Problem

The most significant constraint in this work is data. Real IR imagery from operational systems is classified. Annotated datasets simply do not exist in the public domain at the scale needed to train modern deep learning models. This is the central research problem that everything else flows from.

We cannot collect it. We cannot scrape it. We cannot crowd-source it. The only option is to generate it.

Synthetic Data Pipeline

We built a pipeline that procedurally generates IR image sequences with realistic:

Background clutter — sky gradients, terrain texture, sea surface, urban heat signatures
Atmospheric effects — turbulence, haze, sensor noise models calibrated to real sensor specs
Target signatures — point spread functions derived from physical emissivity models at varying distances and velocities
Motion dynamics — target trajectories with realistic kinematic constraints

The pipeline outputs annotated sequences: bounding boxes, track IDs, depth estimates. The goal was not photorealism but distribution fidelity — the synthetic data should produce the same failure modes and ambiguities that make real IR data hard.

Getting this right required close coordination with domain experts and iterative validation. We had reference imagery we could compare against qualitatively but never directly use.

Model Architecture

The architecture we settled on is a two-stage system:

Stage 1 — Spatio-temporal feature extraction. A lightweight backbone processes consecutive frames jointly. The temporal dimension is critical: a true target moves consistently with physical dynamics; background clutter does not. Frame differencing alone is insufficient for slow-moving or distant targets, so we encode temporal context across a short window.

Stage 2 — Detection head with false-alarm suppression. A small detection head predicts target locations, followed by a learned suppression module trained to reject high-response false alarms from background structure (cloud edges, terrain discontinuities, sensor artifacts).

The model is deliberately compact — real-time inference on embedded hardware is a hard constraint in this application domain.

Results and Observations

On our synthetic test sets, the system achieves strong detection rates at low false-alarm rates across a range of signal-to-clutter ratios. More importantly, ablation studies confirm the contribution of each component: removing the temporal branch degrades performance significantly on slow-moving targets; removing the suppression module increases false alarms in textured backgrounds.

The harder question — sim-to-real transfer — is one I can speak to only partially. Qualitative comparisons with reference data are encouraging. Quantitative evaluation on real operational data is not something I am in a position to publish.

Reflections

Working at the boundary of academic research and defence applications means navigating a genuine tension: the scientific instinct to open-source, reproduce, and publish everything collides with legitimate restrictions on what can be shared. The work is real, the methods are novel, and the engineering is non-trivial — but the output lives behind a classification wall.

What I can share is the methodology, the general architecture, and the lessons learned about building ML systems under data scarcity. If you are working on a related problem and want to talk through approaches, I am always open to that conversation.