engineering March 14, 2025 12 min read

DrikNetra: Multi-Frame Super-Resolution for License Plates That Cameras Can't Read

Single frames from CCTV are too blurry. We fuse multiple video frames into one crisp plate image — turning 32x16 pixel smears into readable text.

Hansraj Patel

DrikNetra: Multi-Frame Super-Resolution for License Plates That Cameras Can't Read

The Universal ANPR Problem

Every ANPR system faces the same fundamental problem: the camera is too far away and the plate is too small.

A typical Indian CCTV camera is mounted 5-8 meters high on a pole, covering a 30-meter stretch of road. A standard Indian license plate is 500mm x 120mm. At 20 meters distance on a 720p camera with a 4mm lens, that plate occupies roughly 40x10 pixels in the frame.

40x10 pixels. Ten characters of text. That is 4 pixels per character width. OCR on 4-pixel-wide characters does not work. No amount of model architecture innovation changes the physics.

The traditional solution is to install dedicated ANPR cameras — high-resolution, narrow field of view, IR illuminated, positioned at optimal distance and angle. These work. They also cost 5-10x more than a standard CCTV camera, require dedicated mounting infrastructure, and cover a single lane.

India has 50 million CCTV cameras already installed. We do not need more cameras. We need more intelligence from the cameras that exist.

The Multi-Frame Insight

A vehicle does not appear in a single frame. It passes through the camera’s field of view over 1-3 seconds — that is 30-90 frames at standard frame rates. In each frame, the plate is captured from a slightly different angle, at a slightly different position, with slightly different motion blur and noise.

Each individual frame is too degraded to read. But collectively, those 30-90 frames contain enough information to reconstruct a readable plate image. The information is there. It is just spread across time.

This is the core insight behind DrikNetra: treat license plate reading as a video super-resolution problem, not a single-image OCR problem.

Frame 1:  [blurry, motion-right]     ──┐
Frame 2:  [blurry, slightly shifted]   │
Frame 3:  [less blur, noise]           │──► Super-Resolution ──► [Sharp plate image]
Frame 4:  [blur, partial occlusion]    │
Frame 5:  [moderate quality]          ──┘

The DrikNetra Pipeline

The pipeline has four stages:

Stage 1: Plate Detection and Tracking

Before we can super-resolve a plate, we need to find it and track it across frames.

The plate detector runs as a secondary detection head on tracked vehicles. When a vehicle enters a defined zone (e.g., approaching a toll booth or crossing a stop line), the system crops the lower region of the vehicle bounding box and runs a lightweight plate detection model.

class PlateTracker:
    def __init__(self):
        self.plate_detector = PlateDetNet()  # lightweight CNN, 2ms inference
        self.plate_buffer = defaultdict(list)  # track_id -> list of plate crops

    def process(self, track: Track, frame: np.ndarray):
        # Crop vehicle region
        vehicle_crop = frame[track.bbox.y1:track.bbox.y2,
                             track.bbox.x1:track.bbox.x2]

        # Detect plate within vehicle crop
        plate_bbox = self.plate_detector(vehicle_crop)
        if plate_bbox is None:
            return

        # Extract plate crop with padding
        plate_crop = vehicle_crop[
            plate_bbox.y1 - 5 : plate_bbox.y2 + 5,
            plate_bbox.x1 - 5 : plate_bbox.x2 + 5
        ]

        # Store with metadata
        self.plate_buffer[track.id].append({
            "crop": plate_crop,
            "frame_id": track.frame_id,
            "bbox_size": (plate_bbox.w, plate_bbox.h),
            "vehicle_speed": track.speed,
            "quality_score": self.assess_quality(plate_crop)
        })

        # Trigger SR when enough frames collected
        if len(self.plate_buffer[track.id]) >= 5:
            self.trigger_super_resolution(track.id)

The quality scoring function estimates per-crop quality based on:

Sharpness (Laplacian variance)
Size (larger crops have more information)
Aspect ratio (plates have known aspect ratios; deviations indicate perspective distortion)
Brightness (too dark or too bright reduces readability)

We keep the top-K crops ranked by quality for super-resolution. K is typically 5-8.

Stage 2: Alignment

The plate crops are from different frames, which means different positions, scales, and perspectives. Before fusion, they must be aligned to a common reference frame.

We use a two-stage alignment process:

Coarse alignment via homography estimation. We detect the four corners of the plate (or approximate them from the bounding box) and compute a perspective transform to a canonical front-parallel view.

Fine alignment via optical flow. After coarse alignment, we compute dense optical flow between consecutive crops using a lightweight flow network (similar to SpyNet). This handles sub-pixel misalignments that homography misses.

class PlateAligner:
    def align(self, crops: list[dict]) -> list[np.ndarray]:
        # Select reference frame (highest quality)
        ref_idx = max(range(len(crops)),
                      key=lambda i: crops[i]["quality_score"])
        ref = crops[ref_idx]["crop"]

        aligned = []
        for crop_data in crops:
            crop = crop_data["crop"]

            # Coarse: homography to reference
            H = self.estimate_homography(crop, ref)
            warped = cv2.warpPerspective(crop, H, ref.shape[:2][::-1])

            # Fine: optical flow refinement
            flow = self.flow_net(warped, ref)
            refined = self.warp_with_flow(warped, flow)

            aligned.append(refined)

        return aligned

Alignment quality directly impacts super-resolution quality. Misaligned inputs produce ghosting artifacts — double edges that make text unreadable. We discard crops where alignment confidence falls below a threshold.

Stage 3: Multi-Frame Super-Resolution

This is the core of DrikNetra. We take 5-8 aligned, low-resolution plate crops and produce a single high-resolution plate image.

We evaluated three architectures:

BasicVSR++ (Chan et al., 2022): A recurrent architecture with bidirectional propagation and second-order grid propagation. It processes frames sequentially, propagating features forward and backward through the sequence. Strong temporal modeling. Our primary architecture.

RVRT (Liang et al., 2022): Recurrent Video Restoration Transformer. Attention-based architecture that captures long-range temporal dependencies. Higher accuracy than BasicVSR++ but 3x slower inference. We use this for offline/batch processing.

Real-ESRGAN (Wang et al., 2021): Single-image super-resolution with a powerful degradation model. We use this as a fallback when only 1-2 frames are available (vehicle moved too fast, camera frame rate too low).

class DrikNetraSR:
    def __init__(self):
        self.basicvsr = load_model("basicvsr_pp_plate_v2.engine")  # TensorRT
        self.rvrt = load_model("rvrt_plate_v1.onnx")              # ONNX (offline)
        self.realesrgan = load_model("realesrgan_plate_v2.engine") # TensorRT

    def super_resolve(self, aligned_crops: list[np.ndarray],
                      mode: str = "realtime") -> np.ndarray:
        n_frames = len(aligned_crops)

        if n_frames >= 5 and mode == "realtime":
            # Stack into tensor: [1, T, C, H, W]
            tensor = self.prepare_input(aligned_crops)
            sr_output = self.basicvsr(tensor)  # 4x upscale
            return sr_output

        elif n_frames >= 3 and mode == "offline":
            tensor = self.prepare_input(aligned_crops)
            sr_output = self.rvrt(tensor)
            return sr_output

        else:
            # Fallback: single-image SR on best crop
            best = max(aligned_crops, key=self.quality_score)
            sr_output = self.realesrgan(best)
            return sr_output

The super-resolution models are trained specifically on license plate data. This is critical. General-purpose SR models optimize for perceptual quality — they produce images that look good to humans but may hallucinate texture details. Our models optimize for OCR accuracy — they produce images where character edges are sharp and unambiguous, even if the overall image looks slightly less “natural.”

Training data: We use our indian-plate-dataset — a curated collection of Indian license plates with:

50,000 high-resolution plate images (ground truth)
Synthetic degradation applied to create low-resolution inputs (blur, noise, compression, perspective distortion)
Paired sequences: multiple degraded versions of the same plate simulating temporal observation

Degradation model:

def synthesize_degradation(hr_plate: np.ndarray, n_frames: int) -> list:
    """Create realistic degraded inputs from a clean plate image."""
    degraded_frames = []

    for i in range(n_frames):
        frame = hr_plate.copy()

        # Random downscale (2x-8x)
        scale = random.uniform(2, 8)
        frame = cv2.resize(frame, None, fx=1/scale, fy=1/scale)

        # Motion blur (vehicle movement)
        kernel_size = random.randint(3, 15)
        angle = random.uniform(-15, 15)
        frame = apply_motion_blur(frame, kernel_size, angle)

        # Camera noise
        noise_level = random.uniform(5, 30)
        frame = frame + np.random.randn(*frame.shape) * noise_level

        # JPEG compression (CCTV artifact)
        quality = random.randint(40, 85)
        frame = jpeg_compress(frame, quality)

        # Perspective jitter (simulates frame-to-frame variation)
        jitter = random.uniform(0.5, 3.0)  # pixels
        frame = apply_perspective_jitter(frame, jitter)

        degraded_frames.append(frame)

    return degraded_frames

Stage 4: OCR

The super-resolved plate image is passed to an OCR model trained on Indian plate formats.

Indian plates have specific structure constraints that we exploit:

# Indian plate format regex
PLATE_PATTERN = r'^[A-Z]{2}\s?\d{1,2}\s?[A-Z]{1,3}\s?\d{1,4}$'

# State codes (first two characters)
VALID_STATE_CODES = {
    'AN', 'AP', 'AR', 'AS', 'BR', 'CG', 'CH', 'DD', 'DL', 'GA',
    'GJ', 'HP', 'HR', 'JH', 'JK', 'KA', 'KL', 'LA', 'LD', 'MH',
    'ML', 'MN', 'MP', 'MZ', 'NL', 'OD', 'PB', 'PY', 'RJ', 'SK',
    'TN', 'TR', 'TS', 'UK', 'UP', 'WB'
}

The OCR model outputs character probabilities. We apply format-aware beam search decoding — the decoder constrains the output to valid Indian plate formats, boosting accuracy by 3-5% compared to unconstrained decoding.

class PlateOCR:
    def __init__(self):
        self.recognizer = CRNNRecognizer()  # CNN + BiLSTM + CTC
        self.decoder = FormatAwareDecoder(
            valid_state_codes=VALID_STATE_CODES,
            plate_pattern=PLATE_PATTERN
        )

    def read(self, sr_plate: np.ndarray) -> PlateResult:
        # Get character probabilities
        logits = self.recognizer(sr_plate)  # [T, C] — T timesteps, C characters

        # Format-aware beam search
        candidates = self.decoder.beam_search(
            logits,
            beam_width=10,
            format_bonus=0.3  # bonus for valid format matches
        )

        return PlateResult(
            text=candidates[0].text,
            confidence=candidates[0].score,
            alternatives=candidates[1:3]
        )

Results

We evaluated DrikNetra against single-frame baselines and commercial ANPR systems on our drik-bench-anpr test set.

Full Plate Accuracy by Resolution Bracket

Method	<40px	40-80px	80-120px	>120px	Overall
Single-frame OCR (CRNN)	4.2%	22.8%	61.3%	84.1%	38.7%
Single-frame + Real-ESRGAN	8.1%	31.4%	68.7%	86.2%	44.3%
DrikNetra (3 frames)	15.3%	48.2%	78.4%	89.7%	55.6%
DrikNetra (5 frames)	22.7%	59.1%	83.2%	91.3%	63.2%
DrikNetra (8 frames)	28.4%	64.8%	85.1%	91.8%	67.1%
Commercial ANPR System A	6.3%	35.2%	72.1%	88.4%	46.8%
Commercial ANPR System B	3.8%	28.6%	65.9%	85.7%	41.2%

The pattern is clear: more frames means more accuracy, especially at low resolutions where single-frame methods fail completely.

At the <40px bracket — which represents 35% of plates in typical Indian CCTV footage — DrikNetra with 8 frames achieves 28.4% accuracy compared to 4.2% for single-frame OCR. That is a 6.7x improvement. Still not perfect, but the difference between “completely useless” and “useful for investigation leads.”

Accuracy by Condition

Condition	Single-frame	DrikNetra (5 frames)
Day, clear	45.2%	71.3%
Day, haze	38.1%	64.8%
Night, street-lit	29.7%	55.2%
Night, headlights only	18.3%	42.1%
Rain	31.4%	58.7%
Motion blur (>30 km/h)	22.6%	61.4%

The multi-frame approach is most valuable in degraded conditions. Night and motion blur see the largest relative improvements — exactly the conditions where single-frame methods fail most.

Before/After Examples

Consider a motorcycle traveling at 40 km/h past a 720p pole-mounted camera at night. The plate occupies 35x12 pixels in each frame. Motion blur smears each frame by 3-5 pixels horizontally. Individual characters are indistinguishable.

DrikNetra collects 6 frames of this plate as the motorcycle passes. Each frame captures slightly different pixel samples of the plate surface. The alignment stage compensates for the horizontal motion. The super-resolution network combines the 6 observations into a single 140x48 image where individual characters are clearly separated.

The OCR reads “MH 12 AB 3456” with 0.87 confidence. The single-frame OCR on the best individual frame outputs “MH 12 A_ 3_56” with 0.31 confidence — two characters missing, one uncertain.

Night and Rain: The Hard Cases

Night-time ANPR is fundamentally harder because:

IR illumination. Many cameras switch to IR mode at night. Plates have IR-reflective surfaces, but the reflection can saturate (bloom) or create uneven illumination.
Headlight glare. Oncoming vehicles create lens flare that washes out plate regions.
Low SNR. Camera sensors in low light produce significant noise, which compression amplifies.

DrikNetra handles night conditions through two mechanisms:

Temporal noise averaging. Random noise is different in each frame. Averaging across aligned frames reduces noise by a factor of sqrt(N). With 8 frames, noise is reduced by nearly 3x before the SR network even processes the input.

Night-specific degradation training. Our training pipeline includes night-specific degradations: IR illumination patterns, headlight bloom, high ISO noise profiles. The SR network learns to invert these degradations specifically.

Rain adds another layer: water droplets on the lens or on the plate surface create localized distortions. These distortions are frame-specific — a droplet may obscure one character in frame 3 but not in frame 7. Multi-frame fusion naturally routes around these per-frame corruptions, selecting clean information from whichever frame has it.

Deployment Considerations

DrikNetra runs as part of the larger edge pipeline. Latency breakdown:

Stage	Latency
Plate detection	2ms
Crop + buffer	<1ms
Alignment (5 frames)	8ms
Super-resolution (BasicVSR++)	18ms
OCR	7ms
Total	~35ms

The plate detection and crop stages run per-frame. Alignment, SR, and OCR run only when the buffer is full (every 5-8 frames for a given vehicle). This means the ANPR system adds ~35ms of latency per vehicle, not per frame.

Memory footprint: BasicVSR++ TensorRT engine occupies 180MB GPU memory. On a multi-camera deployment, ANPR is the second-largest memory consumer after the detector.

What is Next

Three areas of active development:

Temporal attention. Current BasicVSR++ treats all frames equally in the recurrent propagation. We are experimenting with attention mechanisms that learn which frames are most informative — giving less weight to severely blurred or occluded frames.

Bilingual OCR. Many Indian plates include text in regional scripts (Hindi, Tamil, Kannada, etc.) alongside English. Our current OCR reads only the English characters. A bilingual model would improve accuracy and enable reading older hand-painted plates.

Plate condition classification. Before running expensive SR + OCR, classify the plate condition: readable (skip SR), degraded (run SR), unreadable (skip entirely). This adaptive approach could reduce compute by 40% while maintaining accuracy.

Check out DrikNetra on our Open Source page. The architecture, training pipeline, and pre-trained models are available for research use.