engineering February 25, 2025 11 min read

Edge-First: Our Real-Time Traffic Intelligence Architecture

How we process 200+ camera feeds at 30 FPS with sub-100ms latency — from RTSP ingestion to scene reasoning — without touching the cloud.

Hansraj Patel

Edge-First: Our Real-Time Traffic Intelligence Architecture

Why Edge-First

The default architecture for video AI is simple: stream video to the cloud, process it there, send results back. It works for demo videos. It does not work for production traffic systems.

Here is why:

Latency. A round-trip to the cloud adds 100-500ms. For real-time violation detection or incident alerting, that is unacceptable. By the time the cloud tells you a vehicle ran a red light, the vehicle is gone.

Bandwidth. A single 1080p camera at 25 FPS generates approximately 4 Mbps of H.264 video. Scale to 200 cameras and you need 800 Mbps of sustained upstream bandwidth. Indian ISP infrastructure does not support this at most deployment sites. Even where it does, the cost is prohibitive.

Privacy. Streaming raw video of public roads to cloud servers creates legal and regulatory exposure. Edge processing means raw video never leaves the premises. Only metadata and events are transmitted.

Cost. Cloud GPU inference at scale is expensive. At 200 cameras, the monthly cloud compute bill exceeds the one-time cost of edge hardware within 3-4 months. Edge wins the economic argument decisively at scale.

Reliability. Internet connections fail. Edge devices keep processing. When connectivity returns, buffered events sync upstream. Zero data loss.

Our architecture processes everything at the edge. The cloud receives only structured events, aggregated statistics, and occasional evidence frames. Raw video stays local.

The Pipeline

The full pipeline from photon to event has six stages. Each stage has a latency budget. The total must stay under 100ms.

RTSP/ONVIF Stream
    │
    ▼
┌─────────────────┐
│  Frame Decoder   │  ~5ms
│  (FFmpeg/GStreamer)│
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Preprocessor    │  ~2ms
│  (Resize, Norm)  │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Detector        │  ~25ms
│  (TensorRT)      │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Tracker         │  ~8ms
│  (ByteTrack)     │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Recognizer      │  ~35ms
│  (DrikNetra ANPR)│
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Reasoner        │  ~20ms
│  (Scene Engine)  │
└────────┬────────┘
         │
         ▼
    Event Output

Total: ~95ms per frame. Let me walk through each stage.

Stage 1: Stream Ingestion (5ms)

Traffic cameras speak two protocols: RTSP (Real Time Streaming Protocol) and ONVIF (Open Network Video Interface Forum). Our ingestion layer handles both.

class StreamManager:
    def __init__(self, camera_configs: list[CameraConfig]):
        self.streams = {}
        for config in camera_configs:
            self.streams[config.id] = RTSPStream(
                url=config.rtsp_url,
                transport="tcp",        # TCP for reliability, UDP for lower latency
                buffer_size=2,          # frames — keep it small
                reconnect_interval=5,   # seconds
                hw_decode=True          # NVDEC hardware decoding
            )

    async def get_frame(self, camera_id: str) -> Frame:
        raw = await self.streams[camera_id].read()
        return Frame(
            data=raw.data,
            timestamp=raw.pts,
            camera_id=camera_id,
            resolution=raw.resolution
        )

Key design decisions:

Hardware decoding. We use NVIDIA’s NVDEC for H.264/H.265 decoding. This offloads decode from the CPU entirely. On a Jetson Orin, NVDEC can decode 16 streams simultaneously.
Minimal buffering. We buffer 2 frames maximum. Any more and we introduce latency. If processing falls behind, we drop frames rather than queue them. Stale data is worse than no data.
TCP transport. RTSP over UDP has lower latency but drops packets on congested networks. Most Indian CCTV installations run on local networks with packet loss. TCP eliminates decode artifacts from dropped packets.
Watchdog reconnection. Cameras go offline. Power cuts, network hiccups, firmware crashes. The stream manager detects disconnection within 2 seconds and reconnects automatically.

Stage 2: Preprocessing (2ms)

Raw frames need three transformations before inference:

Resize to model input resolution (640x640 for detection, preserving aspect ratio with letterboxing)
Color space conversion from BGR to RGB
Normalization to [0, 1] float32 range

We do all three on GPU using CUDA kernels. The frame never touches CPU memory after decode.

// Custom CUDA preprocessing kernel
__global__ void preprocess_kernel(
    const uint8_t* input,    // NV12 from decoder
    float* output,            // RGB float32
    int src_w, int src_h,
    int dst_w, int dst_h,
    float scale, int pad_x, int pad_y
) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    if (x >= dst_w || y >= dst_h) return;

    // Letterbox mapping
    int src_x = (x - pad_x) / scale;
    int src_y = (y - pad_y) / scale;

    if (src_x < 0 || src_x >= src_w || src_y < 0 || src_y >= src_h) {
        output[y * dst_w + x] = 0.5f; // gray padding
        return;
    }

    // NV12 to RGB + normalize
    // ...
}

This kernel runs in under 1ms for 1080p to 640x640 conversion on a Jetson Orin.

Stage 3: Detection (25ms)

Detection is the most compute-intensive stage. We run a custom-trained YOLO model optimized with TensorRT.

Model architecture. We use a YOLOv8-based architecture with modifications for our 50+ class taxonomy. The backbone is CSPDarknet with a P3-P5 feature pyramid. We added a P2 head for small object detection — critical for distant motorcycles and pedestrians.

TensorRT optimization. The PyTorch model is exported to ONNX, then compiled to a TensorRT engine with:

FP16 precision (negligible accuracy loss, 2x throughput)
Dynamic batching (batch multiple cameras when GPU utilization is low)
Layer fusion (convolution + batch norm + activation fused into single kernels)
INT8 calibration for Jetson deployments (4x throughput, ~1% mAP loss)

# TensorRT engine build
trtexec \
    --onnx=drik_detect_v3.onnx \
    --saveEngine=drik_detect_v3.engine \
    --fp16 \
    --workspace=4096 \
    --minShapes=images:1x3x640x640 \
    --optShapes=images:4x3x640x640 \
    --maxShapes=images:8x3x640x640 \
    --verbose

On an NVIDIA A2 (our standard discrete GPU deployment), the detector runs at 25ms per frame for a single stream, or 8ms per frame when batching 4 streams. On a Jetson Orin NX, it is 35ms per frame in FP16 or 18ms in INT8.

Post-processing. NMS (Non-Maximum Suppression) runs on GPU using a custom CUDA kernel. We use class-aware NMS with an IoU threshold of 0.45 and a confidence threshold of 0.25. Low confidence threshold is deliberate — we prefer false positives that the tracker can filter over false negatives that create track gaps.

Stage 4: Tracking (8ms)

Detection gives you objects per frame. Tracking gives you objects over time. Without tracking, you cannot count vehicles, measure speed, detect violations, or reason about behavior.

We use ByteTrack with modifications for Indian traffic:

class DrikTracker(ByteTrack):
    def __init__(self):
        super().__init__(
            track_thresh=0.3,
            track_buffer=60,        # frames to keep lost tracks (2s at 30fps)
            match_thresh=0.8,
            frame_rate=30
        )
        self.reid_model = ReIDNet()  # appearance-based re-identification

    def update(self, detections: list[Detection]) -> list[Track]:
        # Standard ByteTrack association
        tracks = super().update(detections)

        # Re-ID for tracks lost > 10 frames
        for lost_track in self.lost_tracks:
            if lost_track.frames_lost > 10:
                match = self.reid_model.match(
                    lost_track.appearance,
                    [d.crop for d in detections]
                )
                if match and match.score > 0.7:
                    self.reactivate(lost_track, match.detection)

        return tracks

Standard ByteTrack fails in Indian traffic because:

Extreme occlusion duration. A motorcycle disappears behind a bus for 3+ seconds (90 frames). ByteTrack’s default buffer of 30 frames loses the track. We extend to 60 frames and add appearance-based re-identification.
Non-linear motion. Kalman filters assume linear motion. An auto-rickshaw making a sudden U-turn violates this assumption completely. We use an adaptive motion model that detects non-linear behavior and switches to a wider search radius.
Dense small objects. Twenty motorcycles in a cluster are nearly indistinguishable by IoU matching alone. Re-ID features (color, shape, rider appearance) are essential.

Our modified tracker maintains identity through 90%+ of occlusion events, compared to 65% for vanilla ByteTrack on our test set.

Stage 5: Recognition — DrikNetra ANPR (35ms)

For vehicles with license plates, we run DrikNetra — our multi-frame plate recognition system. This deserves its own blog post (coming soon), but the key insight is: do not try to read a plate from a single blurry frame. Accumulate multiple observations and fuse them.

The recognizer crops the plate region from each tracked frame, buffers 5-10 crops, and runs multi-frame super-resolution before OCR. The super-resolution model (based on BasicVSR++) turns 5 blurry 32x16 crops into one sharp 128x64 image.

This stage only runs for vehicles that cross a defined detection zone (e.g., a stop line), not for every vehicle in every frame. This keeps the computational budget manageable.

Stage 6: Scene Reasoning (20ms)

This is where detection becomes intelligence. The reasoning engine takes tracked objects with their trajectories, speeds, and classes, and generates semantic events:

class SceneReasoner:
    def __init__(self, scene_config: SceneConfig):
        self.zones = scene_config.zones          # defined regions of interest
        self.rules = scene_config.rules          # violation rules
        self.state = SceneState()                # persistent scene state

    def reason(self, tracks: list[Track]) -> list[Event]:
        events = []

        for track in tracks:
            # Zone-based reasoning
            for zone in self.zones:
                if zone.contains(track.position):
                    if zone.type == "red_light" and zone.signal_state == "red":
                        if track.speed > 5:  # km/h — moving through red
                            events.append(RedLightViolation(track))

                    if zone.type == "no_entry" and track.direction_matches(zone.forbidden_dir):
                        events.append(WrongWayViolation(track))

            # Behavior-based reasoning
            if track.speed > self.speed_limit:
                events.append(SpeedViolation(track, track.speed))

            if track.stopped_duration > 30:  # seconds
                events.append(IllegalParking(track))

            # Predictive reasoning (Level 4-5)
            collision_risk = self.predict_collision(track, tracks)
            if collision_risk > 0.8:
                events.append(CollisionWarning(track, collision_risk))

        return events

The reasoner operates on trajectories, not frames. It maintains state across time — tracking how long a vehicle has been stopped, whether a signal has changed, how traffic flow patterns have shifted. This is reasoning, not detection. A detector can tell you a car is in an intersection. A reasoner can tell you the car ran a red light.

Hardware Configurations

We deploy on three hardware tiers:

Tier	Hardware	Cameras	Use Case
Edge Lite	Jetson Orin NX 16GB	4-8	Single intersection
Edge Pro	NVIDIA A2 + Xeon	16-32	Corridor / small city
Edge Max	2x NVIDIA A2 + Xeon	32-64	District-level

For deployments exceeding 64 cameras, we use multiple Edge Pro/Max units with a local orchestration layer. Each unit operates independently — no single point of failure.

Power consumption matters for Indian deployments. Many camera installations have unreliable power. The Jetson Orin NX draws 15-25W. With a small UPS, it survives 2+ hours of power cuts.

The Five Reasoning Levels in Architecture

Our five-level reasoning hierarchy maps directly to system components:

Detect → TensorRT detector (drik-detect). Object presence and class.
Recognize → DrikNetra ANPR + Re-ID. Object identity.
Describe → Scene reasoner, state machine. What is happening.
Reason → Rule engine + trajectory analysis. Why it is happening, whether it violates rules.
Predict → Collision prediction, traffic flow forecasting. What will happen next.

Levels 1-3 are in production. Level 4 is deployed for specific violation types. Level 5 is in active research.

Scaling Lessons

Deploying edge AI on Indian roads taught us things that no architecture diagram can capture:

Thermal management is critical. A metal enclosure in direct Indian sun reaches 70C internally. Our edge boxes have passive cooling rated for 50C ambient. Above that, we throttle inference from FP16 to INT8 automatically.

Network is unreliable. We design for offline-first. The edge device stores 72 hours of events locally. When connectivity returns, events sync in chronological order. Duplicate detection is handled server-side with idempotent event IDs.

Power cycles are frequent. The system boots to full operation in under 45 seconds. TensorRT engines are pre-compiled and cached. Stream reconnection is automatic. No manual intervention required after power restoration.

Camera quality varies wildly. The same intersection might have a 2023 Hikvision 4MP camera and a 2016 no-brand 720p camera. Our detector is trained on both quality levels. The ANPR system adjusts confidence thresholds based on resolution.

What is Next

We are working on three architectural improvements:

Multi-camera fusion. Using overlapping camera views to resolve occlusions and improve tracking continuity across cameras.
Adaptive inference. Dynamically switching model complexity based on scene difficulty. Simple scenes get a lightweight model. Complex scenes get the full model. This improves throughput by 40% on average.
On-device learning. Fine-tuning the detector on-device using accumulated edge cases. The model improves over time without redeployment.

Explore the technical details on our Technology page. If you are building edge AI systems for any domain — traffic, security, retail, industrial — the architectural patterns are transferable. The hard constraints of Indian traffic forced us to build something robust enough for anywhere.

If it works here, it works everywhere.