Edge-First: Our Real-Time Traffic Intelligence Architecture
How we process 200+ camera feeds at 30 FPS with sub-100ms latency — from RTSP ingestion to scene reasoning — without touching the cloud.
Why Edge-First
The default architecture for video AI is simple: stream video to the cloud, process it there, send results back. It works for demo videos. It does not work for production traffic systems.
Here is why:
Latency. A round-trip to the cloud adds 100-500ms. For real-time violation detection or incident alerting, that is unacceptable. By the time the cloud tells you a vehicle ran a red light, the vehicle is gone.
Bandwidth. A single 1080p camera at 25 FPS generates approximately 4 Mbps of H.264 video. Scale to 200 cameras and you need 800 Mbps of sustained upstream bandwidth. Indian ISP infrastructure does not support this at most deployment sites. Even where it does, the cost is prohibitive.
Privacy. Streaming raw video of public roads to cloud servers creates legal and regulatory exposure. Edge processing means raw video never leaves the premises. Only metadata and events are transmitted.
Cost. Cloud GPU inference at scale is expensive. At 200 cameras, the monthly cloud compute bill exceeds the one-time cost of edge hardware within 3-4 months. Edge wins the economic argument decisively at scale.
Reliability. Internet connections fail. Edge devices keep processing. When connectivity returns, buffered events sync upstream. Zero data loss.
Our architecture processes everything at the edge. The cloud receives only structured events, aggregated statistics, and occasional evidence frames. Raw video stays local.
The Pipeline
The full pipeline from photon to event has six stages. Each stage has a latency budget. The total must stay under 100ms.
RTSP/ONVIF Stream
│
▼
┌─────────────────┐
│ Frame Decoder │ ~5ms
│ (FFmpeg/GStreamer)│
└────────┬────────┘
│
▼
┌─────────────────┐
│ Preprocessor │ ~2ms
│ (Resize, Norm) │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Detector │ ~25ms
│ (TensorRT) │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Tracker │ ~8ms
│ (ByteTrack) │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Recognizer │ ~35ms
│ (DrikNetra ANPR)│
└────────┬────────┘
│
▼
┌─────────────────┐
│ Reasoner │ ~20ms
│ (Scene Engine) │
└────────┬────────┘
│
▼
Event Output
Total: ~95ms per frame. Let me walk through each stage.
Stage 1: Stream Ingestion (5ms)
Traffic cameras speak two protocols: RTSP (Real Time Streaming Protocol) and ONVIF (Open Network Video Interface Forum). Our ingestion layer handles both.
class StreamManager:
def __init__(self, camera_configs: list[CameraConfig]):
self.streams = {}
for config in camera_configs:
self.streams[config.id] = RTSPStream(
url=config.rtsp_url,
transport="tcp", # TCP for reliability, UDP for lower latency
buffer_size=2, # frames — keep it small
reconnect_interval=5, # seconds
hw_decode=True # NVDEC hardware decoding
)
async def get_frame(self, camera_id: str) -> Frame:
raw = await self.streams[camera_id].read()
return Frame(
data=raw.data,
timestamp=raw.pts,
camera_id=camera_id,
resolution=raw.resolution
)
Key design decisions:
- Hardware decoding. We use NVIDIA’s NVDEC for H.264/H.265 decoding. This offloads decode from the CPU entirely. On a Jetson Orin, NVDEC can decode 16 streams simultaneously.
- Minimal buffering. We buffer 2 frames maximum. Any more and we introduce latency. If processing falls behind, we drop frames rather than queue them. Stale data is worse than no data.
- TCP transport. RTSP over UDP has lower latency but drops packets on congested networks. Most Indian CCTV installations run on local networks with packet loss. TCP eliminates decode artifacts from dropped packets.
- Watchdog reconnection. Cameras go offline. Power cuts, network hiccups, firmware crashes. The stream manager detects disconnection within 2 seconds and reconnects automatically.
Stage 2: Preprocessing (2ms)
Raw frames need three transformations before inference:
- Resize to model input resolution (640x640 for detection, preserving aspect ratio with letterboxing)
- Color space conversion from BGR to RGB
- Normalization to [0, 1] float32 range
We do all three on GPU using CUDA kernels. The frame never touches CPU memory after decode.
// Custom CUDA preprocessing kernel
__global__ void preprocess_kernel(
const uint8_t* input, // NV12 from decoder
float* output, // RGB float32
int src_w, int src_h,
int dst_w, int dst_h,
float scale, int pad_x, int pad_y
) {
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
if (x >= dst_w || y >= dst_h) return;
// Letterbox mapping
int src_x = (x - pad_x) / scale;
int src_y = (y - pad_y) / scale;
if (src_x < 0 || src_x >= src_w || src_y < 0 || src_y >= src_h) {
output[y * dst_w + x] = 0.5f; // gray padding
return;
}
// NV12 to RGB + normalize
// ...
}
This kernel runs in under 1ms for 1080p to 640x640 conversion on a Jetson Orin.
Stage 3: Detection (25ms)
Detection is the most compute-intensive stage. We run a custom-trained YOLO model optimized with TensorRT.
Model architecture. We use a YOLOv8-based architecture with modifications for our 50+ class taxonomy. The backbone is CSPDarknet with a P3-P5 feature pyramid. We added a P2 head for small object detection — critical for distant motorcycles and pedestrians.
TensorRT optimization. The PyTorch model is exported to ONNX, then compiled to a TensorRT engine with:
- FP16 precision (negligible accuracy loss, 2x throughput)
- Dynamic batching (batch multiple cameras when GPU utilization is low)
- Layer fusion (convolution + batch norm + activation fused into single kernels)
- INT8 calibration for Jetson deployments (4x throughput, ~1% mAP loss)
# TensorRT engine build
trtexec \
--onnx=drik_detect_v3.onnx \
--saveEngine=drik_detect_v3.engine \
--fp16 \
--workspace=4096 \
--minShapes=images:1x3x640x640 \
--optShapes=images:4x3x640x640 \
--maxShapes=images:8x3x640x640 \
--verbose
On an NVIDIA A2 (our standard discrete GPU deployment), the detector runs at 25ms per frame for a single stream, or 8ms per frame when batching 4 streams. On a Jetson Orin NX, it is 35ms per frame in FP16 or 18ms in INT8.
Post-processing. NMS (Non-Maximum Suppression) runs on GPU using a custom CUDA kernel. We use class-aware NMS with an IoU threshold of 0.45 and a confidence threshold of 0.25. Low confidence threshold is deliberate — we prefer false positives that the tracker can filter over false negatives that create track gaps.
Stage 4: Tracking (8ms)
Detection gives you objects per frame. Tracking gives you objects over time. Without tracking, you cannot count vehicles, measure speed, detect violations, or reason about behavior.
We use ByteTrack with modifications for Indian traffic:
class DrikTracker(ByteTrack):
def __init__(self):
super().__init__(
track_thresh=0.3,
track_buffer=60, # frames to keep lost tracks (2s at 30fps)
match_thresh=0.8,
frame_rate=30
)
self.reid_model = ReIDNet() # appearance-based re-identification
def update(self, detections: list[Detection]) -> list[Track]:
# Standard ByteTrack association
tracks = super().update(detections)
# Re-ID for tracks lost > 10 frames
for lost_track in self.lost_tracks:
if lost_track.frames_lost > 10:
match = self.reid_model.match(
lost_track.appearance,
[d.crop for d in detections]
)
if match and match.score > 0.7:
self.reactivate(lost_track, match.detection)
return tracks
Standard ByteTrack fails in Indian traffic because:
- Extreme occlusion duration. A motorcycle disappears behind a bus for 3+ seconds (90 frames). ByteTrack’s default buffer of 30 frames loses the track. We extend to 60 frames and add appearance-based re-identification.
- Non-linear motion. Kalman filters assume linear motion. An auto-rickshaw making a sudden U-turn violates this assumption completely. We use an adaptive motion model that detects non-linear behavior and switches to a wider search radius.
- Dense small objects. Twenty motorcycles in a cluster are nearly indistinguishable by IoU matching alone. Re-ID features (color, shape, rider appearance) are essential.
Our modified tracker maintains identity through 90%+ of occlusion events, compared to 65% for vanilla ByteTrack on our test set.
Stage 5: Recognition — DrikNetra ANPR (35ms)
For vehicles with license plates, we run DrikNetra — our multi-frame plate recognition system. This deserves its own blog post (coming soon), but the key insight is: do not try to read a plate from a single blurry frame. Accumulate multiple observations and fuse them.
The recognizer crops the plate region from each tracked frame, buffers 5-10 crops, and runs multi-frame super-resolution before OCR. The super-resolution model (based on BasicVSR++) turns 5 blurry 32x16 crops into one sharp 128x64 image.
This stage only runs for vehicles that cross a defined detection zone (e.g., a stop line), not for every vehicle in every frame. This keeps the computational budget manageable.
Stage 6: Scene Reasoning (20ms)
This is where detection becomes intelligence. The reasoning engine takes tracked objects with their trajectories, speeds, and classes, and generates semantic events:
class SceneReasoner:
def __init__(self, scene_config: SceneConfig):
self.zones = scene_config.zones # defined regions of interest
self.rules = scene_config.rules # violation rules
self.state = SceneState() # persistent scene state
def reason(self, tracks: list[Track]) -> list[Event]:
events = []
for track in tracks:
# Zone-based reasoning
for zone in self.zones:
if zone.contains(track.position):
if zone.type == "red_light" and zone.signal_state == "red":
if track.speed > 5: # km/h — moving through red
events.append(RedLightViolation(track))
if zone.type == "no_entry" and track.direction_matches(zone.forbidden_dir):
events.append(WrongWayViolation(track))
# Behavior-based reasoning
if track.speed > self.speed_limit:
events.append(SpeedViolation(track, track.speed))
if track.stopped_duration > 30: # seconds
events.append(IllegalParking(track))
# Predictive reasoning (Level 4-5)
collision_risk = self.predict_collision(track, tracks)
if collision_risk > 0.8:
events.append(CollisionWarning(track, collision_risk))
return events
The reasoner operates on trajectories, not frames. It maintains state across time — tracking how long a vehicle has been stopped, whether a signal has changed, how traffic flow patterns have shifted. This is reasoning, not detection. A detector can tell you a car is in an intersection. A reasoner can tell you the car ran a red light.
Hardware Configurations
We deploy on three hardware tiers:
| Tier | Hardware | Cameras | Use Case |
|---|---|---|---|
| Edge Lite | Jetson Orin NX 16GB | 4-8 | Single intersection |
| Edge Pro | NVIDIA A2 + Xeon | 16-32 | Corridor / small city |
| Edge Max | 2x NVIDIA A2 + Xeon | 32-64 | District-level |
For deployments exceeding 64 cameras, we use multiple Edge Pro/Max units with a local orchestration layer. Each unit operates independently — no single point of failure.
Power consumption matters for Indian deployments. Many camera installations have unreliable power. The Jetson Orin NX draws 15-25W. With a small UPS, it survives 2+ hours of power cuts.
The Five Reasoning Levels in Architecture
Our five-level reasoning hierarchy maps directly to system components:
- Detect → TensorRT detector (drik-detect). Object presence and class.
- Recognize → DrikNetra ANPR + Re-ID. Object identity.
- Describe → Scene reasoner, state machine. What is happening.
- Reason → Rule engine + trajectory analysis. Why it is happening, whether it violates rules.
- Predict → Collision prediction, traffic flow forecasting. What will happen next.
Levels 1-3 are in production. Level 4 is deployed for specific violation types. Level 5 is in active research.
Scaling Lessons
Deploying edge AI on Indian roads taught us things that no architecture diagram can capture:
Thermal management is critical. A metal enclosure in direct Indian sun reaches 70C internally. Our edge boxes have passive cooling rated for 50C ambient. Above that, we throttle inference from FP16 to INT8 automatically.
Network is unreliable. We design for offline-first. The edge device stores 72 hours of events locally. When connectivity returns, events sync in chronological order. Duplicate detection is handled server-side with idempotent event IDs.
Power cycles are frequent. The system boots to full operation in under 45 seconds. TensorRT engines are pre-compiled and cached. Stream reconnection is automatic. No manual intervention required after power restoration.
Camera quality varies wildly. The same intersection might have a 2023 Hikvision 4MP camera and a 2016 no-brand 720p camera. Our detector is trained on both quality levels. The ANPR system adjusts confidence thresholds based on resolution.
What is Next
We are working on three architectural improvements:
- Multi-camera fusion. Using overlapping camera views to resolve occlusions and improve tracking continuity across cameras.
- Adaptive inference. Dynamically switching model complexity based on scene difficulty. Simple scenes get a lightweight model. Complex scenes get the full model. This improves throughput by 40% on average.
- On-device learning. Fine-tuning the detector on-device using accumulated edge cases. The model improves over time without redeployment.
Explore the technical details on our Technology page. If you are building edge AI systems for any domain — traffic, security, retail, industrial — the architectural patterns are transferable. The hard constraints of Indian traffic forced us to build something robust enough for anywhere.
If it works here, it works everywhere.