engineering February 18, 2025 8 min read

Building DrikSynth: Synthetic Data for Traffic Scenes That Don't Exist Yet

How we use Unreal Engine to generate 100K annotated traffic scenes — covering edge cases that would take years to collect in the real world.

Hansraj Patel

Building DrikSynth: Synthetic Data for Traffic Scenes That Don't Exist Yet

The Annotation Bottleneck

Here is a number that should concern every computer vision team: annotating one hour of traffic video takes approximately 800 human-hours at production quality. That is 800 hours of someone drawing bounding boxes, assigning classes, verifying tracks, and correcting edge cases. For a single hour of footage.

We need thousands of hours of annotated data. The math does not work.

Real-world data collection has other problems too:

Rare events are rare. You might wait months to capture a vehicle fire, a wrong-way driver, or a band baarat procession at night. You need thousands of examples of each to train reliably.
Privacy constraints. Faces, license plates, and identifiable information in real footage create legal obligations. Synthetic data has no PII.
Weather coverage. You cannot schedule fog. You cannot order heavy rain at a specific intersection. You cannot control the angle of the sun.
Annotation errors. Human annotators disagree on class boundaries (is that an SUV or an MUV?), miss small objects, and lose tracks in occlusion. Ground truth from synthetic data is mathematically perfect.

This is why we built DrikSynth, internally called Maya. It generates photorealistic Indian traffic scenes in Unreal Engine 5, with pixel-perfect annotations, at a fraction of the cost of real data collection.

The Pipeline

DrikSynth is not a game. It is a data factory. The pipeline has five stages.

Stage 1: Procedural Road Generation

Indian roads are not standardized. A single stretch might transition from a four-lane divided highway to a two-lane undivided road to an unpaved village path. Intersections are irregular. Roundabouts are asymmetric. Service roads merge at odd angles.

We model this procedurally. The road generation system takes parameters:

road_config = {
    "type": "urban_arterial",
    "lanes": 4,
    "divider": "concrete_median",  # or "painted", "none", "broken"
    "surface": "asphalt_degraded",  # potholes, patches, cracks
    "markings": "faded",            # "fresh", "faded", "none"
    "sidewalk": "partial",          # "full", "partial", "none"
    "intersections": [
        {"type": "t_junction", "signal": "non_functional"},
        {"type": "roundabout", "signal": "none"}
    ],
    "roadside": ["shops", "trees", "construction", "open_drain"]
}

The system generates road geometry, applies surface materials with procedural wear, places roadside elements (shops, trees, electric poles, chai stalls), and bakes lightmaps. Each generation is unique.

We have templates for 12 road categories: national highway, state highway, urban arterial, urban collector, residential, village road, expressway, service road, flyover, underpass, bridge, and unpaved track.

Stage 2: Vehicle and Actor Spawning

This is where the Indian-specific taxonomy matters. We have 3D models for 50+ vehicle types, each with multiple variants (color, load, damage state, modification).

The spawning system follows traffic distribution profiles learned from real data:

spawn_profile = {
    "two_wheeler_ratio": 0.45,    # 45% of traffic
    "auto_rickshaw_ratio": 0.12,
    "car_ratio": 0.18,
    "bus_ratio": 0.04,
    "truck_ratio": 0.08,
    "tractor_ratio": 0.03,
    "bicycle_ratio": 0.05,
    "pedestrian_ratio": 0.15,     # per-frame density
    "animal_ratio": 0.02,
    "special_event_prob": 0.01    # band baarat, funeral, etc.
}

Vehicles are spawned with behavior profiles. A motorcycle might weave between lanes. A bus stops abruptly. A tractor moves at 15 km/h in the fast lane. An auto-rickshaw makes a U-turn across a divider. These behaviors are modeled from real-world trajectory data.

Each vehicle carries metadata: make, model, color, license plate (procedurally generated in Indian formats — state code, RTO code, number series), occupant count, cargo type, and damage state.

Stage 3: Environment Randomization

Domain randomization is the technique of varying visual parameters during synthetic data generation so that the model learns to be invariant to them. We randomize:

Lighting and time of day. We simulate 24-hour cycles with accurate sun position for Indian latitudes. Dawn, harsh midday sun, golden hour, dusk, and night with headlight glare. Each generates dramatically different shadow patterns and color temperatures.

Weather. Clear sky, haze (Delhi-level smog with PM2.5 particle simulation), light rain, heavy rain (Mumbai monsoon), fog (Punjab winter), and dust storms (Rajasthan). Each weather condition affects visibility, surface reflectance, and vehicle behavior.

weather_config = {
    "condition": "heavy_rain",
    "visibility_range": 80,       # meters
    "rain_intensity": 0.8,        # 0-1 scale
    "road_wetness": 0.9,
    "puddle_coverage": 0.3,
    "spray_from_vehicles": True,
    "wiper_state": "fast",        # for ego-perspective renders
    "fog_density": 0.2
}

Camera parameters. We simulate the actual cameras deployed in Indian CCTV installations. That means:

Resolution: 720p (most common), 1080p, 4MP
Compression: H.264 with aggressive quantization (CRF 28-35)
Frame rate: 15 FPS (common) or 25 FPS
Lens: Fixed 2.8mm, 4mm, or 6mm (with appropriate distortion)
Night mode: IR cut filter simulation, low-light noise
Mounting: Pole-mounted at 4-8 meters, wall-mounted, gantry-mounted

This is critical. Training on clean 4K renders and deploying on compressed 720p CCTV creates a domain gap that kills accuracy. We simulate the deployment camera, not an ideal camera.

Stage 4: Annotation Generation

Every frame comes with annotations that would take a human hours to produce:

2D bounding boxes for every object, with class labels from our 50+ taxonomy
Instance segmentation masks at pixel level
3D bounding boxes with orientation
Depth maps (absolute depth per pixel)
Optical flow (per-pixel motion vectors)
Object tracks with consistent IDs across the full sequence
Occlusion flags (percentage occluded, occluding object ID)
License plate text ground truth
Scene-level metadata (weather, time, road type, traffic density)

All annotations are generated automatically from the render engine’s scene graph. Zero human labor. Zero annotation errors.

# Exported annotation for a single frame
{
    "frame_id": 4821,
    "timestamp_ms": 321400,
    "weather": "haze",
    "time_of_day": "10:42",
    "objects": [
        {
            "id": 147,
            "class": "auto_rickshaw",
            "bbox": [412, 283, 89, 67],
            "segmentation": "rle_encoded_mask",
            "depth_m": 12.4,
            "speed_kmh": 23.5,
            "heading_deg": 175.2,
            "occlusion_pct": 0.15,
            "plate_text": "GJ 01 AB 1234",
            "track_id": 147,
            "in_violation": False
        },
        // ... 80+ more objects
    ]
}

Stage 5: Export and Training Integration

DrikSynth exports in COCO, YOLO, MOT, and our internal format. The export pipeline handles:

Class mapping — mapping our 50+ classes to coarser taxonomies when needed
Split generation — train/val/test splits with stratification by weather, time, road type
Difficulty scoring — each frame gets a difficulty score based on occlusion density, object count, and weather degradation
Curriculum scheduling — easy scenes first, hard scenes later in training

Domain Randomization vs. Domain Adaptation

There are two schools of thought on bridging the sim-to-real gap.

Domain randomization says: vary the synthetic data enough that the real world becomes just another variation. Make the textures random, the lighting random, the camera random. The model learns features that are invariant to all of it.

Domain adaptation says: make the synthetic data look as real as possible, then fine-tune on a small amount of real data to close the remaining gap.

We use both.

Our base training uses aggressive domain randomization — we render scenes with unrealistic textures, exaggerated weather, and extreme camera angles alongside photorealistic renders. This forces the feature extractor to learn shape and motion rather than texture.

Then we fine-tune on real Indian traffic data annotated with DrikLabel. The fine-tuning stage uses 10-20% real data mixed with synthetic data. We found this ratio critical: too little real data and the model has sim-to-real artifacts; too much and you lose the coverage benefits of synthetic data.

training_config = {
    "stage_1": {
        "data": "synth_randomized",
        "epochs": 100,
        "augmentation": "heavy",
        "lr": 1e-3
    },
    "stage_2": {
        "data": "synth_photorealistic + real_annotated",
        "ratio": "80:20",
        "epochs": 50,
        "augmentation": "moderate",
        "lr": 1e-4
    },
    "stage_3": {
        "data": "real_annotated",
        "epochs": 20,
        "augmentation": "light",
        "lr": 1e-5
    }
}

Results

We have generated over 100,000 unique scenes with DrikSynth. Some key numbers:

Scene diversity: 12 road types x 6 weather conditions x 8 time-of-day slots x variable traffic density = effectively infinite variation
Object instances: 8.2 million annotated object instances across all scenes
Rare event coverage: 5,000+ scenes with rare events (wrong-way driving, vehicle fires, animal crossings, ceremonial processions) — events that would take years to collect naturally
Annotation cost: $0 marginal cost per annotation (compute cost is ~$0.02 per scene on cloud GPU instances)
Accuracy impact: Models trained with synthetic data pre-training show +4.2 mAP improvement on our real-world test set compared to real-data-only training

The biggest win is in rare classes. Our tractor-trolley detection improved from 31.2 mAP to 58.7 mAP with synthetic pre-training, simply because we could generate thousands of tractor-trolley examples in varied conditions. Real data had fewer than 200 examples.

What We Learned

Sim-to-real gap is real but manageable. The gap is largest for texture-dependent features (license plate reading) and smallest for shape-dependent features (vehicle detection). Our multi-stage training handles both.

Camera simulation matters more than scene realism. Rendering a photorealistic intersection is less important than accurately simulating the JPEG compression, resolution, and noise profile of the deployment camera. We spent more engineering time on camera simulation than on environment art.

Behavior realism is underrated. Early versions of DrikSynth had vehicles following traffic rules. The models trained on this data failed in real-world India because they had never seen a motorcycle traveling against traffic or a bus stopping in the middle of a highway. We now model chaotic behavior explicitly.

Diversity beats volume. 10,000 diverse scenes outperform 100,000 similar scenes. Our difficulty-aware curriculum is more important than raw scene count.

Try It

DrikSynth is part of our open-source toolkit. Check it out on the Open Source page to see how we generate training data at scale. If you are building traffic AI for any market — not just India — synthetic data will accelerate your development. The chaotic scenes we generate will stress-test your models in ways that clean datasets never will.