EngineeringApr 10, 20257 min read

Scaling AI inference: how we route to the right model

CDNZero supports multiple AI models for image and video generation, but users shouldn't have to think about which one to use. Our model routing layer makes that decision automatically.

The routing logic considers three factors: content type (product photo vs. abstract art vs. video), quality requirements (thumbnail vs. hero image), and latency budget (real-time generation vs. batch processing).

For images, we maintain a fast model for thumbnails and social previews (generates in 2-3 seconds), a quality model for marketing assets (8-12 seconds, much higher detail), and a specialized model for product photography with background removal.

For video, routing is more complex. Short clips (under 15s) use a lightweight model that renders in real-time. Longer content uses a multi-pass pipeline that generates keyframes first, then interpolates motion — slower but significantly better quality.

The routing layer also handles failover. If a model is experiencing high latency or errors, requests are automatically rerouted to an alternative that can produce acceptable results. Users see slightly different output characteristics but never a failure.

We run inference on a mix of A100 and H100 GPUs across three regions (US-East, EU-West, APAC). Regional routing ensures that generation requests land on the closest GPU cluster, while generated outputs are cached at the edge for subsequent retrievals.

The result: 95% of image generation requests complete in under 10 seconds, and 95% of video requests complete in under 45 seconds. All generated assets are immediately available via CDN with sub-50ms retrieval latency.