Every AI consultancy can build a demo. The demo always works. It's compelling, polished, and ships in two weeks. Then you sign a six-figure contract, and six months later you're staring at a system that falls over under real load, hallucinates in front of customers, and costs 10x what anyone budgeted.
This isn't hypothetical. According to a 2024 RAND Corporation study, more than 80% of AI projects fail to reach meaningful production deployment — twice the failure rate of non-AI IT projects. And McKinsey's 2025 State of AI report found that while 88% of organizations now use AI in at least one function, only about one-third have managed to scale it beyond pilots.
The gap between "we built a cool prototype" and "this runs in production, reliably, at scale" is where most AI engagements go to die. Here's how to tell whether a consultancy can actually cross that gap — before you hand them the keys.
Any consultancy will show you a portfolio of successes. That tells you almost nothing. What you want to hear about is failure — specifically, production failures they've encountered and how they handled them.
A team that's actually shipped AI to production will have war stories about:
If a consultancy can't articulate specific production failures they've navigated, they haven't done production work. Full stop. The RAND study identified five root causes of AI project failure, including inadequate infrastructure to deploy completed models and misalignment between the problem and the technology. These aren't theoretical risks — they're the daily reality of production AI.
Prototype architecture and production architecture are fundamentally different. A prototype calls an LLM API and returns the result. A production system needs:
Ask them to whiteboard the architecture for a system they've deployed. If the diagram is just "user → API → LLM → response," they're a prototype shop.
Traditional software has well-established testing patterns. AI systems are harder to test because they're non-deterministic — the same input can produce different outputs. A production-ready consultancy will have a clear answer to: "How do you test AI systems?"
Look for:
If their testing strategy is "we try it and see if it works," that's a prototype mindset.
Standard APM tools (Datadog, New Relic) aren't enough for AI systems. You need AI-specific observability. Ask what they monitor in production:
A consultancy that can't tell you their monitoring stack and the specific metrics they track hasn't operated AI in production at meaningful scale.
Production AI means being at the mercy of model providers — OpenAI, Anthropic, Google, and others. Models get updated, deprecated, rate-limited, and occasionally go down entirely. A production-ready consultancy:
AI is only one layer of a production system. A consultancy that only knows AI but can't talk about deployment, scaling, security, and DevOps is going to hand you a model that works on their laptop but not in your cloud.
Red flags:
Production AI is a systems engineering problem, not just a machine learning problem. If the team is all data scientists and no infrastructure engineers, that's a signal.
Shipping v1 is the beginning, not the end. AI systems require ongoing care: models drift, data changes, user patterns evolve, and costs need continuous optimization. Ask:
Before signing with an AI consultancy, run them through these questions. Score each honestly:
| Question | Pass? |
|---|---|
| Can they describe specific production failures they've encountered and resolved? | ☐ |
| Do they have a documented evaluation and testing methodology for AI systems? | ☐ |
| Can they whiteboard a production architecture with guardrails, fallbacks, and monitoring? | ☐ |
| Do they have multi-model experience and provider migration strategies? | ☐ |
| Can they articulate unit economics (cost per AI transaction)? | ☐ |
| Do they have infrastructure and DevOps expertise, not just AI/ML? | ☐ |
| Is there a plan for post-launch operations, monitoring, and optimization? | ☐ |
| Have they dealt with compliance requirements (SOC 2, HIPAA, GDPR) in AI contexts? | ☐ |
If a consultancy can't check most of these boxes with specific, concrete examples — not vague assurances — keep looking.
We've been building production systems for over a decade — complex web platforms, headless CMS architectures, high-traffic e-commerce sites. When we moved into AI, we brought that production mindset with us. We don't think of AI as a separate discipline; it's another layer in a production stack that needs the same rigor as everything else: CI/CD, monitoring, testing, incident response, and operational documentation.
Every AI system we build ships with guardrails, audit trails, cost controls, and eval suites. Not because we're paranoid — because we've seen what happens when you skip them. The prototype-to-production gap isn't closed by better models. It's closed by better engineering.