Building custom AI software is the easy part. Keeping it running well six months later — that's where most companies fall apart.

According to a 2024 RAND Corporation report, more than 80% of AI projects fail, twice the rate of non-AI IT projects. And the failures aren't usually in the prototype phase. They happen after launch, when the system quietly degrades and nobody notices until it's too late.

Gartner predicts that by 2030, 50% of enterprises will face delayed AI upgrades or rising maintenance costs due to unmanaged GenAI technical debt. That's not a distant risk — it's the trajectory most organizations are already on.

This post breaks down what maintaining and improving custom AI software actually looks like in practice: what degrades, what breaks, and what you need to build into your process from day one.

AI Software Degrades Differently Than Traditional Software

Traditional software breaks when code changes or infrastructure fails. AI software breaks while you're not touching it at all.

The landmark Google Research paper "Hidden Technical Debt in Machine Learning Systems" (NeurIPS 2015) identified this problem a decade ago: in ML systems, only a small fraction of the code is the model itself. The vast majority is the surrounding infrastructure — data pipelines, feature engineering, configuration, monitoring, and serving layers. All of it needs ongoing maintenance.

Custom AI software faces three categories of degradation that traditional software doesn't:

  • Model drift: The world changes, but your model's training data doesn't. Customer behavior shifts, new products launch, seasonal patterns evolve — and your model's accuracy quietly declines.
  • Dependency volatility: LLM providers ship breaking changes. OpenAI, Anthropic, and Google update their APIs and model versions frequently. A model you tuned prompts for in January may behave differently by March.
  • Data pipeline rot: Upstream data sources change schemas, add fields, or alter formats. If your AI system ingests data from CRMs, CMSes, or analytics platforms, any upstream change can corrupt your inputs.

The Five Pillars of AI Software Maintenance

After building and maintaining AI systems across multiple enterprise environments, we've found that ongoing AI maintenance comes down to five disciplines. Skip any one and you'll pay for it later.

1. Model and Prompt Monitoring

You can't improve what you don't measure. Every production AI system needs observability that goes beyond standard application monitoring.

For LLM-based systems, this means tracking:

  • Output quality scores — automated evaluation of responses against known-good examples
  • Latency distributions — not just averages, but P95 and P99 where user experience actually suffers
  • Cost per task — token usage trends that reveal prompt bloat or unnecessary reasoning loops
  • Error and fallback rates — how often the system fails to produce a usable result
  • User feedback signals — thumbs up/down, corrections, escalations to humans

For traditional ML models, add drift detection: statistical tests that compare incoming production data distributions against your training data. When they diverge past a threshold, you know it's time to retrain.

2. Prompt and Configuration Management

In LLM-based systems, prompts are code. They need the same rigor you'd give to any other critical codebase.

That means version control for every prompt template, with diffs and rollback capability. It means staging environments where prompt changes are tested against evaluation datasets before they hit production. And it means ownership — someone is responsible for each prompt's performance, not just its initial creation.

We version prompts alongside application code, run automated evaluation suites on every change, and maintain a changelog that ties prompt updates to observed behavior changes. It's not glamorous work, but it's the difference between a system you can debug and one that's a black box.

3. Dependency and Model Version Management

AI systems have a unique dependency challenge: the models themselves are third-party services that change without your permission.

When OpenAI deprecates a model version or Anthropic updates Claude's behavior, your application can change overnight. Managing this requires:

  • Model abstraction layers that let you swap providers without rewriting application logic
  • Pinned model versions where the API supports it, with deliberate upgrade cycles
  • Regression test suites that run against new model versions before you adopt them
  • Fallback chains so a single provider outage doesn't take your system down

This isn't theoretical. We've seen production systems break because a model provider quietly changed default parameters in a minor version bump. You need automated testing that catches these changes before your users do.

4. Data Pipeline Maintenance

Your AI system is only as good as the data flowing into it. And data pipelines are fragile.

Upstream systems — CRMs, content management systems, analytics platforms, third-party APIs — change constantly. A field gets renamed in Salesforce. A content type gets restructured in Contentful. An API response adds a new nested object. Any of these can silently corrupt your AI system's inputs.

Robust data pipeline maintenance includes:

  • Schema validation at every ingestion point, with alerts on unexpected changes
  • Data quality checks — completeness, freshness, statistical distribution
  • Contract testing between your system and upstream data sources
  • Automated backfill procedures for when data issues are discovered retroactively

5. Continuous Evaluation and Improvement

Maintenance isn't just about keeping things from breaking. It's about making the system better over time.

McKinsey's 2025 State of AI survey found that the transition from pilots to scaled impact remains "a work in progress at most organizations." One key reason: teams build and deploy, but don't establish feedback loops that drive iterative improvement.

A healthy improvement cycle looks like:

  • Weekly eval runs against curated test datasets that grow over time
  • Monthly prompt optimization informed by production performance data
  • Quarterly architecture reviews to evaluate whether new models, tools, or patterns could improve performance or reduce cost
  • Ongoing collection of edge cases that become new test cases, steadily hardening the system

What Most Companies Get Wrong

The most common mistake isn't neglecting maintenance entirely. It's treating it as an afterthought — something to figure out after the build.

Here's what that looks like in practice:

No budget for ongoing work. The project budget covers design, development, and launch. There's no line item for the 15–25% annual maintenance that custom AI software demands. Six months later, the system is degrading and there's no team allocated to fix it.

No observability from day one. Monitoring is treated as a "nice to have" rather than a launch requirement. By the time someone notices the system is underperforming, there's no data to diagnose why.

Treating prompts as "set and forget." A prompt that works perfectly at launch will not work perfectly forever. Models change, user patterns change, and edge cases accumulate. Without a prompt maintenance practice, quality erodes silently.

No separation between the AI layer and the application layer. When model logic is tightly coupled with business logic, upgrading one means risking the other. Clean abstractions aren't just good architecture — they're a maintenance necessity.

How We Approach AI Maintenance at Last Rev

We build maintenance into the architecture from the start, not as a phase that comes after launch. Here's what that looks like:

Abstraction by default. Every AI integration goes through an abstraction layer. Model providers, prompt templates, and data sources are all swappable without touching application code. When a client needs to switch from one LLM to another — or use different models for different tasks — it's a configuration change, not a rewrite.

Observability as a launch requirement. No AI feature ships without monitoring for output quality, latency, cost, and error rates. We define what "healthy" looks like during development so we can detect degradation automatically.

Evaluation-driven development. We build evaluation datasets alongside the feature. Every prompt change, every model upgrade, every data pipeline modification runs through automated evals before it reaches production. This catches regressions that manual testing would miss.

Documented runbooks. When something breaks at 2 AM, you don't want to rely on tribal knowledge. Every AI system we build includes runbooks for common failure modes: model provider outages, data pipeline failures, cost spikes, and quality degradation.

Planned improvement cycles. We work with clients to establish regular review cadences — not just "is it broken?" but "could it be better?" New model releases, accumulated user feedback, and production performance data all feed into structured improvement sprints.

The Real Cost of Not Maintaining AI Software

Gartner's concept of "AI debt" captures this well: shortcuts taken during AI development that create hidden ongoing costs. Just like traditional technical debt, AI debt compounds. Unmonitored model drift leads to bad outputs, which lead to lost user trust, which leads to the AI feature being abandoned entirely.

As Forrester notes, organizations need to "maintain the lifecycle — support, development, training, and maintenance — of their AI solutions" to maximize value. AI technical services aren't a nice-to-have; they're essential to getting returns on your AI investment.

The companies that succeed with custom AI aren't the ones that build the most impressive prototypes. They're the ones that plan for what comes after launch.

Key Takeaways

  • AI software degrades even when you don't touch it. Model drift, dependency changes, and data pipeline rot are constant forces working against your system.
  • Maintenance isn't optional — budget for it. Plan for 15–25% of your initial build cost annually for ongoing maintenance and improvement.
  • Observability is a launch requirement. You can't maintain what you can't measure. Build monitoring into every AI feature from day one.
  • Prompts are code. Version them, test them, review them, and assign ownership.
  • Architect for change. Clean abstractions between your AI layer and application layer make upgrades and provider switches manageable.
  • Establish improvement cycles. The best AI systems get better over time because someone is deliberately making them better.

Custom AI software is a living system. Treat it like one, and it compounds in value. Neglect it, and it compounds in debt.

Sources

  1. RAND Corporation — "The Root Causes of Failure for Artificial Intelligence Projects and How They Can Succeed" (2024)
  2. Gartner — "Gartner Identifies Critical GenAI Blind Spots That CIOs Must Urgently Address" (2025)
  3. Sculley et al. (Google) — "Hidden Technical Debt in Machine Learning Systems" (NeurIPS 2015)
  4. McKinsey — "The State of AI: Global Survey 2025" (2025)
  5. Gartner — "AI Debt: Understanding It, Planning for It, and Paying It Back" (2025)
  6. Forrester — "Navigate The Evolving Landscape Of AI Technical Services" (2025)