At Merapar, we’ve seen it time and again: enterprise AI pilots that start with big ambitions but fail to deliver measurable business value. Most don’t fail because the technology isn’t ready, they fail because organizations treat AI like traditional software. This observation isn’t just ours. A 2025 MIT Media Lab study found that more than 95% of enterprise AI pilots never translate into real business impact. We believe that this isn’t because the technology falls short, but because companies approach AI with the mindset of traditional, deterministic software projects. So the problem? A fundamental misconception about what AI development truly entails.
As a Head of Product or VP of Engineering, you’re used to deterministic systems: press button A, get outcome X. AI doesn’t work that way. Models are probabilistic, they operate on likelihoods, not guarantees. A result may be correct 95% of the time and still surprise you 5% of the time. As our CTO, Klaus Seiler, keeps on reminding us in internal meetings: “It’s not about code so much anymore, it’s about the evaluation dataset you use to optimize the outcome.” That simple shift changes everything...
AI systems are extremely sensitive to input. A single word in a prompt can completely alter the output. Think Ashton Kutcher’s 2004 film The Butterfly Effect: change one small thing, and the whole story shifts. Or a bit like DiCaprio’s dream layers in Inception, where one small change reshapes the entire world above it. That’s exactly what happens in AI: small changes in data or phrasing can cascade into vastly different results.
You’re no longer in a neat, linear Design → Build → Test → Launch cycle. You’re in a continuous experiment loop, where code, data and evaluation constantly evolve together. To build reliable, scalable AI systems, such as streaming recommendation engines or media-intelligence platforms, you need to adopt three core principles.
The prompt is not magic; it’s your application’s logic. Manage it, version it, and track it like any other codebase. A lot of companies, including we at Merapar, are increasingly applying this principle: managing prompts like code, versioned, traceable and tested against evaluation datasets. A number of evaluation approaches can complement this, human-in-the-loop, LLM-as-a-judge, or hybrid models. But as we often remind our teams: each judge LLM brings its own prompt, and its own uncertainty. Calibration is part of your QA.
You can’t trust what you don’t trace. Comprehensive observability, from prompt call to model output, is non-negotiable. Instrumentation should capture not only latency or error rates, but e.g. also semantic accuracy, bias drift and compliance behavior. Modern frameworks like OpenTelemetry, MLflow, LangSmith, or Galileo AI make this easier than ever. But tools assist, they’re not the solution. Process and discipline come first.
Your evaluation dataset is your ground truth, the benchmark against which every iteration is validated. Google Cloud calls this a golden reference dataset; we call it the backbone of trustworthy AI.
For the CTO or VP of Engineering, the true ROI of AI isn’t the model’s initial accuracy, it’s the speed of iteration and the reduction of operational risk. By establishing a solid evaluation framework from Day One, you’re proactively avoiding crippling costs later on. When development teams neglect evaluation, they end up “debugging in production,” leading to downtime, loss of customer trust, and technical debt. A 2025 EY survey found that more than 70% of companies deploying AI reported financial losses linked to flawed or biased outputs. EY notes that organisations with stronger governance and evaluation processes experience significantly fewer issues, reinforcing how critical a structured testing and validation approach is from the start.
So, try thinking like scientists, defining clear goals, metrics, and feedback loops. You will accelerate what we call the Data Flywheel. That means faster learning, cheaper failure and continuous improvement.
AI development is no longer just coding: it’s experimentation at scale. The companies that win are those that optimize for iteration speed, not initial perfection. At Merapar, we help teams turn AI experimentation into production-ready systems; reliable, measurable, and fast to evolve. Because in the end, innovation isn’t about being right once. It’s about improving every day.