Are we engineering software or just generating code?

SIILI-Blog-Modern_software_development_with_AI-hero

Listen audio version of this post:

13:38

Share content:

I've been building software for over two decades, and I've never seen anything change the daily work as fast as AI coding tools have done. You can get a working prototype up in hours that would have taken weeks. You may create complex database queries, utility scripts, testers and so forth with simple prompts. This has lowered the threshold for experimentation considerably: I've built things in the past year that I wouldn't have even started before, because the effort just wasn't worth it.

The interesting question isn't what the tools can do. The tools are genuinely valuable, but it's what they change about how we work, and how we should rethink our work to get the best out of them. AI makes writing code fast, but makes specifying the right system the new engineering bottleneck. Unless we focus on getting the specifications and necessary checks right, we soon find out that we ship faster yet spend more time debugging, fixing and tuning afterwards. Our pull requests grow larger but reviews get shallower as there's just too much to check.

The new bottleneck

Judging from the number of social media posts touting how many tens of thousands of lines of code have been written without human intervention, it appears as if we're now back to measuring volume. You know, more PRs merged, more lines. And to be fair, for well-scoped tasks the productivity gains are real.

But here's what I've seen on actual projects: the code comes out fast, and then the team spends the next two weeks figuring out why it doesn't quite work. Not because the code is bad in an obvious way. It compiles, it passes linting, it often even passes the tests (especially if AI wrote those too). It just doesn't do the right thing, or does the right thing in a way that quietly breaks something else. Or it's kind of close, but not quite.

The common thinking was that writing code used to be the slow, expensive part, even though that was kind of a limited view: most of the time in projects was never really writing new code, but debugging, refactoring, testing, integrating and so forth. But now the slow part is knowing whether the generated code is correct and aligned with what the system actually needs. I think this is what many of us haven't adjusted just yet. We're still celebrating that the old bottleneck disappeared, without noticing that a new one took its place.

Why does 90% accurate still fail?

So why do we get variable, close-but-not-quite-right code? Short answer is randomness. A longer answer would include other variables, such as how complete and clear our intent (specification) was, how well the architecture was defined in the agent instructions and so forth.

Let's look at some rudimentary back-of-the-envelope numbers first.

Say an AI-assisted step in your delivery pipeline is 90% reliable. Sounds pretty good, right? Meaning that given clear (enough) specifications, we get 9/10 times the correct enough answer.

But software delivery is not a single step. It's a chain: requirements interpreted, architecture decided, code generated, tests written, integration verified, edge cases handled. Chain ten steps at 90% each and your end-to-end success rate drops to about 35%.

That's assuming independent steps for simplicity. In practice the steps are correlated: an early architectural mistake biases everything downstream, and good context can lift accuracy across the board. But the basic point stands, and correlated failures tend to make it worse, not better.

If you've worked with reliability engineering, this is the same thing as MTBF: chain components in series and each one that can fail drags the overall reliability down.

Are-We-Engineering-or-Just-Generating-Graphs-01

A misunderstood requirement produces code that compiles fine but solves the wrong problem. A generated test suite passes because it validates the implementation rather than the intent. Each step looks fine on its own. The system breaks at the seams.

There are strategies to mitigate this, especially as we've done with flesh-and-blood developers. Test, review and so forth. Luckily this time much of that can be automated, too! Given enough care you have an amazing test case generator and code reviewer in your team.

Also, especially for more complex things, having manual checkpoints to reset the chain before errors propagate, restarting and even looping until we get the answer right are viable mitigation tactics. That last one borders on 'monkeys typing the works of Shakespeare' metaphor (kind of brute-forcing) rather than smart engineering.

The shift-left thing
We've been told for years to "shift left": catch problems early, invest in planning, test sooner. The Pareto principle backs this up. Roughly 80% of the cost of a defect comes from finding it late. Fix it in requirements and it's cheap. Fix it in production and it's expensive.
AI was supposed to help with this. Generate faster, iterate faster, find problems faster. In theory, that should push effort to the left side of the delivery cycle. More time thinking, less time typing.
In practice, I'm seeing something different on many teams. AI compresses the coding phase so much that the effort doesn't just shift left. It splits. You get a bimodal curve: heavy planning on the left (if you're disciplined), and heavy debugging and rework on the right (whether you're disciplined or not).

The coding phase collapsed into almost nothing. Great. But the generated code still needs to be understood, reviewed, integrated, and debugged. And because there's so much more of it, and because you didn't build up the mental context line by line, the rework phase balloons. You didn't write the code, but you still have to own it.

Here's the thing: the same tool that was supposed to shift effort left is also shifting effort right. But the teams that invest on the left (specs, acceptance criteria, architecture) keep the right side under control. They get the speed without the rework. The ones that skip the left-side investment just get both peaks, and wonder why it feels like they're running in circles.

From writing code to specifying what you want

Despite what many seem to believe, I don't think software engineering is dead and the engineers are no longer needed. What their daily jobs will contain will be somewhat different. Instead of pulling tickets from JIRA and implementing them, you can imagine people to be more involved in writing those tickets for the AI to have a fighting chance at getting them done.

And yes, you still need people to understand, fix and modify what was produced, and quite often, the AI speed increase is not positive; simple changes are still simple, and often easier and faster to do the old way.

Think about what actually goes wrong when AI-generated code fails. It's rarely a syntax error: AI agents are really good at producing code that compiles and runs. The failures originate from incomplete or vague intent: the code does something, just not the right thing. It handles the happy path but misses the business rule that only matters on the third Tuesday of the month. And it looks good enough that you might not catch it until production. Or it might have something you didn't ask for, but your AI model thought might be useful or related, perhaps because similar apps in its training data had that.

I keep coming back to the same thing: the specification becomes the product. A vague prompt produces vague code. A precise spec with clear acceptance criteria produces something you can actually trust and verify. People spend so much energy debating which model is best. I'd rather spend that energy on writing better specs.

Are-We-Engineering-or-Just-Generating-Graphs-02

This shift from writing code to writing very clearly what needs to be done is way harder than it used to be. Although you can and should use AI to help you with it, you'll also need to describe your practices and architecture to a much deeper level than before. You need to review what was written, in the form of a task list and verify if it makes sense. So knowing what "done" looks like before the first line gets generated. Understanding the domain well enough to catch the edge cases AI will miss. This stuff was always at the heart of good engineering. AI just made it impossible to skip.

As a side note, a common misunderstanding is that this 'new' way of working, referred to as Specification-Driven Development in corporate speak, means that you end up in a waterfall process. It doesn't. You can still iterate, and you will.

What this means for organizations

I certainly don't have all the answers about what works and what does not for organizations. But some patterns are getting pretty clear, and the upside for teams that get this right is real.

For engineering leaders I'd recommend that rather than using training budgets on"how to write better prompts", use them to train your people on "how to write better specs" and "how to keep systems coherent as they grow." The teams I've seen get the most out of AI are the ones where these skills were already strong. AI made them even better instead of exposing gaps.

Business leaders should regard AI as a capability multiplier, not a cost-cutting tool. The savings come from building the right thing once, not from building the wrong thing faster.

For practitioners — the ones actually using this stuff — I'd stress that AI is a power tool that needs more skill to use safely, not less. A chainsaw is faster than a handsaw. That's exactly why it needs more training and more attention to where you point it. The developers who get good at specs and system thinking are going to be more valuable than ever.

What we actually built

So what does this look like in practice? On a recent project we set up a team of specialized AI agents that follow the exact same delivery cycle a human team would. A planner agent breaks down user stories into implementation tasks, what you'd do in a backlog grooming or sprint planning. An engineering agent writes the code based on this plan, preferably broken down to rather small segments. A tester agent designs and runs end-to-end tests, and finally, a reviewer agent checks the code against project conventions and architecture. And after each cycle, a learning loop captures what worked and what didn't, so the next round is better.

Are-We-Engineering-or-Just-Generating-Graphs-03

None of this is all that exotic; AI telling AI what to do, and finally checking that it did what it was supposed to do. It's plan, build, test, review, learn. The same loop every functioning software team has been running for decades. We just have different workers doing parts of it now.

The interesting part is that the governance is the same too. The planner needs clear acceptance criteria or it plans the wrong thing. The engineer needs architectural constraints or it reinvents the wheel. The tester needs to validate intent, not just implementation. The reviewer needs standards to review against. Take any of those away and you get the same mess you'd get with a human team that skipped those steps.

So, AI didn't change what good engineering looks like. What's lost right now in the hype is that making this work like a factory is a tough, hard job, and requires constant attention and special expertise and commitment.

Summary and recommendations

The question I asked in the title wasn't rhetorical. Are your teams making deliberate decisions about what to build and why? Or just generating code as fast as the tools allow and hoping volume turns into value?

Doing really ambitious AI-assisted development is still fresh. Very few have this figured out, myself included. Getting organized, planned and specification-driven with AI takes considerable time, effort and also rethinking work on a team and individual level.

Certainly the tools will keep getting better, and the teams that learn to combine AI speed with real engineering discipline are going to have a serious advantage.

If you want a starting point, here are some things I'd recommend you focus on first:

Write specifications before you prompt. Even a few bullet points of acceptance criteria will dramatically improve what you get back. Vague in, vague out. Make sure cross-cutting concerns and required templates are available for your agents.

Make sure somebody with a real brain regularly reviews what gets done, also under the hood. Not everywhere, but at the seams: requirements, architecture decisions, integration points. Otherwise you get an unmaintainable bloated mess that doesn't fly far.

Treat AI output like a junior developer's pull request. Review it, question it, test it against the intent, not just against whether it compiles. Use deterministic static analysis tools, log analytics and the strictest linters you can get your hands on. And be prepared this alone won't be enough.

Invest in your team's engineering skills, not just their prompting skills. Things like decomposition (what would be your task list if you were to do this yourself?), system thinking, knowing what "done" looks like.

These are things that separate teams that ship from teams that generate.

Happy agenting!

About the author

SIILI-Antti_Koivisto

Antti Koivisto
Lead Architect