Software Development Agents: What Works and What Doesn’t

AI is rapidly transforming software development. Tools like GitHub Copilot introduced AI-powered code autocompletion, and now more advanced autonomous coding agents (e.g. Devin, OpenHands) are emerging. These agents can take high-level instructions and perform multi-step coding tasks, acting almost like a junior developer working asynchronously. The excitement is high – some engineers report 10× productivity boosts – but skepticism exists as well. The difference often comes down to having realistic expectations and using the agents properly. As Robert Brennan (CEO of All Hands AI and co-creator of OpenHands) noted in his ODSC 2025 talk, we need to “cut through the hype” and understand what today’s AI coding agents can actually handle – and what they can’t . In essence, these agents are powerful generalists, but they are not magic – success with them depends on smart usage (clear prompts, proper sandboxing, human oversight) .

Importantly, AI coding agents are changing how we spend our time as developers. “Coding” (the manual act of writing code) is starting to “go away” in routine tasks, but software engineering is not going away. Instead, the human role shifts more toward high-level thinking: understanding user needs, defining requirements, architecting solutions, and verifying the AI’s work. The AI excels at the “inner loop” of development – the rapid cycle of writing code, running it, and iterating – but it struggles with the “outer loop”: the big-picture planning and empathizing with end-user or business objectives. In short, we’ll spend less time typing boilerplate code and more time reviewing, guiding, and thinking critically about what needs to be built.

From Autocomplete to Autonomous Agents

Early AI dev tools like Copilot acted as assistive autocompletes, suggesting a few lines of code in your editor. In contrast, modern coding agents possess agency: they can take actions through tools on your behalf. Rather than you writing every line, you might give an agent a one-sentence goal and it will work for 5, 10, or even 15 minutes autonomously, then return with a proposed solution. This represents a shift from synchronous, line-by-line assistance to asynchronous, multi-step problem solving .

Today’s agents can modify codebases, execute commands, run tests, and even browse the web for documentation. In fact, tools like OpenHands (formerly OpenDevin) aim to handle a “full spectrum” of dev tasks – not just coding, but also compiling, running, testing, and researching solutions online . As one reviewer put it, we’ve seen AI evolve from “simple code completion aids to sophisticated systems capable of understanding complex requirements and generating functional applications” . In other words, an agent is more like an autonomous junior engineer, whereas Copilot is more like an autocomplete on steroids.

However, with greater power comes new challenges. Agents operate autonomously, which means they might make several changes or decisions before you see the result. This is incredibly powerful – you can be coding in parallel or focusing on other work while agents churn through tasks – but it also means you must trust but verify their output. They can accomplish a lot more on their own than a simple autocomplete, yet they can also go off track without guidance. That’s why understanding how they work and how to direct them is critical.

How Coding Agents Work (Under the Hood)

At the heart of every coding agent is a loop between an LLM “brain” and the outside world. The agent uses a large language model (LLM) to decide on actions step by step, and it has tools to carry out those actions in a development environment. The typical loop is: the agent reads the current state (code, errors, etc.), the LLM decides “what’s the next action to get closer to the goal?”, then the agent executes that action and gathers the result, feeding it back into the LLM for the next step. This continues until the agent believes it has achieved the goal (or gets stuck/fails).

Core tools that coding agents use include:

Code Editor – The agent can read and modify files in the codebase. Modern agents don’t rewrite entire files blindly; instead they apply targeted edits (diffs or find-and-replace operations) to be efficient. They may be provided with an abstract syntax tree or other navigational aids to find where to make changes. This lets the agent handle large codebases by focusing only on relevant sections rather than exceeding token limits.
Terminal/Command Execution – The agent can run shell commands (for building, testing, running the app, etc.) in a sandboxed environment. Handling the terminal autonomously raises questions like how to deal with long-running commands or parallel processes. Agents like OpenHands support running servers and then hitting them with test requests, etc., managing multiple processes. The key is that any command output or exit status is captured and fed back into the LLM so it knows what happened.
Web Browsing – When the agent needs external information (documentation, Stack Overflow solutions, library references), it can issue web requests. Rather than dumping raw HTML (which is full of irrelevant markup) into the LLM, a good agent will extract readable content (using techniques like parsing the accessibility tree or converting HTML to Markdown). Some agents allow scrolling through pages or even simulating clicks and form inputs if needed. This capability is actively improving – for example, a recent contribution to OpenHands “doubled [its] accuracy on web browsing”, highlighting how quickly this area is evolving. Effective web integration means the agent can fetch the latest docs or find solutions to novel problems autonomously.
File System – The agent can create, read, and delete files in its workspace. This is necessary for adding new source files, updating config files, or writing out generated content. Typically the agent’s file access is confined to a specified project directory or a Docker volume mount to avoid messing with anything outside its scope.
Sandboxing (Safety) – Autonomy is powerful but dangerous without isolation. A coding agent essentially has the powers of a developer on your machine – which could include running destructive commands if something goes wrong. All reputable agent frameworks run the AI in a sandbox (often a Docker container) by default . This ensures the agent can’t, say, accidentally run rm -rf / on your actual system or exfiltrate sensitive data. The sandbox has only the tools and permissions it needs for the task at hand. For example, if the agent needs credentials (GitHub tokens, cloud keys), you provide scoped credentials with minimal privileges. Following the principle of least privilege is essential when granting an AI access to external systems. In short, sandboxing contains the agent’s actions so it can experiment freely without real harm.

Understanding this loop and toolset is more than just technical trivia – it helps you develop an intuition for the agent’s behavior. For instance, knowing that the agent reads feedback from tests or error messages means you should run the tests early so the agent sees failures and can fix them. Knowing it browses the web means if it’s stuck, maybe the docs online could help. Essentially, you are managing an AI coworker who has a very fast but literal mind: it will do exactly (and only) what you ask and what its tools allow.

Best Practices for Using Coding Agents

Given how they work, what can we do to get the most out of coding agents? Brennan’s talk and community experiences suggest several best practices to use these agents effectively without creating noise or tech debt:

Start Small and Simple: In the beginning, give the agent small, contained tasks rather than an entire project. The ideal tasks take maybe one commit’s worth of changes and have a clear definition of done (e.g. “all tests pass” or “merge conflict resolved”). This way, the agent can verify it succeeded (by running tests, etc.) and you can easily check the result. Good starter tasks are often the tedious chores developers dislike – for example, fixing a single failing test, resolving lint errors, or updating a config file. These tend to be straightforward and repetitive, which AIs handle well, and you can quickly confirm the fix is correct.
Be Explicit in Instructions: When prompting an agent, clarity is crucial. Don’t just state what you want done – also how you want it done. Mention specific frameworks, function names, or files relevant to the task. For instance, instead of saying “implement user login,” you might say “Implement a user login using the existing AuthServiceclass and following the pattern used in loginController.js (use JWT tokens). Add corresponding unit tests.” Specific guidance prevents the agent from taking misguided paths, and it also speeds it up (it doesn’t waste time searching the entire codebase if you point to the right spot). Investing a minute to write a detailed prompt can save the agent from ten minutes of flailing.
Iterate and Refine: Treat each agent attempt as a draft. If the agent’s first attempt isn’t satisfactory, you can give follow-up instructions in the same session to refine it. Agents remember the conversation context (to a limit) and can adjust their output. If things go really off track, don’t be afraid to stop and reset with a new approach. One advantage of AI-generated code is that code is cheap – you haven’t sunk hours into writing it yourself. You can throw away a bad AI attempt without much loss. This “easy come, easy go” dynamic encourages experimentation and rapid prototyping. In fact, Brennan noted that sometimes he’ll spin up an agent to scaffold an idea he had on his commute – if it works, great, if not, he can discard it and try something else, no harm done.
Always Review and Test the Output: Never blindly merge code from an agent. As useful as these tools are, they make mistakes and can introduce subtle bugs or messy code if unchecked. Brennan warns against trying to “vibe code” a complete production application with no human oversight – that’s a recipe for accumulating massive tech debt (duplicate code, poor structure, hidden bugs) . Always do a code review on what the agent produces. Ensure that you understand the changes and that they align with your intent. Run the code yourself or in a test environment to verify it actually solves the problem. In practice, this means keeping a human-in-the-loop for quality control. Some teams even require that any AI-authored pull request is attributed to a human engineer (for example, OpenHands PRs are assigned to the dev who invoked the agent, not to “OpenHands” itself) so that a human is accountable for getting that PR merged and fixing any issues. This policy reinforces that the engineer is responsible for the AI’s output.
Leverage Tools Like Tests and Linters: To aid the agent and your review, make sure to have a solid suite of automated tests and linters. Agents love having tests – tests provide a clear success criteria (pass/fail) that the agent can aim for. If you ask an agent to fix a bug or add a feature, ask it to also run the tests to confirm everything passes. This not only guides the agent but also gives you confidence in the result. Similarly, if you have linters or formatters, you can have the agent run those, so it adheres to style guidelines automatically. Essentially, the more feedback you can provide to the agent (via failing tests, error logs, etc.), the better it can correct and converge on a good solution.

By following these practices – starting with well-scoped tasks, giving explicit guidance, iterating as needed, and rigorously reviewing – you set the agent (and yourself) up for success. Used this way, agents become a powerful accelerator rather than a source of noise.

Tasks Where AI Agents Excel

So, what kinds of tasks are coding agents especially good at today? Based on Brennan’s experience with OpenHands and reports from the community, several “day one” use cases stand out as big wins :

Resolving Merge Conflicts: This is a classic developer headache that agents handle very well. Merge conflicts often involve rote reconciliation of two changes. An agent can parse the conflict markers, understand the intent from each side, and merge code logically. Brennan noted that on the fast-moving OpenHands codebase, no pull request escapes conflict-free, and yet having the agent auto-merge those conflicts saves tons of time. It correctly resolves the vast majority of conflicts by analyzing what changed where, sparing you the manual diff comparison.
Addressing Code Review Feedback: If a teammate leaves clear comments on a PR (“please rename this variable for clarity” or “use X library function here instead of custom code”), you can instruct the agent to apply all that feedback. The agent can read the review comments and make the requested changes across the codebase. This is a great use case because the desired outcome is explicitly described in human language by the reviewer – essentially a ready-made prompt. Brennan gave an example where a front-end expert requested specific React changes that he (Brennan) wasn’t familiar with; OpenHands was able to implement those exact changes properly based on the review notes. The agent acts like an obedient pair programmer who does exactly “what that reviewer said.”
Bug Fixes & Simple Feature Edits: For small bugs where the problem is localized (e.g. “the input should be a number, but it’s currently treated as text”), agents are very effective. Rather than you grepping through the code, you can ask the agent to find and fix the issue. It will locate the relevant file, apply the fix (changing an <input> to type="number", for instance), and even update or add a test if asked. This can be done via a quick prompt (even via a chat interface or CLI) without loading up your whole IDE. It’s the convenience of having a junior dev on call to handle minor issues.
Infrastructure as Code & Config Changes: Tasks like tweaking a Terraform configuration, updating a CI pipeline, or increasing a server memory limit are often straightforward but require looking up syntax or config keys. Agents shine here: they often “know” the syntax (thanks to training on docs) or can quickly fetch it. You might say “increase the memory limit of our Kubernetes pod in the config,” and the agent will update the YAML/JSON or Terraform accordingly. It handles the boilerplate and ensures you don’t forget any linked setting. As Brennan mentioned, when an out-of-memory alert popped up, they could just tell OpenHands to bump the memory setting and it handled the change across the infra code.
Database Migrations: Writing migrations (adding a column, creating an index, etc.) is another repetitive task where agents do well. An AI can generate the migration script (SQL or using an ORM’s migration tool) following best practices, because it has seen many examples. Interestingly, Brennan found that AI often exceeds humans at following best practices here – it will faithfully add indices, foreign keys, or not null constraints that a human might overlook when rushing. The agent basically acts as a meticulous DBA assistant.
Increasing Test Coverage: If you notice a part of your code lacks tests, you can ask the agent to generate tests for it. This is a low-risk task (since it doesn’t change production code, only adds tests) and thus safe to hand off to AI. The agent can write unit tests or integration tests based on how it understands the code’s expected behavior. It’s a fast way to improve your test suite. As long as the tests pass (and ideally, you quickly inspect that the tests make sense), this is a nice offloading of grunt work. It’s worth noting that writing tests also forces the agent to think through the code’s behavior, which can indirectly surface hidden assumptions or bugs.
Scaffolding New Modules or Apps: Agents are surprisingly good at creating the initial skeleton of an app or module. You can say “set up a new microservice that does X” and the agent will lay down a basic structure – e.g. create a new directory, add boilerplate code, perhaps a README, etc. This is especially useful for internal tools or prototypes where speed matters more than perfection. Brennan’s team uses OpenHands to spin up small internal apps (for example, a debugging interface for agent sessions) very quickly. However, caution: For production-facing systems, you’ll still want to heavily review and polish any AI-generated scaffolding. The agent gets you 80% of the way in minutes, but that last 20% (making it production-grade) is on the humans. Think of this as a way to beat writer’s block on a blank project – the agent gives you a starting point that you can then refine.

All these use cases are well-bounded tasks with clear success criteria (tests pass, configuration updated, etc.), which is why agents excel at them . They relieve developers of tedious or formulaic work, allowing us to focus on harder problems. Many teams report that once they trust an agent for these chores, it frees up a lot of time and mental energy.

Tasks That Remain Challenging for Agents

On the flip side, there are certain things that coding agents are not so good at (yet). It’s crucial to recognize these limitations to avoid frustration or disaster:

Big-Picture Design and Requirements: Agents are not product managers or senior architects. They lack true understanding of user needs or business context. They cannot empathize with an end-user’s pain points or decide what features make sense for the business strategy. For example, an agent won’t intuit that a fast-and-dirty solution might hurt maintainability, or that a certain feature might confuse users – it just follows instructions. High-level tasks like deciding the overall architecture of a new system, performing trade-off analyses, or interpreting ambiguous requirements are best left to humans, now and likely in the future. AI can assist with implementationdetails, but defining the problem and ensuring the solution makes sense require human judgment.
Open-Ended Problem Solving: If your prompt is vague or the goal is open-ended (“make the app better” or “optimize this codebase”), the agent will flail. Agents need a well-defined target. They excel at converging toward a known-good outcome (passing tests, matching a spec). When the task is exploratory or creative without clear criteria, the agent might produce irrelevant output or get stuck in loops. For instance, asking an agent to design a new feature with no further guidance could lead it to build something misaligned with what you actually wanted. Ill-defined tasks = poor results. Until AI gains more true understanding, humans must break down big problems into concrete sub-tasks for the agent.
Maintaining Coherence in Large Codebases: LLM-based agents have limited memory (context window) and can struggle to keep a whole large project in mind. They may make changes in one file that conflict with assumptions in another file if those weren’t within the prompt context. As one engineer put it, LLMs often “generate inconsistent, poorly structured code and struggle to maintain coherence across multiple files” when left unguided . In a big codebase, an agent might not realize that the approach it’s taking in one module violates a convention used elsewhere. Humans are still better at holistic understanding of a complex system’s architecture. Agents work best when you can point them to a specific part of the code; if they have to refactor or update dozens of files at once, you’ll need to carefully supervise or partition the work.
Long-Running Autonomy Without Checkpoints: The longer you let an agent run without supervision, the more chances for error accumulate. Agents don’t have an “executive self-correction” beyond what the LLM can infer, so they might go down a wrong path repeatedly. Brennan’s advice is to keep tasks short and verify frequently. Research has shown that partial autonomy – giving the user an “autonomy slider” – tends to work better than fully hands-off approaches . In practice, this means if an agent hasn’t made progress after a few minutes or a few iterations, it’s wise to step in, check its intermediate output, and possibly reset or guide it anew. Fully autonomous multi-hour coding sprees sound exciting but often end in the agent getting confused or producing low-quality code that must be unwound.
Quality of Generated Code (Tech Debt Risk): While agents can write a lot of code quickly, quantity is not quality. If used naively, they can introduce subtle bugs, security issues, or just messy code that “works” but is hard to maintain. For example, an agent might duplicate logic in two places because it didn’t realize they were related, or use an inefficient approach that technically passes tests but wouldn’t scale. As AI engineer Alberto Fortin cautioned, relying on an agent without understanding the code yourself is “a recipe for disaster” . Over-reliance on AI can even erode your own skills over time . Therefore, treat agent-written code with the same scrutiny you’d treat a human junior developer’s code. Don’t assume it’s elegantly structured or optimal. Often, some refactoring by a human will be needed to keep the codebase clean.

In summary, coding agents are not a substitute for human insight. They are incredibly useful assistants for well-defined programming tasks, but they are not ready (and may never be ready) to be lead engineers owning a project’s direction. The successful pattern is to let the agent handle the heavy lifting on clearly-scoped problems, while you provide the vision, guidance, and critical review.

Evolving Capabilities by 2025 and Beyond

The pace of improvement in AI agents is rapid. By 2025 and beyond, we can expect them to successfully tackle some of the tasks that are challenging today, though with caveats. For instance, agents might get better at multi-step reasoning across a codebase as context window sizes grow and architectures improve. This could help with more coherent large-scale refactors or updates handled by AI. Tool use is also getting more sophisticated – as seen with web browsing improvements and possibly GUI interactions – meaning agents will have richer ways to gather context and verify their work. We may see agents taking on slightly higher-level tasks, like coordinating multiple microservices changes or performing end-to-end testing flows, which today still often require a person in the loop.

However, it’s also likely that human oversight will remain essential. Even if the agents become more reliable, having a human to double-check critical changes is just prudent engineering. The goal is not to remove humans from the process, but to amplify our productivity. In Brennan’s vision, the future might involve developers orchestrating multiple agents (an “AI development team”) to work in parallel on different tasks, while the human acts as a tech lead supervising the swarm. This is a fundamentally different way of working – more about supervising and integrating, less about hand-coding everything – but it still requires skill and judgment on the part of the engineer.

As one observer succinctly put it, “It’s not about the AI being smart – it’s about being smart about the AI.” By staying up-to-date on best practices (prompt engineering, tool setup, etc.) and maintaining strong software engineering fundamentals, developers can ride this wave rather than be drowned by it. The companies and engineers who figure out how to effectively partner with AI agents will likely outperform those who don’t, especially on routine tasks, but creative design and complex problem-solving will remain uniquely human fortes for the foreseeable future.

Conclusion

AI coding agents like OpenHands are proving to be valuable additions to the developer toolkit, automating away mundane tasks and accelerating the development cycle. When used wisely, they enable a workflow where you “code less, but accomplish more.” Mundane chores – from merge conflict resolution to writing boilerplate tests – can be offloaded to an tireless AI assistant, freeing you to focus on higher-level design and critical thinking. What works today is leveraging agents for well-scoped, verifiable tasks and using them as junior collaborators under your guidance. What doesn’t workis expecting the agent to magically handle ambiguous or strategic work, or turning it loose without oversight – that path leads to bugs and maintainability nightmares.

The key takeaway is balance: combine the speed and breadth of AI with the judgment and intuition of humans. Brennan’s lessons echo this – success with coding agents “depends on smart prompts, sandboxing, and human review” . In practice, that means treating the agent as a powerful tool that still needs a skilled operator. By understanding both its capabilities and its limits, software engineers can harness AI agents to dramatically boost productivity while still maintaining quality and control. The developers who master this partnership will navigate the changing landscape of software development most effectively, allowing them to build more and better software – with a little help from their AI friends.

Sources:

Brennan, R. (2025). AI Software Engineering Agents: What Works and What Doesn’t – Talk at ODSC East 2025
Ponomarev, M. (2025). OpenHands: The Open Source Devin AI Alternative – Apidog Blog
Steinberger, P. (2025). Essential Reading for Agentic Engineers – Blog post
Fortin, A. (2025). A Cautionary Perspective on AI Coding – Blog post