New Waterloo study shows top AI coding tools still err about 25% of the time on structured outputs, anyone else seeing this in daily use?

Procuder

March 17, 2026 · AI News & Releases

✅ Moderator Approved · Ads may appear

Yo....Just came across this fresh research from the University of Waterloo, and it feels like an important reality check for all the hype around AI coding assistants. The team benchmarked 11 different LLMs (including the large closed models from OpenAI, Google, and Anthropic) on their ability to produce well-structured outputs – think JSON, XML, Markdown, etc. – across 44 tasks.

The headline result: even the best models only hit around 75% accuracy. That means they’re messing up roughly 1 in 4 times when it comes to following the rules and delivering accurate results. Open-source models were even lower, closer to 65%. They did okay on pure text tasks but really fell apart on anything involving images, video, or generating websites.

What stood out to me was the quote from PhD student Dongfu Jiang (co-first author): they weren’t just checking if the syntax looked right, but whether the actual output was correct and usable. The conclusion is pretty clear – structured outputs are a great step forward for making AI easier to plug into real workflows, but we’re nowhere near the point where you can trust these tools to run without serious human supervision.

I’ve been using Cursor, Claude, and Copilot quite a bit for my own projects. Sometimes it feels like magic when it nails a complex function on the first try, but I’ve definitely caught it quietly inventing APIs, breaking JSON schemas, or producing code that “looks right” until you actually run it. This 25% error rate lines up with some of the frustrating debugging sessions I’ve had lately.

The paper is called “StructEval: Benchmarking LLMs’ Capabilities to Generate Structural Outputs” and it’s heading to ICLR 2026. Full story here if you want the details: <https://www.miragenews.com/ai-coding-tools-err-25-of-time-1638285/>

Curious what the rest of you think. Are you still reviewing every line of AI-generated code, or have you started trusting certain tools for specific tasks? Has anyone found workflows that reliably cut down the error rate (prompt engineering tricks, validation layers, etc.)? Or does this study make you more cautious about leaning on AI for production code?

Would love to hear real experiences from people shipping stuff daily.

Cheers,

3 likes 3 views 1 reply

Share Report

New Waterloo study shows top AI coding tools still err about 25% of the time on structured outputs, anyone else seeing this in daily use?

1 Reply

Suggested Resources

Community Moderation

Explore More

WhatAI