Jan 21, 2026

What I Learned After a Year of Building Production-Grade AI Agents

Hi, Sarah Hirschfield here, founder and engineer at Carly AI, the personal assistant for calendar management, and Carly AI. I wanted to take a few moments to reflect on my experience building production-grade agents that now serve many businesses across the globe. I do hope that improvements to large language models make these comments obsolete, but in the case that they don’t, I’m sharing this to help other engineers and product people as they develop AI/agentic applications, because this is the stuff I wish I had known earlier, and I had to learn the hard way.

Hallucinations Are Not the Problem

There’s a lot of talk about hallucinations being the big problem, but I think that’s pretty overstated. With a little engineering work, you can dampen the effect of hallucinations a lot.

The bigger problem, in my experience, is the models’ lack of common sense. People talk about this in the academic literature, but I’m really just speaking as a practitioner.

How Common Sense Failures Show Up in Scheduling

In a scheduling context, this shows up in the kinds of mistakes the model makes—mistakes where, if a human made them, you’d be like: whoa, that person has no common sense.

For example: the LLM doesn’t reliably understand that if people are scheduling something, it probably isn’t going to be in the past. Humans automatically get that—if we’re trying to get together, the date we’re choosing is going to be in the future. But the model won’t always understand that. If there’s evidence of a past date in the thread—like someone referencing an earlier meeting—it can decide, “Oh, they must be trying to schedule in the past.” That’s a common sense failure.

LLMs don’t understand social norms. One example is: what you should say to the other person you’re coordinating with to schedule a meeting? It might not seem like a big deal, but it is, because you can inadvertently offend someone if you’re not careful. To be concerete: we might have a rule like “default meeting length is 30 minutes,” and the model often wants to say that explicitly to the recipient, supposedly for transparency—“This will be a 30-minute meeting.” Some people don’t care. But for others, it can come off poorly, like you’re rubbing in their face that they’re not a priority. A human usually wouldn’t phrase it that way.

These are the kinds of things that are genuinely shocking when you first start building with LLMs: they can generate fluent language, but they don’t naturally apply the basic human understanding of a situation, or the social norms around what’s appropriate to say.

AGI and the Common Sense Gap

In the academic literature, people sometimes treat the common sense gap as the thing that will hold us back from AGI. The models can pull in lots of information and put together words that sound fine, but they still struggle to act—in the sense of speech acts—the way a human would.

What this means for us as builders is not revolutionary: working with LLMs requires understanding the use case of your customers inside and out. Only then can you know how to guide the model to behave in a way that makes sense.

See you last year!

Sarah