Learning Log: When My AI Judge Couldn't Count to Two

Quick context before we dive in: I came to software engineering from tech sales.

Spent years on the other side—sending cold emails, booking meetings, trying to hit quota.

Most of my project ideas come from that world.

I'm building the tools I wish I'd had back then, just now I actually know how to code them.

I'm working on an email generator for a hypothetical conference outreach.

It needs to create personalized emails at scale, and those emails need to be good.

The question that's been eating at me: how do you validate AI-generated content reliably?

My first instinct was to use AI to judge it.

It was also completely wrong for my use case, and it took me hours of debugging to figure out why.

The Experiment: AI Judging AI

Here's what I tried:

Setup:

Claude Sonnet generates personalized email (subject + body)
GPT-4o-mini judges the output and gives a score 0-100
If score < 70, regenerate and try again
Iterate until good enough

Expected result: Emails improve with each iteration, scores climb to 90+

Actual result:

Iteration 1:

Subject: "Site Progress" (clearly 2 words)
Judge: "Subject has 4 words, needs 2-3"
Score: 45/100

Wait, what?

Iteration 2:

New subject: "Project Update" (also 2 words)
Judge: "Missing booth number" (it's right there in line 3)
Score: 52/100

Iteration 8:

Judge: "Body is 48 words, exceeds 45 max" (I counted: 43 words)
Score: 63/100

Iteration 15:

Still bouncing around 55-65/100
Judge contradicting itself between runs
No upward trajectory at all

The AI judge couldn't count words correctly.

It hallucinated missing elements. It scored the same email differently on consecutive runs.

Judge accuracy: ~40% (I validated this by manually checking its claims)

Why This Matters

When your validator is wrong 60% of the time, iterative improvement is impossible.

You're not optimizing—you're just regenerating randomly and hoping to get lucky.

The real problem was more fundamental: I was using AI for a task that didn't need AI.

The Pivot: Just Write Some Python

I had specific rules for what makes a good email:

Subject: 2-3 words, Title Case, no punctuation
Body: 45-55 words, starts with "Hey [FirstName],", max 2 commas, no jargon

If I can write these rules down explicitly, why am I asking an AI to approximate them?

So I deleted the AI judge and wrote this instead:

python

# Actual word counting
word_count = len(subject.split())
if word_count < 2 or word_count > 3:
    issues.append(f"Subject has {word_count} words, needs 2-3")

# Real Title Case check  
expected = " ".join(word.capitalize() for word in words)
if subject != expected:
    issues.append("Not perfect Title Case")

# Character-level punctuation detection
if any(char in '.,;:!?-' for char in subject):
    issues.append("Contains punctuation")

Results after switching:

Judge accuracy: 95%+ (no more hallucinations)
Average score: 92/100 (up from 57-65)
Convergence: 1-3 iterations (down from never)

The improvement was immediate and dramatic. I felt dumb for not doing this from the start.

But Wait—What About AI Judges?

After sharing this, someone asked me a good question: "So what's the point of AI judges if they're not good?"

The answer isn't "AI judges are bad"—it's "I was using them wrong."

Here's what I'm realizing:

AI judges fail at objective tasks:

Counting words
Checking formatting
Verifying required elements exist
Anything with a clear right/wrong answer

AI judges might excel at subjective tasks:

"Does this sound natural?"
"Is the tone appropriate for this audience?"
"Does this feel authentic or generic?"
"Would a human find this persuasive?"

The problems happen when we reach for AI judges for tasks in the first category, where simple logic is perfect.

A Possible Hybrid Architecture

I'm considering adding an AI judge back—but only for the subjective layer.

Stage 1 - Structural Gate (Logic):

Word counts ✓
Format rules ✓
Required elements ✓
If this fails, immediately regenerate—no AI call needed

Stage 2 - Quality Assessment (AI):

Tone and naturalness
Persuasiveness
Authenticity
Personalization quality
Only evaluate emails that pass structure

This way:

Logic handles what it's great at (objective rules, instant, free, accurate)
AI handles what logic can't (subjective judgment, nuance, context)
They don't compete, they complement

The Turn-Based Learning Part

Once I had reliable judging, the next problem was: how do you improve a 72/100 email efficiently?

Naive approach: Regenerate the whole thing.

Better approach: Figure out what's specifically wrong and fix only that part.

I built what I'm calling a "turn-based optimizer."

Each turn:

Judge the current email precisely
Identify specific failures (subject has punctuation, body too long, contains "optimization")
Categorize by component (subject vs body issues)
Regenerate only the broken part with targeted prompts
Track what didn't work to avoid repeating mistakes

The system maintains memory across attempts:

python

@dataclass
class OptimizationSession:
    tried_subjects: Set[str]  # Don't regenerate these
    failed_approaches: Set[str]  # Avoid these patterns
    successful_patterns: List[str]  # Learn from these

Example sequence:

Turn 1:

Subject has punctuation, body has jargon
Score: 72/100
Regenerate both with specific constraints

Turn 2:

Subject fixed, body perfect
Score: 94/100
Done ✓

Average convergence: 1-3 turns to hit 90+/100.

This only works because the judge gives accurate feedback.

With the hallucinating AI judge, targeted fixes were impossible—the system couldn't tell what was actually broken.

What I'm Taking Away (So Far)

1. Use the simplest thing that works

I defaulted to AI because it felt modern.

But when you have deterministic rules, write deterministic code.

Don't approximate something you can calculate exactly.

2. Accurate feedback is everything for iteration

Whether it's AI or logic doing the validation, if your feedback is wrong 40% of the time, you can't build a learning system.

You just have expensive randomness.

3. Maybe there's still a place for AI judges

Just not where I was using them.

I'm still figuring out if the subjective quality layer is worth building, or if it's solving a problem I don't actually have.

4. The right abstraction matters

Framing this as "turns" with memory between attempts made the learning behavior obvious to implement.

What's Next

Things I'm exploring:

Whether to build that Stage 2 AI quality layer or if structural validation is enough
Testing this pattern on other content types (support responses, code review feedback)
Figuring out if there's a general framework here or if this is just specific to email
Building out the full outreach workflow (email finding, validation, sending via webhooks)

The code is open source at github.com/mherzog4/dd-gtm if you want to see the actual implementation. The precise judge and turn-based optimizer are in there.

If you've dealt with similar AI validation problems, or have thoughts on when AI judges actually make sense, I'd genuinely love to hear about it. Reply to this email—I read everything.