Quick context before we dive in: I came to software engineering from tech sales.
Spent years on the other side—sending cold emails, booking meetings, trying to hit quota.
Most of my project ideas come from that world.
I'm building the tools I wish I'd had back then, just now I actually know how to code them.
I'm working on an email generator for a hypothetical conference outreach.
It needs to create personalized emails at scale, and those emails need to be good.
The question that's been eating at me: how do you validate AI-generated content reliably?
My first instinct was to use AI to judge it.
It was also completely wrong for my use case, and it took me hours of debugging to figure out why.
The Experiment: AI Judging AI
Here's what I tried:
Setup:
Claude Sonnet generates personalized email (subject + body)
GPT-4o-mini judges the output and gives a score 0-100
If score < 70, regenerate and try again
Iterate until good enough
Expected result: Emails improve with each iteration, scores climb to 90+
Actual result:
Iteration 1:
Subject: "Site Progress" (clearly 2 words)
Judge: "Subject has 4 words, needs 2-3"
Score: 45/100
Wait, what?
Iteration 2:
New subject: "Project Update" (also 2 words)
Judge: "Missing booth number" (it's right there in line 3)
Score: 52/100
Iteration 8:
Judge: "Body is 48 words, exceeds 45 max" (I counted: 43 words)
Score: 63/100
Iteration 15:
Still bouncing around 55-65/100
Judge contradicting itself between runs
No upward trajectory at all
The AI judge couldn't count words correctly.
It hallucinated missing elements. It scored the same email differently on consecutive runs.
Judge accuracy: ~40% (I validated this by manually checking its claims)
Why This Matters
When your validator is wrong 60% of the time, iterative improvement is impossible.
You're not optimizing—you're just regenerating randomly and hoping to get lucky.
The real problem was more fundamental: I was using AI for a task that didn't need AI.
The Pivot: Just Write Some Python
I had specific rules for what makes a good email:
Subject: 2-3 words, Title Case, no punctuation
Body: 45-55 words, starts with "Hey [FirstName],", max 2 commas, no jargon
If I can write these rules down explicitly, why am I asking an AI to approximate them?
So I deleted the AI judge and wrote this instead:
python
# Actual word counting
word_count = len(subject.split())
if word_count < 2 or word_count > 3:
issues.append(f"Subject has {word_count} words, needs 2-3")
# Real Title Case check
expected = " ".join(word.capitalize() for word in words)
if subject != expected:
issues.append("Not perfect Title Case")
# Character-level punctuation detection
if any(char in '.,;:!?-' for char in subject):
issues.append("Contains punctuation")Results after switching:
Judge accuracy: 95%+ (no more hallucinations)
Average score: 92/100 (up from 57-65)
Convergence: 1-3 iterations (down from never)
The improvement was immediate and dramatic. I felt dumb for not doing this from the start.
But Wait—What About AI Judges?
After sharing this, someone asked me a good question: "So what's the point of AI judges if they're not good?"
The answer isn't "AI judges are bad"—it's "I was using them wrong."
Here's what I'm realizing:
AI judges fail at objective tasks:
Counting words
Checking formatting
Verifying required elements exist
Anything with a clear right/wrong answer
AI judges might excel at subjective tasks:
"Does this sound natural?"
"Is the tone appropriate for this audience?"
"Does this feel authentic or generic?"
"Would a human find this persuasive?"
The problems happen when we reach for AI judges for tasks in the first category, where simple logic is perfect.
A Possible Hybrid Architecture
I'm considering adding an AI judge back—but only for the subjective layer.
Stage 1 - Structural Gate (Logic):
Word counts ✓
Format rules ✓
Required elements ✓
If this fails, immediately regenerate—no AI call needed
Stage 2 - Quality Assessment (AI):
Tone and naturalness
Persuasiveness
Authenticity
Personalization quality
Only evaluate emails that pass structure
This way:
Logic handles what it's great at (objective rules, instant, free, accurate)
AI handles what logic can't (subjective judgment, nuance, context)
They don't compete, they complement
The Turn-Based Learning Part
Once I had reliable judging, the next problem was: how do you improve a 72/100 email efficiently?
Naive approach: Regenerate the whole thing.
Better approach: Figure out what's specifically wrong and fix only that part.
I built what I'm calling a "turn-based optimizer."
Each turn:
Judge the current email precisely
Identify specific failures (subject has punctuation, body too long, contains "optimization")
Categorize by component (subject vs body issues)
Regenerate only the broken part with targeted prompts
Track what didn't work to avoid repeating mistakes
The system maintains memory across attempts:
python
@dataclass
class OptimizationSession:
tried_subjects: Set[str] # Don't regenerate these
failed_approaches: Set[str] # Avoid these patterns
successful_patterns: List[str] # Learn from theseExample sequence:
Turn 1:
Subject has punctuation, body has jargon
Score: 72/100
Regenerate both with specific constraints
Turn 2:
Subject fixed, body perfect
Score: 94/100
Done ✓
Average convergence: 1-3 turns to hit 90+/100.
This only works because the judge gives accurate feedback.
With the hallucinating AI judge, targeted fixes were impossible—the system couldn't tell what was actually broken.
What I'm Taking Away (So Far)
1. Use the simplest thing that works
I defaulted to AI because it felt modern.
But when you have deterministic rules, write deterministic code.
Don't approximate something you can calculate exactly.
2. Accurate feedback is everything for iteration
Whether it's AI or logic doing the validation, if your feedback is wrong 40% of the time, you can't build a learning system.
You just have expensive randomness.
3. Maybe there's still a place for AI judges
Just not where I was using them.
I'm still figuring out if the subjective quality layer is worth building, or if it's solving a problem I don't actually have.
4. The right abstraction matters
Framing this as "turns" with memory between attempts made the learning behavior obvious to implement.
What's Next
Things I'm exploring:
Whether to build that Stage 2 AI quality layer or if structural validation is enough
Testing this pattern on other content types (support responses, code review feedback)
Figuring out if there's a general framework here or if this is just specific to email
Building out the full outreach workflow (email finding, validation, sending via webhooks)
The code is open source at github.com/mherzog4/dd-gtm if you want to see the actual implementation. The precise judge and turn-based optimizer are in there.
If you've dealt with similar AI validation problems, or have thoughts on when AI judges actually make sense, I'd genuinely love to hear about it. Reply to this email—I read everything.
