The Story of Making AI Indistinguishable from Humans: Implementing a Turing Test with LLM Judges
Starting from HL 4.1 The first prototype of human-persona scored 4.1 out of 10 on "human-likeness." This is a score far below the threshold between "AI-like" and "human-like." And I built it and sc...

Source: DEV Community
Starting from HL 4.1 The first prototype of human-persona scored 4.1 out of 10 on "human-likeness." This is a score far below the threshold between "AI-like" and "human-like." And I built it and scored it myself. It took 5 versions to raise this to HL 7.7. This article is the story of that journey—what I tried, what didn't work, and what worked dramatically. Evaluation Method: LLM Judge I had Claude Sonnet act as an "expert in distinguishing humans from AI" to score the outputs. JUDGE_PROMPT = """ You are an expert in distinguishing humans from AI. Evaluate the following message and respond with JSON only: { "human_likeness_score": 1-10, "style_variation_rate": 0.0-1.0, "timing_naturalness": 1-10, "reason_human_likeness": "Reason in one sentence", "improvement_suggestion": "Improvement suggestion in one sentence" } """ Three metrics: Metric Meaning Target Value HL (human_likeness_score) How non-AI-like it is 7.5 or higher SV (style_variation_rate) Not too homogeneous (lower is better)