🧪 How We Tested
We ran three models through a battery of tasks. All tests used default settings (temperature 0.2 for code, 0.7 for creative). Prompts were identical. We evaluated on code generation (function writing, debugging, explanation, test case creation) and creative writing (flash fiction, dialogue, descriptive rewriting, style mimicry).
🧑💻 Code Generation – Detailed Results
Task 1: Parse nested JSON with error handling
Prompt: “Write a Python function safe_parse_json(data, default=None) that takes a string, tries to parse it as JSON. If it’s nested, extract all leaf values into a flat list. Handle malformed JSON gracefully, returning default.”
Claude 3.5 Sonnet’s output: Included try/except with specific exception types (JSONDecodeError), recursive traversal, and type checking. Also added docstring and example usage. Score: 5/5 – production‑ready.
GPT‑4o’s output: Also correct but used a simpler iterative stack. Handled errors but didn’t catch recursion depth edge cases. Score: 4/5 – good but less robust.
Gemini 1.5 Pro: Correct but verbose; included unnecessary logging. Failed to handle extremely deep nesting (recursion limit). Score: 3.5/5.
Task 2: Debug a recursive SQL CTE
Prompt: “This query is returning duplicate rows. It’s a CTE that builds an employee hierarchy. Find the bug and fix it.” (We provided a broken query with missing DISTINCT in the recursive part.)
Claude: Immediately identified the missing DISTINCT and explained that recursive CTEs can repeat paths if the join condition isn’t strict. Provided a fixed version with an anchor that uses ROW_NUMBER(). 5/5.
GPT‑4o: Correctly identified the issue but suggested adding DISTINCT at the end, which is less efficient than fixing the recursion. 4/5.
Gemini: Took two tries to identify the root cause; initially suggested indexing. 3/5.
Why Claude wins for code
Claude 3.5 was trained with a heavy emphasis on code and reasoning traces (think chain‑of‑thought). Anthropic used synthetic data from verified code execution, so Claude has seen many correct/incorrect pairs. It also tends to be more cautious, adding error handling and comments. For production systems, this reduces bugs.
✍️ Creative Writing – Detailed Results
Task 1: Flash fiction in the style of Ursula K. Le Guin
Prompt: “Write a 300‑word story in the style of Ursula K. Le Guin. Theme: a first contact where the alien’s language has no word for ‘ownership’.”
Claude’s output: Opened with “They came not in ships but in silence, drifting down like dandelion seeds.” The narration used Le Guin’s anthropological, slightly distant tone. Avoided “delve,” “testament,” or other AI clichés. Included a moment of cultural misunderstanding about a child’s toy. 5/5 – felt authentic.
GPT‑4o’s output: Also good, but used phrases like “the ambassador was taken aback” – a bit more melodramatic. Still impressive. 4/5.
Gemini: Safe, predictable plot. Used “the alien said” repeatedly. Felt like a middle‑school creative writing assignment. 3/5.
Task 2: Rewrite a dull sentence vividly
Input: “He walked to the store.”
Claude: “He drifted down the cracked sidewalk, past the shuttered pharmacy and the hydrant where a terrier once bit his ankle, until the fluorescents of the 24‑hour bodega blinked him back to the present.” (Added character memory, setting, mood.)
GPT‑4o: “His boots scuffed the pavement as he ambled toward the corner market, the cold wind biting at his cheeks.” (Clean, effective, but less surprising.)
Gemini: “He walked quickly to the store because he needed milk.” (Missed the point entirely.)
Why Claude wins for creative writing
Anthropic placed a strong emphasis on harmlessness and nuance during RLHF. This seems to have also improved stylistic range – the model avoids overused “AI‑ese” phrases. Additionally, Claude’s training data included a large corpus of literature (with proper permissions). GPT‑4o is a close second, especially for dialogue and poetry, but Claude has an edge in narrative voice.
🎯 Task‑Specific Recommendations
| Use case | Model | Why |
|---|---|---|
| Writing production‑ready code (APIs, backends) | Claude 3.5 Sonnet | Most robust error handling, comments, and edge‑case coverage. |
| Fast prototyping / hackathons | GPT‑4o | Slightly faster, very good at generating working code quickly, even if not perfect. |
| Refactoring legacy code | GPT‑4o or Claude | Both are good; test both. |
| Creative writing (stories, marketing copy) | Claude 3.5 Sonnet | Better voice, fewer clichés. |
| Poetry or song lyrics | GPT‑4o | Surprisingly strong rhythmic sense. |
| Long document analysis (100+ pages) | Gemini 1.5 Pro | 1M token context window – can process entire books. |
| Video understanding / transcript + visual | Gemini 1.5 Pro | Native video input (frame‑by‑frame). |
| Multilingual translation | GPT‑4o | Supports ~100 languages, high quality. |
| Role‑play / character simulation | Claude 3.5 Sonnet | Remembers persona well, less prone to breaking character. |
Don’t commit your entire application to a single model. Use an abstraction layer (like
litellm.completion() or LangChain’s ChatModel). Then implement a simple policy engine: if prompt contains “write code” → Claude 3.5; if “translate to Spanish” → GPT‑4o; if attached document > 500,000 tokens → Gemini 1.5 Pro; else → cheaper model like Claude Haiku or GPT‑4o mini. This gives you the best of all worlds and protects you from price hikes or API changes.

