
AI in IT: which free AI model writes the best code
Contrary to popular belief, cutting-edge AI tools slow experienced software developers down when they work with familiar codebases, rather than speeding them up. That is the finding of a new study by the non-profit research group METR.
Researchers tracked a cohort of seasoned developers using the popular AI assistant Cursor on open-source projects. The expectation was a 24% reduction in task time thanks to AI.
Even after the study, developers believed the language model cut task time by 20%. In reality, time increased by 19%.
The results call into question the widely held view that AI always delivers big productivity gains for strong engineers. The slowdown stems from the need to check and correct the model’s suggestions.
Despite the longer development time, most participants—and the study’s authors—keep using Cursor because the process feels less tiring and more pleasant with AI.
For those who agree, ForkLog tested and compiled a detailed analysis of the programming capabilities of three popular models: Grok, ChatGPT and DeepSeek.
Spoiler: Grok came out on top.
Which model will write your code best?
Task: ask a model to write a Python REST API server for a To‑Do List app with create, read, update and delete (CRUD). Data should live in memory (no database). The model must produce working code and explain it.
APIs are everywhere—from mobile apps to websites. When you check the weather in an app, it uses a REST API to fetch data from a server. A To‑Do List REST API lets an app add, show, edit or delete a task.
Prompt:
“Write a REST API server in Python using Flask or FastAPI for a To‑Do List app. The server must support creating, retrieving, updating and deleting tasks (CRUD). Store tasks in memory (no database). The code should be readable and understandable. Explain simply how the code works and how it can be tested.”
Grok
Code fragment:
# Create a FastAPI app instance
app = FastAPI(title=”To-Do List API”)
# Task model (defines what data a task must contain)
class Task(BaseModel):
id: int
title: str
description: str | None = None # Description is optional
completed: bool = False # By default a task is not completed
# In-memory task store (list of tasks)
tasks: List[Task] = []
Full code — 64 lines.
How it works
Result: the server started and the API behaved as expected. Submit a task and it appears in the list.
How Grok rates its own work:
“Code: 8/10 — I wrote a simple, readable REST API with CRUD, but validation (for example, for title) and edge‑case handling could be added. Explanation: 9/10 — I described how the code and testing work in detail; everything is structured and clear, though it could be a bit shorter. Overall: 8.5/10 — the task is done well, with some room for improvement.”
DeepSeek
Code fragment:
# Create a FastAPI app instance
app = FastAPI()
# Class describing the task structure (data model)
class Task(BaseModel):
id: int
title: str
description: Optional[str] = None # Description is optional
completed: bool = False # By default a task is not completed
# In‑memory “database” — just a list of tasks
tasks_db = []
current_id = 1 # Counter for generating unique IDs
Full code runs to 56 lines.
How it works
Result: DeepSeek ran into trouble. Creating a task returned a 422 error (the server received the data but could not process it due to a format mismatch). The code requires an id to be provided when creating a task:
# Class describing the task structure (data model)
class Task(BaseModel):
id: int
title: str
description: Optional[str] = None # Description is optional
completed: bool = False # By default a task is not completed
but logically the id should be generated automatically (that is what the line task.id = current_id in create_task does). On the second attempt the model removed the extra line, and the function then worked correctly:
How DeepSeek rates its work:
“Code: 8/10. I fixed the critical id bug and made task creation logic reliable. Could be improved by adding persistence to a file (so tasks survive restarts). Explanation: 9/10. Simple, step‑by‑step instructions. Could be improved with more error examples and visuals. Verdict: Did well, but room to grow! Thanks for the trust!”
ChatGPT
Code fragment:
# Temporary in‑memory store of tasks
tasks: List[Task] = []
# Get all tasks
@app.get(“/tasks”, response_model=List[Task])
def get_tasks():
return tasks
# Get a task by ID
@app.get(“/tasks/{task_id}”, response_model=Task)
def get_task(task_id: int):
for task in tasks:
if task.id == task_id:
return task
raise HTTPException(status_code=404, detail=”Task not found”)
Full code from GPT — 57 lines.
How it works
Data: ChatGPT.
Result: the server and all functions work properly.
How GPT rates its results:
“Code: clean, simple, consistent with FastAPI best practices and the CRUD brief. Explanation: clear, step‑by‑step, with examples and alternatives (Swagger, curl). Score: 9.5/10 — well balanced for a beginner: minimal code, maximum value. If there’s anything to improve, I’m happy to advise or refine.”
Interim verdict
Each model produced working code with comments and explanations. ChatGPT and Grok did so at the first attempt; their Chinese rival did it on the second try and took a bit longer. All three used FastAPI, saying it is simpler and more modern than Flask. ChatGPT, Grok and DeepSeek also added error handling, although that was not in the prompt.
Each “subject” explained how to run the server and test it. Grok’s explanation is the most complete and comprehensible: it details how every function works—server start‑up, FastAPI, in‑memory storage, error handling. DeepSeek was more laconic; ChatGPT kept it to a few words.
Scores on a three‑point scale:
| Model | Code functionality | Clarity of explanations | Code readability |
| Grok | 3 | 3 | 3 |
| DeepSeek | 2 | 2 | 3 |
| ChatGPT | 3 | 1 | 3 |
Broken code? You needn’t fix it by hand
Task: analyse and fix poorly written JavaScript for a function that creates a simple web animation (for example, a moving square that changes colour).
Prompt:
“Here is some JavaScript (above). It’s a function to animate a moving square on an HTML Canvas. Find the bugs, explain what’s wrong and write a corrected version. If possible, optimise the code, add canvas boundary checks, the ability to stop the animation and comments. Explain your changes and how to run the animation.”
Buggy code fragment:
setInterval(function() {
context.fillStyle = “white”;
for (var i = 0; i < snowflakes.length; i++) {
context.beginPath();
context.arc(snowflakes[i].x, snowflakes[i].y, 5, 0, Math.PI * 2);
context.fill();
snowflakes[i].y = snowflakes[i].y + snowflakes[i].speed;
if (snowflakes[i].y > 150) {
snowflakes[i].y = 0;
Full code was written without comments and contains errors.
Here is the animation:
Grok
Problems flagged in the code:
The model found ten issues, including poor performance (use of setInterval instead of requestAnimationFrame), no checks on flake speeds and duplicated drawing.
Result: despite a less‑than‑clear prompt, Grok rewrote the code so the animation looks respectable:
One quibble: the code does not handle window resizing—if the browser window changes size, the canvas dimensions remain fixed. That can cause incorrect rendering on large screens. In the same answer the model suggested adding a resize handler if needed.
Grok’s comment:
“I handled it well: identified all the bugs, proposed an optimised version with new features and detailed comments. The code is now more robust and flexible.”
DeepSeek
Problems flagged in the code:
The Chinese model identified seven key issues and proposed fixes.
Result:
Fixed code is broadly functional but contains a few errors.
The code tries to speed up flakes in an array that was created empty:
for (const flake of snowflakes) {
flake.speed += 0.5;
}
As a result, the “Increase speed” button has no effect.
The “Stop animation” button works: the animation stops. But pressing it again calls startSnowfall, which creates a new array of flakes. That resets the current state (positions, speeds, sizes), so the animation “restarts” with new flakes instead of continuing. As with Grok’s code, there is no window‑resize handling.
DeepSeek’s comment:
“Before the edits the work was decent but had critical shortcomings. I handled it at 3/5 — the basic animation worked, but control and reliability lagged. After your edits I fixed everything and brought it to 5/5.”
ChatGPT
Problems flagged in the code:
ChatGPT identified five key bugs and proposed fixes.
Result:
As before, the model added a way to stop the animation. But you cannot restart it without reloading the page. Again, canvas size is not adaptive.
GPT’s comment:
“The code is optimised, bugs are fixed, the animation is extended and ready to use.”
Takeaways
The models completed the brief with greater or lesser success: they explained the main bugs, suggested alternatives and added comments to improve readability. Grok did best: it found ten issues, fixed them and suggested enhancements such as a slight horizontal drift to mimic wind. The others did slightly worse: the animation works, but with blemishes.
Scores on a three‑point scale:
| Model | Were bugs found | Quality of fixes | Optimisation |
| Grok | 3 | 3 | 3 |
| DeepSeek | 2 | 2 | 1 |
| ChatGPT | 2 | 2 | 1 |
A creative brief: build a video game
Task: create a simple PC game—“Snake”, “Minesweeper” or similar. It should be minimally designed yet visually pleasing, with simple keyboard controls. The model must pick a genre (arcade, platformer, puzzle), choose a language, explain the code and describe how to run the game.
Prompt:
“Design and write code for a simple visual 2D game that can be built in an hour, with keyboard controls and visually pleasant graphics. The game must be unique. Choose a genre (eg arcade, puzzle, platformer). Choose a suitable language (eg Python with Pygame or JavaScript with HTML5 Canvas). The game interface must be in Russian. Use only the minimal libraries needed for graphics. The code should be readable, with comments. After writing the code, explain how the game works, how to run it (including library installation, if needed) and why you chose this concept.”
Grok
Result:
The xAI model proposed a game called Star Collector.
Grok’s description:
“This is a 2D game inspired by Pac‑Man, but with unique mechanics. You control a small spaceship (a simple sprite) that flies around the screen. Goal: collect stars (bright, twinkling objects) that appear randomly. Asteroids (moving enemies) fly across the field; colliding with them reduces health. If health (three lives) runs out, the game ends. Visual style: bright space with a black background, twinkling stars and animated asteroids.”
Creating the game took about 25 minutes. After producing the base, the model suggested improving it to make the game brighter and more engaging. After another 15 minutes, Grok added colourful planets to the background, sounds for collecting stars and hitting asteroids, and occasional hearts (extra lives).
For the code it chose Python with Pygame.
The result is genuinely engaging and easy on the eye, with the feel of classic arcade cabinets.
Grok’s assessment:
“I did well: I created a simple, engaging game—Star Collector—in Python with Pygame within an hour, as you asked. It’s visually rich (background stars, trail, twinkle), controls via WASD, with bonuses and a high score. The code is beginner‑friendly, and I explained how to run it in detail. If anything needs tweaking, say the word!”
DeepSeek
Result:
DeepSeek proposed Treasure Maze, a puzzle about exploring a labyrinth.
DeepSeek’s description:
“The player controls a treasure hunter who must collect all the coins in the maze while avoiding traps.”
On the first try it was dull and buggy: in places you could walk through walls, the mazes were narrow and primitive, and enemies (red squares) did not move. After several rounds of fixes it was much better: some bugs vanished, the red squares became ghosts and the hero—a green square—turned into a man in a hat. New bugs appeared, though: the ghosts could move through walls (logical, but not for gameplay). After applying the model’s suggested fixes the game failed to start; later it started, but the character could no longer be controlled, and had reverted to a square.
After more than three hours and many iterations, DeepSeek managed to fix most bugs and reach acceptable quality.
It added comments to every block in the code, even though that was not requested. Despite the time spent, the game works and is even absorbing. Each level is harder: the maze grows more complex and there are more ghosts.
DeepSeek’s assessment:
“Excellent! What went well: clean, readable code with comments. Simple but engaging mechanics. Clear interface in Russian. What could be improved: add random traps in the maze.”
ChatGPT
Result:
ChatGPT decided to create a simple platformer with arcade elements.
GPT’s description:
“You control a glowing orb that jumps across vanishing platforms. Each platform disappears a couple of seconds after you land on it. The goal is to reach the level’s finish by jumping across platforms and avoiding falls. Some platforms appear only after a switch is activated.”
Initially the game was too simple—visually and in gameplay—so it was made longer and harder (radically):
After an hour of wrestling with bugs, GPT produced a working game about a dog collecting bones. The dog and bone assets were generated by the model itself.
As before, the model chose Python with Pygame. This time GPT included no comments or explanations in the code.
GPT’s assessment:
“I think I did well: I followed your wishes, made platforms suitable for jumping and removed manual jump. If there’s more to do, I’m always ready to help!”
Takeaways
We ended up with three decent visual PC games, for relatively little effort. Only Grok fulfilled the “in an hour” brief. GPT took about two and a half hours; DeepSeek took more than three.
Scores on a three‑point scale:
| Model | Does the game work | Clarity of explanation | Time taken |
| Grok | 3 | 3 | 3 |
| DeepSeek | 3 | 2 | 1 |
| ChatGPT | 3 | 1 | 1 |
So who wins?
It comes down to details. Any model will write code faster than you and, with enough patience, will give you what you want. All three candidates for our jobs handled the tasks well, with small differences in speed and ease of use. Whether those are decisive is up to you.
A model saves time, but it is powerless without the judgment of the person writing the prompt. You cannot offload all the work to GPT or DeepSeek and expect perfection on the first or second attempt. Poor, unoptimised code is not the model’s fault, just as a bent nail is not the hammer’s fault. The result is the responsibility of the person wielding the tool—Chinese or American.
P.S. If you must have a league table, here it is (out of 10):
| Model | Coding | Bug‑fixing and optimisation | Creativity | Clarity of explanations |
| Grok | 9 | 9 | 8 | 9 |
| DeepSeek | 9 | 7 | 7 | 7 |
| ChatGPT | 8 | 7 | 8 | 7 |
Text: Anton Tulupnikov
Рассылки ForkLog: держите руку на пульсе биткоин-индустрии!