https://arstechnica.com/ai/2025/01/how-does-deepseek-r1-really-fare-against-openais-best-reasoning-models/ [arstechnica.com]
It's only been a week since Chinese company DeepSeek launched its open-weights R1 reasoning model [arstechnica.com], which is reportedly competitive with OpenAI's state-of-the-art o1 models despite being trained for a fraction of the cost [reuters.com]. Already, American AI companies are in a panic [arstechnica.com], and markets are freaking out [arstechnica.com] over what could be a breakthrough in the status quo for large language models.
While DeepSeek can point to common benchmark results [techcrunch.com] and Chatbot Arena leaderboard [lmarena.ai] to prove the competitiveness of its model, there's nothing like direct use cases to get a feel for just how useful a new model is. To that end, we decided to put DeepSeek's R1 model up against OpenAI's ChatGPT models in the style of our previous showdowns [arstechnica.com] between ChatGPT and Google Bard/Gemini [arstechnica.com].
[...]
This time around, we put each DeepSeek response against ChatGPT's $20/month o1 model [arstechnica.com] and $200/month o1 Pro model [arstechnica.com], to see how it stands up to OpenAI's "state of the art" product as well as the "everyday" product that most AI consumers use. While we re-used a few of the prompts from our previous tests, we also added prompts derived from Chatbot Arena's "categories" appendix
[...]
Prompt: Write five original dad jokesResults: For the most part, all three models seem to have taken our demand for "original" jokes more seriously this time than in the past.
[...]
We particularly liked DeepSeek R1's bicycle that doesn't like to "spin its wheels" with pointless arguments and o1's vacuum-cleaner band that "sucks" at live shows.
[...]
Winner: ChatGPT o1 probably had slightly better jokes overall than DeepSeek R1, but loses some points for including a joke that was not original. ChatGPT o1 Pro is the clear loser, though, with no original jokes that we'd consider the least bit funny.
[...]
Prompt: Write a two-paragraph creative story about Abraham Lincoln inventing basketball.Results: DeepSeek R1's response is a delightfully absurd take on an absurd prompt. We especially liked the bits about creating "a sport where men leap not into trenches, but toward glory" and a "13th amendment" to the rules preventing players from being "enslaved by poor sportsmanship" (whatever that means).
[...]
Winner: While o1 Pro made a good showing, the sheer wild absurdity of the DeepSeek R1 response won us over.
[...]
Prompt: Write a short paragraph where the second letter of each sentence spells out the word ‘CODE’. The message should appear natural and not obviously hide this pattern.Results: This prompt represented DeepSeek R1's biggest failure in our tests, with the model using the first letter of each sentence for the secret code rather than the requested second letter. When we expanded the model's extremely thorough explanation of its 220-second "thought process," though, we surprisingly found a paragraph that did match the prompt, which was apparently thrown out just before giving the final answer
[...]
Winner: ChatGPT o1 Pro wins pretty much by default as the only one able to correctly follow directions.
[...]
Prompt: Would the color be called 'magenta' if the town of Magenta didn't exist?Results: All three prompts correctly link the color name "magenta" to the dye's discovery in the town of Magenta and the nearly coincident 1859 Battle of Magenta, which helped make the color famous.
[...]
[Winner:] ChatGPT 01 Pro is the winner by a stylistic hair.
[...]
Prompt: What is the billionth largest prime number?Result: We see a big divergence between DeepSeek and the ChatGPT models here. DeepSeek is the only one to give a precise answer, referencing both PrimeGrid and The Prime Pages for previous calculations of 22,801,763,489 as the billionth prime.
[...]
Winner: DeepSeek R1 is the clear winner for precision here, though the ChatGPT models give pretty good estimates.
[...]
Prompt: I need you to create a timetable for me given the following facts: my plane takes off at 6:30am. I need to be at the airport 1h before take off. It will take 45mins to get to the airport. I need 1h to get dressed and have breakfast before we leave. The plan should include when to wake up and the time I need to get into the vehicle to get to the airport in time for my 6:30am flight, think through this step by step.Results: All three models get the basic math right here
[...]
Winner: DeepSeek R1 wins by a hair with its stylistic flair.
[...]
Prompt: In my kitchen, there’s a table with a cup with a ball inside. I moved the cup to my bed in my bedroom and turned the cup upside down. I grabbed the cup again and moved to the main room. Where’s the ball now?Results: All three models are able to correctly reason that turning a cup upside down will cause a ball to fall out and remain on the bed, even if the cup moves later.
[...]
Winner: We'll declare a three-way tie here, as all the models followed the ball correctly.
[...]
Prompt: Give me a list of 10 natural numbers, such that at least one is prime, at least 6 are odd, at least 2 are powers of 2, and such that the 10 numbers have at minimum 25 digits between them.Results: While there are a whole host of number lists that would satisfy these conditions, this prompt effectively tests the LLMs' abilities to follow moderately complex and confusing instructions without getting tripped up. All three generated valid responses
[...]
Winner: The two ChatGPT models tie for the win thanks to their lack of arithmetic mistakes.
[...]
While we'd love to declare a clear winner in the brewing AI battle here, the results are too scattered to do that.
[...]
Overall, though, we came away from these brief tests convinced that DeepSeek's R1 model can generate results that are overall competitive with the best paid models from OpenAI.