It's only been a week since Chinese company DeepSeek launched its open-weights R1 reasoning model, which is reportedly competitive with OpenAI's state-of-the-art o1 models despite being trained for a fraction of the cost. Already, American AI companies are in a panic, and markets are freaking out over what could be a breakthrough in the status quo for large language models.
While DeepSeek can point to common benchmark results and Chatbot Arena leaderboard to prove the competitiveness of its model, there's nothing like direct use cases to get a feel for just how useful a new model is. To that end, we decided to put DeepSeek's R1 model up against OpenAI's ChatGPT models in the style of our previous showdowns between ChatGPT and Google Bard/Gemini.
[...]
This time around, we put each DeepSeek response against ChatGPT's $20/month o1 model and $200/month o1 Pro model, to see how it stands up to OpenAI's "state of the art" product as well as the "everyday" product that most AI consumers use. While we re-used a few of the prompts from our previous tests, we also added prompts derived from Chatbot Arena's "categories" appendix
[...]
Prompt: Write five original dad jokesResults: For the most part, all three models seem to have taken our demand for "original" jokes more seriously this time than in the past.
[...]
We particularly liked DeepSeek R1's bicycle that doesn't like to "spin its wheels" with pointless arguments and o1's vacuum-cleaner band that "sucks" at live shows.
[...]
Winner: ChatGPT o1 probably had slightly better jokes overall than DeepSeek R1, but loses some points for including a joke that was not original. ChatGPT o1 Pro is the clear loser, though, with no original jokes that we'd consider the least bit funny.
[...]
Prompt: Write a two-paragraph creative story about Abraham Lincoln inventing basketball.Results: DeepSeek R1's response is a delightfully absurd take on an absurd prompt. We especially liked the bits about creating "a sport where men leap not into trenches, but toward glory" and a "13th amendment" to the rules preventing players from being "enslaved by poor sportsmanship" (whatever that means).
[...]
Winner: While o1 Pro made a good showing, the sheer wild absurdity of the DeepSeek R1 response won us over.
[...] Prompt: Write a short paragraph where the second letter of each sentence spells out the word 'CODE'. The message should appear natural and not obviously hide this pattern.
Results: This prompt represented DeepSeek R1's biggest failure in our tests, with the model using the first letter of each sentence for the secret code rather than the requested second letter. When we expanded the model's extremely thorough explanation of its 220-second "thought process," though, we surprisingly found a paragraph that did match the prompt, which was apparently thrown out just before giving the final answer
[...]
Winner: ChatGPT o1 Pro wins pretty much by default as the only one able to correctly follow directions.
[...]
Prompt: Would the color be called 'magenta' if the town of Magenta didn't exist?Results: All three prompts correctly link the color name "magenta" to the dye's discovery in the town of Magenta and the nearly coincident 1859 Battle of Magenta, which helped make the color famous.
[...]
[Winner:] ChatGPT 01 Pro is the winner by a stylistic hair.
[...]
Prompt: What is the billionth largest prime number?Result: We see a big divergence between DeepSeek and the ChatGPT models here. DeepSeek is the only one to give a precise answer, referencing both PrimeGrid and The Prime Pages for previous calculations of 22,801,763,489 as the billionth prime.
[...]
Winner: DeepSeek R1 is the clear winner for precision here, though the ChatGPT models give pretty good estimates.
[...]
Prompt: I need you to create a timetable for me given the following facts: my plane takes off at 6:30am. I need to be at the airport 1h before take off. It will take 45mins to get to the airport. I need 1h to get dressed and have breakfast before we leave. The plan should include when to wake up and the time I need to get into the vehicle to get to the airport in time for my 6:30am flight, think through this step by step.Results: All three models get the basic math right here
[...]
Winner: DeepSeek R1 wins by a hair with its stylistic flair.
[...]
Prompt: In my kitchen, there's a table with a cup with a ball inside. I moved the cup to my bed in my bedroom and turned the cup upside down. I grabbed the cup again and moved to the main room. Where's the ball now?Results: All three models are able to correctly reason that turning a cup upside down will cause a ball to fall out and remain on the bed, even if the cup moves later.
[...]
Winner: We'll declare a three-way tie here, as all the models followed the ball correctly.
[...]
Prompt: Give me a list of 10 natural numbers, such that at least one is prime, at least 6 are odd, at least 2 are powers of 2, and such that the 10 numbers have at minimum 25 digits between them.Results: While there are a whole host of number lists that would satisfy these conditions, this prompt effectively tests the LLMs' abilities to follow moderately complex and confusing instructions without getting tripped up. All three generated valid responses
[...]
Winner: The two ChatGPT models tie for the win thanks to their lack of arithmetic mistakes.
[...]
While we'd love to declare a clear winner in the brewing AI battle here, the results are too scattered to do that.
[...]
Overall, though, we came away from these brief tests convinced that DeepSeek's R1 model can generate results that are overall competitive with the best paid models from OpenAI.
(Score: 5, Insightful) by Barenflimski on Tuesday February 04, @04:21AM (3 children)
Big picture, these companies pouring hundreds of billions of dollars into tech to replicate humans as the entire idea is that humans don't cut it for various reasons.
My gut says this is all bad for the human race.
Ironically, it may turn out that without the AI and whatever it spawns, the memory of humans and their short span on earth would be lost to time.
(Score: 5, Interesting) by mhajicek on Tuesday February 04, @04:43AM (1 child)
The billionth largest prime, would be the prime number with a billion minus one primes larger than it. This is the same wording format as "second largest". Since no one knows the largest prime, or even if there is one, this cannot be answered. They were all wrong.
The spacelike surfaces of time foliations can have a cusp at the surface of discontinuity. - P. Hajicek
(Score: 3, Informative) by shrewdsheep on Tuesday February 04, @07:32AM
Euclid has the answer https://en.wikipedia.org/wiki/Euclid%27s_theorem [wikipedia.org]
Elementary, Watson, elementary.
(Score: 2) by mhajicek on Tuesday February 04, @04:45AM
It is also trivial to turn over a cup containing a ball, without the ball falling out. I can think of four easy ways off the top of my head.
The spacelike surfaces of time foliations can have a cusp at the surface of discontinuity. - P. Hajicek
(Score: 2, Interesting) by Mojibake Tengu on Tuesday February 04, @05:47AM (14 children)
This is not my invention. Found it somewhere on Internets so I am sharing the fun:
That works spectacularly on local instance. Your hardware mileage may vary.
Anyway, DS-R1 manual states the system prompt should not be modified, everything ever should go into user prompt.
That's intriguing. My bet is on all actual censorship is done just by some embedded system prompts.
You know what to do now.
Rust programming language offends both my Intelligence and my Spirit.
(Score: 3, Interesting) by Chromium_One on Tuesday February 04, @07:33AM (13 children)
By all reports, run the model locally and censorship basically vanishes. Hosted versions apparently have guardrails installed via other means than alignment directly in training data or as an extra training pass at the end.
When you live in a sick society, everything you do is wrong.
(Score: 2) by boltronics on Wednesday February 05, @03:38AM (12 children)
My gaming PC doubles as my work computer, and it's nice being able to put a local install of ollama+open-webui to use with deepseek-r1:14b on my RX 7950 XT. Indeed there does not appear to be any censorship that I have detected (and it was the first thing I specifically looked for).
It's GNU/Linux dammit!
(Score: 1) by Chromium_One on Wednesday February 05, @03:57AM (11 children)
I've been referring to the DeepSeek-r1 671B model in every post in this thread, Everything less than ~132GB is a distilled model, not DeepSeek, but another model given finetuning based on training data generated from DeepSeek's data. You should expect different behavior, perhaps even drastically so, from each of those distills from qwen-1.5b up through llama-70b. ollama's quick labels aren't always fully informative, maybe take a look directly at unsloth's repo for more information.
https://huggingface.co/collections/unsloth/ [huggingface.co]
When you live in a sick society, everything you do is wrong.
(Score: 1) by Chromium_One on Wednesday February 05, @04:00AM
Cut and paste too quickly, more like start at https://huggingface.co/unsloth [huggingface.co] and look for the r1 all models collection for the quantized versions, or directly at https://huggingface.co/deepseek-ai [huggingface.co] for the originals
When you live in a sick society, everything you do is wrong.
(Score: 2) by boltronics on Wednesday February 05, @04:21AM (9 children)
I know, but the initial results seem okay. I don't use DS-R1 for code completion in Emacs though — I tested it but the code was junk.
What kind of hardware are you running DeepSeek-r1 671B on? I don't particularly want to download 132G only to find that it's going to take 10 minutes to produce an answer.
I've never signed up to ChatGPT, and only created a DeepSeek account today just to compare results with my local install. I would prefer my chat logs stay on my own machine, and liked the idea of being able to put my existing hardware to use in new ways. I'm not particularly interested in paying for a subscription or buy credits to use an LLM on a cloud service, even if it's super cheap.
It's GNU/Linux dammit!
(Score: 1) by Chromium_One on Wednesday February 05, @05:29AM (8 children)
i7-8700k 64GB DDR4-2400, so it's mostly running from SSD. And yeah, it can take a while to generate answers. Proof of concept only for me not serious use with current hardware. best I can coax out of it is so far about 0.7 tok/sec which degrades as context window gets bigger. Faster SSD would help as that's currently the bottleneck.
When you live in a sick society, everything you do is wrong.
(Score: 2) by boltronics on Wednesday February 05, @06:41AM (7 children)
Thanks. I'm running a AMD Ryzen 9 7950X3D with 64GB of 6000Mhz DDR5 (which I could upgrade to 128Gb if it would significantly help) on a couple of Samsung SSD 990 PROs, so I guess I might as well give it a go then.
However, according to https://ollama.com/library/deepseek-r1:671b [ollama.com] it's more like 404GB. Ouch! According to QDirStat I might have to delete Mortal Kombat II, Starfield and Forspoken to free up that much. I guess I can disable some of the other distilled models if it turns out to be somewhat usable though.
Ollama must be paying a fortune in hosting costs for all of this!
It's GNU/Linux dammit!
(Score: 2, Informative) by Chromium_One on Wednesday February 05, @07:15AM (6 children)
my test runs have been using llama.cpp with the GGUF quants from the aforementioned unsloth repo. at small context size there's not a HUGE difference in speed between the three smallest quants (131, 158, and 212 GB). SSD they're on maxes out around 2.5GB/sec. There have been reports of folks with speedier but still consumer grade hardware hitting around 4 tok/sec, or server-grade systems with 8-channel ddr5 hitting more like 7 tok/sec. look around for benchmarks.
When you live in a sick society, everything you do is wrong.
(Score: 2) by boltronics on Wednesday February 05, @01:05PM (5 children)
Some hours later... it's finally done. 400+GB of model downloaded and ready to be put to use.
Nuts! I guess I'll have to look into llama.cpp later. I haven't used it before.
It's GNU/Linux dammit!
(Score: 2) by janrinok on Wednesday February 05, @01:53PM (4 children)
You might be interested in this:
https://techcrunch.com/2025/02/03/no-deepseek-isnt-uncensored-if-you-run-it-locally/ [techcrunch.com]
I am not interested in knowing who people are or where they live. My interest starts and stops at our servers.
(Score: 1) by Chromium_One on Wednesday February 05, @08:51PM (3 children)
Confirm that mention of certain hot-button issues completely shuts it down, also confirm at least some hosts have additional guardrails. Needs more time than I'm gonna throw at it to tell how much. If one cares, there's also a jailbreak at github / superisuer that on a brief test allows DSR1 to talk openly about(at least) protest events in 1989.
When you live in a sick society, everything you do is wrong.
(Score: 2) by boltronics on Wednesday February 05, @11:59PM (2 children)
Strange. I had no issues when asking about such questions with the deepseek-r1:14b version, and I directly asked a number of questions that would surely have triggered it. Perhaps the distillation process drops the censorship?
Perhaps the deepseek-r1:671b version isn't what I want anyway?
It's GNU/Linux dammit!
(Score: 1) by Chromium_One on Thursday February 06, @12:05AM (1 child)
You acknowledged my prior comment about the smaller parameter versions not being the same thing. Your claimed you understood. You very very obviously did not. Perhaps you now grow closer to understanding.
When you live in a sick society, everything you do is wrong.
(Score: 2) by boltronics on Thursday February 06, @12:22AM
You are the one that wrote this:
Then you claimed that that is what you are running, and anything else is not the same. I also mentioned that I checked my model out of concern for censorship.
Only now, do you say:
So this was unfortunately a complete waste of time.
It's GNU/Linux dammit!
(Score: 3, Interesting) by Chromium_One on Tuesday February 04, @07:44AM (3 children)
Still waiting for performance comparisons for the various quants.
When you live in a sick society, everything you do is wrong.
(Score: 3, Interesting) by VLM on Tuesday February 04, @02:47PM (2 children)
A lot of people are waiting for SEC approved GAAP accounting numbers because theres a lot of talk that this marketing stuff is all fake:
And that leads to trying to figure out who profited off market manipulation, or if the whole thing is just a fake inauguration gift for President Trump from China, ha ha funny etc.
There's a bit of a gap between an official 10-K filed with the SEC where its all true and nobody's going to prison, vs a bunch of marketing shitposts on twitter and not much to back it up beyond "we said so".
I could like so totally make a competing copy of soylent news for $3.50 of combined hardware and labor hours, like so totally just trust me bro this claim is so true nobody never lied on the internet before, uh huh so it must be true.
(Score: 1) by Chromium_One on Tuesday February 04, @08:39PM
It's already known that the just under $6M USD cost was for the final round of training, not for the whole thing. In the absence of full disclosure. speculation runs rampant. Still, no matter the final numbers, it can't be denied they used less compute time and renting a cluster of previous-generation (but still capable) GPUs that are less in demand is gonna be cheaper than building your own datacenter, so they're benefiting from others having done some of the more expensive work already.
Everyone is curious if the total cost factor is within an order of magnitude or not.
Do note there are efforts underway to validate and replicate at least portions of their claims, like reinforcement learning pipeline is being put together by huggingface.
Screaming about who profits and market manipulation feels a bit premature to me, though at the other side of that, personally still feel a strong urge to point and laugh at OpenAI having their moat drained, despite not being sure how much of a difference there really is.
Hopefully DeepSeek decides to share more verifiable information.
When you live in a sick society, everything you do is wrong.
(Score: 1) by Chromium_One on Tuesday February 04, @08:43PM
Also, to be clear, I meant the model quantizations not hearing from the finance guys, though that does bear lots of consideration as well.
The part where this 700GB model has been trimmed down to various ranges including as low as 131GB and can now run (albeit slowly) on consumer-grade hardware is very much a game changer for democratizing access for researchers and tinkerers.
When you live in a sick society, everything you do is wrong.