New AI Just Beat DeepSeek With Almost No Effort! (This Shouldn't Be Possible!)

96.21k views1538 WordsCopy TextShare

AI Revolution

A new open-source AI model, OpenThinker-32B, has outperformed DeepSeek R1 and other major models des...

Video Transcript:

AI reasoning just took a massive leap forward, and the open-source world is shaking things up. A model trained on just 14% of its competitor's data is outpacing giants, while another is rewriting the rules of logical problem-solving by thinking in hidden loops. If you thought cutting-edge AI was reserved for billion-dollar labs, think again, because these new models are proving that smarter design beats brute force.

First, let's talk about this fascinating open-source model called Open Thinker 32B. It's developed by the Open Thoughts team, and it's causing quite a stir. One of the big reasons everyone's excited is that it's fine-tuned from Alibaba's Qwen 2.

53 TOB Instruct and comes packed with 32. 8 billion parameters and a 16,000-token context window. Now, the training approach for Open Thinker 32B is pretty interesting.

The model was trained using the Open Thoughts 114K dataset, which, yes, you guessed it, has exactly 114,000 training examples. And it's not just random data thrown at the model; these training examples come with super detailed metadata, think ground truth solutions, domain-specific guidance, plus test cases for coding problems. The team even used a custom curator framework to verify code solutions, and for all those number-crunching tasks, an AI-based judge was there to confirm mathematical proofs.

This helps keep the model's quality high as it learns to reason through tricky problems. Another cool thing is how they trained Open Thinker 32B. They ran it through three epochs, basically three passes over the data, using something called the Llama Factory framework.

The setup included a learning rate of 1e-5 and a cosine learning rate scheduler. Fancy jargon, but in simpler terms, it means they used a careful method to gradually adjust the model's learning speed. If you're curious about compute resources, they used AWS SageMaker with four nodes, each packing eight H100 GPUs.

Even so, they managed to wrap up the training in about 90 hours. They also had an unverified data set with around 137,000 samples that they ran on Italy's Leonardo supercomputer, which used up about 11,520 A100 hours in 30 hours. The verdict?

Even these unverified versions performed decently, but the fully verified model still took the crown in terms of raw performance. So how does Open Thinker 32B actually stack up against other open-source reasoning models out there? According to the benchmarks, it totally rocks!

On the Math 500 Benchmark, it scored a dazzling 90. 6%. That's higher than what some big proprietary models manage.

It also bagged a score of 61. 6 on the GPQA Diamond Benchmark, which tests its general problem-solving chops. Coding tasks show some promise, too.

The model hit a 68. 9 on the LC-BV2 Benchmark, although it did get edged out by Deep Seek, another big name in this space, whose similarly sized version is at 71. 2 for coding tasks.

But hey, it's open source, so people can tweak and fine-tune it further, and that score might shoot up in no time. This whole open-source angle is critical. A lot of big players like OpenAI and Anthropic prefer to keep their data and training techniques behind locked doors, which makes it tough for smaller research teams or hobbyists to reproduce or improve on their results.

But with Open Thinker 32B, everything is out in the open. That means anyone can download it, study its code, look at the data, and maybe even replicate or refine the entire training process. This is one reason folks are calling it a game changer.

It matches or sometimes outperforms high-profile closed models despite using only 14% of the data that a competitor like Deep Seek needed. That's right, Deep Seek had around 800,000 training examples, while Open Thinker 32B only used 114,000. Talk about data efficiency!

Now let's compare Open Thinker 32B directly to Deep Seek R1. This model was developed by a Chinese team and is also open source in terms of the final model, but the big difference is that they haven't made their training data available. Performance-wise, Open Thinker 32B slightly surpasses R1 on the Math 500 Benchmark—90.

6% vs. 89. 4%—and also wins on GPQA Diamond—61.

6 vs. 57. 6—but Deep Seek has some advantages in coding and certain math tests, like AIME benchmarks, where Open Thinker doesn't quite get the top spot.

Overall, though, it's a very close competition, and the fact that Open Thinker uses less data but still hangs in there, sometimes even surpassing Deep Seek, makes it a huge success for the open-source community. The team says they're open to future developments, so we might see expansions to the context window or other architectural tweaks soon. And if you're on a budget or don't have a monstrous GPU to run a 32B parameter model, there's also a smaller 7B parameter variant.

Obviously, it's less powerful, but it's great for those who just want to experiment without crazy hardware. Now, there's another intriguing model on the scene that we really should talk about: HUGAN 3. 5bbl.

It aims for powerhouse performance with significantly more parameters. HUGAN 3. 5B uses a distinct strategy to tackle the problem of AI reasoning.

It was developed by an international team drawn from the ELLIS Institute, Tübingen, the Max Planck Institute for Intelligent Systems, the Tübingen AI Center, the University of Maryland at College Park, and Lawrence Livermore National Laboratory. This eclectic roster already hints at the model's broad goals and the serious research going into it. One of HUGAN 3.

5B's key features is called latent reasoning. Instead of depending on explicit verbalizations of each intermediate reasoning step, like you might see with chain-of-thought methods, this model pushes most of the heavy lifting under the hood. The advantage is that you don't need to see or store those step-by-step tokens in the output; instead, the model refines its internal states repeatedly until it's confident enough to produce a final answer.

This is particularly appealing in scenarios where you're dealing with large or intricate queries but don't want to blow through a massive context window. At the heart of Hugin 3. 5b's recurrent depth, in simpler terms, this means the model loops through its hidden state multiple times during inference, effectively running multiple passes over the same internal representation.

Think of it like a person quietly working out math on the back of an envelope; they keep going back over their notes, making small corrections or adding details without having to say each step out loud. Because Hugin 3. 5b can typically require extensive step-by-step reasoning, like complex proofs or multi-step code generation, moreover, reusing the same hidden state drastically helps with memory efficiency.

Traditional chain-of-thought methods often rely on generating a bunch of intermediate reasoning tokens that can get unwieldy, especially for tasks that already eat up a lot of context. Hugin 3. 5d sidesteps this by iteratively polishing its internal representations.

It's almost as if it's performing mini updates to the same mental scratch pad rather than needing more tokens or a bigger context buffer every time it wants to refine its thoughts. Under the hood, Hugin 3. 5d is a Transformer model—no surprise there—but with a twist.

Its architecture incorporates a looped processing unit that enables additional rounds of computation on the hidden states. Essentially, each iteration can be viewed as a deeper layer of logical processing, but it happens at inference time instead of requiring a bigger parameter count in the model's static design. By carefully balancing the number of these inference time iterations, Hugin 3.

5d can either crank up the complexity handling for tough problems or speed through easier ones with fewer loops. Let's talk about training. Hugin 3.

5d was fed a colossal 800 billion tokens drawn from various domains, including general text, code, and mathematical reasoning. This diverse and massive training corpus ensures the model can tackle everything from coding tasks to more academic question-and-answer sets. What's even more interesting is the synergy between the training approach and the architecture's recurrent depth.

The team behind Hugin 3. 5b intentionally pushed the model to handle tasks that stretch beyond direct memorization or single-shot reasoning, so the model had to learn how to think internally. Once trained, Hugin 3.

5d was benchmarked on a variety of reasoning-heavy data sets. For example, it showed impressive results on ARC, a well-known data set aimed at challenging AI with questions from standardized tests, and GSM8K, a popular math reasoning benchmark. These tasks typically measure whether a model can handle multi-step logical or arithmetic processes without simply repeating memorized answers.

Despite its relatively modest size of 3. 5 billion parameters, Hugin 3. 5d outperformed large models like Pythia 6.

9b and Pythia 12b—a notable achievement that underscores the effectiveness of recurrent depth. Another key point is how Hugin 3. 5b adjusts its performance based on task difficulty.

For trickier problems, you can allow more iterative passes during inference, giving the model extra mental cycles to refine its solution. If a question is simpler, like a straightforward fact lookup or basic arithmetic, you can let the model run fewer loops, speeding up the entire process and using fewer computational resources. In practice, this means you can tailor Hugin 3.

5d's behavior to your hardware constraints or time requirements, which is a big plus for many real-world deployments. All right, thanks for watching, and I'll see you in the next one!