Microsoft Accidentally Created the Most Efficient AI Ever

37.15k vues1578 MotsCopier le textePartager

AI Revolution

Microsoft’s new AI model, BitNet b1.58 2B4T, is a groundbreaking transformer that uses ternary weigh...

Transcription vidéo:

[Music] microsoft's general artificial intelligence team has dropped a new AI called BitNet B1. 582B4T And yeah the name sounds like a Wi-Fi password but the idea is surprisingly elegant Run a serious large language model on nothing more exotic than a vanilla CPU without wrecking your electric bill The kicker they're doing it with weights that aren't 32bit or 16 bit or even the crunchy 8bit you've heard about Instead every single weight in the network can only be negative 1 0 or positive 1 which averages out to just 1. 58 bits of information In other words they've squeezed the precision so hard that three possible values cover the whole show because the log base 2 of three is 1.

58496 You get the picture Now you might ask hang on don't we already have one bit or 4-bit quantized models We do kind of but most of them were full precision models first and got compressed after the fact That post-training quantization trick saves memory sure yet it generally leaks accuracy like a balloon with a slow pin prick Microsoft flipped the script by training bit first They let the model learn from scratch in turnary so there's no memory of the float point life to miss The result is a two billion parameter transformer trained on a 4 trillion tokens And the team insists it keeps up with heavyweight open- source rivals that still carry all their float baggage Let's talk hardware impact because that's where this gets spicy A regular 2 billion parameter model in full precision parks itself in something like 2 to 5GB of VRAM once you ditch the embedding table Bitnet It strolls in at 0. 4 That slashes the working set so hard that the model fits comfortably in the lcash layers on many CPUs which is why the demo on an Apple M2 chip spits out 5 to 7 tokens per second About the speed at which a human reads a paperback line by line And while it's generating the researchers measured 85 to 96% lower energy draw than similar float models which is basically the difference between idling a Prius and flooring a muscle car Yet of course none of that matters if the answers stink So the team hit it with the usual alphabet soup of benchmarks MMLU GSM8K Arc Challenge Hellwag Pi QA Truthful QA The greatest hits averaged across 17 different tests Bitnet lands a 54. 19% macro score barely a point behind the best floatbased competitor in its weight class llama derived QN 2.

5 which sits at 55. 23 Where Bitnet really flexes is logical reasoning It tops the chart on Ark Challenge with 49. 91% leads Arc Easy at 74.

79 and edges past everyone on the notoriously tricky Winnow Grande with 71. 9 Math isn't a fluke either On GSM 8K it cracks 58. 38 exact match outscoring every other 2 billion regular model on the list and beating Quen's 56.

79 while running on maybe a tenth of the watts If you're wondering how that stacks up against 4-bit post-training tricks the paper spells it out They took Quen 2. 51. 5B threw the standard GPTQ and AWQ int 4 hammers at it and got the memory down to 0.

7GB Nice but still nearly double Bitnet's footprint More importantly the quantized Quen dropped three full points of accuracy to the low 52s while Bitnet held its 55-ish line So the moral is native turnery greater than retrofitted in 4 at least in this neighborhood Now let's peek inside the model I'll use simple examples so even if you're not super technical you'll get the idea Think of a normal AI model as a huge warehouse full of shelves Every shelf is packed with big jars of exact numbers And anytime the model answers a question it has to haul all those jars around Bitnet replaces the jars with tiny color-coded poker chips Red for negative 1 white for zero blue for positive one Because there are only three kinds of chips they weigh almost nothing So the whole warehouse shrinks from a few gigabytes down to about the size of a single mobile game download A little worker called the ABS mean quantizer decides which chip fits each spot and does it live while the model runs Meanwhile the messages racing between shelves get squeezed into kids-sized Lego bricks eight-bit numbers so the corridors stay clear and everything moves faster That chip and brick diet can make the structure wobble So the designers add a sprinkle of balance powder that sub layer norm and they swap a fancy activation function for a simpler squared relu because simple handles rough treatment better They also borrow Llama 3's tokenizer which is like bringing an already filled dictionary so the model doesn't have to learn a new alphabet from scratch Why train it in three rounds Imagine teaching a child First you read them every book in the library at top speed That's the four trillion token pre-training with a high learning rate Halfway through you slow down so the kid stops skimming and starts absorbing details That's the cool down Next you give them practice exams with clear answers the fine-tuning stage so they learn how to talk to people without rambling Here the teachers discovered that adding up the grading points instead of averaging them keeps this low bit brain steadier And because the tiny poker chips don't explode when you poke them they could push the lessons a little harder Finally you show them pairs of answers and say "People like this one better Do more of that. " That's direct preference optimization It's a gentle nudge two short passes with a microscopic learning rate So the student keeps their knowledge but learns some manners Crucially the kid never switches back to heavyweight textbooks It's chips and Lego bricks all the way through so nothing gets lost in translation Running the model needs special plumbing because graphics cards expect normalized jars not chips Microsoft wrote custom software that bundles four chips into a single bite slides that bundle across the GPU highway unpacks it right next to the math engine and multiplies it with those little 8-bit bricks That trick means Bitnet can read 5 to seven words a second using nothing but a laptop C If you don't have a GPU the Bitnet CPP program does the same dance on a regular desktop or Mac You only need about 400 MB of spare memory so even an ultrabook can play The payoff shows up on a simple graph One axis is memory The other is test score smarts Most small open models squat in a blob that needs 2 to 5 GB and scores somewhere in the 50s Bitnet lands way over to the left at 4GB yet floats above 60 on the score line Even bigger rivals that were later crushed down to low bits can't catch it because they still lug more memory and fall a handful of points behind In plain terms Bitnet squeezes more brain power into every bite and every watt which is why it looks like such a leap forward for anyone who wants solid AI on everyday gear Naturally Microsoft isn't calling it job done The final section of the paper reads like a to-do list They want to test how well native one-bit scaling laws hold at 7 and 13 billion parameters and beyond And they're practically begging hardware designers to build accelerators with specialized low-bit logic so the math no longer has to pretend turnary values are int 8 refugees for decades They also admit that the current 4K token context needs stretching for document length tasks that the data was Englishheavy and should branch into multilingual territory and that multimodal text plus headbug hybrids are still uncharted for the Turner approach Plus the theory heads remain puzzled about why such brutal quantization doesn't trash the learning trajectory So expect papers on lost landscapes and bit flip resilience in the months to come But let's zoom out What bitnet B1. 58 really shows is that we might not need a farm of H100s to push useful AI into everyday devices If you can carry a model that rivals the best two billion millimeter floats in a fifth of a gig and push it at reading speed on a single CPU core while sipping 30 milligles a token then smart keyboards offline chat bots and edge device copilots suddenly look plausible without tripping over battery life or data center bill You can grab the packed weights right now on hugging face three formats in fact the inference ready pack the BF-16 master for anyone crazy enough to retrain and a GGUF file for Bitnet CPP And there's even a web demo if you just want to make it tell dad jokes before bedtime Oh yeah everybody loves the giant models with 100,000 token context windows and billion-dollar clusters but Bitnet B 1.

58 is a reminder that sometimes a tight coupe beats a roaring muscle car If the streets are narrow and gas costs a fortune keep an eye on this space Once the hardware catches up and the sequence length grows we might see an allout Turner renaissance Until then grab a CPU download 400 megs of weights and see what a 1.