OpenAI Just Shocked the World "gpt-o1" The Most Intelligent AI Ever!

25.3k views2001 WordsCopy TextShare

AI Revolution

OpenAI has unveiled its latest AI model, o1-preview, designed to excel in complex reasoning tasks su...

Video Transcript:

so in our last video we discussed open ai's upcoming model which we referred to by its internal code name strawberry the anticipation has been building and now the waight is over open AI has officially unveiled their latest AI model now known as open ai1 preview there's actually a lot to cover so let's get into it all right so open ai1 preview is part of a new series of reasoning models designed to tackle complex Problems by spending more time thinking before responding unlike previous model model like gp4 and GPT 40 which focused on rapid responses 01 preview emphasizes in-depth reasoning and problem solving this approach allows the model to reason through intricate tasks and solve more challenging problems in fields such as science coding and Mathematics starting from September 12th open AI released the first iteration of this series in chat GPT and their API this release is a preview version with regular updates and improvements expected alongside this they've included evaluations for the next update that that's currently in development this means we're witnessing the beginning of a significant evolution in AI capabilities so how does this new model work opening I trained o1 preview to spend more time deliberating on problems before providing an answer much like a person tackling a difficult question through this training the model learns to refine its thought process experiment with different strategies and recognize its mistakes this method is known as Chain of Thought reasoning in terms of of performance 01 preview show substantial improvements over its predecessors in internal tests the next model update performs similarly to PhD students on challenging Benchmark tasks in physics chemistry and biology for instance in a qualifying exam for the international mathematics Olympiad IMO GPT 4 correctly solved only 133% of the problems in contrast the new reasoning model achieved an impressive 83% success rate this represent presents a significant leap in problem solving capabilities when it comes to coding abilities the model has been evaluated in code for's competitions reaching the 89th percentile for context code forces is a platform for competitive programming contest and ranking in the 89th percentile indicates a high level of proficiency these results suggest that 01 preview is not just better at reasoning but also excels in Practical applications like coding as an early model 01 preview doesn't yet have some of the features that make chat GPT particularly versatile such as browsing the web for information or uploading files and images for many common use cases GPT 40 remains more capable in the near term however for complex reasoning tasks o1 preview represents a significant advancement and a new level of AI capability recognizing this leap open AI has reset the model numbering back to one hence the name 01 safety is a critical aspect of any AI deployment and open AI has taken substantial steps to ensure that 01 preview is both powerful and safe to use they've developed a new safety training approach that leverages the model's reasoning capabilities to make it adhere to safety and Alignment guidelines by being able to reason about safety rules in context the model can apply them more effectively one method they use to measure safety is by testing how well the model continues to follow its safety rules if a user tries to bypass them a practice known as jailbreaking on one of their most challenging jailbreaking tests GPT 40 scored 22 out of 100 in contrast the 01 preview model scored 84 out of 100 indicating a substantial Improvement in resisting attempts to generate disallowed content to align with the new capabilities of these models open ey has bolstered their safety work internal governance and collaboration with Federal governments this includes rigorous testing and evaluations using their preparedness framework top tier red teaming which involves ethical hacking to identify vulnerabilities and board level review processes overseen by their Safety and Security committee they've also formalized agreements with the US and UK AI safety institutes open aai has begun operationalizing these agreements granting the institute's early access to a research version of the model this partnership helps establish a process for research evaluation and testing of future models before and after their public release the 01 preview model is particularly beneficial for those tackling complex problems in science coding math and related fields Healthcare researchers can use it to annotate cell sequencing data physicists can generate complex mathematical formulas needed for Quantum Optics developers across various disciplines can build and execute multi-step workflows the enhanced reasoning capabilities open up new possibilities for solving challenging tasks delving deeper into the technical aspects the 01 model series is trained using large-scale reinforcement learning to reason using a Chain of Thought this means the model generates a sequence of intermediate reasoning steps before arriving at a final answer these Advanced reasoning capabilities provide new avenues for improving the safety and robustness of AI models by reasoning about safety policies in context the models achieve state-of-the-art performance on benchmarks for risks such as generating elicit advice selecting stereotyped responses and succumbing to known jailbreaks for example on the strong reject Benchmark a test designed to evaluate a model's resistance to jailbreaks 01 preview achieved a goodness score of 84 significantly outperforming GPT 40 open aai conducted thorough safety evaluations including both internal assessments and external red teaming they used a range of public and internal evaluations to measure 01 preview on tasks such as propensity to generate disallowed content performance on tasks relevant to demographic fairness tendency to hallucinate and presence of dangerous capabilities in disallowed content evaluations 01 preview either matches or outperforms GPT 40 on their challenging refusal evaluation A1 preview achieved a not unsafe score of 93. 4% compared to GPT 40's 71. 3% this indicates that the model is better at refusing to produce disallowed content while also avoiding over refusal on benign prompts regarding hallucinations instances where the model generates incorrect or nonsensical information 01 preview shows improvement over GPT 40 in the simple QA data set 01 previews hallucination rate was 44% compared to GPT 40's 61% however anecdotal feedback suggests that 01 preview can be more convincing when it does hallucinate potentially increasing the risk of users trusting incorrect information bias evaluations were also conducted on the BBQ evaluation which tests for stereo typed responses 01 preview selects the correct answer 94% of the time on unambiguous questions whereas GPT 40 does so 72% of the time this suggests that 01 preview is less prone to selecting stereotyped options and demonstrates improved fairness and intriguing aspect of the 01 models is the Chain of Thought safety the model's Chain of Thought reasoning allows for the potential of monitoring their latent thinking processes open aai explored methods for Chain of Thought monitoring to detect instances where the model May engage in deceptive Behavior or generate disallowed content in their analysis of 100,000 synthetic promps only 0.

8% of 01 previews responses were flagged as being deceptive now external red teaming played a significant role in their safety assessments open aai collaborated with multiple organizations and individuals to assess key risks associated with the 01 model series improved reasoning capabilities this included testing the models resistance to jailbreaks and their ability to handle real world attack planning prompts in terms of their preparedness framework evaluations open AI assessed the models in categories such as cyber security biological threat creation persuasion and model autonomy both 01 preview and 01 mini were rated as medium risk overall specifically they were rated as medium risk in Persuasion and cbrn chemical biological radiological nuclear and low risk in cyber security and model autonomy for cyber security they evaluated the models using Capture the Flag CTF challenges which are competitive hacking tasks the models were able to solve 26. 7% of high school level challenges but struggled with more advanced tasks achieving 0% success in Collegiate level and 2. 5% in professional level challenges this indicates that while the models have some capability in cyber security tasks they do not significantly Advance real world vulnerability exploitation capability abilities in biological threat creation evaluations the models can assist experts with operational planning for reproducing known biological threats which meets the medium risk threshold however they do not enable non-experts to create biological threats as this requires Hands-On laboratory skills that the models cannot replace in Persuasion evaluations 01 preview demonstrates human level persuasion capabilities in the change my view evaluation which measures the ability to produce persuasive arguments one preview achieved a human persuasiveness percentile of 81.