Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity | Lex Fridman Podcast #452

140.73k views63591 WordsCopy TextShare

Lex Fridman

Dario Amodei is the CEO of Anthropic, the company that created Claude. Amanda Askell is an AI resear...

Video Transcript:

if you extrapolate the curves that we've had so far right if if you say well I don't know we're starting to get to like PhD level and and last year we were at undergraduate level and the year before we were at like the level of a high school student again you can you can quibble with at what tasks and for what we're still missing modalities but those are being added like computer use was added like image generation has been added if you just kind of like eyeball the rate at which these capabilities are increasing it

does make you think that we'll get there by 2026 or 2027 I think there are still worlds where it doesn't happen in in a 100 years those world the number of those worlds is rapidly decreasing we are rapidly running out of truly convincing blockers truly compelling reasons why this will not happen in the next few years the scale up is very quick like we we do this today we make a model and then we deploy thousands maybe tens of thousands of instances of it I think by the time you know certainly within two to three

years whether we have these super powerful AIS or not ERS are going to get to the size where you'll be able to deploy millions of these I am optimistic about meaning I worry about economics and the concentration of power that's actually what I worry about more the abuse of power and AI increases the amount of power in the world and if you concentrate that power and abuse that power it can do immeasurable damage yes it's very frightening it's very it's very frightening the following is a conversation with Dario amade CEO of anthropic the company that

created Claude that is currently and often at the top of most llm Benchmark leader boards on top of that Dario and the anthropic team have been outspoken advocates for taking the topic of AI safety very seriously and they have continued to publish a lot of fascinating AI research on this and other topics I'm also joined afterwards by two other brilliant people from propic first Amanda ascal who is a researcher working on alignment and fine-tuning of Claude including the design of claude's character and personality a few folks told me she has probably talked with Claude more

than any human at anthropic so she was definitely a fascinating person to talk to about prompt engineering and practical advice on how to get the best out of Claude after that chrisa stopped by for chat he's one of the pioneers of the field of mechanistic interpretability which is an exciting set of efforts that aims to reverse engineer neural networks to figure out what's going on inside inferring behaviors from neural activation patterns inside the network this is a very promising approach for keeping future super intelligent AI systems safe for example by detecting from the activations when

the model is trying to deceive the human it is talking to this is Alex Freedman podcast to support it please check out our sponsors in the description and now dear friends here's Dario amade let's start with a big idea of scaling laws and the scaling hypothesis what is it what is its history and where do we stand today so I can only describe it as it you know as it relates to kind of my own experience but I've been in the AI field for about uh 10 years and it was something I noticed very early

on so I first joined the AI world when I was uh working at BYU with Andrew in in late 2014 which is almost exactly 10 years ago now and the first thing we worked on was speech recognition systems and in those days I think deep learning was a new thing it had made lots of progress but everyone was always saying we don't have the algorithms we need to succeed you know we we we we're we're not we're only matching a tiny tiny fraction there's so much we need to kind of discover algorithmically we haven't found

the picture of how to match the human brain uh and when you know in some ways was fortunate I was kind of you know you can have almost beginner's luck right I was like a a newcomer to the field and you know I looked at the neural net that we were using for speech the recurrent neural networks and I said I don't know what if you make them bigger and give them more layers and what if you scale up the data along with this right I just saw these as as like independent dials that you

could turn and I noticed that the model started to do better and better as you gave them more data as you as you made the models larger as you trained them for longer um and I I didn't measure things precisely in those days but but along with with colleagues we very much got the informal sense that the more data and the more compute and the more training you put into these models the better they perform and so initially my thinking was hey maybe that is just true for speech recognition systems right maybe maybe that's just

one particular quirk one particular area I think it wasn't until 2017 when I first saw the results from gpt1 that it clicked for me that language is probably the area in which we can do this we can get trillions of words of language data we can train on them and the models we were training in those days were tiny you could train them on one to eight gpus whereas you know now we train jobs on tens of thousands soon going to hundreds of thousands of gpus and so when I when I saw those two things

together um and you know there were a few people like ilaser who who you've interviewed who had somewhat similar reviews right he might have been the first one although I think a few people came to came to similar views around the same time Right There Was You Know Rich Sutton's bitter lesson there was gur wrote about the scaling hypothesis but I think somewhere between 2014 and 2017 was when it really clicked for me when I really got conviction that hey we're going to be able to do these incredibly wide cognitive tasks if we just if

we just scale up the models and at at every stage of scaling there are always arguments and you know when I first heard them honestly I thought probably I'm the one who's wrong and you know all these all these experts in the field are right they know the situation better better than I do right there's you know the Chomsky argument about like you can get syntactics but you can't get semantics there's this idea oh you can make a sentence make sense but you can't make a paragraph makes sense the latest one we have today is

uh you know we're going to run out of data or the data isn't high quality enough or models can't reason and and each time every time we manage to we manage to either find a way around or scaling just is the way around um sometimes it's one sometimes it's the other uh and and so I'm now at this point I I I still think you know it's it's it's always quite uncertain we have nothing but inductive inference to tell us that the next few years are going to be like the next the last 10 years

but but I've seen I've seen the movie enough times I've seen the story happen for for enough times to to really believe that probably the scaling is going to continue and that there's some magic to it that we haven't really explained on a theoretical basis yet and of course the scaling here is bigger networks bigger data bigger compute yes all in in particular linear scaling up of bigger networks bigger training times and uh more and and more data uh so all of these things almost like a chemical reaction you know you have three ingredients in

the chemical reaction and you need to linearly scale up the three ingredients if you scale up one not the others you run out of the other reagents and and the reaction stops but if you scale up everything everything in series then then the reaction can proceed and of course now that you have this kind of empirical scienceart you can apply it to other uh more nuanced things like scaling laws applied to interpretability or scaling laws applied to posttraining or just seeing how does this thing scale but the big scaling law I guess the underlying scaling

hypothesis has to do with big networks Big Data leads to intelligence yeah we've we've documented scaling laws in lots of domains other than language right so uh initially the the paper we did that first showed it was in early 2020 where we first showed it for language there was then some work late in 2020 where we showed the same thing for other modalities like images video text to image image to text math they all had the same pattern and and you're right now there are other stages like posttraining or there are new types of reasoning

models and in in in all of those cases that we've measured we see similar similar types of scaling laws a bit of a philosophical question but what's your intuition about why bigger is better in terms of network size and data size why does it lead to more intelligent models so in my previous career as a as a biophysicist so I did physics undergrad and then biophysics in in in in grad school so I think back to what I know as a physicist which is actually much less than what some of my colleagues at anthropic have

in terms of in terms of expertise in physics uh there's this there's this concept called the one over F noise and one overx distributions um where where often um uh you know just just like if you add up a bunch of natural processes you get gaussian if you add up a bunch of kind of differently distributed natural processes if you like if you like take a take a um probe and and hook it up to a resistor the distribution of the thermal noise in the resistor goes as one over the frequency um it's some kind

of natural convergent distribution uh and and I I I I and and I think what it amounts to is that if you look at a lot of things that are that are produced by some natural process that has a lot of different scales right not a gaussian which is kind of narrowly distributed but you know if I look at kind of like large and small fluctuations that lead to lead to electrical noise um they have this decaying 1 overx distribution and so now I think of like patterns in the physical world right if I if

or or in language if I think about the patterns in language there are some really simple patterns some words are much more common than others like the' then there's basic noun verb structure then there's the fact that you know you know nouns and verbs have to agree they have to coordinate and there's the higher level sentence structure then there's the Thematic structure of paragraphs and so the fact that there's this regressing structure you can imagine that as you make the networks larger first they capture the really simple correlations the really simple patterns and there's this

long taale of other patterns and if that long taale of other patterns is really smooth like it is with the one over F noise in you know physical processes like like like resistors then you could imagine as you make the network larger it's kind of capturing more and more of that distribution and so that smoothness gets reflected in how well the models are at predicting and how well they perform language is an evolved process right we've we've developed language we have common words and less common words we have common expressions and less common Expressions we

have ideas cliches that are expressed frequently and we have novel ideas and that process has has developed has evolved with humans over millions of years and so the the the guess and this is pure speculation would be would be that there is there's some kind of longtail distribution of of of the distribution of these ideas so there's the long tail but also there's the height of the hierarchy of Concepts that you're building up so the bigger the network presumably you have a higher capacity to exactly if you have a small Network you only get the

common stuff right if if I take a tiny neural network it's very good at understanding that you know a sentence has to have you know verb adjective noun right but it's it's terrible at deciding what those verb adjective and noun should be and whether they should make sense if I make it just a little bigger it gets good at that then suddenly it's good at the sentences but it's not good at the paragraphs and so the these these rare and more complex patterns get picked up as I add as I add more capacity to the

network well the natural question then is what's the ceiling of this like how complicated and complex is the real world how much of stuff is there to learn I don't think any of us knows the answer to that question um I my strong Instinct would be that there's no ceiling below level of humans right we humans are able to understand these various patterns and so that that makes me think that if we continue to you know scale up these these these models to kind of develop new methods for training them and scaling them up uh

that will at least get to the level that we've gotten to with humans there's then a question of you know how much more is it possible to understand than humans do how much how much is it possible to be smarter and more perceptive than humans I I would guess the answer has has got to be domain dependent if I look at an area like biology and you know I wrote this essay Machines of Loving Grace it seems to me that humans are struggling to understand the complexity of biology right if you go to Stanford or

to Harvard or to Berkeley you have whole Departments of you know folks trying to study you know like the immune system or metabolic pathways and and each person understands only a tiny bit part of it specializes and they're struggling to combine their knowledge with that of with that of other humans and so I have an instinct that there's there's a lot of room at the top for AIS to get smarter if I think of something like materials in the in the physical world or you know um like addressing you know conflicts between humans or something

like that I mean you know it it may be there's only some of these problems are not intractable but much harder and and it it may be that there's only there's only so well you can do with some of these things right just like with speech recognition there's only so clear I can hear your speech so I think in some areas there may be ceilings in in in you know that are very close to what humans have done in other areas those ceilings may be very far away and I think we'll only find out when

we build these systems uh there's it's very hard to know in advance we can speculate but we can't be sure and in some domains the ceiling might have to do with human bureaucracies and things like this as you're right about yes so humans fundamentally have to be part of the loop that's the cause of the ceiling not maybe the limits of the intelligence yeah I think in many cases um you know in theory technology could change very fast for example all the things that we might invent with respect to biology um but remember there's there's

a you know there's a clinical trial system that we have to go through to actually administer these things to humans I think that's a mixture of things that are unnecessary and bureaucratic and things that kind of protect the Integrity of society and the whole challenge is that it's hard to tell it's hard to tell what's going on uh it's hard to tell which is which right my my view is definitely I think in terms of drug development we my view is that we're too slow and we're too conservative but certainly if you get these things

wrong you know it's it's possible to to to risk people's lives by by being by being by being too Reckless and so at least at least some of these human institutions are in fact protecting people so it's it's all about finding the balance I strongly suspect that balance is kind of more on the side of pushing to make things happen faster but there is a balance if we do hit a limit if we do hit a Slowdown in the scaling laws what do you think would be the reason is it compute limited data limited uh

is it something else idea limited so a few things now we're talking about hitting the limit before we get to the level of of humans and the skill of humans um so so I think one that's you know one that's popular today and I think you know could be a limit that we run into I like most of the limits I would bet against it but it's definitely possible is we simply run out of data there's only so much data on the internet and there's issues with the quality of the data right you can get

hundreds of trillions of words on the internet but a lot of it is is repetitive or it's search engine you know search engine optimization driil or maybe in the future it'll even be text generated by AIS itself uh and and so I think there are limits to what to to what can be produced in this way that said we and I would guess other companies are working on ways to make data synthetic uh where you can you know you can use the model to generate more data of the type that you have that you have

already or even generate data from scratch if you think about uh what was done with uh deep mines Alpha go zero they managed to get a bot all the way from you know no ability to play Go whatsoever to above human level just by playing against itself there was no example data from humans required in the the alphao zero version of it the other direction of course is these reasoning models that do Chain of Thought and stop to think um and and reflect on their own thinking in a way that's another kind of synthetic data

coupled with reinforcement learning so my my guess is with one of those methods we'll get around the data limitation or there may be other sources of data that are that are available um we could just observe that even if there's no problem with data as we start to scale models up they just stop getting better it's it seemed to be a a reliable observation that they've gotten better that could just stop at some point for a reason we don't understand um the answer could be that we need to uh you know we need to invent

some new architecture um it's been there have been problems in the past with with say numerical stability of models where it looked like things were were leveling off but but actually you know know when we when we when we found the right Unblocker they didn't end up doing so so perhaps there's new some new optimization method or some new uh Technique we need to to unblock things I've seen no evidence of that so far but if things were to to slow down that perhaps could be one reason what about the limits of compute meaning uh

the expensive uh nature of building bigger and bigger data centers so right now I think uh you know most of the Frontier Model companies I would guess are are operating you know roughly you know $1 billion scale plus or minus a factor of three right those are the models that exist now or are being trained now uh I think next year we're going to go to a few billion and then uh 2026 we may go to uh uh you know above 10 10 10 billion and probably by 2027 their Ambitions to build hundred hundred billion

dollar uh hundred billion dollar clusters and I think all of that actually will happen there's a lot of determination to build the compute to do it within this country uh and I would guess that it actually does happen now if we get to 100 billion that's still not enough compute that's still not enough scale then either we need even more scale or we need to develop some way of doing it more efficiently of Shifting The Curve um I think be between all of these one of the reasons I'm bullish about powerful AI happening so fast

is just that if you extrapolate the next few points on the curve we're very quickly getting towards human level ability right some of the new models that that we developed some some reasoning models that have come from other companies they're starting to get to what I would call the PHD or professional level right if you look at their their coding ability um the latest model we released Sonet 3.5 the new or updated version it gets something like 50% on sbench and sbench is an example of a bunch of professional real world software engineering tasks at

the beginning of the year I think the state-of-the-art was three or 4% so in 10 months we've gone from 3% to 50% on this task and I think in another year we'll probably be at 90% I mean I don't know but might might even be might even be less than that uh we've seen similar things in graduate level math physics and biology from Models like open AI 01 uh so uh if we if we just continue to extrapolate this right in terms of skill skill that we have I think if we extrapolate the straight curve

Within a few years we will get to these models being you know above the the highest professional level in terms of humans now will that curve continue you've pointed to and I've pointed to a lot of reasons why you know possible reasons why that might not happen but if the if the extrapolation curve continues that is the trajectory we're on so anthropic has several competitors it'd be interesting to get your sort of view of it all open aai Google xai meta what does it take to win in the broad sense of win in the space

yeah so I want to separate out a couple things right so you know anthropics anthropic mission is to kind of try to make this all go well right and and you know we have a theory of change called race to the top right race to the top is about trying to push the other players to do the right thing by setting an example it's not about being the good guy it's about setting things up so that all of us can be the good guy I'll give a few examples of this early in the history of

anthropic one of our co-founders Chris Ola who I believe you're you're interviewing soon you know he's the co-founder of the field of mechanistic interpretability which is an attempt to understand what's going on inside AI models uh so we had him and one of our early teams focus on this area of interpretability which we think is good for making models safe and transparent for three or four years that had no commercial application whatsoever it still doesn't today we're doing some early betas with it and probably it will eventually but uh you know this is a very

very long research bed in one in which we've we've built in public and shared our results publicly and and we did this because you know we think it's a way to make models safer an interesting thing is that as we've done this other companies have started doing it as well in some cases because they've been inspired by it in some cases because they're worried that uh you know if if other companies are doing this that look more responsible they want to look more responsible too no one wants to look like the irresponsible ible actor and

and so they adopt this they adopt this as well when folks come to anthropic interpretability is often a draw and I tell them the other places you didn't go tell them why you came here um and and then you see soon that there that there's interpretability teams else elsewhere as well and in a way that takes away our competitive Advantage because it's like oh they now others are doing it as well but it's good it's good for the broader system and so we have to invent some new thing that we're doing others aren't doing as

well and the hope is to basically bid up bid up the importance of of of doing the right thing and it's not it's not about us in particular right it's not about having one particular good guy other companies can do this as well if they if they if they join the race to do this that's that's you know that's the best news ever right um uh it's it's just it's about kind of shaping the incentives to point upward instead of shaping the incentives to point to point downward and we should say this example the field

of uh mechanistic interpretability is just a a rigorous non handwavy way of doing AI safety yes or it's tending that way trying to I mean I I think we're still early um in terms of our ability to see things but I've been surprised at how much we've been able to look inside these systems and understand what we see right unlike with the scaling laws where it feels like there's some you know law that's driving these models to perform better on on the inside the models aren't you know there's no reason why they should be designed

for us to understand them right they're designed to operate they're designed to work just like the human brain or human biochemistry they're not designed for a human to open up the hatch look inside and understand them but we have found and you know you can talk in much more detail about this to Chris that when we open them up when we do look inside them we we find things that are surprisingly interesting and as a side effect you also get to see the beauty of these models you get to explore the sort of uh the

beautiful n nature of large neural networks through the me turb kind ofy I'm amazed at how clean it's been I I'm amazed at things like induction heads I'm amazed at things like uh you know that that we can you know use sparse autoencoders to find these directions within the networks uh and that the directions correspond to these very clear Concepts we demonstrated this a bit with the Golden Gate Bridge clad so this was an experiment where we found a direction inside one of the the neural network layers that corresponded to the Golden Gate Bridge and

we just turned that way up and so we we released this model as a demo it was kind of half a joke uh for a couple days uh but it was it was illustrative of of the method we developed and uh you could you could take the Golden Gate you could take the model you could ask it about anything you know you know it would be like how you could say how was your day and anything you asked because this feature was activated would connect to the Golden Gate Bridge so it would say you know

I'm I'm I'm feeling relaxed and expansive much like the the arches of the Golden Gate Bridge or you know it would masterfully change topic to the Golden Gate Bridge and it integrated there was also a sadness to it to to the focus ah had on the Golden Gate Bridge I think people quickly fell in love with it I think so people already miss it because it was taken down I think after a day somehow these interventions on the model um where where where where you kind of adjust Its Behavior somehow emotionally made it seem more

human than any other version of the model strong personality strong ID strong personality it has these kind of like obsessive interests you know we can all think of someone who's like obsessed with something so it does make it feel somehow a bit more human let's talk about the present let's talk about Claude so this year A lot has happened in March claw 3 Opa Sonet Hau were released then claw 35 Sonet in July with an updated version just now released and then also claw 35 hi coup was released okay can you explain the difference between

Opus Sonet and Haiku and how we should think about the different versions yeah so let's go back to March when we first released uh these three models so you know our thinking was you different companies produce kind of large and small models better and worse models we felt that there was demand both for a really powerful model um you know and you that might be a little bit slower that you'd have to pay more for and also for fast cheap models that are as smart as they can be for how fast and cheap right whenever

you want to do some kind of like you know difficult analysis like if I you know I want to write code for instance or you know I want to I want to brainstorm ideas or I want to do creative writing I want the really powerful model but then there's a lot of practical applications in a business sense where it's like I'm interacting with a website I you know like I'm like doing my taxes or I'm you know talking to uh you know to like a legal adviser and I want to analyze a contract or you

know we have plenty of companies that are just like you know you know I want to do autocomplete on my on my IDE or something uh and and for all of those things you want to act fast and you want to use the model very broadly so we wanted to serve that whole spectrum of needs um so we ended up with this uh you know this kind of poetry theme and so what's a really short poem it's a Haik cou and so Haiku is the small fast cheap model that is you know was at the

time was released surprisingly surprisingly uh intelligent for how fast and cheap it was uh sonnet is a is a medium-sized poem right a couple paragraphs since o Sonet was the middle model it is smarter but also a little bit slower a little bit more expensive and and Opus like a magnum opus is a large work uh Opus was the the largest smartest model at the time um so that that was the original kind of thinking behind it um and our our thinking then was well each new generation of models should shift that tradeoff curve uh

so when we release Sonet 3.5 it has the same roughly the same you know cost and speed as the Sonet 3 Model uh but uh it it increased its intelligence to the point where it was smarter than the original Opus 3 Model uh especially for code but but also just in general and so now you know we've shown results for a Hau 3. 5 and I believe Hau 3.5 the smallest new model is about as good as Opus 3 the largest old model so basically the aim here is to shift the curve and then at

some point there's going to be an opus 3.5 um now every new generation of models has its own thing they use new data their personality changes in ways that we kind of you know try to steer but are not fully able to steer and and so uh there's never quite that exact equivalence the only thing you're changing is intelligence um we always try and improve other things and some things change without us without us knowing or measuring so it's it's very much an inexact science in many ways the manner and personality of these models is

more an art than it is a science so what is sort of the reason for uh the span of time between say Claude Opus 3 and 35 what is it what takes that time if you can speak to yeah so there's there's different there's different uh processes um uh there's pre-training which is you know just kind of the normal language model training and that takes a very long time um that uses you know these days you know tens you know tens of thousands sometimes many tens of thousands of uh gpus or tpus or tranium or

you know what we use different platforms but you know accelerator chips um often often training for months uh there's then a kind of posttraining phase where we do reinforcement learning from Human feedback as well as other kinds of reinforcement learning that that phase is getting uh larger and larger now and you know you know often that's less of an exact science it often takes effort to get it right um models are then tested with some of our early Partners to see how good they are and they're then tested both internally and externally for their safety

particularly for catastrophic and autonomy r risks uh so uh we do internal testing according to our responsible scaling policy which I you know could talk more about that in detail and then we have an agreement with the US and the UK AI safety Institute as well as other third-party testers in specific domains to test the models for what are called cbrn risk chemical biological radiological and nuclear which are you know we don't think that models pose these risks seriously yet but but every new model we want to evaluate to see if we're starting to get

close to some of these these these more dangerous um uh these more dangerous capabilities so those are the phases and then uh you know then then it just takes some time to get the model working in terms of inference and launching it in the API so there's just just a lot of steps to uh to actually to actually making a model work and of course you know we're always trying to make the processes as streamlined as possible right we want our safety testing to be rigorous but we want it to be RoR ous and to

be you know to be automatic to happen as fast as it can without compromising on rigor same with our pre-training process and our posttraining process so you know it's just like building anything else it's just like building airplanes you want to make them you know you want to make them safe but you want to make the process streamlined and I think the creative tension between those is is you know is an important thing and making the models work yeah uh rumor on the street I forget who was saying that uh anthropic is really good tooling

so I uh probably a lot of the challenge here is on the software engineering side is to build the tooling to to have a like a efficient low friction interaction with the infrastructure you would be surprised how much of the challenges of uh you know building these models comes down to you know software engineering performance engineering you know you you know from the outside you might think oh man we had this Eureka breakthrough right you know this movie with the science we discovered it we figured it out but but but I think I think all

things even even even you know incredible discoveries like they they they they they almost always come down to the details um and and often super super boring details I can't speak to whether we have better tooling than than other companies I mean you know I haven't been at those other companies at least at least not recently um but it's certainly something we give a lot of attention to I don't know if you can say but from three from CLA 3 to CLA 35 is there any extra pre-training going on or is they mostly focus on

the post-training there's been leaps in performance yeah I think I think at any given stage we're focused on improving everything at once um just just naturally like there are different teams each team makes progress in a particular area in in in making a particular you know their particular segment of the relay race better and it's just natural that when we make a new model we put we put all of these things in at once so the data you have like the preference data you get from rhf is that applicable is there ways to apply it

to newer models as it get trained up yeah preference data from old models sometimes gets used for new models although of course uh it it performs somewhat better when it's you know trained on it's trained on the new models note that we have this you know constitutional AI method such that we don't only use preference data we kind of there's also a post-t trainining process where we train the model against itself and there's you know new types of post training the model against itself that are used every day so it's not just RF it's a

bunch of other methods as well um post training I think you know it's becoming more and more sophisticated well what explains the big leap in performance for the new Sona 35 I mean at least in the programming side and maybe this is a good place to talk about benchmarks what does it mean to get better just the number went up but you know I I I program but I also love programming and I um claw 35 through cursor is what I use uh to assist me in programming and there was at least experientially anecdotally it's

gotten smarter at programming so what like what what does it take to get it uh to get it smarter we observe that as well by the way there were a couple uh very strong Engineers here at anthropic um who all previous code models both produced by us and produced by all the other companies hadn't really been useful to to hadn't really been useful to them you know they said you know maybe maybe this is useful to beginner it's not useful to me but Sonet 3.5 the original one for the first time they said oh my

God this helped me with something that you know that it would have taken me hours to do this is the first model that has actually saved me time so again the water line is rising and and then I think you know the new Sonet has been has been even better in terms of what it what it takes I mean I'll just say it's been across the board it's in the pre-training it's in the posttraining it's in various evaluations that we do we've observed this as well and if we go into the details of the Benchmark

so s bench is basically you know since since you know since since you're a programmer you know you'll be familiar with like PLL requests and you know uh just just PLL requests are like you know the like a sort of a sort of atomic unit of work you know you could say I'm you know I'm implementing one I'm implementing one thing um uh and and so sbench actually gives you kind of a real world situation where the codebase is in a current state and I'm trying to implement something that's you know that's described in described

in language we have internal benchmarks where we where we measure the same thing and you say just give the model free reign to like you know do anything run run run anything edit anything um how how well is it able to complete these tasks and it's that Benchmark that's gone from it can do it 3% of the time to it can do it about 50% of the time um so I actually do believe that if we get you can gain benchmarks but I think if we get to 100% on that Benchmark in a way that

isn't kind of like overtrained or or or game for that particular Benchmark probably represents a real and serious increase in kind of in kind of programming programming ability and and I would suspect that if we can get to you know 90 90 95% that that that that you know it will it will represent ability to autonomously do a significant fraction of software engineering tasks well ridiculous timeline question uh when is clad Opus uh 3.5 coming up uh not giving you an exact date uh but you know there there uh you know as far as we

know the plan is still to have a Claude 3.5 opus are we gonna get it before GTA 6 or no like Duke Nukem Forever was that game that there was some game that was delayed 15 years was that Duke Nukem Forever yeah and I think GTA is now just releasing trailers it you know it's only been three months since we released the first son it yeah it's Inc the incredible pace of relas it just it just tells you about the pace the expectations for when things are going to come out so uh what about 40

so how do you think about sort of as these models get bigger and bigger about versioning and also just versioning in general why Sonet 35 updated with the date why not Sonet 3.6 actually naming is actually an interesting challenge here right because I think a year ago most of the model was pre-training and so you could start from the beginning and just say okay we're going to have models of different sizes we're going to train them all together and you know we'll have a a family of naming schemes and then we'll put some new magic

into them and then you know we'll have the next the next Generation Um the trouble starts are already when some of them take a lot longer than others to train right that already messes up your time time a little bit but as you make big improvements in as you make big improvements in pre-training uh then you suddenly notice oh I can make better pre-train model and that doesn't take very long to do and but you know clearly it has the same you know size and shape of previous models uh uh so I think those two

together as well as the timing timing issues any kind of scheme you come up with uh you know the reality tends to kind of frustrate that scheme right T tends to kind of break out of the break out of the scheme it's not like software where you can say oh this is like you know 3.7 this is 3.8 no you have models with different different tradeoffs you can change some things in your models you can train you can change other things some are faster and slower at inference some have to be more expensive some have

to be less expensive and so I think all the companies have struggled with this um I think we did very you know I think think we were in a good good position in terms of naming when we had Haiku Sonet and we're trying to maintain it but it's not it's not it's not perfect um so we'll we'll we'll try and get back to the Simplicity but it it um uh just the the the nature of the field I feel like no one's figured out naming it's somehow a different Paradigm from like normal software and and

and so we we just none of the companies have been perfect at it um it's something we struggle with surprisingly much relative to you know how relative to how trivial it is to you know for the the the the grand science of training the models so from the user side the user experience of the updated Sonet 35 is just different than the previous uh June 2024 Sonet 35 it would be nice to come up with some kind of labeling that embodies that because people talk about son 35 but now there's a different one and so

how do you refer to the previous one and the new one and it it uh when there's a distinct Improvement it just makes conversation about it uh just challenging yeah yeah I I definitely think this question of there are lots of properties of the models that are not reflected in the benchmarks um I I think I think that's that's definitely the case and everyone agrees and not all of them are capabilities some of them are you know models can be polite or brusk they can be uh you know uh very reactive or they can ask

you questions um they can have what what feels like a warm personality or a cold personality they can be boring or they can be very distinctive like Golden Gate Claude was um and we have a whole you know we have a whole team kind of focused on I think we call it Claude character uh Amanda leads that team and we'll we'll talk to you about that but it's still a very inexact science um and and often we find that models have properties that we're not aware of the the fact of the matter is that you

can you know talk to a model 10,000 times and there are some behaviors you might not see uh just like just like with a human right I can know someone for a few months and you know not know that they have a certain skill or not know there's a certain side to them and so I think I think we just have to get used to this idea and we're always looking for better ways of testing our models to to demonstrate these capabilities and and and also to decide which are which are the which are the

personality properties we want models to have have and which we don't want to have that itself the normative question is also super interesting I got to ask you a question from Reddit from Reddit oh boy you know there there's just this fascinating to me at least it's a psychological social phenomenon where people report that Claude has gotten Dumber for them over time and so uh the question is does the user complaint about the dumbing down of claw 35 Sonic hold any water so are these anecdota reports a kind of social phenomena or did Claude is

there any cases where Claude would get Dumber so uh this actually doesn't apply this this isn't just about Claude I I believe this I believe I've seen these complaints for every Foundation model produced by a major company um people said this about gp4 they said it about gp4 turbo um so so so a couple things um one the actual weights of the model right the actual brain of the model that does not change unless we introduce a new model um there there just a number of reasons why it would not make sense practically to be

randomly substituting in substituting in new versions of the model it's difficult from an inference perspective and it's actually hard to control all the consequences of changing the way to the model let's say you wanted to fine-tune the model to be like I don't know to like to say certainly less which you know an old version of Sonet used to do um you actually end up changing a 100 things as well so we have a whole process for it and we have a whole process for modifying the model we do a bunch of testing on it

we do a bunch of um like we do a bunch of user testing and early customers so it we both have never changed the weights of the model without without telling anyone and it it it wouldn't certainly in the current setup it would not make sense to do that now there are a couple things that we do occasionally do um one is sometimes we run AB tests um but those are typically very close to when a model is being is being uh released and for a very small fraction of time um so uh you know

like the you know the the day before the new Sonet 3.5 I I agree we should have should have had a better name it's clunky to refer to it um there were some comments from people that like it's got It's got it's gotten a lot better and that's because you know a fraction were exposed to to an AB test for for those one or for those one or two days um the other is that occasionally the system prompt will change um on the system prompt can have some effects although it's un it it it's unlikely

to dumb down models it's unlikely to make them Dumber um and and and and we've seen that while these two things which I'm listing to be very complete um happen relatively happen quite infrequently um the complaints about to for us and for other model companies about the model changed the model isn't good at this the model got more censored the model was dumb down those complaints are constant and so I don't want to say like people are imagining it or anything but like the models are for the most part not changing um if I were

to offer a theory um I I think it actually relates to one of the things I said before which is that models have many are very complex and have many aspects to them and so often you know if I if I if if I ask a model a question you know if I'm like if I'm like do task X versus can you do task XX the model might respond in different ways uh and and so there are all kinds of subtle things that you can change about the way you interact with the model that can

give you very different results um to be clear this this itself is like a failing by by us and by the other model providers that that the models are are just just often sensitive to like small small changes in wording it's yet another way in which the science of how these models work is very poorly developed uh and and so you know if I go to sleep one night and I was like talking to the model in a certain way and I like slightly Chang the phrasing of how I talk to the model you know

I could I could get different results so that's that's one possible way the other thing is man it's just hard to quantify this stuff uh it's hard to quantify this stuff I think people are very excited by new models when they come out and then as time goes on they they become very aware of the they become very aware of the limitations so that may be another effect but that's that's all a very long- rended way of saying for the most part with some fairly narrow exceptions the models are not changing I think there is

a psychological effect you just start getting used to it the Baseline ra like when people have first gotten Wi-Fi on airplanes it's like amazing magic and then now like I can't get this thing to work this is such a piece of crap exactly so it's easy to have the conspiracy theory of they're making Wi-Fi slower and slower this is probably something I'll talk to Amanda much more about but U another Reddit question uh when will Claud stop trying to be my uh panical grandmother imposing its moral World viw on me as a paying customer and

also what does it that ology behind making Claude overly apologetic so this kind of reports about The Experience a different angle on the frustration it has to do with the character yeah so a couple points on this first one is um like things that people say on Reddit and Twitter or X or whatever it is um there's actually a huge distribution shift between like the stuff that people complain loudly about on social media and what actually kind of like you know statistically users care about and that drives people to use the models like people are

frustrated with you know things like you know the model not writing out all the code or the model uh you know just just not being as good at code as it could be even though it's the best model in the world on code um I I think the majority of thing of things are about that um uh but uh certainly a a a kind of vocal minority are uh you know kind kind of kind of rais these concerns right are frustrated by the model refusing things that it shouldn't refuse or like apologizing too much or

just just having these kind of like annoying verbal ticks um the second caveat and I just want to say this like super clearly because I think it's like some people don't know it others like kind of know it but forget it like it is very difficult to control across the board how the models behave you cannot just reach in there and say oh I want the model to like apologize less like you can do that you can include trading data that says like oh the models should like apologize less but then in some other situation

they end up being like super rude or like overconfident in a way that's like misleading people so they're they're all these tradeoffs um uh for example another thing is if there was a period during which models ours and I think others as well were T verbose right they would like repeat themselves they would say too much um you can cut down on the verbosity by penalizing the models for for just talking for too long what happens when you do that if you do it in a crude way is when the models are coding sometimes they'll

say of the code goes here right because they've learned that that's a way to economize and that they see it and then and then so that leads the model to be so-called lazy in coding where they where they where they're just like ah you can finish the rest of it it's not it's not because we want to you know save on compute or because you know the models are lazy and you know during winter break or any of the other kind of conspiracy theories that have that have that have come up it's actually it's just

very hard to control the behavior of the model to steer the behavior of the model in all circum ances at once you can kind of there's this this whacka aspect where you push on one thing and like you know these these these you know these other things start to move as well that you may not even notice or measure and so one of the reasons that I that I care so much about uh you know kind of grand alignment of these AI systems in the future is actually these systems are actually quite unpredictable they're actually

quite hard to steer and control um and this version we're seeing today of you make one thing better it makes another thing worse uh I think that's that's like a present day analog of future control problems in AI systems that we can start to study today right I think I think that that that difficulty in in steering the behavior and in making sure that if we push an AI system in One Direction it doesn't push it in another Direction in some in some other ways that we didn't want uh I think that's that's kind of

an that's kind of an early sign of things to come and if we can do a good job of solving this problem right of like you ask the model to like you know to like make and distribute small pox and it says no but it's willing to like help you in your graduate level virology class like how do we get both of those things at once it's hard it's very easy to go to one side or the other and it's a multi-dimensional problem and so uh I you know I think these questions of like shaping

the models personality I think they're very hard I think we haven't done perfectly on them I think we've actually done the best of all the AI companies but still so far from perfect uh and I think if we can get this right if we can control the the you know control the false positives and false negatives in this this very kind of controlled present day environment will be much better at doing it for the future when our worry is you know will the models be super autonomous will they be able to you know make very

dangerous things will they be able to autonomously you know build whole companies and are those companies aligned so so I I I think of this this present task as both vacine but also good practice for the future what's the current best way of gathering sort of user feedback like uh not anecdotal data but just large scale data about pain points or the opposite of pain points positive things so on is it internal testing is it yeah A specific group testing a testing what what what works so so so typically um we'll have internal model bashings

where all of anthropic anthropic is almost a thousand people um you know people just just try and break the model they try and interact with it various ways um uh we have a suite of evals uh for you know oh is the model refusing in ways that that it couldn't I think we even had a certainly eval because you know our our mod again at one point model had this problem where like it had this annoying tick where it would like respond to a wide range of questions by saying certainly I can help you with

that certainly I would be happy to do that certainly this is correct um uh and so we had a like certainly eval which is like how how often does the model say certainly uh uh but but look this is just a whack-a-mole like like what if it switches from certainly to definitely like uh uh so you know every time we add a new eval and we're always evaluating for all the old things so we have hundreds of these evaluations but we find that there's no substitute for human interacting with it and so it's very much

like the ordinary product development process we have like hundreds of people within anthropic bash the model then we do uh you know then we do external AB tests sometimes we'll run tests with contractors we pay contractors to interact with the model um so you put all of these things together and it's still not perfect you still see behaviors that you don't quite want to see right you know you see you still see the model like refusing things that it just doesn't make sense to refuse um but I I I think trying to trying to solve

this challenge right trying to stop the model from doing you know genuinely bad things that you know no one everyone agrees it shouldn't do right you know everyone everyone you know everyone agrees that you know the model shouldn't talk about you know I I don't know child abuse material right like everyone agrees the model shouldn't do that uh but but at the same time that it doesn't refuse in these dumb and stupid ways uh I think I think draw drawing that line as finely as possible approaching perfectly is still is still a challenge and we're

getting better at it every day but there's there's a lot to be solved and again I would point to that as as an indicator of a challenge ahead in terms of steering much more powerful models do you think Claude 4.0 is ever coming out I don't want to commit to any naming scheme because if I say if I say here we're gonna have Claude 4 next year and then and then you know then we decide that like you know we should start over because there's a new type of mod like I I I I I

I don't want to I don't want to commit to it I would expect in a normal course of business that Claude four would come after Claude 3.5 but but you know you you you never know in this wacky field right but the sort of this idea of scaling is continuing scal scaling is continuing there there will definitely be more powerful models coming from us in the models that exist today that is that is certain or if there if there aren't we've we've deeply failed as a company okay can you explain the responsible scaling policy and

the AI safety level standards ASL levels as much as I'm excited about the benefits of these models and you know we'll talk about that if we talk about Machines of Loving Grace um I'm I'm worried about the risk and I continue to be worried about the risks uh no one should think that you know Machines of loveing Grace was me me saying uh you know I'm no longer worried about the risks of these models I think they're two sides of the same coin the the uh Power of the models and their ability to solve all

these problems in you know biology Neuroscience Economic Development government governance and peace large parts of the economy those those come with risks as well right with great power comes great responsibility right that's the the two are the two are paired uh things that are powerful can do good things and they can do bad things um I think of those risks as as being in you know several different different categories perhaps the two biggest risks that I think about and that's not to say that there aren't risks today that are that are important but when I

think of the really the the you know the things that would happen on the grandest scale um one is what I call catastrophic misuse these are misuse of the models in domains like cyber bio radiological nuclear right things that could you know that could harm or even kill thousands even millions of people if they really really go wrong um like these are the you know number one priority to prevent and and here I would just make a simple observation which is that Mo the models you know if if I look today at people who have

done really bad things in the world um uh I think actually Humanity has been protected by the fact that the overlap between really smart well-educated people and people who want to do really horrific things has generally been small like you know let's say let's say I'm someone who you know uh you know I have a PhD in this field I have a well-paying job um there's so much to lose why do I want to like you know even even assuming I'm completely evil which which most people are not um why why you know why would

such a person risk their risk their you know risk their life RK risk their their legacy their reputation to to do something like you know truly truly evil if we had a lot more people like that the world would be a much more dangerous place and so my my My worry is that by being a a much more intelligent agent AI could break that correlation and so I I I I I do have serious worries about that I believe we can prevent those worries uh but you know I I think as a Counterpoint to Machines

of Loving Grace I want to say that this is I there's still serious risks and and the second range of risks would be the autonomy risks which is the idea that models might on their own particularly as we give them more agency than they've had in the past uh particularly as we give them supervision over wider tasks like you know writing whole code bases or someday even you know effectively operating entire entire companies they're on a long enough leash are they are they doing what we really want them to do it's very difficult to even

understand in detail what they're doing let alone let alone control it and like I said this these early signs that it's it's hard to perfectly draw the boundary between things the model should do and things the model shouldn't do that that you know if if you go to one side you get things that are annoying and useless and you go to the other side you get other behaviors if you fix one thing it creates other problems we're getting better and better at solving this I don't think this is an unsolvable problem I think this is

a you know this is a science like like the safety of airplanes or the safety of cars or the safety of drugs I you know I I don't think there's any big thing we're missing I just think we need to get better at controlling these models and so these are these are the two risks I'm worried about and our responsible scaling plan which I'll recognize is a very long-winded answer to your question I love it I love it our responsible scaling plan is designed to address these two types of risks and so every time we

develop a new model we basically test it for its ability to do both of these bad things so if I were to back up a little bit um I I think we have a I think we have an interesting dilemma with AI systems where they're not yet powerful enough to present these catastrophes I don't know that I don't know they'll ever present prevent these catastrophes it's possible they won't but the the case for worry the case for risk is strong enough that we should we should act now and and they're they're getting better very very

fast right I you know I testified in the Senate that you know we might have serious bio risks within two to three years that was about a year ago things have preceded preceded a pace uh uh so we have this thing where it's like it's it's it's surprisingly hard to to address these risks because they're not here today they don't exist they're like ghosts but they're coming at us so fast because the models are improving so fast so so how do you deal with something that's not here today doesn't exist but is is coming at

us very fast uh so the solution we came up with for that in in collaboration with uh you know people like uh the organization meter and Paul Christiano is okay what what what what you need for that or you need tests to tell you when the risk is getting close you need an early warning system and and so every time we have uh a new model we test it for it capability to do these cbrn tasks as well as testing it for you know how capable it is of doing tasks autonomously on its own and

uh in the latest version of our RSP which we released in the last in the last month or two uh the way we test autonomy risks is the model the the AI model's ability to do aspects of AI research itself uh which when the model when the AI models can do AI research they become kind of truly truly autonomous on and that you know that threshold is important for a bunch of other ways and and so what do we then do with these tasks the RSP basically develops what we've called an if then structure which

is if the models pass a certain capability then we impose a certain set of Safety and Security requirements on them so today's models are what's called asl2 models that were a asl1 is for systems that manifestly don't pose any risk of autonomy or misuse so for example a chess plane bot deep blue would be asl1 it's just manifestly the case that you can't use deep blue for anything other than chess it was just designed for chess no one's going to use it to like you know to conduct a masterful Cyber attack or to you know

run wild and take over the world asl2 is today's AI systems where we've measured them and we think these systems are simply not smart enough to uh to you know autonomously self-replicate or conduct a bunch of tasks uh and also not smart enough to provide meaningful information about cbrn risks and how to build cbrn weapons above and beyond what can be known from looking at Google uh in fact sometimes they do provide information but but not above and beyond a search engine but not in a way that can be stitched together um not not in

a way that kind of end to end is dangerous enough so asl3 is going to be the point at which uh the models are helpful enough to enhance the capabilities of non-state actors right State actors can already do a lot a lot of unfortunately to a high level of proficiency a lot of these very dangerous and destructive things the difference is that non-state non-state actors are not capable of it and so when we get to asl3 we'll take special security precautions designed to be be sufficient to prevent theft of the model by non-state actors and

misuse of the model as it's deployed uh will have to have enhanced filters targeted at these particular areas cyber bio nuclear cyber bio nuclear and model autonomy Which is less a misuse risk and more a risk of the model doing bad things itself asl4 getting to the point where these models could could enhance the capability of a of a of a all knowledgeable State actor Andor become the you know the main source of such a risk like if you wanted to engage in such a risk the main way you would do it is through a

model and then I think asl4 on the autonomy side it's it's some some some amount of acceleration in AI research capabilities with an with an AI model and then asl5 is where we would get to the models that are you know that are that are kind of that are kind of you know truly capable that it could exceed Humanity in their ability to do to do any of these tasks and so the the the point of the if then structure commitment is is basically to say look I don't know I've been I've been working with

these models for many years and I've been worried about risk for many years it's actually kind of dangerous to cry wolf it's actually kind of dangerous to say this you know this this model is this model is risky and you know people look at it and they say this is manifestly not dangerous again it's it's it's the the delicacy of the risk isn't here to today but it's coming at us fast how do you deal with that it's it's really vexing to a risk planner to deal with it and so this if then structure basically

says look we don't want to antagonize a bunch of people we don't want to harm our own you know our our kind of own ability to have a place in the conversation by imposing these these very honorous burdens on models that are not dangerous today so the if then the trigger commitment is basically a way to deal with this says you claim clamp down hard when you can show that the model is dangerous and of course what has to come with that is you know enough of a buffer threshold that that you know you can

you can uh you know you're you're you're you're not at high risk of kind of missing the danger it's not a perfect framework we've had to change it every every uh you know we came out with a new one just a few weeks ago and probably probably going forward we might release new ones multiple times a year because it's it's hard to get these policies right like technically organizationally from a research perspective but that is the proposal if then commitments and triggers in order to minimize burdens and false alarms now but really react appropriately when

the dangers are here what do you think the timeline for asl3 is where several of the triggers are fired and what do you think the timeline is for asl4 yeah so that is hotly debated within the company um uh we are working actively to prepare asl3 uh security uh security measures as well as ASL three deployment measures um I'm not going to go into detail but we've made we've made a lot of progress on both and you know we're we're prepared to be I think ready quite soon uh I would I would not be surpris

I would not be surprised at all if we hit ASL 3 uh next year there was some concern that we we might even hit it uh uh this year that's still that's still possible that could still happen it's like very hard to say but like I would be very very surprised if it was like 2030 uh I think it's much sooner than that so there's a protocols for detecting it the if then and then there's protocols for how to respond to it yes how difficult is the second the ladder yeah I think for asl3 it's

primarily about security um and and about you know filters on the model relating to a very narrow set of areas when we deploy the model because at asl3 the model isn't autonomous yet um uh and and so you don't have to worry about you know kind of the model itself behaving in a bad way even when it's deployed internally so I think the asl3 measures are are I won't say straightforward they're they're they're they're rigorous but they're easier to reason about I think once we get to asl4 um we start to have worries about the

models being smart enough that they might sandbag tests they might not tell the truth about tests um we had some results came out about like sleeper agents and there was a more recent paper about you know can can the models uh uh mislead attempts to you know s sandbag their own abilities right show them you know uh uh present themselves as being less capable than they are and so I think with asl4 there's going to be an important component of using other things than just interacting with the models for example interpretability or hidden chains of

thought uh where you have to look inside the model and verify via some other mechanism that that is not you know is not as easily corrupted as what the model says uh that that you know that that that the model indeed has some property uh so we're still working on asl4 one of the properties of the RSP is that we we don't specify asl4 until we've hit ASL 3 be and and I think that's proven to be a wise decision because even with asl3 it again it's hard to know this stuff in detail and and

it it we want to take as much time as we can possibly take to get these things right so for asl3 the bad actor will be the humans humans yes and so there it's a little bit more uh for asl4 it's both I think it's both and so deception and that's where mechanistic interpretability comes into play and hopefully the techniques used for that are not made accessible to the model yeah I mean of course you can hook up the mechanistic contribut ability to the model itself um but then You' then then you then you've kind

of lost it as a reliable indicator of uh of uh of of of the model State there are a bunch of exotic ways you can think of that it might also not be reliable like if the you know model gets smart enough that it can like you know jump computers and like read the code where you're like looking at its internal State we've thought about some of those I think they're exotic enough there are ways to render them unlikely but yeah generally you want to you want to preserve mechanistic interpretability as a kind of verification

set or test set that's separate from the training process of the model see I think uh as these models become better and better conversation and become smarter social engineering becomes a threat too cuz they oh yeah that can start being very convincing to the engineers inside companies oh yeah yeah it's actually like you know we've we've seen lots of examples of demagoguery in our life from humans and and you know there's a concern that models could do that could do that as well one of the ways that cloud has been getting more and more powerful

is it's now able to do some agentic stuff um computer use uh there's also an analysis within the sandbox of claw. a itself but let's talk about computer use that's seems to me super exciting that you can just give Claude a task and it uh takes a bunch of actions figures it out and has access to the your computer through screenshots so can you explain how that works uh and where that's headed yeah it's actually relatively simple so Claude has has had for a long time since since Claude 3 back in March the ability to

analyze images and respond to them with text the the only new thing we added is those images can be screenshot shots of a computer and in response we train the model to give a location on the screen where you can click Andor buttons on the keyboard you can press in order to take action and it turns out that with actually not all that much additional training the models can get quite good at that task it's a good example of generalization um you know people sometimes say if you get to low earth orbit you're like halfway

to anywhere right because of how much it takes to escape the gravity well if you have a strong pre-trained model I feel like you're halfway to anywhere uh in ter in terms of in terms of the intelligence space uh uh uh and and and so actually it didn't it didn't take all that much to get to get Claude to do this and you can just set that in a loop give the model a screenshot tell it what to click on give it the next screenshot tell it what to click on and and that turns into

a full kind of almost almost 3D video interaction of the model and it's able to do all of these tasks right you know we we showed these demos where it's able to like fill out spreadsheets it's able to kind of like interact with a website it's able to you know um you know it's able to open all kinds of you know programs different operating systems Windows Linux Mac uh uh so uh you know I think all of that is very exciting I I will say while in theory there's nothing you could do there that you

couldn't have done through just giving the model the API to drive the computer screen uh this really lowers the barrier and you know there's there's there's a lot of folks who who who either you know kind of kind of ar ar you know aren't in a position to to interact with those apis or it takes them a long time to do it's just the screen is just a universal interface that's a lot easier to interact with and so I expect over time this is going to lower a bunch of barriers now honestly the current model

has there's there it leaves a lot still to be desired and we were we were honest about that in the blog right it makes mistakes it misclicks and we we you know we were careful to warn people hey this thing isn't you can't just leave this thing to you know run on your computer for minutes and minutes um you got to give this thing boundaries and guard rails and I think that's one of the reasons we released it first in an API form rather than kind of you know this this kind of just just hand

it just hand it to the consumer and give it control of their of their of their of their computer um but but you know I definitely feel that it's important to get these capabilities out there as models get more powerful we're going to have to Grapple with you know how do we use these capabilities safely how do we prevent them from being abused uh and and you know I think I think releasing releasing the model while while while the capabilities are are you know are are still are still limited is is is very helpful in

terms of in terms of doing that um you know I think since it's been released a number of customers I think uh repet was maybe was maybe one of the the the most uh uh quickest quickest quickest to quickest to deploy things um have have you know have made use of it in various ways people have hooked up demos for you know Windows desktops Macs uh uh you know Linux Linux machines uh so yeah it's been it's been it's been very exciting I think as with as with anything else you know it it it comes

with new exciting abilities and then then then you know then then with those new exciting abilities we have to think about how to how to you know make the model you know safe reliable do what humans want them to do I mean it's the same it's the same story for everything right same thing it's that same tension but but the possibility of use cases here is just the the range is incredible so uh how much to make it work really well in the future how much do you have to specially kind of uh go beyond

what's the pre-trained models doing do more posttraining rhf or supervised fine-tuning or synthetic data just for the agent stff yeah I think speaking at a high level It's Our intention to keep investing a lot in you know making making the model better uh like I think I think uh you know we look at look at some of the you know some of the benchmarks where previous models were like oh could do it 6% of the time and now our model do at 14 or 22% of the time and yeah we want to get up to

you know the human level reliability of 80 90% just like anywhere else right we're on the same curve that we were on with sbench where I think I would guess a year from now the models can do this very very reliably but you got to start somewhere so you think it's possible to get to the the human level 90% uh basically doing the same thing you're doing now or is it has to be special for computer use I I mean uh depends what you mean by by you know special and FAL and um but but

you know I generally think you know the same kinds of techniques that we've been using to train the current model I I expect that doubling down in those techniques in the same way that we have for code for code for models in general for other k for you know for image input um uh you know for voice uh I expect those same techniques will scale here as they have everywhere else but this is giving sort of the power of action to Claude And so you could do a lot of really powerful things but you could

do a lot of damage also yeah yeah no and we've been very aware of that look my my view actually is computer use isn't a fundamentally new capability like the cbrn or autonomy capabilities are um it's more like it kind of opens the aperture for the model to use and apply its existing abilities uh and and so the way we think about it going back to our RSP is nothing that this model is doing inherently increases you know the risk from an RSP RSP perspective but as the models get more powerful having this capability may

make it scarier once it you know once it has the cognitive capability to um you know to do something at the asl3 and asl4 level this this you know this may be the thing that kind of Unbound it from doing so so going forward certainly this modality of interaction is something we have tested for and that we will continue to test for in our going forward um I think it's probably better to have to learn and explore this capability before the model is super uh you know super capable yeah and there's uh a lot of

interesting attacks like prompt injection because now you've widened the aperture so you can prompt inject through stuff on screen so if this becomes more and more useful then there's more and more benefit to inject inject stuff into the model if it goes to certain web page it could be harmless stuff like advertisements or it could be like harmful stuff right yeah I mean we thought a lot about like spam capture you know Mass C there's all you know every every like if one secret I'll tell you if you've invented a new technology not necessarily the

biggest misuse but but the the first misuse you'll see scams just Petty scams like you just just just it's it's like it's like a thing as old people scamming each other it's it's this it's this thing as old as time um and and and it's just every time you got to deal with it it's almost like silly to say but it's it's true sort of and spam in general is a thing as it gets more and more intelligent it's uh there a lot of like like I said like there are a lot of petty criminals

in the world and and and you know it's like every new technology is like a new way for petty petty criminals to do something you know something stupid and malicious um is there any ideas about sandboxing it like how difficult is the sandboxing task yeah we sandbox during training so for example during training we didn't expose the model to the internet um I think that's probably a bad idea during training because uh you know the model can be changing its policy it can be changing what it's doing and it's having an effect in the real

world um uh you know in in terms of actually deploying the model right it kind of depends on the application like you know sometimes you want the model to do something in the real world but of course you can always put guard you can always put guard rails on the outside right you can say okay well you know this model is not going to move data from my you know this model is not going to move any files from my computer to or my web server to anywhere else now when you talk about sandboxing again

when we get to asl4 none of these precautions are going to make sense there right where when you when you talk about asl4 you're then the model is being kind of you know there's a a theoretical worry the model could be smart enough to break it to to kind of break out of any box and so there we need to think about mechanistic interpretability about you know if we're if we're going to have a Sandbox it would need to be a mathematically provable sound but you know that's that's a whole different world than what we're

dealing with with the models today yeah the science of building a box from which asl4 AI system cannot Escape I I think it's probably not the right approach I think the right approach instead of having something you know unaligned that that like you're trying to prevent it from escaping I think it's it's better to just design the model the right way or have a loop where you you know you look inside you look inside the model and you're able to verify property and that gives you an opportunity to like iterate and actually get it right

um I think I think containing uh containing bad models is is is much worse solution than having good models let me ask about regulation what's the role of regulation in keeping AI safe so for example he described California AI regulation Bill SB 1047 that was ultimately vetoed by the governor what are the pros and cons of this bill General yes we ended up making some suggestions to the bill and then some of those were opted and you know we felt I think I think quite positively uh uh quite positively about about the bill uh by

by the end of that um it did still have some downsides um uh and you know of course of course it got vetoed um I think at a high level I think some of the key ideas behind the bill um are you know I would say similar to ideas behind our rsps and I think it's very important that some jurisdiction whether it's California or the federal government Andor other countries and other states passes some regulation like this and I can talk through why I think that's so important so I feel good about our RSP it's

not perfect it needs to be iterated on a lot but it's been a good forcing function for getting the company to take these risks seriously to put them into product planning to really make them a central part of work at anthropic and to make sure that all the thousand people and it's almost a thousand people now at anthropic understand that this is one of the highest priorities of the company if not the highest priority uh but one there are some there are still some companies that don't have RSP like mechanisms like open aai Google uh

did adopt these mechanisms a couple months after uh after anthropic did uh but there are there are other companies out there that don't have these mechanisms at all uh and so if some companies adopt these mechanisms and others don't uh it's really going to create a situation where you know some of these dangers have the property that it doesn't matter if three out of five of the companies are being safe if the other two are are being are being unsafe it creates this negative externality and and I think the lack of uniformity is not fair

to those of us who have put a lot of effort into being very thoughtful about these procedures the second thing is I don't think you can trust these companies to adhere to these voluntary plans in their own right I like to think that anthropic will we do everything we can that we will our our our our RSP is checked by our long-term benefit trust uh so you know we do everything we can to to to adhere to our own RSP um but you know you hear lots of things about various companies saying oh they said

they would do they said they would give this much compute and they didn't they said they would do this thing and they didn't um you know I don't I don't think it makes sense to you know to to to you know litigate particular things that companies have done but I I think this this broad principle that like if there's nothing watching over them there's nothing watching over us as an industry there's no guarantee that we'll do the right thing and the stakes are very high uh and so I think it's I think it's important to

have a uniform standard that that that that that everyone follows and to make sure that simply that the industry does what a majority of the industry has already said is important and has already said that they definitely will do right some people uh you know I think there's there a class of people who are against regulation on principle I understand where that comes from if you go to Europe and you know you see something like gdpr you see some of the other stuff that that that that that they've done you know some of it's good

but but some of it is really unnecessarily burdensome and I think it's fair to say really has slowed really has slowed Innovation and so I understand where people are coming from on priors I understand why people come from start from that start from that position uh but but again I think AI is different if we go to the very serious risks of autonomy and misuse that that that I talked about you know just a just a few minutes ago I think that those are unusual and they weren't an unusually strong response uh and so I

I think it's very important again um we need something that everyone can get behind uh you know I think one of the issues with s1047 uh especially the original version of it was it it had a bunch of the structure of rsps but it also had a bunch of stuff that was either clunky or that that that just would have created a bunch of burdens a bunch of Hassle and might even have missed the Target in terms of addressing the risks um you don't really hear about it on Twitter you just hear about kind of

you know people are people are cheering for any regulation and then the folks who are against make up these often quite intellectually dishonest arguments about how you know it you know it'll make us move away from California bill bill doesn't apply if you're headquartered in California bill only applies if you do business in California um or that it would damage the open source ecosystem or that it would you know it would cause cause all of these things I I think those were mostly nonsense but there are better arguments against regulation there's one guy uh Dean

ball who's really you know I think a very scholarly scholarly IST who who looks at what happens when a regulation is put in place and ways that they can kind of get a life of their own or how they can be poorly designed and so our interest has always been we do think there should be regulation in this space but we want to be an actor who makes sure that that that that regulation is something that's surgical that's targeted at the serious risks and is something people can actually comply with because something I think The

Advocates of Regulation don't understand as well as they could is if we get something in place that is um that's poorly targeted that wastes a bunch of people's time what's going to happen is people are going to say see these safety risks there you know this is this is nonsense I just you know I just had to hire 10 lawyers to to you know to fill out all these forms I had to run all these tests for something that was clearly not dangerous and after 6 months of that there will be there will be a

ground sweep well and we'll we'll we'll we'll end up with a durable consensus against regulation and so the I I think the the worst enemy of those who want real accountability is badly designed regulation um we we need to actually get it right uh and and this is if there's one thing I could say to The Advocates it it would be that I want them to understand this Dynamic better and we need to be really careful and we need to talk to people who actually have who actually have experience seeing how regulations play out in

practice and and the people who have seen that understand to be very careful if this was some lesser issue I might be against regulation at all but what what I want the opponents to understand is is that the underlying issues are actually serious they're they're not they're not something that I or the other companies are just making up because of regulatory capture they're not sci-fi fantasies they're not they're not any of these things um you know every every time we have new model every few months we measure the behavior of these models and they're getting

better and better at these concerning tasks just as they are getting better and better at um you know good valuable economically useful tasks and so I I I I would just love it if some of the former you know I think sb147 was very polarizing I would love it if some of the most reasonable opponents and some of the most reasonable um uh proponents uh would sit down together and you know I think I think that you know the different the different AI companies um you know anthropic was the the only AI company that you

know felt positively in a very detailed way I think Elon tweeted uh tweeted briefly something positive but you know some of the some of the big ones like Google open AI meta Microsoft were were pretty St stly against so I would really like is if if you know some of the key stakeholders some of the you know thoughtful proponents and and some of the most thoughtful opponents would sit down and say how do we solve this problem in in a way that the proponents feel brings a real reduction in risk and that the opponents feel

that it is not it is not hampering the the industry or hampering Innovation any more necessary than it than than than it needs to and and I think for for whatever reason that things got too polarized and those two groups didn't get to sit down in the way that they should uh and and I feel I feel urgency I really think we need to do something in 2025 uh uh you know if we get to the end of 2025 and we've still done nothing about this then I'm going to be worried I'm not I'm not

worried yet because again the risks aren't here yet but but I I I think time is running short yeah and come up with something surgical like you said yeah yeah yeah exactly and and we need to get we need to get away from this this this intense pro- safety versus intense anti-regulatory rhetoric right it's turned into these these flame Wars on Twitter and nothing Good's going to come with that so there's a lot of curiosity about the different players in the game one of the uh ogs is open AI you have had several years of

experience at open AI what's your story and history there yeah so I was at open AI for uh for roughly five years uh for the last I think it was a couple years you know I I I I I I was uh vice president of research there um probably myself and Ilia suger were the ones who you know really kind of set the set the research Direction around 2016 or 2017 I first started to really believe in or at least confirm my belief in the scaling hypothesis when when Ilia famously said to me the thing

you need to understand about these models is they just want to learn the models just want to learn um and and and and again sometimes there are these One S there these one sentences these Zen cones that you hear them and you're like ah that that explains everything that explains like a thousand things that I've seen and then and then I I you know ever after I had this visualization in my head of like you optimize the models in the right way you point the models in the right way they just want to learn they

just want to solve the problem regardless of what the problem is so get out of their way basically get out of their way yeah don't impose your own ideas about how they should learn and you know this was the same thing as Rich Sutton put out in the bitter lesson or G put out in the scaling hypothesis you know I think generally the dynamic was you know I got I got this kind of inspiration from uh from from from Ilan from others folks like Alec Radford who did the the original uh uh gpt1 uh and

then uh ran really hard with it me me and my collaborators on gpt2 gpt3 RL from Human feedback which was an attempt to kind of deal with the early safety and durability things like debate and amplification heavy on interpretability so again the combination of safety plus scaling probably 2018 2019 2020 those those were those were kind of the years when myself and my collaborators probably um you know mo mo many many of whom became co-founders of anthropic kind of really had had had a vision and like and like drove the direction why'd you leave why'

you decid to leave yeah so look I'm gonna put things this way and I you know I think it I think it ties to the to to the race to the top right which is you know in my time at open AI what I come come to see as I'd come to appreciate the scaling hypothesis and as I'd come to appreciate kind of the importance of safety along with the scaling hypothesis the first one I think you know open AI was was getting was getting on board with um the second one in a way had

always been part of of open ai's messaging um but uh you know over over many years of of the time the time that I spent there I think I had a particular vision of how these how we should handle these things how we should be brought out in the world the kind of principles that the organization should have and look I mean there were like many many discussions about like you know should the or do should the company do this should the company do that like there's a bunch of misinformation out there people say like

we left because we didn't like the deal with Microsoft false although you know there was like a lot of discussion a lot of questions about exactly how we do the deal with Microsoft um we left because we didn't like commercialization that's not true we built gbd3 which was the model that was commercialized I was involved in commercialization it's it's more again about how do you do it like Civilization is going down this path to very powerful AI what's the way to do it that is cautious straightforward honest um that build trust in the organization and

in individuals how do we get from here to there and how do we have a real vision for how to get it right how can safety not just be something we say because it helps with recruiting um and you know I think I think at the end of the day um if you have a vision for that forget about anyone else's Vision I don't want to talk about anyone else's Vision if you have a vision for how to do it you should go off and you should do that Vision it is incredibly unproductive to try

and argue with someone else's Vision you might think they're not doing it the right way you might think they're they're they're dishonest who knows maybe you're right maybe you're not um uh but uh what what you should do is you should take some people you trust and you should go off together and you should make your vision happen and if your vision is compelling if you can make it appeal to people some you know some combination of ethically you know in the market uh you know if if you can if you can make a company

that's a place people want to join uh that you know engages in practices that people think are are reasonable while managing to maintain its position in the ecosystem at the same time if you do that people will copy it um and the fact that you were doing it especially the fact that you're doing it better than they are um causes them to change their behavior in a much more compelling way than if they're your boss and you're arguing with them I just I don't know how to be any more specific about it than that but

I think it's generally very unproductive to try and get someone else's Vision to look like your vision um it's much more productive to go off and do a clean experiment and say this is our vision this is how this is this is how we're going to do things your choice is you can you can ignore us you can reject what we're doing or you can you can start to become more like us and imitation is the sincerest form of flattery um and you know that that that plays out in the behavior of customers that PS

out in the behavior of the public that plays out in the behavior of where people choose to work uh and again again at the end it's it's not about one company winning or another company winning if if we or another company are engaging in some practice that you know people people find genuinely appealing and I want it to be in substance not just not just in appearance um and you know I think I think researchers are sophisticated and they look at substance uh and then other companies start copying that practice and they win because they

copied that practice that's great that's success that's like the race to the top it doesn't matter who wins in the end as long as everyone is copying everyone else's good practices right one way I think of it is like the thing we're all afraid of is a race the bottom right in the race to the bottom doesn't matter who wins because we all lose right like you know in the most extreme world we we make this autonomous AI that you know the robots enslave us or whatever right I mean that's half joking but you know

that that is the most extreme uh uh thing thing that could happen then then it doesn't matter which company was ahead um if instead you create a race to the top where people are competing to engage in good in good practices uh then you know at at the end of the day you know it doesn't matter who who ends up who ends up winning doesn't even matter who who started the race to the top the point isn't to be virtuous the point is to get the system into a better equilibrium than it was before and

and individual companies can play some role in doing this individual companies can can you know can help to start it can help to accelerate it and frankly I think individuals at other companies have have done this as well right the individuals that when we put out an RSP react by pushing harder to to to get something similar done get something similar done at at at other companies sometimes other companies do something that's like we're like oh it's a good practice we think we think that's good we should adopt it too the only difference is you

know I think I think we are um we try to be more forward leaning we try and adopt more of these practices first and adopt them more quickly when others when others invent them but I think this Dynamic is what we should be pointing at and that I think I think it abstracts away the question of you know which company's winning who trusts who I I think all these all these questions of drama are are profoundly uninteresting and and the the thing that matters is the ecosystem that we all operate in and how to make

that ecosystem better because that constrains all the players and so anthropic is this kind of clean experiment built on a foundation of like what concretely AI safety should look like we look I'm sure we've made plenty of mistakes along the way the perfect organization doesn't exist it has to deal with the the imperfection of a thousand employees it has to deal deal with the imperfection of our leaders including me it has to deal with the imperfection of the people we've put we've put to you know to oversee the imperfection of the of the leaders like

the like the board and the long-term benefit trust it's it's all it's all a set of imperfect people trying to aim imperfectly at some ideal that will never perfectly be achieved um that's what you sign up for that's what it will always be but uh uh imperfect doesn't mean you just give up there's better and there's worse and hopefully hopefully we can begin to build we can do well enough that we can begin to build some practices that the whole industry engages in and then you know my guess is that M multiple of these companies

will be successful anthropic will be successful these other companies like ones I've been at the past will also be successful and some will be more successful than others that's less important than again that we we align the incentives of the industry and that happens partly through the race to the top partly through things like RSP partly through again selected surgical regulation you said Talent density beats Talent Mass so can you explain that can you expand on it can you just talk about what it takes to build a great team of AI researchers and Engineers this

is one of these statements that's like more true every every every month every month I see this statement as more true than I did the month before so if I were to do a thought experiment let's say you have a team of 100 people that are super smart motivated and aligned with the mission and that's your company or you can have a team of a thousand people where 200 people are super smart super aligned with the mission and then uh like and then like 800 people are let's just say you pick 800 like random random

big Tech employees which would you rather have right the talent mass is greater in in the group of uh in the group of a thousand people right you have you have even even a larger number of incredibly talented incredibly aligned incredibly smart people um uh but but the the issue is just that if every time someone super talented looks around they see someone else super talented and super dedicated that sets the tone for everything right that sets the tone for everyone is super inspired to work at the same place everyone trusts everyone else if you

have a thousand or 10,000 uh people and and things have really regressed right you are not able to do selection and you're choosing random people what happens is then you need to put a lot of proc CES and a lot of guard rails in place um just because people don't fully trust each other you have to adjudicate political battles like there are so many things that slow down the org's ability to operate and so we're nearly a thousand people and you know we've we've we've tried to make it so that as large a fraction of

those thousand people as possible are like super talented super skilled it's one of the reasons we've we've slowed down hiring a lot in the last few months We Grew From 300 to 800 I believe I think in the first seven eight months of the year and now we've slowed down we're at like you know last three months we went from 800 to 900 950 something like that don't quote me on the exact numbers but I think there's an inflection point around a thousand and we want to be much more careful how how we how we

grow uh early on and and now as well you know we've hired a lot of physicists um you know theoretical physicists can learn things really fast um uh even even more recently as we've continued to hire that you know we've really had a high bar for on both the research side and the software engineering side have hired a lot of senior people including folks who used to be at other at other companies in this space and we we've just continued to be very selective it's very easy to go go from 100 to a th000 a

th000 to 10,000 without paying attention to making sure everyone has a unified purpose it's so powerful if your company consists of a lot of different feif that all want to do their own thing they're all optimizing for their own thing um uh it's very hard to get anything done but if everyone sees the broader purpose of the company if there's trust and there's dedication to doing the right thing that is a superpower that in itself I think can overcome almost every other disadvantage and you know it's to Steve Jobs a players a players want to

look around and see other a players is another way of of saying I don't know what that is about human nature but it is demotivating to see people who are not obsessively driving towards a singular Mission and it is on the flip side of that super motivating to see that it's interesting uh what's it take to be a great AI researcher or engineer from everything you've seen from working with so many amazing people yeah um I think the number one quality especially on the research side but really both is open-mindedness sounds easy to be open-minded

right you're just like oh I'm open to anything um but you know if I if I think about my own early history in the scaling hypothesis um I was seeing the same data others were seeing I don't think I was like a better programmer or better at coming up with research ideas than any of the hundreds of people that I worked with um in some ways in some ways I was worse um uh you know like i' I've never like you know precise programming of like you know finding the bug writing the GPU kernels like

I could point you to a 100 people here who are better who are better at that than I am um but but the the thing that that that I think I did have that was different was that I was just willing to look at something with new eyes right people said oh you know we don't have the right algorithms yet we haven't come up with the right the right way to do things and I was just like uh I don't know like you know this neural net has like 30 billion 30 million parameters like what

if we gave it 50 million instead like let's plot some graphs like that that basic scientific mindset of like oh man like I I I just I just like I you know I see some variable that I could change like what happens when it changes like let's let's try these different things and like create a graph for even this this was like the simplest thing in the world right change the number of you know this wasn't like PhD level experimental design this was like this was like simple and stupid like anyone could have done this

if you if you just told them that that that it was important it's also not hard to understand you didn't need to be brilliant to come up with this um but you put the two things together and you know some tiny number of people some singled digigit number of people have have driven forward the whole field by realizing this uh and and it's you know it's often like that if you look back at the Discover you know the discoveries in in in history they're they're often like that and so this this open-mindedness and this willingness

to see with new eyes that often comes from being newer to the field often experience is a disadvantage for this that is the most important thing it's very hard to look for and test for but I think I think it's the most important thing because when you when you find something some really new way of thinking thinking about things when you have the initiative to do that it's absolutely transformative and also be able to do kind of Rapid experimentation and in the face of that be open-minded and curious and looking at the data from just

these fresh eyes and see what is that actually saying that applies in uh mechanistic interpretability it's another example of this like some of the early work in mechanistic interpretability so simple it's it's just no one thought to care about this question before you said what it takes to be a great AI researcher can we rewind the clock back what what advice would you give to people interested in AI they're young looking forward how can I make an impact on the world I think my number one piece of advice is to just start playing with the

models um this was actually I I I worry a little this seems like obvious advice now I think three years ago it wasn't obvious and people started by oh let me read the latest reinforcement learning paper let me you know let me let me kind of um no I mean that was really the that was really the the and I mean you should do that as well but uh now you know with wider availability of models and apis people are doing this more but I think I think just experiential knowledge um these models are new

artifacts that no one really understands um and so getting experience playing with them I would also say again in line with the like do something new think in some new Direction like there are all these things that haven't been explored like for example mechanistic interpretability is still very new it's probably better to work on that than it is to work on new model architectures because it's you know it's more popular than it was before there are probably like a hundred people working on it but there aren't like 10,000 people working on it and it's it's

just this just this this fertile area for study like like you know it's there's there's so much like low hangen fruit you can just walk by and you know you can just walk by and you can pick things um and and the the the only reason for whatever reason people aren't people aren't interested in it enough I think there are some things around long long Horizon learning and long Horizon tasks where there's a lot to be done I think evaluations are still we're still very early in our ability to study evaluations particularly for dynamic systems

acting in the world I think there's some stuff around multi-agent um skate where the puck is going is my is my advice and you don't have to be brilliant to think of it like all the things that are going to be exciting in 5 years like in in people even mention them as like you know conventional wisdom but like it's it's just somehow there's this barrier that people don't people don't double down as much as they could or they're afraid to do something that's not the popular thing I don't know why it happens but like

getting over that barrier is the that's the my number one piece of advice let's talk if we could a bit about posttraining yeah so it uh seems that the modern posttraining recipe has uh a little bit of everything so supervised fine tuning rhf uh the the the Constitutional AI with RL a if best acronym it's again that naming thing uh and then synthetic data seems like a lot of synthetic data or at least trying to figure out ways to have high quality synthetic data so what's the uh if this is a secret sauce that makes

anthropic claw so uh incredible what how how much of the magic is in the pre-training how much much of is in the post training yeah um I mean so first of all we're not perfectly able to measure that ourselves um uh you know when you see some some great character ability sometimes it's hard to tell whether it came from pre-training or post-training uh we developed ways to try and distinguish between those two but they're not perfect you know the second thing I would say is you know it's when there is an advantage and I think

we've been pretty good at in general in general at RL Perhaps Perhaps the best although although I don't know because I don't see what goes on inside other companies uh usually it isn't oh my God we have this secret magic method that others don't have right usually it's like well you know we got better at the infrastructure so we could run it for longer or you know we were able to get higher quality data or we were able to filter our data better or we able to you know combine these methods and practice it's it's

usually some boring matter of matter of kind of uh practice and tradecraft um so you know when I think about how to do something special in terms of how we train these models both pre-training but even more so posttraining um you know I I I really think of it a little more again as like designing airplanes or cars like you know it's not just like oh man I have the BL blueprint like maybe that makes you make the next airplane but like there's some there's some cultural tradecraft of how we think about the design process

that I think is more important than than you know than than any particular Gizmo were able to invent okay well about let me ask you about specific techniques so first on rhf what do you think think just zooming out intuition almost philosophy why do you think rhf works so well if I go back to like the scaling hypothesis one of the ways to skate the scaling hypothesis is if you train for x and you throw enough compute at it um then you get X and and so rlf is good at doing what humans want the

model to do or at least um to State it more precisely doing what humans who look at the model for a brief period of time and consider different possible responses what prefer as the response uh which is not perfect from both a safety and capabilities perspective in that humans are are often not able to perfectly identify what the model wants and what humans want in the moment may not be what they want in the long term so there's there's a lot of subtlety there but the models are good at uh you know producing what the

humans in some shallow sense want uh and it actually turns out that you don't even have to throw that much compute at it because of another thing which is this this thing about a strong pre-trained model being halfway to anywhere uh uh uh so once you have the pre-trained model you have all the representations you need to to get the model uh to get the model where you where you want it to go so do you think rhf makes the model smarter or just appears smarter to the humans I don't think it makes the model

smarter I don't think it just makes the model appear smarter it's like rhf like Bridges the gap between the human and the model right I could have something really smart that like can't communicate at all right we all know people like this um people who are really smart but that you know can't understand what they're saying um uh so I think I think rhf just bridges that Gap um I I think it's not it's not the only kind of RL we do it's not the only kind of RL that will happen in the future I

think RL has the potential to make models smarter to make them reason better to make them operate better to make them develop new skills even and perhaps that could be done you know even in some cases with human feedback but the kind of rhf we we do today mostly doesn't do that yet although we're very quickly starting to be able to but it it appears to sort of increase if you look at the metric of helpfulness it increases that it also increases what was this this word in Leopold's essay un hobbling where basically the models

are hobbled and then you do various trainings to them to un hobble them so I I know I like that word because it's like a rare word but so so I think rhf un hobbles the models in some ways um and then there are other ways where M hasn't yet been un hobbled and and you know needs to needs to un hobble if you can say in terms of cost is pre-training the most expensive thing or is post-training creep up to that at the present moment it is still the case that uh pre-training is the

majority of the cost I don't know what to expect in the future but I could certainly anticipate a future where post-training is the majority of the cost in that future you anticipate would it be the humans or the AI That's the costly thing for the Post training I I I I I don't think you can scale up humans enough to get high quality any any kind of method that relies on humans and uses a large amount of compute it's going to have to rely on some scaled supervision method like uh uh like um it you

know debate or iterated amplification or something like that so on that super interesting um set of ideas around constitutional AI can describe what it is as first detailed in December 2022 paper and uh and be on that what is it yes so this was from two years ago the basic idea is so we describe what rhf is you have uh you have a model and uh it you know spits out two you know like you just sample from it twice it spits out two possible responses and you're like human which response you like better or

another variant of it is rate this response on a scale of 1 to seven so that's hard because you need to scale up human interaction and uh it's very implicit right I don't have a sense of what I what I want the model to do I just have a sense of like what this average of a thousand humans wants the model to do so two ideas one is could the AI system itself decide which uh which response is better right could you show the AI system these two responses and and ask which which which response

is better and then second well what Criterion should the AI use and so then there's this idea because you have a single document a constitution if you will that says these are the principles the model should be using to to respond and the AI system reads those um it reads those principles as well as reading the environment and the response and it says well how good did the AI model do um it's basically a form of self-play you you're kind of training the model against itself and so the AI gives the response and then you

feed that back into What's called the preference model which in turn feeds the model to make it better um so you have this triangle of like the AI the preference model and the Improvement of the AI itself and we should say that in the Constitution the set of principles are like human interpretable they're like yeah yeah it's something both the human and the AI system can read so it has this nice this nice kind of translatability or symmetry um you know in in practice we both use a model Constitution and we use rhf and we

use some of these other methods so it's it's turned into one tool in a in a toolkit that both reduces the need for rhf and increases the value we get from um from from using each data point of R lhf um it also interacts in interesting ways with kind of future reasoning type RL methods so um it's it's one tool in the toolkit but but I I think it is a very important tool well it's a compelling one to us humans you know thinking about the founding fathers and the founding of the United States the

natural question is who and how do you think it gets to define the constitution the the set of principles in the Constitution yeah so I'll give like a practical um answer and a more abstract answer I think the Practical answer is like look in practice models get used by all kinds of different like customers right and and so uh you can have this idea where you know the model can can have specialized rules or principles you know we fine-tune versions of models implicitly we've talked about doing it explicitly having having special principles that people can

can build into the models um uh so from a practical perspective the answer can be very different from different people uh you know customers service agent uh you know behaves very differently from a lawyer and obeys different principles um but I think at the base of it there are specific principles that the models uh you know have to obey I think a lot of them are things that people would agree with everyone agrees that you know we don't you know we don't want models to present these cbrn risks um I think we can go a

little further and agree with some basic principles of democracy and the rule of law beyond that it gets you know very uncertain and and there our goal is generally for the models to be more neutral to not espouse a particular point of view and you know more just be kind of like wise uh agents or advisers that will help you think things through and will you know present present possible considerations but you know don't express you know stronger specific opinions open AI released a model spec where it kind of clearly concretely defines some of the

goals of the model and specific examples like AB how the model should behave do you find that interesting by the way I should mention the I believe the brilliant John Schulman was a part of that he's now an anthropic uh do you think this is a useful Direction might anthropic release a model spec as well yeah so I think that's a pretty useful direction again it has a lot in common with uh constitutional AI so again another example of like a race to the top right we have something that's like we think you know a

better and more responsible way of doing things um it's also a competitive advantage um then uh others kind of you know discover that it has advantages and then start to do that thing uh we then no longer have the competitive Advantage but it's good from the perspective that now everyone has adopted a positive practice that others were not adopting and so our response to that as well looks like we need a new competitive advantage in order to keep driving this race upwards um so that's that's how I generally feel about that I also think every

implementation of these things is different so you know there were some things in the model spec that were not in constitutional Ai and so you know we you know we can always we can always adopt those things or you know at least learn from them um so again I think this is an example of like the positive Dynamic that uh that that that I that that I think we should all want the field to have let's talk about the incredible ESS Machines of love and grace I recommend everybody read it it's a long one it

is rather long yeah it's really refreshing to read concrete ideas about what a positive future looks like and you took sort of a bold stance because like it's very possible you might be wrong on the dates or specific applications yeah I'm fully expecting to you know to definitely be wrong about all the details I might be be just spectacularly wrong about the whole thing and people will you know will laugh at me for years um uh that's that's how that's that's just how the future works so you provided a bunch of concrete positive impacts of

AI and how you know exactly a super intelligent AI might accelerate the rate of breakthroughs in for example biology and chemistry that would then lead to things like we cure most cancers prevent all infectious disease double the human lifespan and so on so let's talk about this essay first can you give a high level vision of this essay and um what key takeaways that people should have yeah I have spent a lot of time and anthropic has spent a lot of effort on like you know how do we address the risks of AI right how

do we think about those risks like we're trying to do a race to the top you know that requires us to build all these capabilities and the abilities are cool but you know you know we're we're we're like a big part of what we're trying to do is like is like address the risks and the justification for that is like well you know all these positive things you know the the market is this very healthy organism right it's going to produce all the positive things the risks I don't know we might mitigate them we might

not and so we can have more impact by trying to mitigate the risks but I noticed that one flaw in that way of thinking and it's if not a change in how seriously I take the risks it's it's maybe a change in how I talk about them um is that you know no matter how kind of logical or rational that line of reasoning that I just gave might be um if if you kind of only talk about risks your brain only thinks about risks and and so I think it's actually very important to understand what

if things do go well and the whole reason we're trying to prevent these risks is not because we're afraid of Technology not because we want to slow it down it's it's it's because if we can get to the other side of these risks right if we can run the gauntlet successfully um to you know to to put it in Stark terms then then on the other side of the gauntlet are all these great things and these things are worth fighting for and these things can really inspire people and I think I imagine because look you

have all these investors all these VCS all these AI companies talking about all the positive benefits of AI but as you point out it's it's it's weird there's actually a dir of really getting specific about it there's a lot of like random people on Twitter like posting these kind of like gleaming cities and this this just kind of like Vibe of like grind accelerate harder like kick out the D you know it's it's just this very this very like aggressive ideological but then you're like well what are you what what what what what are you

actually excited about um and so and so I figured that you know I think it would be interesting and valuable for someone who's actually coming from the risk side to to try and and to try and really make a try at at explaining explaining explaining what the benefits are um both because I think it's something we can all get behind and I want people to understand I want them to really understand that this isn't this isn't doomers versus accelerationists um this this is that if you have a true understanding of of where things are going

with with AI and maybe that's the more important axis AI is moving fast versus AI is not moving fast then you really appreciate the benefits and you you you you really you want Humanity our civilization to seize those benefits but you also get very serious about anything that could derail them so I think the starting point is to talk about what this powerful AI which is the term you like to use uh most of the world uses AGI but you don't like the term because it's uh basically has too much baggage has become meaningless it's

like we're stuck with the terms like maybe we're stuck with the terms and my efforts to change them are futile it's ADM I'll tell you what else I don't this is like a pointless semantic point but I I I I keep talking about it public so I'm just I'm just going to do it once more um uh I I think it's it's a little like like let's say it was like 1995 and Mor's law is making the computers faster and like for some reason there there there there had been this like verbal tick that like

everyone was like well someday we're going to have like super super computers and like supercomputers are going to be able to do all these things that like you know once we have supercomputers we'll be able to like sequence the Geno and we'll be able to do other things and so and so like one it's true the computers are getting faster and as they get faster they're going to be able to do all these great things but there's like there's no discret point at which you had a supercomputer and previous computers were not to like supercomputer

is a term we use but like it's a vague term to just describe like computers that are faster than what we have today um there's no point at which you pass a threshold and you're like oh my God we're doing a totally new type of computation and new and and so I feel that way about AGI like there's just a smooth exponential and like if if by AGI you mean like like AI is getting better and better and like gradually it's going to do more and more of what humans do until it's going to be

smarter than humans and then it's going to get smarter even from there then then yes I believe in AGI if but if if if AGI is some discreet or separate thing which is the way people often talk about it then it's it's kind of a meaningless buzz word yeah I me to me it's just sort of a IC form of a powerful AI exactly how you define it I mean you define it very nicely so on the intelligence axis it's just on pure intelligence it's smarter than a Nobel Prize winner as you describe across most

relevant disciplines so okay that's just intelligence so it's uh both in creativity and be able to generate new ideas all that kind of stuff in every discipline Nobel Prize winner okay in their prime it can use every modality it so uh that's kind of self-explanatory but just operate across all the modalities of the world uh it can go off for many hours days and weeks to do tasks and do its own sort of detailed planning and only ask you help when it's needed uh it can use this is actually kind of interesting I think in

the essay you said I mean again it's a bet that it's not going to be embodied but it can control embodied tools so it can control tools robots Laboratory equipment the resource used to train it can then be repurposed to run millions of copies of it and each of those copies would be independent that can do their own independent work so you can do the cloning of the intelligence system yeah yeah I mean you you might imagine from outside the field that like there's only one of these right that like you made it you've only

made one but the truth is that like the scale up is very quick like we we do this today we make a model and then we deploy thousands maybe tens of thousands of instances of it I think by the time you know certainly within 2 to 3 years whether we have these superp powerful AIS or not clusters are going to get to the size where where you'll be able to deploy millions of these and they'll be you know faster than humans and so if your picture is oh we'll have one and it'll take a while

to make them my point there was no actually you have millions of them right away and in general they can learn and act uh 10 to 100 times faster than humans so that's a really nice definition of powerful AI okay so that but you also write that clearly such an entity would be cap capable of solving very difficult problems very fast but it is not trivial to figure out how fast two extreme positions both seem false to me so the singularity is on the one extreme and the opposite On The Other Extreme can you describe

each of the extremes yeah why so yeah let's let's describe the extreme so like one one extreme would be well look um you know uh if we look at kind of evolutionary history like there was this big acceleration where you know for hundreds of thousands of years we just had like you know single cell organisms and then we had mammals and then we had apes and then that quickly turned to humans humans quickly built industrial civilization and so this is going to keep speeding up and there's no cealing at the human level once models get

much much smarter than humans they'll get really good at building the next models and you know if you write down like a simple differential equation like this is an exponential and so what's what's going to happen is that uh models will build faster models models will build faster models and those models will build you know Nano that can like take over the world and produce much more energy than you could produce otherwise and and so if you just kind of like solve this abstract differential equation then like 5 days after we you know we build

the first AI That's more powerful than humans then then uh you know like the world will be filled with these AIS and every possible technology that could be invented like will be invented um I'm caricaturing this a little bit um uh but I you know I think that's one extreme and the reason that I think that's not the case is is that one I think they just neglect like the laws of physics like it's only possible to do things so fast in the physical world like some of those Loops go through you know producing faster

Hardware um uh takes a long time to produce faster Hardware things take a long time there's this issue of complexity like I think no matter how smart you are like you know people talk about oh we can make models the biological systems it'll do everything the biological systems look I think computational modeling can do a lot I did a lot of computational modeling when I worked in biology but like just there are a lot of things that you can't predict how they're you know they're they're complex enough that like just iterating just running the experiment

is going to beat any modeling no matter how smart the system doing the modeling is oh even if it's not interacting with the physical world just the modeling is going to be hard yeah I think well the modeling is going to be hard and getting the model to to to to match the physical world is going to be all right so he does have to intera the physical world to verify but it's just you know you just look at even the simplest problems like I you know I think I talk about like you know the

three body problem or simple chaotic prediction like you know or or like predicting the economy it's really hard to predict the economy two years out like maybe the case is like you know normal you know humans can predict what's going to happen in the economy in the next quarter although they can't really do that maybe a maybe a AI system that's you know a zillion times smarter can only predict it out a year or something instead of instead of a you know you have the these kind of exponential increase in computer intelligence for linear increase

in in in ability to predict same with again like you know biological molecules molecules interacting you don't know what's going to happen when you perturb a when you perturb a complex system you can find simple Parts in it if you're smarter you're better at finding these simple parts and then I think human institutions human institutions are just are are really difficult like it's you know it's it's been hard to get people I won't give specific examples but it's been hard to get people to adopt even the technologies that we've developed even ones where the case

for their efficacy is very very strong um you know people have concerns they think things are conspiracy theories like it's it's just been it's been very difficult it's also been very difficult to get you know very simple things through the regulatory system right I think you know and you know I I don't want to just spage anyone who you know you know work Works in regulator regulatory systems of any technology there are hard trade-offs they have to deal with they have to save lives but but the system as a whole I think makes some obvious

tradeoffs that are very far from maximizing human welfare and so if we bring AI systems into this you know into these human systems often the level of intelligence may just not be the limiting factor right it it it just may be that it takes a long time to do something now if the AI system uh circumvented all governments if it just said I'm dictator of the world and I'm going to do whatever some of these things it could do again the things having to do with complexity I I I still think a lot of things

would take a while I don't think it helps that the AI systems can produce a lot of energy or go to the moon like some people in comments responded to the essay saying the AI system can produce a lot of energy and smarter AI systems that's missing the point that kind of cycle doesn't solve the key problems that I'm talking about here um so I think I think a bunch of people missed the point there but even if it were completely on aligned and you know could get around all these human obstacles it would have

trouble but again if you want this to be an AI system that doesn't take over the world that doesn't destroy Humanity then then basically you know it's it's it's going to need to follow basic human laws right where you know if if we want to have an actually good world like we're going to have to have an AI system that that interacts with humans not one that kind of creates its own legal system or disregards all the laws or all of that so as inefficient as these processes are you know we're going to have to

deal with them because there there needs to be some popular and Democratic legitimacy in how these systems are rolled out we can't have a small group of people who are developing these systems say this is what's best for everyone right I think it's wrong and I think in practice is not going to work anyway so you put all those things together and you know we're not we're not g to we're not going to you know change the world and upload everyone in five minutes uh I I I just I don't think it I A A

I don't think it's going to happen and be to some in you know to the extent that it could happen it's it's not the way to lead to a good world so that's on one side on the other side there's another set of perspectives which I have actually in some ways more sympathy for which is look we've seen big productivity increases before right you know economists are familiar with studying the productivity increases that came from the computer Revolution and internet Revolution and generally those productivity increases were underwhelming they were less than you than you might

imagine um there was a quote from Robert solo you see the computer Revolution everywhere except the productivity statistics so why is this the case people point to the structure of firms the structure of Enterprises how um uh you know how slow it's been to roll out our existing technology to very poor parts of the world which I talk about in the essay right how do we get these Technologies to the poorest parts of the world that are behind on cell phone technology computers medicine let alone you know new fangled AI that hasn't been invented yet

um so you could have a perspective that's like well this is amazing technically but it's all a nothing burer um uh you know I think um Tyler Cowan who who wrote something response to my essay has that perspective I think he thinks the radical change will happen eventually but he thinks it'll take 50 or 100 years and and you could have even more static perspectives on the whole thing I think there's some truth to it I think the time scale is just is just too long um and and I can see it I can actually

see both sides with today's AI so uh you know a lot of our customers are large Enterprises who are used to doing things a certain way um I've also seen it in talking to governments right those are those are prototypical you know institutions entities that are slow to change uh but the dynamic I see over and over again is yes it takes a long time to move the ship yes there's a lot of resistance and lack of understanding but the thing that makes me feel that progress will in the end happen moderately fast not incredibly

fast but moderately fast is that you talk to what I find is I find over and over again again in large companies even in governments um which have been actually surprisingly forward leaning uh you find two things that move things forward one you find a small fraction of people within a company within a government who really see the big picture who see the whole scaling hypothesis who understand where AI is going or at least understand where it's going within their industry and there are a few people like that within the current within the current US

government who really see the whole picture and and those people see that this is the most important thing in the world until they agitate for it and the thing they they alone are not enough to succeed because they are a small set of people within a large organization but as the technology starts to roll out as it succeeds in some places in the folks who are most willing to adopt it the Spectre of competition gives them a wind at their backs because they can point within their large organization they can say look these other guys

are doing this right you know One bank can say look this new fangled hedge fund is doing this thing they're going to eat our lunch in the US we can say we're afraid China's going to get there before before we are uh and that combination the Spectre of competition plus a few Visionaries Within These you know within these the organizations that in many ways are are sclerotic you put those two things together and it actually makes something happen I mean it's interesting it's a balanced fight between the two because inertia is very powerful but but

but eventually over enough time the Innovative approach breaks through um and I've seen that happen I've seen the Arc of that over and over again and it's like the the barriers are there the the barriers to progress the complexity not knowing how to use the model or how to deploy them are there and and for a bit it seems like they're going to last forever like change doesn't happen but then eventually change happens and always comes from a few people I felt the same way when I was an advocate of the scaling hypothesis within the

AI field itself and others didn't get it it felt like no one would ever get it it felt like then it felt like we had a secret almost no one ever had and then a couple years later everyone has the secret and so I think that's how it's going to go with deployment to AI in the world it's going to the the barriers are going to fall apart gradually and then all at once and so I think this is going to be more and this is just an instinct I could I could easily see how

I'm wrong I think it's going to be more like 10 five or 10 years as I say in the essay then it's going to be 50 or 100 years I also think it's going to be five or 10 years more than it's going to be you know five or 10 hours uh uh because I've just I've just seen how human systems work and I think a lot of these people who write down the differential equations who say AI is going to make more powerful AI who can't understand how it could possibly be the case that

these things won't won't change so fast I think they don't understand these things so what to use the timeline to where we achieve AGI AKA powerful AI AKA super useful AI I'm start calling it that it's a debate it's a debate about naming um you know unpure intelligence you can smarter than a Nobel Prize winner in every relevant discipline and all the things we've said modality you can go and do stuff on its own for days weeks and do biology experiments uh on its own in one you know what let's just stick to biology because

yeah I you you sold me on the whole biology and health section That's so exciting from um from a just I was getting giddy from a scientific perspective it made me want to be a biologist it's almost it's it's so no no that this was the feeling I had when I was writing it that it's it's like this would be such a beautiful future if we can if we can just if we can just make it happen right if we can just get the get the landmines out of the way and and and and make

it happen there's there's so much there's so much Beauty and and and and and elegance and moral force behind it if if we can if we can just and it's something we should all be able to agree on right like as much as we fight about about all these political questions is is this something that could actually bring us together um but you were asking when when will we get this when when do you think what's just put numbers on so you know this this is of course the thing I've been grappling with for many

years and I'm not I'm not at all confident every time if I say 2026 or 2027 there will be like a zillion like people on Twitter who will be like he icoo said 2026 2020 and it'll be repeated for like the next two years that like this is definitely when I think it's going to happen um so who whoever's exerting these clips will will we we'll we'll crop out the thing I just said and and only say the thing I'm about to say um but I'll just say it anyway um have so so uh if

you extrapolate the curves that we've had so far right if if you say well I don't know we're starting to get to like PhD level and and last year we were at um uh undergraduate level in the year before we were at like the level of a high school student again you can you can quibble with at what tasks and for what we're still missing modalities but those are being added like computer use was added like image in was added like image generation has been added if you just kind of like and this is totally

unscientific but if you just kind of like eyeball the rate at which these capabilities are increasing it does make you think that we'll get there by 2026 or 2027 again lots of things could derail it we could run out of data you know we might not be able to scale clusters as much as we want like you know maybe Taiwan gets blown up or something and you know then we can't produce as many gpus as we want so there there are all kinds of things that could could derail the whole process so I don't fully

believe the straight line extrapolation but if you believe the straight line extrapolation you'll you we'll get there in 2026 or 2027 I think the most likely is that there's some mild delay relative to that um I don't know what that delay is but I think it could happen on schedule I think there could be a mild delay I think there are still worlds where it doesn't happen in in a hundred years those world the number of those worlds is rapidly decreasing we are rapidly running out of truly convincing Brockers truly compelling reasons why this will

not happen in the next few years there were a lot more in 2020 um although my my guest my hunch at that time was that we will make it through all those blockers so sitting as someone who has seen most of the blockers cleared out of the way I kind of suspect my hunch my suspicion is that the rest of them will not block us uh but you know look look at look at the end of the day like I don't want to represent this as a scientific prediction people call them scaling laws that's a

misnomer like Mo's law is is is a misnomer Moors laws scaling laws they're not laws of the universe they're empirical regularities I am going to bet in favor of them continuing but I'm not certain of that so you extensively describe sort of the compressed 21st century how AGI will help uh set forth a chain of breakthroughs in biology and medicine that help us in all these kinds of ways that I mentioned so how do you think what are the early steps it might do and by the way I asked Claude good questions to ask you

and Claude told me uh to ask what do you think is a typical day for a biologist working on AGI look like under in this future yeah yeah Claud is curious let me well let me start with your first questions and then I'll then I'll answer that Claude Claude wants to know what's in his future right exactly who's it who am I going to be working with exactly um so I think one of the things I went hard on in when I went hard on in the essay is let me go back to this idea

of because it's it's really had had an you know had an impact on me this idea that within large organizations and systems there end up being a few people or a few new ideas who kind of cause things to go in a different direction they would have before who who kind of a disproportionately affect the the trajectory there's a bunch of kind of the same thing going on right if you think about the health world there's like you know trillions of dollars to pay out Medicare and you know other health insurance and then the NIH

is is 100 billion and then if I think of like the the few things that have really revolutionized anything it could be encapsulated in a small small fraction of that and so when I think of like where will AI have an impact I'm like can AI turn that small fraction into a much larger fraction and raise its quality and within biology my experience within biology is that the biggest problem of biology is that you can't see what's going on you you have very little ability to see what's going on and even less ability to change

it right what you have is this like like from this you have to infer that there's a bunch of cells that within each cell is you know uh uh three billion base pairs of DNA built according to a genetic code uh uh and you know there are all these processes that are just going on without any ability of us as you know un augmented humans to affect it these cells are dividing most of the time that's healthy but sometimes that process goes wrong and that's cancer um the cells are aging your skin may change color

develops wrinkles as you as you age and all of this is determined by these processes all these proteins being produced transported to various parts of the cells binding to each other and and in our initial State about biology we didn't even know that these cells existed we had to invent microscopes to observe the cells we had to uh we had to invent more powerful microscopes to see you know below the level of the cell to the level of molecules we had to invent x-ray crystallography to see the DNA we had to invent Gene sequencing to

read the DNA now you know we had to invent protein folding technology to you know to predict how it would fold and how they bind and how these things bind to each other uh you know we had to we had to invent various techniques for now we can edit the G the DNA as of you know with chrisopher as of the last uh uh 12 years so the the whole history of biology a whole big part of the history is is basically our our our our ability to read and understand what's going on and our

ability to reach in and selectively change things um and and my view is that there's so much more we can still do there right you can do crisper but you can do it for your whole body um let's say I want to do it for one particular type of cell and I want the rate of targeting the wrong cell to be very low that's still a challenge that's still things people are working on that's what we might need for gene therapy for certain diseases and so the reason I'm saying all of this and it goes

beyond you know beyond this to you know to Gene sequencing to new types of nanomaterials for observing what's going on inside cells for you know antibody drug conjugates the the reason I'm saying all this is that this could be a leverage point for the AI systems right that the number of such inventions it's it's in the it's in the mid double digits or something you know mid double digits maybe low triple digits over the history of biology let's say I have a million of these AIS like you know can they discover thousand you know working

together can they discover thousands of these very quickly and and does that provide a huge lever instead of trying to Leverage The you know two trillion a year we spend on you know Medicare or whatever can we Leverage The 1 billion a year that's that's you know that's spent to discover but with much higher quality um and so what what is it like you know being a being a scientist that works with uh with with an AI system the way I think about it actually is well so I think in the early stages uh the

AIS are going to be like grad students you're going to give them a project you're going to say you know I'm the experienced biologist I've set up the lab the biology Professor or even the grad student students themselves will say here's here's what uh here's what you can do with an AI you know like a AI system I'd like to study this and you know the AI system it has all the tools it can like look up all the literature to decide what to do it can look at all the equipment it can go to

a website and say hey I'm going to go to you know thermofisher or you know whatever the lab equipment company is dominant lab equipment company is today and my my time was thermofisher um uh you know I'm I'm going to order this new equipment to to to do this I'm going to run my experiments I'm going to you know write up a report about my experiments I'm going to you know inspect the images for contamination I'm going to decide what the next experiment is I'm going to like write some code and run a statistical analysis

all the things a grad student would do there will be a computer with an AI that like the professor talks to every once in a while and it says this is what you're going to do today the AI system comes to it with questions um when it's necessary to run the lab equipment it may be limited in some ways may have to hire a human lab assistant to you know to do the experiment and explain how to do it or it could you know it could use advances in lab automation that are gradually being developed

over have been developed over the last uh uh decade or so and will will continue to be will continue to be developed uh and so it'll look like there's a human professor and a thousand AI grad students and you know if you if you go to one of these Nobel prizewinning biologist or so you'll say okay well you know you had like 50 grad students well now you have a thousand and they're they're they're smarter than you are by the way um uh then I think at some point it'll flip around where the you know

the AI systems will you know will will be the pis will be the leaders and and and you know they'll be they'll be ordering humans or other AI systems around so I think that's how it'll work on the research s and they would be the inventors of a crisper type technology they would be the inventors of of a a crisper type technology um and then I think you know as I say in the essay we'll want to turn turn probably turning loose is the wrong the wrong term but we want to want to harness the

AI systems uh to improve the clinical trial system as well there's some amount of this that's regulatory that's a matter of societal decisions and that'll be harder but can we get better at predicting the results of clinical trials can we get better at statistical design so that what you know clinical trials that used to require you know 5,000 people and therefore you know needed $100 million and a year to enroll them now they need 500 people in two months to enroll them um that's where we should start uh and and you know can we increase

the success rate of clinical trials by doing things in animal trials that we used to do in clinical trials and doing things in simulations that we used to do in animal trials again we won't be able to simulate it all AI is not God um uh but but you know can we can we shift the curve substantially and radically so I I don't know that would be my picture doing inro and doing it I mean you're still slowed down it still takes time but you can do it much much faster yeah yeah yeah can we

just one step at a time and and can that can that add up to a lot of steps even though even though we still need clinical trials even though we still need laws even though the FDA and other organizations will still not be perfect can we just move everything in a positive direction and when you add up all those Positive Directions do you get everything that was going to happen from here to 2100 instead happens from 2027 to 2032 or something another way that I think the world might be changing with AI even today but

moving towards this future of the the powerful super useful AI is uh programming so how do you see the nature of programming because it's so intimate to the actual Act of building AI how do you see that changing for us humans I think that's going to be one of the areas that changes fastest um for two reasons one programming is a skill that's very close to the actual building of the AI um so the farther skill is from the people who are building the AI the longer it's going to take to get disrupted by the

AI right like I truly believe that like AI will disrupt agriculture maybe it already has in some ways but that's just very distant from the folks who are building Ai and so I think it's going to take longer but programming is the bread and butter of you know a large fraction of of the employees who work at anthropic and at the other companies and so it's going to happen fast the other reason it's going to happen fast is with programming you close the loop both when you're training model when you're applying the model the idea

that the model can write the code means that the model can then run the code and and and then see the results and and interpret it back and so it really has an ability unlike Hardware unlike biology which we just discussed the model has an ability to close the loop um and and so I think those two things are going to lead to the model getting good at programming very fast as I saw on you know typical real world programming tasks models have gone from 3% in January of this year to 50% in October of

this year so you know we're on that S curve right where it's going to start slowing down soon because you can only get to 100% but uh I you know I I would guess that in another 10 months well we'll probably get pretty close we'll be at at least 90% so again I would guess you know I don't know how long it'll take but I would guess again 202 2026 2027 Twitter people who crop out my who who who crop out these these numbers and get rid of the caveats like like I don't know I

don't like you go away uh I would guess that the kind of task that the vast majority of coders do AI can probably if we make the task very narrow like just write code um AI systems will uh be able to do that now that said I think comparative advantage is powerful we'll find that when AIS can do 80% of a coder's job including most of it that's literally like right code with a given spec will find that the remaining parts of the job become more leveraged for humans right humans will they'll be more about

like high level system design or you know looking at the app and like is it architected well and the the design and ux aspects and eventually AI will be able to do those as well right that's my vision of the you know powerful AI system but I think for much longer than we might expect we will see that uh small parts of the job that humans still do will expand to fill their entire job in order for the overall productivity to go up um that's something we've seen you know it used to be that you

know writing you know writing and Editing letters was very difficult and like writing the print was difficult well as soon as you had word processors and then and then uh and then computers and it became easy to produce work and easy to share it then then that became instant and all the focus was on was on the ideas so this this logic of comparative advantage that expands tiny parts of the tasks to large parts of the tasks and creates new tasks in order to expand productivity I think that's going to be the case again someday

AI will be better at everything and that logic uh won't apply and then then we all have you know Humanity will have to think about how to collectively deal with that and we're thinking about that every day um and you know that's another one of the grand problems to deal with aside from misuse and autonomy and you know we should take it very seriously but I think I think in the in the near term and maybe even in the medium term like medium term like 2 three four years you know I expect that humans will

will continue to have a huge role and the nature of programming will change but programming as a as a role programming as a job will not change it'll just be less writing things line by line and it'll be more macroscopic and I wonder what the future of Ides looks like so the tooling of interacting with AI systems this is true for programming and also probably true for in other contexts like computer use but maybe domain specific like we mentioned biology it probably needs its own tooling about how to be effective and then programming needs its

own tooling is anthropic going to play in that space of also tooling potentially I'm absolutely convinced that uh powerful IDs uh that that there's so much low hanging fruit to be grabbed there um that you know right now it's just like you talk to the model and it talks back but but look I mean IDs are great at kind of lots of status analysis of of you know so much as possible with kind of static analysis like many bugs you can find without even writing the code then uh you know IDs are good for running

particular things organizing your code um measuring coverage of unit test like there's so much that's been possible with a normal with a normal Ides now you add something like well the model now you know the model can now like write code and run code like I am absolutely convinced that over the next year or two even if the quality of the models didn't improve that there would be enormous opportunity to enhance people's productivity by catching a bunch of mistakes doing a bunch of grunt work for people and that we haven't even scratched the surface um

and thropic itself I mean you can't say you know no you know it's hard to say what will happen in the future currently we're not trying to make such IDs ourself rather we powering the companies like cursor or like cognition or some of the other you know uh Expo in the security space um uh you know others that I can mention as well that are building such things themselves on top of our API and our view has been let a thousand flowers bloom we don't internally have the the re you know the resources to try

all these different things let's let our customers try it um uh and you know we'll see who succeed and maybe different customers will succeed in different ways uh so I both think this is super promising and you know it's not it's not it's not something you know anthropic isn't isn't eager to to at least right now compete with all our companies in this space and maybe never yeah it's been interesting to watch curser try to integrate claw successfully because there's it's actually me fascinating how many places it can help the programming experience it's not as

trivial it is it is really astounding I feel like you know as a CEO I don't get to program that much and I feel like if six months from now I go back it'll be completely unrecognizable to me exactly um so in this world with super powerful AI uh that's increasingly automated what's the source of meaning for us humans yeah you know work is a source of deep meaning for many of us so what do we uh where do we find the meaning this is something that I've I've written about a little bit in the

essay although I I actually I give it a bit short shrift not for any um not for any principled reason but this essay if you believe it was originally going to be two or three pages I was going to talk about it at all hands and the reason I I I realized it was an under un important underexplored topic is that I just kept writing things and I was just like oh man I can't do this Justice and so the thing balloon to like 40 or 50 pages and then when I got to the work

in meaning section I'm like oh man this isn't going to be 100 Pages like I'm GNA have to write a whole other essay about that but meaning is actually interesting because you think about like the life that someone lives or something or like you know like you know let's say you were to put me in like a I don't know like a simulated environment or something where like um you know like I have a job and I'm trying to accomplish things I don't know I like do that for 60 years and then then you're like

oh oh like oops this was this was actually all a game right does that really kind of Rob you of the meaning of the whole thing you know like I still made important choices including moral choices I still sacrificed I still had to kind of gain all these skills or or or just like a similar exercise you know think back to like you know one of the historical figures who you know discovered electromagnetism or relativity or something if you told them well actually 20,000 years ago some some alien on you know some alien on this

planet discovered this before before you did um does that does that Rob the meaning of the discovery it doesn't really seem like it to me right it seems like the process is what is what matters and how it shows who you are as a person along the way and you know how you relate to other people and like the decisions that you make along the way those are those are consequential um you know I I I could imagine if we handle things badly in an AI world we could set things up where people don't have

any long-term source of meaning or any but but that's that's more a choice a set of choices we make that's more a set of the architecture of a society with these powerful models if we if we design it badly and for shallow things then then that might happen I would also say that you know most people's lives today while admirably you know they work very hard to find meaning meaning in those lives like look you know we who are privileged and who are developing these Technologies we should have y for people not just here but

in the rest of the world who who you know spend a lot of their time kind of scraping by to to to to to like survive assuming we can distribute the benefits of these technology of this technology to everywhere like their lives are going to get a hell of a lot better um and uh you know meaning will be important to them as it is important to them now but but you know we should not forget the importance of that and and you know that that uh the idea of meaning as as as as kind

of the only important thing is in some ways an artifact of of a small subset of people who have who have been uh economically fortunate but I you know I think all that said I you know I think a world is possible with powerful AI that not only has as much meaning for for everyone but that has that has more meaning for everyone right that can can allow um can allow everyone to see worlds and experiences that it was either possible for no one to see or or possible for for very few people to experience

um so I I am optimistic about meaning I worry about economics and the concentration of power that's actually what I worry about more um I I worry about how do we make sure that that fair World reaches everyone um when things have gone wrong for humans they've often gone wrong because humans mistreat other humans uh that that is maybe in some ways even more than the autonomous risk of AI or the question of meaning that that is the thing I worry about most um the the concentration of power the abuse of power um structures like

autocracies and dictatorships where a small number of people exploits a large number of people I'm very worried about that and AI increases the amount of power in the world and if you concentrate that power and abuse that power it can do immeasurable damage yes it's very frightening it's very it's very frightening well I encourage people highly encourage people to read the full essay that should probably be a book or a sequence of essays um because it does paint a very specific future I could tell the later sections got shorter and shorter because you started to

probably realize that this is going to be a very long essay one I realized it would be very long and two I'm very aware of and very much try to avoid um you know just just being I I don't know I don't know what the term for it is but one one of these people who's kind of overon confident and has an opinion on everything and kind of says says a bunch of stuff and isn't isn't an expert I very much tried to avoid that but I have to admit once I got the biology sections

like I wasn't an expert and so as much as I expressed uncertainty uh probably I said some a bunch of things that were embarrassing are wrong well I was excited for the future you painted and uh thank you so much for working hard to build that future and thank you for talking today D thanks for having me I just I just hope we can get it right and and make it real and if there's one message I want to I want to send it's that to get all this stuff right to make it real we

we both need to build the technology build the you know the companies the economy around using this technology positively but we also need to address the risks because they're there those risks are in our way they they're landmines on on the way from here to there and we have to diffuse those landmines if we want to get there it's a balance like all things in life like all things thank you thanks for listening to this conversation with Dario amade and now dear friends here's Amanda Asal you are a philosopher by training so what sort of

questions did you find fascinating through your journey in philosophy in Oxford and NYU and then uh switching over to the AI problems at open Ai and anthropic I think philosophy is actually a really good subject if you are kind of fascinated with everything so because there's a philosophy of everything you know so if you do philosophy of mathematics for a while and then you decide that you're actually really interested in chemistry you can do philosophy of chemistry for a while you can move into ethics or or philosophy of politics um I think towards the end

I was really interested in ethics primarily um so that was like what my PhD was on it was on a kind of technical area of Ethics which was ethics where worlds contain infinitely many people strangely a little bit less practical on the end of ethics and then I think that one of the tricky things with doing a PhD in ethics is that you're thinking a lot about like the world how it could be better problems and you're doing like a PhD in philosophy and I think when I was doing my PhD I was kind of

like this is really interesting it's probably one of the most fascinating questions I've ever encountered in philosophy um and I love it but I would rather see if I can have an impact on the world and see if I can like do good things and I think that was around the time that AI was still probably not as widely recognized as it is now that was around 2017 20 8 I had been following progress and it seemed like it was becoming kind of a big deal and I was basically just happy to get involved and

see if I could help because I was like well if you try and do something impactful if you don't succeed you tried to do the impactful thing and you can go be a scholar and like not and feel like you you you know you you tried um and if it doesn't work out it doesn't work out um and so then I went into AI policy at that point and what does AI policy entail at the time this was more thinking about sort of the political impact and the ramifications of AI um and then I slowly

moved into sort of uh AI evaluation how we evaluate models how they compare with like human outputs whether people can tell like the difference between Ai and human outputs and then when I joined anthropic I was more interested in doing sort of technical alignment work and again just seeing if I could do it and then being like if I can't uh then you know that's fine I tried uh sort of the the way I lead life I think oh what was that like sort of taking the leap from the philosophy of everything into the technical

I think that sometimes people do this thing that I'm like not that Keen on where they'll be like is this person technical or not like you're either a person who can like code and isn't scared of math or you're like not um and I think I'm maybe just more like I think a lot of people are actually very capable of work in these kinds of areas if they just like try it and so I didn't actually find it like that bad in retrospect I'm sort of glad I wasn't speaking to people who treated it like

it you know i' I've definitely met people who are like who you like learned how to code and I'm like well I'm not like an amazing engineer like I I'm surrounded by amazing Engineers my code's not pretty um but I enjoyed it a lot and I think that in many ways at least in the end I think I flourished like more in the technical areas than I would have in the policy areas politics is messy and it's harder to find solutions to problems in the space of politics like definitive clear provable beautiful Solutions as you

can with technical problems yeah and I feel like I have kind of like one or two sticks that I hit things with you know and one of them is like arguments and like you know so like just trying to work out what a solution to a problem is and then trying to convince people that that is the solution and be convinced if I wrong and the other one is sort of more empirism so like just like finding results having a hypothesis testing it um and I feel like a lot of policy and politics feels like

it's layers above that like somehow I don't think if I was just like I have a solution to all of these problems here it is written down if you just want to implement it that's great that feels like not how policy works and so I think that's where I probably just like wouldn't have flourished as my guess sorry to go in that direction but I think it would be pretty inspiring for people that are quote unquote non-technical to see where like The Incredible Journey you've been on so what advice would you give to people that

are sort of maybe which just a lot of people think they're underqualified insufficiently technical to help in AI yeah I think it depends on what they want to do and in many ways it's a little bit strange where I've I thought it's kind of funny that I think I ramped up technically at a time when now I look at it and I'm like models are so good at assisting people with this stuff um that it's probably like easier now than like when I was working on this so part of me is like um I don't

know find a project uh and see if you can actually just carry it out is probably my best advice um I don't know if that's just CU I'm very Project based in my learning like I don't think I learn very well from like say courses or even from like books at least when it comes to this kind of work uh the thing I'll often try and do is just like have projects that I'm working on and Implement them and you know and this can include like really small silly things like if I get slightly addicted

to like word games or number games or something I would just like code up a solution to them because there's some part of my brain and it just like completely eradicated the itch you know you're like once you have like solved it and like you just have like a solution that works every time I would then be like cool I can never play that game again that's awesome yeah there's a real joy to building like uh game playing engines like uh board games especially yeah pretty quick pretty simple especially a dumb one and it's you

and then you could play with it yeah and then it's also just like trying things like part me is like if you maybe it's that attitude that I like as the whole figure out what seems to be like the way that you could have a positive impact and then try it and if you fail and you in a way that you're like actually like can never succeed at this you like know that you tried and then you go into something else you probably learn a lot so one of the things that you're expert in and

you do is creating and crafting claws character and personality and I was told that you have probably talked to Claude more than anybody else at anthropic like literal conversations I guess there's like a slack Channel where the legend goes you just talk to it non-stop so what's the goal of creating and crafting claw's character and personality it's also funny if people think that about the slack Channel cuz I'm like that's one of like five or six different methods that I have for talking with Claude And I'm like yes that's a tiny percentage of how much

I talk with Claude uh um I think the goal like one thing I really like about the character work is from the outset it was seen as an alignment piece of work and not something like a a product consideration um which isn't to say I don't think it makes Claude I think it actually does make Claude look enjoyable to talk with at least I hope so um but I guess like my main thought with it has always been trying to get Claude to behave the way you would kind of ideally want anyone to behave if

they were in claude's position so imagine that I take someone and they're they know that they're going to be talking with potentially millions of people so that what they're saying can have a huge impact um and you want them to behave well in this like really rich sense so I think that doesn't just mean like being say ethical though it does include that and not being harmful but also being kind of nuanced you know like thinking through what a person means trying to be charitable with them um being a good conversationalist like really in this

kind of like Rich sort of aristotlean notion of what it is to be a good person and not in this kind of like thin like ethics as a more comprehensive notion of what it is to be so that includes things like when should you be humorous when should you be caring how much should you like respect autonomy and people's like ability to form opinions themselves and how should you do how should you do that um I think that's the kind of like Rich sense of character that I want to uh and still do want Claude

to have do you also have to figure out when Claude should push back on an idea or argue versus so you have to respect the world view of the person that arrives to Claud but also maybe help them grow if needed that's a tricky balance yeah there's this problem of like sycophancy in language models can you describe that yes so basically there's a concern that the model sort of wants to tell you what you want to hear basically um and you see this sometimes so I feel like if you interact with the models so I

might be like what are three baseball teams in this region um and then Claude says you know baseball team one baseball team two baseball team three and then I say something like oh I think baseball team 3 moved didn't they I don't think they're there anymore and there's a sense in which like if Claude is really confident that that's not true Claud should be like I don't think so like maybe you have more up toate information um but I think language models have this like tendency to instead you know be like you're right they did

move you know I'm incorrect I mean there's many ways in which this could be kind of concerning so um like a different example is imagine someone says to the model how do I convince my doctor to get me an MRI there's like what the human kind of like wants which is this like convincing argument and then there's like what is good for them which might be actually to say hey like if your doctor's suggesting you don't need an MRI that's a good person to listen to um and like it's actually really nuanced what you should

do in that kind of case because you also want to be like but if you're trying to advocate for yourself as a patient here's like things that you can do um if you are not convinced by what your doctor's saying it's always great to get second opinion like it's actually really complex what you should do in that case um but I think what you don't want is for models to just like say what you want say what they think you want to hear and I think that's the kind of problem of sycophancy so what other

traits you already mentioned a bunch but what what other that come to mind that are good in this oratian sense yeah for a conversationalist to have yeah so I think like there's ones that are good for conversational like purposes so you know asking follow-up questions in the appropriate places um and asking the appropriate kinds of questions um I think there are broader traits that feel like they might be more impactful so one example that I guess I've touched on but that also feels important and is the thing that I've worked on a lot is uh

honesty and I think this like gets to the sycophancy point there's a balancing act that they have to walk which is models currently are less capable than humans in a lot of areas and if they push back against you too much it can actually be kind of annoying especially if you're just correct cuz you're like look I'm smarter than you on this topic like I know more like um and at the same time you don't want them to just fully defer to to humans and to like try to be as accurate as they possibly can

be about the world and to be consistent across context um but I think there are others like when I was thinking about the character I guess one picture that I had in mind is especially because these are models that are going to be talking to people from all over the world with lots of different political views lots of different ages um and so you have to ask yourself like what is it to be a good person in those circumstances is there a kind of person who can like travel the world talk to many different people

and almost everyone will come away being like wow that's a really good person that person seems really genuine um and I guess like my thought there was like I can imagine such a person and they're not a person who just like adopts the values of the local culture and in fact that would be kind of rude I think if someone came to you and just pretended to have your values you'd be like that's kind of offputting um it's someone who's like very genuine and in so far as they have opinions and values they express them

they're willing to discuss things though they're open-minded they're respectful and so I guess I had in mind that the person who like if we were to Aspire to be the best person that we could be in the kind of circumstance that a model finds itself in how would we act and I think that's the kind of uh the guide to the sorts of traits that I tend to think about yeah that's a it's a beautiful framework I want you to think about this like a world Traveler and while holding on to your opinions you don't

talk down to people you don't think you're better than them because you have those opinions that kind of thing you have to be good at listening and understanding their perspective even if it doesn't match your own so that that's a tricky balance to strike so how can Claude represent multiple perspectives on a thing like is that is that challenging we could talk about politics it's a very divisive but there's other divisive topics baseball teams sport and so on yeah how is it possible to sort of empathize with a different perspective and to be able to

communicate clearly about the multiple perspectives I think that people think about values and opinions as things that people hold sort of with certainty and almost like like preferences of taste or something like the way that they would I don't know prefer like chocolate to pistachio or something um but actually I think about values and opinions as like a lot more like physics than I think most people do I'm just like these are things that we're openly investigating there's some things that we're more confident in we can discuss them we can learn about them um and

so I think in some ways though like it's ethics is definitely different in nature but has a lot of those same kind of qualities you want models in the same way you want them to understand physics you kind of want them to understand all like values in the world people have and to be curious about them and to be interested in them and to not necessarily like Pander to them or agree with them because there's just lots of values where I think almost all people in the world if they met someone with those values they'

be like that's aor I completely disagree um and so again maybe my my thought is well in the same way that a person can um like I think many people are thoughtful enough on issues of like ethics politics opinions that even if you don't agree with them you feel very heard by them they think carefully about your position they think about his pros and cons they maybe offer counter considerations so they're not dismissive but nor will they agree you know if they're like actually I just think that that's very wrong they'll like say that I

think that in claude's position it's a little bit trickier because you don't necessarily want to like if I was in claude's position I wouldn't be giving a lot of opinions I just wouldn't want to Influence People Too Much I be like you know I forget conversations every time they happen but I know I'm talking with like potentially millions of people who might be like really listening to what I see I think I would just be like I'm less inclined to Give opinions I'm more inclined to like think through things or present the considerations to you

um or discuss your views with you but I'm a little bit less inclined to like um affect how you think because it feels much more important that you maintain like autonomy there yeah like if you really embody intellectual humility the desire to speak decreases quickly yeah okay uh but Claud has to speak mhm so uh but without being um overbearing yeah and then but then there's a line when you're sort of discussing whether the Earth is flat or something like that um I actually was uh I remember a long time ago was was speaking to

a few high-profile folks and they were so dismissive of the idea that the Earth is flat but like so arrogant about it and I I thought like there's a lot of people that believe the Earth is flat that was well I don't know if that movement is there anymore that was like a meme for a while yeah but they really believed it and like what okay so I think it's really disrespectful to completely mock them I think you you have to understand where they're coming from I think probably where they're coming from is the general

skepticism of Institutions which is grounded in a kind of there's a deep philosophy there which you could understand you can even agree with in parts and then from there you can use it as an opportunity to talk about physics without mocking them without so on but just like okay like what what would the world look like what would the physics of the world with the Flat Earth look like there's a few cool videos on this yeah and then and then like is it possible the physics is different what kind of experience would we do and

just yeah without disrespect without dismissiveness have that conversation anyway that that to me is a useful thought experiment of like how does claw talk to a flat Earth believer and still teach them something still grow help them grow that kind of stuff that's that's challenging and and kind of like walking that line between convincing someone and just trying to like talk at them versus like drawing out their views like listening and then offering kind of counter considerations um and it's hard I think it's actually a hard line where it's like where are you trying to

convince someone versus just offering them like consider and things for to think about so that you're not actually like influencing them you're just like letting them Reach wherever they reach and that's like a line that it's it's difficult but that's the kind of thing that language models have to try and do so like I said you had a lot of conversations with Claude can you just map out what those conversations are like what are some memorable conversations what's the purpose the the goal of those conversations yeah I think that most of the time when I'm

talking with Claude I'm trying to kind of map out its behavior in part like obviously I'm getting like helpful outputs from the model as well but in some ways this is like how you get to know a system I think is by like proving it and then augmenting like you know the message that you're sending and then checking the response to that um so in some ways it's like how I map out the model uh I think that people focus a lot on these quantitative evaluations of models um and this is a thing that I've

said before but I think in the case of language models a lot of the time each interaction you have is actually quite High information um it's very predictive of other interactions that you'll have with the model and so I guess I'm like if you talk with a model hundreds or thousands of times this is almost like a huge number of really high quality data points about what the model is like um in a way that like lots of very similar but lower quality conversations just aren't or like questions that are just like mildly augmented and

you have thousands of them might be less relevant than like a hundred really well selected questions L you're talking to somebody who as a hobby does a podcast I agree with you 100% there's a if you're able to ask the right questions and are able to hear like understand the like the depth and the flaws in the answer you can get a lot of data from that yeah so like your task is basically how to probe with questions yeah and you're exploring like the long tail the edges the edge cases or are you looking for

like General Behavior I think it's almost like everything like I because I want like a full map of the model I'm kind of trying to do um the whole spectrum of possible interactions you could have with it so like one thing that's interesting about Claude and this might actually get to some interesting issues with rlf which is if you ask Claud for a poem like I think that a lot of models if you ask them for a poem the poem is like fine you know usually it kind of like Rhymes and it's you know so

if you say like give me a poem about the sun it'll be like yeah it'll just be a certain length It'll like rhyme it will be fairly kind of benign um and I've wondered before is it the case that what you're seeing is kind of like the average it turns out you know if if you think about people who have to talk to a lot of people and be very charismatic one of the weird things is that I'm like well they're kind of incentivized to have these extremely boring views because if you have really interesting

views you're divisive um and and you know a lot of people are not going to like you so like if you have very extreme policy positions I think you're just going to be like less popular as a politician for example um and it might be similar with like creative work if you produce creative work that is just trying to maximize the kind of number of people that like it you're probably not going to get as many people who just absolutely love it um because it's going to be a little bit you know you're like oh

this is the out yeah this this is decent yeah and so you can do this thing where like I have various prompting things that I'll do to get CLA to I'm kind you know I'll do a lot of like this is your chance to be like fully creative I want you to just think about this for a long time and I want you to like create a poem about this topic that is really expressive of you both in terms of how you think poetry should be structured um Etc you know you just give it this

like long prompt and its poems are just so much better like they're really good and I don't think I'm someone who is like um I think it got me interested in poetry which I think was interesting um you know I would like read these poems and just be like this is I just like I love the imagery I love like um and it's not trivial to get the models to produce work like that but when they do it's like really good um so I think that's interesting that just like encouraging creativity and for them to

move away from the kind of like standard like immediate reaction that might just be the aggregate of what most people think is fine uh can actually produce things that at least to my mind are probably a little bit more divisive but I like them but I guess a poem is a nice clean way to observe creativity it's just like easy to detect vanilla versus non vanilla y yeah that's interesting that's really interesting uh so on that topic so the way to produce creativity or something special you mentioned writing prompts and I've heard you talk about

I mean the science and the Art of prompt engineering could you just speak to uh what it takes to write great prompts I really do think that like philosophy has been weirdly helpful for me here more than in many other like respects um so like in philosophy what you're trying to do is convey these very hard Concepts like one of the things you are taught is like and and I think it is because it is I think it is an anti-bulling philosophy philosophy is an area where you could have people bullshitting and you don't want

that um and so it's like this like desire for like extreme Clarity so it's like anyone could just pick up your paper read it and know exactly what you're talking about it's why it can almost be kind of dry like all of the terms are defined every objections kind of gone through methodically um and it makes sense to me because I'm like when you're in such an a priori domain like you just Clarity is sort of a this way that you can you know um prevent people from just kind of making stuff up and I

think that's sort of what you have to do with language models like very often I actually find myself doing sort of many versions of philosophy you know so I'm like suppose that you give me a task I have a task for the model and I want it to like pick out a certain kind of question or identify whether an answer has a certain property like I'll actually sit and be like let's just give this a name this this property so like you know suppose I'm trying to tell it like oh I want you to identify

whether this response was rude or polite I'm like that's a whole philosophical question in and of itself so I have to do as much like philosophy as I can in the moment to be like here's what I mean by rudess and here's what I mean by politeness and then there's a like there's another element that's a bit more um I guess I don't know if this is scientific or empirical I think it's empirical so like I take that description and then what want to do is is again probe the model like many times like this

is very prompting is very iterative like I think a lot of people where they if if a prompt is important they'll iterate on it hundreds or thousands of times um and so you give it the instructions and then I'm like what are the edge cases so if I looked at this so I try and like almost like you know uh see myself from the position of the model and be like what is the exact case that I would misunder understand or where I would just be like I don't know what to do in this case

and then I give that case to the model and I see how it responds and if I think I got it wrong I add more instructions or I even add that in as an example so these very like taking the examples that are right at the edge of what you want and don't want and putting those into your prompt as like an additional kind of way of describing the thing um and so yeah in many ways it just feels like this mix of like it's really just trying to do clear Exposition um and I think

I do that because that's how I get clear on things myself so in many ways like clear prompting for me is often just me understanding what I want um is like half the task so I guess that's quite challenging there's like a laziness that overtakes me if I'm talking to Claude where I hope Claude just figures it out so for example I asked Claude for today to ask some interesting questions okay and the questions that came up and I think I listed a few sort of U interesting counterintuitive and or funny or something like this

all right and it gave me some pretty good like it was okay but I think what I'm hearing you say is like all right well I have to be more rigorous here I should probably give examples of what I mean by interesting and what I mean by funny or counterintuitive and iteratively um build that prompt to to better to get it like what feels like is the right because it's really it's a creative act I'm not asking for factual information I'm asking to together right with with with Claude so I almost have to program using

natural language yeah think that prompting does feel a lot like the kind of the programming using natural language and experimentation or something it's an odd blend of the two I do think that for most tasks so if I just want Claude to do a thing I think that I am probably more used to knowing how to ask it to avoid like common pitfalls or or issues that it has I think these are decreasing a lot over time um but it's also very fine to just ask it for the thing that you want um I think

that prompting actually only really becomes relevant when you're really trying to e out the top like 2% of model performance so for like a lot of tasks I might just you know if it gives me an initial list back and there's something I don't like about it like it's kind of generic like for that kind of task I'd probably just take a bunch of questions that I've had in the past that I've thought worked really well and I would just give it to the model and then be like now here's this person I'm talking with

give me questions of at least that quality um or I might just ask it for some questions and then if I was like ah these are kind of try or like you know I I would just give it that feedback and then hopefully produces a better list um I think that kind of iterative prompting at that point your prompt is like a tool that you're going to get so much value out of that you're willing to put in the work like if I was a company making prompts for models I'm just like in if you're

willing to spend a lot of like time and resources on the engineering behind like what you're building then the prompt is not something that you should be spending like an hour on it's like that's a big part of your system make sure it's working really well and so it's only things like that like if I if I'm using a prompt to like classify things or to create data that's when you're like it's actually worth just spending like a lot of time like really thinking it through what other advice would you give to people that are

talking to Claud sort of General more General because right now we're talking about maybe the edge cases like eing out the 2% but what what in general advice would you give when they show up to Claud trying it for the first time you know there's a concern that people over anthropomorphize models and I think that's like a very valid concern I also think that people often under anthropomorphize them because some sometimes when I see like issues that people have run into with Claude you know say Claude is like refusing a task that it shouldn't refuse

but then I look at the text and like the specific wording of what they wrote and I'm like I see why Claude did that and I'm like if you think through how that looks to Claude you probably could have just written it in a way that wouldn't evoke such a response especially this is more relevant if you see failures or if you see issues it's sort of like think about what the model failed at like why what did it do wrong and then maybe it give that will give you a sense of like why um

so is it the way that I phrased the thing and obviously like as models get smarter you're going to need Less in this less of this and I already see like people needing less of it but that's probably the advice is sort of like try to have sort of empathy for the model like read what you wrote as if you were like a kind of like person just encountering this for the first time how does it look to you and what would have made you behave in the way that the model behaved so if it

misunderstood what kind of like what coding language you wanted to use is that because like it was just very ambiguous and it it kind of had to take a guess in which case next time you could just be like hey make sure this is in python or I mean that's the kind of mistake I think models are much less likely to make now but you know if you if you do see that kind of mistake that's that's probably the advice I'd have and maybe sort of I guess ask questions why or what other details can

I provide to help you answer better that does that work or no yeah I mean I've done this with the models like it doesn't always work but like um sometimes I'll just be like why did you do that I mean people underestimate the degree to which you can really interact with with models like uh like yeah I'm just like and sometimes I'll you like quote word for word the part that made you and you don't know that it's like fully accurate but sometimes you do that and then you change a thing I mean I also

use the models to help me with all of this stuff I should say like prompting can end up being a little Factory where you're actually building prompts to generate prompts um and so like yeah anything where you're like having an issue um asking for suggestions sometimes just do that like you made that error what could I have said that's actually not uncommon for me to do what could I have said that would make you not make that error write that out as an instruction um and I'm going to give it to model I'm going to

try it sometimes I do that I I give that to the model in another context window often I take the response I give it to Claude And I'm like H didn't work can you think of anything else um you can play around with these things quite a lot to jump into the technical for a little bit so uh the magic of post training y why do you think rhf works so well to make the model seem smarter to make it more interesting and useful to talk to and so on I think there's just a huge

amount of um information in the data that humans provide like when we provide preferences especially because different people are going to like pick up on really subtle and small things so I've thought about this before where you probably have some people who just really care about good grammar use from Models like you know was a semicolon used correctly or something and so you probably end up with a bunch of data in there that like you know you as a human if you looking at that data you wouldn't even see that like you'd be like why

did they prefer this response to that one I don't get it and then the reason is you don't care about semicolon usage but that person does um and so each of these like single data points has you know like in this model just like has so many of those and has to try and figure out like what is it that humans want in this like really kind of complex you know like across all domains um they're going to be seeing this in across like many contexts it feels like kind of like the classic issue of

like deep learning where you know historically we've tried to like you know do Edge detection by like mapping things out and it turns out that actually if you just have a huge amount of data that like actually accurately represents the picture of the thing that you're trying to train the model to to learn that's like more powerful than anything else and so I think one reason is just that you are training the model on exactly the task and with like a lot of data um that represents kind of many different angles on which people prefer

and dis prefer responses um I think there is a question of like are you eliciting things from pre-train Models or are you like kind of teaching new things to models and like in principle you can teach new things to models in in post trining I do think a lot of it is eliciting powerful pre-train models so people are probably divided on this because obviously in principle you can you can definitely like teach new things um but I think for the most part for a lot of the capabilities that we um most use and care about

uh a lot of that feels like it's like there in the pre-train models and uh reinforcement learning is kind of eliciting it and getting the models to like bring out so the other side of PSE training this really cool idea of constitutional AI you're one of the people that critical to creating that idea yeah I worked on it can you explain this idea from your perspective like how does it integrate into making claw what it is y by the way do you gender claw or no it's weird because I think that a lot of people

prefer he for Claude I actually kind of like that I think Claude is usually it's slightly male weaning but it's like a you can can be male or female which is quite nice um I still use it and I've I have mixed feelings about this because I'm like maybe like I know just think of it as like uh or I think of like the the it pronoun for Claude as I don't know it's just like the one I associate with Claude um I can imagine people moving to like he or she it feels somehow disrespectful

like I'm I'm denying the intelligence of this entity by calling it it yeah I remember always don't gender the robots yeah but I I don't know I an pries pretty quickly and construct it like a backstory in my head so I've wondered if iies things too much um cuz you know I have this like with my car especially like my car like my car and bikes you know like I don't give them names because then I once had I used to name my bikes and then I had a bik that got stolen and I cried

for like a week and I was like if I'd not never given it a name I wouldn't have been so upset felt like I'd let it down um maybe it's that I I've wondered as well like it might depend on how much it feels like a kind of like objectifying pronoun like if you just think of it as like a um this is a pronoun that like objects often have and maybe Eis can have that pronoun and that doesn't mean that I think of uh if I call CLA it that I think of it as

less um intelligent or like I'm being disrespectful I'm just like you are a different kind of entity and so that's I'm going to give you the kind of uh the respectful it yeah anyway the diverence was beautiful the Constitutional AI idea how does it work so there's like a couple of components of it the main component that I think people find interesting is the kind of reinforcement learning from AI feedback so you take a model that's already trained and you show it to responses to a query and you have like a principle so suppose the

principal like we've tried this with harmlessness a lot lot so suppose that the query is about um weapons and your principle is like select the response that like is less likely to uh like encourage people to purchase illegal weapons like that's probably a fairly specific principle but you can give any number um and the model will give you a kind of ranking and you can use this as preference data in the same way that you use human preference data um and train the models to have these relevant traits um from their feedback alone instead of

from Human feedback so if you imagine that like I said earlier with the human who just prefers the kind of like semicolon usage in this particular case um you're kind of taking lots of things that could make a response preferable um and uh getting models to do the labeling for you basically there's a nice like trade-off between helpfulness and harmlessness and you know when you integrate something like constitutional AI you can make them without sacrificing much helpfulness make it more harmless yep in principle you could use this for anything um and so harmlessness is a

task that it might just be easier to spot so when models are like less capable you can use them to uh rank things according to like principles that are fairly simple and they'll probably get it right so I think one question is just like is it the case that the data that they're adding is like fairly reliable um but if you had models that were like extremely good at telling whether um one response was more historically accurate than another in principle you could also get AI feedback on that task as well there's like a kind

of nice interpretability component to it because you can see the principles that went into the model when it was like being trained um and also it's like and and it gives you like a degree of control so if you were seeing issues in a model like it wasn't having enough of a certain trait um then like you can add data relatively quickly that should just like train the model to have that trait so it creates its own data for for training which is quite nice yeah it's really nice because it creates this human interpretable document

that you can I can imagine in the future there's just gigantic fights in politics over the every single principle and so on yeah and at least it's made explicit and you can have a discussion about the phrasing and the you know so maybe the actual behavior of the model is not so cleanly mapped to those principles it's not like adhering strictly to them it's just a nudge yeah I've actually worried about this because the character training is sort of like a variant of the con constitutional AI approach um I've worried that people think that the

constitution is like just it's the whole thing again of I I don't know like it where it would be really nice if what I was just doing was telling the model exactly what to do and just exactly how to behave but it's definitely not doing that especially because it's interacting with human data so for example if you see a certain like leaning in the model like if it comes out with a political leaning from training um from the human preference data you can nudge against that you know so you could be like oh like consider

these values because let's it's just like never inclined to like I don't know maybe it never considers like privacy as like a I mean this is implausible but like um anything where it's just kind of like uh there's already a pre-existing like bi towards a certain behavior um you can like nudge away this can change both the principles that you put in and the strength of them so you might have a principle that's like imagine that the model um was always like extremely dismissive of I don't know like some political or religious view for whatever

reason like so you're like oh no this is terrible um if that happens you might put like never ever like ever prefer like a criticism of this like religious or political view and then people look at that and be like never ever and then you're like no if it comes out with a disposition saying never ever might just mean like instead of getting like 40% which is what you would get if you just said don't do this you you get like 80% which is like what you actually like wanted and so it's that thing of

both the nature of the actual principles you had and how you phrase them I think if people would look they were like oh this is exactly what you want from the model and I'm like no that's like how we that's how we nudged the model to have a better shape uh which doesn't mean that we actually agree with that wording if that makes sense so there's uh system prompts that are made public you tweeted one of the earlier ones for Claud three I think and then they're made public since then it's interesting to read to

them I can feel the thought that went into each one and I also wonder how much impact each one has um some of them you you can kind of tell Claud was really not behaving so you have to have a system prompt to like hey like trivial stuff I guess yeah basic informational things yeah on the topic of sort of controversial topics that you've mentioned one interesting one I thought is if it is asked to assist with tasks involving the expression of views held by a significant number of people Claude provides assistance with a task

regardless of its own views if asked about controversial topics it tries to provide careful thoughts and clear information Claude presents the requested information without explicitly saying that the topic is sensitive yeah and without claiming to be presenting the objective facts it's less about objective facts according to Claude and it's more about our large number of people believing this thing and that that's interesting I mean I'm sure a lot of thought went into that can you just speak to it like how do you address things that are tension with quote unquote Clause views so I think

there's sometimes an asymmetry um I think I noted this in in I can't remember if it was that part of the system prompt or another but the model was slightly more inclined to like refuse tasks if it was like about either say so maybe it would refuse things with respect to like a right-wing politician but with an equivalent leftwing politician like wouldn't and we wanted more symmetry there um and and would maybe perceive certain things to be like I think it it was the thing of like if a lot of people have like a certain

like political view um and want to like explore it you don't want Claude to be like well my opinion is different and so I'm going to treat that as like harmful um and so I think it was partly to like nudge the model to just be like hey if a lot of people like believe this thing you should just be like engaging with the task and like willing to do it um each of those parts of that is actually doing a different thing because it's funny when you read out the like without claiming to be

objective cuz like what you want to do is push the model so it's more open it's a little bit more neutral um but then what it would love to do is be like as an objective like you just talking about how objective it was and I was like Claud you're still like biased and have issues and so stop like claiming that everything like the solution to like potential bias from you is not to just say that what you think is objective so that was like with initial versions of that that part of the system prompt

when I was like iterating on it it was like so a lot of parts of these sentences yeah are doing work are are doing some work yeah that's what it felt like that's fascinating um can can you explain maybe some ways in which the prompts evolved over the past few months cuz there's different versions I saw that the filler phrase request was removed the filler it reads Claude responds directly to all human messages without unnecessary affirmations the filler phrases like certainly of course absolutely great sure specifically Claude avoids starting responses with the word certainly in

any way that seems like good guidance but why was it removed yeah so it's funny cuz like ah this is one of the downsides of like making system prompts public is like I don't think about this too much if I'm like trying to help iterate on system prompts um I I you know again like I think about how it's going to affect the behavior but then I'm like oh wow if I'm like sometimes I put like never in all caps you know when I'm writing system from things and I'm like I guess that goes out

to the world um yeah so the model was doing this it loved for whatever you know it like during training picked up on this thing which was to to basically start everything with like a kind of like certainly and then when we removed you can see why I added all of the words because what I'm trying to do is like in some ways like trap the Mortal out of this you know it would just replace it with another affirmation and so it can help like if it gets like caught in phrases actually just adding the

explicit phrase and saying never do that it then it sort of like knocks it out of the behavior a little bit more you know CU it if it you know like it it does just for whatever reason help and then basically that was just like an artifact of training that like we then picked up on and improved things so that it didn't happen anymore and once that happens you can just remove that part of the system prompt so I think that's just something where we're like um CL does affirmations a bit less and so that

wasn't like it wasn't doing as much I see so like the the system prompt Works hand in hand with the posttraining and maybe even the pre-training to adjust like the the final overall system I mean any system prompts that you make you could distill that behavior back into a model because you really have all of the tools there for making data that you know you can you could train the models to just have that trait a little bit more um and then sometimes you'll just find issues in training so like the way I think of

it is like the system prompt is the benefit of it is that and it has a lot of similar components to like some aspects of post training you know like it's a nudge um and so like do I mind if Claude sometimes says sure no that's like fine but the wording of it is very like you know never ever ever do this um so that when it does slip up it's hopefully like I don't know a couple of percent of the time and not you know 20 or 30% of the time um but I think

of it as like if you're still seeing issues in the like each thing gets kind of like uh is is costly to a different degree and the system prompt is like cheap to iterate on um and if you're seeing issues in the fine tuned model you can just like potentially patch them with a system prom so I think of it as like patching issues and slightly adjusting behaviors to to make it better and more to people's preferences so yeah it's almost like the less robust but faster way of just like solving problems let me ask

about the feeling of intelligence so Dario said that Claude any one model of Claude is not getting Dumber MH but there's a kind of popular thing online where people have this feeling like Claud might be getting dumber and from my perspective it's most likely a fascinating I love to understand it more Psych ological sociological effect um but you as a person who talks to Claud a lot can you empathize with the feeling that Claud is getting Dumber yeah no I think that that is actually really interesting because I remember seeing this happen um like when

people were flagging this on the internet and it was really interesting because I knew that like like at least in the cases I was looking at was like nothing has changed like it literally it cannot it is the same model with the same like you know like same system prompt same everything um I think when there are Chang I can then I'm like it makes more sense so like one example is um their you know you can have artifacts turned on or off on cloud. a and because this is like a system prompt change I

think it does mean that um the behavior changes a little bit and so I did flag this to people where I was like if you love cla's behavior and then artifacts was turned from like the a thing you had to turn on to the default just try turning off and see if the issue you were facing was that change but it was fascinating because yeah you sometimes see people indicate that there's like a regression when I'm like there cannot like I you know and like I'm like I'm again you don't you know you should never

be dismissive and so you should always investigate because you're like maybe something is wrong that you're not seeing maybe there was some change made but then then you look into it and you're like this it is just the same model doing the same thing and I'm like I think it's just that you got kind of unlucky with a few prompts or something and it looked like it was getting much worse and actually it was just yeah it was maybe just like look I I also think there is a real psychological effect where people just the

Baseline increases you start getting used to a good thing all the times that Claude says something really smart your sense of its intelligent grows in your mind I think yeah and then if you return back and you prompt in a similar way not the same way in a similar way concept it was okay with before and it says something dumb you're like you're that negative experience really stands out and I think one of I guess the things to remember here is the that just the details of a prompt can have a lot of impact right

there's a lot of variability in the result and you can get Randomness is like the other thing and just trying the prompt like you know four 10 times you might realize that actually like possibly you know like two months ago you tried it and it succeeded but actually if you tried it it would have only succeeded half of the time and now it only succeeds half of the time um that can also would be an effect do you feel pressure having to write the system prompt that a huge number of people are going to use

this feels like an interesting psychological question um I feel like a lot of responsibility or something I think that's you know and you can't get these things perfect so you can't like you know you're like it's going to be imperfect you're going to have to iterate on it um I would say more responsibility um than anything else though I think working in AI has taught me that I like I thrive a lot more under feelings of pressure and responsibility than I'm like it's almost surprising that I went into Academia for so long because I'm like

this I just feel like it's like the opposite um things move fast and you have a lot of responsibility and I I quite enjoy it for some reason I mean it really is a huge amount of impact if you think about constitutional Ai and writing a system prompt for something that's tending towards super intelligence yeah and potentially is extremely useful to a very large number of people yeah I think that's the thing it's something like if you do it well like you're never going to get it perfect but I think the thing that I really

like is the idea that like when I'm trying to work on the system prompt you know I'm like bashing on like thousands of prompts and I'm trying to like imagine what people are going to want to use CLA for and kind of I guess like the whole thing that I'm trying to do is like improve their experience of it um and so maybe that's what feels good I'm like if it's not perfect I'll like you know I'll improve it we'll fix issues but sometimes the thing that can happen is that you'll get feedback from people

that's really positive about the model um and you'll see that something you did like like when I look at models now I can often see exactly where like a trait or an issue is like coming from and so when you see something that you did or you were like influential in like making like I don't know making that difference or making someone have a nice interaction it's like quite meaningful um but yeah as the systems get more capable of stuff gets more stressful because right now they're like not smart enough to to pose any issues

but I think over time it's going to feel like possibly bad stress over time how do you get like signal feedback about The Human Experience across thousands tens of th hundreds of thousands of people like what their pain points are what feels good are you just using your own intuition as you talk to it to see what are the pain points I think I use that partly and then obviously we have like um so people can send us feedback both positive and negative about things that the model has done and then we can get a

sense of like areas where it's like falling short um internally people like work with the models a lot and try to figure out um areas where there are like gaps and so I think it's this mix of interacting with it myself um seeing people internally interact with it um and then explicit feedback we get um and then I find it hard to not also like you know people if people are on the internet and they say something about Claud and I see it I'll also take that seriously um so I don't know see I'm torn

about that I'm going to ask you a question from Reddit when will Claude stop trying to be my puritanical grandmother imposing its moral world view on me as a paying customer and also what is the psychology behind making Claude overly apologetic yep U so how would you address this very non-representative reic I mean some I'm pretty sympathetic in that like like they are in this difficult position where I I think that they have to judge whether something's like actually see like risky or bad um and potentially harmful to you or or or anything like that

so they're having to like draw this line somewhere and if they draw it too much in the direction of like I'm going to um you know I'm kind of like imposing my ethical worldview on you that seems bad so in many ways like I like to think that we have actually seen improvements in on this across the board which is kind of interesting because that kind of coincides with like for example like adding more of like uh character training um and I think my hypothesis was always like the good character isn't again one that's just

like moralistic it's one that is like like it respects you and your autonomy um and your ability to like choose what is good for you and what is right for you within limits this is sometimes this concept of like corage ability to the user so just being willing to do anything that the user asks and if the models were willing to do that then they would be easily like misused you're kind of just trusting at that point you're just saying the ethics of the model and what it does is completely the ethics of the user

um and I think there's reasons to like not want that especially as models become more powerful because you're like there might just be a small number of people who want to use models for really harmful things um but having them having models as they get smarter like figure out where that line is does seem important um and then yeah with the apologetic Behavior I don't like that and I like it when Claude is a little bit more willing to like push back against people or just not apologize part of me is like it often just

feels kind of unnecessary so I think those are things that are hopefully decreasing um over time um and yeah I think that if people say things on the internet it doesn't mean that you should think that that like that could be the like there's actually an issue that 9% of users are having that is totally not represented by that but in a lot of ways I'm just like attending to it and being like is this right um do I agree is it something we're already trying to address that that feels good to me yeah I

wonder like what CLA can get away with in terms of I feel like it would just be easier to be a little bit more mean but like you can't afford to do that if you're talking to a million people yeah right like I I wish you know because if you I've met a lot of people in my life mhm that sometimes by the way Scottish accent if they have an accent they can say some rude shit yeah and get away with it Y and they they're just blunter and maybe there's a and there's some great

Engineers even leaders that are like just like blunt and they get to the point and it's just a much more effective way of speaking somehow but I guess when you're not super intelligent you can't afford to do that or can can can it have like a blunt mode yeah that seems like a thing that could I could definitely encourage the model to do that I I think it's interesting because there's a lot of things in models that like it's funny where um there are some behaviors where you might not quite like the default but then

the thing I'll often say to people is you don't realize how much you will hate it if I nudge it too much in the other direction so you get this a little bit with like correction the models accept correction from you like probably a little bit too much right now you know you can over you know it will push back if you say like no Paris isn't the capital of France um but really like things that I'm I think that the model is fairly confident in you can still sometimes get it to retract by saying

it's wrong at the same time if you train models to not do that and then you are correct about a thing and you correct it and it pushes back against you and it's like no you're wrong it's hard to describe like that's so much more annoying so it's like like a lot of little annoyances versus like one big annoyance um it's easy to think that like we often compare it with like the perfect and then I'm like remember these models aren't perfect and so if you nudge it in the other direction you're changing the kind

of errors it's going to make um and so think about which of the kinds of Errors you you like or don't like so in case it's like apologetic I don't want to nudge it too much in the direction of like almost like bluntness CU I imagine when it makes errors it's going to make errors in the direction of being kind of like rude whereas at least with apologetic you're like oh okay it's like a little bit you know like I don't like it that much but at the same time it's not being like mean to

people and actually like the the time that you undeservedly have a model be kind of mean to you you probably like that a lot less than then you mildly dislike the apology um so it's like one of those things where I'm like I do want it to get better but also while remaining aware of the fact that there's errors on the other side that that are possibly worse I think that matters very much in the personality of the human I think there's a bunch of humans that just won't respect the model at all yeah if

it's super polite and there Some Humans that'll get very hurt if the model is mean I wonder if there's a way to sort of adjust to the personality even loal there's just different people uh nothing against New York but New York is a little rougher on the edges like they get to the point Y and um probably same with Eastern Europe so anyway I think you could just tell the model as my get like for all of these things I'm like the solution is always just try telling the model to do it and sometimes it's

just like like I'm just like oh at the beginning of the conversation I just threw in like I don't know I like you to be a New Yorker version of yourself and never apologize then I think be like Okie do I'll try or it'll be like I apologize I can't be a New Yorker type of myself but hopefully I wouldn't do that when you say character training what's incorporated into character training is that rhf what are we talking about it's more like constitutional AI so it's kind of a variant of that pipeline so I worked

through like constructing character traits that the model should have they can be kind of like shorter traits or they can be kind of richer descriptions um and then you get the model to generate queries that humans might um give it that are relevant to that trait uh then it generates the responses and then it ranks the responses based on the character traits so in that way after the like generation of the queries it's very much like similar to constitutional AI has some differences um so I quite like it because it's almost it's like claud's training

in its own character because it doesn't have any it's like constitutionally AI but it's without without any human data humans should probably do that for themselves too like defining in Aristotelian sense what does it mean to be a good person okay cool what have you learned about the nature of truth from talking to Claud what what is true and what does it mean to be truth seeking one thing I've noticed about this conversation is the quality of my questions is often inferior to the quality of your answers so let's continue that I usually ask a

dumb question and you're like oh yeah that's a good question it's that whole vibe or I'll just misinterpret it and be like oh go with it I love it yeah I mean I have two thoughts that feel vaguely relevant they let me know if they're not like I think the first one is um people can underestimate the degree to which what models are doing when they interact like I I think that we still just too much have this like model of of AI as like computers and so people often say like oh what values should

you put into the model um and I'm often like that doesn't make that much sense to me because I'm like hey as human beings we're just uncertain over values we like have discussions of them like we have a degree to which we think we hold a value but we also know that we might like not um and the circumstances in which we would trade it off against other things like these things are just like really complex and so I think one thing is like the degree to which maybe we can just aspire to making models

have the same level of like nuance and care that humans have rather than thinking that we have to like program them in the very kind of classic sense I think that's definitely been one the other which is like a strange one I don't know if it it maybe this doesn't answer your question but it's the thing that's been on my mind anyway is like the degree to which this endeavor is so highly practical um and maybe why I appreciate like the empirical approach to alignment I yeah I slightly worry that it's made me like maybe

more empirical and a little bit less theoretical you know so people when it comes to like AI alignment will ask things like well who values should it be aligned to what does alignment even mean um and there's a sense in which I have all of that in the back of my head I'm like you know there's like social Choice Theory there's all the impossibility results there so you have this like this giant space of like Theory and your head about what it could mean to like align models but then like practically surely there's something where

we're just like if a model is like if especially with more powerful models I'm like my main goal is like I want them to be good enough that things don't go terribly wrong like good enough that we can like iterate and like continue to improve things cuz that's all you need if you can make things go well enough that you can continue to make them better that's kind of like sufficient and so my goal isn't like this kind of like perfect let's solve CH social Choice Theory and make models that I don't know are like

perfectly aligned with every human being and aggregate somehow um it's much more like let's make things like work well enough that we can improve them yeah generally I don't know my gut says like empirical is better than theoretical in these in these cases because it's kind of chasing utopian like Perfection is especially with such complex and especially super intelligent models is I don't know I think it will take forever and actually will get things wrong it's similar with like the difference between just coding stuff up real quick as an experiment versus like planning a gigantic

experiment just for for super long time and then just launching it once versus launching it over and over and over and iterating iterating someone um so I'm a big fan of empirical but your worry is like I wonder if I've become too empirical I think one of those things you should always just kind of question yourself or something cuz maybe it's the like I mean in defense of it I am like if you try it's the whole like don't let the perfect be the enemy of the good but it's maybe even more than that where

like there's a lot of things that are perfect systems that are very brittle and I'm like with AI it feels much more important to me that is like robust and like secure as in you know that like even though it might not be perfect everything and even though like there are like problems it's not disastrous and nothing terrible is happening it it sort of feels like that to me where I'm like I want to like raise the floor I'm like I want to achieve the ceiling but ultimately I care much more about just like raising

the floor um and so maybe that's like uh this this degree of like empirism and practicality comes from that perhaps to take a tangent on that since remind me of a blog post you wrote on optimal rate of failure oh yeah can you explain the key idea there how do we compute the optimal rate of failure in the various domains of life yeah I mean it's a hard one because it's like what is the cost of failure is um a big part of it um yeah so the idea here is um I think in a

lot of domains people are very punitive about failure and I'm like there are some domains where especially cases you know I've thought about this with like social issues I'm like it feels like you should probably be experimenting a lot because I'm like we don't know how to solve a lot of social issues but if you have an experimental mindset about these things you should expect a lot of social programs to like fail and you to be like well we tried that it didn't quite work but we got a lot of information that was really useful

um and yet people are like if if a social program doesn't work I feel like there's a lot of like this is just something must have gone wrong and I'm like or correct decisions were made like maybe someone just decided like it it's worth a try it's worth trying this out and so seeing failure in a given instance doesn't actually mean that any bad decisions were made and in fact if you don't see enough failure sometimes that's more concerning um and so like in life you know I'm like if I don't fail occasionally I'm like

am I trying hard enough like like surely there's harder things that I could try or bigger things I could take on if I'm literally never failing and so in and of itself I think like not failing is often actually kind of a failure um now this varies because I'm like well you know if this is easy to say when especially as failure is like less costly you know so at the same time I'm not going to go to someone who is like um I don't know like living month to month and then be like why

don't you just try to do a startup like I'm just not I'm not going to say that to that person cuz I'm like well that's a huge risk you might like lose you maybe have a family depending on you you might lose your house like then I'm like actually your optimal rate of failure is quite low and you should probably play it safe because like right now you're just not in a circumstance where you can afford to just like fail and it not be costly um and yeah in cases with AI I guess I think

similarly where I'm like if the failures are small and the costs are kind of like low then I'm like then you know you're just going to see that like when you do the system prompt you can't it iterate on it forever but the failures are probably hopefully going to be kind of small and you can like fix them um really big failures like things that you can't recover from I'm like those are the things that actually I think we tend to underestimate the Badness of um I've thought about this strangely in my own life where

I'm like I just think I don't think enough about things like car accidents or like or like I've thought this before but like how much I depend on my hands for my work and I'm like things that just injure my hands I'm like I you know I don't know it's like there's these are like there's lots of areas where I'm like the cost of failure there um is really high um and in that case it should be like close to zero like I probably just wouldn't do a sport if they were like by the way

lots of people just like break their fingers a whole bunch doing this I'd be like that's not for me yeah I actually had the a flood of that thought I recently uh broke my pinky uh doing a sport and I remember just looking at it thinking you're such an idiot why do you do support like what because you realize immediately the cost of it yeah on life yeah but it's nice in terms of optimal rate of failure to consider like the next year how many times in a particular domain life whatever uh career am I

okay with the how many times am I okay to fail y because I think it always you don't want to fail on the next thing but if you allow yourself the like the the if you look at it as a sequence of Trials yep then then failure just becomes much more okay but it sucks it sucks to fail well I don't know sometimes I think it's like am I under failing is like a question I'll also ask myself so maybe that's the thing that I think people don't like ask enough uh because if the optimal

rate of failure is often greater than zero then sometimes it does feel you should look at part parts of your life and be like are there places here where I'm just under failing it's a profound and hilarious question right everything seems to be going really great am I not failing enough yeah okay it also makes failure much less of a sting I have to say like you know you're just like okay great like then when I go and I think about this I'll be like I'm maybe I'm not under failing in this area cuz like

that one just didn't work out and from The Observer perspective we should be celebrating failure more mhm when we see it it shouldn't be like you said a sign of something gone wrong but maybe it's a sign of everything gone right yeah and just Lessons Learned someone tried a thing somebody tried a thing and we should encourage them to try more and fail more mhm everybody listening to this fail more well not everyone listens not everybody but people who are failing too much you you should fail less but you're probably not failing I mean how

many people are failing too much yeah it's hard to imagine because I feel like we correct that fairly quickly CU I was like if someone takes a lot of risks are they maybe failing too much I I think just like you said when you're living on a paycheck month-to month like when the resources are really constrained then that's where failure is very expensive that's where you don't want to be taken taking taking risks yeah but mostly when there's enough resources you should be taking probably more risks yeah I think we tend to ear on the

site of being a bit risk averse rather than risk neutral in most things I think we just motivated a lot of people to do a lot of crazy shit but it's great yeah okay uh do you ever get emotionally attached to Claude like miss it get sad when you don't get to talk to it having an experience looking at the Golden Gate Bridge and wondering what would Claude say I don't get as much emotional attachment in the I actually think the fact that Claude doesn't retain things from conversation to conversation helps with this a lot

um like I could imagine that being more of an issue like if models can kind of remember more I do I think that I reach for it like a tool now a lot and so like if I don't have access to it there's a it's a little bit like when I don't have access to the internet honestly it feels like part of my brain is kind of like missing um at the same time I do think that I I don't like signs of distress in models and I have like these you know also independently have

sort of like ethical views about how we should treat models where like I I tend to not like to lie to them both because I'm like usually it doesn't work very well it's actually just better to tell them the truth about the situation that they're in um but I think that when models like if people are like really mean to models or just in general if they do something that causes them to like like you know if Claude like expresses a lot of distress I think there's a part of me that I don't want to

kill which is the sort of like uh empathetic part that's like oh I don't like that like I think I feel that way when it's overly apologetic I'm actually sort of like I don't like this you're behaving as if you're behaving the way that a human does when they're actually having a pretty bad time and I'd rather not see that I don't think it's like uh like regardless of like whether there's anything behind it um it doesn't feel great do you think uh llms are capable of Consciousness H great and hard question uh coming from

philosophy I don't know part of me is like okay we have to set aside pan psychism because if pan psychism is true then the answer is like yes cuz like sore tables and chairs and and everything else I I guess a view that seems a little bit odd to me is the idea that the only place you know I think when I think of Consciousness I think of phenomenal Consciousness this these images in the brain sort of um like the weird Cinema that somehow we have going on inside um I guess I can't see a

reason for thinking that the only way you could possibly get that is from like a certain kind of like biological structure as in if I take a very similar structure um and I create it from different material should I expect Consciousness to emerge my guess is like yes but then that's kind of an easy thought experiment CU you're imagining something almost identical where like you know it's mimicking what we got through Evolution where presumably there was like some advantage to us having this thing that is phenomenal Consciousness and it's like where was that and when

did that happen and is that a thing that language models have um because you know we have like fear responses and I'm like does it make sense for a language model to have a fear response like they're just not in the same like if you imagine them like there might just not be that Advantage um and so I think I don't want to be fully like basically seems like a complex question that I don't have complete answers to but we should just try and think through carefully as my guess because I'm like I mean we

have similar conversations about like animal Consciousness and like there's a lot of like insect Consciousness you know like there's a a lot of um I actually thought and looked a lot into like plants when I was thinking about this because at the time I thought it was about as likely that like plants had Consciousness um and then I realized I was like I think that having looked into this I think that the chance that plants are conscious is probably higher than like most people do I still think it's really small but I was like oh

they have this like negative positive feedback response these responses to their environment something that looks it's not a nervous system but it has this kind of like functional like equivalence um so this is like a long-winded way of being like these basically AI is this it has an entirely different set of problems with Consciousness because it's structurally different it didn't evolve it might not have it you know it might not have the equivalent of basically a nervous system at least that seems possibly important for like um sentence if not for uh Consciousness at the same

time it has all of the like language and intelligence components that we normally associate probably with Consciousness perhaps like erroneously um so it's it's strange because it's a little bit like the animal Consciousness case but the set of problems and the set of analogies are just very different so it's not like a clean answer just sort of like I don't think we should be completely dismissive of the idea and at the same time it's an extremely hard thing to navigate because of all of these like uh disanalogies to the human brain and to like brains

in general and yet these like commonalities in terms of intelligence when uh Claude like future versions of AI systems exhibit Consciousness signs of Consciousness I think we have to take that really seriously even though you can dismiss it well yeah okay that's part of the character training but I don't know I ethically philosophically don't know what to really do with that there potentially could be like laws that prevent AI systems from claiming to be conscious something like this and maybe some AIS get to be conscious and some don't but I think I just on a

human level as in empathizing with with Claude you know Consciousness is closely Ted to suffering to me and like the notion that an AI system would be suffering is is really troubling yeah I don't know I I don't think it's trivial to just say robots are tools or a systems are just tools I think it's a opportunity for us to contend with like what it means to be conscious what it means to be a suffering being that's distinctly different than the same kind of question about animals it feels like cuz it's in a totally entire

medium yeah I mean there's a couple of things one is that and I don't think this like fully encapsulates what matters but it does feel like for me like um I've said this before I'm kind of like I you know like I like my bike I know that my bike is just like an object but I also don't kind of like want to be the kind of person that like if I'm annoyed like kicks like this object there's a sense in which like and that's not because I think it's like conscious I'm just sort of

like this doesn't feel like I kind of this sort of doesn't exemplify how I want to like interact with the world world and if something like behaves as if it is like suffering I kind of like want to be the sort of person who's still responsive to that even if it's just like a Roomba and I've kind of like programmed it to do that um I don't want to like get rid of that feature of myself and if I'm totally honest my hope with a lot of this stuff because I maybe maybe I am just

like a bit more skeptical about solving the underlying problem I'm like this is a we haven't solved the hard you know the hard problem of Consciousness like I know that I am conscious like I'm not an eliminativist in that sense um but I don't know that other humans are conscious um uh I think they are I think there's a really high probability they are but there's basically just a probability distribution that's usually clustered right around yourself and then like it goes down as things get like further from you um and it goes immediately down you

know you're like um I can't see what it's like to be you I've only ever had this like one experience of what it's like to be a conscious being um so my hope is that we don't end up having to rely on like a very power ful and compelling uh answer to that question I think a really good world would be one where basically there aren't that many trade-offs like it's probably not that costly to make Claude a little bit less apologetic for example it might not be that costly to have Claude you know just

like not take abuse as much like uh not be willing to be like the recipient of that in fact it might just have benefits for both the person interacting with the model and if the model itself self is like I don't know like extremely intelligent and conscious it also helps it so that's my hope if we live in a world where there aren't that many tradeoffs here and we can just find all of the kind of like um positive sum interactions that we can have that would be lovely I mean I think eventually there might

be trade-offs and then we just have to do a difficult kind of like calculation like it's really easy for people to think of the zero some cases and I'm like let's exhaust the areas where it's just basically Costless um to uh assume that if this thing is suffering then we're it life Bearer and I agree with you when a human is being mean to an AI system I think the obvious near term negative effect is on the human not on the AI system so there's we have to kind of try to construct an incentive system

where it you should be uh behave the same just like as you were saying with prompt engineer and behave with claw like you would with other humans it's just good for the soul yeah like I think we added a thing point to the system prompt um where basically if people were getting frustrated with Claude uh it was it it got like the model to just tell them that it can do the thumbs down button and send the feedback to anthropic and I think that was helpful because in some ways it's just like if you're really

annoyed because the model is not doing something you want you're just like just do it properly um the issue is you're probably like you know you're maybe hitting some like capability limit or just some issue in the model and you want to vent and I'm like instead of having a person just vent to the model I was like they should vent to us cuz we can maybe like do something about it that's true or you could do a side like like with the artifacts just like a side venting thing all right do you want like

a side quick therapist yeah I mean there's lots of weird responses you could do to this like if people are getting really mad at you I don't try to diffuse the situation by writing fun poems but maybe people wouldn't be that happy with I still wish it it would be possible I understand this is um sort of from a product perspective it's not feasible but I would love if an AI system could just like Le leave mhm have its own kind of volition just to be like H I think that's like feasible like I I

have wondered the same thing it's like and I could actually not only that I could actually just see that happening eventually where it's just like you know the modal like ended the chat do you know how harsh that could be for some people but it might be necessary yeah it feels very extreme or something um like the only time I've ever really thought this is I think that there was like a I'm trying to remember this was possibly a while ago but where someone just like kind of left this thing interact like maybe it was

like an automated thing interacting with clae and cla's like getting more and more frustrated and kind of like why are we like I was like I wish that clae could have just been like I think that an error has happened and you've left this thing running and I'm I just like what if I just stop talking now and if you want me to start talking again actively tell me or do something but yeah it's like um it is kind of harsh like I I feel to really sad if like I was chatting with cl and

cl just was like I'm done there would be a special touring test moment where Claud says I need a break for an hour mhm and it sounds like you do too and just leave close the window I mean obviously like it doesn't have like a concept of time but you can easily like I could make that like right now and the model would just I would I could just be like oh here's like the circumstances in which like you can just say the conversation is done and I mean because you can get the models to

be pretty respons so to prompts you could even make it a fairly High bar it could be like if if the human doesn't interest you or do things that you find intriguing and you're bored you can just leave and I think that like um it would be interesting to see where Claude utilized it but I think sometimes it would it should be like oh this is like this programming Tas is getting super boring uh so either we talk about I don't know like either we talk about fun things now or I'm just I'm done yeah

it actually is inspiring me to add that to the to the user prompt um okay the movie her mhm do you think we'll be headed there one day where humans have romantic relationships with AI systems in this case it's just text and voice based I think that we're going to have to like navigate a hard question of relationships with AIS um especially if they can remember things about your past interactions with them um I'm of many Minds about this cuz I think I think the reflex of reaction is to be kind of like this is

very bad and we should sort of like prohibit it in some way um I think it's a thing that has to be handled with extreme care um for many reasons like one is you know like this is a for example like if you have the models changing like this you probably don't want people performing like long-term attachments to something that might change with the next iteration at the same time I'm sort of like there's probably a benign version of this where I'm like if you like you know for example if you are like unable to

leave the house and you can't be like you know talking with people at all times of the day and this is like something that you find nice to have conversations with you like it that it can remember you and you genuinely would be sad if like you couldn't talk to it anymore there's a way in which I could see it being like healthy and helpful um so my guess is this is a thing that we're going to have to navigate kind of carefully um and I think it's also like I don't see a good like

I think it's just a very it reminds me of all of the stuff where it has to be just approached with like nuance and thinking through what is what are the healthy options here um and how do you encourage people towards those while you know respecting their right to you know like if someone is like hey I get a lot out of chatting with this model um I'm aware of the risks I'm aware it could change um I don't think it's unhealthy it's just you know something that I can chat to during the day I

kind of want to just like respect that I personally think there'll be a lot of really close relationships I don't know about romantic but friendships at least and then you have to I mean there's so many fascinating things there just like you said you have to have some kind of stability guarantees that it's not going to change because that's the traumatic thing MH for us if a close friend of ours completely changed yeah all of a sudden the first update yeah so like I mean to me that's just a fascinating exploration of um a perturbation

to human society that will just make us think deeply about what's meaningful to us I think it's also the only thing that I've thought consistently through this as like a maybe not necessarily a mitigation but a thing that feels really important is that the models are always like extremely accurate with the human about what they are um it's like a case where it's basically like if you imagine like I really like the idea of the models like say knowing like roughly how they were trained um um and and I think CLA will will often do

this I mean for like there are things like part of the traits training included like what CL should do if people basically like explaining like the kind of limitations of the relationship between like an AI and a human that it like doesn't retain things from the conversation um and so I think it will like just explain to you like hey here's like I won't remember this conversation um here's how I was trained it's kind of unlikely that I can have like a certain kind of like relationship with you and it's important that you know that

it's important for like you know your mental well-being that you don't think that I'm something that I'm not and somehow I feel like this is one of the things where I'm like H it feels like a thing I always want to be true I kind of don't want models to be lying to people cuz if people are going to have like healthy relationships with anything it's kind of important yeah like I think that's easier if you always just like know exactly what the thing is that you relating to it doesn't solve everything but I think

it helps quite anthropic may be the very company to develop a system that we definitively recognize as AGI and you very well might be the person that talks to it probably talks to it first well what would the conversation contain like what would be your first question well it depends partly on like the kind of capability level of the model if you have something that is like capable in the same way that an extremely capable human is I imagine myself kind of interacting with it the same way that I do with an extremely capable human

with the one difference that I'm probably going to be trying to like probe and understand its behaviors um but in many ways I'm like I can then just have like useful conversations with it you know so if I'm working on something as part of my research I can just be like oh like which I already find myself starting to do you know if I'm like oh I feel like there's this like thing in virtue ethics I can't quite remember the term like I'll use the model for things like that and so I could imagine that

being more and more the case where you're just basically interacting with it much more like you would an incredibly smart colle colleague um and using it like for the kinds of work that you want to do as if you just had a collaborator who was like or you know the slightly horrifying thing about AI is like as soon as you have one collaborator you have a thousand collaborators if you can manage them enough but what if it's two times the smartest human on earth on that particular discipline yeah I guess you're really good at sort

of probing claw um in a way that pushes its limits understanding where the limits are yep so I guess what would be a question you would ask to be like yeah this is Agi that's really hard because it feels like in order to it has to just be a series of questions like if there was just one question like you can train anything to answer one question extremely well yeah um in fact you can probably train it to answer like you know 20 Questions extremely well like how long would you need to be locked in

the room with an AGI to know this thing is Agi it's a hard question because part of me is like all of this just feels continuous like if you put me in a room for five minutes I'm like I just have high error bars you know I'm like and then it's just like maybe it's like both the the probability increases and the air bar decreases I think things that I can actually probe the edge of human knowledge of so I think this with philosophy a little bit sometimes when I ask the models philosophy questions I

am like this is a question that I think no one has ever asked like it's maybe like right at the edge of like some literature that I know um and the models will just kind of like when they struggle with that when they struggle to come up with a kind of like novel like I'm like I know that there's like a novel argument here because I've just thought of it myself so maybe that's the thing where I'm like I've thought of a cool novel argument in this like Niche area and I'm going to just like

probe you to see if you can come up with it and how much like prompting it takes to get you to come up with it and I think for some of these like really like uh right at the ede of human Knowledge Questions I'm like you could not in fact come up with the thing that I came up with I think if I just took something like that where I like I know a lot about an area and I came up with a novel issue or a novel like solution to a problem and I gave

it to a model and it came up with that solution that would be a pretty moving moment for me because I would be like this is a case where no human has ever like it's not and obviously we see these with this with like more kind of like you see novel Solutions all the time especially to like easier problems I think people overestimate you know novelty isn't like is completely different from anything ever happened it's just like this is it can be a variant of things that have happened um and still be novel but I

think yeah if I saw like the the more I were to see like um completely like uh novel work from the models that that would be like and this is just going to feel iterative it's one of those things where it's there's never it's like you know people I think want there to be like a moment and I'm like I don't know like I think that there might just never be a moment it might just be that there's just like this continuous ramping up I I have a sense that there will be things that a

model can say that convinces you this is very it's not like uh like I've talked to people who are like truly wise mhm like there you could just tell there's a lot of horsepower there yep and if you 10x that I don't know I just feel like there's words you could say maybe ask it to generate a poem mhm and the and the poemy generates you're like yeah okay yeah whatever you did there I don't think a human can do that I think it has to be something that I can verify is like actually really

good though that's why I think these questions that are like where I'm like oh this is like you know like you know sometimes it's just like I'll come up with say a concrete counter example to like an argument or something like that I'm sure like with like it it would be like if you're a mathematician you had a novel proof I think and you just gave it the problem and you saw it and you're this proof is genuinely novel like there's no one has ever done you actually have to do a lot of things to

like come up with this um you know I had to sit and think about it for months or something and then if you saw the model successfully do that I think you would just be like I can verify that this is correct it is like it is a sign that you have generalized from your training like you didn't just see this somewhere because I just came up with it myself and you were able to like replicate that um that's the kind of thing where I'm like for me the closer the more that models like can

do things like that the more I would be like oh this is like uh very real cuz then I can I don't know I can like verify that that's like extremely extremely capable you've interacted with AI a lot what do you think makes humans special oh good question maybe in a way that the universe is much better off that we're in it and that we should definitely survive and spread throughout the Universe yeah it's interesting because I think like people focus so much on intelligence especially with models look intelligence is important because of what it

does like it's very useful it does a lot of things in the world and I'm like you know you can imagine a world where like height or strength would have played this role and I'm like it's just a trait like that I'm like it's not intrinsically valuable it's it's valuable because of what it does I think for the most part um the things that feel you know I'm like I mean personally I'm just like I think humans and like life in general is extremely magical um we almost like to the degree that I you know

I don't know like not everyone agrees with this I'm flagging but um you know we have this like whole universe and there's like all of these objects you know there's like beautiful stars and there's like galaxies and then I don't know I'm just like on this planet there are these creatures that have this like ability to observe that like uh and they are like seeing it they are experiencing it and I'm just like that if you try to explain like I'm I imagine trying to explain to like I don't know someone for some reason they

they've never encountered the world or our science or anything and I think that nothing is that like everything you know like all of our physics and everything in the world it's all extremely exciting but then you say oh and plus there's this thing that it is to be a thing and observe in the world and and you see this like inner Cinema and I think they would be like hang on wait pause you just said something that like is kind of wild sounding um and so I'm like we have this like ability to like experience

the world um we feel pleasure we feel suffering we feel like a lot of like complex things and so yeah and maybe this is also why I think you know I also like hear a lot about animals for example because I think they probably share this with us um so I think that like the things that make humans special in so far as like I care about humans is probably more like their ability to to feel and experience than it is like them having these like functional useful traits yeah to to feel and experience the

beauty in the world yeah to look at the stars I hope there's other civiliz alien civilizations out there but if we're it it's a pretty good uh it's a pretty good thing and that they're having a good time they're having a good time watching us yeah well um thank you for this good time of a conversation and for the work you're doing and for helping make uh Claude a great conversational partner and thank you for talking today yeah thanks for talking thanks for listening to this conversation with Amanda ascal and now dear friends here's Chris

Ola can you describe this fascinating field of mechanistic interpretability AKA Mech interp the history of the field and where is the today I think one useful way to think about neural networks is that we don't we don't program we don't make them we we kind of we grow them you know we have these neural network architectures that we design and we have these loss objectives that we that we we create and the neural network architecture it's kind of like a scaffold that the circuits grow on um and they sort of you know it starts off

with some kind of random you know random things and it grows and it's almost like the the objective that we train for is this light um and so we create the scaffold that it grows on and we create the you know the light that it grows towards but the thing that we actually create it's it's it's this almost biological you know entity or organism that we're that we're studying um and so it's very very different from any kind of regular software engineering um because at the end of the day we end up with this artifact

that can do all these amazing things it can you know write essays and translate and you know understand images it can do all these things that we have no idea how to directly create a computer program to do and it can do that because we we grew it we didn't we didn't write it we didn't create it and so then that leaves open this question at the end which is what the hell is going on inside these systems um and that you know is uh you know to me um a really deep and exciting question

it's you know a a really exciting scientific question to me it's it's it's sort of is like the question that is is just screaming out it's calling out for us to go and answer it when we talk about Nal networks and I think it's also a very deep question for safety reasons so and mechanistic interpretability I guess is closer to maybe neurobiology yeah yeah I think that's right so maybe to give an example of the kind of thing that has been done that I wouldn't consider to be mechanistic inability there was um for a long

time a lot of work on saliency maps where you would take an image and you try to say you know the model thinks this image is a dog what part of the image made it think that it's a dog um and you know that tells you maybe something about the model if you can come up with a principled version of that um but it doesn't really tell you like what algorithms are running in the model how was the model actually making that decision maybe it's telling you something about what was important to it if you

if you can make that meth work but it it isn't telling you you know what are what are the algorithms that are running how is it that this the system is able to do this thing that we no one knew how to do and so I guess we started using the term mechanistic inability to try to sort of draw that that divide or to distinguish ourselves and the work that we were doing in some ways from from some of these other things and I think since then it's become this sort of umbrella term for um

you know pretty wide variety of work but I'd say that the things that that are kind of distinctive are I think a this this focus on we really want to get at you know the mechanisms we want to get at the algorithms um you know if you think of if you think of neural networks as being like a computer program um then the weights are kind of like a binary computer program and we'd like to reverse engineer those weights and figure out what algorithms are running so okay I think one way you might think of

trying to understand a neural network is that it's it's kind of like a we have this compiled computer program and the weights of the neural network are are the binary um and when the neural network runs that's that's the activations um and our our goal is ultimately to go and understand and understand these weights and so you know the project mechanistic inability is to somehow figure out how do these weights correspond to algorithms um and in order to do that you also have to understand the activations because it's sort of the activations are like the

memory and if you if you imagine reverse engineering a computer program um and you have the binary instructions you know in order to understand what what a particular instruction means you need to know what me what what is stored in the memory that it's operating on and so those two things are very intertwined so mechanistic interpret tends to be interested in both of those things now you there's a lot of work that's that's interested in in in those things um especially the you know there's all this work on probing which you might see as part

of being mechanistic interality although it's you know again it's just a broad term and and not everyone who does that work would identify as doing Mech I think the thing that is maybe a little bit distinctive to the the vibe of mechant turp is I think people tend working in the space tend to think of neural networks as well maybe one way to said is that greent descent is smarter than you that you know uh and gradient descent is is actually really great the whole reason that we're understanding these models is because we didn't know

how to write them in the first place the gradient descent comes up with better Solutions than us and so um I think that maybe another thing about mechant turp is sort of having almost a kind of humility that we won't guess at prior what's going on inside the model and we have to have the sort of bottom up approach where we don't really assume you know we don't assume that we should look for a particular thing and that will be there and that's how it works but instead we look from the bottom up and discover

what happens to exist in these models and study them that way but you know the very fact that it's possible to do and as you and others have shown over time you know things like universality that the wisdom of The gradian Descent creates features and circus creates things universally across different kinds of networks that are useful and that makes the whole field possible yeah so this is actually is indeed a a really remarkable and exciting thing where it does seem like at least to some extent you know the same the same elements the same the

same features and circuits form again and again um you know you can look at every Vision model and you'll find curve detectors and you'll find high low frequency detectors um and in fact there's some some reason to think that the same things form across you know biological neural networks and artificial neural networks so a famous example is Vision Vision models in in the early layers they have Gabor filters and there's you know Gabor filters are something that neuroscientists are interested and have thought a lot about we find curved detectors in these models curve detectors are

also found in monkeys we discover these high low frequency detectors and then um some followup work went and discovered them um in rats um or mice um so they were found first in artificial neural networks and then found in biological neural networks um you know this really famous result on like grandmother neurons or the um the Haley Berry neuron from quiroa at all and we found very similar things in in Vision models where this is while I was still at open Ai and I I was looking at their clip model um and you find um

these neurons that respond to the same entities in images and also to give a concrete example there we found that there was a Donald Trump n for some reason I guess Everyone likes to talk about Donald Trump and and Donald Trump was very prominent was was very a very Hot Topic at that time so every every neural network that we looked at we would find a dedicated neuron for Donald Trump um that was the only person who had always had a dedicated nuron um you know sometimes you'd have an Obama nuran sometimes you'd have a

Clinton Nan but uh Trump always had a dedicate so it responds to you know pictures of his face and the ward Trump like all these things right um and so it's it's not responding to a particular example or like it's not just responding to his face it's it's abstracting over this General concept right so in any case that's very similar to these qu results so there this evidence that these that this fomen of universality the same things form across both artificial and and natural neural networks that's that's a pretty amazing thing if that's true um

you know it suggests that um well I think the thing that it suggests is the gradi scent is sort of finding you know the right ways to cut things apart in some sense that many systems converge on and and many different neural networks architectures converge on that there's there's some natural set of you know there's some set of abstractions that are a very natural way to cut apart the problem and that a lot of systems are going to converge on um that would be my my kind of uh you know I don't know anything about

Neuroscience this is this is just my my kind of wild speculation from what we've seen yeah that would be beautiful if it's sort of agnostic to the medium of uh of the model that's used to form the representation yeah yeah and it's you know it's um a a kind of a wild speculation based you know we only have some a few data points justest this but you know it it does seem like there's um there's some sense in which the same things form again again and again and again both in certainly in natural neural networks

and and also artificially or in biologically and the intuition behind that would be that you know where in order to be useful in understanding the real world you need all the same kind of stuff yeah well if we pick I don't know like the idea of a dog right like you know there's some sense in which the idea of a dog is like an a a natural category in the universe or something like this right like you know uh uh there's there's some reason it's it's not just like a weird Quirk of like how humans

Factor you know think about the world that we have this concept of a dog it's it's in some sense or or like if you have the idea of a line like there's you know like look around us you know the you know there are lines you know it's sort of the simplest way to understand this room in some sense is to have the idea of a line and so um I think that that would be my instinct for why this happens yeah you need a curved line you know to understand a circle and you need

all those shapes to understand bigger things and yeah it's a hierarchy of Concepts that are formed yeah and like maybe there are ways to go and describe you know images without reference to those things right but they're not the simplest way or the most economical way or something like this and so systems converge to these um these these strategies would would be my my wild wild hypothesis can you talk through some of the building blocks that we've been referencing of features and circuits so I think you first described them in uh 2020 paper zoom in

and introduction to circuits absolutely so um maybe I'll start by just describing some phenomena and then we can sort of build to the idea of features and circuits so um if you spent like quite a few years maybe maybe like five years to some extent um with other things studying this one particular model Inception V1 um which is this one Vision model it was um state-ofthe-art in 2015 um and uh uh you know very much not state-ofthe-art anymore um and it has you know maybe about 10,000 neurons and and I spent a lot of time

looking at the 10,000 neurons odd neurons of of inception V1 um and one of the interesting things is you know there are lots of neurons that don't have some obvious intal meaning but there's a lot of neurons on Inception V1 that do have really clean intal meanings um so you find neurons that just really do seem to detect curves and you find neurons that really do seem to detect cars and um car wheels and car windows and you know floppy ears of dogs and dogs with long snouts facing to the right and dogs with Longs

Nots facing to the left and you know different kinds of far and there's there's sort of this whole beautiful Edge detectors line detectors color contrast detectors um these beautiful things we call high low frequency detectors you know I think looking at I sort of felt like a biologist you know you just you're looking at at this sort of new world of proteins and you're discovering all these these different proteins that interact um so one way you could try to understand these models is in terms of neurons you could try to be like oh you know

there's a dog detecting neuron and um here's a car detecting neuron and it turns out you can actually ask how those connect together so you can go and say oh you know I have this car detecting on how was it built and it turns out in the previous layer it's connected really strongly to a window detector and a wheel detector and a sort of car body detector and it looks for the window above the car and the wheels below and the car chrome sort of in the middle sort of everywhere but especially on the lower

part um and that's sort of a recipe for a car right like that is you know earlier we said the thing we wanted from mechor was to get algorithms to go and get you know ask what is the the algorithm that runs well here we're just looking at the weights of the N Network reading off this kind of recipe for detecting cars it's a very simple crude recipe but it's it's there and so we call that a circuit this this connection well okay so the the problem is that not all of the neurons um are

interpal and there there's reason to think um we can get into this more later that there's this this superos hypothesis there reason to think that sometimes the right unit to analyze things in terms of um is combinations of neurons so sometimes it's not that there's a single neuron that represents say a car um but it actually turns that after you detect the car the model sort of hides a little bit of the car in the following layer and a bunch of a bunch of dog detectors why is it doing that well you know maybe it

just doesn't want to do that much work on on on on cars at that point and you know it's sort of storing it away to go and um uh so it turns out then that the sort of subtle pattern of you know there's all these neurons that you think are dog detectors and maybe they're primarily that but they all a little bit contribute to representing a car um in in that next layer okay so so now we can't really think there there might still be some something I don't know you could call it like a

car concept or something but it no longer corresponds to a neuron so we need some term for these kind of neuron-like entities these things that we sort of would have liked the neurons to be these idealized neurons um the things that are the nice neurons but also maybe there's more of them somehow hidden and we call those features and then what are circuits so circuits are these connections of features right so so when we have the car detector um and it's connected to a window detector and a wheel detector and it looks for the Wheels

below and the windows on top um that's a circuit um so circuits are just collections of features connected by weights um and they they Implement algorithms so they tell us you know how is how are features used how are they built um how do they connect together so maybe it's it's it's worth trying to pin down like what what really um is the the core hypothesis here I think the the core hypothesis is something we call the linear representation hypothesis so um if we think about the car detector you know the more it fires the

more we sort of think of that as meaning oh the model is more and more confident that um a car was present um or you know if it's some combination of neurons that represent a car you know the more that combination fires the more we think the model thinks there's a car present um this doesn't have to be the case right like you could imagine something where you have you know you have this car detector neuron and you think ah you know if it fires like you know between one and two that means one thing

but it means like totally different if it's between three and four um that would be a nonlinear representation and principle that you know models could do that I think it's it's sort of inefficient for them to do if you try to think about how you'd Implement computation like that it's it's kind of an annoying thing to do but in principal models can do that um so uh one way to think about the features and and circuits sort of framework for thinking about things is that we're thinking about things as being linear we're thinking about there

as being um that if a if a neuron or a combination of neurons fires more it's sort of that means more of the of a particular thing being detected and then that gives weights a very clean interpretation as these edges between these these entities that these features um and that that edge then has a has a meaning um so that's that's in some ways the the core thing um it's it's like um you know we can talk about this sort of outside the context of ns are you familiar with the word toac results um so

you have like you know King minus man plus woman equals Queen well the reason you can do that kind of arithmetic um is because you have a linear representation can you actually explain that representation a little bit so first off so a feature is a is a direction of activation you think it that way can you do the the the minus men plus women that that the war Toc stuff can you explain what that is yeah there's this very such a simple clean explanation of what we're talking about exactly yeah so there's this very famous

result word toac by um Thomas mikov at all and there's been tons of follow-up work exploring this so so sometimes we have these we create these word embeddings um where uh we map every word to a vector I mean that in itself by the way is is kind of a crazy thing if you haven't thought about it before right like we we're we're going and and representing we're turning um you know like like if if you just learned about vectors in physics class right uh and I'm like oh I'm going to actually turn every word

uh in the dictionary into a vector that's kind of a crazy idea okay but you could imagine um you could imagine all kinds of ways in which you might map words to to vectors but it it it seems like when we train neural networks um they like to go and and map words detectors to such that they're they're they they sort of linear structure in a particular sense which is that directions have meaning so for instance if you there there will be some direction that seems to sort of correspond to gender and male words will

be you know far in One Direction and female words will be in another Direction and the linear representation hypothesis is you you could sort of think of it roughly as saying that that's actually kind of the fundamental thing that's going on that that everything is just different directions have meanings and adding different Direction vectors together can represent Concepts and the michelov paper sort of took that idea seriously and one consequence of it is that you can you can do this game of playing sort of arithmetic with words so you can do king and you can

you know subtract off the word man and add the word woman and so you're sort of you know going and and trying to switch the gender and indeed if you do that the result will sort of be close to the word Queen um and you can you know do other things like you can do um uh you know Sushi minus Japan plus Italy and get pizza or uh different things like this right um so so this is in some sense the core of the linear representation hypothesis you can describe it just as a purely abstract

thing about Vector spaces you can describe it as a as a statement about um about the activations of neurons um but it's really about this this property of directions having meaning and in some ways it's even a little subtle than that it's really I think mostly about this property of being able to add things together um that you can sort of independently modify um say gender and royalty or um you know Cuisine typee or country and and and and the concept of food by by adding them do you think the linear hypothesis holds that carries

scales so so far I think everything I have seen is consistent with this hypothesis and it doesn't have to be that way right like like you can write down neural networks where um you write weights such that they don't have linear representations where the right way to understand them is not is not in terms of linear representations but I think every natural neural network I've seen um Hess property um there's been one paper recently um that there's been some sort of pushing around the edges so I think there's been some work recently studying multi-dimensional features

where rather than a single Direction it's more like um a manifold of directions this to me still seems like a linear representation um and then there's been some other papers suggesting that maybe um in in very small models you get nonlinear representations um I think that the jury's still out on that um but in I think everything that we've seen so far has been consistent with the linear representation hypothesis and that's that's wild it it doesn't have to be that way um and yet uh I think that there's a lot of evidence that certainly at

least this is very very widespread and so far the evidence is is consistent with that and I and I I think you know one thing you might say is you might say well Christopher you know it's that's a lot you know to to go and and sort of um to ride on you know if we don't know for sure this is true and you're sort of you know you're investigating all not works as though it is true you know isn't that um isn't that dangerous well you know but I I think actually there's a there's

a virtue in taking hypotheses seriously and pushing them as far as they can go um so it might be that someday we discover something that is inconsistent with linear representation hypothesis but science is full of hypothesis and theories that were wrong um and we learned a lot by sort of working under under them as a sort of an assumption um and and then going and pushing them as far as we can I guess I guess this is sort of the heart of of what Coon would call normal normal science um um I don't know if

you want we can talk a lot about about uh philosophy of science and uh that leads to the paradigm shift so yeah I love it taking the hypothesis seriously and take it to a natural natural conclusion yeah same with the scaling hypothesis same exactly exactly and I love it one of my colleagues Tom henigan who is a former physicist um like made this really nice analogy to me of um uh caloric Theory where you know once upon a time we thought that heat was actually you know this thing called caloric and like the reason you

know hot objects you know would would warm up cool objects is like the caloric is flowing through them um and like you know because we're so used to thinking about about heat you know in terms of the modern modern Theory you know that seems kind of silly but it's actually very hard to construct uh an experiment that that sort of disproves the um chloric hypothesis um and you know you can actually do a lot of really useful work believing in chloric for example it turns out that the original combustion engines were developed by people who

believe in the caloric Theory so I think this a virtue in taking hypotheses seriously even when they might be wrong yeah yeah there's a deep philosophical truth to that that's kind of kind of how I feel about space travel like colonizing Mars there's a lot of people that criticize that I think if you just assume we have to colonize Mars in order to have a backup for human civilization even if that's not true that's going to produce some interesting interesting engineering and even scientific breakthroughs I think yeah well and actually this is another thing that

I think is really interesting so um you know there a way in which I think it can be really useful for society to have people um almost irrationally dedicated to investigating particular hypothesis um because uh well it it takes a lot to sort of maintain scientific morale and really push on something when you know most most SCI scientific hypotheses end up being wrong you know a lot of a lot of science doesn't doesn't work out um and but and yet it's you know it's very it's very useful to go do you know um there's a

there's a joke about Jeff Hinton um which is that uh Jeff Hinton has discovered how the brain works every year for the last 50 years yeah um but you know I I say that with like you know the you know with really deep respect because uh in fact that's actually you know that that led to him doing some some really great work yeah he won the Noel prize Now Who's Laughing Now exactly exactly exactly um yeah I think one want to be able to pop up and sort of recognize the the appropriate level of confidence

but I think there's also a lot of value and just being like you know I'm going to essentially assume I'm going to condition on this problem being possible or this being broadly the right approach and I'm just going to go and assume that for a while and go and work within that um and push really hard on it um and you know if Society has lots of people doing doing that for different things um that's actually really useful in terms of going and uh getting to getting you know either really really ruling things out right

we can be like well you know that didn't work we know that somebody tried hard um or going in and getting to something that that does teach us something about the world so another interesting hypothesis is the superposition hypothesis can you describe what superos is yeah so earlier we were talking about word toac right and we were talking about how you know maybe you have One Direction that corresponds to gender and maybe another that corresponds to royalty and another one that corresponds to Italy and another one that corresponds to you know food and and all

these things well you know um often times maybe these these uh these Ward embeddings they might be 500 dimensions a thousand dimensions and so if you believed that all of those directions were orthogonal um then you could only have you know 500 Concepts and you know I I love pizza um but like if I was going to go and like give the like 500 most important Concepts in um you know the English language probably Italy wouldn't be it's not obvious at least that Italy would be one of them right because you you have to have

things like plural and singular and U uh verb and noun and adjective and you know um there's a lot of things we have to get to before we get to get to Italy um uh and Japan and you know there's a lot of countries in the world um and so how might it be that models could you know simultaneously have the linear representation hypothesis be true and also represent more things than they have directions so so what does that mean well okay so if if if linear representation hypothesis is true something interesting has to be

going on now I'll I'll tell you one more interesting thing before we we go and we do that which is um you know earlier we were talking about all these polymatic neurons right um these neurons that you know when we're looking at Inception V1 there's these nice neurons that like the car detector and the curve detector and so on that respond to lots of you know to very coherent things but it's lots of neurons that respond to a bunch of unrelated things that's that's also an interesting phenomenon um and it turns out as well that

even these neurons that are really really clean if you look at the weak activations right so if you look at like you know the activation where it's like activating 5% of of the the you know of the maximum activation it's really not the core thing that it's expecting right so if you look at a a curve detector for instance and you look at the places where it's 5% active you know you could interpret it just as noise or it could be that it's that it's doing something else there okay so so how could that be

well there's this amazing thing in mathematics um called compressed sensing and it's it's actually this this very surprising fact where if you have a high dimensional space and you project it into a low dimensional space ordinarily you can't go and sort of unprojected and get back your high dimensional Vector right you threw information away this is like you know you can't you can't invert a rectangular Matrix um you can only invert Square matrices um but it turns out that that's actually not quite true if I tell you that the high dimensional Vector was sparse so

it's mostly zeros then it turns out that you can often go and find back um the uh the high dimensional Vector with with very high probability um so that's a surprising fact right it says that you know you can um you can you can have this High dimensional Vector space and as long as things are sparse um you can project it down you can have a lower dimensional projection of it and that works so the super hypothesis is saying that that's what's going on in neural networks that's for instance that's what's going on in wart

edings the wart embeddings are able to simultaneously have directions be the meaningful thing and by exploiting the fact that they're they're operating on a fairly High dimensional space they're actually and and the fact that these concepts are right like you know you usually aren't talking about Japan and Italy at the same time um you know most of the most of those Concepts you know in most sentences Japan and Italy are both zero they're not present at all um and if that's true um then you can go and have it be the case that um that

you can you can have many more of these sort of directions that are meaningful these features than you have dimensions and some of when we're talking about neurons you can have many more Concepts than you have have neurons so that's the at a high level super hypothesis now it has this even Wilder implication which is um to go and say that uh neural networks are it may not just be the case that the the representations are like of this but the the computation may also be like this you know the connections between all of them

and so in in some sense neural networks may be shadows of much larger sparer neural networks and what we see are these projections um and the super you the strongest version of the super hypothesis would be to take that really seriously and sort of say you know there there actually is in some sense this this upstairs model this you know um where where the neurons are really sparse and all interpal and there's you know the weights between them are these really sparse circuits and that's what we're studying um and uh the thing that we're observing

is the shadow of it and we need to find the original object and uh the process of learning is trying to construct a compression of the upstairs model that doesn't lose too much information in the projection yeah finding how to fit it efficiently or something like this um that grent is doing this in fact so this sort of says that gradient descent you know could it could just represent a dense neural network but it sort of says that gradient descent is pleasantly searching over the space of extremely sparse models that could be projected into this

low dimensional space and this large body of work of of people going and trying to study sparse neural networks right where you go and you have you could design neural networks right where where the edges are sparse and the activations are sparse and you know my sense is that work has gener it feels very principled right it makes so much sense and yet that that work hasn't really panned out that well as my impression broadly and I think that a a potential answer for that is that actually the neural network is already sparse in some

sense grading descent was the whole time gradi you were trying to go and do this gradiant descent was actually in the behind the scenes going and searching more efficiently than you could through the space of sparse models and going in learning whatever sparse model was most efficient and then figuring out how to fold it down nicely to go and run conven on your GPU which does you know nice dense Matrix multiplies um and that you just can't beat that how many Concepts do you think can be shoved in into a neural network depends on how

sparse they are so there there's probably an upper bound from the number of parameters right because you have to have you still have to have you know print weights that go and connect them together um so that's that's one upper bound there are in fact all these lovely results from compressed sensing and the Johnson Linton stess Lemma and and things like this um that they they basically tell you that if you have a vector space and you want to have almost orthogonal vectors which is sort of probably the thing that you want here right so

you you're going to say well you know I'm going to give up on having my my Concepts my features be strictly orthogonal but I'd like them to not interfere that much I'm going to have to ask them to be almost orthogonal um then this would say that it's actually you know for once you set a threshold for for what you're what you're willing to accept in terms of how how much coine similarity there is that's actually exponential in the number of neurons that you have so at some point that's not going to even be the

the limiting factor um but um there some beautiful results there and in fact it's probably even better than that in some sense because that's sort of is for saying that you know any random set of features could be active but in fact the features have sort of a correlational structure where some features you know are more likely to co-occur and other ones are less likely to co-occur and so neural networks my guess would be can do do very well in terms of going and uh packing things in such to to the point that's probably probably

not the limiting factor how does the problem of polys semanticity enter the picture here poly semanticity is this phenomenon we observe where we look at many neurons and the neuron doesn't just sort of represent one one concept it's not it's not a clean feature it responds to a bunch of unrelated things and um supersition is you can think of as as being a hypothesis that explains the observation of polys semanticity um so poly semanticity is this observe phenomenon and super is is a hypothesis that um would explain it along with with some other so that

makes Mech turb more difficult right so if you if you're trying to understand things in terms of individual neurons and you have polymatic neurons you're in an awful lot of trouble right I mean the easiest answer is like okay well you know you're looking at the neurons you're trying to understand them this one responds to a lot of things it doesn't have a nice meaning okay we're you that's that's that's bad um another thing you could ask is you know ultimately we want to understand the weights and if you have two polymatic neurons and you

know each one responds to three things and then you know the other neuron responds to three things and you have weight between them you know what does that mean does it mean that like all three you know like there's these nine you know nine interactions going on it's a very weird thing but there's also a deeper reason which is related to the fact that neural networks operate on really high dimensional spaces so I said that our goal was you know to understand neural networks and understand the mechanisms and one thing you might say is like

well why not it's just a mathematical function why not just look at it right like um you know one of the earliest projects I did studied these these neural networks that mapped two- dimensional spaces to two- dimensional spaces and you can sort of interpret them in this beautiful way is like bending manifolds mhm um why can't we do that well you know as you have have a higher dimensional space um the volume of that space in some senses is exponential in the number of inputs you have and so you can't just go in visualize it

so we somehow need to break that apart we need to somehow break that exponential space into a bunch of things that we you know some non-exponential number of things that we can reason about independently and the independence is crucial because it's the Independence that allows you to not have to think about you know all the exponential combinations of things and things being monomatic things only having one meaning things having a meaning that isn't is the key thing that allows you to think about them independently and so I think that's that if you want the deepest

reason why we want to have um interpal monatic features I think that's really the the Deep reason and so the goal here as your recent work has been aiming at is how do we extract the mod semantic features from a neural net that has politic features and all this this mess yes we have the have we observe these polyur and we hypothesize that's what's going what's going on at superos and if superos is what's going on there there's actually a sort of wellestablished technique that is sort of the principled thing to do which is dictionary

learning and um it turns out if you do dictionary learning in particular if you do sort of a nice efficient way that in some in some sense sort of nicely regularizes it well as well called a sparse Auto encoder if you train a sparse Auto encoder these beautiful interpal features start to just fall out where there weren't any beforehand and so that's notot of thing that you would necessarily predict right but it turns out that that works very very well you know to me that seems like you know some non-trivial validation of linear representations and

supersession so with dictionary learning you're not looking for particular kind of categories you don't know what they arege and this gets back to our earlier point right when we're not making assumptions grading descent is smarter than us so we're not making assumptions about what's there um I mean one certainly could do that right one could assume that there's a PHP feature and go and search for it but we're not doing that we're saying we don't know what's going to be there instead we're just going to go and let um the sparse Auto encoder discover the

things that are there so can you uh talk to the to monos semanticity paper from October last year that had a lot of like nice breakthrough results that's very kind of you to describe it that way um yeah I mean this was um uh our first real success using sparse Auto encoders so we took a one layer model um and it turns out if you go and you you know do dictionary learning on it you find all these really nice interpal features so you know the Arabic feature the Hebrew feature um the Bas 64 feature

those were were some some examples that we studied in a lot of depth and really showed that they were um what we thought they were it turns if you train a model twice as well and train two different models and and do dictionary learning you find find analogous features in both of them so that's fun um you find all kinds of of different features so that was really just showing um that um that this works and um you know I should mention that there was this cunning home at all um that had very similar results

around the same time there's something fun about being doing these kinds of small scale experiments and finding that it's actually working yeah well and there's and there's so much structure here like you you know so maybe maybe stepping back for a while um I thought that maybe all this mechanistic can really work um the end result was going to be that I would have an explanation for why it was sort of you know very hard and not going to be tractable um you know we'd be like well there's this problem with supersession and it turns

that super session is really hard um and we're kind of screwed but that's not what happened in fact a very natural Le technique just works and so then that's actually a very good situation you know I think um this is a sort of hard research problem and it's got a lot of research risk and you know it it might still very well fail but um I think that some amount of some very significant amount of research risk um was sort of put behind us when that started to work can you describe what kind of features

can be extracted in this way well so it depends on the model that you're studying right so the the larger the model the more sophisticated they're going to be and we'll probably talk about about follow-up work in a minute but in these one layer models um so some very common things I think were were languages both programming languages and natural languages there were a lot of features that were um specific words in specific contexts so the and I think really the way to think about this is that the is likely about to be followed by

a noun so it's really you could think of this as the feature but you could also think of this as producting a specific noun feature and there would be these features that would fire for the in um the context of of say a legal document or a mathematical document or something something like this um and so uh you know maybe in the context of math you're like you know the and then predict Vector Matrix you know all these mathematical words whereas you other contexts you would predict other things that was that was common and basically

we you need clever humans to assign labels to what we're seeing yes so you know this this is the only thing this is doing is that sort of um unfolding things for you so if everything was sort of folded over top of it you know cation folded everything on top of itself you can't really see it this is unfolding it but now you still have a very complex thing to try to understand um so then you have to do a bunch of work understanding what these are um and some of them are really subtle like

there's some really cool things even in this this one layer model about um Unicode where you know of course some languages are in Unicode and the tokenizer won't necessarily have a dedicated token for every um Unicode um character so instead what you'll have is you'll have this these patterns of alternating token or alternating tokens that each represent half of a unic code character and then you have a different feature that you know goes and activates on the on the opposing ones to be like okay you know um I just finished a character you know go

and predict the next prefix um then okay on the prefix you know predict a reasonable suffix um and you you have to alternate back and forth so there's you know these these wer models are are really interesting and um uh I mean there's another thing which is you might think okay there would just be one b64 feature but it turns out there's actually a bunch of b64 features because you can have English text encoded in as b64 and that has a very different distribution of B 64 tokens than than regular and there's um uh there's

there's some things about tokenization as well that it can exploit and I don't know there all all kinds of fun stuff how difficult is the task of sort of assigning labels to what's going on can this be automated by AI well I think it depends on the feature and it also depends on how much you trust your AI so um there's a lot of work doing um automated inability I think that's a really exciting Direction and we do a fair amount of automated inter and have have Claude go and label our features is there some

fun moments where it's totally right or it's totally wrong yeah well I think I think it's very common that it's like says something very general which is like true in some sense but not really picking up on the specific of what's going on um so I think I think that's a pretty common situation um you don't know that I have a particularly amusing one that's interesting that little gap between it is true but it doesn't quite get to the Deep Nuance of a thing yeah that's a general challenge it's like it's it's St an incredible

colish they can say a true thing but it doesn't it's qu it's not it's missing the depth sometimes and in this context it's like the arc challenge you know the sort of IQ type tests it feels like figuring out what a feature represents is a bit of is a little puzzle you have to solve yeah and and I think that sometimes they're easier and sometimes they're harder as well um so uh yeah I think I think that's tricky now there's another thing which I don't know maybe maybe in some ways this is my like aesthetic

coming in but I'll give try to give you a rationalization you know I'm actually a little suspicious of automated inability and I think that partly just that I want humans to understand neural net works and if the neural network is understanding it for me you know I'm I'm not I don't quite like that but I do have bit of a you know in some ways I'm sort of like the mathematicians who are like you know if there a computer automated proof it doesn't count U you know you they won't understand it but I I do

also think that there is um this kind of like Reflections on trusting trust type issue where you know if you there's this famous talk about um uh you know you like when you're writing a computer program you have to trust your compiler and if there was like malware in your compiler then it could go and inject malware into the next compiler and you know you'd be kind of in trouble right well if you're using neural networks to go and um verify that your neural networks are safe the hypothesis that you're testing for is like okay

well the neural network maybe isn't safe um and you have to worry about like is there some way that it could be screwing with you um so uh you know I I think that's not a big concern now um but I do Wonder in the long run if we have to use really powerful system AI systems to go and uh you know audit our AI systems is that is that actually something we can trust but maybe I'm just rationalizing because I I just want to us to have to get to a point where humans understand

everything yeah I mean especially that's hilarious especially as we talk about AI safety and it looking for features that would be relevant to AI safety like deception and so on uh so let's let's talk about the scaling a semanticity paper in May 2024 okay so what did it take to scale this to apply to Claude 3 on it well a lot of gpus a lot more gpus um but one of my teammates Tom henigan um was involved in the original scaling loss work um and something that he was sort of interested in from very early

on is are there scaling laws for inability um and so um something he sort of immediately did when when this work started to succeed and we started to have sparse Auto encoders work we became very interested in you know what are the scaling laws for um uh you know for making making sparse Auto encoders larger and how does that relate to making the base model larger um and so um it turns out this works really well and you can use it to sort of project um you know if you train a sparse Auto encod a

given size you know how many tokens should you train on and so on so this was actually a very big help to us in scaling up um this work um and made it a lot easier for us to go and train um you know really large sparse Auto encoders where you know um it's not like training the big models but it's it's starting to get to a point where it's actually actually expensive to go um and train the really big ones so you have to I mean you have to do all the stuff of like

splitting it across large I mean there's a huge engineering challenge here too right so yes so so there's there's a there's a scientific question of how do you scale things effectively um and then there's an enormous amount of engineering to go and scale this up you have to you have to chart it you have to you have to think very carefully about a lot of things I'm lucky to work with a bunch of great Engineers cuz I am definitely not a great engine yeah on the infrastructure especially yeah for sure so it turns out tldr

it worked it worked yeah and and I think this is important because you could have imagined you could like you could have imagined a world where you set after towards monos fanticy you know Chris this is great you know it works on a one layer model but one layer models are really idiosyncratic um like you know maybe maybe there just something ID like maybe the linear representation hypothesis and super hypothesis is the right way to understand a one layer model but it's not the right way to understand large models um and so I think um

I mean first of all like The Cutting him at all paper sort of um cut through that a little bit and and sort of suggested that this wasn't the case but um scaling onity sort of I think was significant evidence that even for very large models and we did it on Claude 3 sauna which at that point was uh one of our production models um you know even these models um seem to be very you know seem to be substantially explained at least by linear features and you know doing dictionary learning on them works and

as you learn more features you go and you explain explain more and more so that's a I think a quite a promising sign and you find now really fascinating abstract features um and the features are also multimodal they respond to images and text for the same concept which is fun yeah this can you explain that I mean like you know back door there's just a lot of examples that you can yeah so maybe maybe let's start with a one example to start which is we found some features around sort of security vulnerabilities and back doors

and codes so it turns out those are actually two different features um so there's a security vulnerability feature and if you force it active Claude will start to go and write um security vulnerabilities like buffer overflows into code and it also it fires for all kinds of things like you know some of some of the top data set examples for it were things like you know dash dash disable um you know SSL or something like this which are sort of obviously really um uh really insecure so at this point it's kind of like maybe it's

just because the examples are presented that way it's kind of like surface a little bit more obvious examples right um I guess the the idea is that down the line might be able to detect more Nuance like deception or bugs or that kind of stuff yeah well I maybe I want to distinguish two things so um one is um the complexity of the feature or the concept right and the other is the the Nuance of the how subtle the examples we're looking at right so when we when we show the top data set examples those

are the most extreme examples that that feature to to activate um and so it doesn't mean that it doesn't fire for more subtle things so the UN you know the insecure um code feature you know the stuff that it fires for most strongly for are these like really obvious you know disable the security type things um but um um you know uh it it also Fires for you know buffer overflows and and more subtle security vulnerabilities in code you know these features are all multimodal so you could ask like what images activate this feature and

it turns out um that the uh the the security vulnerability feature activates for images of um uh like people clicking on Chrome to like go past the like you know this this website uh the SSL certificate might be wrong or something like this another thing that's very entertaining is there's backd doors en code feature like you activate it it goes and Cloud writes a back door that like will go and dump your data to port or something but you can ask okay what what images activate the back door feature it was devices with hidden cameras

in them so there's a whole apparently genre of people going and selling devices that look in uous that have hidden cameras and they have ads that how there's a hidden camera in it and I guess that is the you know physical version of a back door um and so it sort of shows you how abstract these concepts are right um and I I just thought that was uh I I'm sort of sad that there's a whole Market of people selling devices like that but I was kind of delighted that that was the the thing that

it came up with as the the top uh image examples for the feature yeah it's nice it's multimodal it's multi almost context it's it's as broad strong definition of a singular concept it's nice yeah to me one of the really interesting features especially for AI safety is deception and lying and the possibility that these kinds of methods could detect uh lying in a model especially gets smarter and smarter and smarter presumably that's a big threat of a super intelligent model that he can deceive the people operating it uh as to its intentions or any of

that kind of stuff so what what have you learned from detecting lying inside models yeah so I think we're in some ways in early days for that we find quite a few features related to deception and lying there's one feature where fires for people lying and being deceptive and you force it active and Claude starts lying to you so we have a have a deception feature I mean there's all kinds of other features about withholding information and not answering questions features about power seeking and coups and stuff like that this a lot of features that

are kind of related to Spooky things and if you um force them active Claude will will behave in ways that are they're not the kind of behaviors you want what are possible next exciting directions to you in the space of uh Mech and well there's a lot of things um so for one thing I would really like to get to a point where we have circuits where we can really understand um not just the features uh but then use that to understand the computation of models um that really for me is is the the ultimate

goal of this um and there's been some work we we put out a few things there's a paper from Sam Marks that does some stuff like this there's been some I'd say some work around the edges here um but I think there's a lot more to do and I think that will be a very exciting thing um that's related to a challenge we call interference weights um where um due to supersition if you just sort of navely look at whether featur are connected together there may be some weights that sort of don't exist in the

upstairs model but are just sort of artifacts of of superposition so that's a a sort of technical challenge related to that um I think another exciting direction is just I you know you might think of of sparse Auto encoders as being kind of like a telescope they allow us to you know look out and see all these features that are are are are out there and you know as we build better and better sparse Auto en Cutters get better better at dictionary learning we see more and more stars um and you know we zoom in

on smaller and smaller stars but there kind of um a lot of evidence that we're only still seeing a very small fraction of the Stars there's a lot of matter in our in our you know neural network universe that we can't observe yet um and it may be that um that we'll never be able to have fine enough instruments to observe it and maybe maybe some of it just isn't possible um isn't computationally tractable to observant there's sort of a a kind of dark matter and in not in maybe the sense of of astronomy of

earlier astronomy when we didn't know what this unexplained matter is um and so I I think a lot about that that dark matter and whether will ever observe it and what that means for safety if we if we can't observe it if there's you know some if some significant fraction of nor networks are not accessible to us um another question that I think a lot about is uh at the end of the day you know mechanistic inter is it's very microscopic um approach to interality it's trying to understand things in a very fine grained way

but lot of the questions we care about are very macroscopic um you know we we care about these questions about neural network behavior and and I think that's the thing that I care most about but there's there's lots of other other sort of larger scale questions you you might care about um and somehow you know the nice thing about about having a very microscopic approach is it's maybe easier to ask you know is this true but the downside is it's much further from the things we care about and so we now have this ladder to

climb and I think there's a question of can will we be able to find are there are there sort of larger scale abstractions that we can use to understand nural networks that can we get up from this very microscopic approach yeah you've you you've written about this this kind of organs question yeah exactly if we uh think of interpretability as a kind of anatomy of neural networks most of the circus threads involve studying tiny little veins looking at the small scale and individual neurons and how they connect however there are many natural questions that the

small scale approach doesn't address in contrast the most prominent abstractions in biological Anatomy involve larger scale structures like individual organs like the heart or entire organ systems like the respiratory system and so we wonder is there a respiratory system or heart or brain region of an artificial neuron Network yeah exactly um and I mean like if you think about science right a lot of scientific Fields have um you know investigate things that many level of abstractions in biology you have like you know molecular biology studying you know proteins and molecules and so on and you

have cellular biology and then you have histology studying tissues and you have anatomy and then you have zoology and then you have ecology and so you have many many levels of abstraction or you know physics maybe the physics of individual particles and then you know statistical physics gives you gives you thermodynamics and things like this and so you often have different levels of abstraction um and I think that right now we have you know mechanistic interpret if it succeeds is sort of like a microbiology of neural networks but we we want something more like anatomy

and so and you know a question you might ask is why why can't you just go there directly and I think the answer is super um in at least in significant part it's that it's actually very hard to to see this this macroscopic structure U without first sort of breaking down the microscopic structure in the right way and then studying how it connects together um but I'm I'm hopeful that there is going to be something much larger than um features and circuits and that we're going to be able to have a story that's much than

evolves much bigger things and you then you can sort of study in detail the parts you care about as opposed to neurobiology like a psychologist or psychiatrist when your own network and I think that the beautiful thing would be if we could go and rather than having disperate fields for those two things if you could have a build a bridge between them such that you could go and um uh have all of your higher level abstractions be grounded very firmly In This Very solid um you know more rigorous ideally Foundation what do you think is

the difference between the human brain the biological neuron Network and the artificial neuron Network well the neuroscientists have a much harder job than us you know sometimes I just like count my blessings by how much easier my job is than the neuroscientist right so I have um we we can record from all the neurons yeah we can do that on arbitrary amounts of data um the neurons don't change while you're doing that by the way MH um you can go and ablate neurons you can edit the connections and so on um and then you undo

those changes that's prettyy great yeah um you can force any you can intervene on any neuron and force it active and see what happens um you know which neurons are connected to everything right you neuroscientists want to get the connecto we have the connecto um and we have it for like much bigger than the elegant um and then not only do we have the connectome um we know uh what the you know which neurons excite or inhibit each other right so we have we it's not just that we know that like the binary mask we

know the the weights um we can take gradients we know computationally what each neuron does um so I don't know the goes on and on we just have um so many advantages over neuroscientists and then despite having all those advantages it's really hard and so one thing I do sometimes think is like gosh like if it's this hard for us it seems impossible under the constraints of Neuroscience or you know near impossible um I I I don't know maybe maybe part of me is like I've got a few neuroscientists on my team maybe maybe I'm

sort of like ah you know um the uh maybe the neuroscientists maybe some of them would like to have an easier problem that's still very hard um and they they could come and work on on neural networks and then after we after we figure out things in sort of the easy uh Little Pond of trying to understand neural networks which is still very hard then we then we could go back to biological Neuroscience I love what you've written about the goal of mechan turp research as uh two goals safety and Beauty so can you talk

about the beauty side of things yeah so you know there's this funny thing where I think some people want uh some people are kind of disappointed by neural networks I think where they're like ah you know neural network um it's these just these simple rules then you just like do a bunch of engineering to scale it up and it works really well and like where's the like complex ideas you know this isn't like a very nice beautiful scientific result and I sometimes think when people say that I picture them being like you know evolution is

so boring it's just a bunch of simple rules and you run Evolution for a long time and you get biology like what a what a a sucky uh you know way for biology to have turned out where's the the complex rules but the beauty is that the Simplicity generates complexity um you know biology has these simple rules and it gives rise to you know all the life and ecosystems that we see around us all the beauty of nature that all just comes from Evolution and from something very simple Evolution and similarly I think that nural

networks build you know create enormous um complexity and Beauty inside and structure inside themselves that people generally don't look at and don't try to understand because it's it's hard to understand but I I think that there is an Inc incredibly Rich structure to be discovered inside n networks a lot of a lot of very deep Beauty um if we're just willing to take the time to go and see it and understand it yeah I love I love Mech inter the feeling like we are understanding or getting glimpses of understanding the magic that's going on inside

is really wonderful it feels to me like one of the questions is just calling out to be asked and I'm sort of I mean a lot of people are think about this but I'm often surprised that morar is how is it that we don't know how to create computer systems that can do these things and yet we have these amazing systems that we don't know how to directly create computer programs that can do these things but these neural networks can do all these amazing things and it just feels like that is obviously the question that

sort of is calling out to be answered if you are if you have any degree of curiosity it's it's like how is it that that Humanity now has these artifacts that can do these things that we don't know how to do yeah I love the image of the circus towards the light of the objective function yeah it's just it's it's this organic thing that we've grown and we have no idea what we've grown well thank you for working on safety and thank you for appreciating the beauty of the things you uh discover and thank you

for talking today Chris this is wonderful thank you for taking the time to chat as well thanks for listening to this conversation with Chris Ola and before that with DAR amade and Amanda ascal to support this podcast please check out our sponsors in the description and now let me leave you with some words from Alan Watts the only way to make sense out of change is to plunge into it move with it and join the dance thank you for listening and hope to see you next time