CS50P - Lecture 7 - Regular Expressions

255.68k views23505 WordsCopy TextShare

CS50

This is CS50P, CS50's Introduction to Programming with Python. Enroll for free at https://cs50.edx.o...

Video Transcript:

[Music] [Music] all right this is cs50's Introduction to programming with python my name is David menen and this is is our week on regular Expressions so a regular expression otherwise known as a Rex is really just a pattern and indeed it's quite common in programming to want to use patterns to match on some kind of data often user input for instance if the user types in an email address whether to your program or a website or an app on your phone you might ideally want to be able to validate that they did indeed type in

an email address and not something completely different so using regular Expressions we're going to have the Newfound capability to Define patterns in our code mod to compare them against data that we're receiving from someone else whether it's just a validated or heck even if we want to clean up a whole lot of data that itself might be messy because it too came from us humans before though we use these regular Expressions let me propose that we solve a few problems using just some simpler syntax and see what kind of limitations we run up against let

me propose that I open up VSS code here and let me create a file called validate dopy the goal at hand being to validate how about just that a US users's email address they've come to your app they've come to your website they type in their email address and we want to say yes or no this email address looks valid all right let me go ahead and type code of validate dopy to create a new tab here and then within this tab let me go ahead and start writing some code how about that keeps things

simple initially first let me go ahead and prompt the user for their email address and I'll store the return value of input in a variable called email asking them what's your email question mark I'm going to go ahead and preemptively at least clean up the user's input a little bit by minimally just calling strip at the end of my call to input because recall that input returns a string or a stir stirs come with some built-in methods or functions one of which is strip which has the effect of stripping off any leading white space to

the left or any trailing white space to the right so that's just going to go ahead and at least avoid the human having accidentally typed in a space character we're going to throw it away just in case now I'm going to do something simple for a user's input to be an email address I think we can all agree that it's got to minimally have an at sign somewhere in it so let's start simple if the user has typed in something with an at sign let's very generously just say Okay valid looks like an email address

and if we're missing that at sign let's say invalid because clearly it's not an email address it's not going to be the best version of my code yet but we'll start simple so I'm going to ask the question if there is an at symbol in the user's email email address go ahead and print out for instance quote unquote valid else if there's not now I'm pretty confident that the email address is in fact invalid now what is this code doing well if at sign in email is a pythonic way of asking is this string quote

unquote at in this other string email no matter where it is at the beginning the middle of the end it's going to automatically search through the entire string for you automatically I could do this more verbosely and I could use a for Loop or a while loop and look at every character in The user's email address looking to see if it's an at sign but this is one of the things that's nice about python you can do more with less so just by saying if at quote unquote in email we're achieving that same result we're

going to get back true if it's somewhere in there thus valid or false if it is not well let me go ahead now and run this program in my terminal window with python of validate dopy and I'm going to go ahead and give it my email address maen harvard.edu enter and indeed it's valid looks valid is valid valid but of course this program is technically broken it's buggy what would be an example input if someone might like to volunteer an answer here that would be considered valid but you and I know it really isn't valid

yeah thank you uh well pleas to you can type just two science science and that's it and it still be valid still be valid according to your program but the same exactly we've set a very low bar here in fact if I go ahead and rerun python of validate and I'll just type in one AD Sign that's it no username no domain name this doesn't really look like an email address but unfortunately my code thinks it in fact is because it's obviously just looking for an at sign alone well how could we improve this well

minimally an email address I I think tends to have though this is not actually a requirement tends to have an at sign and a single dot at least maybe somewhere in the domain name so maen harvard.edu so let's check for that dot as well though again strictly speaking doesn't even have to be that case but I'm going for my own email address at least for now as our test case so let me go ahead and change my code now and say not only if at is an email but also uh dot is in email as

well so I'm asking now two questions I have two Boolean Expressions if at an email and I'm anding them together logically this is a logical and so to speak so if it's the case that at is an email and Dot is an email okay now I'm going to go ahead and say valid all right this would still seem to work for my email address let me go ahead and run python of validate dopy maen harvard.edu enter and that of course is valid as expected but here too we can be a little adversarial and type in

something nonsensical like at Dot and unfortunately that too is going to be mistaken as valid even though there's still no username domain name or anything like that so I think we need to be a little more methodical here in fact notice that if I do this like this uh the at sign can be anywhere and the dot can be anywhere but if I'm assuming the user is going to have a traditional domain name like harvard.edu or gmail.com I really want to look for the dot in the domain name only not necessarily just the username so

let me go ahead and do this let me go ahead and introduce a bit more logic here and instead do this let me go ahead and do uh email.it of quote unquote at sign so email again is a string or a stir stirs come with methods not just strip but also another one called split that as the name applies will split one stir into multiple ones if you give it a character or more to split on so this is hopefully going to return to me two parts from a traditional email address the username and the

domain name and it turns out I can unpack that sequence of responses by doing this username comma domain equals this I could store it in a list or some other structure but if I already know in advance what kinds of values I'm expecting a username and hopefully domain I'm going to go ahead and do it like this instead and just Define two variables at once on one line of code and now I'm going to be a little more precise if username uh uh if username then I'm going to go ahead and say print valid else

I'm going to go ahead and say print invalid now this isn't good enough but I'm at least checking for the presence of a username now and you might not have seen this before but if you simply ask a question like if username and username is a string well use if username is going to give me a true answer if username is anything except none or quote unquote nothing so there's a a truthy value here whereby if username has at least one character that's going to be considered true but if username has no characters it's going

to be considered a false value effectively but this isn't good enough I don't want to just check for username I want to also check that it's the case that dot is in the domain name as well so notice here there's a bit of Po potential confusion with the English language here I seem to be saying if username and Dot in domain as though I'm asking the question if the username and the dot are in the domain but that's not what this means these are two separate Boolean Expressions if username and separately if dot in domain

and if I parenthesize this we could make that even more clear by putting parentheses there parentheses here so just to be clear it's really two Boolean Expressions that we're ending together not one long english-like sentence now if I go ahead and run this python of validate dopy enter I'll do my own email address again mailin harvard.edu and that's valid and it looks like I could tolerate something like this if I do mailin at just say Harvard I think at the moment this is going to be invalid now maybe the top level domain Harvard exists but

at the moment it looks like we're looking for something more we're looking for a top level domain too like edu so for now we'll just consider consider this to be invalid but it's not just that we want to do it's not just that we want to check for the presence of a username and the presence of a dot let's be more specific let's start to now narrow the scope of this program not just to be about generic emails more generally but about edu addresses so specifically for someone in a US University for instance whose email

address tends to end with doedu I can be a little more precise and you might recall this function already instead of just saying is there a DOT somewhere in domain let me instead say and the domain ends with quote unquote. edu so now we're being even more precise we want there to be minimally a username that's not empty it's not just quote unquote nothing and we want the domain name to actually end with doedu let me go ahead and run python of validate dopy and just to make sure I haven't made things even worse let

me at least chest my own email address which does seem to be valid now it seems that I minimally need to provide a username because we definitely do have that check in place so I'm going to go ahead and say Ma and now I'm going to go ahead and say at and it looks like I could be a little malicious here just say maen at.edu as though minimally meeting the requirements of this of this pattern and that of course is considered valid but I'm pretty sure there's no one at mailen at.edu we need to have

some domain name in there so we're still not being quite as generous now we could absolutely continue to iterate on this program and we could add some more Boolean Expressions we could maybe use some other python methods for checking more precisely is there's something to the left of the dot to the right of the dot we could use split multiple times but honestly this just escalates quickly like you end up having to write a lot of code just to express something that's relatively simple in spirit just format this like an email address so how can

we go about improving this well it turns out in Python there's a library for regular Expressions it's called succinctly re and in the re Library you have a lot of capabilities to Define and check full for and even replace patterns again a regular expression is a pattern and this Library the re library in Python is going to let us to Define some of these patterns like a pattern for an email address and then use some built-in functions to actually validate a user's input against that pattern or even use these patterns to change the user's input

or extract partial information there from we'll see examples of all this and more so what can and should I do with this Library well first and foremost it comes with a lot of functionality here's the URL for instance to the official documentation and let me propose that we focus on using one of the most versatile functions in the liary library namely this search. search is the name of the function in the re module that allows you to pass in a few arguments the first is going to be a pattern that you want to search for

in for instance a string that came from a user the string argument here is going to be the actual string that you want to search for that pattern and then there's a third argument optionally that's a whole bunch of flags a flag in general is like a a parameter you can pass in to modify the behavior of the function but initially we're not even going to use this we're just going to pass in a couple of arguments instead so let me go ahead and employ this re Library this regular expression library and just improve on

this design incrementally so we're not going to solve this problem all at once but we'll take some incremental steps I'm going to go back to VSS code here and I'm going to go ahead now and get rid of most of this code but I'm going to go into the top of my file and first and fall import this re Library so import re gives me access to that function and more now after I've gotten the user input in the same way as before stripping off any leading or trailing whitespace I'm just going to use this

function super trivially for now even though this isn't really a big step forward I'm going to say if re. search contains quote unquote at in the email address then let's go ahead and print valid else let's go ahead and print invalid at the moment this is really no better than my very first version where I was just asking python if at sign in the email address but now I'm at least beginning to use this Library by using its own re. search function which for now you can assume returns a a True Value effectively if indeed

the at sign is an email just to make sure that this version does work as I expect let me go ahead and run python of validate dopy and enter I'll type in my actual email address and we're back in business but of course this is not great because if I similarly run this version of the program and just type in an at sign not an email address and yet my code of course thinks it is valid so how can I do better than this well we need a bit more vocabulary in the realm of regular

expressions in order to be able to express ourselves a little more precisely really the pattern I want to ultimately Define is going to be something like I want there to be something to the left then an at sign then something to the right and that something to the right should end with doedu but should also have something before the edu like Harvard or Yale or any other school in the US as well well how can I go about doing this well it turns out that in the world of regular Expressions whether in python or a

lot of other languages as well there's certain symbols that you can use to define patterns at the moment I've just used literal raw text if I go back to my code here this technically qualifies as a regular expression I've passed in a quoted string inside of which is an at sign now that's not a very interesting pattern it's just an at sign but it turns out that once you have access to regular expressions or a library that offers that feature you can more powerfully express yourself as follows let me reveal that the pattern that you

passed to re. search can take a whole bunch of special symbols and here's just some of them in the examples we're about to see in the patterns we're about to Define here are the special symbols you can use a single period a DOT to just represent any character except a new line a blank line so that is to say if I don't really care what letters of the alphabet are in the user's username I just want there to be one or more characters uh in the user's name dot allows me to express a through z

uppercase and lowercase and a bunch of other letters as well star is going to mean in single asterisk zero or more repetitions so if I say something star that means that I'm willing to accept either zero repetitions that is nothing at all or more repetitions one or two or three or 300 if you see a plus in my patterns that's going to mean one or more repetitions that is to say there's got to be at least one character there one symbol and then there's optionally more after that and then you can say zero or one

repetition you can use a single question mark after a symbol and that will say I want zero of this character or one but that's all I'll expect and then lastly there's going to be a way to specify a specific number of symbols if you use these curly braces and a number represented here symbolically as is M you can specify that you want M repetitions be it 1 or two or 3 or 300 you can specify the number of repetitions yourself and if you want a range of repetitions like you want this few characters or this

many characters you can use curly braces and two numbers inside called here M and N which would be a range of M through n repetitions now what does all of this mean well let me go back to vs code here and let me propose that we iterate on this solution further it's not sufficient to just check for the at sign we know that we minimally want something to the left and to the right so how can I represent that I don't really care what the user's username is or what letters of the alphabet are in

it be it menen or anyone else's so what I'm going to do to the left of this equal sign is I'm going to use a single period the dot that again indicates any character except for a new line but I don't just want a single character otherwise the Pierce person's username could only be a at such and such or B at such and such I want it to be multiple such characters so I'm going to initially use a star sot star means give me something to the left and I'm going to do another one star

something to the right now this isn't perfect but it's at least a step forward because now what I'm going to go ahead and do is this I'm going to rerun python to validate Dy I'm going to keep testing my own email address just to make sure I haven't made things worse and that's now okay I'm now going to go ahead and type in some other input like how about just uh mail at with no domain name whatsoever and you would think this is going to be invalid but but but it's still considered valid but why

is that if I go back to this chart why is maen at with no domain now considered valid what's my mistake here by having used star at. star as my regular expression or Rex because you're using the star instead of the plus sign exactly the star again means zero or more repetitions so r. search is perfectly happy to accept nothing after the equ after the at sign because that would be zero repetitions so I think I minimally need to evolve this and go back to my code here and let me go ahead and change this

from Star to plus and let me change the ending from Star to plus so that now when I run my code here let me go ahead and run python of validate dopy I'm going to test my email address as always still working now let me go ahead and type in that same thing from before that was accident mentally considered valid now I hit enter finally it's invalid so now we're making some progress on being a little more precise as to what it is we're doing now I'll note here like with almost everything in programming python

included there's often multiple ways to solve the same problem and does anyone see a way in my code here that I could make a slight tweak if I forgot that the plus operator exists and go back to using a star if I allowed you only to use dots and only Stars could you recreate the notion of plus yes um use another dot dot dot star yeah because if a DOT means any character we'll just use a DOT and then when you want to say or more use another Dot and then the star so equivalent tot

plus would have been dot dot star because the first dot means any character and the second pair of characters dot star means zero or more other characters and to be clear it does not have to be the same character just by doing orar does not mean your whole username needs to be a or a A or a AA or A A AA it can vary with each symbol it just means zero or more of any character back to back so I could do this on both the left and the right which one is better you

know it depends I think a an argument could be made that this is even more clear because it's obvious now that there's a DOT which means any character and then there's the dot star but if you're in the habit of doing this frequently one of the reasons things like the plus exist is just to consolidate your code into something a little more succinct and if you're familiar with seeing the plus now maybe this is more readable to you so again just like with python more generally you're going to often see different ways to express the

same patterns and reasonable people might agree or disagree as to which way is better than another well let me propose to that we can think about both of these models a little more graphically if this looks a little cryptic to you let me go ahead and Rewind to the previous incarnation of this regular expression which was just a single Dot star this regular expression doar at doar means what again it means zero or more characters followed by a literal at sign followed by zero or more other characters now when you pass this pattern in as

an argument to re. search it's going to read it from left to right and then use it to try to match against the input email in this case that the user typed in now how is the computer how is re. search going to keep track of whether or not the user's email matches this pattern well turn turns out that it's going to be using a machine of sorts implemented in software known as a finite State machine or more formally a non-deterministic finite automaton and the way it works if we depict this graphically is as follows

the re. search function starts over here in a so-called start State that's the sort of condition in which it begins and then it's going to read the user's email address from left to right and it's going to decide whether or not to stay in this first state or transition to the next state so for instance in this first state as the user reading my email address maen harvard.edu it's going to follow this curved edge up and around to itself a reflexive Edge and it's labeled dot because dot again just means any character so as the

function is reading my email address mail harvard.edu from left to right it's going to follow these transitions as follows m a l a n and then it's hopefully going to follow this transition to the second state because there's a literal sign both in this machine as well as in my email address then it's going to try to read the rest of my address h a r v a r d do e duu and that's it and then the computer's going to check did it end up in a an accept state a final State that's actually

depicted here pictorially a little differently with double circles one inside of the other and that just means that if the computer finds itself in that second accept state after having read all of the user's input it is indeed a valid email address if by some chance the machine somehow ended up stuck in that first state which does not have double circles and it is therefore not an accept state the computer would conclude this is an invalid email address instead by contrast if we go back to my other version of the code where I instead had

Plus on both the left and the right recall that re. search is going to use one of these State machines in order to decide from left to right whether or not to accept the user's input like M at harvard.edu can we get from the start state so to speak to an accept state to decide Yep this was in fact meeting the pattern well let's propose that this non-deterministic finite automaton look like this instead we're going to start as before in the leftmost start State and we're going to necessarily consume one character per this first Edge

which is labeled with a DOT to indicate that we can consume any one character like the m in M harvard.edu then we can spend some time consuming more characters before the at signs so the a l a n then we can consume the at sign then we can consume at least one more character because recall that the Rex has dot plus this time and then we can consume even more characters if we want so if we first consume the H in harvard.edu that then leaves the a r v a d and then e du and

now here too we're at the end of the story but we're in an accept state because that circle at the end has two circles total which means that if the computer if this function finds itself in that accept state after reading the entirety of the user's input it is too in fact a valid email address if by contrast we had gotten stuck in one of those other states unable to follow a transition one of those edges and therefore unable to make progress in the user's input from left to write then we would have to conclude

that that email address is in fact invalid well how can we go Upon a proving this code further let me propose now that we check not only for a username and also something after the username like a domain name but minimally required that the string ends with edu as well well I think I could do this fairly straightforward not only do I want there to be something after the at sign like the domain like Harvard I want the whole thing to end with edu but there's a little bit of danger here what have I done

wrong by implementing my regular expression now in this way by by using plus at. plus. edu what could go wrong with this version uh the dot is dot means something else in this context where it means zero or more repetitions of a character which is why it will interpret it different exactly even though I mean for it to mean literally edu a period and then doedu unfortunately in the world of regular Expressions dot means any character which means that this string could technically end in a ed U or bedu or cedu and so forth but

that's not in fact that I want so any instincts now as to how I could fix this problem and let me demonstrate the problem more clearly let me go ahead and run this code here let me go ahead and type in maen at harvard.edu and as always this does in fact work but Watch What Happens here let me go ahead and do Ma at Harvard and then uh M at Harvard question mark edu enter that too is valid so I could put any character there and it's still going to be accepted but I don't want

question mark edu I want edu literally any instincts then for how we can solve this problem here how can I get this new function re. search and a regular expression more generally to literally mean a dot might you think you can use the uh Escape character the backslash indeed the so-called Escape character which we've seen before outside of the context of regular Expressions when we talked about new lines back sln was a way of telling the computer I want a new line but without actually literally hitting enter and moving the cursor yourself and you don't

want a literal n on the screen so back sln was a way to escape and and convey that you want a new line it turns out regular Expressions use a similar technique to solve this problem here in fact let me go into my regular expression and before that final dot let me put a single backslash in the world of regular Expressions this is a so-called special sequence and it indicates per this backslash and a single dot that I literally want to match on a a DOT it's not that I want to match on any character

and then edu I want to match on a DOT or a period edu but we don't want python to misinterpret this backslash is beginning a an escape sequence something special like backslash n which even though we as the programmer might type two characters back slash in it really is interpreted by python as a single new line we don't want any kind of misinterpretation like that here so it turns out there's one other thing we should do for regular expressions like this that have a back slash used in this way I want to specify to python

that I want this string this regular expression in double quotes to be treated as a raw string literally putting an r at the beginning of the string to indicate to python that you should not try to interpret any backslashes in the usual way I want to literally pass the backslash and the dot and the edu into this particular function search in this case so it's similar in spirit to using that F at the beginning of a format string which of course tells python to format the string in a certain way plugging in in variables that

might be between curly braces but in this case r indicates a raw string that I want passed in exactly as is now it's only strictly necessary if you are in fact using backslashes to indicate that you want some special sequence like backslash dot but in general it's probably a good habit to get into to just use raw strings for all of your regular Expressions so that if you eventually go back in make a change make an addition you don't accidentally introduce a backslash and then forget that that might have some special or misinterpreted meaning well

let me go ahead and try this new regular expression I'll clear my terminal window run python of validate run python of validate dopy and then I'll type in my email address correctly mailen harvard.edu and that's fortunately still valid Let Me Clear My screen and run it one more time python of validate dopy and this time let's mistype it as mailen Harvard questionmark edu whereby there's obviously not a DOT there but there is some other Single Character that last time was misinterpreted is valid but this time now that I've improved my regular expression it's discovered as

indeed invalid any questions now on this technique for matching something to the left of the at sign something to the right and now ending with edu explicitly um what happens when that mle sign a good question and you kind of called me out here well when in doubt let's try let me go ahead and do python validate dopy mailin at harvard.edu which also is incorrect unfortunately my code thinks it's valid so another problem to solve but a shortcoming for now other questions on these regular Expressions thus far can you use curly brackets M instead of

backlash can you use curly brackets instead of backslash not in this case if you want a literal dot backs slash dot is the way to do it literally how about one other question on regular Expressions is this the same same thing that Google forms uses in order to categorize data in let's say some if you've got multiple people sending in requests about some feedback do they categorize the data that they get using this particular regular expression things indeed if you've ever used Google forms to not just submit it but to create a Google form one

of the menu options is for response validation in English at least and what that allows you to do is specify that the user has to input an email address or a URL uh or a string of some length but there's an even more powerful feature that some of you may not have ever noticed and indeed if you'd like to open up Google forms create a new form temporarily and poke around you will actually see in English at least quote unquote regular expression mentioned as one of the mechanisms you can use to validate your user's input

into your Google form so in fact after today you can start avoiding these specific dropdowns of like email address or URL or the like and you can express your own path patterns precisely as well regular Expressions can even be used in VSS code itself if you go and find or do a find and replace in vs code you can of course just type in words like you could into Microsoft Word or Google Docs you can also type if you check the right box regular expressions and start searching for patterns not literally specific values well let

me propose that we now enhance this implementation further by introducing a few other symbols because right now with my code I keep saying that I want my email address to end with doedu and start with a username but I'm being a little too generous this does in fact work as expected for my own email address maen harvard.edu but what if I type in a sentence like my email address is maen harvard.edu and suppose I've typed that into the program or I've typed that into a Google form is this going to be considered valid or invalid

well let's consider it's got the at sign so we're good there it's got one or more characters to the left of the at sign it's got one or more characters to the right of the at sign it's got a literal doedu somewhere in there to the right of the at sign and granted there's more stuff to the right there's literally this period at the end of my English sentence but that's okay because at the moment my regular expression is not so precise as to say the pattern must start with the username and end with the

doedu technically it's left unsaid what more can be to the left and what more can be to the right so when I hit enter now you'll see that that whole sentence in English is valid and that's obviously not what you want in fact consider the case of using Google forms or Office 365 to collect data from users if you don't validate your input your users might very well type in a full sentence or something else with a typographical error not an actual email so if you're just trying to copy all of the results that have

been typed into your form form so you can paste them into Gmail or some email program it's going to break because you're going to accidentally paste something like a whole English sentence into the program instead of just an email address which is what your mailer expects so how can I be more precise well let me propose we introduce a few more symbols as well it turns out in the context of a regular expression one of these patterns you can use the carrot symbol the little triangular Mark to represent that you want this pattern to match

the start of the string specifically not anywhere at the start of the user string by contrast you can use a dollar sign in your regular expression to say that you want to match the end of the string or technically just before the new line at the end of the string but for all intents and purposes think of carrot as meaning start of the string and dollar sign is meaning end of the string it is a weird thing that one is a carrot and one is a dollar sign these are not really things that I think

of as opposites like a parenthesis or something like that but those are the symbols the world chose many years ago so let me go back back to vs code now and let me add this feature to my code here let me specify that yes I do want to search for this pattern but I want the user's input to start with this pattern and end with this pattern so even though it's going to start looking even more cryptic I put a carrot symbol here at the beginning and I put a dollar sign here at the end

that does not mean I want the user to type a carrot symbol or a dollar sign this is special symbology that indicates to re. search that it should only look for now in exact match against this pattern so if I now go back to my terminal window and I'll leave the previous result on the screen let me type the exact same thing my email address is maen harvard.edu enter sorry period and now I'm going to go ahead and hit enter now that's considered invalid but let me clear the screen and just to make sure I

didn't break things let me type in just my email address and that too is valid any questions now on this version of my regular expression which note goes further to specify even more precisely that I want it to match at the start and the end any questions on this one here okay you have slash and. edu and then the dollar sign but the dot is like uh one of the regular expression right it normally is but this backslash that I deliberately put before this period here is an escape character it is a of telling re.

search that I don't want any character there I literally want a period there and it's the only way you can distinguish one from the other if I got rid of that slash this would mean that the email address just has to end with any character then an e then a d then a u i don't want that I want literally a period Then the E then the D then the U this is actually common convention in programming and technology in general if you and I decide on a convention whereby we're using some character on the

keyboard to mean something special invariably we create a future problem for oursel when we want to literally use that same character and so the solution in general to that problem is to somehow escape the character so that it's clear to the computer that it's not that special symbol it's literally the symbol it sees so we don't need another we don't need another another slash before the dollar sign no uh because in this case dollar sign means something special per this chart here dollar sign by itself does not mean US dollars or currency it literally means

match the end of the string if however I wanted the user to literally type in a dollar sign at the end of their input the solution would be the same I would put a backslash before the dollar sign which means my email address would have to be something like mail harvard.edu dollar sign which is obviously not correct too so backslashes just allow you to tell the computer to not treat those symbols specially like likes meaning something special but to treat them literally instead how about one other question here on regular Expressions you said you said

one represents to to make it one plus then you said one was to make it one with nothing sure let rewind in time I think what you're referring to was one of our earlier versions that initially looked like this which just meant Zero or more characters then an at sign then zero or more other characters we then evolved that to be this this Plus on both sides which means one or more characters on the left then an at sign then one or more characters on the right and if I'm interpreting your question correctly one of

the points I made earlier was that if you didn't use Plus or forgot that it exists you could equivalently achieve the exact same result with two dots and a star because the first dot means any character it's got to be there the second dot star means zero or more other characters and same on the right so it's just another way of expressing the same idea one or more can be represented like this with dot dot star or you can just use the handier syntax of plus which means the same thing all right so I dare

say there's still some problems with the regular expression in this current form because even though now we're starting to look for the username at the beginning of the string from the user and we're looking for the doedu literally at the end of the string from the user those dots are a little too encompassing right now I am allowed to type in more than the single at sign w why because at is a character and Dot means any character so honestly I can have as many at signs is this thing at the moment as I want

for instance if I run python of validate dopy ma harvard.edu still works as expected but if I also run python of validate dopy and incorrectly do mail harvard.edu that should be invalid but it's considered valid instead so I think we need to be a little more restrictive when it comes to that dot and we can't just say oh any old character there is fine we need to be more specific well it turns out that regular Expressions also support this syntax you can use square brackets inside of your pattern and inside of those square brackets include

one or more characters that you want to look for specifically alternatively you can inside of those square brackets put a carrot symbol which unfortunately in this context means something completely different from match the start of the string but this would be the complement operator inside of these square brackets which means you cannot match any of these characters so things are about to look even more cryptic now but that's why we're focusing on regular Expressions on their own here if I don't want to allow any character which is what a DOT is let me go ahead

and I could just say well I only want to support a or B's or C's or D's or E's or FS or G's I could type in the whole alphabet here plus some numbers to actually include all of the letters that I do want to allow but honestly a little simpler would be this I could use a carrot symbol and then an at sign which has the effect of saying this is the set of characters that has everything except an at sign and I can do the same thing over here instead of a DOT to

the right of the at sign I can do Open Bracket carrot at sign and I admit things are starting to escalate quickly here let's start from the left and go to the right this carrot outside of the square brackets at the very start of my string as before means match from the start of the string and let's Jump Ahead the dollar sign all the way at the end of the regular expression means match at the end of the string so if we can mentally tick those off as straightforward let's now focus on everything else in

the middle well to the left here we have new syntax a square bracket another carrot an at sign and a closed square bracket and then a plus the plus means the same thing as always it means one or more of the things to the left what is the thing to the left well this is the new syntax inside of square brackets here I have a carrot symbol and then an at sign that just means any character except an at sign it's a weird syntax but this is how we can express that simple idea any character

on the keyboard except for an at sign and heck even other characters that aren't physically on your keyboard but that nonetheless exist then we have a literal at sign then we have another one of these same things square bracket carrot at close bracket which means any character except an at sign then one or more of those things followed by literally a period e duu so now let me go ahead and do this again let me rerun python of validate dopy and test my own email address to make sure I've not made things worse and we're

good now let me go ahead and clear my screen and run python of validate dopy again and do mailin harvard.edu crossing my fingers this time and finally this now is invalid why I'm allowing myself to have one at sign in the middle of the user's input but everything to the left per this new syntax cannot be an at sign it can be anything but one or more times and everything to the right of the at sign can be anything but an at sign one or more times followed by lastly a literal. edu so again the

new syntax is quite simply this square brackets allow you to specify a set of characters that you literally type out at your keyboard AB bcde e f or the complement the opposite the carrot symbol which means not and then the one or more symbols you want to exclude questions now on this syntax here so right after add sign can we use the curly brackets M uh one so that we can only have one repetition of the add symbol absolutely so we could do this let me go ahead and pull up vs code and let me

delete the current form of a regular expression and go back to where we began which was justar at andar I could absolutely do something like this and require that I want at least one of any character here and then I could do something more to have anymore as well so the curly brace syntax which we saw on the slide earlier but didn't yet use absolutely can be used to specify a specific number of characters but honestly this is more verbose than is necessary the best solution arguably or the simplest at least ultimately is just to

say plus but there too another example of how you can solve the same problem multiple ways let me go back to where the regular expression just was and take other questions as well questions on these sets of characters or complimenting that's set so can you use that same syntax to say that you don't want a certain character throughout the whole string you could it's going to be uh you could absolutely use the same character to exclude a um you could absolutely use this syntax to exclude a certain character from the entire string but it would

be a little harder right now because we're still requiring edu at the end but yes absolutely other questions what happens if the user inputs doedu in the beginning of the string a good question what happens if the user types in.edu at the beginning of the screen well let me go back to vs code here and let's try to solve this in two different ways first let's look at the regular expression and see if we can infer if that's going to be tolerated well according to the current cryptic regular expression I'm saying that you can have

any character except the at sign so that would work I could have the dot for the edu but then I have to have an at sign so that wouldn't really work because if I'm just typing in doedu we're not going to pass that constraint so now let me try this in by running the program let me type in just literally doedu that doesn't work but but but I could do this doedu at.edu that to is invalid but let me do this uhedu something. edu that passes so it's starting to get a little weird now maybe

it's valid maybe it's not but I think we'll eventually be more precise too how about one more question on this regular expression and these complimenting of sets can we use uh another domain name as a string input can you use another domain name absolutely I'm using my own just for the sake of demonstration but you could absolutely use any domain or top level domain and I'm using edu which is very us-centric but this would absolutely work exactly the same for any top level domain all right let me go ahead now and propose that we improve

this regular expression further because if I pull it up again in V vs code here you'll see that I'm being a little too tolerant still it turns out that there are certain requirements for someone's username and domain name in an email address there is an official standard in the world for what an email address can be and what characters can be in it and this is way too accommodating of all the characters in the world except for the at symbol so let's actually narrow the definition of what we're going to tolerate in usernames and companies

like Gmail could certainly do this as well suppose that it's not just that I want to exclude at sign suppose that I only want to allow for say characters that normally appear in words like letters of the alphabet A through Z be it uppercase or lowercase maybe some numbers in heck maybe even an underscore could be allowed to well we can use this same square bracket syntax to specify a set of characters as follows I could do a b c d e f g h i j oh my God this is going to take forever

I'm going to have to type out all 26 letters of the alphabet both lowercase and uppercase so let me stop doing that there's a better way already if you want to specify Within These square brackets it's a range of letters you can actually just do a hyphen if you literally do a hyphen Z in these square brackets the computer is going to know you mean a through z you do not need to type 26 letters of the alphabet if you want to include uppercase letters as well you just do the same no spaces no commas

you literally just keep typing a through capital Z so I have little a hyphen little Z big a hyphen big Z no spaces no commas no separators you just keep specifying those ranges if I additionally want numbers I could do 0 1 2 3 4 nope you don't need to type in all 10 decimal digits you can just say 0 through n using a hyphen as well and if you now want to support underscores as well which is pretty common in usernames for email addresses you can literally just type in underscore at the at the

end notice that all of these characters are inside of square brackets which just again means here is a set of characters that I want to allow I have not used a carrot symbol at the beginning of this whole thing because I don't want to complement it complement it with an E not complement it with an i I want don't want to complement it by making it the opposite I literally want to accept only these characters I'm going to go ahead and do the same thing on the right if I want to require that the domain

name similarly come from this set of characters which admittedly is a little too narrow but it's familiar for now so we'll keep it simple I'm going to go and paste that exact same set of characters over there to the right and so now it's much more restrictive now I'm going to go ahead and run python of validate dopy I'm going to test my own email address and we're still good I'm going to clear my screen and run it once more this time trying to break it let me go ahead and do something like how about

davidor maen at harvard.edu enter but that too is going to be valid but if I do something completely wrong again like ma harvard.edu that's still going to be invalid why because my regular expression currently only allows for a single at in the middle because everything to the left must be alpha numeric alphabetical or numeric or an underscore the same thing to the right followed by the doedu now honestly this is a regular expression that you might be in the habit of typing in the real world as in as cryptic as this might look this is

the world of regular Expressions so you'll get more comfortable with this syntax over time but thankfully some of these patterns are so common that there are builtin shortcuts for representing some of the same information that is to say you don't have to constantly type out all of the symbols that you want to include because odds are some other programmer has had the same problem so built into regular Expressions themselves are some additional patterns you can use and in fact I can go ahead and get rid of this entire set a A through Z lowercase A

through Z uppercase 0 through 9 in an underscore and just replace it with a single back slw back slw in this case represents a word character which is commonly known as a alpha numeric symbol or the underscore as well I'm going to do the same thing over here I'm going to highlight the entire set of square brackets delete it and replace it with a single back slw and now I feel like we're making progress because even though it's cryptic and would if looked way cryptic a little bit ago um and even though it would have

looked even more cryptic a little bit ago now it's at least starting to read a little more friendly the carrot on the left means start matching at the beginning of the string back slw means any word character the plus means one or more at symbol literally then another word character one or more then a literal dot then literally edu and then match at the very end of the string and that's it so there's more of these two and we won't use them all here but here is a partial list of the patterns you can use

within a regular expression one you have backd for any decimal digit decimal digit meaning 0 through n commonly done here too is if you want to do the opposite of that the complement so to speak you can do back slash capital D which is anything that's not a decimal digit so it might be letters and punctuation and other symbols as well Meanwhile Back SLX s means Whit space characters like a single hit of the space or maybe hitting tab on the keyboard that's Whit space back SL capital S is the opposite or complement of that

anything that's not a wh space character back slw we've seen a word character as well as numbers and the underscore and if you want the complement or opposite of that you can use back SL capital W to give you everything but a word character again these are just common patterns that so many people were presumably using in yester year that it's now baked into the regular expression syntax so that you can more succinctly express your same ideas any questions then on this approach here where we're now using backw to represent my word character uh so

what I want to ask about was the actually the previous approach like the square bracket approach could we accept like lists in there yes we'll see this before long but suppose you wanted to tolerate not just edu but maybe edu or do you could do this you could introduce parentheses and then you can or those together I could say Comm or edu I could also add in something like in the US or gov or net or anything else or org or the like and each of the vertical bars here means something special it means or

and the parentheses simply group things together formally you have this syntax here A or B A or vertical Bar B means a has to match or B has to match where A and B can be any other patterns you want in parentheses you can group those things together so just like math you can uh combine ideas into one phrase and do this thing or the other um and there's other syntax as well that we'll soon see other questions on these regular expressions in this syntax here what if we put spaces in the expression sure so

if you want spaces in there you can't use back slw alone because that is only a word character which is alphabetical numerical or the underscore but you could do this you could go back to this approach whereby you use square brackets and you could say A through Z or a through z or 0 through 9 or underscore or I'm going to hit the space bar a single space you can put a literal space inside of the square brackets which will allow you then to detect a space alternatively I could still use back slw but I

could combine it as follows I could say give me a back slw or a back slash S because call that back SLS is wh space so it's even more than a single space it could be a tab but by putting those things in parentheses now you can match either the thing on the left or the thing on the right one or more times how about one other question on these regular Expressions perfect so I was going to ask um does the back slw um include a DOT uh because no nope it only includes letters numbers

uh and underscore that is it and I was wondering you gave an example at the beginning that had uh spaces like this is my email so on so um I don't think our current version even quite quite a long while ago stopped accepting it was that because of the carrot uh or because of something no the reason I was handling spaces and other English words when I typed out my email address is mailin at harvard.edu was because we were using initially star or Plus which is any character uh and even after that we said anything

except the at sign which includes spaces only once I started using square brackets and a through z and 0 through 9 and underscore did we finally get to the point where we would reject whitespace and in fact I can run this here let me go into the current version of my code in VSS code which is using again the back slw's for word characters let me run python of validated. and incorrectly type in something like my email address is maen harvard.edu period which has spaces to the left of my username and that is now invalid

because space is not a word character yorgo notes too that technically I'm not allowing dots and some of you might be thinking wait a minute my gmail address has a dot in it that's something we're going to still have to fix a backw is not the end all here it's just allowing us to express our previous solution a little more succinctly now one thing we're still not handling quite properly is uppercase versus lowercase the back slw technically does handle lowercase letters and uppercase because it's the exact same thing as that set from before which had

little a through little Z and big a through big Z but watch this let me go ahead in my current form run python to validate dopy and just because my caps slot key is down maen harvard.edu shouting my email address it's going to be okay in terms of the mail it's going to be okay in terms of the Harvard because those are matching the back slw which does include lowercase and uppercase but I'm about to see invalid why why is mailin at harvard.edu invalid when it's in all caps here even though I'm using backw yeah

so you are asking for the domain. edu in lower case and you're typing in an uppercase exactly I'm typing in my email address in all uppercase but I'm looking for literally edu and as I see you with airpods and so many of you with headphones I apologize for yelling into my microphone just now to make this point but let's see if we can't fix that well if my pattern on line five is expecting it to be lowercase there's actually a few ways I can solve this one would be something we've seen before I could just

force the user's input to all lowercase and I could put onto the end of my first line lower and actually force it all to lowercase alter alternatively I could do that a little later instead of passing in email I could pass in the lowercase version of email because email addresses should in fact be case insensitive so that would work too but there's another mechanism here which is worth seeing it turns out that that function before called re. search supports recall a third argument as well these so-called flags and flags are configuration options typically to a

function that allow you to configure it a little differently and how might I go about configuring this call all to re. search a little bit differently in so far as I'm currently only passing in two arguments well it turns out that some of the flags you can pass into this function are these it turns out that the regular expression library in Python AKA re comes with a few built-in variables so to speak things that you can think of as constants that have meaning to re. search and they do so as follows if you pass in

as a flag re. ignore case what re. search is going to do is ignore the case of the user's input it can be uppercase lower case a combination thereof the case is going to be ignored it will be treated case insensitively and you can do other things too that we won't do here but if you want to handle the user's input that maybe spans multiple lines maybe they didn't just type in an email address but an entire paragraph of text and you want to match different lines of that text that is multiple lines another flag

is for re. multiline for just that or re. all whereby you can uh you can conf figure the dot to rep to recognize not just any character except new lines but any character plus new lines as well but for now let me go ahead and just make use of this first one let me pass in a third argument to re. search which is re do uh ignore case let me now rerun the program without clearing my screen python EV validate dopy let me type in again in all caps effectively shouting maen harvard.edu enter and now

it's considered valid because I'm telling re. search specifically to ignore the case of the input and that two here is fine and why might I do this approach rather than call do lower in one of those other locations if I don't actually want to change the user's input for whatever reason I can still treat it case insensitively without actually changing the value of that variable itself are any final questions now on this validation of email addresses uh so the pattern is a string right mhm M uh can we use an F string you can you

yes you can use an F string so that you could plug in for instance the value of a variable and pass it into the function other questions on this access W character could we take it as an input from the user technically yes that's not a problem we're trying to solve right now we want the user to provide literal input like their email address not necessarily a regular expression but you could imagine building software that asks the user especially if they're more advanced users to type in a regular expr for some reason to validate something

else against that and in fact that's what Google's doing if you play around with Google forms and create a form with response validation and select regular expression Google lets you and I type in our own regular Expressions would be a per which would be a perfect example of that all right well let me propose that we try to solve one other problem here whereby if I go into the same version as before which is now ignoring case but I type in one of my other email addresses let me go ahead and run python of validate

dopy and this time let me type in not maen harvard.edu which I use primarily but my another email address of mine Ma at cs50.h harvard.edu which forwards to the same let me go ahead and hit enter now and huh invalid even though I'm pretty sure that is in fact my email address well let's put our finger on the reason why why at the moment is mail at cs50.h harvard.edu being considered invalid even though I'm pretty sure I send and receive email from that address too why might that be because there is a DOT that has

come after the add symbol exactly there's a DOT after my cs50 and I'm not expecting any dots there I'm expecting only again word characters which is a through z 0 through 9 and underscore so I'm going to have to retool here but how could I go about doing this well it turns out theoretically there could be other email addresses even though they'd be getting a little excessively long for instance mail at something. cs50.h harvard.edu which does not technically exist but it could you can have of course multiple dots in a domain name like we see

here wouldn't it be nice if we could handle that as well well let me propose that we modify my regular expression as follows it turns out that you can group ideas together and you can not only ask whether or not this pattern matches or this one using syntax like a vertical Bar B which means either A or B you can also group things together and then apply some other operator to them as well in fact let me go back to VSS code here and let me propose that if I want to tolerate a subdomain like

cs50 that may or may not be there let me go ahead and change it as follows I could naively do this if I want to support subdomains I could say well let's allow for otherw characters plus and then a literal Dot and notice I'll highlight in blue here what I've just added everything else is the same but I'm now adding room for another sequence of one or more word characters and then a literal dot so this now I think if I rerun python of validate dopy will work for maen cs50.h harvard.edu enter unfortunately does anyone

see where this is going let me rerun python of validate dopy and type in as I keep doing maen harvard.edu which up until now has kept working despite all of my changes but now finally I've broken my own email address so logically what's the solution here well there's a bunch of ways we could solve this I could maybe start using two regular expressions and support you uh email addresses of the form username at domain. TLD or username at subdomain dod. TLD where TLD just means top level domain like edu or I could maybe just modify

this one because I'd prefer not to have like two uh regular expressions or one that's twice as big why don't I just specify to re. search that part of this pattern is optional what was the symbol we saw earlier that allows you to specify that the thing before it is technically optional uh the straight bar we are using the Straight bar as a optional make the opt the argument optional so we could we could use a vertical bar and some parentheses and say either there's something here or there's nothing we could do that in parentheses

but I think there's actually an even easier way uh actually is a question mark indeed question mark think back to this summary here of our first set of symbols whereby we had not just Dot and star and plus but also a question mark which means literally zero or one repetitions which effectively means optional it's either there one or it's not zero now how can I translate to that to this code here well let me go ahead and Surround this part of my pattern with parentheses which doesn't mean I want literally a parenthesis in the user's

input I just want to group these characters together and in fact this now will still work I've only added parentheses around the new part for the subdomain let me run python of validate dopy let me run Ma at cs50.h harvard.edu enter that's still valid but to be clear if I rerun it again for Ma at har edu that is still invalid but not if I go in here and say after the parentheses which now is one logical unit it's one big group of ideas together I add a single question mark there this will now tell

re. search that that whole thing in parentheses can either be there once or be there not at all zero times so what does this translate into when I run it well let me go ahead and rerun it with me at cs50.h harvard.edu so that the subdomain is there that works as before let me clear my screen and run it again python of valid. with mail harvard.edu which used to work then broke is are we back in business now we are that's now valid again questions now on this approach where we've Ed not just the question

mark But the parentheses as well okay yeah you said he works for zero or one repetition what if you have more what if you have more that's okay that's where you could do star star is zero or more which gives you all the flexibility in the world yeah so I was just asking that uh uh with question mark there's only one repetition hello it means zero or one repetition so it's either not there or it is there and so that's why this pattern now if I go back to my code even though again it admittedly

looks cryptic let me highlight everything after the at sign and before the dollar sign this now represents a domain name like harvard.edu or a sub domain within the domain name why well this part to the right is the same as always backw plus means something like Harvard or Yale back.edu means literally doedu so the new part is this in parentheses I have another set of backw plus back. now but it's all in parentheses I'm now having a question mark right after that which means that whole thing in parenthesis either can be there or it can't

be there it's either of those are acceptable so a question mark effectively makes something optional it would not be correct to remove the parenthesis because what would this mean if I remove the parentheses that would mean that only this dot is optional which isn't really what we want to express I want the subdomain like cs50 and the additional Dot to be what's there or not there how about one other question on Rex's here can they use this for the usernames absolutely we still have other problems right we're not solving all of the problems today just

yet but absolutely right now we are not letting you have a period in your username and again some of you with Gmail accounts or other accounts you probably have not just underscores numbers and letters you might have periods too well we could fix that not using question mark here per se but now that we have these parentheses at our disposal what I could do is this I could use parentheses to surround the backw to say any word character which is the same thing again as a letter or a number or an underscore but I could

also or in using a vertical bar something else like a literal dot now a literal dot needs to be escaped otherwise it represents any character which would be a regression a step back but now notice what I've done in parentheses I'm telling re. search that those first few characters in your email address that is your username has to be a word character Like A through Z uppercase or lowercase or Z through n or an underscore or a literal dot we could do this differently too I could get rid of the parentheses and the or and

I could just use a set of characters I could again manually say a through z a through z 0 through n underscore and then I could do a literal dot with a back slash period and now I technically don't even need the uppercase because I'm already telling the computer to ignore case I can just pick one or the other which one is better is really up to you whichever one you think is more readable would generally be the better design all right let me propose that I rewind this in time to where we left off

which was here and let me propose that there are indeed still limitations of this solution not just with the username not just with the domain name we're still being a little too restrictive so would you like to see the official regular expression that at least browsers use nowadays whenever you type in an email address to a web form and the web form the browser tells you yes or no your email address is syntactically valid ready ready here is and this is an even officially the right regular expression it's a simplified version that browsers use because

it catches most mistakes but not all here we go this is the regular expression for a valid email address at least as browsers nowadays Implement them now it's crazy cryptic at first glance but not and it's wrapping onto many lines but it's just one pattern but just notice the now familiar symbols there is the carrot symbol at the very top there is the dollar sign at the very end there is a square bracket over here and then some of these ranges plus other characters turns out you don't normally see these characters in email addresses it

looks like you're swearing at someone in their username but they're valid characters they're valid officially that doesn't mean that Gmail is going to allow you to put dollar signs and other punctuation in your username but officially some servers might allow that so if you really want to validate a user's email address you would actually come up with or copy paste something like this but honestly this looks so cryptic and if you were to type it out manually you are so likely to make a mistake what's the better solution here instead this is where P past

week's Library are your friend surely someone else on the internet a programmer more experienced than you even has come up with code that validates email addresses properly using this regular expression or even something more sophisticated than that so generally if the problem at hand is to validate input that is pretty conventional an email address a URL something where there's an official definition that's independent of you yourself find a popular library that you're comfortable using and use it in your code to validate email addresses this is not a wheel necessarily that you yourself should invent we've

used email addresses though to iteratively start from something simple too simple and build on top of that so you could certainly imagine using regular Expressions still to validate things that aren't email addresses but are data that are important to you so we at least now have these building blocks now besides the regular Expressions themselves it turns out there's other functions in Python's re library for regular Expressions among them is this function here e. match which is actually very similar to re. search except you don't have to specify the carrot symbol at the very beginning of

your Rex if you want to match from the start of a string re. match by Design will automatically start matching from the start of the string for you similar in spirit is re. full match which does the same thing but not only matches at the start of the string but the end of the string so that you too don't need to type in the carrot symbol or the dollar sign as well but let's go ahead and transition back now to some actual code whereby we we solve a different problem in spirit rather than just validate

the user's input and make sure it looks the way we want let's just assume that the users are not going to type in data exactly as we want and so we're going to have to clean up their input this happens so often when you're using like a Google form or Office 365 form or anything else to collect user input no matter what your form question says your users are not necessarily going to follow those directions they might go ahead and type in something that's a little differently formatted than you might like now you certainly go

through the results and download a CSV or open the Google spreadsheet or equivalent in Excel and just clean up all the data manually but if you've got lots of submissions dozens hundreds thousands of rows in your data set doing things manually might not be very fun it might be much more effective to write code as in Python that can allow you to clean up that data and any future data as well so let me propose that we go ahead here and close validate dopy and let's go ahead and create a new program Al together called

format. the goal of which is to reformat the user's input in the format we expect I'm going to go ahead and run code of format. piy and let's suppose that the data we're going to reformat is the user's name so not email address but name this time and we're going to hope that they type in their name properly like David men but some users might be in the habit for whatever reason of typing their name backwards if you will with a comma such as men comma David instead now it's fine because both are clearly as

readable to the human but if you want to standardize how those names are stored in your system perhaps a database or CSV file or something else it would be nice to at least standardize or canonicalize the format in which you're storing your data so that if you print out the user's name it's always the same format David men and there's no commas or backwards to it so let's go ahead and do something familiar let's go ahead and give myself a variable called name and set it equal to the return value of input asking the user

as we've done many times what's your name question mark I'm going to go ahead and proactively at least clean up some messiness as we keep doing here by just stripping off any leading or trailing whites space just in case the user accidentally hits the space bar we don't want that ultimately in our data set and now let me go ahead and do this as we've done before let me just go ahead quickly and print out just to make sure I'm off to the right start hello and then and curly braces name so making an F

string to format hello comma name now let me go ahead and clear my screen and run python of format. let me behave and type in my name as I normally would David space ma enter and I think the output looks pretty good it looks as expected grammatically let me now go ahead though and play this game again but this time maybe because I'm not thinking or I'm just in the habit of doing last name comma first I do ma comma David and hit enter all right well this now is is weird even though the program

is just spitting out exactly what I typed in arguably this is not close to correct at at least grammatically it should really say hello David ma now maybe I could have some if conditions and I could just reject the user's input if they type a comma or get their names backwards somehow but that's going to be uh too little too late if the user has already submitted a form online and I already have the data and now I need to go in and clean it up and it's not going to be fun to go through

manually in Google spreadsheets or apple numbers or Microsoft Excel and manually fix a lot of people's names to get rid of the commas and move the first name before the as is conventional in the US so let's do this it could be a little fragile but let's let's start to express ourselves a little programmatically here and ask this if there is a comma in the person's name which is pythonic I'm just asking the question is this shorter string in this longer string then let me go ahead and do this let me go ahead and grab

that name in the variable split on not just the comma but the space after assuming the human typed in a space after their name name and let me go ahead and store the result of that splitting of me comma David into two variables let's do last comma first again unpacking the sequence of values that comes back now let me go ahead and reformat the name so I'm going to forcibly change the user's name to be as I expect so name is actually going to be this format string first name then last name both in curly

braces but format it together with a single space so that I'm overriding the user's input and updating my name variable accordingly for the moment to be clear this program is interactive like the users like me are typing their name into the program but imagine the data already is in a CSV file it came in from some process like a Google form or something else online you could imagine writing code similar to this but that maybe goes and reads that file into memory first maybe it's a CSV via CSV reader or a dict reader and then

iterating over each of those names but we'll keep it simple and just do one name at a time but now what's kind of interesting in here is if I go back to my terminal window and clear it and run python of format. piy and hit enter I'm going to type in David space ma as before and I think we're still good but I'm also going to go ahead and do this python of format. py maen comma David with a space in between crossing my fingers and hit enter and voila that now has been fixed such

a simple thing to be sure but it is so commonly necessary to clean up users input here we see at least one way to do so pretty easily now to be fair there's some problems here and in fact can someone imagine a scenario in which this code really doesn't fix the user's input what could still go wrong even with this fix in my code any thoughts if they type their in their name comma and then sign them oh and then something else yeah so let me let me try this for instance um let me go

ahead and run a program and uh I am the only David me that I know but suppose I were uh uh let's say junior like this and it's common in English at least to sometimes put a comma there you don't necessarily need the comma but I'm one of those people who uses a comma that's now really really broken so I broken some assumption there and so that could certainly go wrong here what else well let me go ahead and run this again and if I did men comma David no space because I'm being a little

sloppy I'm not paying attention which is going to happen when you have lots of users ultimately well this really broke now notice I have a value error an actual exception why well because split is supposed to be splitting the string into two strings by looking for the comma and a space but if there is no comma in space it can't split it into two things and the fact that I have two variables on the left but I'm only getting back one thing on the right means that I can't do this code quite as this so

it's fragile to be sure but wouldn't it be nice if we could at least improve it for instance we now know some regular expressions syntax what if I at least wanted to make this space optional well I could use my new found regular expression syntax and put a question mark question mark means zero or one of the things to the left what's the thing to the left it's literally a space I don't even need parentheses if there's just one thing there so that would be the start of a pattern that says I must have a

comma and then I may or may not have a space zero or one spaces thereafter unfortunately the version of split that's built in into the stir variable as in this case doesn't support regular Expressions if we want our regular Expressions we need to go use that Library here so let me go ahead and do this let me go in and leave this code AS is but go up to the top now and import re to import the library for regular expressions and now let me go ahead and start changing my Approach here I'm going to

go ahead and do this I'm going to use the same function called re. search and I'm going to search for a pattern that I think will be last comma first so let me use my newfound regular expression syntax and represent a pattern for something like ma comma space David how can I do this well inside of my quotes for re. search I'm going to have something sot plus sorry I'm G to have something sot plus then I'm G to have a comma then I'm going to have a space then I'm going to have something plus

now I'm going to preemptively refine this a little bit I want this whole pattern to start matching at the beginning of the user's input so I'm going to add the carrot right away and I want the end of the user's input to be matched as well so that I'm literally expecting any character one or more times then a comma then a space then any other character one or more times and then that is it and I'm going to pass in the name variable as before now when we've used re. search in the past we really

used it just to answer a question does the user's input match the following pattern or not true or false effectively but re. search is actually more powerful than that you can actually get back more information and you can do this you can specify a variable and then an assignment operator and get back more precise answers to what has been found when searched for but what is it you want to get back well it turns out there's this other feature of regular Expressions which allow you to use parth es not just to group things together but

to capture them it turns out when you specify parentheses in a regular expression unbeknownst to us up until now everything in the parentheses will be returned to you as a return value from the re. search function it's going to allow you to extract specific amounts of information from the user's own input you can reverse this process too by using the non-capturing version as well you can use parentheses and then literally a question mark and a colon and then some other stuff and that will say don't bother capturing this I just want to group things but

for now we're going to use just the parentheses themselves so how am I going to do this well if I want to get back the user's last name and first name I think what I want to capture is the plus here and the plus here so I've deliberately surrounded in parentheses the dot plus both to the left and the right of the comma not because I'm grouping them together per se I'm not adding a question mark I'm not adding up another Plus or a star I'm using parentheses now for capturing purposes why well I'm going

to do this next I'm going to still ask a Boolean question like if there are matches then do this so if matches is not effectively false like none I do expect I've gotten back some matches and watch what I can do now I can do last comma first equ equals whatever matches in and get back all of the groups of matches then go ahead and update name just like before with a format string and do first and then last in curly braces as well and then at the very bottom just like before print out for

instance hello comma name so the new code now is everything highlighted here I'm using re search to search for whether the user typed their name in last comma first format but I am more powerfully using re. search to capture some of the user input what's going to get captured anything I surround it in parentheses will be returned to me as return values how do you get at those return values you ask the variable to which you assigned them for all of the groups all of the groups of parentheses that were captured so let me go

ahead and do this let me go ahead now and run python of format. piy enter and I'm going to type my name as usual in this this case nothing happens with this if condition why because I did not type a comma and so this search does not find a comma so there are no matches so we immediately just print out hello name nothing interesting or new there but if I now go ahead and clear my screen and run python of format. and do maen commas space David enter we've reformatted my name well how did this

work let me be a little more explicit now it turn turns out I don't have to just say matches. groups I can get specific groups back that I want so let me change my code a little bit more let me go ahead now and just say this let's update name uh to well actually let's do this let's say that the last name is going to be in the matches but specifically group one the first name is going to be in the matches but specifically group two why one and two because this is the first set

of parentheses to the left of the comma this is the second set of parentheses to the right of the comma and based on the input this would be the user's last name in this scenario me this would be the user's first name David in this scenario that's why I'm using group one for the last name and group two for the first name and now I'm going to go ahead and say name equals uh F string again uh first and then last done and let me refine this one last step before we take questions I don't

really need these variables if I'm immediately using them let's just go ahead and tighten this up further as we've done in the past for design sake if I want to make the name the concatenation of the person's first name and last name let's just do this matches. group two first plus a space plus matches. group one so it's just up to me to know from left to right this is group one this is group two so group one is last group two is first so if I want to flip them around around and update the

value of name I can explicitly get group two first concatenate using plus a single space and then concatenate on group one all right that was a lot let me pause to see if there are questions the key difference here is we're still using re. search the exact same way but now I'm using its return value not just to answer a question true or false but to actually get back specific matches anything I captured so to speak with parentheses why is it here we're using one and two instead of zero and one really good question caping

the first a good observation in almost every other context we've started counting at zero and one instead of one and two it turns out there's something else in location zero when it comes back from re. search related to the string itself so according to the documentation of this function only one is the first set of parentheses and two is the second set and onward from there just a different convention here other questions uh what if we write nothing like whes space comma whes space uh how we check um true of condition before I answer directly

let me just run this and make sure I've not broken anything further let me run python of format. let me type in David space maen the right way let me run it once more let me type it Ma comma David the wrong way that we're fixing and we're still good but I think it will still break let me run it a third time with me comma David with no space and now it's still broken why because I'm still looking for comma space now how can I fix that one way I could do that is to

add a question mark here which again is zero or more of the thing before so if I have a space and then a question mark literally no need for any parentheses then I can literally tolerate both ma comma space David or ma comma David so let's try again before this did not work let's do ma comma David with no space now it does actually work so we can tolerate different amounts of white space if I am a little more precise with my formula let me go ahead and try once more let me very weirdly but

possibly hit the space bar a few too many times so now they're really separated this again is uh not going to work quite right because it's going to consume all of that white space so now I might want to strip left and right any of the leading white space on the result or what I could do here is say this instead of zero or one I could use a star here so space star and now if I run this once more with ma comma space space space David enter now we've cleaned up things further so

you can imagine depending on how messy the data is that you're cleaning up your regular Expressions might need to get more and more sophisticated it really depends on just how many problems we want to solve at once well allow me to propose that we Forge ahead further just to clean this up even more so using a feature that's actually relatively new to python itself it is very common when using regular Expressions to do exactly what I've done here to call a function like re. search with capturing parentheses inside such that you get back a return

value that I'm calling matches you could call it something else but I'm calling it by default matches and then notice on the next line I'm saying if matches wouldn't it be nice if I could just tighten things up further and do these all on the same line well you can sort of let me go ahead and do this let me get rid of this if and let me just try to say something like this if matches equals re search and then colon so combining my if condition into just one line instead of those two um

in C or C++ or Java you would actually do something like this surrounding the whole thing with parentheses sometimes double sets to suppress any warnings if you want to do two things at once if you want to not only assign the return value of re. search to a variable called matches but you want to subsequently ask a Boolean question is this effectively true or false that's what I was doing a moment ago let me undo this a moment ago I was getting back the return value and assigning it to matches and then I was asking

the question well it turns out this need to have two lines of code presumably rubbed people wrong for too long in Python and so you can now combine these two kinds of lines into one but you need a new operator you cannot just say if matches equals re search and then anodin at the end you instead need to do this you need to do colon equals if and only if you want to assign something from right to left and you want to ask an if or an L if question on the same line This is

affectionately known as you can see here as the walrus operator and it's new to python in recent years and and it both allows you to assign a value as I'm doing from right to left and ask a Boolean question about it like I'm doing with the if or equivalently L if does anyone know why this is called The Walrus operator if you kind of look at it like this perhaps if you're familiar with walruses it kind of sort of looks like a walrus so a minor detail but a relatively new feature of python that honestly

you'll probably continue to see online and in source code and in textbooks and so forth increasingly so now that it does exist it does not change the logic at all if I run python of format. piy and type maen commas space David it still fixes things but it's tightened up my code just a bit more all right let's go ahead and look at one final problem to solve that of extracting information now as well so at this point we've now validated the user's input by checking whether or not it meets a certain pattern we've cleaned

up the users's input by checking against the pattern whether it matches or not and if it does match we kind of reorganize some of the users information so we can clean up their input and standardize the format in which we're storing or printing it in this case let's do one final example where we're very specifically extracting information in order to answer some question so let me propose this let me go ahead and close format. py and create a new file called twitter. the goal of which is to prompt users for the URL of their Twitter

profile and extract from it infer from that URL what is the user's username now why might you want to do this well one you might want users to be able to just very easily copy and paste the URL from their own Twitter profile into your form into your app so that you can figure out what their uh username is or you might have a form that asks the user for their Twitter username and because people aren't necessarily paying very close attention some people type their username some people type their whole URL or something else Al

together it would be nice now that you're a programmer to just be more tolerant of different types of input and just take on the burden of canonicalizing standardizing the data but being flexible with the users it's arguably a better user experience if you just let me copy paste or type in what I want you clean it up you're the programmer not me lens for a better experience perhaps well let me go ahead and do this with twitter. py let me first go ahead and prompt the user here for a value for a variable that I'll

call URL and just ask them to input the URL of their Twitter profile I'm going to go ahead and strip off any leading or trailing whites space just in case users accidentally hit the space bar that's like literally the least I can do quite easily but now let's go ahead and do this suppose that the user's address is the following let me print out what they type in and let me clear my screen and run python of twitter. I'm going to go ahead and type in for instance https col uh twitter.com J maen which happens

to be my own Twitter username for now we're just going to print it back onto the screen just to make sure I've not messed up yet okay so I've printed back out the exact same URL but the goal at hand is to extract the username only now let me just ask perhaps a straightforward question logically what do I need to do to get at the user's username well uh we just ignore what's before the username and then just extracts the username perfect yeah I mean it is as simple as that if you know the usernames

at the end well let's just somehow ignore everything to the beginning well what's at the beginning well it's a URL so we're probably going to need to ignore an https a colon slash a twitter.com and a slash so we just want to throw all of that away why because if it's URL We Know by how Twitter works that the username comes at the end so let's use that very simple idea to get at the information we want well I'm going to try this a few different ways let me go back into my program here and

instead of just printing it out which was just to see what's going on let me do this let me create a new new variable called username and let me call url. replace it turns out that if URL is a string or a stir in Python it again comes with multiple methods like strip uh and split and others as well one of which is called replace and replace will do just that you pass it two arguments the first of which is what do you want to replace the second argument is what do you want to replace

it with so if I want to get rid of as I've been proposed really just everything before the user name that is the Twitter URL or the beginning thereof let's just say this go ahead and replace https twitter.com close quote that's what I want to replace and comma second argument what do you want to replace it with nothing so I'm literally going to pass in quote unquote to effectively do a find and replace that's what the replace method does just like you can do it in Microsoft Word or Google Docs this is the programmer's way

of doing find and replace now let me go ahead and print out just the username so I'll use an F string like this I'll say username colon and then in curly braces username just to format it nicely all right let me go ahead and clear my screen and run python of twitter. enter URL here we go https col [Music] twitter.com davidj maen enter okay now we've made some progress done for the day right well what is suboptimal about this can anyone critique or find fault with my program it is working now but it's a little

fragile I bet we could contrive some scenarios where I think it works but it doesn't well I have a few ideas actually well first of all uh if we if you don't specify https it will broken secondly if we have slash at the end it also will be broken if we if we have like question mark is something after question mark it also won't work so a lot of scenario oh my God I mean here we are I was pretending to think I was done but my God like Alex gave us a whole laundry list

of like problems and just to recap then what if it's not https it's HTTP slightly less secure but I should still be able to tolerate that programmatically uh what if the protocol is not there what if the user just type twitter.com davidj maen it would be nice to tolerate that rather than show an error and make me type in the protocol why it's not good user experience what if it had a slash at the end of the username or a question mark If you think about urls you've seen on the web there's very commonly more

information especially if it's been shared on social media there might be HTTP parameters so to speak just stuff there that we don't want there could be a www. twitter.com which I'm also not expecting but does work if you go to that URL too so there's just so many things that can go wrong and even if I come back to my contrived example as earlier what if I run this program and say this uh my username is https twitter.com davidj mail enter well that too just just didn't really work it got rid of the U actually

okay actually that kind of worked but the goal here is to actually get the user's username not an English sentence describing the user's username so I would argue that even though I just accidentally created perfectly correct English grammar I did not extract the Twitter username correctly I don't want words like my username is as part of my input so how can we go about proving this and maybe chipping away at some of those problems one by one well let me clear my screen here let me come back up to my code and let me not

just replace it but let me do something else instead I'm going to go ahead and instead of using replace I'm going to use another function called remove prefix a prefix is a string or a subring that comes at the start of another so if I remove prefix I don't need a second argument for this function I just need one what prefix do you want to remove so this will at least now fix the problem I just described of typing in like a whole sentence where the URL is there but it's not the beginning it's only

at the end so here this still is not correct but we don't create this weird looking output that just removes the URL part of the input uh my username is htps twitter.com davidj maen a moment ago it did remove the URL and left only the David J ma this is not perfect still but at least now it does not weirdly remove the URL and then leave the English it's just leaving it alone so maybe I could handle this better but at least it's removing it from the part of the string I might anticipate well what

else could we do here well it turns out that like regular Expressions just let us express patterns much more precisely we could spend all day using a whole bunch of different python functions like remove prefix or remove and strip and others and kind of make our way to the right solution but a regular expression just allows you to more succinctly if admittedly more cryptically Express these kinds of patterns and goals and we've seen from parentheses which can be used not just to group symbols together as sets but to capture information as well we have a

very powerful tool now in our toolkit so let me do this let me go ahead and start fresh here and import the re Library as before at the very top of my program I'm still going to get the user's URL via the same line of code but I'm now going to use another function as well it turns out that there's not just re. search or re. match or re. full match there's also re.sub in the regular expression Library where sub here means substitute and it takes more arguments but they're fairly straightforward the first argument to

re.sub is the pattern the regular expression that you want to look for then you have a replacement string what do you want to replace that pattern with and where do you want to do all that well you pass in the string that you want to do the substitution on then there's some other arguments that I'll wave my hands at for now among them are those same flags and also a count like how many times do you want to do find and replace do you want it to do all or do you want to do it

just one or so forth you can have further control there too just like you would in Google Docs or Microsoft Word well let me go back to my code here and let me do this I'm going to go ahead and call re not search but re.sub for substitute I'm going to pass in the following regular expression https colon twitter.com uh SL and then I'm going to close my quote and now what do I want to replace that with well like before with the simple stir replace function I want to replace it with nothing just get

rid of it all together but what uh string do I want to pass into to do this to the URL from the user and now let me go ahead and assign the return value of re sub to a variable called username so re sub's purpose in life is again to substitute some value for a su regular Expressions some number of times it essentially is find and replace using regular expressions and it returns to you the resulting string once you've done all those substitutions so now the very last line of my code can be the same

as before print and I'll use an F string username colon and then in curly braces username so I can print out literally just that all right let's try this and see what happens I'll clear my terminal window run python of twitter. py and here we go https col uh Twitter ./ davidj cross my fingers and hit enter okay now we're in business but it is still a little fragile and so let me ask the group what problem should I now further chip away at they've been said before but let's be clear what's one or more

problems that still remain the protocols and the uh domain prefixes good the protocol so HTTP versus https maybe the subd doain www should it be there or not and there's a few other mistakes here too let me actually stay with the group what are some other shortcomings of this current solution um if we use a phrase like you do before we are going to have the same problem because it's not taking account in the first part of the the text example good I might still allow for like some words uh some English to the left

of the URL because I didn't use like my carrot symbol so I'll fix that and any final observations on short coming here uh well it could be a HTTP or there could be like less than two slashes okay so it could be HTTP and I think that was mentioned too in terms of protocol there could be fewer than two slashes that I'm not going to worry about if the user gives me one slash instead of two that's really user error and I could be tolerant of it but you know what at that point I'm okay

yelling at them with an error message saying please fix your input otherwise we could be here all day long trying to handle all possible typos for now I think in the interest of usability or user experience ux let's at least be tolerant of all possible valid inputs or reasonable inputs if you will so let me go here and let me start chipping away at these here what are some problems we can solve well let me propose that we first address the issue of matching from the beginning of the string so let me add the carrot

to the beginning and let me add not a dollar sign at the end though right because I don't want to match all the way to the end because I want to tolerate a username there so I think we just want the carrot symbol there there's a subtle bug that no one yet mentioned and let me just kind of highlight it and see if it jumps out at you now it's a little subtle here on my screen I've highlighted in blue a final bug here maybe some Smiles on the screen yeah can we take one hand

here why am I highlighting the dot in twitter.com even though it definitely should be there so the dot without a backlash mean any character is up a new line yeah exactly it's not it's means any character so I could type in something like Twitter uh question markc or Twitter anything com and that would actually be tolerated it's not really that bad because why would the user do that but if I want to be correct and I want to be able to test my own code properly I should really get this detail right so that's an

easy fix too but it's a common mistake anytime you're writing regular Expressions that happen to involve special symbols like dots in a URL or domain name a dollar sign in something involving currency remember you might indeed Need to Escape it with a backslash like this here all right let me ask the group about the protocol specifically https is a good thing in the world it means secure there is encryption being used so generally you like to see https but you still see people typing or copy pasting HTTP what would be the simplest fix here to

tolerate as has been proposed both HTTP and https I'm going to propose that I could do this I could do HTTP vertical bar or https which again Means A or B but I think I can be smarter than that I can keep my code a little more succinct any recommendations here for tolerating HTTP or https we could try to put an question mark behind the S perfect just use a question mark right like both of those would be viable Solutions if you want to be super explicit in your code fine use parenthesis and say HTTP

or https so that you the reader your boss your teacher just know exactly what you're doing but you know if you keep taking the more of ver Bose approach all the time it might actually become less readable certainly once your regular Expressions get this big instead of this big so let's save space where we can and I would argue that this is pretty reasonable so long as you're in the habit of reading regular expressions and know that question mark does not mean a literal question mark but it means zero or one of the thing before

I think we've effectively made the S optional here now what else can I do well suppose we want to tolerate the www dot which may or may not be there uh but it will work if you go to a browser I could do this www.t uh wait I want a backslash there so I don't repeat the same mistake as before but this is no good either because I want to tolerate www being there or not being there and now I've just required that it be there but I think I can take the same approach any

recommendations how do I make the www. optional just to hammer this home we can like group uh make a square and question mark perfect so question mark is the short answer again but we have to be a little smarter this time as Maria's noted we need parentheses now because if I just put a question mark after the dot that just means the dot is optional and that's wrong because we don't want the user to type type in wwwt w i TT e r we want the dot to be there or just not at all with

no www so we need to group this whole thing together put a parenthesis there and then a parenthesis not after the third W after the dot so that that whole thing is either there or it's not there and what else could we still do here you know there's going to be one other thing we should tolerate and it's been said before and I'll pluck this one off what about the protocol like what if the user just doesn't type or doesn't copy paste the HTTP colon SL slash or an https colon SL slash right honestly you

and I are not in the habit generally of even typing the protocol anymore nowadays you just let the browser figure it out for you and automatically add it as instead so this one's going to look like more of a mouthful but if I want this whole thing here in blue to be optional it's actually the same solution as Maria offered a moment ago I'm going to go ahead and put a parenthesis over here and a parth is after the two slashes and then a question mark so it's to make that whole thing optional as well

and this is okay it's totally fine to make this whole thing optional or inside of it this little thing just the S optional as well so long as I'm applying the same principles again and again either on a small scale or a bigger scale it's totally fine to Nest one of these inside of the other questions now on any of these refinements to this parsing this analyzing of Twitter what if we put a vertical bar besides this www do what if we use a a vertical bar there so we could do something like that too

we could do something like this uh instead of the question mark I could do www. or nothing and just leave that in the parentheses that to would be fine I personally tend not to like that because it's a little less obvious to me wait a minute is that deliberate or did I forget to finish my thought by putting something after the vertical bar but that too would be allowed there as well if that's what you mean other questions on where we left things here where we made the protocol optional too what could if we have

a parenthesis and then inside we have another parenthesis and another parenthesis interfere with each other if you have parentheses inside of parenthesis that too is totally fine and indeed that should be one of the reassuring lessons today as complicated as each of these regular Expressions has admittedly gotten I'm just applying the exact same principles and the exact same syntax again and again so it's totally fine to have parenthesis inside of parentheses if they're each solving different problems and in fact the lesson I would really emphasize the most today is that you will not be happy

if you try to write out a whole complicated regular expression all at once like if you're anything like me you will fail and you will have trouble finding the mistake because my God look at these things they are even to me all these years later cryptic the better way I would argue whether you're new to programming or as old to it as I am is to just take these baby steps these incremental steps where you do something simple you make sure it works you add one more feature make sure it works add one more feature

make sure it works and hopefully by the end because you've done each of those steps one at a time the whole thing will make sense to you but you'll also have gotten each of those steps correct um at each turn so please do avoid the inclination to try to come up with long sophisticated regular Expressions all at once because it's just not a good use of a time if you then stare at it trying to find a mistake that you could have caught if you did things more incrementally instead all right there's still remains arguably

at least one problem with this solution in that even though I'm calling re.sub to substitute the URL with nothing quote unquote I then on my final line of code line six am just blindly assuming that it all worked and I'm going to go ahead and print out the username but what if the user if I clear my screen here and run python of twitter. doesn't even type a Twitter URL what if they do something like https www.google.com like completely unrelatedly for whatever reason enter that is not their Twitter username so we need to have some

conditional logic I would argue so that for this program's sake we're only printing out or in a backend system we're only saving into our database or a CSV file the username if we actually matched the proper pattern so rather than use re.sub which is useful for cleaning up data as we've done here to get rid of something we don't want there why don't we go back to re. search where we began today and use it to solve the same problem but in a way that's conditional whereby I can confidently say yes or no at the

end of my program here's the username or here it is not so let me go ahead now and I'll clear my terminal window here I'm going to keep most of uh the I'm going to keep the first two lines the same where I import re and I get the URL from the user but this time let's do this let's this time search for using re. search instead of re.sub the following I'm going to start matching at the beginning of the UR of the string https uh question mark to make the S optional colon SL slash

then I'm going to make my uh www uh optional by putting that in question marks there then a twitter.com with a literal dot there so I'll stay ahead of that issue too then a slash and then well this is where David J ma is supposed to go how do I detect this well I think I'll just tolerate anything at the end of the URL here all right dollar sign at the very end close quote for the moment I'm going to stipulate that we're not going to worry about question marks at the end or hashes like

for fragment IDs and URLs we're going to assume for Simplicity now that the URL just ends with the username alone now what am I going to do well I want to search for this URL specifically and I'm going to ignore case so re. ignore case uh uh applying that same lesson learned from before re. search recall will return to you the matches you've captured well what do I want to capture well I want to capture everything to the right of the twitter.com URL here so let me surround what should be the user's username with parentheses

not for making them optional but to say capture this set of characters Now ar. search recall returns and answer matches will be my variable name again but I could call it anything I want and then I can do this if matches now I know I can do this let's print out the format string username colon and then uh what do I want to print out well I think I want to print out matches. group one for my matched username all right so what am I doing just to recap line one I'm importing the library line

two I'm getting the URL from the user so nothing new there line five I'm searching the user's URL as indicated here as the second argument for this regular expression this pattern I have surrounded the dot plus with parentheses so that they are captured ultimately so I can extract in this final scenario the user's username if I indeed got a match and matches is non none it is actually containing some match then and only then print out username in this way let me try this now if I run python of twitter. and type in HTTPS www.google.com

now nothing gets printed so I've at least solved the mistake we just saw where I was just assuming that my code worked now I'm making sure that I have searched for and found the Twitter URL prefix all right well let's run this for real now python of twitter. https twitter.com davidj maen but note I could use HTTP I could use www I'm just going to go ahead here and hit enter huh none what has gone wrong this one's a bit more subtle but why does matches. group one contain nothing wait a minute let me maybe

I maybe I did this wrong maybe maybe do we need the www let me run it again so here we go http let add it www. twitter.com davidj maen all right enter ho what is going on we have to say it's group two I have to say group two well wait all right because we had the the subdomain was optional and to make it optional I needed to use parentheses here and so I then said zero or one okay so that means that actually I'm unintentionally but by design capturing the www dot or none of

it if it wasn't there before but I have a second match over here CU I have a second set of parentheses so I think yep let me change matches group one to matches group two and let's rerun this python of twitter. https www. Twitter let's do this uh twitter.com davidj maen enter and now we've got access to the username let me go ahead and tighten it up a little bit further uh if you like uh our new friend uh it's hard not to like if we like our old friend The Walrus uh operator let's go

ahead and add this just to tighten things up let me go back to VSS code here and let me get rid of the unnecessary condition there and combine it up here if matches equals that but let's change the single assignment operator to the Wallace operator now I've tightened things up further but I bet I bet I bet there might be another solution here and indeed it turns out that we can come back to this final set of syntax recall that when we introduced these parentheses we did it so that we could do a or b

for instance with the vertical bar though you can even combine more than just one bar we use the group to combine ideas like the www dot and then there's this admittedly weird syntax at the bottom here up until now not used there is a non-capturing version of parentheses if you want to use parentheses logically because you need to but you don't want to bother capturing the result and this would arguably be a better solution here because yes if I go back to vs code I do need to surround the www do with parentheses at least

as I've written my reject here because I wanted to put the question mark after it but I don't need the www dot coming back in fact let's only extract the data we care about just so there's no confusion down the road for me or my colleagues or my teachers so what could I do well the syntax per this slide is to use a question mark and a colon immediate mediately after the open parenthesis it looks weird admittedly those of you who have prior programming experience might recognize the syntax from Turner operators doing an if else

allinone line a question mark colon at the beginning of that parenthetical means yes I'm using parentheses to group these things together but no you do not need to capture them instead so I can change my code back now to matches. group one I'll clear my screen here run python of twitter. I'll again run here uh https twitter.com davidj maen with or without the www and now I indeed get back that username any questions then on these final techniques so first of all could we move the carrot right at the beginning of Twitter and then just

start reading from there and then get rid of everything else before that the kind of WWE issues that we had and then my second question is how would we um use kind of I guess it either a list or a a dictionary to to sort the do kind of thing because we havec UK and that kind of s how how would we bring that into uh the re function a good question but no if I move the carrot before twitter.com and throw away the protocol and the www then the user is going to have to

type in literally twitter.com SL username they can't even type in that other stuff so that would be a regression a step back as for the do and the.org and.edu and so forth the short answer is there's many different solutions here if I wanted to be stringent about do and suppose that Twitter probably owns multiple domain names even though they tend to use just this one suppose they have uh something like.org as well you could use more parentheses here and do something like this Comm or org I'd probably want to go in and add a question

mark colon to make it non-capturing because I don't care which it is I just want to tolerate both alternatively we could capture that we could do something like this where we do dot plus so as to actually capture that and then we could do something like this if matches. group one now equals equals Comm then we could support this so you could imagine factoring out the logic just by extracting the top level domain or TLD and they're just using python code maybe a list maybe a dictionary to validate elsewhere outside of the redx if it's

in fact what you expect for now though we kept things simple we focused only on the Doom in this case let's make one final change to this program so that we're being a little more specific with the definition of a Twitter username it turns out that we're being a little too generous over here whereby we're accepting one or more of any character I check the documentation for Twitter and Twitter only supports letters of the alphabet A through Z numbers 0 through 9 or underscore so not just dot which is literally anything so let me go

ahead and be more precise here at the end of my string let me go ahead and say this set of symbols in square brackets I'm going to go ahead and say a through z uh 0 through n and an underscore because again those are the only valid symbols I don't need to bother with an uppercase a or a lowercase C because we're using re. ignore case over here but I want to make sure now that I tolerate not only one or more of these symbols here but also maybe some other stuff at the end of

the URL I'm now going to be okay with there being a slash or a question mark or a hash at the end of the URL all of which are valid symbols in a URL but I know from the Twitter's documentation are not part of the username all right now I'm going to go ahead and run python of twitter. one final time typing in HTTPS twitter.com davidj maen maybe with maybe without a trailing slash but hopefully with my biggest fingers crossed here I'm going to go ahead now and hit enter and thankfully my username is indeed

David J Ma so what more is there in the world of regular expressions and this own Library not just re. search and also re.sub there's other functions too there's re dosit via which you can split a string not using a specific character or characters like a comma and a space but multiple characters as well and there's even functions like re. findall which can allow you to search for multip copies of the same pattern in different places in a string so that you can perhaps and manipulate more than just one so at the end of the

day now you really learned a whole other language like that of regular expressions and we've used them in Python but these regular Expressions actually exist in so many languages too among them JavaScript and Java and Ruby and more so with this new language even though it's admittedly cryptic when you use it for the first time you have this new found ability to express these patterns that again you can use to validate data to clean up data or even extract data and from any data set you might have in mind that's it for this week we

will see you next time