Hello, everybody, and welcome to chapter one of Python for Everybody. I'm Charles Severance. I'm your instructor, and I welcome you to this class. The basic goal of this class is to teach everybody how to program, regardless of your background. You don't have to be a math whiz. You don't have to be a computer expert. No matter how old you are or what your background is, we want to teach you how to program. So welcome to the course. Welcome to chapter one. So the first thing to understand is that the purpose to learn to program is
because computers want to do things for us. They are built and created and designed, and their hardware is set up so that they basically ask us, what do you want to do next? If you grab your phone, your phone sort of does nothing until you tell it what to do. It waits for you, and it's just waiting for you. And all the hardware computer technology around you is generally waiting for you. And we can use this for useful things. We could play video games. We could have it help navigate for our cars. Someday we might
even have self-driving cars. And it's really, in a sense, in my mind, silly if you spend your whole life not really understanding this technology. And I think it's important that we learn to tell these computers what to do, rather than just let them increasingly control our lives. And so as we'll see, computers aren't very smart on their own. We humans are the ones that imbue them with knowledge. And what we need to learn to speak their language. It is much easier for us to learn to speak their language than it is for them to learn
to speak our language. Although with these cell phones, we're starting to see little bits where they can begin to understand. But you would be amazed at the 40 or 50 years that it has taken us to build programs to begin to understand. So I'm bringing you into something where you are going to learn the ways of programming and the ways of the computer. Because it's easier to teach you how to program than it is to teach this, how to work in your world. Even though, ultimately, the goal is to get this to do work for
you. So part of what I'm trying to do is move you from a user perspective, where you just look at the computer as something that someone else has constructed and you are the user of, to the point where you construct new things. Now the first kinds of things that you're going to construct are actually things to solve your own problems. And it's very popular now to work on data. And Python is an excellent programming language for data mining and data analysis. And that's a lot of what we're going to do in this course. Although really,
it's a gateway to all kinds of things, like artificial intelligence, or gaming, or navigation, or mobile applications, or entertainment, all kinds of things. But first, we have to learn to program. We have to move from using the computer as a tool to using the tools within the computer that allow us to change how the computer sees the world. So there's a couple of reasons that you might want to be a programmer. Some of you are looking to improve your career, to be paid to work on programming. I've been a paid programmer most of my life,
and I like it. It's a good job. You don't have to stand in the mud. You don't have to lift things. You have to use your brain. And I'll just say that it has been nice for my career to not be exposed to the elements, but to be able to work often wherever I want. But that's actually our secondary goal. Our first goal is to get you to write programs that solve problems that you have to solve. Maybe you have a job as an accountant, or a lawyer, or something else. And maybe you run across
some data. Maybe there's some system that logs your time, and it's not quite giving the report that you want to give. And so you say, could I just grab the log data myself and write a program to do some analysis to say, well, what's the average this versus that, or the average of some other thing? And so that's the basic idea, that you'll initially use computers to serve your own ends. That makes it a lot easier to write programs, because you don't have to worry about a million users using your software. If it works for
you, then we're happy. And so it takes a little more training to write software for other people, or for thousands and thousands of other people. And so part of what I want to do is I want to change your perspective. You look at this from the outside, and you see it from the outside, and you click on things. I want to turn this around, and I want you to be the person inside this looking out at the world. And as a programmer, we are making things inside these computers for the world. And so we want
to pull you into being part of this. We want you inside this, or thinking inside this. And what you learn is that if you're inside this computer, and you are taking your instructions to build programs to be used by the human outside the computer, you have things that you need to take advantage of. There's things like the central processing unit, the memory of this system, the network connection of this system, the disk drive, or permanent storage on this system. And as a programmer, you are kind of mediating between all those internal resources that this has
that are not very smart, but highly powerful, and mediating with what that user wants. And so we take the end user, and we programmers, we serve the end user, but the computer serves us. So together, between us and all the computer's resources, we can serve the needs of the end user. And we do this by writing code, or programming. And what is that? Well, programming is a sequence of instructions where we are giving instructions to the resources inside the computer in a way to accomplish the goals of the end user. And remember, sometimes we are
our own end user. It's not just you're not always doing a startup. You're not always writing a mobile gaming system. So sometimes you're writing something for yourself. But that's OK. So sometimes you're writing something to solve a problem. You're like crafting. You're doing something that you could do by hand or manually. And you're making some clever little 25 or 100 line program. And you're putting that in. Other times, like when I work on the open source learning management system, Sakai, it is my creativity. I've got an idea. And I want to share it with a
million users. And so I write my code for an external audience. And so code is that sequence of instructions that the computer itself doesn't know how to hand a roster out. But I can write code that will hand a roster out by looking at the data that's inside this computer, inside this application. And so if you think about programs, we have programs for computers and programs for humans. And a number of years ago, now I'm starting. Sooner or later, this will be me showing my age. This is an example of the Macarena. And the Macarena
is a song that effectively is a sequence of instructions. You put your left hand out. You put your right hand out. You put it on the shoulder. You wiggle, wiggle, wiggle. And you spin around. And you do things. And this is a program for people. And so I want you to take a quick look at this and see if you can find anything wrong with this particular program. So look really closely. So I'll show you. It's got some typographical errors in it. And we as humans are really good at reading or hearing typographical errors and
correcting them automatically and instantly. But computers are not. Computers are extremely literal. If it saw this ham instead of hand, it would think, what's a ham? And why am I going to hit someone in the back of the head with a ham? And why would I take my left hand and hit somebody? These are all bad things. But the computer is going to take us very literally. And so we have to be really precise. And the computer just doesn't know the difference between what we mean and what we say. So we have to be very
precise. And this is one of the great frustrations that people have when they first start using computers. And so we have to get this right. We have to get these little bits of text exactly the way they are. Computers will blow up with syntax errors. And they seem to make quite a fuss when you make the tiniest of errors. But you'll get used to that. I mean, that's not because you're bad or you're less than awesome. It just means the computers can't compensate when you make small mistakes. And so you've got to get used to
the fact that the computer is sort of intellectually not as strong as you. And so it gets confused really easy. Even though when it gets confused, it says seemingly mean things to you. So you'll get used to that. OK, so the first thing I want to do is I want to throw up some text. And I want you to, while this text is up, I want you to count the number of each word in this text and tell me what the most common word is in this text. OK, so here we go. OK, so I
kind of made that hard on you on purpose by moving around and distracting you and confusing you. But even if it's not moving at all, it's a little bit tricky to do. You probably stare at it a couple of times. Your brain is going back and forth and back and forth. And so text analysis is one of the great things that computers are very, very good at. And some of the things that they can translate text, and that's because they've looked at a lot of information. So looking at text is actually something computers are really
good at. And so if we take a look at the kind of programs that we're going to write to do this kind of thing, this is something that humans are not naturally good at, but computers are super good at. Now, I'm not going to have you look at this code. I'm not going to, this code you will understand in a few weeks. But basically, this is a set of instructions to open a file, read that file, read all the words in the file, create a histogram of all the words in the file, and then search
through that histogram to find the most common word and tell us what the most common word is in the file. And in this clown file, the word the is the most common. It happened seven times. And here's another large file called words.text, and the word to is the most common thing. And our goal is to get to the point where you can write this on your own. So you can say, you know what, I got a problem to solve. That is, what's the most common word in this file? I know how to start, and then
I know how to finish. I know how to do the stuff in the middle. And we have to learn this kind of weird language. But when we do, we can count millions of words as easily as we count 20 words. So that's the fun of all of this, is to teach you this language so that you can solve that problem so that you don't have to solve it. Because you could solve it, but it's not something that you're naturally good at, and it's hard work. So up next, we're gonna talk a little bit about the
hardware architecture that you're gonna be experiencing as you write programs. Hello, and welcome back to hardware architecture. Now, you might ask, why do I tell you about hardware architecture? Probably you're not gonna build any hardware, although it's fun stuff to do. And if you're gonna become a computer scientist, which most of you won't want to be, it's a great thing to study. And it's those who build our hardware are amazingly talented individuals, and it's a really rewarding job. The reason I like talking to you about hardware is because I want to be able to use
words at some point and say, oh, secondary storage, or central processing unit, or random access memory, or peripherals, input devices. And I wanna be able to say those words, and I want you to be able to understand them. And so I'll start with a little piece of hardware called the Raspberry Pi. And the Raspberry Pi is a cute little single board computer. As we go forward, these things are smaller, and smaller, and smaller. And the interesting thing is that the architecture of these stays the same, but the number of components drops. So I'm gonna start
and give you a block diagram of sort of a generic computer and tell you the major parts of it. Now, I'm gonna show you some really old hardware, some really new hardware, and then some hardware that is of medium age. And the medium age hardware is probably the easiest one to see. The architecture is the same, okay? And so the basic block diagram is that the brains, if there are brains in computers, which there really aren't, the software is the closest thing computers have to brains, but in hardware, the closest brain a computer has is
this, called a micro-processing unit, or a central processor unit. And this is designed three billion times a second to ask the question, what do you want me to do next? And these little pins on the back are instructions, like 32 or 64 of these pins, three billion times a second, we send an instruction into these things. Now, we can't sit there and talk to it, we can't. And so the instructions we store in what's called the main memory. And this memory is really fast, and the memory sort of feeds this. And so every time the
CPU needs a new instruction, it asks the memory where that instruction is. And so the memory feeds the instruction CPU, the CPU does it, says give me another instruction, CPU does it, gives me another instruction, and that is the basic essence of programming. This asks what's next, and this is where your program is stored, or a program you purchased or came with your hardware, where that's all stored, and those are your places. And so you end up inside, your programs end up inside this memory. So then there's a, I mean, and so in software you
tend to program the CPU, and if you had bought a desktop computer a number of years back, it would have this thing called the motherboard. And the motherboard is called this because it kind of connects all the components together. And so if you buy memory by itself, it does nothing, but it has a place to plug into the motherboard, and if you buy a microprocessor, it has a place to plug into the motherboard. And if you buy a hard drive, this is a really old hard drive, it has a place to plug in on the
motherboard, and so the motherboard sort of connects everything together. The hard drive is secondary storage. Now the way, how secondary storage is different than the main memory, which, there it is. I gotta unpile this stuff. So this main memory is really fast, but as soon as you turn the power off of this memory, it sort of vanishes. And so to store files like word processing files or text files or whatever, you gotta store it on something that lasts a little bit longer. And so that's the purpose of the secondary storage. It's permanent. When the power's
off, it stores it. Now this one here is in such bad shape that isn't probably storing anything, but it's got these little heads, and it spins around and goes in and out. And we'll have a video later that shows you one of these things that's not quite in as bad a shape. If you look, this has four different platters that are all spinning around. And so this is just using magnetic material and electronics that sort of magnetize and demagnetize this stuff. And if you look at a disk, they're often rated, physical disks, are rated in
revolutions per minute. And that's how many times this thing spins around. And if you got an old desktop and you hear it spin up, this is the thing that's spinning. And it's the place that your operating system lives, your files live, your applications live, while they're stored and while the computer's turned off. And then they're loaded into this while they're running. And then, this CPU takes the data from the main memory, and your program runs at three billion operations per second. So, let's talk a little bit about something that this is probably from the 1960s
or 70s. This actually has, if you're an electrical person, it has capacitors, those little silver things are capacitors. These little colored things are resistors, and that's more capacitors. And then there's wires, and wires move everything. And so, when you say like this has millions of transistors, oh wait, that isn't a capacitor, that's a transistor. That's a transistor. When you say that this here has etched, and if you look closely at this, go look at a picture of a microprocessor online, you will see that it has millions of these. And so, the difference between 1960 and
today is this circuitry of capacitors, resistors, and transistors has been microized and put onto this. It's using a photographic process, and they're tinier and tinier and putting more and more on. And if you think going from millions of these to one of these is crazy, the thing that's happening now, and the reason we have whole computers inside our pocket, is that everything, all of this, this whole thing, CPU, memory, everything, all of it connected, and the storage is being made smaller and smaller. And so, this little single board computer called the Raspberry Pi has one
thing in it, and it has the main memory, and it has the CPU, it has connections for things like peripherals, like keyboards and stuff. Now, it doesn't yet have secondary storage on it. The secondary storage gets plugged in right here via USB. And then if you take it one step farther to my phone, it's got the secondary storage built right in. And so, this picture goes from the size of cabinets in the old days all the way down to really tiny. But, at the end of the day, inside it is a highly sophisticated piece of
circuitry that asks for instructions one at a time, and main memory that holds the instructions and feeds them, okay? Central processor does the thinking, take a look here, central processor does the thinking, it runs the program, it's asking what's next, it's not really smart, but it's really fast. And so, we compensate for the lack of intelligence of this thing by us writing really good software that runs really fast. And so, voice recognition on things like phones is possible because computers have so much storage and they run so fast and the algorithms that do voice recognition
are finally starting to work. Input devices like keyboards and mice and pens and whatever, they come in, output devices are like the screens that we see, the main memory is the fast part of the computer that stores all the programs, and the secondary memory is the permanent storage. Increasingly, secondary memory, do I have any USB sticks in here? I don't. Well, increasingly secondary memory is flash RAM or static RAM with no moving parts. And so, in a few years you'll not even be able to see secondary memory with moving parts. But that's okay, it's still
secondary memory, it's still memory that lasts. And so, you and where your place is in here is you live in the main memory. This is you, you are here. And so, in a sense, when the CPU asks the question what next, it is your job to answer that. And you answer that by writing Python code. And so, your Python code, you'll write a file in Python code. Blah, blah, blah, blah, blah, blah, blah. And then that Python code sort of gets loaded into main memory, there's a magic translation process that happens. And then your code
is actually answering this question three billion times a second. Three billion times a second, you're sitting there. But this is you. You're really out here, but you then write a file and the file's loaded in and then the file runs. And that's how things are at. And that's your place in the world. Now, what's actually running is not Python code. There is, as I said, a translation process. You write a Python file and then Python itself translates this into the actual language known by the microprocessor, which is a series of zeros and ones called machine
language. Someday I would love to teach you a class on machine language. But for now, we're gonna teach you Python and we're gonna use Python as a crutch. We don't have to talk machine language, but you could, if you really wanted to, you could know how to write machine language. But I assure you, Python is far easier to learn than machine language. So, Python acts as a translator, translates what you're doing into machine language, and then the machine language is what's sent back and forth. But still, even though it's translated to machine language, it's you.
It is you answering those questions and that's what a program is, is you pre-storing your response to the what next question over and over again. So, here's a couple of videos that you can look at on YouTube about a CPU. These CPUs, and it looks very much like this CPU that I've got with me, these CPUs run extremely high heat when you put this thing on your computer on your lap and it starts to heat up. That means it's thinking really, really hard. And so, this is a small little old video from a long time
ago that shows what happens when you take out the cooling capability of microprocessors and just how hot they can be. And the other video that I have is a hard disk. Something like this hard disk that I have except that it works and they turn the power on. Some of them last for a few seconds, some of them last for a few minutes. It's never a, achoo! Achoo! I must be allergic to this hard drive. Or maybe it's because there's dust in this hard drive and I keep spinning it and I sneeze. But basically, some
of them last for a few seconds, some of them last for a few minutes. It's not a good idea to open them up, but I'm glad somebody opened it up and then did what they did and then recorded it so we can all enjoy what it is that they're capable of doing, okay? So that's a quick introduction to hardware, mostly so that I can use those words going forward. Now, what we're gonna talk about next is communicating in the language Python. That is, writing code and putting it into the computer so that that can execute,
okay? And welcome to my video that shows how to get started and install Python on Microsoft Windows, okay? So it's not too hard. We're gonna both install Python 3 and we're going to install Text Editor. And so I'm just gonna go into Google and I'm gonna install Python 3. And my top link is downloading Python. And there is my link for downloading Python 3.5.2. This version of my class uses Python 3. I have an earlier class that you may have seen that uses Python 2, but in this class, we're going to do this. Now, it
might take you a while to download this. I've actually already downloaded it. Now, the other thing we need is a programmer text editor. And you can really use any programmer text editor. We've used Notepad Plus in the past. We've used JEdit in the past. I like Adam, Adam.io, T-O-M,.io, mostly because it works the same on Windows and Mac and Linux. But you can really use any text editor that you like. Just don't use Word or TextEdit that comes with the operating system. You need a programmer's editor that doesn't mess with weird characters or weird lines
or strange formats. You must have a real programmer editor. And so I've already downloaded this as well. And so I won't waste the time waiting to download it, but let's go ahead and do the installation. So these things ended up in my downloads file. So I'm going to downloads. And I'll start installing Python 3.5.2. Now it's gonna ask me some things. Add Python 3.5 to the path. And that's a good idea. Install the launcher for all users. I'm going to add that. Maybe you will, maybe you won't do that. It's gonna tell me where it's
going to install it. Install now. Of course, it's going to ask me for permission to do these things. And now it's running through the installation. Okay, so there we go. You could maybe click on this online tutorial and documentation. But we're just gonna close this. And I'm gonna start and run the Windows command line. Now, you may have all kinds of fancy ways to run Python, but I like running the command line, C-O-M-M-A-N-D. I like running the command line because after a while, it's important to know what folder things are being run in. And so
here's this command line. And I should be able to type Python here. And so now I'm in Python 3.2. And this is, the chevron prompt here is the Python interpreter, where it's asking for Python commands. And I can say print. Hello world. Of course, this is what we tend to print all the time. I can make a mistake. I can say, lulululul. Right, and it'll complain to me. Now to get out of this, I can either type control-Z or quit. In this case, I'm gonna type control-Z. And I'm back to the prompt. A couple of
things, I can do a dir to see what folders and files I have. And that is like my desktop. And then the cd command tells me where I'm at in the folder. That means I'm in the user's directory, Dr. Chuck. So I have now installed Python. I ran the Python interpreter to verify it. I said, print hello world. And so now what I'm gonna do is I'm gonna actually install Atom. And I already had this downloaded. So let's go ahead and install Atom on my computer. Okay, so Atom is now installed and it's kind of
telling us what to do. So I'm gonna actually just close all these windows, close this window, close everything. And I'm gonna create a file. I'm going to say print. In this case, let's see if I can make this bigger. I can make it bigger. So I'm gonna type print hello from a file. Okay, and I'm gonna save this. I'm gonna say file, save as. And what I'm gonna do is I'm gonna go to my desktop. And I'm gonna make a folder on the desktop. I'm gonna call this folder py4e. So I now have a folder
on the desktop. Move this here, I'll move this here. Oops. And I'm gonna go into py4e. And then I'm gonna name this file first.py. And you'll notice that when I save this, when I save this, it syntax highlighted it. That's one of the nice things about a programmer editor. Okay, and so it says, oh, it's got a suffix of.py. So therefore it knows that it's supposed to look pretty with Python and make this one color, make this another color. The other thing that you'll notice is that I now have a folder called py4e. And if
I am in this command line, let me just start that up again. I'll show you how to start the command line again. Command. Now, if I do a dir, I see the folders that I'm in. And one of the folders that you can see here is the desktop folder. So I'm gonna say cd desktop. And then I'm gonna type the dir command to see what folders are in the desktop. These folders are the same as these folders. These things are kind of virtual folders. Py4e is py4e. Now I can type cd, which stands for change
directory, py4e. And I can do a dir, and I see first.py. And that's the same as if I'm diving into this folder. Here's this file, first.py. Windows hides the suffix, which is somewhat annoying and frustrating, but that suffix is there, that file is there. And so for me, one of the things you gotta figure out in Windows is how to make sure that you are in the same folder, users, Dr. Chuck, desktop, py4e, and that's the name of this file, and here as well. And now I'm gonna run this program. I'm gonna type python, first.py.
And you see that it ran the Python code, okay? Another way you can do this is you can type first.py. And that's because this file association has happened in Windows. This doesn't work in Macintosh. This only works in Windows. That all files with.py are expected to be Python, and it knows the Python interpreter where to run it, okay? And so I've got Python 3.0 installed, and that gets me started, and so I hope that this little introduction about getting things started and writing your first Python program has been helpful to you. We're going to actually
download and install Python 3 from python.org on a Macintosh. If your Macintosh for years has wonderfully come with Python 2. So if I type python minus minus version, then I type that. I see that I've got Python 2.0. What we wanna do is, in addition, install Python 3. One of these days, Macintosh might upgrade their distributed version of Python 3, but there's so many things inside Mac that depend on Python 2. I'm gonna expect that it will always be named Python 3, which is what we're gonna call it in a second. So here I am
at the python.org downloads, and I'm gonna download Python 3. You click here, and I've actually got it sitting here in downloads already, because I always do that. And so I'm gonna install this. There is the installer. I'm gonna say continue, continue, continue. Of course I agree, I read all that really fast, and now I'm going to install it. Okay, so now that means if I run a terminal, so this of course is start run terminal, so Python 2 is still there, but Python 3 is also now there, so we should have Python 3 installed. So
we installed Python 3.6, and so there we go, and that's all it takes to install Python 3 on the Macintosh. So let's write our first little Python program. I'm going to, I like Atom, and so I've got this Atom editor. It's atom.io, right here, atom.io, download and install the Atom editor. I like it because Atom works the same on both Windows, Mac, and Linux, and it has syntax highlighting, and so I really like things like that. So I'm gonna make myself a simple Python program. Hello world, like we always do. Now you'll notice that it's
not syntax highlighting yet, but I'm gonna do a file, save, oopsie daisy, file, save as, and I'm gonna go into my desktop, and I'm gonna make a folder called py4e. I must find this call as hello.py. Oh crud, gotta rename it, rename it. I ended up with two dots, hello.py, there we are. And so now I'm here, and I'm in my home folder. I can go into my desktop, and I can go into that new folder I made, Python for Everybody, and I can see the files. Now there are ways to run this, and I
really don't, I really want you to learn the terminal so that you really know what you're doing. And so here we are, we are in the folder that has the Python, and then all we do to run it is we say Python3 hello.py, and there we go. And of course this is Python3 because I'm using parentheses there. So instead of double quotes. But Python2 is still there, and of course if you just run Python hello.py, it'll be a syntax error, or not. Must be they added something. Ha ha ha. Yeah, because Python is still version,
still version two, but apparently they allowed print in the latest version of Python2. So away we go. Okay, so again, thanks for watching. I hope this was helpful to you to get Python3 installed on your Macintosh. Hello, and welcome back to Python as a Language. You'll notice that I'm wearing a hat. And part of the story of the hat is that where I work here at the University of Michigan School of Information, my office is in this building called North Quad. And we call it quadwort sometimes because it's sort of got a square, it sort
of imitates an Oxford quad. And so it seemed to me to evoke notions of Harry Potter. And when we first moved into the building, I joked in one of my classes that we should have a sorting ceremony for all the students as they come into North Quad for the first time. And so that was cool, and I thought that I would belong in Gryffindor, like everyone wants to be in Gryffindor, right, they're the good guys. And my students told me that I couldn't be in Gryffindor, that I had to be in Slytherin. So you'll see
me drinking tea throughout the course out of this teacup. It's my Slytherin teacup. I picked that up from Harry Potter World. I went down to Florida and visited Harry Potter World. And the reason that I was sorted by my students into Slytherin is also because I teach Python. And Python is like a snake. And so if you think about the people from Slytherin, they are capable of talking to snakes. And the class that we were doing the sorting was a Python class. And so it sort of made perfect sense that you would have to be
in Slytherin if you were the Python teacher. And of course, your name is Charles Severance. And then that sounds kind of like Severus Snape. And so I just accepted that I'm in Slytherin, okay? So you all can be in Gryffindor, but I can't. I'm in Slytherin. So I'm the bad guy or the good guy. Depends on how you look at it, right? And so what I'm going to do now is I'm going to bring you into Slytherin as well. Because I'm going to teach you the Python language. Python is the language that we Pythonistas talk.
It was invented about over 20 years ago by a fellow named Gita Van Rossum. And away we go. Now, even though I'm using this whole snake Slytherin thing, it turns out that Python was not at all named for Harry Potter because Python was invented almost two decades before Harry Potter was created. And it wasn't for the snake. It was actually, Monty Python's flying circus was the inspiration for Python, the name Python. And because Gita Van Rossum really wanted to create a programming language, that while it was powerful underneath it in its very nature was a
very powerful language, he wanted it to be a language that was fun. And he wanted it to be a language that was approachable. And so that's why Python recently has become so absolutely popular. And it's easy to learn. But it's also powerful. And that's sort of the magic of Python, is the ease of learning it, the brevity of the programs, the shortness of the programs, and the power. And so we are going to become Pythonistas. Now, as you learn to be a software developer using the Python programming language, you are going to encounter syntax errors.
And I remember when I used to get syntax errors. And I remember my first programming class. And I would type on cards. And I would upload those cards to the computer. And the computer would say, you're not worthy. And I'm like, wait a sec, those are pretty good cards. How could you be so critical of me? I'd say syntax error. And I really got sort of a really bad attitude that somehow this computer didn't like me. And that I would make cards that would complain. And I would make changes to the cards. And it would
still complain. And I make changes that would still complain. I'm like, how can I win in this situation? And you're going to feel the same thing. You're going to absolutely feel the same thing. You're going to be struggling. You're going to be like, how come this computer hates me? Let me assure you right now the computer doesn't hate you. The computer actually loves you. It just is not very good at showing how it loves you or telling you how or why it loves you. And so syntax errors are not so much Python telling you that
you're bad or that you're an inadequate programmer or you should find something else to do. It's really Python's admission that it doesn't understand what you're trying to say. And so you've got to get used to that. And it's frustrating, but you've got to get used to the fact that syntax errors are your friend. Python is saying, hey, I got to line seven. And I was doing fine up to line seven. But boy, in line seven, there's some little thing. I don't know what the word else means in this context. Or you didn't indent it. And
so I'm kind of confused. What did you mean? Please, please, please help me. And so it's so much easier for you to learn Python than it is for Python to figure out what you mean when you're writing code. So we have a number of different ways to sort of encode our instructions when we talk to Python. One is we just run Python interactively on our computer. Hopefully, by now, you've got it installed. And you just type Python at a command prompt. So either a Windows command prompt or a Linux command prompt or a Macintosh command
prompt. And I got some examples of how to sort of get this all started, get Python installed, and away you go. Now, you'll notice when you run the Python interpreter, the three chevron prompt, Python is asking you what next. This is you. It's saying, I want to talk to you. I want you to tell me some Python to do. If you know the Python language, you know what to say right here. Now, if you know Python, you can type these languages. You can say, oh, x equals 1, which really means, go find a little piece
of memory, label it x, and stick 1 in it. Print x is like, go find that thing where you labeled it x, and bring me back that number and tell me what I stored in there. Now, why you want to do this, that's a different question. These are very simple things. It's going to take you a while to get the big picture of why we're doing this. So just trust me that you want to learn these statements. And then later, we will successfully turn those into a program. So x equals x plus 1, the third
line there. x equals x plus 1 is not, as it seems in math, it basically says, hey, go grab the old value of x, add 1 to it, and stick it back in x. That's what that means. So equal sign really has kind of an arrow to it. And then we say, hey, go look up that x thing that we just did, and print that out, and then we're going to say, quit. So that's us talking to Python. Now, you can type just about any crazy stuff you want in here, and Python will be unhappy
and talk to you. So what we're going to do next is we're going to start talking about the actual language of Python and what it is that we have to say to make Python happy when we're talking to it. So now we're going to start learning the actual Python language. So what do we say? You can think of this as almost like writing, almost like writing a story. We're going to start with a basic vocabulary. We're going to talk a little bit about lines or sentences. And then we're going to start talking about how to
put those sentences together to make a coherent paragraph, as it were. And you just have to accept the fact that when I start teaching you this stuff, it's not going to make sense for about six or seven more chapters. And so just sort of bear with me, except, I mean, I remember when I first learned, it went from me confused, confused, confused, confused, confused, holy mackerel, this is awesome. And so I expect many of you will go through that same thing. So just learn the first parts, accept the fact that it doesn't necessarily make sense
in a big picture, and just bear with us, okay? So we'll start with vocabulary, we'll start to make sentences, and then we'll have little short stories and paragraphs, okay? And so this is a short story about how to count the words in Python. It's got a couple of paragraphs, and we are going to look at all of this stuff eventually. So we start with a set of reserved words. And what are reserved words? Well, they're words that Python expects when you use these words that they're gonna mean exactly what Python expects to mean. And what
it really means is you're not allowed to use them for any other purpose than the purpose that Python wants. It's sort of part of the contract. It's like when you have a dog, and you go, what did you think of that television program? And the dog has no idea what you're saying, and then you say, do you wanna wait until Saturday to go to the veterinarian? And the dog still doesn't know what you're saying. Then you go like, how would you like to take a walk? And then the dog goes, walk? I know what that
means, and then hits the door, right? And so the way the dog sees you is blah, blah, blah, walk, blah, blah, blah, blah, food, blah, blah, blah, blah, treat, blah, blah, blah, blah, walk. That's kinda how Python looks at these reserved words. When you say class, it goes class. Oh, I know what that means. Now, if I say zap, it's like, oh, zap's something that you get to decide, or it's maybe a variable name. So reserved words are simply words that when you use these words in Python, and there's only a few of them, like
and, or del, or if, maybe, pass, maybe, in. A lot of these, you won't end up using them, it's just these are reserved for Python and part of the Python vocabulary. This is Python vocabulary. Now, when we move from words to sentences, you see that Python is a series of lines. A Python program is a series of statements. They have an order because the computer wants to know what next, what next, what next. So, what next is start at the beginning. So, I already talked about an assignment statement that basically says x equals two, this
is not a mathematical statement, this is a directive to say, take this variable two, this value two, this constant two, and stick it in a location in your memory, and remember that I asked you to name it x. X is a variable, something you made up. You chose that, but it's Python's job to remember it. So, this says go, whatever that x is, there's a two in there, now pull that x back out, add two to it, which makes it four, and stick it back in x, and so that makes this a four. So, x
is a four, and print x says, go look up that thing that was an x and print it out. And so, these are like, each line has something to it, I'm using a reserved word, well actually that's a function, but it's a reserved word too. And so, there's reserved words and all these things, and you combine these, there are operators, plus is an operator, equals is an operator, these things do things, and we'll learn all this stuff in time, so the basic building blocks of lines of Python. Now, as we take these lines of Python
and build them up, we end up making paragraphs, programming in paragraphs. And so, one of the things that it's important is I showed you how to do interactive Python, so you just type Python and you type a statement and a statement, those get really tiring after about three or four lines of Python, because you start making mistakes and you have to start over. So, the better thing to do is to, as your program gets a little larger, to write a script, put your Python instructions in a file, and then tell Python to read from the
file, and then run the script as it's entered in that file. We tend to name these files with.py, and I've got a series of videos that you can watch to figure out how this all works. Like I said, you can type interactively to Python, and it's a great way to experiment with Python, check to see if a statement does what you think it does, but script is the way, after we are past one or two lines of code, we write it in files and then run it separately. So, there are a couple of basic patterns,
and it's really important to understand each of these patterns, and like I said, we'll teach you these patterns separately, and then we'll combine them together. And when you combine them together is when you say, oh, that's what a program is. So, you have to suspend disbelief. We have a couple of different patterns. One is a sequence of steps. Do this, then do this, then do this. Conditional is like skipping something. Repeated does it over and over and over again. Computers are really good at repeating stuff. Much better than people. People get tired going over and
over doing the same thing. And then we have store and repeated steps as well. And so, if we take a look at this, and we take a look at a Python program, this is a piece of code, this is a little script. If you type this into a code, take this Python code into a file and run it, it starts at the beginning, and then it goes to the next line, and the next line, and the next line. And Python executes the scripts as you write them. So, it says, stick a variable, find a place
called in your memory called x, stick two into that, okay. Then go to the next one, print that out. So, the program is producing output. Now, go read x and add two to it, and stick it back in x. So, x is four, then print that. This side over here, this is called a flow chart. I'm not gonna make you draw flow charts. I'm only gonna draw them a few times in ways that I think will help you. But you can think of it as Python, when it finishes something, it goes onto the next one,
unless you tell it otherwise. Finishes this, goes onto the next one. Finishes this, goes onto the next one. Finishes this, and now the program is all done. And so, that's sequential steps. You just type them in, Python runs it. They're important, but sort of uninteresting, because you can only get so far. And you can't really make them intelligent, because it's always gonna do the next one. So, the next thing we do is what are called conditional steps. And this is where it starts to get intelligent. I mean, where you are able to encode your brain
into the computer, like, oh wait a sec, let's only do this step if something is true. And the syntax that we tend to use here is the reserved word if, if, okay? And so, the if is like a little fork in the road. You can go one way, or you can go another way, and you're asking a question. So, inside the if statement, right here, there is a question, saying, is x less than 10? That's a, that resolves to a true or false. If it's less than 10, that's true. If it's greater than 10, it's
false. And so, then what we do is, if it's less than 10, we have this indented block of code. There's also this colon that tells us we're in the beginning of an indented block of code. And so, what it basically says is, if this is true, run that code. If it's false, skip that code. So, it can either run it or skip it, depending on this question that's being asked. Now, if you look at this code, it's pretty obvious what's going on. It comes down, x is five. If x is less than 10, that's true.
So, it runs this code and prints out smaller. And then, it comes back here, deindents. The next basic sequential, this ends up being kind of a block. If x is greater than 20, if x is greater than 20, oh, come back, come back. If x is greater than 20, this turns out to be false, because x is five, and so it skips this. So, the bigger never comes out, and then it continues on and prints fini. Oops, that's a typographical error. Make that a lowercase print, and then prints fini. So, it comes in, runs this,
skips this, and then finishes. Okay, so here is the last one we'll talk about, the repeated steps. We'll get back to store and retrieve later, but for now, we're just gonna talk about three of the four. This is another program, and the key is, is that we're gonna use this same choice where we're gonna go in, but then we're gonna run for a while, and then we'll have an exit condition where we get out. So, this is repeated over and over and over and over again, and this is the essence of how we make computers
do things that are seemingly difficult, while they're more naturally difficult for people, okay? And so, how do we encode this notion that we wanna do something not forever, but for a while? How do we encode that notion? And so, we do it in this way. So, we have our statement, sequentially go to this while, while is a key word, and it's asking another question that's a true false question. Is n greater than zero? I read this as, as long as n remains greater than zero, keep doing this indented block, and you have a colon at
the end, and then you have two lines of code that's indented, so that tells us what the loop is, and then this is now deindented. And so, it comes in, and if this is true, if this is true, if this is true, it runs these two lines. Prints out n, n is five, and then it says, n equals n minus one, which makes n be four, and it goes back up, and it goes up, and it asks this question again. Is n greater than zero? If it is, continue on, and prints four, and then subtracts
it, and it does that, four, three, two, and prints out one, then it comes up, and now, after this, n is now zero, n is now zero, and n is no longer greater than zero, so it takes sort of the exit ramp, and goes down here. So, it takes the exit ramp, and goes to here, and runs the next line. Now, we're gonna cover all this again. So, I'm just trying to give you the big picture, next couple of chapters, we're gonna hit all these things again, and we're gonna hit them in much more detail,
with a lot better information. This is now sort of like combining these, and again, I don't want you to really know this stuff, just, you will know this in a couple of weeks, you will see this program again, but this shows you how we combine those patterns of repeated, sequential, and conditional together. So, this is a bit of sequential code, comes in here, runs this, which happens to ask for a file name, then it opens the file, it creates a data structure called a dictionary, this is all sequential, now the four is another form of
loops, so this is gonna loop for a while, and then this is, within a loop, we can even have two indents, and that's another loop, so these are like repeated, and then it goes, it goes down to the next sequential bit, then it does this, here's another loop, it's gonna run, and then here's a conditional, it's gonna run, and then once all done, we print out the last thing, and this is, of course, is the program that does the, it figures out the most common word and prints that most common word out, and so this
is a Python short story, it reads some data, it reads the name of a file, it opens that file, it talks about how to make a histogram, and then it looks through for the most common word, so don't worry too much about this, over the next couple weeks, we'll fill in the pieces so that you absolutely understand every single line of this code. So, this is a quick overview, chapter one, stick with us, you realize it will be chapter seven before this makes too much sense, you really have to trust that you are learning important
things, and that it all makes sense when we bring it together like in chapter seven in a few weeks. Hello, and welcome to chapter two. Now we're gonna continue to talk about the building blocks of Python, variables, constants, statements, expressions, et cetera. The first thing we have to talk about is constants, these are just things we call constants because they don't change, they're numbers, strings, et cetera, and we use them to sort of start calculations, or, you know, if something is greater than 40 hours, we're gonna do something, and so 40 is the constant in
that situation. So, we have 123, we have 98.6, we have hello world, which is a string by enclosing it in quotes, we pass each of these things to the print function, and aside of the respect of the print function is that we see the output. So, print 123, prints out 123, print 98.6, prints it out. So, these are just really the syntax of constants, and without constants, we can't write really much of anything. The other sort of foundational notion of any programming language are the reserved words, and like I said before, reserved words are these
special words where Python is listening for them, and there are very special meanings, so when Python sees if, it's not just any other word, it means how Python implements conditional execution. Variables are the third building block, and that is a way that you can ask Python to allocate a piece of memory and then give it a name, and you can put stuff in that. Sometimes you just put one value, later we'll see, when we do collections in chapters eight and nine, we will see the more than one value can be put into a variable, and
the variable, how we control the variable is through the assignment statement, and as I said before, it's important to think of the assignment statement as having an arrow to it, so this is not saying X for all time is the same as 12.2, what it's saying is take 12.2, find a place, find some memory in your computer there, Mr. Python, give it a label X, we get to choose the X, that's the variable part, we chose it, right? And then stick 12 in it, and then the same is true for 14. Go find another spot,
name it Y, and then put a 14 in there, so think of this as an arrow every time you see that equality, the assignment in an assignment statement. Now, these variables hold one value, so now if we have these three statements, these two, and then the third one executes, it says put 100 into X, but that wipes out the old value of 12.2, and it rewrites it with 100, and so we can change the variables, that's another reason that we call them variable. There are some names, some rules for making variable names, you can start
with a letter or an underscore. We tend not to, as normal programmers use underscore, we tend to reserve those for variables that we use to communicate with Python itself, so when we're making up a variable, we tend not to use underscores as a first character. You can have letters and numbers and underscores after the first character, and they're case sensitive, but it's really a bad idea to use case as the only differentiator. So, in this case, spam, eggs, spam 23, and underscore speed are all totally legit, we would probably not use this one unless we
were actually doing it because Python told us to use that variable. 23 spam starts with a number, pound sign starts and dot is not a legitimate variable character. And spam, capital spam and all caps spam are different, but this is not something that you want to sort of depend on too much, so. That's just the rule names. We tend to start them with a letter and then use letters, numbers, and underscores. Underscores other than the first character are generally pretty common, and you'll see those used a lot. Now, when we're choosing variable names, one of
the things about variables is we get to choose the name. We get to choose the name X, choose the name Y, and so sometimes we like them short, but sometimes we want them descriptive, and the notion that of making variables descriptive is often confusing to beginning students. Sometimes it's really helpful to, if you're gonna have a line of text and you name the variable line, that's great because the next person reading your program says, oh, that must be the line of text. Whereas it also can become misleading that line, the name of a variable somehow
has meaning, and so sometimes we'll have even singular variables and plural variables like friend and friends. You know, like, is plural, does Python know about singular and plural? And the answer is no. So sometimes we pick variables that make no sense. Sometimes we pick variables that make a lot of sense. This is just something that you as a beginning programmer are going to have to understand that we can pick anything we want, and so you'll see, I'll try to call attention to this in the first few lectures as we go through. So here's a bit
of code with an assignment statement, two assignment statements, a multiplication, and a print statement, and you can say, what is this doing? Now, Python is perfectly happy with this code because it assigns it in there. You have said, please go give me this as a label, and then we assign two variables, and then we're carefully pulling these two variables back out, multiplying them together and sticking them into yet another variable, and then printing that variable out. That seems like, you know, we can figure out what it is. You just have to look really careful, and
a single character mistake, and Python is gonna be, you know, pretty unhappy, okay? So that's one way to write this program. It's hard, though, because any of those characters are long variables and they're random stuff. It's not very friendly to anyone who might read your program. Now, this looks a little friendlier. It's the same program because Python just wants a correspondence. You pick A, you pick B, and you pick C, and it's really much easier for us to see what's going on, and so this is, in a way, going from here to here is much
friendlier, but we can be even friendlier if we pick mnemonic variable names. So this is not mnemonic. This is short and convenient. This is long and inconvenient. Python is happy with any of these. Here, on the other hand, is another version of the exact same program, and now you think to yourself, oh, yeah, now I get it. 35 is the number of hours. 12 dollars and 50 cents is the rate, and then we're gonna multiply the hours and the rate and come up with a pay, and we're putting out the pay. Now, whoever wrote this
program is much, is helping us greatly understand what's going on, and that's good. Choosing variable names. Python, again, all three of these are the same to Python. Choosing variable names in a way that help your reader understand what's going on is a great thing. The problem is, the danger is, if you read this and you think that somehow Python understands payroll, that if you name a variable hours that Python knows what hours means, the answer is, Python really doesn't care what you name the variable as long as what you name it, you use it, right?
And so, you gotta be careful, and so you'll see, I will, when I write my code in these first few weeks, first few lectures, I will sometimes write it with gibberish, I'll sometimes write it with extremely short but meaningless variable names, and sometimes I'll use meaningful variable names, and I'll call your attention to it, and it will get you. You'll start, when you look at this third kind, it has meaningful variables or mnemonic variable names, you'll just instinctively want to give Python more intelligence than it sort of deserves, I guess that's probably the best way
to say that. So, we've talked about constants, we've talked about reserved words, we've talked about variables. And so, here we have a sentence, like we've already done some of these things, where we set x equals two, we retrieve the old value of x and add two to it, so that becomes four, and then we print four out, print is a function that's built in, and we pass in whatever we want to print out. So, this parentheses is part of a function call. Okay, so, an assignment statement, you have to really get your head around the
notion that it has this arrow nature, and that it evaluates this entire right-hand side before we change the left-hand side. And so, you can think of this sort of as, at time step one, it does this, and then at time step two, it does the copy. And that's how you can have something like x on both sides of an assignment statement. And so, if, for example, we have x, and x has 0.6 in it, x has 0.6 in it, what happens is that it first, it sort of ignores this part right here, and evaluates the
expression. So, it pulls the 0.6, everywhere x appears, it pulls 0.6 out, then it starts running these calculations, and then it has the new value. After all the calculations are done, then and only then is it going to put that back into x. And so, it sort of takes that and puts it back into x, and then wipes out the old value. At this point, this has all been taken care of, and it's been reduced down to this 0.93, and so that is what's put in as the new value. So, up next, we'll talk a
little bit more about making more complex expressions. So, welcome back. We're now going to talk about expressions. Expressions are a little more complex calculations that we can sort of do on the right-hand side of an assignment statement. So, one of the things about expressions is operators. And then operators in computer programming are often very much the same as the mathematical operators, but we don't have all the fancy characters that we have in mathematics, and so we have to choose what's on the keyboard, and then if we really go back to the 1960s and 1970s, and
then we used what was on the keyboard in the 1960s and the 1970s to make these operators. So, pluses addition, minuses subtraction, we don't have a time sign or a dot in the middle, so we use the asterisk as multiplication. Division, we can't put two things over top of each other, so we use slash for division. Raising to the power, because it didn't have little characters back then, is star, star, which is raising to the power. And then remainder. Remainder is the, when you do integer division, it's also called the modulo operator, it's the remainder,
not the quotient. Now, I've got a picture of that coming up. So, here's a whole series of little examples of this, right? So, we've already seen, you know, the plus, x equals x plus one. Keep remembering that these assignments are arrows, basically, arrow, arrow, they have a direction. Multiplication, 440 times 12. Dividing this by, that's division over 1,000, 5.28. Here, we're gonna put 23 into JJ, and then we're gonna do modulo. So, that says, take 23, divide it by five, and give me back the remainder and put it in KK. So, this is the expression
that evaluates like this. Take 23, divide five into 23, four, remainder, three. The three is what comes back up here. Okay, and so that is the remainder. It's also called modulo operator. It turns out that, for things like picking a random number and then taking the modulo of 52 is a way to pick a card randomly. So, this modulo operator is actually, especially in games and other things, super useful. So, that's the various operators. It's important to know which of these operators goes first. It's called operator precedence. Now, normally, we put parentheses in, like, you
know, so if I put the parentheses in here, I'd say this goes first, parentheses, then this goes first. Oh, actually, not that one. Oops, got that one wrong. This happens first, this happens, then this happens. Okay, and so, but it's important for us to be able to know if there were no parentheses, the order in which these things will happen. So, the way things work in terms of operator precedence is parentheses are the most important thing, followed by raising to the power, all else being equal. Multiplication and division are all both equal, and then addition,
and then within, it's adding left to right. So, let's see an example of how this works. And so, if we take one plus two to raise to the three power, divided by four times five, and we print out what comes out of this. So, the way I did this when I was taking exams back many, many years ago when I was first in computer science, is I'd write it all down, and I'd look for the highest precedence thing. Now, parentheses would make this easy, but exponentiation is the first one. So, that means we're gonna take
this, and that's gonna be eight, two to the third power, two times two times two, two cubed is eight. Then what I would do is I rewrite the whole thing with the eight there, and now I look across, and I'm looking for multiplications, because the power's been done, the multiplication's what I'm looking for next. And then there is both multiplication division, they're equal, they're at the same level. And so, what happens is they're done left to right. Eight divided by four happens before four times five. And so, the fact that it's not four times five,
but instead eight times four, is because of the left to right rule. So, then this gets rewritten to be two, one plus two times five, and this one, multiplication is the top one. So, that does this next, two times five becomes 10, I rewrite it again, and then one plus 10 addition is the lowest thing, and that's how we end up with 11. And so, that's how I would do these problems if I ever saw the problem on an exam. And it's a fun problem to put on exams, because there is one and only one
answer, and every programming class has usually at least one slide about this stuff. So, like I said, the rules go top to bottom, parentheses, power, multiplication, addition, and then left to right within it. So, we've talked about variables and computing values to put inside variables, but the one thing you've kind of also, maybe you noticed it as we go by, is we have different kinds of data. We call it type. Is this of type integer? Is this of type floating point number? Is it of type string? What is going on here? And Python is pretty
smart about various kinds of types of data. And so, you know, we're adding one plus four here, and Python knows, as it looks at this, that that's an integer and that's an integer, and we'll add it together and make it an integer. So, that thing is an integer. We can also use this plus to concatenate two strings. This is hello blank plus there, and plus looks here, says, oh, that's a string, and that's a string. So, I know what to do with strings. I will concatenate those two things together, so it becomes another string that
gets assigned into EE, and it's hello space there. The plus doesn't add the space. I added the space by putting it right there. And so, these operators are kind of smart in that they kind of know what they're dealing with, and sometimes they will do one thing or another depending on the kinds of values, variables, or constants that they're working with. And so, sometimes type can get us in trouble. So, here we have EE, which is hello there because we've concatenated these two strings together, and now we're adding one. And the problem now is that
it looks on one side and says, that's a string, and that's a number, and says, I don't know how to do that. This is another one of those annoying errors that you would like, you think that somehow Python doesn't like you, but it just is confused. If you look at these things, traceback, traceback always means I quit. It means I stopped, I ran, I'm quitting now because I don't want to go any farther because I've become confused. So, your program stops running, and you say, here's where I stopped running, because we're typing interactively. It's always
line one here. Type it, but you, if you read carefully and you don't get too stuck on too much stuff, line one that tells us something in module type error can't convert int object to str implicitly. So, that's an integer right there, and that's a string, and that's what it's complaining about, that little bit right there. If Python is so grumpy about types, then we should be able to ask it about type. So, it turns out that there is, inside Python, a built-in function called type, T-Y-P-E. So, we can pass into type. So, the syntax
is calling a built-in function named type. Parenthesis is the parameter that we're passing to it. We're saying, hey, hello, tell me something about the type of the variable E-E-E-E-E. And so, this is a function, the parentheses are part of the function call, and it says, oh, that would be of class string. And then we can pass in a constant and says, hey, what about hello? The string hello, it's like, oh, that's a string too. What about a one? Well, that's an integer. And so, we are asking Python, through the type function, what the type of
either a variable or a constant is. And there are even several types of numbers, and we'll even see Booleans and others later, like one with no decimal, that's an integer number. 98.6 with a decimal, that's a floating point number. And so, constants can be both integer and floating point. And I'm just asking over and over and over again, what is the type of, what's in xxx? What's the type of what's in temp? And what's the type of the constant one? And what's the type of 1.0? You can also use a set of built-in functions, like
float and int, to convert from one to another. And so, this basically says, I wanna convert, oops, let's go back. I wanna convert 99 to a floating point number. So, this is a function, and it's participating in this plus, but before it can finish the plus, it turns this into a 99.0. The difference between 99 as an integer and 99.0 is that it's a floating point number. And that actually turns this computation, as it looks to the left and looks to the right, says, oh, I've got a floating point number on one side, an integer
on the other side, and so I'm gonna make my calculation overall via floating point calculation. I can also pass into the float function. I can say, take this variable i, which has a 42, also an integer, and then give me back a floating point. So, that'll be 42.0, pass that into f, we print it out, and it is indeed 42.0, and it's a float. And so, it knows the type and value in any variable. This is an integer of value 42. This is a float of value 42.0. Integer division in Python 2 was kinda weird,
and it was actually one of the big things that they changed between Python 2 and Python 3. This is a Python 3 course, so we're not worried about that too much. What's nice about integer division in Python 3 is it always produces a floating point result. And that means that Python 3's division is more predictable, and it works more like a calculator. So, in this case, I mean, you can go back and look at my Python 2 lectures and see how crazy it was in Python 2. 10 divided by 2 is 5.0, and the weird
thing here is these are both integers, but the division forces the result of the calculation to be a floating point number. And this, you know, 10 over 2 could be 5, but 9 over 2 is 4.5, and so that is accurate. In old Python 2, that would give us back 4, which is completely unpredictable and weird. The same with 99 over 100. As you would expect if this were a calculator, you get 0.99. Actually, what you get in Python 2 is zero because it would round it down. It doesn't round at all, it truncates it.
So, 99 over 100 is 0.99, and then it truncates it to zero. That's Python 2. We're not talking about Python 2. There's a good reason we're not talking about Python 2. Welcome to Python 3. Of course, if there are a floating point on either side, the result is still a floating point, floating point, and the result is still a floating point. So, integer division produces a floating result in Python 3.0, not in Python 2.0. That is an improvement in Python 3.0. And that's why we're recording these lectures. I have a whole great set of lectures
about Python 2, and now I'm gonna have a great set of lectures about Python 3. Welcome to Python 3. Okay, so, we've been talking about converting from integer to floating point, but you can also convert from string to integer or string to floating point. And so, here we start out with a little string value. Now, it only works for strings that are made of digits. So, quote one, two, three, quote is not an integer. It is a three-character string that has one, two, three as the characters in that string, which is very different than 123.
We say, what is the type of this? It's a string. We say, let's add one to it. And it says, can't convert int to string, so that blows up, right? Because this is a string. It looks to both sides. String plus an integer, not good, okay? But we can convert this. We can call the int function, which is like the float function, and pass a string in. So, it says, hey, take this and turn it into an integer. So, take the input of sval, which is the string one, two, three, and give me back an
integer representation of that, which is going to be 123. So, we say, what kind of thing do we get back? Well, we got back an integer. We can now add one to it and get 124. And so, you have to manage the type of things and you can convert from one type to another. Now, int is not magic. If you send something into it, a string that doesn't consist of digits, then you're gonna end up with another error. Invalid literal for integer with base 10, blah, blah, blah, blah, blah. So, it's really complaining. It says,
I want these to be numbers here and you just gave me letters. So, that's going to cause this to fail. Another thing that we're gonna do with variables is just like the print function takes something, a list of things, in this case, a string, comma, a variable, and then print some output in the program. The opposite of that is input. Actually, input generally happens before output. Input is a built-in function and we pass to it a prompt, a string of text that's going to be printed out for the user and then it stops and waits.
So, it says, who are you? And then right here, it just sits, waiting for us to type something. So, we type, blah, blah, blah, blah, and then hit the enter key, right? We hit the enter key and then this text ends up in this variable. So, this is an assignment statement that chuck is the result of the input call, gets copied into the nam variable. So, let's do that again. It's evaluating assignment statement. Remember, it's kind of this way or you can think of it as do this right side first. It writes this out, writes
that out, then it waits, wait, wait, wait, wait, wait, until we hit the enter and takes this chuck and that becomes the result of this input which is then assigned in to nam. Now, then we go sequentially to the next line. It prints out welcome, comma, n-a, contents of the variable nam. Now, this one, this comma here, actually does put the space in here automatically. So, it says welcome space chuck. So, it pulls the, there's no space in chuck, just the chu-c-k. And so, print can take more than one thing separated by commas. Matter of
fact, print can have, you know, a whole bunch, oops, come back, come back, come back. Print can have comma, comma, comma, parenthesis, as many as you like. Everything you've seen up to now is kind of one thing in the print but that doesn't mean that print only can do one thing. So, I've talked about variables, we've talked about constants, we've talked about input, we've talked about output, and now it is time to write our first meaningful program. And so, this program has to do with those of you who have traveled internationally. If you traveled to
United States and you traveled outside the United States, you notice that there is an elevator convention that is different inside the United States. The United States, the walk in the ground floor in the elevator, that's one. And if you walk in a ground floor in Europe or many other places in the world, then the elevator is zero. So, we have written a small app that we're gonna put on the app store and get wealthy with, with called Elevator Floor Conversion App. And it's gonna ask us, we're in Europe and we're lost, and you say, well,
what floor would this be if I was in the United States of America? And so, here's, we have to read the floor that we are at in Europe, and then we're going to convert it to a US floor, and then we're gonna print it out. This is very silly, but it is a pure, essential program that has input, does some kind of task on that input, and then produces some output, which is useful for some value of useful, okay? So, let's take a look at how we combine everything that we learned in this lecture, input,
processing, and output. It's a three-line program, but it's sort of the beginning of something that programs do, okay? You're gonna do lots of programs that do this. So, here we go. Program starts, we do the input side effect. It prints out this and then waits. We type in zero, that comes back here, and the zero, which is a string. Input gives you back a string. It doesn't give you back a number. It's a little different in Python too, but in Python 3, input gives you a string. So, quote zero, quote, which is what we typed
here. We didn't type the quotes, it's a string. It gets stored in the imp variable. Then we move to the next statement, and on this right-hand side, we convert that string variable to an integer, so that becomes the integer zero. We add one to it, and then that becomes one, and then we assign that into USF. I've named this variable United States Floor, right? So, imp is the input, and USF, that's mnemonic. It doesn't know anything about elevators, it's just I picked a variable that was quite friendly. And so, at this point, USF has the
United States Floor that's equivalent to the European Floor, and then I just fall down and I do a print statement. Print out USFloor, USFloor, comma, that's the space right here, and then whatever the contents of the USFloor variable is. And you could see that I could write this on four, and it would say three. I could write this and say seven, and it would say six. This is an amazing program. It converts floors in a European numbering scheme. Wait, actually, no, I got that wrong. Hang on, let me clear this. I wasn't thinking clearly. I
could type in four, and it would give me back five. I could type in six, and it would give me back seven. See, I'm confused, haven't been in Europe in a couple of months, and so I forgot all about the floors, but that's the idea. Now, this is a super, super, super simple program. Not super useful, but you get the idea that we're gonna pull some data in, we're gonna do some intelligent thing. Soon this will be hundreds of lines of code instead of one line of code, and then we're gonna present the results to
our user. Now, another element of most any programming language is what's called a comment. A comment is a way for you to put in a program file some text that's to be ignored by Python or C or whatever language we happen to be using. In Python, comments start with a pound sign. So what you can do is put a pound sign anywhere in a line, and then after the pound sign, Python ignores everything after that pound sign. It can be the first character. So here's our recurring concept that we talk a lot about. We're not
gonna cover this. Remember what this does. This is counting how many letters, the, the, the. There's 16 thes, and there's, in that file, there was six twos or whatever it was. This is that code. We'll get back to this code. But what we've done here is I've added some comments that are really for human consumption. So this first paragraph is get the name of the file and open it. The second paragraph is count the word frequency. You know, maybe I should have said histogram here. Count the word frequency and assemble a histogram. And then here
I'm putting this pound sign in, find the most common word, and then I'm all done, I print this stuff out, right? And so all I'm saying is comments are for people to read. Your next programmer or the person who's gonna change your program after you're done with it. And they're nice. And you don't have to use any particularly weird syntax or variable naming conventions. You put a pound sign in and you can write anything you want from that point forward. Okay, so we've talked a little bit about variables and types and mnemonics and how we
would choose variable names and how expressions work and the various operators converting between different types, printing, input, output, and comments. So that just kinda gets us sentences. Coming up next we'll talk about conditional execution where we're really starting to move up to paragraphs. So see you in a bit. Hello and welcome to chapter three, conditional execution. In conditional execution we meet the if statement. The if statement is where Python can go one way or another way. And it's the beginning of sort of our way of making Python make decisions for us. Sequential code, we just
do some things. Sometimes that's useful. But now we can have our code check something and then make a decision based on that thing. So the conditional steps in Python are pretty straightforward. The keyword that we're going to use is the if statement. And so if is a reserved word. And the if statement has as part of it a question that it asks. And this is asking if x is less than 10. And the colon is the end of the if statement. And then we begin an indented block of text. And the way this works in
this particular thing is this line is the conditional line. If the question is true, the line executes. And if the question is false, the line is skipped. And you can think of it the way this is, right? x is five, ask a question. Is it 10 or not? These questions do not harm the value of x. If it is, then we run this code. And then we sort of rejoin here. And then we test this next if. And if that's true, we do this code. And then we do there. But in this case, it's going
to be false because x is not less than 20. And so it just continues down here. So if we look at how this works, it runs. It runs this line. Then it sees this question. It skips that line. So this line does not run. And so smaller prints out. And funny prints out. OK? And so that's the basic idea of an if statement. And the indentation, when we are done with an if statement, we deindent back. And there's this little block. This is one sort of if statement. And this is another if statement. And these
are the two conditional lines that either run or they don't run, depending on the answer to that question. So we have a number of different comparison operators that we can use to ask these true-false questions that say, is this true? So again, we're kind of limited to the keys that were on computer keyboards in the 1940s and 1950s. Less than, less than or equal to. So we didn't have fancy math characters. So we just concatenated less than and equal to be less than or equal to. This double equals is the asking, is this equal to?
And so that's a little tricky. The equals sign is that assignment operator. If I was building a language today from scratch, I would probably make assignment be arrow. And the equals question to have an equals. Or I might say somewhere I would say question equals. But I'm not building this language. So that's not up to me. So this is the question. Double equals is asking the question is equal to. Greater than or equal, greater than, and not equal. So this is the exclamation point is sort of like not equal. So that's sort of not equal.
So that's how we do not equal. So if we take a look at some of these in some examples, all of these are going to be true because of the way x is set. If x is equal to 5, that's the question version. That's true or false. It'll execute that. If x is greater than 4, it's going to execute that. If x is greater than or equal to 5, it's going to execute that. Here's kind of a shorthand where if there's only one line in this block, you can kind of pull it up right on
the same line after the equals. If x is less than 6, which it is, true. Execute that. Then if x is less than or equal to 5, do that. And if x is not equal to 6, do that. Now like I said, all these questions have been carefully constructed so that they're true. Just to kind of show you the syntax of those comparison operators. Now you don't just have to have a single line of text in the indented block. And this will be something you're going to get used to. So if we indent more than
one line, then the conditional code is actually these three lines. So the idea is you have an if statement. You come in, you do an indent. And as long as you stay indented, you stay in that if block. If it's false, it just skips all of those. So the way this is going to execute, x is 5. You could print before 5. Is x equal 5? That's the question mark, and that's true. So it's going to run all these. And then come back, and then continue on, and then de-indent. So all this stuff is running.
And then it says if x equals 6. So that was false. So that skips all of them. So none of these lines of code run. So these actually don't run, and it says afterwards 6. So that's a mistake. Those don't run right there, because x is not equal 6. OK? So indentation is an essential part of Python. We use indentation in lots of programming languages, often to demarcate blocks to show where blocks start and stop. But in Python, it's syntactically correct. You can make an error if your indentation is wrong. After an if, you must
indent. And you maintain the indent as long as you want to be in that same if block. And then when you're done with the if block, you reduce the indent. In this rule of indenting, comment lines and blank lines are completely ignored. So we're going to tend to put four spaces. Four spaces ends up being the normal thing that we do. And you'll see all the code that I write has four spaces for each indent. If I go in twice, I use eight spaces. And we have this instinct of wanting to hit the Tab key
to move in four spaces. Now, the problem is that it might look the same on your screen. A Tab and four spaces might line up the same place, depending on how tabs are set. But Python can get confused by that. So we tend to avoid using actual tabs in files. And so most programming text editors, like if you're using Notepad or Text Wrangler, there is a place to set the tabs, to say don't put tabs in this document. But every time you hit Tab, move over four spaces. And so if you hit a Tab, but
it's like space, space, space, space, space. Now, the nice thing about Atom, and this is the text editor we tend to recommend in this class, A, because it works on Windows, Linux, and Mac, but also because it automatically sets this up. As soon as you save your file with a.py extension, you can sort of hit the Tab key with impunity. And everything works perfectly. But the key thing here is that Python insists that you get this right. And if you don't get this right, you're going to get indentation errors. And they're just another syntax error.
So if you're using something like Text Wrangler or Notepad, run around in the Preferences, and you'll find something about expanding tabs, or maybe how many spaces each tab stop is supposed to be. And so you check these. And what this really is doing is telling your text editor, never put an actual tab in the document, but somehow simulate tab stops using spaces. And so here is a bit of code. It's got some nested block. But it gives you the sense that you have to be very explicit when you're reading Python code of whether the indent
is the same between two lines, the same, increased or decreased. And every time you increase it, you mean something. And every time you decrease it, you mean something. And literally, if it stays the same, you mean something as well. And so if we take a look at this, here we have a line. And the next line has the same indent. This is an if with a colon at the end. So we have to increase the indent. And now we're maintaining it. So these two lines are part of that if. But now we have deindent it.
So whether you choose to deindent this word, or this word, or whatever, the where you do this deindent affects the scope of how far this if statement lasts. It lasts up to, but not including, the line that's deindented to the same level as the if. So this is a deindent. Now we have a blank line, which doesn't matter. And we maintain it. And we have a for, which we'll learn about in the next chapter, which is a looping structure. Let's do a for. For runs this five times. It has a colon. And it also expects
an indented block. Now we have what's called a nested block, where we have an if and a colon. We go into some more. So this is like two indents. So these are one indent. And these are two indents. And so this is a block within a block. And then we deindent. So that means this print is not part of the if statement, but it's still part of the for statement. And then we deindent again. And then that means this print is on the same level as that for statement. So if you start thinking about this,
you want to be able to start thinking that these blocks are the start of the block with the colon line up to, but not including this line that's been deindented. So the for goes this far. The for goes up to, but not including the line that's deindented. The if goes up to, but not including the line that's deindented. So as you do this, you'll sort of mentally start drawing these blocks. And pretty soon, you will start constructing them as blocks. And it takes a while, but doesn't take forever. But in Python, unlike other languages, you
have this is very important, and it matters. And you can have syntax errors if you get it wrong. Because you're really communicating the shape and structure of your code using these indents and deindents. We already saw a nested indent. This is a nested if. So you can put an if within an if. And you can go as far deep as you want to go, like Russian dolls. And so here we have x equals 42. If it's one, we indent one. And then with this next thing we do, these are not the same level of indent.
But now we see an if, and it has to indent further. So this is like two in, eight spaces. And then we deindent back. Actually, we deindent back too. And so if you'll watch this, and you take a look at how this works, it runs to here. Oops, back up. Comes in here. The answer is yes, x is greater than one. Prints this. Is x less than 100? Well, it's 42, so the answer is yes. So it runs this, and then it kind of continues back to there. And you can also think of drawing boxes
around this. This is one if box. And then within that if box, there is another if box. And again, it's the indent block up to, but not including where the deindent happens. And this here is like two backwards deindents. So it ends two blocks. So two blocks are ended by where we place this. We could move this in, or we could move this out. We could have it all the way into here. We could have it to here or here. And where we put that line depends on how the ends of these blocks are going
to work out. So one form that's a one branch if that we just saw, but then you can also have what's called a two branch if. And the basic idea of a two branch if is that you're going to come in, you're going to ask a question, and you're going to go one direction if it's yes, and another direction if it's no. We call this an if then else. It's kind of like a fork in the road. And the way to think about it is depending on the output of this question, we're going to pick
one or two of these. But if we pick one, the other one's never going to happen. So it's like an either or. We're either going to go one way, or we're going to go the other way. But there is no path where we somehow go boot through both on that. That doesn't happen. And the syntax that we use for this is what we call the if then else. And so the first part is normal if with an indent. And then we deindent. And then this is another reserved word else with a colon. And then we
reindent. And so this is really end up being part of a whole block here. And the else is the part. This is the part that runs if it's false. And this is the part that runs if it's true. The first branch of the if, the first indented block is what runs if it's true. And the second indented block is the one that runs if it's false. And so here we go. It's just if x is greater than 2, in this case it's yes. We're going to print bigger. And then we're going to be all done.
And so we do one. And so this one did run. And this one did not run. So basically with an if then else, one of the two branches is going to run. But there's no case in which both branches run. And again, you sort of draw these blocks around these things mentally. And in this one, you sort of take from the if, not the else is really part of the block up to, but not including that print, which is deindented back to the same level as the if statement. OK? Python is actually one of the
more elegant languages, even though after a while this indenting, and when you get too far in, it gets a little bit complex. But this is a good way to visualize this with these indents. Coming up next, we're going to talk about some more complex conditional structures. So welcome back. Let's talk a little bit more about some more complex conditional statements that sort of build on this concept of if and if then else. The first thing we're going to look at is the multi-way branch. And so the idea is it's kind of like the if then
else where you're going to pick one of two, but now we can pick one of three, or one of four, or one of five. And it introduces a new concept called the LF. The LF is another reserved word inside Python. And the way it works is it's probably best to look at this here, where it checks the first one, and if it's a true, then it runs that, and then it's done. It doesn't check them all. It's not like it sees that there are two logical conditions. It actually checks them, the first one, and how
you order these matters, as we'll see in a bit. And so if the first one is true, it runs if the first one is false, and the second one is true, it runs this one, and it's done. And if neither of them are true, it falls through, and there's an else clause that is otherwise, and it runs that. So basically, it's going to run one and then skip the other two, or it is going to skip one, skip two, and then run this one. But it only runs, in this case, one of them. But the
important thing is it checks these questions in order. And it doesn't check the second question until it finds that the first. It doesn't check the second question until it knows the first question is false. So if the first question is true, you're done. You're done, and you're done with this. You're done with the whole block at that point. So only one of these three is going to execute in that block. So here's sort of some examples of this. If we, for example, have x equals 0, it's going to come down here. x is less than
true. That's true. So it runs this code, and then it skips, skips, skips, down to that. And so it's like this, runs that code, and then skips to the end. On the other hand, if it's 5, then this is false, and it skips that, and it checks this. This is true. It runs this code, and then it's done, skips to the end. It was like false, true, run, end. And then if x is like 20, for example, it runs, it runs, false, false, run the else clause, and you're done. So skip, skip, else, run that
code, and you're done. So in this case, we ran that, and we didn't run that, and we didn't run that. Again, one of them is going to run. They're checked in order. These questions are checked in order, not out of order. It doesn't look ahead. It just checks in the order that you wrote it. You're the one that wrote that order. And so there's a couple of variations on this multi-way. You can have no else. You can have no else, as in this case. And this just means that it might not run any of them.
In this case, x is 5, so it's not less than 2, but then it runs this one. But if x was like 50, for example, if x was 50, then this would be false, then it would skip, and this would still be false, and it would skip, and neither of these two would run. So if you don't have an else, you're not guaranteed that one of them is going to run, because else is like the catch-all. If the other ones are all false, then the else is the one that runs. Similarly, you can have many
elifs, but this is where it's really important for you to make sure you know what order they're being taken in. So if this is true, it runs. It goes all the way to the bottom. If it's false, false, false, true, it runs this one, and it's done. If, on the other hand, it looks at it as false, go back, go back. If it runs false, false, false, false, they're all false, then it runs the else. This one has an else. This one didn't have an else. They don't have to have them. The key is you
can have more than one of these elifs. So I got a couple little things. I'll let you pause right now and look at the question is, are there looking at the three lines or four lines of code, x equals something. Are there lines of code that will never execute, regardless of the value for x? And I'll let you pause and think about it, and then I'll explain it to you. OK, hopefully you paused and thought about it as long as you liked, but so let me now explain it to you. So we come in here,
and if x is less than or equal to 2, it's going to run this first thing. And if x is greater than or equal to 2, it's going to run this. And if neither of those are true, then it's going to run this. Well, the weird thing is, all numbers are either less than 2 or greater than or equal to 2. I carefully constructed this to the point where it would never run this line of code. It is either going to run this one or run that one, but it's not going to ever run this
one. So that was kind of like a weird dysfunctional one that I constructed. This other one is a little different. If x is less than 2, we do this. If x is less than 20, we do that. If x is less than 10, we do that. And if none of those are true, we do that. Well, the problem here is between these two lines. The problem is, if something's less than 10, like 6, for example, it's also less than 20. So even though x is less than 20, so even though there might be values for
which this is true, those also are going to have this true. So for something like 6, it's going to run here. And it's not even going to look at this. That's the point. It doesn't even look at this. And so that's, I mean, I could have made this more sensible if I'd have moved this little block of code up to there. So this is where the order in which you choose your questions, the way you put these LFs together, matters because it doesn't look at all of them. It only looks as long as it can.
As long as it sees falses, then it keeps on going to the next one. But as soon as it doesn't see a false, it doesn't continue. So the last conditional structure we'll talk about is the try and accept structure. If you know any other languages like C++ or Java or JavaScript, you're like, whoa, that's kind of an advanced concept. But it turns out in Python, because of Python's propensity to throw trace backs in situations where you kind of would like to recover, it turns out you kind of have to use it a little more and
a little earlier in your programming skill. So the problem is, what if there is a line of code and you absolutely know it's going to make a trace back. It's going to blow up. But you don't want to blow up. I mean, I don't want to have code blow up. If you're using my autograder and you see a trace back in my autograder, that's kind of like I consider that a failure. I could put an error like, hey, you entered blank data or you didn't enter a number. But a trace back, that just seems like
I'm too lazy as a programmer. So we as programmers are supposed to anticipate parts of our code that are going to blow up potentially based on perhaps the user's input and then do something about it. And that's what the try and accept are for. You take this little dangerous piece of code that might break and might blow up, and you surround it with a try and says, this might blow up. And if it fails, run this code down here. So that's the try. And if you get an exception, the accept is kind of like if
you get an exception. And the problem is, is if you are running code, here's a little bit of code, we put hello bob in and we convert it to an integer, and we know from past experience that this blows up. You can't take hello bob and convert it to an integer. It's just going to blow up. The problem is, and here we are. It says, oh, you blew up on line two. That's great. And I'm not very happy with hello bob and whatever. But the important thing is your program stops. These other lines, they don't
exist. It doesn't go any further. Remember, the trace back is Python is really confused, and I don't know what to do next. So Python is just going to be conservative and stop. So Python stops, and your program stops. No matter how much error checking you put down here, it doesn't matter because it's gone. It's all gone. And like I said, we take this kind of personally because the code that you write is like you being put into the computer giving it instructions. And if the code blows up, well, that sort of wipes you out. You're
not in the game anymore. You're not able to do anything. So we want to be able to, especially in these situations where we can anticipate that an error that might happen in the normal course or your program's execution might be something that you want to compensate for. And that's what the try and accept does. So here's a bit of code for the try and accept. And we just have two little bits of straight line code. And so we put a string in here that's hello bob, and then we're going to convert it to an integer.
This is the dangerous code. This code, in this case, with hello bob, is going to do a trace back. And so we say try, and then we indent the dangerous code. And then we add this little accept bit. If it works, the accept is ignored. If this blows up, it runs the accept. So in this code, it's going to come in. It's going to try this. This is going to blow up. But instead of giving a trace back, it's going to say, oh, I've got an available accept. I'm going to run this accept code, and
then I'm going to continue on. And so that prints out first negative 1. So because we set this variable Ister to negative 1, like a little flag telling us that something went wrong. And then we keep on going. And now we have put in 1, 2, 3, the digits 1, 2, 3. The digits 1, 2, 3. And now it's going to work, but we still have it in a try block. And then this one works. It does not blow up, and then ignores the accept block. So the accept block is only triggered when something goes
wrong in the code. It is ignored if something doesn't go wrong. So it's like you bought an insurance policy on this line of code. And when things go wrong, your accept block springs into action and does whatever it is that you want it to do in the case of an error. So that's a pretty useful thing. You got to be a little bit careful that you don't overuse it, because if you put more than one line inside the try part, and one of the lines blows up, it doesn't come back to the try block. And
so in this one here, we have kind of a simple, silly one where we set the string, we're worried about some stuff. Well, the print statement's never going to blow up, so it's a bad idea to put it in try accept anyways. Then we do this conversion, and that's the dangerous part. And in this one, it's going to blow up. And so then it's going to go to the accept block, and then run the accept block, and then continue. What it does not do, what it doesn't do, is somehow go back and finish this. So
these lines are gone. So if you look at it like this, this works, the try starts. Hello, this blows up, it goes to the accept, it runs the accept, and it continues on. Never runs that code. So it's not like you took out an insurance on the whole block. Any of those lines can blow up in the block, but whichever line blows up, that is the last line that's executing in that block. So you tend to want, in this particular example, you would probably, the print statement would go out there, and this print statement would
come down here, and you would only put in your try block the single line of code that you think might blow up, because you kind of know print statements aren't going to blow up. So this is an example that's a more common real world example, where the user is going to type some data, and that's users that get us in trouble. So our program starts by asking the user enter a number, and we know that this could be dangerous. So we're going to put the conversion from string to integer in a try block, and we're
going to set negative 1 if that's a failure. And then if it's greater than 0, we'll say nice work, and if it's less than 0, well, not a number. So first time we run this program, out comes enter a number. We type in 42, which is a string. That 42 goes back into roster, runs in here. This runs. It's fine. That becomes a 42 number, so we skip the accept block, and iVal is greater than 0. We print out nice work, and we skip the else. So it says nice work. On the other hand, if
we run it again this time, the input says enter a number, and we're silly. We enter the word 42, but in words, 40, F-O-U-R-T-Y. So that's a string, and that goes into roster. And then the execution continues. We run in here, and now this is going to blow up. That's going to blow up. Normally, we would see a trace back right there. There would be a trace back. But we're not going to, because we put this calculation in a try and accept block. It's going to immediately run the accept block, set iVal to negative 1,
continue on with the program, see you are not blown up at this point. And if iVal is greater than 0, well, it's negative 1, so we're going to hit the else clause and print out not a number. So we've done error detection. The user set something that caused a line of our code to kind of blow up, but we put that line in a try and accept block, and so we caught it. And so we dealt with that fact. So in summary in this, we talked about if statements. We talked about else. We talked about
try and accept, how important indentation is to mark blocks where they begin in the end, and an else if, and try accept. So up next, we're going to talk about loops and iteration. Hello, and welcome to chapter four, functions. This is the fourth of our basic patterns. We'll get to iterations next. Functions is the store and reuse. One of the things in programming is that we never like to repeat ourselves. We don't like to, if we have four or five lines of code, and we're going to do the same thing later, we don't like to
put the same four lines of code in, even if it has to do with reliability. If you find something wrong with those four lines of code and you got them 12 different places in your program, then you got to find all 12 places and fix them. So we're like, collect those to one place and then call them and reuse them, and that's the idea of store and reuse. So this is how functions work inside of Python. The first thing we notice is there is a new keyword def that stands for define function, and the def
is like an if statement or we'll see fors and whiles that they end in a colon, and then they have an indented block and then the indented block deindents, and that's the end of the function. And so there's two statements make up this function. The key thing that you have to understand and get used to is this def part is actually not running any code whatsoever. It's actually remembering the code, and that's what I call the store phase. The def creates a bit of code and records it like a macro, although it's much more complex
than a macro, and it names it whatever you chose. You gave it a name. We named this one thing. And so it has a side effect of Python reading or parsing these three lines. It doesn't do anything, but it remembers. These two lines are what you would like to run when you invoke thing. So this is the definition of a function, and this is the invoking of the function. But so this doesn't do anything. So there's no output here from that stuff right there. But then what happens is you invoke it. And this thing looks
like it's part of Python, but you an effective extended Python with your def statement. And so when it sees thing, it goes up and runs your code. And so out comes hello fun, and then it comes back and goes to the next line. Does print, so print comes out. And then it goes back and like, oh, this is the reuse part, but we get to reuse it. We define it once and we use it twice. Then it runs this code again, and it goes to the next line and it's all done. So this little bit
came out twice. And of course this is really simple so that I can fit it on a page. But you get the idea that I don't want to repeat. This might be 15 to 100 lines of code, and I don't want to type those over and over again. So I say, hey, store these in a name that I choose, and then when I invoke them, bring them back and then run them again, okay? So that's the basic idea. We actually have already been using functions from the beginning. The print is a function, right? Print is
a function. Every time we see print, P-R-I-N-T, parentheses, and then we have some stuff in here, we are calling the print function. This is the syntax with two little parentheses, is the syntax for functions. And so input's a function, type is a function, float's a function, int's a function. All these things are built-in functions that come with Python at the moment that we started. I mean, we installed Python and these came along. And then there's other functions that we define and use, and that's what the def is for. And in effect we can create new
reserved words of our own making that extend the Python language after we define the function. So it's just this bit of reusable code that takes some arguments. We haven't seen any with arguments. There's a little parentheses and we'll see how that works in a bit. We define using the def keyword and then we invoke it. There's the defining phase, which actually doesn't run the code, it just remembers the code. And then there's the invoking phase. You define it once and then invoke it one or more times. Calling the function or invoking the function, we think
of those two things as the same thing, call, invoke, or just the terms we use. Most people just say call the function, but invoking is a perhaps more descriptive way to think about it. So here's an example of a function. It is built into Python. It's called the max function. And we can pass some parameters into the max function. So we pass the hello world string. Now, like much of Python, max knows what kind of thing is being passed into it. And it knows that it's looking for the largest character, the lexographically largest character. And
in this case, it scans this little, that's inside the max code, it scans through and finds the largest character. So apparently lowercase letters are higher than uppercase letters because in English we get back a W. And so this is what's called the return value. So this is an assignment statement. Let me clear this and start over. So this is an assignment statement. So it has to evaluate this right-hand side. And a function call is nothing more than like x plus one. It's something to evaluate. It runs the function code, passes in this argument, and then
this residual value, this call return value, we'll look at this in more detail, becomes the result of this little bit in the expression and there's nothing else. We could have W plus one or something. And then the W is what's stored into big. Okay, so we print big and big is a variable that has the letter W inside of it. And then we ask what is the smallest and that finds the blank. And so we get a blank to see this. There's a min function and a max function. Both of these are built-in. These are
built-in functions. They're always there for us. Okay, so here is another example of the max function. And so we can think of this as invoking or calling this function as this right-hand side is being evaluated. We are passing this variable in and there's some code in here and it's gonna do some stuff, yada, yada, yada, and then it's gonna give us back a bit of stuff. And that's its return value and then that goes up into the big, right? And so that's how this works. And so this is actually built-in. Built-in or burnt-in, I guess
I can't draw. And so you can think of this as some time a long time ago when Python was being first formed, somebody wrote some code. And it's got some stuff in it. It's got a little loop that reads through all the letters. It has to figure out if it's a string or a list, et cetera, et cetera, et cetera. But this is store, except you didn't do the storing because it's already built-in. And then this is the reuse, store and reuse. So we build these things into Python. They're already pre-built as if before the
first line of your code executes way up here, someone put all this code in for you into Python and created a thing called max for you. Now we've been using this already, built-in functions. We've got type conversions. We've got like the float that takes a integer and returns a floating point version of that. And again, this is kind of like an expression. So it's like, I wanna divide this by 100, but before I do that, I've gotta convert it to a float. So it has to sort of do these function calls as it's evaluating the
expression, okay? Sometimes like here, we just have, we just have a prints out the return value. That's what this is. This is the return value. If you just type a function in a parameter, it can be in a constant or it can be a variable. And as we'll see in a second, we'll give you many of these if you like. So you can either just run it or take the result of this, this passes an integer in, converts it to a float and then puts the float into that. Type tells us what kind of thing
that is and you can use this inside of an expression. And so it's like, what am I gonna do first? Oh, I've gotta do two times this thing. Oh, wait a sec, pause just briefly for a moment, call out to some float code, pass a three into it and then something comes back, the return value, the residual value comes back and then that participates, in this case it's gonna be 3.0, participates in this two times 3.0, okay? And so two times 3.0 ends up being 6.0, et cetera, et cetera. But you can see as it,
it's like, oh, wait a sec, I gotta figure out what this is, call the function, get the return value and then continue processing this expression. We've also done this with string conversions, partly because just as an example, the input always returns a string, the input function returns a string. And so, you know, here's this string, could be coming from input, but we'll just take one, two, three. We know that that's a string, it's not the number 123. And if we try to add one to it, we get a trace back, cannot concatenate string and integer,
trace back, but we can convert that string to an integer. And so int can take like a floating point number or an integer or even a string and it says, oh, I know what I'm supposed to do with string, I'm supposed to look at this, interpret these as numbers and, you know, multiply by 10 and figure out what the hundreds place is and all that stuff, there's a little bit work to that and it does it, but then it gives us back an integer and we say, oh, what is that? That's now the 123, but
it is of type int. And now we can add one to it and get 124. And as before from this example that we're kind of reusing from a previous chapter, you don't want to try to convert, oops, sad face, sad face, sad face. Don't want to try to convert something that doesn't have digits using int because it'll say, I don't know what to do and then your program quits, right? You don't want your program to stop, trace backs and you can of course deal with that with try and accept, but that's like a previous lecture.
Okay, so up next, we're gonna talk about building our own functions, not just using the predefined ones. So welcome back, we're gonna continue and start talking about building our own functions. So again, we use the def keyword to define a function and then later we're gonna invoke this and there's a bit to it. We are defining the name of the function and in effect we're extending Python and creating new predefined things that we can use except it's our code. It starts with a def keyword, has some optional arguments which we'll see in a bit, that's
what the parenthesis is and then the name and the function names file, the same rules as variable names and then you have an indented block, whatever code you want to do and then you have a deindented block and that sort of defines the essence. The key thing here is this is not calling, it's not invoking, it's not executing, it's remembering, it's storing, it's figuring things out. So here is the output of a program that defines a function but then doesn't use it. So this is a sort of broken function. So here we go, we start
x equals five print. You don't have to def, you have all the defs at the beginning. The def runs whenever. So you know, out comes hello and then we define a function and this says, oh, oh, you wanna make a new thing here. So I'll make a new thing. It's kinda like a variable in a sense and then it copies this stuff, copies it up there and says later you probably are gonna wanna use this so I'm gonna remember it so it doesn't do anything there. No output comes out, then it says print yo and
out comes yo and then it adds two to x so x is now seven and then it prints x and there's no seven, there's seven. These print statements never ran. They never ran, why? Because we did not invoke them down here. We defined them but didn't invoke them. So let's take a look at how you invoke a function, right? You define it and then you use it. Sometimes you define it once and use it once but more commonly you define it once and use more than one time. Again, the store and reuse pattern. The def
is the store and the invoking is the reuse. So here's just a slightly different version of that last program and so now it's gonna actually invoke it. So x equals five, print hello, def, so out comes hello. This produces, the def produces no output, right? But because there's a deindent here, that is the entire blob of the code that is part of print lyrics. So it prints out yo and now we're gonna invoke. This is the call. We're gonna call the function. Now the function goes up, let's clear this. Somewhere down to here. Now this
like suspends at this place. It's like remember to come back to here when we're done. Go up, run this code and then come back and then continue on. So it like leaves like a breadcrumb of where it's supposed to come back to. And then it runs and then the print lyrics of course produces the two lines of output. And yeah, that should probably not have, that day should be up there. And then x equals x plus two which makes it seven and then prints out seven. Okay, so this is the invoke or call the function.
You defined it and then later you called it. Now, in addition to just call and return and invoking, we can pass parameters in. And the example of the parameter is in the max function we have to say, this is the thing I want you to find the maximum of, the largest thing. And part of it is in the whole store and reuse pattern, we have a few lines of code but sometimes we wanna do ever so slightly different things in a different invocations. And so we use the arguments to subtly adjust like finding the maximum
is a general thing but what thing to find the maximum of that makes a function that's much more useful and reusable in a lot more situations. So arguments are the thing we passed in and we defined for our functions that we're going to build, we on the def statement, so we say def, greet, name a function and then this is the arguments, the things that are coming in. Now, this lang variable in a sense only exists during the life of the function and it represents sort of a placeholder, it's not a real variable in the
same sense, it's a placeholder that refers to how you touch that first parameter that's sitting in there. Okay, and so lang, so lang is our first parameter, whatever it is, we don't need to see this part down here right now, all we know is we're gonna make a function and we're gonna take a first, we're gonna take a parameter and this lang is the placeholder that tells us what that parameter is, okay? So within the function, we're gonna check to see if the language is Spanish, if we are print hello, else if the language is
French, print bonjour, otherwise print hello. We have a very highly simplified language translation system here. So the def, of course, does nothing, except it remembers that and defines the concept greet. So that comes down and now we're gonna call it and that says go look up the thing that I defined called greet. If you don't put this in, greet is gonna give you a trace back, but because you extended and named it greet, so it runs in, it starts, suspends the code here, starts up here, but then lang is now an alias to en. So
now we can run if that is es, else if, oop, I'm getting it all wrong now. Right, so en comes in as lang, we're coming in the code. If it's not es, it's not fr, else, it prints hello, and then it comes back to the next line. And then we call it again and this time es is lang and so it runs this code and prints hola, and then next time it calls with this, and then prints bonjour, you get the idea. So this is a placeholder so that on the success of calls or invokes,
invocating invocation of the function, we can get at whatever the programmer put in as that first parameter. And so we are saying in this definition, we are ready to receive a first parameter. Please call us with a parameter and then we will be able to do something slightly different for the different values. So this is a reusable bit of function that prints hello in three different languages and then we tell it what language at the moment that we're actually invoking it. So that's putting stuff into the function. Now getting stuff back out is the concept
of returning. In the return statement, the return statement is an executable statement that does two basic things. The first thing that it does is it finishes. Now this is a one line function so that's kind of redundant, but when Python goes into the return statement, it doesn't continue on to the next line. It just returns. That is the end of the invocation of that particular function. But even more importantly, it takes as its parameter. You can say return without a parameter and it will stop the execution of the function kind of like a break does
for a loop. It's kind of a break for a loop. Get out, we're done. Don't run that next line. Get out. But it also allows the specification of what you want as the residual value in an expression. So we're doing a print and then we're saying greet. And what's gonna show up here is whatever this function does in its return statement. And so that prints hello. We call it again and it prints hello again. Okay? And so basically the return statement, I call this the residual value. It's like what shows up here when the function
is all done and it's the string hello. We call the functions that return value is fruitful because they produce something but you don't have to. You can just say return. Or you don't even have to have a return statement. It goes to the last line of the function and it does a return automatically at the last line of the function. So here's a little bit of a rewrite of our little language program. We are going to create a greeting program. We're gonna take the language as the first parameter. And instead of just doing a print
statement, which is what we did before, this is now more like a function because it takes some input and produces some output as a return rather than just printing. It's a little tacky for a function to print. And so here we return hola bonjour and hello based on the right thing. So now we say print greet en. So it runs the code once, lang is en. And then it runs this code and the residual value is hello. So it says hello glen. And similarly, when it runs this code, it passes es and is lang, it
runs through and it runs this statement. If there was more statements, it still wouldn't run them. As soon as this return runs, that says that this bit right here is now hola. And the same with French, goes in, runs again, out comes the return statement, and then bonjour, Michael. So you see how we can control as we're writing the application, we can control as we're writing the function what the residual value that we want to see in whatever expression is calling us. Sometimes we have returns and sometimes we don't have returns. So, if you think
of the method as a function, well, so if you think of the max code that we talked about before, we can kind of see that somewhere inside that max code, there's a return. And that's how it communicates the W back to us. So we pass in his argument, hello world. It comes in as a parameter and it's gonna loop through this imp somewhere. It's gonna loop over and over into imp. And then at some point it's gonna figure something out and tell us what it wants to send back to us is a return statement. And
so the W comes back and gets assigned into big. You can have more than one parameter and there's just an order. The first one and the second one, three and five. So three becomes A and five becomes B and away we go. So we just use this to add two numbers and so three plus five is eight. So you get as many as you like and the order matters. And if you do things like you tell it you want parameters and you don't give it to them, then that'll become a trace back and it will
blow up. You can also talk about optional parameters later. So you don't have to have return values and that means that you simply don't call the return with a value. And return is always implicitly happening as the last line of the function. So that's kind of the basics of how functions operate. But I don't want you to get too excited about writing functions, some programming classes are like gotta write a function, gotta write a function. Functions to be clear are a very powerful mechanism. And as we write programs 150, 200,000, 200 lines of code, 1000
lines of code, 10,000 lines of code, the concept of a function is really important. We would go crazy if we didn't have functions. But if you're only writing 20 lines of code, forcing yourself to write a function is kind of pointless. So don't worry about maybe the lack of urge to use this. We are calling lots of predefined functions and we will for the next couple of lectures. There will be a time when you go like, oh I'm sick and tired of repeating myself. Oh yeah, time to write a function. So that's why we don't
push functions prematurely. We just want you to know what they are, use them and at some moment you'll be like, oh I wanna define one. But don't worry about, it might take a while before you really wanna define a function. So that kind of summarizes our lecture on functions and up next we're gonna do iterations. Hello and welcome to chapter five, loops and iteration. Now we're going to work on our fourth basic pattern on sequential, conditional, store and reuse and loops and iteration. And this is the one where we teach the computer how to do
things a lot. We can tell it to do something a million times. And so that's where we get the doggedness of computers or the fact that they're so good at doing work for us because we can set them off to a task and they'll do it until it's done. So here's a very simple loop, a very simple loop. Let's put the coffee over here. The key word that we're gonna start using is the while loop. We're also gonna use the for later on. And the while loop functions very much like an if statement. The while
starts it and this is just like an if statement. It's a question that leads to a true or a false answer. And then there's a colon and then there's an indented block and then we use the deindent to determine how long the loop is and so this print is deindented so that indicates the end of the loop. And so at some level, what's gonna happen here is it's just gonna run and if this is true, it's gonna run this code and if it's false, it's gonna skip the code and that way it functions like an
if. The place that it doesn't function like an if is after it's run the code once, it goes up and then asks the question again and so you can think of it going back up kind of to the top of the while loop and then re-asking the question like, okay, is this going to run again? And then it's gonna do that some number of times and then it's gonna finish. And so that's the loop, that's the iteration. And we're going to make a variable, we're gonna construct very carefully a variable that we call the iteration
variable and that's n and it's a variable that's gonna change and it's our way of running the loop but not running the loop forever. So let's just run this. We come in, n is five, is n greater than zero? Yes it is, so we're gonna run this code. So we're gonna run this code, we're gonna print out five, then we're gonna subtract one and then we're gonna go back up, go back up and ask the question, is n greater than zero? And the answer is, since it's four, the answer is yes. So n, it runs
again. Then it prints out four, subtracts it again, checks, prints three, subtracts it again, prints two, subtracts it again, prints one, subtracts it again. Now n is zero and so it comes back up, comes back up, is this question has now become false. So it's gonna take the exit, so it's gonna come down and run this line right here, then it prints blast off and we can kind of print out the residual value of n just to sort of prove to ourselves that it ran until n was no longer greater than zero and then zero
was the final value for n and we carefully constructed this n, n equals, oops, go back. We carefully constructed n, we set it to five, then we carefully subtracted one each time through the loop and then we're using that to control when to exit the loop. And so you could think of this loop as, for now, running five times, true, true, true, true, true, and then false, finally. So this question was true for a while and as long as it was true, the loop ran and then when it turned false, the loop stopped. And so
this variable that we construct to control the loop was called the iteration variable because it tells how many times this loop is going to run over and over or otherwise known as iterate. So this is a badly constructed loop with an iteration variable that we didn't do very well. And so if we take a look at this, we start it with n five and then this is greater than zero so it's true so it runs it and then it runs it again and then it's still greater than zero. So you can pretty much see because
we're not changing n, this is gonna be true, true, true, true, dot, dot, dot, dot, forever, true, forever. And so this is an infinite loop and it's just gonna run until your computer runs out of battery or you hit the button. This is the kind of thing where you often see your computer spinning like a spinning beach ball or some other indication that your computer's super busy. It's in some kind of a loop, really tight and it's running something and it's using up all of the processing resources of your computer. That's an infinite loop. And
so the problem is we did nothing with the iteration variable. Now here's a different loop. And so this one demonstrates a different idea. So in this case, we start out with n is zero and it comes in here and is n greater than zero? Question mark and the answer is false. So it skips it, it doesn't run these lines of code at all. And so this loop doesn't run at all because it comes in, asks the question, it says no and then it skips right around it. So never run, never run. And so this actually
is, sometimes you write a while loop on purpose like this, not quite as simple as this one. But the idea is this emphasizes that these loops are what we call zero trip. They are not even guaranteed to run once. They're gonna run maybe zero times. And in this respect, it functions exactly like an if statement, right? Meaning the first time through the loop, if it's not true, it's just gonna skip right by it. So there's a couple of ways of getting out of loops. In this case, I'm constructing an infinite loop because remember the kind
of definition of an infinite loop is if this is gonna stay true. Well, true is the constant true. So this is gonna run forever. And what it's gonna do is it's gonna prompt with a little arrow and then let us type and read whatever we type into the variable line. And then if the line is done, we're gonna break. Now break is an executable statement. And if you hit the break, it exits the innermost loop out to the place beyond the end of the loop. So when this runs the first time and we say hello
there, line is not done, so it prints it. So it prints out hello there and then goes up. And then we type in again, we type finished. And so it doesn't, it's not done, so it prints it. So now comes that print statement. Then we type in done and now this becomes true. And it comes out and runs the code beyond the end of the loop. The key is it doesn't go back. It's like once you've done a break, that loop is done. And so you look at basically the block that is the loop. So
here's kind of the loop block. And then the break goes to the line after the end of the loop block. And you can think of this as sort of like just a hyperspace jump. There is nothing really, this could be literally hundreds of lines with if statements. And you could be running and doing all kinds of stuff and running and doing all these things. And these things could run all kinds of ways, right? The point is as soon as you hit a break statement, however much stuff is down here, however much stuff is up here,
it exits to whatever the next line is beyond the end of the loop. Continue is another loop control statement, but it works differently than break. So break says get out of this loop. Continue effectively says stop this iteration. We're done with this iteration. And so continue says go up back to the top of the loop. Oops, yeah. Go up back to the top of the loop. And so here we read a line. If the first character is a pound sign, line sub zero, if that first character is a pound sign, we're gonna skip it. And
this is a way for us to make like little comments in our typing. And then if the line is done, we get out and otherwise we print it. And so that's why there is no print out here because it comes in, runs, oops. It comes in, this is true and that goes back up, but it comes back and prints out the next one and does another thing. And so the loop continues, whereas the break ends the loop. And so again, the same kind of notion that you're sort of doing all kinds of complexity. Wherever you're
at in this loop, you hit continue and it doesn't go any further. It goes back up and runs the question mark. It asks the question mark. And so, I mean, ask the question and it might exit the loop in that particular case. But this one here is a true, this is an infinite loop that I've constructed. This is not an infinite loop because at some point the break gets us out of the loop. And so it's an infinite loop with break to escape it. And that's another common way to construct a loop. So these loops
that we've been drawing so far, the ones that use while as their keyword, are what are called indefinite loops. And that's because they kind of go for a while till a break hits or until some value becomes true. I mean, as long as that value remains true. So all the ones we've done so far are easy to look at and know that they look pretty good and they're probably gonna finish. But there are some times if they're long and complex and their exit or termination conditions are a little more complex, it's not clear that they're
really gonna terminate. And so we can use while loops for a lot of things, but for most of our looping, we're gonna use what are called definite loops. And that's what we're gonna talk about next. So definite loops use the for keyword. And the idea of a definite loop is it's going to loop through some set of things. It might be a set of lines in a file, it might be a set of characters in a string, it might be a set of strings in a list of strings. But whatever it is, it's sort of
gonna run a finite number of times depending on the thing that it's looping through. And we like this. And it's an easier way to construct it and we actually don't have to deal with the iteration variable, the for loop includes a mechanism to construct the iteration variable for us. So it's definite loops iterate through the members of a set. So here's a very simple for loop. And so you see the for keyword and n is also a keyword. And the iteration variable is something we put right here. This i is declared, this i is like
an assignment statement. And i is going to take on successive values. So i is going to be five the first time through the loop. Then i is gonna be four the second time through the loop. Third, two, one. So i is gonna be assigned five different times to five different values. And then the loop is going to run. It's gonna run once with five, once with four, once with three, once with two, and once with one. And so this block of code we have contracted, say execute it five times with these values of i. i
is that iteration variable. i is the thing changing through each iteration of the loop. Okay? And so that's why this prints out five, four, three, two, one, and then when it's done it finishes it. So this is a much more direct syntax for looping five times and setting iteration variable. You kind of all combine it into this one thing, right? All into one thing. So it's quite nice. So you don't have to be going through a list of numbers. There's all kinds of things that we can iterate through with four. And by the way, while
I'm sitting here, don't, I named my variable friends, because that's a list of strings, and friend, which is the iteration variable. I'm using singular and plural because it helps you read it. Python doesn't understand singular and plural. So just because you say friends doesn't mean Python knows it's a list. Python does know it's a list, but it doesn't know by the name of the variable I've chosen. That's your basic mnemonic variable warning. These are cool variable names, but I don't want you to get confused by them. So you can loop through a variable. So we're
gonna take this list of three strings and stick it in friends. And so friend is gonna iterate through that. So the first time through, friend is gonna be Joseph. Second time through, it's gonna be Glen. Third time through, it's going to be Sally. And so that just says run this loop, run this code, the indented code, three times each time the variable friend takes on a successive version of, a successive value that's in the friends array. So it says happy birthday Joseph, Glen, Sally, and then we come out of the loop and we print done.
So if we try to draw a picture of what this is really doing, the for loop is actually doing a whole bunch of stuff that we would have to do with maybe separate statements in the while loop. First it decides how many times to run the loop. So it's answering the done question, which way do we go? And it is also then moving I ahead. It's managing the iteration variable. If you go back to the, it's initializing it too. If you go back to the while loop, we had n equals zero, while n greater than
zero, n equals n minus one. So we had like three lines to control the loop to manage the iteration variable. But with a for loop, we don't have to do that. And so that's all taken care of. And so that basically says the for loop, by you using a for loop, are we done? No, we have five things to work. Well set out of the first one, run it. We're not done, because we've got one more. Set it to the second one, third one, fourth one, fifth one, and now we're done. And that is all
handled in a single line of code and that includes the iteration variable and the set of things through which we are going to iterate through. I really like the word in. It is mathematically, I mean, it reminds me of the set theory where you say this is a member of this set or the for each. Math isn't important here, but if you do know math, the vertical bar means such that, right, is a member of this set and that kind of stuff, member of the set, I'll erase the math stuff so we don't over math.
But it's like for each of the values in the set, five, four, three, two, one, run this loop, setting the iteration variable i to the members of that set. So n reminds me, for those of us who are math oriented, n reminds me of a really nice concept in mathematics. Now, you could think of this as sort of this looping structure where the for loop, and this is pretty much how it actually runs inside the computer, right, where it initializes it, i, it runs this, runs this thing five times, and then executes. That's one way
to think about it. You could also think about it in a somewhat more abstract way, and think of it as all we're really doing is we have a contract with Python that says i, we're supposed to run this code five times, and i's supposed to be five, four, three, two, and one. So you could imagine this might be what's going on. The for loop sets i to five, runs our code. The for loop sets i to four, runs our code. The for loop sets i to three, runs our code. The for loop sets i to
two, runs our code. For loop sets i to one, and runs our code. All we know is our code was run five, ran five times, and by contract, each success of time, we're getting a different value for i, and the value for i is taken from this set. And so this is just one way to think about it, to say to yourself, oh yeah, this is one way to think about it as it's actually, and this is how it really works, but this is also kind of logically the contract that Python is making for us.
So up next, we're gonna talk about taking this notion of doing something to a lot of items, but accomplishing something with that, and I call these loop idioms. So now we're gonna talk about loop idioms, and loop idioms are patterns that have to do with how we construct loops. We have the mechanics of fors and whiles, but ultimately we wanna get something done. We wanna solve a problem with a loop, and often what we have to do is if we have a set of things, whether it's lines, or strings, or characters, or numbers, we're looking
for something like the largest, or the smallest, or we wanna add them up, or something like that. And so we can't just say add them up, we have to say go through each one and do something to each one, and somehow achieve adding them up. And the pattern that we're gonna follow is we're gonna have this loop that's gonna do all, run once for each thing in some chunk of data, and then, but we're gonna set something at the beginning, and then we're gonna do something to each one, and then at the end we're gonna
kinda get the payoff, we're gonna get the result. So if we're doing sort of summing things, we're gonna have a running total, and so this'll be like t equals zero, and then this'll be t equals t plus the thing value. And then, but this is not the real total, it's the running total during the loop, but at the end it is the real total. And so we're gonna look at what you do before the loop starts, during the loop, and then what you get after the loop, and how you can use that. So we're gonna
use this loop, it's just gonna loop through a set of six numbers over and over and over again, right? So we're gonna do something before the loop, we're gonna do something after the loop, and then we're gonna run the loop some number of times, and in this case thing is our iteration variable, because I'm using unnemonic variables now. So it's gonna run 9, 41, 12, three, 74, and 15, so it's gonna run and print these things out. So it runs this loop six times, and away we go. Now this loop does nothing except print stuff
out. Of course I like to do that first, is always print things out, to make sure that sort of my brain is functioning. So, to kind of understand how these loops work, I'm gonna ask you to function as a program, and I'm gonna show you some numbers in succession, and I want you to mentally figure out what the largest number is, but more importantly, think about how your brain is solving this problem of what is the largest number, given that I'm only gonna show them to you one at a time for a little while, and
your brain has to do something, and imagine I was gonna show you thousands of numbers, I'm not, but imagine I was. How would you organize yourself in a way, so that for like an hour and a half, you could sit here as I showed you numbers, and you keep track of the largest number that you've seen of all the numbers, okay? So here we go, here's your first number, second number, third number, fourth number, fifth number, sixth and last number. What was the largest number, hmm? What was it? Well, it wasn't too hard, it was
74, but that's not the question. How did your brain arrive at 74? So here's all the numbers, if I was showing you all the numbers, and asked you what's the largest number, your eyes would have sort of gone, zer, zer, zer, zer, zer, and then you got to 74, and you wouldn't do it in any particular order, your eyes would just like see the 74, and it would just throw smaller numbers away, and it would move really quickly to what the answer is. Even if there was several hundred numbers on the screen, your mind would
sort of move fluidly wherever it felt like moving, and then arrive at it. And probably what it would do is, it would do something like, you know, kind of move like this, find this, and then sort of check to make sure that it's okay, and then say like, okay I got 74, I'm done. That's not how computers do it, that is not how computers do it. They do not move fluidly, but they are highly dedicated, they're gonna do something, gee, gee, gee, gee, gee, gee, gee. 74, but how would you construct a loop to achieve
this? So let's take a look. You could create a variable called largest so far, and this is the largest variable, the value that you've seen in the list so far, I don't know, I haven't shown you any numbers yet, so we'll just set this to negative one to get us started. So now, we see three, and we're like, oh, that's better than negative one, it's our first number, so it's probably the largest we've seen so far, right? Great, 41, oh, that's bigger than the largest we've seen so far, so we'll keep it. 12 is not
bigger than 41, so we're not gonna keep it. Notice this keeping thing. Nine is not bigger than 41, so there's no point to keeping it, 74 is bigger than 41, so we'll keep it. Is this the largest number? We don't know, we don't know until we're done. 15, not better than 74, so now, we're all done, and hooray, hooray, hooray, we have the largest number. And we had this variable that we kept the largest number that we'd seen up to this point, and then when we know that we're done at the end, then that becomes
the largest. So if you look at all the numbers, keeping track of the largest so far, at the end of all the numbers, the largest so far and the largest are the same thing. And so that's how you get this idea of something you're doing during the loop is not really the answer, but by the time the loop is done, you will have the answer. And so here's a bit of code that does this. Use it with our numbers, right? So let's take a look. So I have this variable called largest so far, I set
it to negative one, before the loop. Remember, there's a loop before and a loop after and loop in the middle, before it's negative one. So now the num, remember underscores are okay, that's my iteration variable. If nine is greater than largest so far, well largest so far is negative one, so that's true, so this code's gonna run. So we're gonna remember the new number. So this is nine, and so nine ends up in largest so far, and then we print it out, and so largest so far is nine after we saw the number nine. Then
we do it again. Do it again, so now 41 comes in, and is 41 greater than nine? The answer is yes it is, so we're gonna run this code, copy 41 into largest so far, and then print it out, and largest so far is 41 after we saw the number 41. Now we're gonna run the loop again with 12, okay? And you get the idea, I hope. Is 12 greater than 41, which is the largest we've seen so far, and the answer is no it is not, so we skip. So the largest so far stays
41 even though we saw 12, meaning we're sort of like ratcheting up, but we never ratchet back down. So we run it again with three and 41, and we skip this, and then the largest so far is 41 even though we just saw three, and now we see 74 is 74 greater than 41. See, we never are looking at all the numbers. We're only looking at the window on the numbers of the current number that we're looking at. So is 74 greater than 41? The answer is yes, so we run this code, and then we
capture the 74. So we've seen, we just saw 74, and it is the largest so far. And then we run it again with 15, but 74 is our largest so far, and so it skips. So 74 remains largest so far after 15, and now we're finished, because we just ran the last thing, before loop takes care of everything, and jumps to this print statement, and says, afterwards, largest so far is 74, but at this point, it's also the largest, right? So largest so far became largest when our loop finished. So that sort of gives you
this notion of how we construct something at the beginning, some kind of thing that we're gonna do over and over and over again, and then something at the end. And we put some print statements in just so we can watch it and see what's going on. So coming up next, we're gonna talk about some more loop patterns, some counting, totaling, averaging, and finding the smallest number. So now we're gonna look at some more patterns of the different things we can do at the top of the loop, in the middle of the loop, and at the
bottom of the loop. And the first one we're going to do is counting. Now we're gonna take a look at the number of something, the number of things in our list. Now we could just inspect it and see six, but you'll have four loops like you're reading through a file or scanning through some data. And so the notion of counting, but you have to assume that you don't really know dot, dot, dot, dot, dot, that there's gonna be a lot more than just six. But for now, we're just gonna do six, and we're gonna count
how many things that we see in this loop. And the pattern is simple. You set a variable, zork to zero at the beginning. We often call this variable count in mnemonic. And now we're gonna run this loop six times. One, two, three, four, five, six. And each time through, we're just gonna add one to zork. So zork start at zero, then it goes one, two, three, four, five, six. And we're gonna print it out. So we see the nine and zork is one. See 41, zork is two. And in it, zork is 16. When we
see the 15, four stops. And we print out afterwards. And this then is six is then the ultimate count that we got. So that's very, very simple. The pattern is that set it to zero at the beginning, add one to it, and if you run that enough times, then this is how many times that happened. And in a sense, it's how many times this line ran, right? Sometimes you put this in an if statement, et cetera, et cetera, et cetera, okay? Oops. Now, we can do the same thing to get a total. And the way
the total works is you compute a running total of the number of the items that you've seen so far. And at the end, the running total in effect becomes the total. A better variable name for this would be like sum or total or something. But zork, I'll use zork again. So you set zork to zero, and it starts up. The total we've seen so far is indeed zero. And then we're gonna run this one, two, three, four, five, six times. And thing is gonna be the iteration variable. It's gonna take on the successive values. And
each time through, we're just gonna take our running total and add to it the thing we've seen. So we see nine, and the running total is nine. We see 41, and the running total becomes 50. We see 12, the running total becomes 62. We get a three, it becomes 65, we get 74, running total is 139. How many more, how many more are we gonna see? We don't know, it could be a million, could be one. Oh, it's only one. We get a 15, our running total is 154. And what's true at any moment here
is the running total is right, up of what we've seen so far. Now, when we're done, the for loop quits for us, and afterwards 154 is indeed the total. So the running total while we're in the loop, at the end of the loop, after the end of the loop, we have the actual total. So it's not very difficult to convert this to the average, because we've calculated the count, and we've calculated the running total, and now we're gonna have the average by simply dividing those, okay? So, now this time I've used mnemonic variables. Don't get
confused by this, mnemonic variables are just friendly names I chose for you to read the code easier. I am not communicating to Python in any way by naming this count and sum, but count and sum is nice. Okay, so I set count to zero and sum to zero, oh, go back up. I set count to zero and sum to zero at the beginning, and the count is zero and the sum is zero, and then I'm gonna run this loop six times, one, two, three, four, five, six, and each time value is the iteration variable. I
count, every time I run the loop, I count equals count plus one, sum equals sum plus value, so I have a running count and a running total, and they show up here, one, two, three, four, five, six, and then the running total, and then at some point the for loop, we do the last one and the for loop jumps out, and it divides, 654 is the count and running total, and then it divides the average, sum over count, okay? So that's just, again, a pattern of something in the beginning, something in the middle, something in
the end. Another kind of thing we tend to do in loops is we look for things, we hunt for things, and so this is where we have an if statement inside of a loop, and of course, I've created a silly, simple thing. In this code, I am looking for large values that are values that are greater than 20, and again, don't think of this as just six numbers, but I'm looking for all the values, and I'm gonna print them out. So, you know, it says before, it's gonna run this, nine, well, if nine's greater than
20, it's false, so it goes back up, 41, true, so it prints out 41, then goes back up, 12, false, goes back up, three, false, goes back up, 74, true, so it runs this, so out comes that little print statement, goes back up, and then 15 is the last one, and that's false, it goes back up, and the four says we're done, and then we do afterwards, and so this is just the notion of having an if statement inside of a for loop, where we're sort of picking, or choosing, or selecting, or looking for something
in a large set of things that we're looping through. We can also say I wanna know if a particular value is there, and so we're gonna use a Boolean variable, and we've talked about integer variables like one, 42, and then floating point variables like 98.6, and then string variables like hello world, that have quotes in them. This is a fourth type, type, a kind of variable. It's called a Boolean variable, and it only has two values. It has true and false. Matter of fact, these if statements, they return Boolean values, value equal equal three, that
is returning a true or a false based on the value of value. There's a new monic confusion there, but I'm using, so I'm gonna make a variable called found, and that's a decent name for a variable, so don't get hung up on that, and I'm gonna initially say found is gonna indicate to me whether or not I found a three in my list, and I'm gonna start before the loop starts, let's say false, because we haven't found anything yet, so found equals false, and so at the beginning of the loop, found is false, before the
loop starts, found is false, and now we're gonna run this loop a bunch of times. Nine, is that true? No, skip. 41, is that true? Skip. 12, skip, right, so nine, 41, 12, and found has remained false, because we haven't done anything to it, but now in comes a three, and this becomes true, so it runs this code, so found becomes true, and then we print it, and you'll notice that when we see a three, we get true, and then it runs again, we get 74, it's still false, 15, it's still false, run, run, run,
quit, and the residual afterwards is true, and in fact, if you didn't know any of this, and you don't print that out, all you know is that afterwards, we loop through all those things, and we know that there was a three in there. That's what we're doing, so we searched all of them, we checked for threes when we found a three, and you can see basically that the found remains false until it flips to true, but then there's nothing to set it back to false, there's nothing in this loop that's gonna set it back to
false, so once it sort of catches the three, then it remains true for the rest of the loop, and then it just finds its way out. Now if you wanna think about it for a moment, ask yourself, how might we make this loop more efficient by putting a statement right in here? Think about a way to, once you've found it, and it's true, there is sort of no reason to keep on going, so what would you put there to perhaps make this loop, to look for threes, just to tell you whether or not there was
at least one three in there, how to make that more efficient? Just think about that. Okay, so now let's look back at the largest value that we started out with, right? And so if you think about this, let's kind of give it a sort of a rough look here. Largest so far is our kind of, like a running total, but it's our hypothesis is the best large number. And we have this if statement that says, if the number we just see right now is greater than the largest so far, then capture it, right? Take whatever
number we saw and capture it. So when we see a nine, it's better, we capture it. We see a 41, it's better, we capture it. We don't capture this, we don't capture this, we capture the 74, and we don't capture the 15, and that's how we do it. So you could think of this as better. When the number we're looking at is greater than our working hypothesis of the largest, we grab it because it's better. So this line right here is the grab line, grab it, okay? So then the question is how would you modify
this code to teach it to find the smallest value in this list of numbers? Think of it as you have a starting number, you have a sort of what's better in this grabbing notion. How could you do that? Take a look. Okay, so let's take a look. So let's do a couple things. Like if you look at this if statement that's better, well, it's better now if the number is less than. So if the, but then we should probably change this to be smallest so far, smallest so far, smallest so far, smallest so far, smallest
so far, smallest so far, right? Matter of fact, that's what this is. We've changed the word largest so far to smallest so far, and we've changed the greater than to a less than. Is that gonna fix it? Give you a second to look at it, pause if you need. It's not gonna fix it. It's not gonna find our smallest number. The answer is, of course, no, it's not. So if we run this code, so we set the smallest so far to negative one and it starts out negative one. We run it, and it's nine. Is
nine less than negative one? No, it's not. So after we see a nine, the smallest so far is negative one. Now we're gonna run 41. Is 41 less than negative one? No, it is not. So the smallest so far is still negative one. As a matter of fact, it isn't the smallest so far anymore. Just because we named it smallest so far doesn't mean it is the smallest so far. It didn't work out so well. And so you see that none of these, because they're never less than negative one, do anything, and we claim that
afterwards, the smallest we've seen so far is negative one. And that is because, of course, negative one is smaller than any of the numbers that we saw. So how could we fix this? Well, if we started the smallest so far with some like arbitrary big number, then it'd be better. So if we made this 100, whoops, come back. If we made this be like 100, that'd be good, because the first time through the nine would be less than 100, so we would capture the nine and then the rest of the loop would work just fine.
But then what if we didn't know how big these numbers were? As a matter of fact, the largest so far wouldn't have worked if all the numbers were negative. Think about that. We just assumed they were positive, and so we kind of wrote lazy code that assumed all numbers were positive. That might not be a good assumption depending on the numbers that you're dealing with, right? So maybe 100's a good number to start with, or maybe like 1,000, or 10,000, or like some number with lots of zeros in it. How big should we make this?
And the answer is we're kind of solving this problem the wrong way. And the thing we really want to do to solve the problem is to just accept the fact that if we're looking for the smallest number so far, that the right hypothesis is the first number. And if we just knew what that first number was, the nine, that would either, because it's the first number, we know that it's both the largest so far and the smallest so far, as soon as you see the first number. But we don't know here before the loop starts
what that first number is. I mean, you can look at it, but assume this is just data that's coming from somewhere else, and we don't know it until we start reading it. So we have to construct a loop that deals with the fact that we want to capture the first value as our hypothesis for smallest so far. So how do we do that? Let's take a look. So what we do is we use yet another type. So we have integer, floating point, string, Boolean, and now we have a thing called the none type. None type
is a special marker in that it only has one value. Boolean has true and false. You know, floating point has an infinite number of values and integer has an infinite number of values, but none type has one value, none. None is a constant. Capital none is a constant. The difference is, is we can check to see if we have stored none. None is often used to indicate emptiness. Not non-existence, because smallest doesn't exist until we assign it, but we're gonna assign it to like a mark, a flag, a marker. Some way to say, oh, this
is not even a number. It's nothing. And so we're gonna, and you can do this. So that's like, makes a variable called smallest, and then it puts none. It sticks it right in. It's not a string none. It's like a special type, okay? So that actually captures the notion that before the loop starts, the smallest number that we've seen so far is none. We haven't seen any numbers, okay? So, then we come in and we have an if statement. And we have a new operator called is. Is is stronger than equal sign. And so if
smallest is none, that becomes true. It runs this case. And so then what it does is it copies this first value, which is nine, into smallest. And so we see a nine and a smallest so far is nine, which is the first value. And again, we're assuming we don't know what the first value is before the loop starts. So we use the first iteration through the loop as the moment where we capture that, okay? So smallest is the value, and then we print it and we go back up. And now it runs again with 41.
41 is not none. None is, there's only one thing that's none. So it is not equal to none. Smallest is not equal to none or is not none. So this is false, so it skips over here. Then it asks the question, is the value we're looking at 41 less than smallest? Well, smallest is nine in this case, and this is 41. So that's false, so it skips that and goes on. So we see 41, we don't take it. And then you can see that this will never become true again. This is pretty much false for
the rest of the iterations of the loop. It's false for the rest of the iterations for the loop. So it just is gonna run down here and ask this question. And at some point, we see a three, and we run this code, we capture it. We see 74, we don't capture it. We see 15, we don't capture it. So then the for loop skips out. At the end, we have the smallest. And actually, this would be a good technique for the largest as well. Because it really is just a technique to put a marker in
this variable so that we snag that first number, or first whatever as we read and parse through them. So the is and is not operators are very useful in Python. You can think of them as like the double equal sign, they're asking a question. And they're asking a question, they return a true, you know, blank is blank, returns a true or a false. It is stronger. Double equal says are these things equal in type and value? So just as an example, if I were to say is zero equal to 0.0, it would say, yeah, that's
true. But then if I says zero is 0.0, that would be false. So that's because these two are the same value-wise, and these two are not the same type-wise. So is is stronger than equals, meaning that it demands equality in both the type of the variable and the value of the variable. And no conversion is done. And so that's just a very strong. Don't overuse is. If you're dealing with numbers or even strings, use double equals, don't use is, because sometimes it gets a little confusing. So use is sparingly. I tend to only use is
on booleans and on none types. I don't use is on integers, and I don't use is on floats, and I don't use is on strings. Just none or true false. And also is not is also an operator. So you just say blah, blah, blah, is not none, or blah, blah, blah, is not false. Okay, so we've been looping around and doing loops and loops of loops. We looked at the indefinite loops, the while loops that kind of run for a while. The definite loop, and we looked at break and continue as a way to either
escape completely from the loop or go back up and discard the current iteration of the loop. We looked at none. We looked at Boolean variables with for loops, definite loops, where you've got some kind of a set or a list or some kind of sequence that you're looping through. And then the concept of loop idioms where you do something at the top, something to each item, and then you sort of get a benefit at the bottom. And so that gets us through iterations. Hello and welcome to chapter six. And this chapter we're gonna talk about
strings. And chapter seven is the payoff chapter. So up to this point we're still learning sort of basic building blocks, and actually we're gonna write a real program in chapter seven. So just learn this, and the payoff's in chapter seven. So we've actually been using strings from the very first lecture, because if you print Hello World, well, that's a string. And so we've been doing things. This slide here is all review. We use plastic and catenate strings. We use print to print them out. Print's just a function that takes as a parameter something, strings, integers,
et cetera. We can put digits in strings, but we can't add to them. By now you've figured this out, but you can use things like ints to convert the strings to integers and then print things out. So we've been doing this for a while, and we've been talking about strings all along. Now today what we're gonna do is going to just get into strings in more detail. We're reading the input data with the input function. Input returns us a string. And if we want to input a number, we have to run some kind of conversion,
like we have to do on int before we take this data that we read from input, you know? And so there's things that we've gotta do, and we've been doing all these things in programs so far. But if we look a little in a little more detail inside strings, we can index within strings each character. So each character has a separate position and a separate index. And basically the letters have positions, and the positions start at zero, and the best way I explain this to remember this is it's the elevators. As we used in one
of our examples a long time ago, elevators in Europe start at zero, and so strings start at zero as well. Turns out in the old days there's some efficiency with the notion of lists of things starting with zero. These days the efficiency isn't the issue, but there's a certain elegance starting at zero, even though intellectually you might think one would be the first character in the string might make most sense to be sub one, but it's not, it's sub zero. But just remember that. And so we have this operator called the index operator, and it's
square brackets. So fruit is a variable that contains the string banana, and then fruit sub one is the character that's in position one. Now that actually is the second character. I'll keep reminding you until I get tired of reminding you. So that assigns A, the letter A, into, I mean A, the letter A into the variable letter. Of course that's a badly chosen, it's either a well chosen variable name or a badly chosen variable name. And the thing that goes inside this can either be a constant or it can be an expression, so this is
x equals three, and then fruit sub x minus one, well that means two, which is position two, which is an n, and so that gives us an n. So the index is an operator, and you can add this bracket syntax to the end of a string variable. You can't index beyond the length of the string, so if I say zot sub five, well there's only three characters, which means zero, one, two, but sub five doesn't work, and of course we get a happy little trace back. So you have to be careful when you're starting to
pull stuff out of strings, although some of the things allow it, some of them don't, and you'll kind of get used to that. We can ask how long a string is, and so we use the len function, we pass the string variable, and we pass it into len as parameter, and len gives us back the length of the string, not the position, so it's zero through len minus one. So it's zero through len minus one. So len is just another function that we've been doing functions now for a while, you pass in a parameter, and
then len does some work, and out comes six, and that goes back into x, because the function has a residual value, it just happens to be a built-in function. And so, you know, somewhere deep inside Python, there is code that takes this, and somebody wrote a loop, or looked something up, and then returned a return value, and sent back a six to go into our x variable. And so a function is there, and like I said, we've been using this for a while. Another thing we tend to do is to look through strings, and look
at strings, and dig data out of strings. Python is excellent for doing sort of these kinds of lookups. And so we can write a simple loop, we can write a for loop that creates some kind of iteration variable, like index, and given that we know that these positions are zero through five, we can set this to be zero, and then write a while loop, while the iteration variable is less than the length of fruit, and remember, this is six, so it's gonna be zero through five. Zero through five are the values we wanna generate, and
then we can look up one at a time, pull out fruit sub index, so fruit sub zero, fruit sub one, two, three, four, five, and then print out the position and the letter, index, and then add one to index, and it runs, this'll run six times, zero through five, and out we go to produce this output right here. And so that's one way of looping through strings. That is a basic indeterminate loop, but we construct carefully an iteration value, construct an iteration value, and work our way through that loop data. The other way is to
use a determinate loop, a for loop, and generally when we are able to use a while loop or a for loop, all else being equal, we generally prefer a for loop. And so here we have the for keyword and fruit, and it's an in, and so for letter in fruit, well that just says letter is our iteration variable and it's gonna take on the successive values of each of the characters. So this loop is gonna run six times, and letter's gonna be B-A-N-A-N-A, banana. I'm always terrified when I make these slides that I'm gonna misspell
banana, because somehow I always think that there are two ends, somewhere, I don't know. It's not one of my favorite words to spell. I actually didn't choose banana as the constant. The author who I borrowed the textbook from, Alan Downey and Jeff Elkner, they used banana, and so I'm still using banana. So some of the jokes in the book aren't my book, aren't my jokes, they are the jokes of Jeff and Alan. So here are just two equivalent, so you can have the while loop, they sort of both do the same thing, they both just
print the letters out one time through, each of these loops runs five times, but you can see how the determinant loop, the for loop is a prettier loop, unless you truly somehow need to know this number as you're going through the loop. But if all you're doing is going through and you wanna touch in order each of the characters of the string, you then simply write a for loop because it's more elegant. The less code you write, the less code you write, the less chance there is for you to make a mistake. And so the
fact that these are equivalent, this is three lines, well, two lines of a loop and this is four lines of a loop, that's twice as many places as you could make a mistake because you might misspell index or something. I mean, why even make an iteration variable if you don't need to make an iteration variable? And so we can do things that harken back to our iterations and loops chapter where anything that you can do in those things like look for the largest letter, look for the smallest letter, search to see if a letter exists
or say count the number of A's in the word banana. And so that's what this is doing. And so we have a counter. So again, we do something at the top of the loop, we're gonna do something in the middle loop and then we're gonna print it out at the bottom. So we start our counter at zero, we're gonna loop through all the letters and then if the letter is A, then count equals count plus one. This is kind of a pattern in a loop where we're noticing something and instead of like we did it
earlier where we said found equals true, well, we're gonna count them this time. So if we have one, we'll get one, if we have zero, we get zero and how many there are but there should be three because it's gonna run three times and there's three A's in banana. And so this is a conditional within count. We've seen counts, we've seen conditionals in loop in prior chapters. And so again, I love the in keyword in Python. It again reminds me of a set notation in algebra. If you're a math whiz, if you're not, don't worry
about it or maybe you will be a math whiz and you'll say, whoa, this set notation reminds me a lot of the in keyword in Python. So again, it's for iteration variable letter. Again, don't get stuck with letter. I just happen to be using it here in banana. And that is for each character in the string banana, run this loop once, changing the variable letter to be the particular character that we're pointing at. And so, it's taking care of, four is taking care of a lot for us, right? And so this is sort of this
really smart for loop. The for loop is both deciding how many times to run the loop, in this case six, and it's advancing the letter. So advance print. decide whether you're done, advance print. Decide whether you're done, advance print. Decide whether you're done, advance print. Decide whether you're done, advance print. Decide whether you're done, advance print. Decide whether you're not, I am now done because I, whoop, you know, we're done with that particular string. And so, you can think of the four as, you know, magically doing all of this for you, of both deciding how
long to run the loop, when you're done or not, and moving down through all the success of letters in the loop. So up next, we'll talk a little bit about additional things that we can do with strings. So now we're gonna dig into strings a bit, and we've already looked at how you can pull out a single character in a string, and now we're going to look at what we call slicing, and that is pulling chunks of a string out. And again, we're gonna use the square bracket operator, and so S, and the way I
say it is sub, S sub zero through four, that's how I read this. S sub zero through four. So I look at the colon as through, I look at the brackets as sub. And so, S sub zero through four says, start at position zero, and then go up through, but not including four, right? So we don't include four. So that's probably the hardest part of this, up to but not including, up to but not including. This seems counterintuitive, kind of like starting at zero seems counterintuitive, but after a while, you'll kind of get used to
it, and there'll be situations where you're writing code like, oh, that's why that works better. But just for now, remember it, up to but not including. It's just kind of a little thing. We'll come back to when that is useful for us. Six through seven, well that ends up being starting at six, up to but not including seven. So that's why we only get the P out. Now one thing that Python is pretty nice about, is it's not gonna give you a trace back. We might expect that six through 20, well there's no 20 characters,
but it's like, ah, that's okay, we'll just let you stop at the end, and we'll start at six and go all the way to the end. Oh, no trace back. It's almost disappointing sometimes when Python doesn't trace back when you think, ah, you know, if you're so obsessed about everything, now I would have traced back in that situation. But hey, I guess if you're allowed, you're allowed. And so there we go. Now you can eliminate or omit the first or last. If you eliminate the first, it assumes the beginning of string. If you eliminate the
second, it assumes the end of the string. And why you would do this, I don't know, but that's from beginning to end, so it's the whole string. So whole string, eight through the end is thon, and up to but not including two is mo, all right? So you get that. So just, that's pretty simple. Once you've got the rest of slicing and the rest of string indexing, the notion of eliminating the first or the last of the colon expression, the first or second of the colon expression, I think is actually pretty intuitive, pretty nice. We've
already been concatenating strings together. We overload the plus operator, and there is no space added. Remember when you're doing print, x comma y, this comma does turn into a space, but that's not what's happening here. There is no automatic space being added, and so we see hello in there, and it's just as hello there with no space. And so if we want, we just have to concatenate the space explicitly if we wanna put spaces into strings. The problem is, is if this, you might think it's more convenient to add a space with a concatenation, but
then you have to think, well, what about if I wanna concatenate things and not put the space in, then I'd need a different operator. So that's kind of why it works that way. We can use in differently as a logical operator, so we're using it as an iteration structure in for loops, but we can also use it as a logical operator in if statements. So it's kind of like the double equals, or not equals, or less than or equals, or something like that. It's like those guys. And so, and it returns a true or a
false, is n in fruit. So that's a question, and the answer is true. Is m in fruit? No, that's the answer to a question. Is nan in fruit? Doesn't have to be single character, can be more than one character, and the answer is true. And then you say something like, if a in fruit. And so this is the logical value that returns a true or a false, and yes, we found it. So that becomes true in this particular case, so it runs the little indented bit. So n is an operator in this particular situation. In
a for loop, n means something different. And we'll use n for other things as operators, as logical operators, coming up in a bit. You can compare strings, and this has to do with the character set of your computer, the character set that Python is. But in general, it is lexographically less than and lexographically greater than. Uppercase and lowercase are a little weird. I think when we used the max function earlier, the way my computer was set up, uppercase was less than lowercase. But in general, uppercase is less than lowercase. But in general, it's bad to
assume case, but there is a deterministic way to sort strings. You can have something equal to or less than or greater than, and all those operations work naturally, the less than and greater than. You have to kind of be aware of uppercase, lowercase, things like where punctuation sorts less than or greater than letters. That's kind of unpredictable and depends on the character set of your computer and something you just play with and figure out if you're doing sorting stuff by first name and last name, as long as the case is kind of the same, you
know, if, if you were sorting chuck with uppercase and glen, the fact that these uppercases, they'd sort right and these lowercases would sort right, but if you were to subdue instead, lowercase chuck and uppercase glen, then that would sort weird as a matter of fact, the G would come before that. And so case can mess this up, but in general, other than case and special characters and other things, it technically works. It's just hard to kind of predict it. A lot of what we do is use the string library. And so the strings are objects,
and we'll talk later about what that really means. And objects have these things we call methods. So a string object has some built-in capabilities. And one of the built-in capabilities that the string object has is here is a string object. And because greet is a string object, if we said type, we'd see that it was an str. Dot lower says, hey, dear string, make a lowercase version of yourself. It's like calling this function lower and passing greet into it. And then give that back to me. Now it doesn't actually change greet. It gives me a
lowercase copy. So here I have hello Bob with an H and a B uppercase. And when I get back in zap is hello Bob all lowercase. And note that greet is unchanged. So hello Bob is still there. And you can even call these methods on constants. So this is a string object, quote, hi there, quote. Dot lower, that says call lower on this bit of string and give me back a lowercase version of it. And so it prints out as the residual return value. This is like a function call. A method call is a kind
of special form of a function call. It's a function call where you say the thing dot the function name rather than function name pressed in as a parameter. Like len, for example, is non-object oriented. You know, len of x, that's non-object oriented. Object oriented would be x dot something, parenthesis. But, so constants are objects as well. And taking the lower gives us back lowercase, hi there. And so that's just one of the things that you can do in the string library. These are built into string variables and constants. They're just always there. As soon as
you make a string, they're part of it. And when you do type and it says it's class STR, we'll get to object oriented, don't worry. We'll get to object oriented. Okay, and so you can do things like use the type. If you're just, this used to say type str but it's class str, kind of this is more of an oh oh. The word class is an object oriented concept. But it is a string. And you can use the dir and of course there's extra stuff up here. And this is showing all the different methods or
capabilities, things we can do to strings. So, you know, x dot something, parenthesis. Well, what can we do there? This is all of those things that we can do to x's that are built in and come with x's, I mean come with strings when we build them. And Python of course has great documentation online for all of these string methods and what they do and how they work and why they work the way they do. And so here's some of that Python documentation. We'll look at a few of these. But, you know, don't hesitate to
say Python string upper case and then we're like oh yeah, yeah, that is upper, right? And so here's a few things that we can do and use, some of the ones I use a lot. And we'll look at each one of these things. So, the find operation says find me a substring within a string, right? Find me a substring within a string, so find me the first na and give me back the position. So that gives me back two. And then I can say go find a z in there. Well, there's no z and so
it returns me negative one. So that's what the find does. So we're gonna use this kind of stuff a lot and we do a lot of looking in strings. Converting things to upper or lower case, there is an upper method and a lower method. So greet, greet dot upper and that means the upper case nnn is hello bob, greet dot lower, that means that dub dub dub is the lower case hello world and greet is unchanged. Greet is still hello bob with upper and lower because each of these methods basically say I'm going to give
you back a upper case copy or a lower case copy of the original thing without changing the original thing. Search and replace is super useful, super duper useful. And it's pretty clean. Here we have a string and we use the replace method. In this case, we're passing in the old and the new bob, replace all bobs with janes. And so that takes this hello bob and turns it to hello jane. Again, greet is unchanged, greet is unchanged and it does more than one thing. So this says go find, well, let's clear that. This says go
find all the o's and replace all the o's with x's. And so it goes and finds two of them and then out come two x's. And so that really is a replace, it's not just replace the first one but replace all of them. White space, as we'll see, is a big deal. And white space is not just blanks although the most common thing but it's also sort of non-printing characters like tabs and new lines and other kinds of things. And so we have a number of different ways to strip white space. So here we've got
some spaces at the beginning and spaces at the end. And we print out, we do an L strip and that throws away the spaces at the beginning. That's the left, so that's the left strip. It all takes any, if there's nothing there it doesn't harm it. R strip means throw away all the blanks on the far end. And then strip says go take both sides, both sides for strip and so that pulls out all the spaces on both sides. This will be useful because sometimes when you're tearing stuff apart you'll find yourself getting extra spaces.
Sometimes at the beginning, sometimes at the end. And it can be tab or new line. It's sort of white space. Space that is kind of not visible, clear. That's what white space is. It's like if you were on a piece of paper it's the white space. It's like X, well that's not white space but right here, oh that's white space. It's any character that doesn't cause printing to happen. If that makes any sense. It's any character where nothing would be printed. And there are characters like that. There's like even bell characters but we don't use
them very much. We can ask very conveniently we can say hey, does this line start with a particular string? And so line, this is a question, gonna return a true or false. Does this line start with please? And the answer is true, it does start with please. Does this line start with a lowercase p? No, it does not. And so again you'll use this in the context of if something colon some block of text. It's a block of code. So we can combine these things to tear stuff out. And so let's assume that what we
wanna do in this case is we wanna take a from line. This is from an email format from a mailbox. And this has got the from with a space and the person's email and then at sign in the school they're from and a space and then the rest of the stuff like when this mail was sent. And this is a real mail message from this guy Steven from the University of Cape Town in South Africa. It's really Steven and this really is the first line of a file that you'll get to know pretty well by
the rest of this course. Hi Steven, you, we like you. You are the example in my class and have been for a long time. People actually who know Steven have taken this class and they're like Steven, I saw your picture in the class. So if you're ever in Cape Town at the University of Cape Town say hi to Steven and tell him that you saw him in the class. But okay, that's neither here nor there. What I really want to do is I want to extract his school from this email line. Okay, so now eventually
we will do things like the data will come from files but this is still chapter six. So this is the data we're going to search through. And so we can say, hey, let's go find the at sign. Search up to this position and find the at sign. So data.find at sign and give me back where that's at. That's in position 21, it's position zero. Then what we're going to do is we're going to look for the next space after the at sign. So we're going to start at the at sign until find to start here
and look forward until it finds a space. So data.find, look for a space starting at the position of the at sign and then that'll be in position 31. So 31 is what we get in the space position. So now what we have is we have in two variables, we have the position of the at sign and the position of the space after the at sign. Now what we really want is this bit right here. So we have to go one beyond the at sign and we don't want the space. So we say we're going to
use slicing here, data sub at position plus one up to but not including the space. Oh, smiley face, because we didn't have to say space minus one because that is up to but not including. And so we get that little bit right there. So we don't have to say minus one there because this is not actually included. The thing that's at the position of the space is not included. So that's already a little benefit for the up to but not including. And so when we print this variable out host, we get exactly just the school
that Steven works at and probably went to as a matter of fact. I don't know if you went there or not. So this is just kind of a note for non-Latin character sets. All programming languages from the 60s on tended to work in what we call the Latin character set which is United States and England and Europe and lots of places use this ABC character set and the special characters. But it's really common to want to use different characters. And so if you're going from Python two to Python three and we'll talk about this a
little later when it matters more, luckily we're in Python three and so one of the big things about Python three is that all the internal strings are Unicode. In Python two, there was sort of some confusion as you went between strings and this is just a little bit of code and so I'm putting a in here, some Asian characters, this is Korean actually, Asian characters into X and I say what kind of a thing this is and that is a string and then there's this Unicode and this comes from Python two. If it's a Unicode
operation, it's still a string whereas in Python two, if you put a international characters into X, then it was a string and then there was a separate kind of a constant called a Unicode constant and it was a different type and there was ways that you had to mess with these Unicode variables as you did things like read them from files and put them back into files and did other things. So it was much more difficult in Python two but we're doing in Python three and in Python three, it natively understands non-Latin character sets, international
Asian character sets, Spanish, French character sets and so this is a good thing for Python three and this is one of the real benefits of using Python three and as we start doing stuff where we're exchanging data with the outside world, this will come into play and I'll have to show you how to use it. There was weird things that you had to do, it just makes a lot more sense in Python three, okay? So we've talked about strings, we learned about the string, we're converting it, we've done a whole bunch of stuff and this
is again, we're not yet doing anything super useful, we're learning sort of how to like slice and dice even though we're sort of not making the meal yet. Up next, we're gonna talk about files, we're gonna read some data and we're gonna slice and dice and use all the things in the next chapter that we've learned up to this point. So see you in a bit. Hello and welcome to chapter seven. This is the chapter where it all really starts to pay off. We have been learning bits and pieces and doing little two lines, three
lines, four lines of code to learn the basic building blocks of Python and learn some of the syntax and find lots of terms but now we're actually going to start doing something. So if you look at what we've been doing so far, you know, we have been, we're inside this little computer and you type up, you know, the Python says what next and you give it its command and it does something and you do something else and does something and you do this three or four times unless you write a loop and then it goes
like, you know, 10, 20 times and that's it. And then maybe we write a thing that reads something from our keyboard, gives us something back and then we write something and print something out, print a few foot things out and so we've been pretty much using the keyboard, the screen, the CPU and the memory. That's kind of where we've been living. And while it's important to talk to the keyboard and the screen, the real world is things like databases that live out here, files live on our systems and, you know, connecting to the network and
reading data from the network. And so that's what we're starting to do right now is we're starting to be able to work outside kind of our code and create things that are permanent. And so we're gonna be talking, initially we're gonna work on files. We'll later talk to databases and the network and other stuff, but for now we are talking about files. And so really kind of, we're stepping out a little bit and creating, reading things that are prominent and creating things that are permanent. The kinds of files that we're going to talk about mostly
are text files and you can think of these as a sequence of lines in a file that are easily read by Python. You've been making text files all along. You're, you know, hello.py. That file's a text file too. You're using a text editor to create that file. You put your Python commands in a file, you run those files and that's what it is. And so a file can be thought of as a bunch of lines, you know, one, two, three, four, five, six, seven, a blank line here. That's possible and, but the reality is, is
that these are actually just lines and we have a special character called the new line that we'll talk about in a second. So to read a file, you have to call the open function. And open returns what we call a file handle. Open doesn't actually read the file. Open makes it possible so that you can read the file. So the parameters to open are, it takes one parameter that's required, which is the name of the file, another parameter that's optional, whether or not to read it or write it. If we're reading the file, it doesn't
harm it. You can read it over and over. If you write it, it actually, if there's already data in that file, it truncates it and writes something. And we're not gonna really write files, we're mostly gonna read them. And so open, sort of, you pass it in a file, it gives you back this file handle and then you have a variable in which you store it. I often call it fhand to be mnemonic. You'll see my code, I use fhand all the time to indicate that that is a file handle. And so if we were
to run this in an interactive mode, we'll open mbox.txt and that is a function built into Python and then it gives us back a handle. It does not give the data. You can kinda see this when we print out the file handle using the print statement. It doesn't print the lines that are in the file. The lines that are in the file are sort of out there. There could be like, you know, 10 million lines for all we know, lines in the file. The handle's like a little opening outside of your program and you can
talk to the file by opening it, then you can read stuff, you could, if you're writing the file, you can write stuff and then you close the file to shut the handle down. But handle is a thing that allows you to get to the file. It is not the file itself and it's not the data in the file, it's just a wrapper that kind of allows you. So this, if you print it out, it's like, that's the file we opened, we're reading it and then coding has to do with the different kinds of character sets,
which we talked about at the end of last lecture, the Unicode character set, et cetera, UTF-8 is a great character set. It's probably the most typical character set that you will run into it, although you can have different character sets of files, but most of them are UTF-8. So, of course, this is Python. If you make a mistake and there's a file that doesn't exist, we get a trace back and it blows up. We'll show you in a second how to deal with that. Now, the newline character is an important part of file reading and
in strings, we can put the newline character in by this backslash n character. And the backslash n is the character that indicates that we're supposed to go to another line. Go to a newline, go to a newline. And so we have, what is this? Well, that's a backslash n, that's a backslash n. And so, if we print it out, we print it this way, we see that the backslash n is in there. This is how we type it. We actually type backslash n to Python to indicate that we're supposed to put that there. But if
we do a print statement, it actually interprets the backslash n, so the backslash n causes this movement to the beginning. Now, the print actually, at the end of this, adds another backslash n. So, the backslash n that we put in by putting it into the string is that one. And then print always puts a backslash n at the end. There's actually a way to override that backslash n behavior by putting something on the print statement, which we'll talk about later. Now, it's important to note that the backslash n is one character, right? And so, even
though this x backslash ny prints this, and then print adds another new line to go down to here, if you ask how many characters, what is the length of this, well, it's only three. That's because that's a character, the backslash n is a character, and the y is a character. So, it's a three character string. So, the backslash n is a character like all the rest of the characters, but it's only, we encode it by typing backslash n. It's called an escape, where the backslash is the escape. Backslash n is a way to say new
line, because we can't see it. It's a way for us to encode in a string this non-printable character, this invisible character. The white space, it's part of white space. So, as we're reading through the file, we can think of it as a sequence of lines, and we can read these a line at a time. We can also read them a character at a time if we want. And so, but it's more common to say read this line, read the next line, read the line after that, et cetera, et cetera, et cetera. But the way to
best think about this, it doesn't really matter. You can think about it as lines, and we will in most of the programs that we write. But realize that the way when we see this, we see it like this, it comes back to the beginning, it comes back to the beginning. There's a character in the file. At each of these points to say go back to the beginning. It's like hitting the enter key on your computer. And that is a new line. So you have to think that in the file, in order for your text editor
and Python and everybody to know where the lines end, you put new lines in the file. And that's another character. So, you know, this looks like an empty line. This line here looks like an empty line, but really it has a single character, and the character is a new line. And it turns out that in a bit, we're gonna need to keep track of the fact that every line is ended by a new line. So up next, I'm gonna talk a little bit about how to read files in Python. So we're gonna find that there's
a number of different ways that we can read through the file. But the most common way that we're gonna read through the file is to treat it as a sequence of lines. And we're gonna use the determinant loop, the for loop, to do this. And so what happens here is we get back this handle, that opens the file and gives us back the handle. That handle xfile is the variable I named, I just named it xfile. That's not the data. But it is a sequence. It is that file handle represents to Python a sequence that
we can potentially walk through and then get all the lines. And it's the simplest, most beautiful, elegant way to read all the lines in a file. We use the for loop and we have an iteration variable. This is going to take, when we talk about the file, cheese is gonna be the first line, then the second line, then the third line, then the fourth line. So it's like going through a string, but you're going through a file now and you're getting it line by line. So that's each line. I just picked a variable named cheese
so you didn't get confused. Later I'll call this line. But Python doesn't know anything special by naming that variable line. Okay, and so this is, it's the for and the in. And so I read this as for each line in the file handle xfile. So run this loop one time for every line and then print it out. So it's actually really quite simple, okay? Other languages like C or C++ or other languages, they have to write while loops with end of file conditions and all kinds of things that make this very difficult. But this is
one of the prettiest things that Python has. It's a very, very pretty thing. Okay, so let's talk about what we might do. And we're going kind of back to iterations now. What if we wanted to count the number of lines in a file? Well, this is a basic loop counting pattern. So we open the file and then like in all these loops, we do something to sort of prime the loop to get it started, set a variable count to zero. And I'm gonna use the variable line that's gonna go through each of the lines in
the file for line in fhand, down the file. And it's gonna run this loop once for each line in the file and the variable line is gonna change. But all I'm gonna do is add count equals count plus one. And so that's just like from counters, that's just how you detect. So every time we see a line, we're just gonna add one to the counter. We're not printing the line, we're not even looking at its data at this point. And then when the line is done, however many times it has to go, out it comes
and we print out line count equals count. And so if we open mbox.txt, this is gonna do all this work and then print this line out and say line count is 132,045. So this is a little five line program that shows you how to count the lines in a text file using Python. Again, simple and elegant and not too much syntax for you to have to learn. Now it's also possible to read the file as a series of characters all in one go. Read the whole file in. Now you gotta be careful depending on the
size of the file, this is gonna lead to a string variable with a lot of data in it. Now if it's 100,000 characters, that's actually kind of a small thing. But if it was 10 million lines, that would probably not be good. You'd wanna read it one line at a time and process each line and then do something. But mbox.short.txt is a small little file. So we open it and we get back a file object, file handle object, and we call the read method. And that says go through and read all the text and give
it back in one big blob, one big string, and I'll put it in imp. And so that's where you have a line, a new line, a line, a new line, a line, a new line. So not really lines, it's just a sequence of characters with new lines in there to punctuate them. And now you can split that, later we'll see how to split that into separate lines if you want. Now I picked a file that was short, and so this imp variable now has a string in it and I can use the len function, pass
a string into the len function, it says oh 94,626 characters. That's kind of a small little file. And perfectly okay to read it all in one go. And so now I say just print the first 20 characters, that's beginning to up to but not including 20, and so it shows the first 20 characters of that little file is a from line, because this is a mailbox file. Now let's say we're going to do a searching, and we did this loop where you're looking for something. And so we're going to search for lines that have a
prefix of from, okay? That's what we're going to do, and we're going to print those lines out. So there's lots of lines in this file, line, line, line, line, from, line, line, line, line, from, right? On and on and on. And we only want to show these lines, the ones that match, right? That's what we want to do. And so we are going to write an open statement and then we're going to loop through, and we're going to ask the question, if the line starts with from, print it. So sometimes it's going to skip, skip,
skip, skip, and then it's going to run it, and skip, skip, skip, skip, skip, and it's going to run it, skip, skip, skip, and then it's going to run it, okay? So that's the basic idea, and then it'll finish when it's all said and done. And so this is like a criteria, this is like a search. We're looking for lines that match the string, that have the string from as their prefix. Now, when we look at the output of this, it's kind of weird. We see kind of these little blank lines that show up. Blank,
blank, blank, blank, blank, blank, blank. What's going on here? What's going on? So let's take a quick look. The problem is, is new lines. Well, I mentioned that the file has new lines in them. And so when you do the for loop, it doesn't throw the new lines away. As you might expect, it would be kind of nice if it did, but it doesn't. It actually shows you when you read, it reads that first line up to and including the new line and gives you that back as the variable. So that is the first new
line. So that means it's going to go down. And then the print statement actually adds another new line. So that's the second line of the file has a new line at the end of it, and the print statement adds another new line. So if we take a look at the code, there is a new line, oops, come back. If we take a look at the code, this variable line has a new line in it, oops, where am I at? I'm in the wrong slide, there we go. Yeah, this is what I want to do. If
we look at the code, there's a new line in here, and then the print adds another new line. So the print adds a separate new line. And that's how we get two new lines. The print statements new line and the new line from the file. Here's how we fix it. And you're going to write this code a lot because when you're reading text files, you end up with a new line. And often you don't want the new line. But thankfully, as we saw in the previous chapter, there is a nice little function in Python for
strings called strip that allows you to throw away white space. And to review, remember white space is anything that doesn't print. And this new line is not a non-printing character. So our strip gets rid of it. So it's a way to get rid of white space. And our strip does it from the right end. So it's the right end of the string. And so if we just are going to loop through all the lines in the file, we say line equals line our strip. And then this variable no longer has the new line at the
end of it. We have our little if statement. And if we print it, then this line, the data has no thing. And then the data has a no new line in it. So the print only goes down one. And so now we have single spaced output. And so you're going to be doing that a lot. It's really common to read through a file and then just strip the new line or any trailing space off the end of that. Now, there's a couple of ways to do a loop like this. And let's just think of this
as we're looking for a line, a file with lots of different lines in it. And we want to ignore all the lines except some say good lines. And we want to do something with those good lines or the lines we're looking for. Needle in a haystack. This is like searching for a needle in a haystack. So if you look at this code at high level, we're going to loop through everything. And then we're sort of picking which lines are. And these are the good lines down here. Now, often we have a bunch more code that
we want to do. And we're not just printing them, but we're going to do a lot of code. So sometimes you actually structure the loop a little bit differently. And so the way to do it, and this is going to do the exact same thing, it's just a little different way of thinking about this loop. So the top part is the same. We're stripping it. And what we're doing here is everything's the same here except we add this and not. If the line does not start with from, that's the translation of that. If the line
does not start with from, continue. So basically we have a skipping pattern. So the lines we're not interested in, we skip. So we come down, we skip a lot of lines. Choo, choo, choo, choo, choo. And then we find a line that's good, and then we fall through. So this is the good code. And then we have all the other good code that we want to do to that line. We have that showing up down here. And so there's just two patterns that are two ways to do the exact same thing. So another way to
select the lines that we're interested in is to use the in operator. So we talked before about the in operator and how that works. So we're basically gonna use the continue skipping method. So we're gonna read all the lines, these first few lines. If uct.ac.za is not in the line, skip it. And so this is gonna print out all the lines that have the string uct.ac.za in them. And so you see this is the output of the program, dot, dot, dot, dot, dot. Sometimes you'll have programs that want to read different files. Often I give
assignments where I say, show me how this program runs on the short file, and then show me again how it runs on the long file, just like this. And so the way we do that to input the file name, instead of making the file name be a constant to the open call, we make the file name be a input. So we just run an input statement, which gives us a prompt. And then we type mbox.txt, and then that shows up in this variable fname. It's of course a string all the time. And we pass that
into open, and then we open it, and then we do the count operation. So if we enter mbox.txt, it counts 1797 subject lines in mbox. And if we give it mbox short, it says there are 27 subject lines in mbox. And again, this is another one of those ifs, and it's just counting, but only counting lines that match a particular pattern. Okay, so now the user can also type bad file names, and we need to be able to deal with that as well. And so we're taking a small change to the code. The dangerous code
is this line right here. This line right here is gonna trace back if that file doesn't exist. So what do we do? Well, we're gonna just expand that. The rest of this program is exactly the same. You know, things different as we've got this line. We took out insurance on it, and we know that it might blow up, and so we have it in a try and accept block. So here's how the code runs. So, you know, the input runs. We type in a good file name. It comes in here. This works, and so it
skips the acceptance, so it runs the code and prints out the count. So that's the good pattern. The bad pattern is here, we type in a bad file name. It comes in the try accept. This file name is non-abubu, and it's gonna blow up, so this line blows up. So it jumps down into the accept code, prints out, file cannot be opened. So it prints this out. Now this quit is really important, because if we don't put this quit in here, it's gonna continue down here, and that's gonna blow up here, because file handle is
not defined properly at this point. And so what we have is, we have this quit is a special function where it comes in and never returns. So this is a way to terminate the entire Python program silently with no trace back, right? So we put in our own error message, so we look like we're professionals, say if we could not open this file, and then we stop. If you don't, it's gonna come down here, and it's gonna trace back, trace back right there, it's gonna blow up. So the quit is useful when you want to
stop executing, because you've detected some kind of an error. So that's a quick zoom through opening and reading through files and doing some patterns. Most of the rest of the programs in this course are going to say open for our strip, do look for, and then do something interesting. That's going to be our loop that we're gonna do over and over and over again. And now we see how this looping and if and iteration and variables are starting to come together, and you can actually sort of do a program that does something useful. But before
we get to too many more programs, we gotta switch a little bit, switch gears and talk up next about data structures, and that is the shape of data, and how we can use more intricate and complex variables to help solve our problems. Hello and welcome to chapter eight. We're gonna talk about lists in this chapter. Up to now, we've been talking about algorithms. Algorithms are the concept in computer science of using the programming language to express the steps that you want the computer to go through to solve the problem. Read some data, convert it to
a floating point number, check to see if it's greater than 40, do one thing if it's greater than 40, do another thing if it's not, then print out the result. Or open a file, read everything. If the first line starts with something, do something. If not, skip it and then add all the things up. Those are steps, those are a series of steps, and hopefully by now you're getting to the point where you have a good understanding of steps. But there's a whole other side of computer programming and we call it data structures. And data
structures is not the steps, but instead clever ways that you lay out the data and clever ways that you make sure that the data does what you want it to do. And so that's what we're gonna start talking about now. Lists are the first and most simplest data structure. Strings are kind of like data structures, but lists are probably our first real data structure that we're gonna think about and design and make use of effectively. But before we talk about what is a collection, we should talk about what is not a collection. So we're familiar
with what a variable is. We know that a variable is a little piece of memory that's got a label on it. And then an assignment statement, you know, sticks a two into x and then x is, and then two is in this little cupboard. And then it goes to the next line and then four goes into x and so the two goes away and the four is there. A key thing is you can't have more than one variable at any given moment, right? And more than one value in a variable. So when we move to
collections, collections are more like suitcases. We can put lots of things in them. We have ways of organizing them. And as we go through lists and dictionaries and tuples, we'll see how there are different ways to organize them. And as a matter of fact, we've been talking about lists for a while. Every time we use one of these square bracket syntaxes in earlier programs, we've been working with lists. And so this is technically a three item list with three strings, got commas here, Joseph is one string, Glen and Sally are another string. And here's another
one that is another thing. And the list is basically, it's a list constant and it's being assigned into a variable. So this friends variable has three things in it. So that's different than what we've been talking about before. So these brackets and bracket structures with square brackets are those lists. And so the print is just a print with parentheses to get the print to work. But 124, 76 is a three item integer list. Red, yellow and blue is a three item string list. But it doesn't all have to be integers or strings. Python can handle
different things and different kinds of data in different positions in the list. So red, 24, 98.6, a three item list with a string, an integer and a floating point number. And while we're not gonna use this too much for now, this outer list is a three item list and the second item is another list. So this is kind of alluding toward what we'll do when we start talking about data structures. And that is we have a structure and then we have another structure inside of it. And sometimes this can get quite complex. And we're doing
this for a reason. This here has no reason just to show you that it's possible that lists can be made up of lots of things, including other lists. And of course, there is also the notion of the empty list. And like I said, I have had to be able to tell you about lists all along. We use them in for loops. We can put lots of things here. We can put file handle here. We can go through the file. We can put a string there. We can go through the characters in the string and then
the list. And the iteration variable then goes through the successive elements of the list. And that's why this prints off y4321. And then the loop is done and it prints out a blast off. So we've been using them and we've been actually iterating through lists with for statements all along. So the for statement has been something we use with lists. And when you just need to go iterate through the list and go through every item in order, the for is a great way to do that. So friend is our iteration variable. Friends is our list
variable. And so that says friend is gonna successfully take on the value Joseph, Glenn, and Sally and print out, you know, Happy New Year, Joseph, Glenn, and Sally. It runs three times once for each of the values and the iteration variable advances. Now, I do wanna make it really clear that the choice of friends a and friend a, singular and plural, is arbitrary and capricious. It happens to be convenient and intuitive that the iteration variable is one and the list variable is more than one. But Python has no idea about singular and plurals. Matter of
fact, Python would care. It would be totally equivalent for Python to do the same thing, to have the list variable be z and the iteration variable be x. X will take on the successive values of these three things. Now, am I being nice to you by calling this list friends and this iteration variable friend? I am, but I also don't want it to confuse you if you're just a beginning developer. So just like strings, we can sort of look within lists. Part of the thing is when you put more than one thing in a data
structure, you need to get them out. And so lists have positions, they maintain order, and so the first thing in the list is the sub-zero position, sub-one, sub-two. Just like strings, they're zero-based. Just like European elevators, they're zero-based. So if we take a look and we say, oh, friends sub-one, that's how I read that, the little square brackets, when you take a variable here and you say friends sub-one. Remember, singular and plural don't matter. Friends sub-one means glen, because this is the zero and that's the one, and then Sally's the sub-two, and so that's what
prints glen out in this particular thing. Now, lists are mutable. Mutable is another word for changeable. They can be changed, meaning that a list has three things. You can change this thing right in the middle if you want. To take a look at what's not mutable, strings are not mutable. So if I take a look at assigning banana into fruit, well, fruit sub-zero is a capital letter B. Could we imagine for the moment that we could change fruit sub-zero to lowercase b? Well, the syntax would be how you would do it if you could do
it, but it turns out that strings are not mutable, meaning they're not changeable once you create them. And that's why when we do things like lowercase or uppercase, we take a look at the fruit and we say, give me a lowercase copy of that, and then we take the return value from this and we store that in x, and that's how x becomes a lowercase banana. But fruit is still the original one. So fruit has not changed. Compare and contrast that with a list, though. Here we have a five-item list, two, 14, 26, 41. And
we're gonna do the sub-two position. And the sub-two is zero, one, two. So that's that one right there. And we're going to assign a 28 into it. So that 28 is going in here. Gonna wipe that out and put 28 in. So we can do item assignment in lists by putting a bracket syntax on the left-hand side to say, don't just put it in a variable, put it in this position within the variable. So that's what that's doing. And when you print that out, the 28, everything else is unchanged. Meaning the whole list is there.
There could be 1,000 items in the list. And then you're changing the second one. We have a function called len. We've been using this len function all along to take a look at how long strings are. It counts the number of characters in the string. So that's a nine-character string. If we have items in a list, len tells us how many items there are. It's not like how many characters there are. It's the number of things. And each thing doesn't have to be a number. It could be a number, a string, or even another list.
And len is the way to say, hey, how many things are in there? There's a function that returns a list of numbers. And we use it, as we'll see in a second, to construct specialized loops to go through lists. So let's take a look at this range function just for a minute. So range takes as its parameter the number of numbers that you want returned. So I'd like a four-item list with the numbers zero, up to, but not including four. And so it just turns out that that is really useful for constructing four loops that
are counted four loops that go to zero, to the one, to the two, as compared to the definite loops that go through each one. And so it's a common thing to say, okay, we know how many things are in this list. There are three friends. And if I put combine, range, and len, so I take len friends, which is three, and then I take range sub three, I get zero, one, and two. And so the interesting thing is this zero corresponds to the first one, one corresponds to the second one, and two corresponds to the
third one, okay? And so we'll use this to construct loops, especially when we need to go through an array and remember what position we're at. And so here's just an example of two different loops. This is a four loop that's just gonna go through whatever's in this list. So friend is just gonna take on the success of values, and so it's gonna print out these three things just as you would expect. And if you don't need to, while you're going through the loop, know the position, your relative position from the top in the loop, that's
okay. But sometimes you want a little more sophisticated loop. And instead, you want to be able to loop through where you know the position. And so what we do instead is, instead of looping through that list itself, we do range lend friends, which gives us zero, one, two. And then I takes on the success of value zero, one, and then two. So this loop is gonna run four times, and I is zero the first time. And we might even just look up the value inside that sub-zero value so we get Joseph the first time. So
prints out Happy New Year Joseph, goes and I becomes one now, and so it gives us Glen, and that prints out. And away you go. So if you look at these two loops, if you look at these two loops, they really do the exact same thing. The only difference is this, we allowed the four to find its way with the iteration variable through. And here we created our own I variable that went through the positions. And they're dense, there's no gaps in here, so it's zero through two that it goes through. So these two are
equivalent. There'll be times when you'll want to use one and the other. I tend to prefer the first one because it's prettier as long as it works for me. So that gets us started with loops. We'll be back in just a bit. Okay, so we've taken a look at loops, and now we're gonna just take a little bit of a look at some of the operations that you can do with loops. Python has this, as we'll soon learn, object-oriented approach to its operators. And the plus can add strings, and it can add numbers. Floating point
numbers, integer numbers, strings. Et cetera. And so the plus similarly works this way with lists. The plus looks to its left and looks to its right and says, what am I adding? And in the case that I'm adding the list one, two, three, and the list four, five, six, it concatenates them together. And this way it sort of functions like a string, and so we get one, two, three, four, five, six. It just concatenate this list to another list. And it doesn't change A or B just like in any kind of assignment statement. Calculations on
the right side don't change the variables and then produce a new variable and then assign that into C. You can also use list slicing, and it's easy to remember. If you remember how strings work, lists work exactly the same way. So of course it's a little tricky. The first number's the starting position. They start at zero. So one is right there. So it's the zero position, the one position. Start at one, right? But go up two, but not including three. There's one, two, three. So this goes up two, but not including three. And that's why
we get 41, 12 out of that. So up two, but not including. I'll just say that over and over and over again. If we do, you can leave the first part out. You can leave the first part out here, and you can say, oh, up two, but not including four. So that starts at the beginning, goes up two, but not including four. And so that's how we get that piece right there. We can say start at the position three, zero, one, two, three, start at position three, and go to the end. Now the fact that
the number three is in here is sort of irrelevant. Three to the end is those three numbers. And then you can do the whole list with slicing as well. Again, these pretty much are the exact same examples I used when I was doing strings. They're pretty much the same. There's a number of different methods, and you can look up all the documentation in the list. I often just use the dir command to remind myself of them. A pen we'll look at. Count looks for certain values in the list. Extend adds things to the end of
the list. Index looks things up in the list. Insert allows the list to sort of be expanded in the middle. Pop pulls things off the top. Remove removes an item in the middle. Reverse flips the order of them and sort, up puts them sorted order based on the values. So let's look at a couple of these. So if we build a list from scratch, we have a way to ask for an empty list. There are a couple different ways to ask for an empty list. We could use just two square brackets next to each other.
But this is a form we call the constructor form where we say, hey Python, make a list. In this case, the word list is like a reserved word to Python. It's really a reserved class, but say, list parentheses says make me an empty list and then assign that list into stuff. So stuff is now, it's a list of object, it's a type list, but it has nothing in it. And then we can call the append method, stuff.append and stick book in. And then we say, oh, and that knows how long, the stuff knows how long
it is, where the end is and how to add something to it and then add a 99 to it, and we print it out. We got book a 99, reminding ourselves that lists, while they're often the same types of variables, the same types of values in the various positions in the list, it doesn't always have to be that way. Then we say, oh, we'll stuff that append cookie, you can keep on going, and then we end up with three things and the cookie. We have an in operator, works pretty much like the in operator in
a string, is nine in my list? And that's pretty simple, and the answer of course is yes, nine is in my list. Is 15 in my list? Looking through, no it's not, 15 is not in my list. And then there's the not in operator, think of that as kind of like one operator. Is 20 not in the list? And the answer is, since it's not there, is true. And so that's a way to just, you know, it's kind of like starts with or in for strings, same kind of stuff. Lists are in order, and they're
sortable, and so this is something that we take good advantage of. A lot of what computers want to do is sort stuff, you know, look all these things up, append them, and then get them sorted. And so there is this method inside of list, that's just the sort method. So here we, you know, put three values in zero, one, two positions, zero, one, and two, Joseph, Glenn, and Sally, and then we tell the list to sort itself, and then we print it out. Now this is actually sort of the list in place, which is different
than upper and lower, because if you remember, strings are not mutable, but lists are mutable, and so you say, hey, just sort yourself, okay? And so just sort yourself, and then it sorts it, and then it's in alphabetical order, Glenn, Joseph, and Sally. I happen to be clever, I only put strings in there, and I put my upper case and lower case in a very consistent pattern, but the list has changed, and if I look at list sub one, that is the second item, which is Joseph that prints out right down there. There's a whole
bunch of built-in functions to help manipulate list. The other things I was showing was sort is a method that's part of list, but there are other functions that take list as their arguments. We already talked about the lend function, tells you how many items there are. There is pretty obvious max, it says go through and find the largest, min, go through and find the smallest, sum goes through, adds them all up, and we can say let's do average by taking the sum of all of them and dividing it by the length, and you might think
to yourself, oh wow, I wish we'd have known this such few chapters back when we were having to write all those loops to do max, min, sum, largest, smallest, et cetera. You can kind of think in your mind that inside each one of these functions is a loop that does pretty much what you did in those chapters, and part of the reason we did that back then, even though these things were here, was they're kind of easy loops to understand, and so those are there, and basically there allows two different ways of building loops to
do the maximum and minimum. Now it's not necessarily all that much easier to do something using these because you either can do them the old way, or you can make a list and then use these functions. So let's take a look, and I'll just say that these two bits of code are doing the exact same thing, and what they are is they're implementing a program that's gonna repeatedly ask for numbers until we type the word done, and then it's gonna compute the average and tell us what they are, and so using sort of the stuff
from the loop chapter, we start with a total variable and a count variable, set them to zero, and then we read a number, we check for done to break out, but then we convert it to a floating point value, and then we say total equals total plus value, and count equals count plus one, and so this is gonna run over and over and over again, however many times we're gonna do this, and then it's gonna pop out, and when it's done, it's gonna have this value of total, the running total will become the overall total,
divided by count, and it'll print the average out, okay? And so that's kinda how we would have done this before we knew how to do this with lists. Now, let's take a look at the other one. In the other one, we say let's make an empty list, remember this is that constructor syntax that says to Python, make me an empty list, and assign the empty list. It has nothing in it, right, but it is a list, has nothing in it, into the variable num list. Now we're gonna write another loop, this part here is the
same, these three lines, read the number if it's done, quit, and convert it to value. But instead of doing the actual calculation right now, what we're gonna do is just append it to the list. So the list will start out empty, then the three will be in the list, then the nine will be in the list, then the five will be in the list. So we're appending, each time through the loop, we're appending into the list. So we're just growing the list every time I read a value, instead of actually computing something with the value
that we've got. So in either case, we get value, and in one case, we append it to the list. And then finally, it finishes, the break happens, and then we just say, oh, hey, Python, sum up everything in the list, add these three numbers together, and then take the divided by the length of all those things, and you'll have the average. And so these two things give us exactly the same output. Now there is one difference, if there was like one million or one billion numbers, they actually have to all be stored in the memory
simultaneously. Whereas here, it's actually doing the calculation, of the billion numbers, and not using up so much memory. For most of the things that you're gonna be doing, the difference in memory, there is a difference in memory. This uses, this one here uses more memory, but I can't draw very well, more memory. It uses more memory, but it doesn't really matter by the time it's all said and done. And so for you, the difference between these things is not all that significant, but it's important to understand that they're just two techniques to accomplish the same
thing with lists. So now we're gonna wrap up and talk a little bit about how strings and lists are related. They're sort of related in that they both have zero base things and we use the square bracket operator to do various things. But there's a lot of situations where we're looking at our data and we're combining the use of lists and strings. So let me show you the first thing, probably the coolest thing. We're gonna use it a lot the rest of the class, and that is the split function. So let's take a string, we've
got ABC here, it's with three words. What we're interested in the fact is that there's spaces in this word. And what split does is says, you know, I'm gonna look through this thing, I'm gonna find this, and I'm gonna break this into pieces, and I'm gonna return you a list of the separate individual pieces. So look for blanks and break it in pieces and give me back the pieces. So I'll print these out and now you see that it's a list with three items, with three words. The spaces are gone, but it's given it to
us. So it's like, split this into words, please, and give me the individual words, give me a list of individual words, rather than a big long string with spaces in the middle of it. And that is a quick way to go from a line, and it's really common, a lot of things we're going like, go get the second thing, or the third thing, or whatever. So the split's really nice, because then you can just grab stuff. And so you say, oh, how many things did I get? Well, I got three, the len function tells us
that. And I can print the first word I got, which is, and with the subzero, and that'll be like with, will be the first word, because that's the subzero position. So I read something, I split it, I can say there's three things, and I can look at stuff the first word, basically, without really knowing much. Now, if you remember earlier, and we'll see this, we used find and slicing to do a similar kind of thing, but people tend to prefer the split. And you can, you know, oops, go back. You can also then loop through
them, so you can split these things into stuff as a word, and then go through with w, and then it's gonna go through, w's gonna take the successive with three words. And so you can make a loop by reading some data, splitting it, then writing a for loop, and then it's effectively going through the words in that line of data. And so that's a really powerful concept that we'll use in a lot of the programs that we're going to write. Just a couple of bits about this and how it works. Split with no parameters here,
it looks for spaces, but it also treats a bunch of spaces as a single space. And so it's pretty smart about that, and so even though this has a lot of spaces between lot and of, you only see lot of, all the spaces are gone. It does something special about spaces. It's really white space, so tabs or new lines or other characters would also qualify in split, basically. Now, you don't always have to split based on spaces, and a lot of data that you're gonna run into, you're gonna wanna split on something else. And so
here's some data that looks like we're using colons to separate the first, second, and third piece. Now, if you just call split, split's looking for spaces. And so split gives you back a list of the things broken apart with spaces, but there's not a single space in that line, and so we get a list, see, it's a list, but there's only one item, and the semicolons are sitting there. Split doesn't go like, whoa, this looks like it should be semicolons. Split's job is to use spaces and split the string based on spaces, okay? But given
that this is something we like to do, you can tell split what character you'd actually like to split on. Now, it's not quite as clever when splitting on something other than spaces. It doesn't understand that, you know, if there's a bunch of semicolons in a row, it still thinks of those as splitting points to split, but in this particular case where there's no spaces, you know, and it's gonna split that. So it says split this based on the semicolon instead of being based on the space. And so if you take a look at what comes
out of this, we split on semicolon, now we have a three-item list, and we get first, second, and third. And a lot of your data comes out of some logging system or some router status updates, who knows what you're looking at, but the delimiter is often something other than space, and you can do that with split. So this is a useful thing when parsing things like our email address, right? We wanted to get things like the email address, this second piece, off of the line. And so we can use split to take advantage of this.
And so here's a little loop that's just gonna print out not the email addresses, but instead the day of the week. We're gonna print the day of the week out for all these things. How do we do that? Well, we can observe really quickly that if we split based on spaces, it's the zero, one, two, it's the two position. So we can quickly write a bit of code that opens the file, then loops through the lines, we do this all the time now. The strip takes off the end of the new lines. We can check
to see if it starts with from space, right? From space is our key, so we're ignoring, we're ignoring all of the lines that don't start with from space, but then we find a line that starts with from space, and we split it, and then we just print out the second word. And so we get the second word of the lines that start with from, and that's how this thing works. Now, sometimes we want to dig into it deeper, and we will take something, split it, and then split another piece of it again with a different
delimiter. So let's just say that the thing that we want to achieve is getting the part after the at sign for email addresses. And we did this with, again, find and pose and stuff like that, but you can use split to do this as well. So the first thing we're gonna do is we're gonna take this line, we're gonna split it based on spaces, right? Chop, chop, chop, chop, chop, chop, and the fact that there's an extra space there, doesn't matter, split happily just like zooms through that. And then words sub one, zero, one, two,
word sub one is this email address, so we'll put that in a variable called email, and so email will be a string that's just this. So in two lines, we've pulled out the second address into a variable. Then what we're going to do is we're going to re-split that. We're gonna take this string we've got and split it based on at sign, because we know it's an email address. So we get a new set of pieces, the first part is the person's name, and the second part is the host name that their email is hosted
on. And then what we can do then is we just happen to know that, we just happen to know that this is the zero item and this is the one item, so we can get at that. So the interesting thing of going here, if you think back to how we did this before with find and pose and all that stuff, it's really a lot cleaner and we don't, for me, I can look at this after you understand it and it's easy for me to understand that it's correct, whereas that pose stuff, you gotta add one
and start the second find after, just remember that. And this is a lot cleaner way, and this is a more typical way of pulling this kind of information out of a line. So in this chapter, we've talked about lists, we've talked about the concept of collections, that's our first data structure, we're not just doing algorithms, we kinda know algorithms now, but now we're gonna do data structures. And in this chapter and the next two chapters are our foundational data structures and then we'll, like everything, we'll make more complex data structures by composing those data structures
together. We've looked at how strings and lists connect together and how split works and these are all really powerful tools that we're gonna use going forward. Now we're gonna take a look at how we would write some code to do some parsing, read some data. As a matter of fact, we're gonna read through our famous mailbox data, look for lines that begin with from space and extract the third word. As a matter of fact, we already have some of this code already written, we're gonna debug it. We're gonna look at code and we're gonna debug
it. So here we go, here we have it and it's a pretty basic program. It opens a file, loops through the file, throws away the white space, splits it into words and checks to see if the zeroth word, the first word is from and if it's not, we skip and read the next line. And otherwise, if we find a line that starts with from space, then we print the third word, which is word sub two. Okay, so this is what we've got and we carefully saved this file into the same folder that we've got, EX08.
And so let's go ahead, cd, desktop, Python for everybody, EX underscore 08. And so this is some files, we got our day of the week, Python and our inbox short, so that's sitting there, okay? And so let's run this program. This is the program we've got right here, Python three, dow.py and it doesn't work. Now, by now you've seen a few trace backs and there you go. So, you know, when you look at a trace back, you think to yourself, well, I made a mistake and you've gotten pretty good at looking at that line. So
there you are, you're like, this is the line, there must be something wrong on this line and you wanna change it. But that line's not actually the problem in this particular thing. And so you gotta be careful sometimes. And one of the things that you didn't notice in this one right away is that it actually worked. It printed the first line out. So if we take a look at our data set, it found the line started with from space, it split it and printed out the third word and it blew up later. And so part
of the problem is that we don't know what it was doing when it blew up. And so the first thing I'd like to do in this kind of a situation is find the line and make sure there's a print statement right before it. And so I'm gonna print words colon and then comma WDS. I wanna print right before the line that blows up so that I know really when this finally does blow up, what was going on in that line. So I'm gonna run it again. And oop, did I forget to save it? No, I
forgot to save it, look at that. See the little blue dot, forgot to save it. So now we see a whole bunch of output. And we see that it's actually doing a whole lot of work before it's blowing up. And so you see that it prints the words out from that first line and prints out Saturday, which is exactly what we expect. It's the third word in the line. And then reads a whole bunch of stuff and it's actually, what it's doing now is ignoring. Let me just put something here. I'm gonna say print ignore.
So I can keep track of when these lines are being ignored. So let's run it again and have the word ignore pop up. Right, and so it's doing a lot of ignoring. It finds these words, prints out Saturday, reads this line and ignores it, reads this line and ignores it, reads this line and ignores it. So a lot of stuff's going on here that you might not realize. And so we have to take a look at what the problem is. And so it is now blowing up word sub zero. And now we can scroll down
and we can look at exactly what happened right before the trace back. So we really now know exactly what happened before the trace back. And the interesting thing is, is that there is an empty, empty string. I mean empty array. There's an array with zero items. So I'm gonna print the line out too. Print line colon. Now I haven't changed my program at all. I'm just trying to figure out what's going on here. So I'll save that and I'm gonna run it. And we've got a lot of stuff and it's still working. It reads a
line, it reads a line, splits it into words, and then prints out Saturday, which is the third word on the line. Now here it reads a line and this line is a blank line. And it has, because it's a blank line, the split returns no words and that's what blows up. And the problem now is, oh, wait a sec, list index out of range. So word sub zero is not valid, which is the first word, when there are no words. So this is a statement that works most of the time. Now you might think, oh,
I wanna just put a try and accept in there. Well, the right thing to do is to say to yourself, oh, wait a second. If the, I don't have enough words, if the length of the words is less than one, continue. So basically it's gonna come through here, it's gonna split it and if we don't have any words, meaning it's a blank line, then we're gonna skip it. So let's run that. So now this ran all the way to the end. It did a lot of stuff and it did not blow up specifically, didn't have
a trace back. Another way to protect this would be to, we'll take this part out. This is called a guardian pattern. Right, guardian pattern, because this is dangerous. This could blow up, but this, it won't blow up if it makes it past here and it won't come through there under the conditions that are causing it to blow up. Another way to do this might be to protect it as follows. To say, oh, wait a sec. If the line is a blank line, no, continue. So now what we're gonna do is we're gonna skip blank lines.
I even say this, print skip blank. So if it's a blank, we're gonna skip blank and keep going. This will skip blank lines. It'll come through here and this will skip lines that don't have from, but because we're not processing blank lines, words of zero always works. So I can run this code and it works again. So here we have a blank line, we skipped it. Here we have a blank line, we skipped it. Now here we had a non blank lines, we parsed it, but then we ignored it. And then up here, we'll find
it from somewhere. Doo doo doo doo doo. Let's find it from, here it comes. Oh, no, there's ignore, ignore. I got too much debug print, I can't find it. Here, I'll just hunt for from with find. Okay, so there we go. There it's from and we print the thing out. So we're getting a lot of extra stuff. So I'm gonna comment out some of these debugs. And I'm actually just gonna get rid of this whole skipping of the blank line. I'm gonna do it with the words. I'm gonna go back to the guardian we had
before. If the number of words that we got, ln of words, is less than one, continue. Okay, so now this is gonna be a working program. Oops, I gotta take another print statement out. Gotta take another print statement out. We sort of know what we're doing here. Okay, so this looks like a pretty safe thing. This guardian is protecting this dangerous. I'll get rid of that one too. This is the words that was traced back. And nothing else in this thing changed from when we started except we've added this little guardian. Now the interesting thing
is if it comes through here and prints words of two, what happens if somehow we find a line that has from is its first word and there's only one word on, this is gonna blow up. So we can make our guardian a little stronger. And we can say, you know what, we're gonna skip this line if it doesn't have the three words in it. So it has to have at least three words. And if we see less than three words, we're gonna skip it. And that just makes the guardian a bit stronger. And so the
program works safely and you see these things where sometimes you wanna check to see reasonable, that your assumptions about the data are reasonable and skip things where the data is not reasonable. So that's one guardian pattern. Let me show you a slightly different way to do this. And this is with an or statement. So I'm gonna take this code, copy that, and put it here with or. Get rid of all this stuff. This is the guardian in a compound statement. So what we're saying is if there are less than three words on the line or
if the first word is not from, continue. Now we're doing this in order because the way it works is or is true if either that's true or this is true. But if it knows that this is true, then it doesn't bother checking this. And the checking of this is what blows up, what causes the trace back. So if we flip this order, it would fail. If we do it in this order, it will work. So let's do this one right. It works. But if I get this backwards, it's gonna check this before it checks this.
And we're going to go back to failing again. So you gotta get the order of these things right. The guardian comes before in the or. The guardian comes before. And if this is true, then it doesn't check this. This is called short circuit evaluation where it knows that as long as this part's true, it doesn't evaluate this second part. And so now we have a guardian in a compound statement. You'll see this a lot. Sometimes if it's more complex, you do it in multiple statements, or you fall through, check for sanity, check for sanity, and
only run the code. So I hope that that was useful to you, looking a little bit about how to debug where you don't just start chopping on the line that had the problem. It's not always that line because we never did change that line. Although we did change it a little bit at the end, we added this guardian here. But we also fixed it without it. Sometimes you add some print statements to figure out what's going on before you just start chopping on that line. So again, I hope this helps. Thanks. Hello and welcome to
chapter nine. Now we're gonna talk about Python dictionaries. Python dictionaries are probably the thing that most programmers love the most about Python because they're very powerful. They're like a little in-memory database. It's the second of our kinds of collections and probably the best collection. To review what a collection is, it is a situation where we are going to have a variable, like a list or a dictionary, that we can put multiple pieces of information in rather than a single piece of information. And of course, prior to collections, we would put something into X and then
we would put something else into X and it would be overwritten. And now with lists, we can append things on to the end. And so if we compare lists and dictionaries, the list is sort of the organized version of the collections. Everything stays in order. You add something, it always adds to the end. You take something, it sort of compacts itself. It's zero through the n minus one, where n is the number of items. And so it's very organized, kind of like a Pringles, where the potato chips are nicely stacked. Dictionaries are messier. You can
put things into dictionaries. There's no real sense of order in dictionaries. Everything has a key. So you sort of throw things in and they kind of mix around in there somehow. And you pull things out based on the key. It's like you sort of stick a label on it, where you say, okay, I'm gonna take this thing and I'm gonna put Chuck on it. And I'm gonna take these sunglasses with the Chuck label and I'm gonna throw it into the dictionary and I'm like, hey, give me back Chuck. I'm like, oh, here's your sunglasses because
you mark everything. This is like the key. This is the value. I took a pair of sunglasses and I threw it in. So it's kind of like a purse or it's sort of like a mess. And so the idea is you have these labels that you put on everything that you're gonna throw in. Like I'm gonna put, so it won't stick to my keys. You know, what else do I got here? I'm gonna stick a label on my pen, a Chuck label, and I'm gonna store a pen in my dictionary with a Chuck label. And
so it's like having a purse or a bag or a backpack where you have things labeled and you can throw things in and label them and you can shout into your bag and say, give me the calculator or give me the candy or whatever it is that you have labeled them. You have to come up with the labels and then you can use the labels to get things back out. And like I said, they're probably the most powerful thing. And they're basically this concept that's generally referred to as associative arrays, which means they're like lists,
but they have these keys. And so the associative means the association between a key and a value. Whereas in a list, there's a position in a value and the position is less powerful and less flexible. Most modern programming languages have this notion of associative arrays. If they don't, they're sort of unpopular because once you get using them, they're like, whoa, they're so powerful. If you ever find yourself in a language that doesn't have them, you'll freak out. They have different names like property maps or hash maps or property bags, depending on the language you're using,
but they all are the same thing. They're key value pairs. So the idea of a dictionary is that, or the idea of any collection is putting more than one thing in. And then the difference is, is that you have ways of indexing it. So this basically line says, let's make ourselves a dictionary, just like we constructed an empty list. And I want to store 12 into this dictionary and I want to label it money. And so on the left-hand side, when we use this money, that's the label that we're going to give it. And so
12 is being placed in the dictionary. That's like taking the 12, throwing it in the dictionary with a label of money. I can't, yeah. Three's going in with a label of candy and 75 is going in with tissues. We say, what's in there? And there's no order to it. And sometimes the order can even change inside of a dictionary. Although there are more advanced versions of dictionaries that maintain some kind of order, but for now let's just not worry about the ordering of them. If we say, what's in there? You say, oh, there's three things
in there. There is 12, 75, and three, and stored under the keys, money, tissues, and candy, respectively. We can ask, using the index operator, what is purse of candy? And that's like saying, hey, give me back candy. And out comes the number three, which is that. We can update stuff. So we can say, go grab the candy version, add two to it, make five, and then store that back into candy. And so now we see that candy has been set up to be five. And so if you look at the difference between lists and dictionaries,
they both can have new items added to them. We haven't talked a lot about deleting, but items can be deleted from them. The difference is the indexing mechanism, how we look things up, how we store things, and how we look things up. So we make an empty list, we make an empty dictionary. We add 21 to the end, and we add 183 to the end, and we ask it, and it says, oh, position zero is 21, and position one is 183. We don't see the positions when we print it out, because it's sort of implicit.
Here we're gonna, and mark 21 with age, and stick it in, and mark 182 with course, and stick it in, and then we're gonna print it out, and there we got course and age mapped. And we can add 23 and stick it back in age, and that overwrites, so the 21 becomes the 23. We can do the same thing in a list, except we say lists of zero, because in lists, the indexing is position, and so this 21 becomes 23. And again, you just look at them, and you can think of each of these as
pretty much doing roughly the same thing, except the indexing mechanism. The values are the same, but the keys are different. So in lists, the keys are always the position, and you don't get to assign those other than the fact that the order in which you put them in implicitly assigns a position, and in dictionaries, the key is a string. You can actually use other things. I use strings a lot in this lecture, but that just kinda keeps things simple until you get good at it. You can actually use numbers as the dictionary index, the dictionary
keys if you want, but the values are things you put in and manage in those dictionaries. So we can, just like lists, we have dictionary literals, and what's nice about dictionary literals is that they use the exact same syntax as the printout, and so it starts with a curly brace, ends with a curly brace, and then has a series of key colon value, key colon value, key colon value, and this is sort of the associative array bit. We are associating one with a key chuck. We are associating 42 with a key thread, more associating Jan
and 100. Then we print it out, it kinda looks exactly the same, so the print statements in Python are nice in that you ask what's in a thing, you show the stuff, and it shows you in the syntax that if you type that into Python, that would be how you do a constant. And if you just say empty array, you see me also do D-I-C-T. This is constructor where you say make a new empty dictionary. This is an empty dictionary constant. These two things are pretty much the exact same thing. This is a shortcut to
doing this. The empty curly braces is a shortcut to do the construction. So up next, we're gonna talk about sort of one of the really common applications of dictionaries, and that is counting. So now we're gonna talk to you about one of the common applications of dictionaries, and that is making histograms. It's counting the frequency of things. And so if you think of a histogram as, it's a little graph, and there is A, how many A's, how many B's, and how many C's, and there's a histogram that says, oh, there's this many of that, and
this many of that, and these are like buckets, these are frequencies, and this is how many times it happens, so a histogram. But we're gonna do this thing where we're gonna count people's names, and we're gonna kinda count how many that we see. But the interesting thing that we're gonna solve, just like many of the things in the computer, is we can't just sort of look at the data, we gotta look at the data iteratively, one piece of data at a time. So I'm gonna give you a little problem, okay? I'm gonna show you a
series of names, one at a time, and I want you to count for each name, make a little bucket, and then keep counting how many things for each of the different names, okay? You'll notice that you have to start with one, and then you move across, so just watch this, and tell me how many, how many, what's the most common name of the set of names I'm about to show you, and how many do we see? One, two, three, four, five. So how many, what was the most common name and how many times did you
see it? That's the question. Now, here comes the review. So for humans, it's so much easier for you to just look at this and you think, how did my brain look at that? And you're like, okay, what is pretty common? Oh, maybe, maybe Chen is common. Oh, Chen, Chen, Chen, no. Maybe Chen is common, one, two, three, four, yeah. Anybody else have, Markov's got three, C7. And so you'll notice how our minds, without computers, we just sort of like bounce, branch in bound. We have hypotheses and then we decide, yep, it's Chen, that's it, and
there's four of them. Now, how did your brain think about this as we were going through them one at a time? Well, my guess is if you really had to do this a lot, you would make a little picture like this. And then what you would do is if you saw a new name, XYZ, you'd add it to the list and give it a tick mark of one. And then if you saw C7 again, you'd give that a tick mark. And if you saw XYZ again, you'd make a tick mark. And then you'd keep adding
to these tick marks, right? And that's how you would do it. And you wouldn't, like many of the things we do in a loop, you wouldn't really know what the most common was one until the end. And then you'd sort of take a look at these numbers and you'd say, okay, that's the most common number. And then you'd be done. But you have to watch them one at a time. You can't just bounce around. And so that's how we're gonna use dictionaries to achieve that. Again, instinctively as humans, we just look at the stuff. But
if you add a million things, you probably wanna write a Python program and use dictionaries. And so this is the idea. And there's two basic things that happen. One is the first time you see a name. Like I say, is this name there already? If it's there already, you really just wanna add one to it, right? That's the adding of a tick. And or you wanna see for the first time, you know, blah, blah, blah, blah, blah, and give it a one. And so you can use the name as the key. And then one is
the value. And then first time you see Chen, you stick one in there. And so at this point inside the dictionary, sort of dynamically adding as soon as it sees a new name, it adds another slot in here. But then if you see the same name again, like Chen again, then you end up with a one, add one to it, and so it's two. And so at that point, Chen is two. And so you can see how you can both extend the dictionary by encountering a new name or adding when you see a name that
you've already seen before. The problem with dictionaries is like everything in Python, there are rules about what you can and can't do. And one of the, I think, kind of frustrating things about dictionaries is that you can't just look for a key that doesn't exist. So this is a fresh brand new dictionary, we do a constructor there, and we print out sub csev, and boom, it blows up, and that's bad. But we can solve this by the in operator. The in operator we've used in the for loops. We've used it in lists, we've used it
in strings. So that is a question, it's saying, is csev in CCC? Well, this is this empty one, and so it is no, it is not. Csev is not in CCC. And so using this in operator, we can avoid the traceback. We can say, if it's not there, put it in. If it is there, add one to it. And that leads us to this bit of code. Okay, and that is the kind of code that we're gonna build a histogram, this is gonna histogram code, okay? And so this is gonna have name as our iterator
names. Sorry, I made them singular and plural, that's nice, but so name is gonna be csev-chen, csev-gen. Now normally, we'll be reading this from a file, but for now, keep it easy. We're gonna go through this. And we're gonna have counts as our dictionary. So that starts out empty. And we're gonna do a simple if then else every time through the loop. If the name we're looking at is not in the dictionary already is the key, then set it to be one. If it's not, go get the old value, count sub name, and then add
one to it and stick it back in. So this line right here is new, adding a new thing. And this line right here is adding some things to existing things. And you do this long enough, you start with an empty one, and you do this long enough, at the very end, it will print out the histogram that you're looking for, the histogram you're looking for. And so you say, oh, we've seen csev twice, gen once, and gen twice. And so that's the idea. And so this can run a million times if you want. Now, this
notion of checking to see if a key exists and doing one thing if it doesn't exist and doing another thing if it does exist is such a common practice that the dictionary object has this method called get that collapses these four lines into one line. And so the idea is you're going to do one thing if it's in there and you're going to retrieve the current thing. Otherwise, you're going to pick a default value. In this case, we'll pick one. I mean, you pick zero. This is like the default, meaning what is not there. And
if you say counts, now counts is a dictionary, dot get. That's like string dot upper. That's a method. You give it a key and then a default. And if the key exists, you get back what's in the key. If the key doesn't exist, you get the default. And with no trace back, this works. So the best way to think about this is those four lines are equal to that one line. Because x is either going to be whatever was in there before if it exists or it's going to be zero. Now, the nice thing about
zero is the next thing we're going to do is we're going to add one to it. So that that's going to get us to one. So collapsing that loop that we saw before, collapsing that loop, we can make it just a one line loop. And this will become an idiom. This will become something that you will get used to. And you will use over and over and over again. And after a while, right now, you're looking at it, boy, boy, that's a lot of syntax and semicolons and whatever. After a while, you just type this
and not even think about it. It's an idiom. It's basically included in this idiom is how to both create new entries in dictionaries and update existing entries by adding one to them. So everything else in this is the same. Name is going to go through these five values. And we're going to say counts of name equals counts.get name comma zero plus one. And so if, for example, this already has a one in it, then this is going to be one plus one becomes two. If it's not, it's going to be zero plus one equals two.
And so this is the idea of if new set it to one, not zero, set it to one because the first time you see something, the count should be one, not zero. So that's why we make this default. Now the get can be used for anything. It just so happens that zero is a common default because it's really common that we're using this to basically make a histogram, right? Little histogram of a, b, c, right? And so we need to make a d, but then the histogram has to start at one. So that's basically the
simplified counting with get. And there's a lot of things that we're going to do inside of Python that do have to do with frequencies and how many times certain things happened. And this pattern is a really good pattern to absolutely know. So now what we're going to do is we're going to switch from just looping through strings, instead loop through files. And it's going to take a little bit of work because we have to open the file and we'll bring a lot of things together at this point. So here would be another task and that
is here's a bunch of text from the book and you can just split this into words and count and find out what the most common word is and how many times it occurs. So go ahead and try to do this for a second. Feel free to pause. Actually don't bother pausing. This is too hard. We should write a program for this. It's not easy. Humans don't like this. It makes you concentrate. And so here is a counting pattern where we're going to take a line and then later we'll read this in a file. And so
this is just an adaptation improvement of the previous thing. So we're going to start with an empty dictionary. We're going to ask for a line of text and read it in. And then we're going to use split. So remember the list of words? Well, what we're going to get here is a list of words. We'll print it out and we'll run this counting. This is the little loop. For every word in whatever this was, we're going to do this idiom of either adding a new entry or adding one to an existing entry and then printing
that out. So let's take a look at what we get there. So if we run this, we can give it some text and I've got this, this will be all one line. And then it splits it into words and you see that these words here are split, split, split, split. I mean that's strings and splits. Remember strings and lists and split. And so now the counting is gonna go through this list. The clown ran after the, and it's gonna build a histogram. The clown, you know, one clown, the up, up, up of these things are
gonna go up, right? That's this histogram. And then when it's all said and done, we end up with the histogram. And so counts is the dictionary that ends up with a histogram. And we can start by inspection, see, oh, the is the most common word. And there are seven of those, right? So if we sort of take a look at this, we start out, we make a dictionary, we read in a line of text, the text goes in. We, and then we split that and we print the words out. So these are the words, right?
Then we have a for loop that's gonna loop through all those things and then produce a dictionary. And when we print the dictionary out, that's what we're gonna get. And the seven, okay? So that's one line of text. That's how you walk across the words in a line of text after you split the line into separate words. So now we're gonna look at ways that you can loop through dictionaries. We just produced a loop that can build a dictionary, but now we're gonna look at a dictionary. And so we'll start with a very, very simple
example and then we'll work to a slightly more complex example. So here's a dictionary, just the constant, Chuck is one, Fred's 42, and Jan's 100. And so we're gonna use a definite loop with a four, four key and counts. Now it doesn't have to be a key, but key is a good name because these are keys and values, K, V, K, V, keys and values. I just mentally think of this as keys and values and keys and values. So this iteration variable is gonna walk the keys. It's not gonna walk the values, it's gonna walk
the keys. Chuck, Fred, Jan, not necessarily in that particular order. As you see, it goes Jan, Chuck, Fred, because just because I typed it in in this order, it's not like a list, it doesn't stay in that order. It might move around a little bit as we add data to it or as we set the data up. And so you can, in the loop, you can get the key, and so that's what prints out the Jan, Chuck, Fred, but then you can also get the corresponding count for each one of these by just pulling it
out of the array. I mean, pulling it out of the dictionary, right? And so we can pull out the corresponding value, and so we print out Jan 100, Chuck 1, Fred 2, and that runs this loop three times. So if you just use the N and you give a dictionary here, remember all the different things we've been able to put there on the end of a for loop and dictionary's another thing we can put on and we get a list of keys. Now there's a couple of methods that allow us to get the keys and
so we have, you know, we can say turn this into a list and we get a list of the keys. So this is a dictionary, the same dictionary. We get a list of the keys. You can also get a list of the keys by using the keys method. So that's take this dictionary, JJJ, and give me all the keys, which gives me a list, which is kind of the same thing. And then we can ask for the values and they give me just then the values extracted out of this dictionary. So that's nice. Now the
one thing is that while I said you can't predict the order, if in two statements you ask for the keys and then the values, they at least come out in the same order, even though you can't necessarily predict the order that they come out in the same order. And then there is a third thing that we can do and that is list, ask for the items. We can say give me the items. And that gives us a list. This is our first really kind of composite combined data structure where it is a list, a three
item list, zero, one, two. And inside that there is what are called two tuples. Jan maps to 100, Chuck maps to one, Fred maps to 42. Coming up next we're gonna have a whole chapter on that and so just take a look at that for the moment and we will come back to that in some detail later. This whole items idea that gives us back a list of key value pairs, because it's not just a list of keys or a list of values, it's actually a list of key value pairs, allows us to write in
Python a very clever and elegant loop. What we can do is actually this items gives us back each item in the list has a key and a value and we can actually take two iteration variables. For a, a, comma, b, b, b, this is two iteration variables and if you're coming from another programming language, this is super cool and it's a Python only feature. I never have seen another language that's capable of doing something this simple and that elegantly. So what this basically does is says we're gonna simultaneously advance these two iteration variables. So this
is gonna be the key and the value, the K and the V. Key and the value is gonna be Chuck one, then they're both gonna advance, Fred 42, Jan 100. And so that means in this simple loop if we just print them out we're gonna get the key value pairs. Of course in the order. And so it's sort of a, a, a, and b, b, b, simultaneously walk down these key value pairs. And so that's really pretty and it makes for a very succinct loop. It's the syntax is a little sort of disquieting when you
first see it, but it's a super elegant thing and you just have to say items. If you don't say items, you just get the keys. If you say items, you get the key value pairs and you have to have two iteration variables. If you don't have two iteration variables and use items, it'll complain and say, what are you doing? I'm giving you two things and you don't have two variables to receive them. So two iteration variables and items are basically related. Now we're going to take a look and this is code that I showed you
perhaps many weeks ago about, I said this is a little story about how to read a file and count all the words in the file. And now we're back to it. And at this point you should understand every single character of this program, every single concept of the program. You should literally stare at this and look at it, code it, play with it until you absolutely understand it. So let's take a look. Again, I showed you this weeks ago. So we're going to ask for a file name. Then we're going to open the file name.
Then we're going to make an empty dictionary. Again, this is all stuff you've done before. And then we're going to have an iteration variable that's going to go through the lines in the file, right? So line is going to go line, line, line. Then we are going to split that line, each line into words, chop, chop, chop, chop. So that's words is the list of the words in one line. We're inside of a loop that's going to go through all the lines. And then what we're going to do is we're going to have the word
iteration iterate through each word in the line. And then what we're going to do is take each word in the line. I'm going to do this histogram, right? So this is going to run not only just for every line, but for every word in every line. So we have a nested loop for every line. Then we split it and then we go across the line. So it's almost like a typewriter. We go, tch, tch, tch, tch, tch, tch, tch, tch, tch, tch. And that's what we're doing. Tch, tch, tch, tch, tch, tch, tch, tch, tch,
tch, tch, tch, tch, tch, tch, tch. So it's like the outer loop is going down, down, down the lines. And the inner loop is going across, across, across the words. And eventually we are going to see in this middle, in this last line, every single word in the file. And we're going to do the accounts get word plus one, which is our magic histogram making line that if you don't remember what that is, go back a couple of slides, I just talked about it. At this point in the code, and it's important to be able
to draw these lines, at this point in the code, you have the histogram and it's in the variable counts. Now, we want to find the largest one. Now we have written loops that can find the largest in a list, but now we want to find the largest value in the key value pairs of a dictionary. So we're going to start with, we're going to know what the largest count is and the largest word of the, has that count. And we're going to set them both to none because we're going to prime our loop. We have
to prime our loop and we're going to say to none. And so then we're going to write one of these cool things that says for word, comma, count. So word and count are going to go through the key value pairs because we've got items here. So it's going to go through the key value pairs, loop through each key, whatever it was. There could be a million words in here. We're going to go through every one. And what we're going to do is we're going to make sure that key big count is the current largest count
we've seen so far. And if it's none, well, then we haven't seen anything or the current, the count we just read is greater than the big count so far, we are going to jump in and this is sort of like, oh, this is a new personal best count for this particular dataset. And so we're going to remember the word in big word and we're going to remember the count in big count. So this is just a max loop. It's a maximum loop with the extra thing that we're recording in addition to what count is the
largest, what the word that was associated with that count, we're recording it. So again, this is a starting part of the loop. We're going to do some work. And then when we exit the bottom of this, big word is going to be the word that is the most common and big count is the number of times. And so if we run a file, we say, oh, in that file, two is the most common word. And it's 16 times. If we run the clown file, well, the is the most common word in seven. And so this
now is, and this could have a very large file and give you the most common word. And so that is sort of a really good application of dictionaries. So dictionaries are the most powerful, well, they're the most powerful collection we've seen so far. It is good to see both lists and dictionaries to understand what collections are. They are things inside of Python that can handle more than one item inside of it. And we'll learn about another collection about tuples in a second. Just understand the get method because that leads to very compact code, understanding their
various ways to iterate through dictionaries. And so we've learned a lot, but in the next section, we will learn even more and put these together and do some sorting and do some other stuff and really start to see the real power of dictionaries. This is, I'm gonna do some coding. It's related to the dictionaries chapter, chapter nine. And we're gonna do some word counting. That's basically right out of the slides for, but I'm gonna just write the code in front of you rather than have you look at it in the book. So what we're gonna
do is I've got my text editor up here and let me start by making a new folder. New folder for my chapter nine exercise. And then I'm gonna go and make an untitled file. That was from the previous one. And I'll do what I always do, print hello and save it. And save it here into exercise 09 and ex09.py. So now I have a folder that's in my py4e folder and that happens to be in my desktop. Py4e is my folder on my desktop. And now I have all of these subfolders, cd ex08. ls is
dir on windows. ls, oops, I gotta go up one. cd ex09 ls, so I've got that file right there. Now I'm gonna wanna read some files and so I'm gonna bring some files down, a couple of files, Python for everybody, code3, intro.txt, so I've got this URL and I'm gonna save it, save page as. And it's really important that I save it in the same folder as I'm gonna write my code just so that when I open this file it knows where it's at. So I've saved that one. And I'm gonna also take this clown
text. I'll use this to make my life simple so I have a real short thing that I can show you how it works and so now if I go back to my terminal, I see I've got exercise 09 Python, intro.txt and clown.txt, okay? So let's go back to my text editor and get started. I will prompt for the file name input enter file colon space. Now I'm gonna do something. If the length of the F name that I just read is less than one, I'm gonna say F name equals clown.txt. I do this so that
I can just hit enter and it defaults to clown.txt. If I want to give it a different name, I can. So if I just hit enter at this prompt, then this will give me a string that's zero length. So if it's less than one, I'll just assume that. So let me open that. Handle equals open F name. And let's read through it for line in handle. We'll strip it, line equals line.rstrip to take the white space off the right hand side and then we're gonna say print line. Again, I'm not just doing this. I really,
when I write code, I just saved it. When I write code, I do these kind of stuff all the time just for my own sanity checking. And so now I'm gonna run python3 ex09.py just to test that. I'm gonna hit enter now and it's gonna assume, hopefully, clown.txt if it all goes well. And yep, it read one line, okay? So that part's working. I'll just leave that print statement in. The next thing I wanna do is kind of a classic thing where we're gonna go read a bunch of lines and then go horizontally across those
lines in words. So I'm gonna split that. WDS equals line.split and print WDS. So I'll print that and I'm gonna save it and test it. I really love to test things over and over. There's the actual line. This file clown.txt only has one line and it breaks it into words and so I have those words. Let's just run it again with intro.txt. So this will have a lot of lines. Line, line, line, line, line, lots of lines. Every line has a prints out the line and then prints out the words that we split it into,
okay? So now I kinda, one of the things that I do here is I wanna believe, now I sort of can believe everything from here up. Like, oh, it's gonna open the file. It's gonna read through the lines and I'm gonna split them into words. And so then I'll just kind of behind it, I'll just say, okay, I'll just comment that out. Now I need another for loop for W in WDS. Now words is a Python list and has some number of words in it, zero or 12 or whatever was on the line. And now
I'm gonna print out the word, okay? And so now it will go through that horizontally. Now I'll just do clown.txt. So that you see, I'm not printing the line out. That's the words that have been parsed from the, split from the line. And now we got this loop. Now, one of the thing that's interesting is just to make sure that you're going through all the words. And I like a print statement here to know that W is going to successfully take on literally all the words of this file. So if I comment this print statement
out and I run it again, clown.txt, that for loop starting from here is every word in that file which happens to only be one line. But now if I do the same thing for intro.txt, it's just gonna go through the words. And in a sense, by nesting these two loops, we're gonna hit all the lines. And that's a lot of stuff, but it hit all of the lines, all the words, and away we go. Okay, so here's where a dictionary comes in. I'm gonna make a variable called DI for dictionary. And I'm gonna say, give
me a dictionary. Now, D-I-C-T is not something you can choose. That's saying make, that's defining the type of dictionary. DI is a variable that I chose. Okay, so the key thing to this dictionary is we're gonna make a counter. And we're gonna use W, the word absorb, elegant, whatever, and we're gonna use that as the index. So the simple thing to do is to say if W is in DI, then we can say W's, I mean, the dictionary sub the word which is our key and the key value store of the dictionary is equal to
the value that we had before in that area, D sub W plus one. And if it's not in there, else D-I sub W equals one. And I'm gonna print, print new. So every time we see a new word, it's gonna say new. And I'm going to also then print W and the current value of the counter for W as it's going through. Now notice how far in I'm indented. This is all part of this inner loop. So this is the loop that's gonna run every single word. Okay, and I'm gonna run this first with clown.
So it runs slowly, okay, so we saw the was new and the count is one, clown is new, count is one, ran is new, the count is one. After is new, the count is one. Now we saw the again, but now we made the count be two. Let's print here. I'll say existing. So you can kind of see it. Now in the print, I'm printing this, let's make it even a little more verbose. Print W and then I will make it so it prints the, it prints the word before and the count after and then
whether it's existing or new. So we'll put a lot of print statements in. Print statements are cheap, okay? So now we see the word the, it's the first time we see it and we set it to one. We see the clown, it's the first time we see it, we set it to one. We see ran, new, one. Later on, we see the, it's already in. So existing means it was already in the dictionary. W as a key was already in the dictionary, okay? And so that's why we added one to it. So the old value
was one and then we added di sub the equals di sub, di sub the equals di sub the plus one. W is the string, the, T-H-E. That's what that string is, okay? And so we've made it all the way through and you see the in this one line occurred ultimately seven times. So now I want to print out the contents of this dictionary at the very end of both loops. So I got it de-indent twice and so that will give us the counts. Okay? And so this is what we get when it's all said and
done. You know, the happened seven times, but it just worked through its way through, okay? So you got that. Now, this is a pretty verbose way of doing this but I did it sort of the slow way to show that there are two situations. If it's already there, you increment it and if it's not there, you set it to one, effectively inserting it, right? So you insert it and set it to one with this D-I sub the equals one, okay? But let's get a little less verbose here, get rid of some of these print statements
because we kind of covered all that. Get rid of this line and go back to printing W and D-I-W at the end, we'll leave that one in. So what I want to do is I want to look at this bit of code right here, this if W in D-I-Ls. We do this so much with dictionaries that there is an easy mechanism to do this that combines these four lines into a single kind of contraction. And so I'm gonna do this, I'm gonna print, let's put two stars out, then the word and D-I dot get of
the word comma negative 99. Okay, and so this D-I dot get of the word is the important part. The way it is, is this is a dictionary, dot get says in its first parameters, the key to lookup, which is word like the or fell or clown or whatever, and 99 is the default value that we get if the key doesn't exist. So this is an effect, an if then else, right? This little D-I dot get W negative 99 is, if it's in there, do one thing, if it's not in there, do something else, okay? So
let me show you how this works and you'll see that the 99 will happen when, okay, so the first time we see the get returns 99, right, so let's move it over here. The first time we see the, the is not in the dictionary. So this D-I dot get of the word the in the dictionary gives us back the negative 99, okay? And this still is working and so the is one, clown is whatever, but away we go, okay? Let's do it this way, let me comment this out. Let me comment this one out and
run it again so it's a little clearer what's going on. Okay, so the first time we see the, the is not in the dictionary. The first time we see clown and we know it's negative 99, but here we asked for it and the is one because we've seen it before. And so that's just this get mechanism allows us to get the new value or get a value out if the key exists and specify a default if it's not there. So I'm gonna go old count equals D-I dot get W comma zero. So instead of using
99 here, I'm gonna just get rid of all this is what I'm saying is look up in this dictionary. Get is a function that's part of all dictionaries. Look up using the key W which is the, and if I don't get it, give me back zero. And so I'm gonna say print word comma old comma old count. And now what I can say, whatever the old count is, it's either the value that was in there or zero. And now I can say new count equals old count. And now, let's see new count, and I can
say dictionary sub word is equal to new count. So instead I'm gonna get rid of this if then else then. This is basically saying, look up the old count that we have. If you don't find one, use a zero. We'll print that out. And then I'm gonna say afterwards, I'm gonna print the new count. Now, and so, we'll print the old count. Here are some of these blanks. Print the old count. And you can see the old count with the, because the doesn't exist, was zero, the new one's one. Clowns old is zero, new is
one. Clowns old ran old zero. But now we get to the, its old count was one, and now its new count is two, okay? So by using this get and saying if we don't find it, we'll assume the count is zero. That makes a lot of sense, right? If not there, the count is zero. If the key is not there, the count is zero, okay? So that's what this line does. If get the value under the key, associated with the key, or give me zero back. And then I can take that old number and just
add one to it and then stick it back in. Now this is ultimately not how we tend to do it, okay? We tend to blend this all into one big long statement. Di sub w equals this part plus one, okay? So that says get the old value from this key or zero and then add one to it, because that really combines all of these lines into a single line, okay? So I'm gonna delete them now. And now we've combined this all into one, what effectively is an idiom. Retrieve, create, update, counter, all in one line.
I'll still print out, in this case I'll just say di sub w and then we'll see the counter, okay? And so now I'll run this, we don't, we have a new, but now we see at the second time it's two and so we see car the first time, we see that the second time, we see car, I mean the third time, we see car the second time and away we go, okay? And so that's pretty straightforward and so it really kind of typo there. So let's just get rid of that and run it with a
clown stuff and we get the right data there and let's run it with intro dot txt and there we go, okay? And so it's tearing out a bunch of words and giving us a dictionary. And giving us a dictionary, so that was a lot of work to get to this line 16 that has the dictionary in it. Now we wanna find the most common word. And so we're gonna loop through this dictionary and part of it is like once we printed this dictionary out and we verified that it's right, don't worry too much about the
code up here, right? Matter of fact, I can take out some of these print statements and we can kinda trust all this and so now we're gonna work on this, okay? Now we wanna find the most common word. Now this is like a maximum loop. So if you recall, we have a whole set of key value pairs, communicate goes to two, is to two, skills is three. So we have these key value pairs and we're gonna loop through and look for the maximum. Now in a dictionary, we can loop through the key value pairs with
the following syntax for, you know, I would call these variables K and V for key and value, but yeah, in the dictionaries name.items and items is a method inside of all dictionaries that says give me the key value pairs and we need two iteration variables. So this is like an assignment statement for K and V. K and V take on the successive values for the keys, the key and the value, okay? So if I just now print K comma V and I'll take this print statement out and then run the code on, oops, what I
forgot, oh, I fell back into my Python two days. Need parentheses for my print. So there's clown and it just prints it out and it's kind of the same thing except it's pretty where we're putting each one on a line, okay? So the K, the V is the value. So we're looking for the largest value, oops. So the thing is we know that the values are always numbers that are greater than one. So I'm gonna do kind of a quickie maximum loop. Largest equals negative one. Now in previous times, we've seen that this is a
bad assumption, but because we know these are counters that are always positive, it turns out this is not a bad idea. And so I can say if the value is greater than the largest we've seen so far, largest equals the value. Okay, and when that loop is all done, we can print the largest. Okay, and so this is just a max loop and we're using this value. That's the number, the value is the second thing. Oops. Ah, can't type Python. Oh, it's a typo. Yeah, I'm not using value, I'm using V. So largest equals V,
let's try it again. Okay, so we're all done with seven. So these were the things that we were looking for and it was looking for the maximum and it just dutifully found seven was the largest. But we also wanna know what the word is. And so what we can say here is we can say the word is none, meaning it's just like we don't know what the word is. And then whenever we catch this new largest number, we say the word equals W. So I like to think of this as capture, remember the word that
was largest. Right, that's what I'm doing. R, E, M, E, M, R, M, E, M, remember, right, that's tough. R, E, M, E, M, B, E, R, there we go. So we're gonna, this trick here is, not only knowing what the largest number was, but the word that was associated with the largest number. So now I can print out at the end the word and the largest and that's the count. Okay, and so now we know that, oops, did we make a mistake here? Okay, that does not look good because it says car and seven.
If V is greater than the largest, oh, it's not W. I used a really bad variable. See, that's the whole value there. There we go. It's K, which is the key. Key. I was gonna say that was quite the bug. See what happened there? I had this as W and it just happened to be, it was the last word on the file. Car, the last word in the file because I used a wrong variable. No, little mistakes, little mistakes. The and seven. Okay, so let's get rid of this print statement because we kind of know
what's going on here and away we go and this should now work. If we run it. I can even get rid of the word done here. There we go, the seven. Now, the cool thing about this is this code runs just as easily with one line of code or the intro of the book, intro.txt, and not surprisingly, it's still the most common word in the introduction.txt, I seem to like that word, and it's 226 times. Okay, and so that is the basic pattern of reading some, this is just a word loop now sometimes, there would
be some, you know, checking to see if the line is the one you're interested in, maybe tearing apart the line, but it's at the end of the day, this idiom of starting a dictionary. Now, it's a common problem to know where to start the dictionary. Do you want to accumulate the numbers for the whole file so you don't want to put it in between line six and line seven? Okay, so I hope that particular thing helps a little bit, helps you understand dictionaries. Hello and welcome to chapter 10. Now, we're gonna talk about our third
kind of collection called tuples, but tuples are really a lot like lists, there's not too much to them, they're really kind of reductionist version of lists there. So they function very much like lists, and that, you know, they have things, and the difference is there are no square braces, there is a parenthesis, round brace or whatever, and they have positions zero, one, and two, just like a list, and you can look things up, X sub two, so X sub two is actually the third element here, and so that prints out Joseph. You can assign, you
know, make a tuple here, this is the constant syntax for a tuple, and print that out, and the print statement shows you that this is a tuple, not a list, by showing you round parenthesis, and a whole bunch of functions that work with lists work the same way with tuples. You can put a tuple at the end of an end statement in a four, as you might expect, and then it iterates through the tuples, tuples maintain order, so it prints out one, nine, and two. So, literally this bit of code here could be identical, whether
it was a list or a tuple, it really would do the exact same thing. The difference between tuples are that they are immutable, once you create the tuple, you can only sort of assign a tuple, but you can't modify it, you can modify a list. So if we take a look at a list here, we make a list that's nine, eight, seven, and we say X sub two equals six, well, that just means this seven becomes a six, and that's just natural, meaning we can reassign slots, we can delete things, we can insert things, we
can mutate them, we can change them, so they're changeable, right, they're changeable. But, if we try to do that same thing with a string, so we say Y equals ABC, and we know that this is position zero, one, and two, but if we try to say, let's change the C to a D, by saying Y sub two equals D, that is not allowed. And it says it doesn't support item assignment, and this little bracket, you know, X sub two, is what they call item assignment inside of Python. And so if we do the same thing
then with a three element tuple, put that in Z, and we try to change this slot to be a zero, it's gonna blow up, because it's the exact same thing. And that has to do with the fact that, once this assignment is made, this is not modifiable. Now, it turns out that the reason it's not modifiable is for efficiency. They take up less storage, they are quicker to access, and they're really designed internally behind the scenes in ways we don't really need to understand. They're just more efficient than lists. If all you wanna do is
store a list, and look at it, and then throw it away, you probably should use a tuple instead. So there's a lot of things that you can do with lists that you also can't do with tuples, but they're really just a corollary of this notion of non-mutability. And so, like, you can sort a list, but you can't sort tuples. You can add a five to the end of three, two, one. Can't do that in a tuple, but you can in a list. And flip the order, dot, dot, dot, dot, dot, dot. So anything that you
can do to a list that modifies the list, not allowed for tuples. And so you can take a look at the kinds of things that are inside the methods that are part of each list, append, count, extend, index, insert, pop, all of these, many of these are modifying, and then count and index are the only ones that work for tuples. And so tuples are limited lists. Now, at some point, there's gonna be a but here to say, why do we like them? And the reason that we like them is that they're just more efficient. They
don't have to build in Python in its own internal organization of these objects. It knows that they'll never be modified, because when you make a tuple, you as the programmer saying, I'm never gonna modify this, and Python won't let you do it. So it's higher performance, better memory use, and you know, to a beginning programmer, that doesn't really matter, but that's the reason. And so we tend to use tuples in situations where we're gonna make a temporary variable and then temporarily use it just a little bit and then throw it away without really messing with
it. And we tend to use lists to build things up, et cetera, et cetera, et cetera. So the other thing that's interesting about tuples, and we've actually sort of seen this, is that you can put a tuple that includes variables on the left side of the assignment. And this takes a little getting used to, but it's really cool, and no other language that I know of does this. So if we say x comma y, that's a two tuple. Both have two variables. You can't put constants on this side. You know, it's like saying x equals
four, y equals Fred, right? So what happens is, is you can put a tuple on the far side of an assignment statement, and the four goes to x, and the Fred goes to y. And you say, what's in y? Well, y is indeed Fred. And so this is like two assignment statements. Now, the way I've got this syntax, I would probably do two separate statements, just not to show off that I know how to do tuples. And so you can, here's another one, and they just move correspondingly. If you don't have two here, and you
do have two here, well, if you have three here, or two here, and three here, and you don't match the number there, you get in some trouble. Now, if you just say x equals tuple, then that is the tuple in the list. But this is just a simple straight 99 value going into a. So you can put tuples as the left-hand side. And you can even do things like return a tuple from functions. That's a real nice Python feature that I like a lot. Tuples are also related to dictionaries, as we've seen in the previous
chapter. So here we make a little dictionary. We make an empty dictionary by constructing an empty dictionary, stick it in d. So d is sort of like this place that can hold key value pairs. And we put csev, and there's a two in there, and chen1, and there's a four in there. So we have this associative mapping between csev and two, and chen1 and four, all stuff we know. And now we say, hey, we're gonna loop through the key value pairs here, and we've seen this syntax before, k,v. So this is a tuple. So you
can think of this as each one of these things is going to get assigned into this tuple, which means the key ends up in, and the first one's the key, and the second one's the value. I use the variable kv all the time in code that I write, just for my own sanity. So kv are gonna iterate successively through the successive keys and values in them. So this is gonna run twice, and k is gonna be csev2, and chen1, four. The order just happened to stay the same. And so if you say, what is in
one of these things, you can actually take d items, the items method within that dictionary, and say, hey, give me back, give that to me back, and then print tops. And this is, it's a special kind of a class, but really ultimately it is a list of tuples. This is two, this is the zero, and this is the two, the one, the first and the second, and then within each thing you get, you have a two tuple. And so in a sense, this k and v are iterating through those things when we're putting d items
here and d items there. One nice thing about tuples is that they're comparable. They're comparable in the same way that strings are comparable, meaning that they're compared from left to right with the leftmost or zero tuple being the most significant. And it doesn't compare any further than it has to if it's asking less than. So if it's looking at, say, this first tuple, it starts at the left and says, okay, ask the question, tell me true or false. Is zero less than five? The answer is true. And so the answer to this overall expression is
true, and it doesn't even compare those two numbers, those second and third number, they don't compare them. If, on the other hand, we're asking is this less than that, it only looks at the first one and asks if it can answer the question. The answer is, well, they're both zero, and so I can't answer the question, so I have to go to the second one, second pair, and one is less than three, and so that means this is true. And it does not check this. Even though 20 million is bigger than four, it doesn't matter
because these are the numbers that cause the true to happen. And the same is true if you do this with strings. Again, we start the first one. So Jones, Sally, well, that's the same, so we don't know the answer yet, and so Sally, Sam, well, okay, S, S, well, they're the same, A, A, they're the same, O, L, and M. L is less than M, so the actual letter that makes the difference here is the L and the M and leads to us being true. And so it goes left to right, but then even when
it's doing strings, it's going left to right. That's just how string comparison works. And if we say, is Jones Sally greater than Adam, Sam? Well, we checked the first one, and we checked the J and the A. Well, J is greater than A, and so we don't have to look at anything else. We don't have to look at any more of these characters. We don't have to look at the second thing in the tuple. We have to look at that is enough to be true. So it only scans until it has a definitive answer. It
doesn't scan any further. So now what we're going to do is use this comparable capability to sort these lists of tuples and then bring this all back and connect it more to dictionaries. So now we can take advantage of the notion of comparing tuples and use sorting. And so what we're going to produce is a list of tuples, and then we're going to sort them, right? And so we can get a list of tuples from a dictionary and then we can sort that list of tuples, and then we can end up sorting dictionary items by
taking this two-step process. Convert dictionary to a list, sort the list, and then we can have a sorted dictionary values, okay? And so we'll do this a couple of different times. So if we take a look at this code right here, we have our happy little dictionary, A, B, C, A maps to 10, B maps to one, C maps to 20. Like what are we going to get here? Well, it comes out, the mapping is the right way, but the order is whatever. And now we say this function called sorted, which takes inside a sequence
and then returns us a sorted version of that, a list that's sorted. And so it says sort D items. So it's basically going to take this list and compare the A's and the C's and the B's, and because it's a dictionary and all the keys are unique, there's never going to be equality. So it really is going to just sort this by keys and never get to looking at the values. You could construct a list that had duplicate, you could make a list of tuples that had duplicates in the first like we did before, but
given that this coming from a dictionary, the first thing is going to always be unique and distinct. And so if we say sorted D of items that we're passing this stuff into sorted, sorted is going to go around, move stuff around, and then give us back a sorted version, sorted in ascending order based on key without looking at the value. And so that's a way to see dictionaries sorted by key is just say sorted of D sub items. And sorted is a function, and so it just picks stuff. And so this is the kind of
loop that you're going to write to do that. You know, we did this before, we took sorted, and we got these sorted by keys. And so you can just make this nice and simple for key value. By the way, you can eliminate the parentheses here, and I think it's prettier if you eliminate the parentheses, but you could put parentheses. This is still a tuple without the parentheses for key and value in sorted, so that says go through D items, but before I go through them, please sort them. So that means K is going to go
through A, B, and C deterministically every single time it's going to go. And of course, value is going to go through the corresponding value, so now we can print this out nicely sorted by key. And that's a real nice succinct little way to say that. I mean, again, these are one of the kind of things that people really like about Python is that you can do pretty powerful things with easy to under, I mean, you know, you might have seen this for the first time, but ultimately you look at that, eventually you'll be like, oh
yeah, I see exactly what that's doing. Easy, not hard at all. So, but let's say we're looking for the most common word, which we have been for weeks and weeks and weeks now. And so we want to sort by values, not key. So this is an example of where we're going to construct a data structure, we're going to imagine a data structure, and then we're going to write code to construct the data structure, and then that's going to make our problem easy. So this is an example of using cleverly constructed data structures to do this.
And the data structure that we're going to create is a list of tuples where the value is first and the key is second. So you can just with items get key value. I want value key. So let's take a look at this code. Take your time and get it right. So KV goes in C items. Well, that is unsorted and going to have, go through whatever A, B, and C, in whatever order. And we're going to make a new list. So this is a data structure that we're creating temporarily. And what we're going to do
is this is a list. And we are going to append to that list a tuple. So this is going to be a list of tuples. Except we're not going to append them in key value order. We're going to flip them and append the first part of the tuple is going to be the value and the second part is going to be the key. So we end up with this. This is sort of our temporary data structure that we have constructed to make our job really easy. So this ends up being 10A, 22C, 1B. Now we
just kind of flipped them. We took this order and then we flipped them around. And so now we have this nice little list sitting in memory in a variable. And that's really simple. We can say, oh, look, we can use sorted. And we can sort by now the values because they're the first thing. The sorted doesn't know how we produce this list. It just looks at that and says, oh, that's a list of tuples. I'm going to always sort by looking at the first item in any tuple. And I'm going to add reverse equals true
so I get a descending sort. So I see that the value that is highest ends up being first. And so that changes this and I'm just sort it and then reassign it back into temp. And I'll print this out. And so now you see it's sorted in descending order of key. So it's value key, value key, value key, but it's sorted in descending order, okay? And so that's an example sort of of just like, you know, if I just made a data destruction, I flipped those things around, I could use sorted to sort these things.
There's many other ways you could do it, but there's sort of like the more elegant way of doing it. And the clever bit here is like make a new list and make it be a little bit different, okay? So here we're going to print out the top 10 most common words in a file. And most of this code is review. So if we take a look at it, we're gonna open a file. We're gonna start a dictionary for our counting. We're going to, you know, there's gonna be words and lines, right? And so we're gonna
have a for loop. This for loop is gonna go through each line. And then of course we're gonna split them which is busting them into pieces. And then we have a for loop within that. And this for loop is gonna go through each word. And so that means that by nesting these loops, we're going through each line and then within the line we're going through a word. Then we go to the next line and go through the words. And eventually this line of code, count sub word equals counts dot get word zero plus one, are
idiom for making a histogram, right? This line right here is an idiom. If you don't know already what that is, go back to the previous dictionary lecture and understand it, understand it, because you're just gonna use it over and over again. So now at this point, and I always like drawing horizontal lines in code when we write it. At this point, coming through at this point, counts is right. Counts is the histogram. It's not sorted. So now we wanna sort it. So we're going to make a new list. We're gonna loop through key value. And
then we're gonna make a tuple. I'm making this be two lines to make it a little easier, value key. So I'm flipping it, right? So I'm flipping the order of these things. That's making a tuple. And then I'm appending that tuple to the list, okay? So at the end of this, we have a list of tuples in value key order, vk, vk, right? So at this point, coming through here, I've got in my LST variable, I've got this really useful bit of code, or useful bit of data that I produced. And then I'm like, oh,
now it's ready to be sorted. Poof, sort. So take list, sort it back and sort it in descending order, and then stick that back in list. Now we wanna print it out, but we don't wanna print it out. So we got a nice sorted list coming down here. We don't wanna print it out in value key, because that's what it is. It's in parenthesis v, k order, but it's in sorted. And we know that the highest value is here on down. And so we're gonna say, we're gonna run through, and now we're gonna go through
this new list, only the first 10, start at the beginning up to, but not including number 10, which is the first 10, for value key in, and so value is good. So this is the iteration variable that's gonna go through each of these things, on and down, and then we're just gonna print it out flipping it. So we reflip it, flip, flip, we print it out key value, and it's going to work. Okay, so that is one way of doing this. And this slide right here, you absolutely do not need to figure out, but some
of you will look at this slide and you're like, why didn't you show us that in the beginning? And others of you will be like, no, no, no, no, no, keep telling me this stuff here. So I don't know exactly the term for this, but this is a very procedural. This is a classic algorithms and data structures approach to solving this problem. This next thing uses what are called lambdas, and they kind of create what's called, what I call a closed form, where you kind of do it in all one statement, and there's all this
implicit stuff going on. So if you don't get this right away, don't worry too much about that. But roughly, this single line does everything that bottom half of that program does. I mean, if you go back, if we go back to here, it's pretty much this line does everything, does that in one line, okay? It doesn't create the counts, and it doesn't print out the top 10, but it does everything in that middle bit. So let's take a look at this. So we all are gonna collapse this down. So we have a print, that print
sees the end of the print. And then we have sorted, and remember that sorted takes as input a list. And so that's not too bad, and returns us a list. And so we'll print the return from sorted. And then this is the funny part, the fun part, funny part. This is called list comprehension. And we have square brackets, and we say to Python, this is a list. But instead of listing the things, or having a constant one comma two comma three, or a pen to pen to pen, we are going to create an expression that
will act as a generator for all the elements. And so this basically says, this is a list of two tuples, V and K, and then this is sort of implied. For all KV in CDOT items. And so this is like a for loop, that is sort of driving this, think of this as like stamp, stamp, stamp, stamp, stamp. However many times it has to make a stamp. And so that's producing a list. Ch-ch-ch-ch-ch, right? It just manufactures this list. And then that list is sort of manufactured in the moment. There's no stock, it's not put
in a variable. Python makes that list according to the stamping pattern that you've told it to stamp out this list. And then it passes that stamped out list without even storing it in a variable, into sorted, sorted moves the list around, because it is just a list of tuples, and then gives us back the sorted list. And so I didn't put reverse equals true on here, but you see that this is sorted in ascending order now by key. And I did that all in one little statement. So look at this, this is also one of
the beautiful things about Python that you can build these things, and you can build more complex versions of this, and there's a lot of real elegant things that you can do in Python that are really succinct. You should be careful, because in the beginning I think this is easier to understand, even though after a while you're like, wait a sec, why am I putting all these extra lines in? Because this is not so hard to understand, but at some point you will want to master this more powerful and more succinct version of Python that expresses
it in terms of the data you wanna see rather than the steps you wanna take. So this sort of finishes up tuples. We've done a bunch of stuff. I mean, really, they're simple and elegant. Tuples, lists, and dictionaries are all related. They're really three different, kind of three foundational data structures, three foundational collections of Python. And we combine those in a lot of different ways. And now in this little bit of lesson, we are going to talk about some tuples, and we're going to create a list of the most common words and find out how
to sort a dictionary by the values instead of by the key. We're gonna use the clown.txt file and the intro.txt file. And I'm gonna start with the code from exercise nine that I just did from chapter nine. It's not exactly one of the exercises, but it's very similar to them. And I'm going to make a copy, and I'm gonna keep it in the same folder. I'm gonna keep it in the ex09 folder and just call it ex10 because this code is going to do much of the same stuff, and it's gonna read these same files.
And so I've got myself exercise 10. Exercise nine is still here. Exercise 10 is now what I'm editing, exercise 10. But I'm in the exercise nine folder. So in exercise nine, we look for the most common word, but we wanna find the five most common words, which is gonna require us to sort. So I'm gonna get rid of that code because it's not really how we're gonna do it. There we manually loop through it and found the maximum. And so I'm gonna just run this. CD, desktop, Python for everybody, ex09. Now if I do an
ls, you see that I've got ex09.py intro.txt. So I'll run python3 ex10.py and run the clown data. And we see that we see the dictionary is properly making it in this code right here. That doesn't change. It reads the file, reads all the lines, goes through and splits it into words, and then goes through the words and does the idiom of using dictionary get to maintain the counters, and we print it out at the very end. So the new code we're going to write is down here. So let's first do a few things. If I
can say x is equal to the dictionary, dot items, and this gives us basically a list, print x. This gives us a list of the key value pairs. This prints out the dictionary, but if we do it this way and use items, it gives us the key value pairs. Okay, and so that's what we got. Key value pairs. Now we can sort this based on the value because tuples can be compared. This can be compared with this. And because d is lower than r, then this one is lower. This whole, this ran tuple comes after
the down tuple. So we can sort this whole thing. And I'll do this by just putting the word sorted here and say, give me a sorted version of that. Now it's going to do it based on the order of the tuples. This is going to be more, higher precedence than this. So if I print it this way, run it again, you'll see that it's sorted. And now is after and car, it's in alphabetical order by key. And so we could actually print the first five up to, but not including five by adding a list on
the slice, a list slice here. And so that will show you only the first five, right? Except that that's not what we're trying to do. We really want to sort by this, okay? So we have this mechanism that can take a list and sort it based on the tuple values. If we could create a list where it was one comma after instead of after comma one and make it exact same thing, then we could actually then sort it and it would be fine. Okay? So let me show you a couple of ways, at least one
way to do that, okay? Get rid of this. We're gonna hand construct a list and just call it temp equals, give me a new list. Temp equals new list. And then four K comma V in the dictionary.items. And I'll just start by printing K comma V. So we see, and this is where it's really nice to do these with the clown code first and then only do your test on the bigger file later. And so it's pretty much the same thing we are going through in key value order, which is dictionary order, which is not
sorted at all. Okay? Now, instead of printing this out, we are going to, let me do this in a couple of steps. Make a new tuple and I'll just call it newt equals parenthesis V comma K. Okay, so this is, I'm saying make a new tuple. This is like a new tuple with two items in it and I'm gonna make the value and the key. Okay, so then I'm going to say temp.append newt, newtuple. So I'm gonna end up with a list of tuples. Let me comment this one out and I'm gonna then, when I'm
done here, I'm gonna print temp. So if I run clown.txt, you see what happens in temp. It's still, well, let's print temp twice. I mean, it's not sorted, it's flipped. Let's print it, that's okay. We'll just, that's the flipped one. Okay, so it's flipped and all we did is we made it, instead of car comma three, it's three comma car. But now we have a list. Okay? So now it's flipped and now we can sort that. We can say temp equals sorted temp. So it says, takes temp and sort it and give it back to
me and now I'm gonna say print sorted comma temp. Okay, so here's the first print. When we flipped it, we've got two tent, but it's not sorted at all. But after we sorted it, it's sorted by tuple and the lowest is one after. So you'll notice that one is the same as one, so it checked the second item in the tuple. So down comes before after, fell becomes after down. Intro on alphabetical order, but now we get the twos. So all the ones sort there and then the twos come here, but then within the twos,
it's sorted in alphabetical order because like a string, if the first character matches, then it looks to the second character. And then we see, oh, here we go, the threes and then the one we actually wanted, the highest one is the seven. And so one of the things we can do is we can say, you'll notice that we want the highest one, not the lowest one. So we can just tell this with this parameter, reverse equals true. And we just say, hey, sorted, do this backwards, do it from highest to lowest rather than lowest to
highest. And now our sorted one says seven, the, et cetera. Okay. And so we want the first five, we can say up to, but not including five. So this is now the top five. So the sorted one is, that's the top five. If there is, it's a tie, we're gonna go and reverse alphabetical order, but let's not worry about that too much for now. So it makes a flipped list, then it sorts the flipped list. Now, if I just wanted to print it out nicer, I could loop through this new list. I could say for
V comma K, remember this is a flipped list. So the sensible thing is what's coming out, I mean, coming out of this list, each tuple is value comma key in temp. And I'm only gonna go up through five up through, but not including five, so the first five. And so I'm pulling them back out as value key because that's what they are. They're value key, see value key, value key, value key. So V is gonna go through these and K is gonna go through these. And then I'm just gonna print K comma V. So this
is kind of my flipping backwards because I wanna see them this way. And thus the most common one, car three. And so it's just going through this up through the fifth one and then printing them out. Okay, so let me comment this out. Let me comment that out. Let me just delete this. So we have a dictionary. Let me comment out the dictionary. We have a dictionary, we make a list and we make these reversed tuples where we have the value first and the key second. We're setting it up so the sort's gonna work. And
then once it's sorted, we have to flip them back. So we flip them for sorting from key value to value key for sorting. We do the sort, then we flip them back with key value and print them out. And it works fine. So let's try our big file, intro.txt and there you go. Those are the five most common words in intro.txt. So you might ask yourself, why did we use tuples? We probably, we could have really used lists for this but tuples are more efficient than lists and you notice that we weren't gonna modify. We
did modify the temp list, it's a list of tuples but the tuples within the list, we weren't gonna modify. And so we tend not to make lists if we can get away with using tuples. And so that's why we made this flipped tuple thing. Okay, so I hope that was useful to you. Hope to see you on the net. Hello and welcome to chapter 11, regular expressions. The fun thing about this chapter is unlike all the rest of the chapters, you sort of had to really understand every single thing in chapters one through 11 built
on one another, one through 10 built on one another. But you can really get along without using chapter 11. It's not a really required topic but it's a fun topic and an interesting topic. So you can relax a little bit and realize that you may or may not like regular expressions and if you don't like them, that's okay. You don't have to use them. You can go for your whole life without using regular expressions. The idea of a regular expression is that you come up with a language. It's a little character based programming language where
you can do smart searching basically. Start searching and as you'll see in a bit with smart extraction. And it's really almost programmable wild card expressions. There's no looping but there is looping and there's all this implicit thing and you say look for patterns that look like this and then you give back things that match those patterns. We do searching for everything. We're looking through large blocks of text. Say go find me everything that has the word Python in it or something like that. So that's just such a common thing to do and regular expressions are
a very structured way to go about searching for information. They're very powerful but they're also very cryptic and you may not like them but they're a lot of fun actually once you understand them. Learning how to program them takes a while. Writing good regular expression programs requires some try it, play with it, check it, try it, check it, try it, check it. But once you get them they're really quite cool. It's a very old programming language. It comes almost from the 1960s. The concept of it's a theory of computing where they were trying to come
up with theory of languages and regular expressions was one form of languages that computers could understand. And so it has some fun old words. And one of the advantages of knowing regular expressions is that you're kind of a cool person. You can take a quick look at this XKCD that sort of captures the devil may care, awesome power that regular expressions do. And while we're at it, you know, while we're talking about awesome, I do want to take this moment and show you my awesome tattoos. And so you may not know this but I got
a couple tattoos here. Here's the first tattoo. This is where I went to, got my PhD and this is my University of Michigan faculty member position. I got PhD in engineering and I teach in a school of information and library science. And then I have this other tattoo and this tattoo is what I call the ring of compliance. I work on learning management systems and educational technology and standards. And there's this standard called learning tools interoperability, which if you're using this course and doing the auto grader, it uses learning tools interoperability to integrate into whatever
learning management system you happen to be using. And one of those learning management systems is the open source learning management system that I helped write called Sakai. And these are the rest of the major vendors. And the idea of that tattoo was that I would put the tattoo of every vendor that would comply with learning tools interoperability. So you'll notice Coursera, I help Coursera put learning tools interoperability in. And so the auto graders integrate into Coursera, Blackboard or Canvas or Sakai or Moodle or often those other things. So it's just like a cool techno thing,
just like regular expressions. So I've got a URL here for regular expression quick guide. You might wanna print this out so that you can look at it even while you're watching this lecture because it's a little programming language except that it's character based, not line based and not keyword based. It has certain active characters that the character means something versus the character represents the character itself. And so the regular expressions is not part of the base Python but it's distributed with Python. So you have to put an import re at the top to say that's
really saying pull in the regular expression library. And there is a couple of functions inside that re.search which is kind of like a really smart version of the find method inside of strings. And re.findall which is kind of like taking and stamping your way through a loop through a string and finding all of the things that match a particular pattern and then extracting those. And we'll talk about both of these in this lecture. So here's a really simple piece of code where I'm just gonna sort of show you sort of before and after. So here's
a thing where we're looking for lines that begin with from colon. And so we open a file, we loop through the whole file, we strip off the lines text and then we say if line.find from is greater than equal to zero, then we print it. It gives you negative one if it's not found. And so reads all the lines and once in a while it'll print it out, reads all the lines once in a while print it out. So that's kind of like a needle in a haystack. Use regular expressions to do that. We have
to import the regular expression library. These lines are the same, we're gonna loop through, we're gonna strip. And now we're gonna say if re.search, the way to say this is within the library regular expressions, go find the search function and search for the string from in the string line. So this is the line to search whereas here it was more object-oriented where we say line.find and we say re.search and we pass in line as parameter. These two things are equivalent which means most of the time it's gonna run and once in a while hit a
line and it'll print that out and then it'll finish the whole thing. So that is doing what we would do with the find operation with regular expressions. Now, searching with regular expressions has these special characters and so here we have the same basic code except now we're saying if line starts with from, so we're not using find anymore and that way we're only gonna get that thing in the first position not like blah blah blah blah from colon, we don't want that to match, we only want it to match here at the beginning of the
line. And so that's what we use line starts with. So it's gonna do the same thing and find lines that have the prefix and print those out and then be done. Now in regular expression search we don't in a sense change the method, we have a certain number of things we can do with strings based on what they built in. But in regular expression we actually can turn this first parameter into code. And so what's happening here is the caret, if you go back to the little cheat sheet, caret means this is the beginning of
line. It's a virtual character that matches the beginning line. It's like from that starts at the beginning. So from at the beginning does match and from in the middle does not match by putting that little caret there. Same thing, line is what we're searching and then from is what we caret from. Line from at the beginning is what we're looking for. And so again it does the exact same thing. Only prints lines that have from colon is the first character in the line. So the difference is we look for a method and the other one
is we program the regular expression. So we're gonna run out of methods in the string class long before we run out of things that we can do with regular expressions. And so a couple other special characters that caret matches the beginning of the line. So caret matches the beginning of the line. This capital X matches itself. Dot is a wildcard that matches any character and then some of the characters in regular expressions modify the immediately preceding character. And so that says look for a line that starts with X and then has many characters, that's these
two things. Zero or more characters followed by a colon. And so you can see that it's sort of, it's this sort of like expanding stamp. It's like oh there's an X at the beginning of the line, that line, it looks good. I got some characters here and then I got a colon, that's good. So this is an X, some characters, and a colon, check. X, some characters, and a colon, check. X, and these things, away we go. And so you can, that's what's gonna match. And so you can see how some of these characters are
special. Again, go back to your cheat sheet. Some of them are special and some of them are actual characters. And this colon and X are just, they're not special, they're just the characters, okay? Now, sometimes you wanna be a little more clear on your match. So, let's take a look at these lines that match that particular thing that we just did. So we have these two, X dash civ colon, X dash D, stem dash result, like these are from mail messages. And then one of the mail messages has a line in it that says X
dash plain is behind schedule. And this matches. Is that what you really wanted? And so what we can basically say is, because this is an X, this is some number of characters, and that's a colon, it matches. It has to match. That's this rule applied to this line results in a yes. It does. And so how can you be a little more clear as to what you want to match and what you don't want to match? So we can write code. So now what we're going to say is, we wanna match the beginning of the
line and we wanna capital X and we wanna dash. So now we're gonna match those first two characters, X dash at the beginning of the line. Carrot X dash says first two characters of the line must be X dash. Now we have another special character. Again, refer to your cheat sheet. Backslash capital S means a non-whitespace character, right? Any character other than whitespace. And then plus means one or more times, one or more non-whitespace characters. That's what this whole thing says. One or more non-whitespace characters and followed by a colon, which is just a character.
So now we have X dash followed by one or more non-whitespace characters followed by a colon. X dash followed by one or more non-whitespace characters followed by a colon. Here we have X dash followed by one or more, whoops, there's a space there. And so this doesn't match. Even though there's a colon there, it means that between the dash and the colon, you can only have some number of non-whitespace characters. So this is a no, it does not match. And so you just can, if you didn't wanna match this, you then sort of create a
more precise, you know, we could even have a thing that said, I want X dash with an uppercase character, uppercase letter, if you wanted to. And so there's all kind of fine tuning if you sort of learn the structure that you've got to do. And so that's kind of the matching where you're taking a whole line and taking this template and deciding if the template anywhere in that line matches. And now what we're gonna do is use this to actually pull data out of strings using the regular expression library. So now we're going to move
from merely matching to matching and extracting. So we're going to say, hey, I would like to not only have you take this template, this little pattern, the string pattern, regular expression pattern, run it across the line, I want you to give me all the ones that match and I want a list of those. And that's what we're going to use the find all. So search gives a true false, find all gives a list of all the strings that match. So if there's four of them, you'll get four things in the list. If there's nothing that
matches, you'll get an empty list. So let's take a look at what we got going here. So instead of calling search, we call find all. We still pass in the string that we're looking through. And then we have our little template pattern. And this is a new bit of regular expression. Any little bracket operation, square bracket is one character, that's just a character, but then there in between here is a set of allowed characters. So zero dash nine means a single digit. Zero, one, two, three, four, five, six, seven, eight, or nine, but that's really
one character. And then we have, so that's one character. And then when the plus applies to that, which means if we look at this whole thing, this whole thing says one or more digits. That's the code we write in a regular expression that says one or more digits. And we're just gonna use that in our regular expression by itself. So we're going to look for any string that's one or more digits and pull it out and give it back to me. So it's gonna look, so that's my little template, stamp, stamp, stamp, stamp, oop, got
it. Stamp, stamp, stamp, stamp, stamp, stamp, stamp, stamp, oop, got it, stamp, stamp, stamp, stamp, got it. So what we get back after we ask find all, to find all of the one or more digit strings is two, nine, and 42. So it actually parsed it, it split it, it found all these things and said, I found them all for you and here they are, two, 19, and 42. So it's a list of three strings because that's how many it found. Now it might have found none and we would have got an empty list at
that point. But it found some, okay? So just as an example, we did this thing, we get two, 19, and 42, but if I said this, that basically is a uppercase vowel, A, E, I, O, or U. So that's one letter and that's one or more. So it's saying something like A, A would match, E, I would match, O, O would match. But if you look now, it's saying, okay, I'm looking for one or more, minimum one or more uppercase, A, E, I, O, U is a set of characters, one or more uppercase letters. And
so it says, look, do you find, oh, there's an uppercase, but it's an M, no, no, no, no uppercase, no uppercase, no uppercase, no uppercase, found nothing, did not find anything and so it gives us back an empty list. And so it's like, find all the things that match this and the answer is, none match, here's your list of nothing. Okay, and so you have to check, that's how you have to check even if you got something because it's not gonna return you false, it returns you a list with no items in it. Now, the
way it works, like I said, it sort of is taking this template and stamping it across the line, stamping across the characters. Now, there's a behavior that might not be intuitive, intuitive you at the very beginning, but the notion of what we call greedy matching. And that is, when it can match more than one possible string, overlapping string, it chooses the largest of the overlapping strings. And so the easiest way to show this with an example, and we're saying, I want something that starts with an F with one or more characters and ends with a
colon. So that's my little stamp, that's my template. So starts with an F, good, that's good. One or more characters, da da da da da, have a colon. So that could be from colon, that would match. But look, I've got another colon here, and this is just continuing on with one or more characters and this. So the question is, do we get this or do we get this part, right? And the answer is, with greedy matching, is we get the larger of the two, okay? And so what you get back is somewhat counterintuitive. You get
the whole thing as the match, from colon, using the, we could have got from colon, but the reason it picks this is this one's longer. So any time it has a choice, it picks the longer one, and that's what greedy is, meaning, it probably better described as larger or tending toward the longest string or something like that. So you can, of course, suppress this behavior, like everything, in programming regular expressions, you simply add another character. And so now, it's going to say, I would like to start with letter F, any character, one or more times,
and then this question mark, this is still one, you know, one little thing, non-greedy, okay? And so that just says, do it not greedy, which just means that it prefers the shorter of the strings. And so now, it could still match this string or this string, but because it's been told to not be greedy, it chooses this string instead, and that's the string that we get. And so that's the not greedy, and you just add the question mark after the asterisk. So it's usually an asterisk question mark or a plus question mark, though that's a
two thing, that's zero more characters, non-greedy, and that's one or more characters, non-greedy. Actually, most of the time, it seems to me that the non-greedy would be the more reasonable default, but that's not how it is. A greedy is the default, and non-greedy is optional. Now, we can play some more with this stuff, okay? And so let's take a look at this little example where we have a non-blank characters, backslash capital S, one or more of those non-blank characters, followed by an at sign, and then again, one or more non-blank characters. So this is looking
for strings that have an at sign with non-blank characters on both sides. This is an example of where it sort of comes to this at, and it goes this way, and it does it in a greedy manner. If you told it to not be greedy, it would give you this, these three characters, but we're telling it to go greedy, so it goes all the way to here, and stops at this blank, and then stops at this blank. And so that's a nice little thing. Find the at signs, go to the first blank, blank, and pull
that stuff out. And so that, with one little match, you've pulled this thing out. Now, of course, we've done that before with other techniques. So that's just another way to pull stuff out. Now, if we, we get this whole thing, but what if that's not exactly what we wanted? We can tell, we can give it a matching string that's different than the extracting string by adding parentheses, and so here's another example where we basically say, this is our string, we wanna match from at the beginning, followed by a space, followed by, ignore the parentheses for
the minute, one or more non-blank characters, followed by an at sign, followed by one or more non-blank characters. So this is also going to, if there's no from, it's not going to be looking for that, right? So it demands the from is here, so it matches that, and the space is demanded as well. And then it says, oh, non-blank characters, great. I got an at sign, great. Non-blank characters, oops, stop there. And so this is what's going to match. Now, the key is that we don't actually want that back in our extraction. What we really
want back in our extraction is this part right here. So what we do is we put parentheses in. Parentheses don't, are a code, they're code in the regular expression world. Parentheses say, start your extraction and end your extraction. And so when you do this with a parenthesis, when you do it without a parenthesis, you get the whole from, right? Without a parenthesis. Oh wait, no, okay, that doesn't have the from in it, so. But if you do that with the parenthesis, you match the from but you only get the this bit to come out as
well. So you can add this to make the matching part more precise but without changing what you get returned and you specify what you want to get returned with the parentheses. So next I want to show you just a couple of different ways to use these newfound skills. So now what we want to do is use some of these newfound skills in some more practical applications of regular expressions. So let's go back to the way we first tore apart strings and look at the situation where if you recall, we just wanted the host name, right?
This is an email address and we're interested in the host name. So we have this string and we go find the at, right? The find looks up and tells us the at is at position 21. And then what we do is we say, okay, let's look beyond there to the space and that tells us the space is in position 31. And then we're saying we can extract starting beyond the at sign up to but not including the space by saying at post plus one colon space position. And when we get that, now we have to
have a thing that decides to only look at this on from lines but then it can print out the host that is extracting of this information. So that was one way that we did that, right? One way we did it. The next way we did this was the double split pattern, right? So we said, okay, let's take this line, let's break it into words based on spaces. That's what words is. So that's zero, one, two, three, four, five, six. And then we know that the email address on lines that start with from space is the
second one. So we pull out email address, which pulls this bit out into email. And then we're gonna split that again based on the at sign. So we're gonna split this part again based on the at sign. So it splits right there and then this becomes the zero and one in pieces. And then pieces sub one is that host. And if we print that out, we get the host. So that's the double split pattern. Nice thing about that is you don't have to keep track. The little plus ones kind of annoying to use the space
position. That previous one, that's just hard to remember. It's just, I've written this code way too many times in my career and I've made mistakes and I have to debug it every single time. And I print all these numbers out. I'm like, did I get it right? Oh, I did it in Python. I did it in Java. I did it in C. Wait a second, did it differently? And so it's, so this is a lot cleaner. I mean, I can write this every time and I know it's gonna work every time. I barely even need
to test this code because it's so obvious. So double split is another way of extracting stuff. But if we look at this thing with the regular expression, we can say, oh, okay, let's use a regular expression to do this. So we'll start looking through the string. We'll start by saying, hey, let's look until we find an at sign. Then let's start extracting with the parentheses. And then once we have found the at sign, let's look for non-blank characters. This is a set of characters. This caret as the first one means not a blank. So that's
another way to do non-blank, not a set of characters which are everything but blank. That's what this little bit is saying. Star means zero more times, which means it's gonna run, run, run, run, run until it finds a blank which is gonna stop it. The greediness is what keeps pushing it, right? It's, this is a greedy match. That asterisk is greedy because there's no question mark after it. And so that does go and starts at the at sign with the parentheses, goes to the space, and that's the end parentheses and that's what prints out. Now,
Y is gonna be a list that's a one item list that has the string in it that we're looking for, but you can just go sub-zero to get that guy right out of there, okay? So that's sort of the regular expression version of it. But we can make this a more fine-tuned thing. So we can say, look, we also wanna pick the line and we wanna know if there are, if we don't get that line, we wanna skip it. If we do get the line, we wanna extract the data. And we can do this all
in a single regular expression. So again, we say start from the beginning of the line. And if it's gotta be a from, followed by a space, and then followed by any number of characters, dot star, followed by an at sign. So this has to match, we see a space, then we're gonna have any number of characters, and then we're gonna see an at sign. And then we're going to start extracting, and then we're gonna go non-blank, non-blank, non-blank, non-blank, non-blank, up-blank, and extracting, and out that comes. And this has the advantage of the previous one
in that that makes it much more precise. If we look at the previous one, while it works on good lines, it might actually trigger on lines that we actually don't wanna see. So this allows us to refine it so it only actually does this to lines that we care about. So it's sort of both an if statement and a splitting, extracting, going on all at the same time by having a bigger string that we're matching than we're extracting. So it's a way to kind of clean up your data. So here is a simple program that
we're going to just put all this together and actually accomplish something. And so we're gonna read through and look for lines in a file that have this form. And we're gonna extract this number, and then we're going to compute the maximum of this. So we're gonna extract this number and then convert it to a float and compute the maximum. So we're gonna open a file, we're gonna write a for loop, we're gonna strip. So we're gonna do this for every line in the file, but the first thing we wanna do is not get line, we
wanna discard all the lines except ones that have this. So our regular expression is look for lines that start with x dash d span dash confidence colon. So that's a pretty strong match. If that's not there, we're not gonna get anything. And then there's a space, there's a space, and then start extracting, and then go as long, one or more digits and dots, that's a single character, and that's one or more, and then stop extracting. So that says start extracting, da da da da, greedy, greedy, greedy, greedy, stop extracting. And so that's what we're going
to get. Now, if the line doesn't have this, it means missing in some way, whether it's this prefix or this number. If the number's missing, it's gonna fail too. We're going to get back a list, an empty list. So the first thing you have to do is check to see if you actually got a match. So you say if the number of items in the list, len of stuff, is not equal to one, continue. And so this is the skip all the lines that don't match. Skip, skip, skip, skip, skip, skip. So there could be
thousands of lines that don't match. But then, when this match hits, it's gonna come down and fall through, right? So that, most of the lines will skip up, but then when we actually get one, and we know instantly that we've got one and stuff sub zero because that's what we extracted, is this number. And we can take the floating point of it. We append it to our list. We made a list to store them. That runs. The list grows. And then we just say, what was the largest one? And so you can run this and
see that. We have an escape character. And the whole idea is that sometimes all these little special characters that make a lot of sense to us, we actually want to search for it. So what if we want to search for a dollar sign? Well, we just prefix it with the backslash. And that just means this is a real dollar sign. So backslash dollar is a real dollar sign. So this says, I would like a dollar sign followed by one or more digits or dots. And so that's going to match a dollar sign followed by one
or more digits. Dots are okay. This is a set, remember. Zero dash nine or dot. That's a set of the list of legit characters. This is a range of characters that's a shortcut to how to make the set. You could make it be zero, one, two, three, four, five, seven, eight, nine, dot. Or zero dash nine and it assumes that. And that's one or more. So then it stops because this is a space. It's greedy matching. Then it pulls this out. So that's kind of why greedy has to be the default. Because otherwise, if it
wasn't doing greedy matching, oops, come back, come back. If it wasn't doing greedy matching, it would, if it wasn't doing greedy matching, it would stop here because it would find a dollar sign. Non greedy would find a dollar sign and one character and then it would give us dollar one rather than dollar 10. So, in summary, regular expressions are a cryptic but powerful language and they're an acquired taste. I think that, I bet eventually you'll find them fun even though on your first impression you might not think that they're so fun. Welcome to network programs.
This is chapter 12. Now we're going to learn a little bit about how we talk to resources on the network using Python. Now, this is a really quick introduction to how the network really works. I have a whole book that I wrote. It's also translated into Spanish on how the network works starting at the very lowest layer packets and everything right on up. And it's actually really easy to read. I wrote it for a high school audience. It's a short book and pretty easy to read. So if you read that book, you will understand that
there is this layered architecture, the TCP architecture that sort of runs our network at the lowest layer. On one side here, this is your computer and this is a server computer. And if you sort of want a webpage, it goes across the network, does this like 15 or 20 times, then it goes up into the server, reads the data and then the data comes back 15, 20 hops for the packets and then it's shown to you as what you see. And so, that's how it works. And there's these layers that we're not gonna talk about
in this section but I talk about in that book. The layers of the link layer which talk about how to get over one hop, the internet layer which talks about how to construct say 15 or so hops to get packets back and forth, that's the sort of lower level bits. We're gonna start at what we call the transport layer and that's the layer where your computer sort of assumes that it can make a phone call to another computer, another process running on a program on this computer, talks to a program on this computer and then
it kind of comes back, okay? And so, we're gonna leave this alone, we're gonna ignore it, we're gonna assume that there's this nice reliable pipe that's going from point A to point B and what are we gonna do with the pipe? But if you're interested, take a look at the book. So, we're just gonna start with a pipe, some kind of a connection, we have two processes, process, process and we have some kind of a connection between them and it is a connection that we can both use to talk and to listen. In nerd terms,
we call these things sockets and that is one process running on one computer, another process running on another second computer connected through the internet somehow and one computer speaks into that socket and it comes out and the other computer returns something and it comes. And so, this is a bi-directional protocol of data which is a series of, in effect, data phone calls between applications. So, the application might be, on your side, this might be your browser. Chrome, Firefox, Internet Explorer. On the other side, this is a web server. Might be internet IIS, internet something something
from Microsoft or Apache or Java Tomcat. There's another program and you are making phone calls between these programs. Now, in general, these servers here stay up all the time and you sort of just can make a request when you feel like it in your program but that's what we're going to do and this is what we call a socket. So, that little connection, that phone call, that data phone call is what we call a socket. Now, you have to decide which of the systems you're gonna talk to and then which of the services on those
systems or which process. And so, we have this concept called port numbers and they're best thought of like extensions on phones. So, one organization has one phone number and it says, please enter the extension of the party you'd like to talk to. Well, that's kind of what ports are. They're like, here is, I'm a server and I'm connected to the internet. Please enter the extension of the process that you would like to talk to. And so, for example, there might be processes running on various computers and so the email is known to hang out on
port 25 or extension 25. Login, insecure login lives on port 23. Insecure web lives on 80 and secure web lives on 443 and there's a couple of different protocols. Say if you have your mail stored on Gmail and you have a local mail client, say like Thunderbird or Apple Mail, that talks a protocol to pull that mail across and those live on various ports. So, these ports are those extensions and by convention, we have standards that tell us what to roughly expect at those ports. So, when you're talking to port 80, you expect to talk
to a web server or an HTTP server. If you're talking on port 23, you expect to talk to a telnet server and on and on and on and on. And so, these are the extensions, the typical commonly used default extensions for various network application processes that are serving us data. Now, sometimes you'll go to a URL and you'll see in that URL, there's a colon and a number, that means it's a web server that's running on a port other than the official 80 or 443 port. Now, in Python, we can talk to these sockets, right?
We can just talk to them and it's really easy, surprisingly easy. We have to import socket because that's a library. It comes with Python, but until you can use it, you can't use it in your program until you say it. And then you, basically in the socket library, call socket function, that's what that syntax is saying. You're making a socket. Now, the key to a socket, it's sort of like an unopened file handle. It's half of a file handle. It's an outward looking thing that's not yet connected. These parameters, you're just gonna type them in.
This says we're gonna make a socket that goes across the internet and it's a stream socket, which means that it's a series of characters that come one after another rather than a series of blocks of text. There's another kind that's harder to deal with, but we're gonna do this. So this, don't worry about this line. Just know that this creates a socket, but does not associate it. The very next line, we get back a socket object in this variable that I'm storing in the variable mySock. And then when you wanna make a connection across the
internet to the far end, you say, oh, hey, dear socket, extend yourself across the internet. Make the phone call to this host, data.pr4e.org, and on that port 80. So that's making the phone call. This is like the phone number and this is like the phone extension. So that's, we haven't sent any data yet. We have simply rung the phone of a process, hopefully living on port 80. If it's there, great. This might blow up. This one here won't blow up, but this line here could blow up. If there's nothing sitting on that process, it would
come back and say, oh, you try to call, you got no answer. That's a legitimate thing to happen. Maybe you don't have a network connection or maybe that service is down on that server or the whole server is down. But, so I just, it's kind of amazing that we're sitting here in Python and in three lines we have probably a half a million engineers who built this thing called the internet, all these protocols and all this software. And we just made use of it in three lines of Python. And in case, this is one of
the reasons that people absolutely love Python, absolutely love Python. So now that we have a socket, we have to ask ourselves what kind of data are we going to send and then what kind of data are we going to expect to receive across that socket? So now we have a socket. We are going to talk about what we're going to do with it. So the socket basically functions at this level. Your application is saying, make me a socket, which is sort of this end point. And then the connect actually connects to an application on the
far side. And there's a port involved, so that might be port 80. And this is the far host and that could be www.py4e.org or data.py4e.org. Okay, and so the socket is solving this. And the question then is what are we going to send and what are we going to expect to get back? And that's what we call the application protocol. So we know that these two have made a phone call. There's no different than making the phone call and saying, you know, hello, right? And everyone knows that when the phone rings and you pick it
up, you're supposed to say hello. And that's part of our protocol. So who talks first, right? So the dominant protocol that we use in this section is the HTTP protocol. The key is hypertext transfer protocol. It's dominant, it's really easy to use. That's why I use it as an example. But realize that there are many others, like mail and file transfer and remote login and all kinds of other protocols. Each is a different application protocol. They all use sort of sockets at their lower level. But then on top of that, they layer the rules of
the road for retrieving hypertext web pages. And we have used these for all kinds of other things. So the protocol, like I said, is like who answers the phone first? What do they say? What happens if the person doesn't answer right? Can you hear me now? Those kinds of things. And it's a real simple thing. And all you really need to do is so that both sides can agree, you have to write a thing that's like the rules in the middle and say, okay, everybody, as long as we all do this, we'll be fine. It's
as simple as picking on which side of the road the cars can drive on. It works fine no matter which side. But if each car randomly picked, it would be really kind of a mess. So if you look at the typical URL, and this is one of the things that the web innovators in 1980 really invented that was wonderful. And it seems second nature today, but in 1990, it was rather revolutionary. And that these uniform resource locators encrypted included in themselves a protocol, the host to connect to, and the document to retrieve. So this is
one of the clever, clever ideas that the web came up with, because we used to have to pick a program like FTP or Telnet or whatever, SMTP. Then we had to go to the right host, and then we had to talk to that host a certain way. So in HTTP, it's a really simple protocol invented in 1989 and 1990 by Tim Berners-Lee and Robert Caillou at CERN. And they created a protocol that we have grown to know and love and use for way more than retrieving documents, as we'll see in the upcoming chapters. So we're
gonna talk a little bit about what happens when you click on a page that has a link. Now, there's all kind of fancy stuff that can go on, but this is the basics. And so let's just imagine for the moment you start sitting looking at a web page, drchuck.com slash page one, and inside that there is a hyperlink. It is a indication that says when you click on this page, go to a different page. And in that, you see the name of the page that you're supposed to go to. So we click on this link,
and that is a browser. This is an application, this is a process, or an app that's running on your computer. This is the browser, okay? And when the browser sees the click inside your computer, then the browser makes a connection to port 80 on the web server, drchuck.com, and sends the request. This request that it sends is precisely specified by a standard, which we will see in a second. Then the web server does some magic work. Oops, let's go back. Then the web server does some magic work in here, reads some files, runs some code,
does whatever, constructs an answer to our phone call, and sends it back. And it sends, in this case, back a web page in the format of HTML, the hypertext markup link, which is different than HTTP, which is the protocol that we're exchanging. HTML is the format of the document we're getting back. And in this has an anchor tag, href and an end of anchor tag, and some highlighted text. And now your browser gets this back and then renders it according to the rules of HTML and CSS and JavaScript, et cetera, parses it, and then makes
a pretty web page. And this web page happens to have a link back to the first page, and if you click there, it will do this over and over and over again. And that is the request response cycle. And that's governed by a series of internet standards. These are standards that were built from the 60s, 70s, 80s, and 90s, and continue to this day, by a group called the Internet Engineering Task Force, or IETF. The documents they produce are called RFCs, which stands for Request for Comments. The RFC, the word RFC is kind of like
a sort of joke, as it were. They're trying to be kind of funny in that, funny is not the right word. It's ironic in that they're trying to say, even so in the protocols of the internet that we've used for several decades, they're always interested in improvements. And that's what the RFC stands for. And they're all named RFC-whatever. And if we were gonna cruise around, we could find some various RFCs. And this is RFC 2616. It might have been revised since then. But this is like a document, and this is what they look like. Hypertext
Transfer Protocol version one. And so you're reading this document, you're gonna write a browser, and you wanna talk the application protocol that is HTTP. This is one of many documents that helps define what HTTP is. So if you look down and you look down and say, oh, here's what a request looks like. This is how I'm gonna get a document from the server. And you keep reading, and you keep reading, and it says, you're supposed to have the request method with a space, with the request URL, the request method with a space, with a URI
with a space, the HTTP version, and the carriage eternal line feed. That's what it's saying. And so it looks kind of like this, right? We say get the document followed by a space. There's gotta be one space. You do two spaces, and it's going to be quite frustrating, okay? And so this is an example that you can run on Linux operating systems and Macintosh operating systems with no changes. If you install Telnet on your Windows box, you should be able to run something like this as well. So Telnet is a program that we used in
the old days. It used to be how we logged into servers, but because it doesn't encrypt your data back and forth, we don't use it anymore, but it basically is a program that can open a socket to a host on a port. And I'm saying Telnet to this host on port 80. And at this point, I am connected, and whatever I type on my keyboard is gonna be sent to that server. Now if you're doing this, you probably wanna cut and paste this really fast, because if you take too long, most web servers will be
like, you're a human. I don't wanna talk to humans. I wanna talk to programs. So remember to type this fast enough, and then you have to hit Enter twice. So you have to have a blank line here. Just type this exactly as it's shown, and then you will get back the server. If you do it right, the server and the server is properly configured. The server will give you back some headers, and this is metadata about the document you're going to get. For example, it's saying it's got text slash HTML, which means that the remaining
stuff is gonna be in HTML, Hypertext Markup Language. It has a blank line, and then the actual document, and then the connection is closed. And so if you do this, you can set this up in a way that you can run this on your own computer, and in effect, hack through the back door a web server. Now you can't hack the secure web servers, and mail servers used to be easy to hack, but they're harder to hack now because they challenge you for information. But part of the reason I'm so obsessed with the command line
is this is how real hackers work, and they know how to talk some of these protocols more directly. And so we think of this beautiful sophisticated application talking to some other thing, and it's all pretty, and we got wonderful clicky buttons and nice usability, but the reality is, like in the Matrix Reloaded here, the kinds of things that really talented hackers are doing use command lines, and they really know what's going on, and that's how they do it. They understand what's going on better than the developers of the computers that are trying to be resistant
to the hacking. So I come from a long line of using the command line, and that's why I encourage you to use the command line in this course. So the next thing we're going to do is we're going to go up into the application layer, and instead of typing those commands by hand, we're going to actually send them from Python and write a very simple Python web browser. In this section, we're going to write a web browser using Python, so we've already got a socket. We know how to write a socket. In the previous section,
we played with the protocol, and used Telnet to do it by hand, and now we're going to do it in Python. And what you're going to find is it's not that hard. So here we go. So the first three lines of this program, import socket, make the socket. Remember, the socket isn't really got the connection, so when you make the socket, again, we're going to make a stream-based socket, and it's suitable for going across the internet. The connection, it's like ring, phone call, connect to data.pr4e.org and port 80, and so that basically says extend the
socket across and connect to a web server, and so there's got to be a piece of software running, and this will blow up if the software is not running, okay? So then, now we've got a phone, we've made a phone call. Now, whether or not the remote side says hello or not is up to the application protocol, and in this case, the web servers say nothing, and they wait for you to talk first, so we're the web browser in this case, and so we're going to talk first, and we know what, because we read the
documentation, we know that we're going to send get, blah, blah, blah, blah, blah, blah, blah, blah, blah, space, blah, blah, blah, blah, blah, HT1, and then two new lines. Return, return, remember you had to have a blank line. We'll talk a little bit about this end code, it's preparing the data to go across the internet, and then we say send it, and so this basically takes that little string and sends it across the network, and then this piece of software is waiting for it, and then the software goes and reads a file or does some
other stuff, and then it starts sending us data back, which we can then choose to receive. So now we write a real simple loop. We're going to receive the first, we're going to receive these things 512 characters at a time, so we're going to loop through 512 each time, and if we get zero characters, that means it's end of the stream, the stream is closed, and if you look at the little example from the previous one, you saw a connection closed. When the connection is closed, we get an indication that it is because we ask
for some data and we get zero data. Otherwise, if there might be more data, this'll wait. If the network is slow, you'll see, if you do a print statement in here, you will see that this will pause from time to time on a really slow network. If your network is fast, it'll just go blank and it'll be so fast it won't matter. But this is how we go. So this is basically until the entire socket, until the socket is closed, we are going to read this data, and because this data's coming from the outside world,
we have to decode it before we print it, and then when we're all done, we break out of here and we close the socket. So literally, that is an entire web browser written in 10 lines of Python, and again, this is why everybody loves Python. So this is what this program will show if you run. The get is sent, it looks exactly like doing it by hand. You get some headers, again, this is metadata that tells you something about the file. In this case, one of the important things is what kind of thing is coming
next. There's always a blank line, there's a break between the headers and the actual data, the metadata and the data, and then here is the actual text of that romeo.txt file, and then it's gonna run this, gonna print data.decode, all this is coming from the print statement. If you were gonna parse this, you have to know that you're gonna read the headers up to a little blank line. The blank line is your indication as a software developer that the headers have stopped and the actual text begins, and you know the syntax. This actually could be
a JPEG or PNG or some kind of image, right? And this data would here look like, blah, blah, blah. So if you type this and you change that code to actually go retrieve a JPEG URL, gibberish will come out, okay? And so that's exactly what you will see, and so now you have built a very simple web browser. Next, I wanna talk a little bit about what happens when characters transition outside your computer, I mean from inside the computer in strings, out across these sockets to servers and then back. Hello, everybody, and welcome to some
work to sample code. If you are interested in the source code, you go to materials and download this sample code.zip. I have this downloaded. It'll be in a folder called code3 on my computer. This is where I'm at, I'm in the code3 folder, and this has a ton of bits of code here. So if I do an ls, you'll see I got all these files here, and so we'll just leave those there. And so this is the one I wanna work through right now is this socket1.py code. And basically what we're doing here is we're
simulating what is gonna happen in a web browser. And the cool thing about the HTML, the HTTP protocol, is that we can do this by hand, and I'm actually gonna hack this HTTP protocol. This is gonna go to data.pr4e.org and retrieve a document. And so I'm gonna do telnet to, now you can do this on a Mac and Linux, and if you put telnet on a Windows box, you can do it here, data.pr4e.org, and I wanna talk to port 80, and the port 80 is a different port, it's a non-standard port, but what we're doing
here is talking to the HTTP port. And so I'm going to be able to hand send commands to the web server and retrieve a document. So I'm gonna cut, I've already copied this string, this get HTTP romeo.txt, I'm copying that into my buffer because if I wait too long, this won't work. So here I go, and now I'm gonna type that, and I have to hit enter twice, and that literally was the HTTP protocol. What I typed there was the HTTP protocol, and the web server responds with some metadata about the document, how much data
there is, the kind of data is there. A blank line separates the header information from the body of the document. If I was to go to this in a browser, right there, you would see, and if I turned on developer console, and I went to the network, let's make this a little bit bigger, you would see that it retrieves this file romeo.txt, and it gets back, it tells us, it shows us the headers, and it shows us the response. And so this is all the same way of doing the same thing, and that is how
to do the HTTP protocol, okay? But now we're gonna do this in Python, and so here's the code we're gonna write. So we're gonna import the socket library, and we're gonna make a socket. Now this doesn't actually make a connection, think of a socket as a file handle that doesn't have any data associated with it yet. And then what we're going to do is we're going to reach out and connect that socket to a destination across the internet with the domain name of data.pr4e.org, and the second parameter in this tuple, this is a function call
with a single tuple as a parameter, and so tuple sub zero is data.pr4e.org, and tuple sub one is the 80, which says I wanna talk to port 80. That could fail, it will make the connection, and if the port 80 is there, away it goes. And then we're gonna actually send the HTTP command, so get, this is the HTTP rules, followed by an end of line, followed by a blank line. So you saw me do this, this was what I typed here, and then I had to type a blank line. Now if you wanna go
read the RFCs for how to do this, you can figure this out. So the only other thing that's kinda weird here is we have to add this dot in code, and that's because there are strings inside of Python that are in Unicode, and we have to send them out as what's called UTF-8, and in code, converts from Unicode internally to UTF-8. So this command is a set of UTF-8 bytes that we're then going to send. It still has that same set of characters in it, and now we're gonna send it. And that's, after we've made
the connection, we're gonna send these two things, and then we're going to wait. And my SOC is like a file handle at that point, because it's been opened and we've sent data. The HTTP protocol told us what we had to send and the fact that we did have to send it. So now I have just a simple while loop, and I'm going to ask up to 512 characters, and receive up to 512 characters and get that back. If I will know that this is the end of file if I got no data back, so if
the length of the data, the byte array that I got back is less than one, then I'm gonna quit. Otherwise, I'm gonna print the data, and I'm gonna use this decode, which is kinda the opposite of this end code. What I'm getting is UTF-8 encoded data, most likely, and decode basically converts it to the internal format called Unicode that runs inside. So this is gonna run a bunch of times, pulling in the blocks, basically 512, up to 512 characters at a time, printing it out, and then when it's all said and done, we will close
that connection. And so, it's not too exciting. Python three, socket, one.py. And you'll see that it's just gonna, Python is now gonna do what I did by hand. Now, of course, the interesting thing is these are all in strings, right? And so, you know, this way we could write code that does stuff with this. But all we're really trying to do in this particular situation is show how you open a socket, send a command, and then retrieve the data. Okay, so now it's time to teach you a bit of complexity about text processing. Up till
now, we've kind of been ignoring the complexity of text processing. Everything that I have been doing, most of what I've been doing is in ASCII, the Latin character set, the character set that, you know, United States, Europe, lots of Western civilizations use this character set. And if you go back to the 1950s and 1960s, they, we were happy to have one computer and we didn't care what the character set was as long as what you typed on the keyboard came out on the printer, the internal representation didn't matter. And as the 70s and 80s came
along, certainly 70s, we needed some interoperability. And so they standardized that character set, but they standardized that character set, certainly in the West, that did not represent anything. And so if you look at this sheet, basically what it's telling you is for the various characters, there's some non-printing characters, white space, non-printing characters, and then here's some printing characters like the and key, the zero, and then the uppercase characters, and then the lowercase characters. And there's 128 of these possible values. And there are nothing even for Spanish or French in here. And it's also why, by
the way, uppercase letters in Latin sort lower than lowercase letters, and we saw that in some of the string stuff. And what these do is it maps and says, okay. And a lowercase a maps to the number integer number 97, which in base 16 is 61, and in octal it's 141. But in binary, it's eight bit numbers. And so these are eight bits, otherwise known as a byte. And they're very efficient. Like when you buy a disk drive, it's megabytes or gigabytes or whatever, that's how many of these kind of characters it can store. But
unfortunately, this doesn't work for more complex characters. You can figure out these numbers inside of Python by using the ord function. And so you say, what is the ordinal or the numeric representation of the uppercase h, lowercase e, and newline is a character as well. And so like 10 is the ordinal position of newline. And this actually has to do with sorting so that lowercase e is higher than uppercase h. And that's just because in the simplest of sorts, we just sort them numerically. So newline, if you go back to the previous little sheet, newline
is this 10 right here, it's that 10, which is a line feed and that's a 10. And that's why when we print newline out, we get a 10. And so again, in the early days when strings were simple, we just represented them as one byte per character. But the problem is that as we have gotten more complex and in today's modern world, it's simply unacceptable to say that the only thing computers can understand is ASCII. And so this leads to a very, very, from the simplest of character sets to a super complex character set called
Unicode, which basically is billions of characters, potential billions of characters for every language and every character set. And because there's so much space in Unicode, it's easy to take very small variations of characters and give them a space. It's so large that you can have, you can have pretty much any character that you want. So that's Unicode. The problem is that if we sent Unicode across the network, it would be way too large. It'd be this UTF32, which instead of being eight bytes per character would be four bytes per character. And so it would take
all of the data that we build and make it four times larger and it'd be very difficult. And so what they've come up with is ways to compress this. And UTF-16 is this weird thing. UTF-32 is really sort of the full Unicode pretty much. UTF-16 is a subset of Unicode. It's used in some countries. But the best practice for moving data across the internet or in a file that you're gonna move between computers is what's called UTF-8. And so what happens is that UTF-32 is fixed length. ASCII is one byte. UTF-16 is two bytes, UTF-32
is four bytes. And UTF-8 has dynamic length, meaning that it is one to four bytes. And if it's only one byte long, it's perfectly compatible with ASCII, meaning that an ASCII file is also UTF-8. And so here's this little sheet. It's not critical that you understand this graph too much, but basically as time passed, 2000 internets coming, coming, coming, coming, not 2014, pretty much overwhelmingly the documents on the internet that you might retrieve are UTF-8. Now, so UTF-8 is the recommended practice and it's sort of a compression of UTF-8 can represent all the things UTF-32
can represent. It's just a compression of it so that with an overlap of ASCII, which is awesome. It's what you want. I don't even talk anymore. So in Python, we have always had sort of two ways of representing strings. In Python 2, the normal string was a byte string, was an ASCII string, was a Latin string. And if you wanted to represent Unicode, there was a separate kind of object that we had. And so you would do that. And in Python 3.0 or later, one of the main features of Python 3 was to make Unicode
and string the same. So that means inside of Python, when you have a string variable, it's a Unicode. Whereas inside of Python 2, it was a byte variable. And so now we have this notion, separately in Python 2 and on Python 3, where we have byte variables. And so byte variables are, in effect, an array of bytes. So if there's ABC, that means it's three bytes, it's three bytes long. Whereas a string might be, a three-character string might be anywhere from three to 12 bytes long. So Python 2 had bytes and strings that were the
same. So bytes and strings are the same, and Unicode is weird. And in Python 3, strings and Unicode are the same, and bytes are weird. Okay, and so that's what we've got to deal with. And there'll be times when we get bytes from APIs, when we call things, we have to then figure out what kind of thing those bytes contain. Because the bytes might contain ASCII, they might contain UTF-8, they might contain various things. And so internally, all the strings in Python 3 are Unicode. Most of the time, if you're inside the program, or reading
and writing files, we just work. And that's why we haven't mentioned it. But now that we're talking over sockets, and we're talking to the sort of random world out there, we have to be a little more aware of the data we're dealing with. Now, the good news is 98% of the time, or 95% of the time, it's UTF-8, which might also include ASCII, and so it's quite nice. But we have to be aware of this. And so if we are going to take data that comes off of the network in the bytes, then we have
to make sure that we interpret it, or decode it in the right way, so that internally the strings, which are Unicode, are properly represented. And so that's why when we read data in from a network connection like a socket, we have to say, hey, decode it. Now, there's a couple things going on at that moment of decode. And so this is where we're doing it. We see this, we have to manage this in this code, where we, before we send this stuff, we're gonna encode it, which takes a Unicode string and turns it into UTF-8
bytes. There's actually a parameter here that you could do it different than UTF-8, but no one ever does. You might have to for certain situations, but so that says that we're gonna encode this into UTF-8 before we send it, and then when we get something back, before we print it, we're gonna decode it. And that's how this ends up working out. And if you look at the documentation, you will see that sometimes it says it's a string, or it's bytes. And so you take a byte array, and you decode it to get a string, and
you take a string and encode it to get a byte array. And so that's what we're doing. So you can think of the process as this way, and that is the network has these UTF-8, mostly UTF-8 resources, not ASCII. If it's ASCII, it's okay. So you read with the receive. So this receive here pulls data, well, we have a Unicode string, let's start with the send. So up here, we have a Unicode string, that's a Unicode string, even though there's no special characters in it, no Asian characters or French characters, that's a Unicode string. And
before we can send it, we have to send it in UTF-8. If that had Asian characters, it'd be okay, and that would be set up just right, so that the UTF-8 would be right. So we encode it first, and that's the CMD. This is now bytes, okay? CMD is bytes, and then we actually send the bytes. And that goes across the network. We get back our thing, and we receive, and we receive into data, well, data is bytes, not string. It's bytes. We can say how big it is. Function's kinda like a string, and it
has len, except that it is one byte per character, which means some of it might be UTF-8. And then all we have to do is say decode. Again, you could, if you were dealing with a situation where you weren't expecting it to typically be UTF-8 or ASCII, you could tell it UTF-16 or something, and it's more complex, but the simple thing is to just say, I'm gonna clean up my data on the way in, I'm gonna clean it up by running it through decode, and I'm gonna encode stuff on the way out. And so sockets
are the place where this comes into play. And so you'll see, we'll always do this encode and decode every time we're sending data kind of outside of Python and inside of Python. So now that we've talked a little bit about character sets, we're going to make this even easier so you don't have to use sockets. A URL lib is a bit of Python code in the library that does all the socket stuff for you. Okay, so now we're going to write a web browser again in Python, but it's going to even be shorter than what
we did before. We did it in 10 lines using sockets. Now we're gonna do it in four lines with URL lib. So URL lib really is just because the idea of opening a connection, sending a GET request, sending the new line, retrieving the stuff, breaking the headers out, doing all this stuff, that's so common, why not put it in a library to save ourselves some effort. So here's how we do it. We're going to read it in, we're gonna import this library so it's not part, we had to import sockets before, but we're gonna import
URL lib now. And so this is really quite simple. It's like elegantly simple. You say, URL lib, that's a library, that's part of a module within the library and this is a function. So let's call URL open and then give it the URL. Now that's a string which it's gonna encode automatically for us, so it's taking care of all kind of pretty things for us, it does the GET, it does the ENCODE. Look back at that previous code. That's kind of what URL lib is doing for us, okay? Now what URL lib also does is
it makes the connection, encodes the GET request, and then it actually retrieves, at this moment, it retrieves all the headers and keeps them for you for later. You can get the headers, but we're not gonna see the headers. And it returns to you an object that looks pretty much like a file handle. Because you can put this in the for clause after the end. Now it's going to read, run that loop one time for every line of this file. And so the lines we're gonna get back are bytes and so we have to say decode.
It doesn't do that for us automatically. We are gonna have to decode them and that's because we might need to decode them with a particular character set here. And then we're gonna do our strip and we're gonna just print this out. So that's just, that's like open a file, read through it and print it. This is open a URL, read through and print it. And that's as simple as it is. And so that's what happens. This is Romeo.txt and it prints out. Now the thing to notice is that there are no headers here. The headers
have been sort of consumed in the URL open. Again, there is a way to say, hey, give me my headers. But for now, this is just gonna eat the headers and keep them and then you get to read all the data and the loop runs and this loop runs four times and I'll count the four lines. You can go ahead and run this one. It's super easy. I mean literally super easy. And if you, you can do anything you want. I mean treat it like a file. You just have to remember to do the decode
bit when you treat it like a file. And so we, that code import it. We're gonna open it. We're going to make a dictionary. We're gonna loop through. We're gonna split it. We have to add the decode just to make sure because that line is bytes, not string. And then we're gonna go, you know, our words. We're gonna go through the line and then each line we're gonna bounce through the words. The inner for loop is bouncing through the words and then we're gonna go to the next line and then we make ourselves a dictionary
and we print that dictionary out. Now this is, this in effect, other than, you know, importing this, opening it differently and doing the decode, this is exactly how we would process a file. And so by using URL lib, you really sort of reduce the complexity of retrieving and reading network resources to the same complexity of reading and dealing with a file locally on your hard drive, which is kind of pretty. So one of the things then we can do is read web pages. That was a text file but you can get HTML and so here's
how you read a web page. And it's the same kind of code. We open a, we open a URL. This one happens to have HTML in it and we read through it and out comes the HTML. Remember that the headers are there but they've been eaten by URL open for us. And now we could write a browser that would parse these less thans and greater thans and make links, et cetera, et cetera, et cetera. So if you can come up with ways to find these links, you could actually write a bit of code that would
then have a loop that would go up and open a new one. Pull out the links, open a new one. Pull out the links, open a new one. And so you could. You could make a thing that would retrieve a program that would retrieve a page, find the links in the page and then retrieve those links. And we'll actually do that before the end of the class. And so Python is a very popular language at Google and I wonder if I'm gonna, I think it's a pretty safe bet that the first crawler that they wrote
to crawl the web to build the index was Python because literally that's all it takes to read web pages. And pull those web pages into your web crawler database. So I don't know. Are those the first four lines ever written to Google? Who knows? So the next thing that we'll talk about is how you handle that HTML. HTML is kind of yucky and nasty and so it's not as simple as regular expressions. Regular expressions might help. Strength parsing and split might help but it's just too crazy. So we'll talk a little bit about how to
use a library to make HTML parsing a lot easier. We are going to be talking about some code. If you wanna download all the code, it's right here. It's all single big zip file. And of all the sample code, the one I'm gonna talk about is url.1.py. It is not very exciting. It's short. That's what's kinda nice about Python code. And it's really, if we go and take a look at the code we played with just previously, which is socket, the idea here is url.lib is something that Python has produced for us to make socket
communications and HTTP communications a lot better. So socket, this is making socket calls underneath it but there's a library that makes this quite simple. And so we have to do some imports. So instead of importing socket, we'll import these. We are going to create a handle. The url request url open and just pass in a string. So we're not encoding this. We're not sending the get command. All the stuff we did in the previous sockets example is gone and then we can just put this as a for loop. And so we're not using this lower
level read and write code. We're just using a for loop. And so that literally is gonna read the text line by line. And the line does come back as an array of bytes so we have to do a decode but then we got a string and then we can do a strip on it. So this is like a super simple, super simple. So there we go. Now the interesting thing is you also don't see the headers. We just read the contents. Now it turns out in url lib, and we'll see this in later more complex
application, you can get the headers if you want. You can get various other things. So that's url lib, a simple url lib tool. Now we can also use this in url words to show you something quite interesting. And that is if you look at this from right here other than the decode, this is exactly the code we wrote to compute the words, right? So other than this line.decode, this is just a open something up. In this case, we're gonna open a url. We're gonna create a dictionary. We're gonna loop through each of the lines in
that thing. We're gonna decode them and then split them. So once you do line.decode, this is now a legitimate internal Python string. We split it, we run through the words, and run the counts. And so this is exactly like code that we did before to run counts. And so Python three, url words. And so that gives us a dictionary which is the word frequency. And we could do all kinds of crazy stuff in here with sorting and all the kinds of things. The important thing is once you've done this and this, the code doesn't need
to decode these lines when you first get them. It really works just like makes the url lib makes url's function inside Python very much like files. So these are shortened to the point and very simple and I hope that they were useful to you. So now we're going to talk about what you would do with a web page once you've retrieved it in a Python program. Call this web scraping. And so web scraping or web spidering is the act of retrieving a web page, extracting the links from those web page, making a queue of unretrieved
links, and then moving on. And eventually the idea is if you had enough time, energy, bandwidth, and storage, you could find your way to most of the web pages on the internet that are pointing to or are pointing to by other web pages. And so you might have all kinds of reasons to scrape data. You might have a blog that you posted. You might have, who knows, maybe you put some data in a system, maybe the system's being shut down because it's being retired. You can do all kinds of things. You could write a little
thing, just talking to somebody who wrote a thing to retrieve something and check, and then send a text when something changed. All kinds of stuff. Or you might make yourself a search engine. But be careful. Not all websites are happy about you using a robot to retrieve their content. Some of the websites, as we'll see, demand that you log in and they track what you do, and if they think you're doing something bad, they will shut your account off. Other websites will track what you're doing without you logging in, but then shut your address off.
And so you have to be careful. You should read up. You should figure out what sites allow you to scrape them. Now I have some sites that I've set up that you can play with to make it so that it's legit. So parsing HTML is difficult. Some of the simple examples, you could probably write a regular expression, or certainly some splitting and some whatever. And what you would find is you would write that code. And you would retrieve your first five webpages and it would seem to work and then it would encounter some really weird
but legitimate HTML, or maybe even sort of slightly broken HTML. So the web is full of broken HTML, and your browsers just look at it and go like, oh wow, more broken HTML. But they don't put up error messages, and so people just leave broken pages up. But your Python program is gonna see those broken pages. So what you would do is you'd be like, oh, here's a new weird way to do an anchor tag. I'll change my code. And then run for another 100 pages, I'm like, oh no, here's a new weird way to
do an anchor tag. And the problem is is that you're gonna find a lot of different ways to mess up an anchor tag. And someone's already done that. There's a software called BeautifulSoup. And we have installation instructions on how to use it. And really what it is is it's somebody just spent months figuring out all the nasty things that could happen and compensated for it and gave you a nice wrapped interface that just says, look, you give me the HTML and I'll give you back the tags, okay? And so it's called BeautifulSoup. And so you
have to install this. There's a couple of ways that you can install this. If you're good at extending your Python, you can just extend and install BeautifulSoup for all Python programs. If you can't change your computer's configuration because you're on a school computer or you're using a USB stick or something, then there's a way to download this file that I've created called bs4.zip. And so what you do is you end up with your file called urllinks.py and then a little folder called bs4, which is a folder that has a bunch of files in it from
the zip file, and then you can run it. And so it'll pull it in and you'll import from bs4, BeautifulSoup, and that's either gonna pull it in from the folder you do, or if you have installed it using the Python installer, it will also just, you don't have to put this file in. So it's up to you. You can either do it one or two ways. So this is a little bit of code. Now BeautifulSoup is a complex library, and so just because this looks easy, doing things in BeautifulSoup, you might have to actually read
a bit more to figure it out. But we're going to just read this. We're going to import BeautifulSoup. We're gonna ask for a url right here. We're going to take that url. We're gonna open it. The url open, they give the url and read the whole thing. That means we're not writing a loop. We've read the whole thing. That's okay as long as you know that the file's not so large. And then we're going to pass the data we got back. And this is gonna be bytes, but BeautifulSoup knows all about bytes and all about
UTF-8, and it figures that out. And you just say, hey, take that stuff I just got and tear it apart using HTML. And give me back an object, a soup object. Now the soup object is something that you can run queries against. So it parses it. It deals with all the imperfections and inconsistencies in this HTML byte array. And it fixes that and gives that back. And so there's various things you can do. And you gotta go look at the BeautifulSoup documentation. It could be a whole class on BeautifulSoup. So here's a thing you can
do is this object, you can sort of call it like a function and say, hey, give me back the anchor tags. And anchor tags, of course, are the tags. Say href equals blah blah blah slash a. So all of this is an anchor tag. And then we're gonna loop through the tags because there could be more than one of those anchor tags in the file. And then we're going to pull out that href. And that's what this does. We're gonna loop through all the tags and print out the href. So if you tell it to
go to drchuck.com, it will tell you the one external link in drchuck.com. And so I've got an assignment that sort of goes into that in some more detail. But this chapter has been a whole bunch of interesting stuff. We started with the TCPIP model and talked about sockets that are phone calls between computers. And then how applications protocols are developed to say what we say on those phone calls. And we've explored then the HTTP protocol, which is probably the most likely thing you're going to see. And then we played with all this in Python and
saw that Python is really good at this. You can write extremely simple and small programs to do some extremely complex and powerful things. And again, that's why people like Python is because it makes the complex simple. We're gonna do a little bit of sample code. If you're interested in getting the sample code, you can download this zip here at Pythonforeverybody.com, materials.php. And you will download and you will get all the files. And all the files that I'm looking at here. And so the one I'm gonna play with today is the file called URL links.py. So
the first thing you gotta do before URL links.py works is you have got to install beautiful soup. And I've got some simple instructions at the beginning of the file. And so one way to do it is install it using Python install process to install this beautiful soup for all Python applications. And if you are the owner of your computer and you're gonna use beautiful soup a lot, it's a fine idea to do that. But I wanna show you a simpler way that if you don't own your own computer and you just wanna make it so
that beautiful soup works, you can download this file, this file right here. Beautiful soup, for.zip, unzip it and put it in the same folder as here. And so if you look in this folder, I have a subfolder called bs4. And that's the unzipped version of this. And it has these things. I didn't write this code, so I'm sorry if the name is bad, but this is the code to bs4. And this is what's in bs4.zip. And it's in the same folder as URL links.py. And so what happens is when you do this from bs4 import
beautiful soup, that either can go to sort of this global magic place that Python installs stuff and pulls in the beautiful soup object, or it can go to the folder bs4 and pull it in, okay? And so that's how that works. So you have to do one of these two things. I prefer to keep it simple, download and unzip this file and put it in the same folder as this code and away you go. So from the previous example, we're gonna use URL lib, of course, and then we're going to pull in the beautiful soup.
From the beautiful soup for our library, we're gonna get the beautiful soup object. Now, if you do this with SSL, if these websites we're gonna play with have SSL, you pretty much have to do this little hack. And these three lines, don't worry too much about it. The whole idea, you can do Google on Stack Overflow and figure this out. But this is the way that you ignore errors when you have SSL certificate errors. And so we have to add this parameter context equals ctx, which is this variable that we create. So this part and
this part sort of just do them. If you don't, you can take them out, actually. Otherwise, you won't be able to do HTTPS sites. So let's take a look at what we're doing other than dealing with the HTTPS problem. Gonna ask the user for a URL. We are going to retrieve all the HTML. We're gonna do a URL open, just like we did before. Now, this would return us something we could loop through line by line with a for loop. But instead, we're gonna say, hey, read the whole thing. And that basically returns us the
entire document at that webpage in a single big string with new lines at the end of each line. And this is not an Unicode, but it's probably UTF-8 string. But it turns out BeautifulSoup knows how to deal with UTF-8, and it also knows how to deal with Unicode strings. So what we're saying is BeautifulSoup read through and deal with all the nasty bits, right? So HTML is like very, very flexible. So drchuck.com slash page one, HTML. And so if we take a look at the source of this, view page source, make this bigger, you might
be able to do regular expressions, but it does things like break stuff across lines. There could be a line break here. There could be all kinds of things, right? And so writing regular expressions or splits or whatever is really hard for HTML. And so what we do is someone has written this. It's called BeautifulSoup. And it's basically, this is the code, and it's based on a joke from a children's story. It basically, someone has just went through and figured all the bad things that could possibly happen when you're reading and parsing HTML. So either you
use it or you will slowly but surely derive all the things that it doesn't work. And so when we look at this line right here, this line at a high level is saying, we're giving you ugly, nasty HTML that could make no sense whatsoever. Please read it and have all the brains that you have and all the weird stuff figure that out for us and give us back an object. I happen to call it soup. You don't have to call it soup. An object, and that is a proxy for that HTML, but this soup object
is clean. And so what we can do is we can sort of retrieve all the anchor tags. So we can talk to this object and say, ask it, give me the anchor tags. What's an anchor tag? Well, if we take a look at this source, the anchor tag is the A through the slash A. That is the tag. It is the tag, it is attributes that are on the tag, it is the text within the tag, and everything. And so that's what we're gonna get. Now, I call it tags plural, not because plural matters at
all, but because we're gonna get a list of tags. Because even though this web page has lots and lots of tags, if we look at, say, drchuck.com, and view source, whoa, that's kinda small. View page source, right? And we go look for a anchor tags. We got 45 of them, and they all kinda have weird stuff in them, right? So this line will give us back a list of tags. It will give us all the tags in this document. So it goes, the tag goes from there to there. And then what we're gonna do is
we're gonna write a loop to loop through all the tags. So that's basically hopping, like it's hopping through the document, sort of like this, that's what it's doing. Hop, hop, hop, hop, hop, hop. And it's pulling out the text of the href attributes. So it's gonna talk, pull out this bit right here. Oh, whoops, oh darn, that was so cool. Cause that's a flaw, look at that. This is my own page. There is no closing quote here, but it's gonna work because HTML soup is like, oh, I know what to do about that. I can
deal with that. So let's check to see if that one works, cause that's like a mistake. But that's one of the things we like about beautiful soup. So we're gonna read through, and then we're gonna pull out all the hrefs. So, this is probably thousands of lines of code that you really don't want to run. So python3urllinks.py. And so let's start with a simple one. HTTP colon slash slash www.drchuck.com. And it reads it. Oh, that's the, no, that's, that's actually the card one, cause we got a whole bunch. So let's see if sugi, see the
sugi one worked. It found that one. It's right after socaiproject.org. Where is that? Is there another sugi? Oh, no, it didn't find that one. That's kind of funky. Look, it found it wrong, but that's okay. So you see it found all these and did a lot of nice stuff for us. If we do it, python3urllinks.py and do the easy one. It used to be colon slash slash www. dr-chuck.com page one.htm. We will only see one. And there we go. Now, the SSL is if you are looking at a page that has SSL, python, and then
you can see that there's a lot of code at a page that has SSL, python, URL links too. So I'll go to like https colon wwwsi.umich.edu and that will get a bunch of links. And so you'll see. If it wasn't for that, so all kinds of stuff coming back. And if it wasn't for this bit right here and this bit right here, this HTTPS wouldn't have worked. And it's not that that website had a bad URL. It has a certificate that's not in python's official list. And so the URL is okay. So that gives you
a quick summary of using the beautiful soup library in python along with the URL lib. Hello and welcome to chapter 13, web services. So what we've been doing so far is we've been using the request response cycle. We've learned about sockets. We've learned about URL lib. And we've actually learned how to pull HTML and even flat text off the internet. But what we're gonna talk about now is using that same request response cycle to retrieve information that is specifically designed for programmatic consumption. So that we had to have this beautiful soup which sort of did
a hack job or solved the hard problem of parsing HTML. Well, why not produce data in a format that makes good sense to a program because programs wanna talk to each other. If you recall, the whole idea of a socket is to have one application process sending data to another application process. And so if we think about this for a moment and we realize that we have all these programs, they could be written in different programming languages and they're all connected. And so they might wanna send data back and forth or through the network. PHP
programs, JavaScript programs, Java programs. And so we have to decide on a protocol that is independent of any programming language. And then we call that the wire protocol because if you were to sort of take some connection and watch the exact characters that go back and forth, that's what you would see if you were monitoring the wire. So that's why we call that the wire protocol. And so the idea is is that we have to agree on a format that is going to represent the data and we can't make it a Python specific format or
a Java format. And when we take the data from the internal representation, maybe a Python dictionary, to send it to the wire, we call that act serialization. And that is going from sort of the internal representation to the serial representation or the wire representation. And then here is an example of a person with a name and phone number with using less thans and greater thans. This is an XML example. And then in the far end, in a different programming language, it receives this and then deserializes it and then turns it into some useful structure inside
that programming language. And so this is an example of a wire protocol that's using XML. And that's one of the formats we're going to talk about. Another format that we're gonna talk about is a format called JSON, JavaScript Object Notation. And it is simpler and easier, but it's not as precise and descriptive as XML is. And so while you'll find that most of the things you run into, especially if you're talking to APIs of one form or another, you'll find that JSON is very common. XML still holds sway in places like documents. So if you
look at docx at the end of a Microsoft Word document, docx means that it's an XML version of the representation of a word processing document. So the first thing we'll talk about is XML. So one of the two ways that we mark up data is XML. The other of JSON, first we'll talk about XML. We'll talk about XML more for a longer time than we talk about JSON. XML stands for Extensible Markup Language. There was a number of markup languages in the 90s that were out there, ways to send data between computers. And none of
them was like amazingly better than the other, but in the late early 1990s, as HTML came out, the idea that we could use less thans and greater thans, you know, or angle brackets, some people call them. Once HTML made angle brackets popular as a representation format, it was pretty natural that we would find a data representation format that would take a similar approach. And so inside XML, we're gonna talk about tags, we're gonna talk about attributes, we're gonna talk about data, and we've already talked about serialization and deserialization. Serialization is the act of taking data
inside of a computer in one programming language, setting it up for transport, transporting it across, and then taking it back apart and turning it back into the data in, whatever internal data it needs to be in the destination system. So here's some basic XML, so we can take a look at the various things that make up the XML. So it's very much like HTML in that we have tags, less than, greater than. The difference is we get to name the tags anything we want rather than the A tag or the P tag or the H1
tag. And there is a beginning tag and an ending tag, and they're bracketed together. And there's syntax errors in XML. Syntax errors in XML are more severe than syntax errors in HTML. It's supposed to be right. And if you send that XML, it's likely that the far end will not understand it. So we have a beginning tag and ending tag, and so like name and slash name are a beginning and ending pair. Then there is the actual textual content, and that is the material between it. And then here's a phone and slash phone, and we
have this thing called the attribute. Key equals value. The key doesn't have double quotes. The value always has double quotes. And this is like href equals on an anchor tag. And sometimes you have what's called a self-closing tag where you don't actually have a closing tag. You have all the data that you need in the attributes, and so you don't even bother putting an empty text area in in a closing tag. So that is a start tag, an end tag, attribute, and then a self-closing tag. Those are some basics of XML. In general, XML doesn't
care too much about white space. It does in the text areas, so in here it matters, and in here it matters, but things like we can indent this a little bit differently, and we tend to indent it in a way to make it look reasonable. Although once you have programs sending it back and forth, they tend to send it more compacted just for efficiency purposes. So one of the concepts is that there is a hierarchical structure within an XML document, and there are parent nodes and child nodes, and you can think of these as simple
nodes that is a tag in some data, or a complex element that has a tag that includes other tags, some child tags. And there's a couple of different ways we can take a look at this. The simple and more natural way to think about this is a tree with parent-child relationships. So here we have this A tag on the outside, and that's the top level one. You can only have one outer tag, and you can only, you can't have another tag down here, so you have to have one tag that's sort of the root tag
for everything in this XML document, and it has two children, so the C tag and the B tag are two children, so the B tag is a child of A, and then C has a D and an E tag that are children there, and then the textual data we model as a child of each of those tags, and you'll see in a bit why it's best to do that. So that is the way to think about this as a tree, to represent that XML as a tree. If we add attributes to it, and this is
where you kind of see why it's nice to take the text area and make that be a child of the node, an attribute is a different. So the text is a special kind of child, and you can literally have more than one attribute. You could have X equals two, you know, zap equals whatever, and these could have a couple of different attributes. The W attribute is a value of five, and that's the five down there, and so you could have multiple ones. You can only have one text node. Now, in the case of A, you
have a whole bunch of text nodes, but these are because there are child nodes. Within one simple node, you can only have one text element. You can also think of XML as paths, and the easiest way is to sort of look down this tree version and look at from the path from the parent. So you go to A, then the child B, and then X. So at position AB, you find X. So AB is the path up to the root, so ACD, that's this one, is the path to Y, and ACE is the path to
Z, and so you can think of these as paths. Part of what we're doing is we're coming up with ways to walk through and parse trees of XML data. So the next thing we'll talk about is how we determine if a particular XML document is legal or meets the contracts that two applications have set up. We're going to do a little bit of code. If you want to get your hands on the code, go to the materials website, materials.php, actually materials.php, and download the sample code. The code that we're going to work on today is
the XML code, and we need to be able to talk XML to work with web services. So here's one of the examples from the book, it's XML1.py. And so later we'll be pulling XML and JSON from the web, but for now we're just going to put it in a triple-coded string, so data, and we're going to use a built-in XML parser in Python called element tree, and when we say import XML E-tree element tree, this as ET gives us basically a shortcut handle for it. And so the idea, this is a string, it has less
thans and greater thans, it looks like structured information, and it is, but really at this point it's only a string. Now we have to call this ET from string to read this and give us back a tree object. And what it does is this might blow up, this code might blow up right here if there was a mistake in it. Matter of fact, I can probably put a mistake in, let's see if I can delete this and save it and run this code, and we'll see that it will blow up. Right, and so it blew
up, here in line eight, element tree blew up, I mean it blew up in line 12 of the code, which is right here. This failed because the line eight of the XML string was wrong, so let's put that back in. So now it's properly formed XML. So this tree we get back, I name it tree just because I always name it tree, but you could name it X. So the key is tree.find goes and looks for a tag name find, and tree has no longer got less thans and greater thans in it, it is went
and turned these into objects within objects within objects. So tree find name says I would like to find the tag name, and that's what this bit is right here, and then.tx.txt is going within that and grabbing that text, okay? And if we say tree find dot email, then that's going to give us this, and then that's that object, and then.get asks for the contents of the hide attribute, which is the string yes, okay? And so if we run this, now that it's fixed, Python 3XML1.py, it will pull in and get the at the name and
the attributes. So it pulled the chuck out, and so you get this object and then you kind of dive into that object. And so that's XML1.py. If you've got a tag, you can either get the text out of the tag, or you can get an attribute out of the tag. So now let's take a look at XML2.py. Again, we import element tree, and we have a tag, and XML's always got to have a single outer tag. But this time we're going to have, in effect, a list. Now, let's line this up a little better. There
we go, that looks a little prettier. And so users, the fact that it's users doesn't mean anything, but we often come up with semantically meaningful names for these things. Users is going to have, as a children, a list of user tags. Okay, so the children under user, user under user, and then this has each of these as a tag. So we want to parse this, and this is a common thing we want to do. And so, again, the first thing we do is we read the string to just take this, it's a triple-coded string going
from here to here. And then we're going to, instead of doing find, which gives us one tag, we're going to do find all the user's tag, the user tag that is a child of users. And we get back a Python list of the tags, not of the text, but of the tags. So there's a one tag, and there is another tag. And so we can do len of that, so we can see that we got two. And then we can write a for loop, and this item is going to iterate through the tags that are,
the user tags that are children of users. So the first time item is going to be this tag, a tag, remember, and then the second time is going to be this tag. And so we can do things like find and get, just like we did with the, in XML1. So running this is not too exciting, Python 3, XML2.py. You see that there are two users that comes from this print right here, there are two users in there. And the first one, if we go into name, and we go find the text within the name tag,
within user, then we get Chuck and then we get the ID, which is 001, so we find the ID within that item, and then we get the text. And then we look and we grab the x attribute off of that. And so we see Chuck, Chuck 001 and 2, and then in the next tag, the for loop continues, and we print that out, okay? And so that's just a basic run through of the XML from the chapter in the Python book, okay? Thanks. So now we're going to talk a little bit about XML schema. XML
schema is a language that allows you to decide on whether or not a particular XML document meets a contract and arrangement. So you have two pieces of software exchanging data using XML and what if one of them, if they're all working, nobody really worries too much about it, but if all of a sudden one breaks, you change one side and another one breaks, whose fault was it, right? Was it the side that got changed or the other side? And so you could argue. So what you like to do is before you set up these arrangements
between these applications, set up a contract, in a way they're kind of like the RFCs are, except that their scope is between pairs of applications. And so it itself is XML, and it basically, what we do is we take an XML document and an XML schema contract, and then we either say that's good or that that is bad, and that's called validation, a piece of software that validates XML when given a schema is called a validator. And so an XML document, here we have our little XML document. We're passing it to the validator. And then
we have a schema contract, which is a itself XML. It's kind of a particular kind of XML, that XS colon complex type, that's just a tag. Colon is a legitimate character for the name of a tag. Name equals person, that's just an attribute. And so XML schema is a particular format of XML that renders an opinion about what XML is supposed to look like. So there's a number of different XML schema languages, the one we're going to look at as one that kind of came a little bit later, that's very common, called XSD, which is
the World Wide Web Consortium's schema specification. Often you'll find files that have suffixes of.XSD that actually contain the XML just like we're going to show you. So if you recall, there are simple elements which have text children, and then there are complex elements where other nodes are children of other nodes. And so we can say this. And so here we have a little bit of XML, and the XML schema, that makes sense with that. So what we're saying is the outer tag of this legitimate XML is supposed to be a complex tag with a name
of person. And so there we go, that looks good, good, good. Then there is a sequence, and then there is a simple element, a name of last name, looks good. And it's a string, that looks good. Another tag that's of named age, that's of type integer, that's good. And then a thing that's called date born, and then it looks like a date. So we check all these things, and we can basically say, yup, that is a good XML document according to this schema. And you don't have to write this generally, but there is software that
reads these two things and comes back with a true or a false, and not even have some detail as to what went wrong with this particular schema. Here's some more that you can do with a schema. We can do things like have a complex type, we have a sequence. Here we have a string, full name, and a string child name. But we have this min occurs and max occurs. So min occurs is the minimum number of times it can occur, and maximum is the maximum. So min occurs equals one, max occurs equals one means it's
required. And so this is required, and we don't have two of them. Two of them would be an error. One of them is fine, so that's good. Here the child name is min occurs zero, max occurs ten. So we have four here, and so that's good too. And so that is another kind of XML schema constraint that you can have. Here's a few other data types that we can do. We've done the string, we've done the date. The date looks like this. Dates are four digit year, two digit month, two digit day with dashes. Now
there's lots of different ways to represent dates, but the nice thing about this, and you have to put the zeros in. So zero, nine for September. It means that these are sortable as strings. So that if you do all your dates this way, they're sortable as strings. So you could argue what is prettier, but for computers we don't worry about that. We're arguing about what's the most functional. And then the date time is that same date format with zeros followed by the letter T, and then followed by hours, minutes, seconds, zero filled, right? So nine
o'clock is zero, nine, and then the time zone, which we'll talk about a second in the next slide. You can have decimal numbers and you can have integer numbers as well. And so we are able to sort of render an opinion as to what is good and what is bad in the resulting XML. So dates are kind of interesting. There's, again, we have lots of different formats of dates, you know, nine slash 10 slash 2002, right? You know, that's a format of date, but that's one. There's another format of the date, which is, you know,
12 December, whatever. And so this is how people show dates. Computers don't want to have all those different dates and don't want to figure those out. They have libraries that produce dates and make them look pretty for particular locales. But computers really want dates that work best for them. So we just say, okay, we're going to have this year, month, day, time, and then zero fill, hours, minutes, seconds, h, m, s, and then time zone. Now computers even prefer a time zone. I don't know if you've used something like your Google calendar and you take
a flight or take a train trip and you have a different time zone, everything switches. And that's because Google Calendar is not really storing the time zone that you're, it's not storing the dates in your current time zone, it's storing them in what we call universal time or Greenwich Mean Time. Zulu Time is another word for that. And Z means this time that is the time in, you know, London, England, Greenwich Mean Time. And so the thing is that that means if this data moves between time zones or crosses the international date line or standard
data like savings time or anything like that, none of that changes. And so we have this internal date and time that's very common in situations where computers are exchanging data that then gets shown with a time zone converted to the time zone or the local format that's the right way to do that. And there's a standard for how dates and times are supposed to look. So here's another little example of some stuff. Let's see what we got. Now, if you see this little question mark XML, that's not a problem. That just is a way of
sort of putting a header on the whole document that says it's an XML document, telling it that it's a UTF-8 document. And that's not really a tag. That's sort of like a marker on the file so that you can put that there but it doesn't harm the XML. The outer tag is this tag right here, XS colon schema. And then what else we got? We got an address. We got a string, string, string, string, string. We've seen all those. Here we have country and we're going to have a restriction that basically says this is a
simple string but we're going to make it so that you have to list one of these four as the country code. And so here we are down here and that's UK and that's UK and so that is valid XML. Another couple of examples here. Let's see, string, string, string, string, string. Max occurs unbounded. That means infinite number. There's no limit on the number. You can do that. It occurs of zero. Excess positive integer. We've seen integer but you can also say it's got to be positive integer. Decimal, we've seen that. And then use equals required
is just another statement that you can make. I'm not trying to get you to the point where you can do XML schema. Just get you a sense of the kinds of statements that we can speak about when we're talking about what is and is not legitimate XML. So let's talk a little bit about how we might talk XML inside Python. And so like most things that are in this extended part of Python we have to import something. And so this is the name of a library XML E-tree element tree and then as ET this ends
up being a shortcut. So we don't have to type these long things. And so ET is the same as typing that. It's almost like a macro. Now normally this XML is going to come somewhere from the network but I'm just going to put this in a string. I'm using a triple quoted string and so that means that this triple quoted string starts here and ends here and all these new lines that are here are actually part of the string. So this is kind of like I opened a file and read the whole thing in. But
just to keep this totally self-contained I'm putting it in a string. So the XML would come from some server on the other side of the network we would get this XML. So that's how it would normally work. Okay? So this is the XML right there. And we parse a string of data and we call ET from string. So we're passing in the less thans, the new lines, the greater thans, all of this stuff we're passing in. And this could have syntax errors in it. So this might blow up if this had a syntax error like
we forgot the little slash or something. There was a syntax error. But this doesn't have a syntax error. So then what we do is we get back an object. I just happen to call it tree because it kind of is like that tree version of the XML. That is an object that we can then query to pull data out of it. So we say tree.find and look for a tag name name. So that finds the tag name name is this. It's everything. It's the tag and the text. If we want the text, we add dot
text. And then that dot text, that dot text, that actually refines it to only the word chuck. And similarly, if we do tree.findemail, that tree.findemail, that finds the email tag which is this tag. It has a child attribute and you can get any of the attributes. You say dot get. There's only one text child. But there are many attribute children. And so you have to tell it which one you want. And so this here, this bit right here, all of that will resolve down to that string yes. That's what you're going to get there. Yes.
And so you kind of build up these little finds and call methods. This is not clearly a full introduction to element tree. But you get the idea that you sort of dive down in with these methods, the call methods, the call methods, to get little pieces out and parse all of that. Here is a different example. In this one, again, we're using triple quoted string. We always have a single tag on the outside. And then I have a complex type of users. And in it, there are two user objects. So this is kind of like
a list. So this is more than one of these things. So this user can occur more than one time. And again, we take this, we pass that into from string and get back an object that represents the name stuff is not necessarily have to be the same as this outer tag. Just a variable. This could just be as easily as X if I wanted. So now what I'm going to say is, hey, stuff, I want to find the tag, the path users slash user. I want to find all tags that match users slash user. So
that's going to give me a list of two tags, one tag, two tags in a list. Tag, tag. Oops. So two tags. Now I can print out how many I get. That'll be two in this case because I got two tags. And I can actually iterate through the list. So I can iterate through the list. So this item is going to iterate first to this tag and that tag, now it's like in the previous example, we can look for the name tag within there and pull the text out. So we pull that text out, find
the name tag, find the name tag, and then within that find the text. And we can find the ID tag and pull the text of that out. So that pulls out this 001 and I've scribbled too much. And then we can item, which is, this is item, is that whole tag, dot get x. So that gets the attribute, that gets the two, that two comes down here. And then item goes to the next one because item is looping through so item iterates down to that one and pulls out the name dot text, the ID dot
text, and the attribute dot x and pulls all those pieces out. So this is the basic pattern. You saw one where you're tearing into a single thing and here you're tearing into something that is expected to occur more than one time. So that's a quick summary of how you talk to XML in Python. Up next we're going to talk about the other serialization format, JavaScript Object Notation. So now we're going to talk about the other serialization format, JavaScript Object Notation. Chances are good as you go out there, you will very likely encounter more JSON than
you will XML. Not that XML is bad. XML is better for rich and hierarchical documents, whereas JSON is best for just pulling data out of a system and moving it between two systems with the minimum of fuss. This is Douglas Crockford. I have a great interview from him. He's a funny guy, very, very smart. He claims he didn't invent JSON, he discovered it because it really is based on the literal notation for JavaScript. And it actually looks a lot like the Python literal notation for objects and for lists. Now Douglas Crockford has quite a sense
of humor. He wrote this book called JavaScript the Good Parts, that's the little one right there, and then JavaScript the Comprehensive Guide, and the sense of humor is all the stuff that's in JavaScript that's not too useful. And while this is sort of a tongue in cheek, it also is trying to say that JavaScript, what Crockford is really saying here is JavaScript is a great language as long as you avoid the tricky bits and sort of keep it very, very simple. And JavaScript is indeed a great language. But JSON comes from JavaScript. You can read
about JSON at JSON.org. JSON is not an international standard. It's not like an RFC. It really is. Douglas Crockford decided to register JSON.org and typed in some pages, and people started reading it and people started using it. And partly that was because it was truly derived from the JavaScript literal syntax. So we're all ready to code. Here is some Python that's going to process some JSON. Keep it straight. Python process JSON. So again, I'm using the triple-quoted string here. Now you'll notice the syntax that we are using is not angle brackets, but instead curly braces.
And so the curly brace, and then within the curly brace you have key value pairs, name colon chuck, and the key colon value, and both sides have quotes. You can also have objects within objects, curly brace, key value pairs, key value, key value. Looks a lot like Python. And then you can do this. And so this is a structure that has one key value pair that's a string, another key value pair that's an object, another key value pair that's an object, and then these are key values within those contained objects. So this is a string
that again probably was retrieved across the network from some other place. And we're going to pass that string into the JSON library called loadS, loadS stands for load from string. So it reads this, parses it, looks at all the white space. White space again doesn't matter too much here unless it's in between double quotes. The white space doesn't matter. And so it parses it and then returns us a dictionary. So the thing that's different about JSON is that its structure and representation are simpler than XML. So in Python, everything either comes back as a dictionary
or a list, or a dictionary within a dictionary or a list within a dictionary, but it's all dictionaries. It's not a separate structure that you have to do gets and finds and findalls and lookups. So it's right there. So when we get this back, because this is a curly brace, info is a dictionary. And so we can just use the standard syntax of Python, info sub name. Well, that will bring, let's clear this. So info sub name, we'll go find Chuck. So if you compare that with the XML, that's just a lot easier. Now, when
we have info sub email, that's this thing. So info sub email is that thing. And then sub hide is this. So that's what comes out here. So it's really nested dictionaries and lists. We haven't seen a list yet, but this is a set of nested dictionaries that it parses. And it's equally simple in other programming languages. This is a little more complex version where the outer element is a square bracket, which means it's going to be a list. And so we have a list of one, comma, two things. So this is a list of two
dictionaries. So there's two dictionaries inside that list. So again, we take this string and we load it into, use the JSON parser to read the string and give us back. In this case, info is a list. It's got two items. If we print out info, it'll give us two. And we're going to iterate through. And so if we're going to iterate through, item is going to first be this, and then it's going to iterate to this. And it's going to print out item sub name, which is going to print out chuck, item sub id, which
is going to print out 001. Now you'll notice that there is no attributes. And that's because JSON is simpler. But we can have the x just as another item. So we say item sub x, and that's going to print the two out. And then it'll iterate to the next one, and it'll print out the same thing for those guys. And so JSON is simpler because it is, you can't represent as complex a data structure, or you have to compromise and map it into a simpler data structure. But then it is lists and dictionaries. And so
once you've got it parsed, it is easier to understand and to make use of. So that was quick. So that's partly why everyone likes JSON better, is once you have come up with the format that you're going to send it back and forth, it's easy to make it, and it's easy to read it. Now what we're going to talk about is sort of moving up a level. If you've got all these data formats and URLs that you can hit to pull those data formats down, what approach do you do as you start to construct applications
that increasingly go from a single application to a networked application? We're playing with the web services chapter right now. And if you want to get the materials for this course, you can go here and download the sample zip, samplecode.zip. I've got this all sitting already on my computer. I also have the whole thing in GitHub if you want to get it out of GitHub. So the thing we're talking about now is we're talking about the JSON 1.py example from the book. And so JSON is kind of like XML except a lot simpler. And that's why
a lot of people like it. It's not that JSON is always better, but JSON is better in a lot of situations that don't require the complexity of XML. So we start to import JSON. JSON is built into Python, but we have to ask to import it. Again, we're using a triple-coded string to put the JSON in there. And JSON looks a lot like Python dictionaries, key-value pairs. Key-value pairs. In this case, this is a key, and the value itself is another dictionary, or in JSON terms, an object. But again, key-value pairs within key-value pairs within
key-value pairs. And all these little cursor guys have to, all these little curly-brace guys have to line up properly. And so, like all the time, this is a string, which we normally would read and decode from the Internet. But for now, we're just going to have it in there. Load JSON.loadS says go into the JSON library, pull out load string, and parse this, which turns this set of curly braces, spaces, commas, and perhaps syntax errors into a structured object. And if we'd made a syntax error in here, then this would blow up. But if this
doesn't make a syntax error, if this doesn't blow up, then we have a structured representation. Now, the difference between XML and Python JSON is that this turns into a Python dictionary with key-value pairs. And so, once we have this, this is a dictionary. And we can say info sub name, and that's the exact syntax that we would use to get the dictionary. And that's going to extract this value out of there. And if we want to go in deeper, we can say info sub email, and that's what info sub email is right there, and then
sub hide. So that's a dictionary within a dictionary. So if we run this, Python 3 JSON 1.py, it digs in really fast. And so this is why people tend to like JSON, is because you'll read the JSON, which is actually a syntax derived from JavaScript, but it looks just like the syntax for a Python. So that's moving an object, a JSON object that turns in directly into a Python dictionary with nested dictionaries. Now we're going to look at JSON 2. And so JSON 2, we're going to see a list, or an array in JSON terms,
but it turns into a list in Python terms. So this is a list of dictionaries. In JavaScript, that would be an array of objects, but in Python, it's a list of dictionaries. So we'll just pretend that it's a list of dictionaries. Again, we load the string, parsing, looking for syntax errors. So let's just make a syntax error here and run Python JSON 2.py, and you'll see where it blows up. It blows up at line 15, which is right here. It's like this load s blows up. Now you could put a try accept around it to
save it, but we're not going to do that. And it even complains. It says, look, we're expecting something here in line 11. And that's line 11 of the JSON, which starts at line 4. And so I'll put my little square brace back in so it's not syntactically broken. So let's run it again and make sure that she runs, and yes, she does. So this parses it and converts from the JSON syntax into a Python, in this case, list, because it's got square braces instead of curly braces. The previous example had square braces. And we can
then take a len of it, and it's an array, it's a list, and we see that there are two things in there. And then we're going to iterate through, and this item is going to iterate through these dictionaries, that dictionary followed by that dictionary. So the first time it's item sub name, which is this value right here, and then item sub id, which is this value. So you can dig right into this, but you're using, you're not using get and you're not using the weird extra find or find all or anything. You just are going
at these structures directly. And so you can quickly extract this stuff out, and we read through id's name is Chuck. Oops, name is Chuck. There are no attributes, by the way. x is two, and so we had to make x. So if you look at the XML, we had this concept of attributes on the outer tag. These things are also not named. We just have to know what we're looking for. JSON represents simple structures, but it's much simpler to use. So I hope this has been useful to you and talk to you in a bit
about some more JSON. So the service-oriented approach is a way we approach solving a complex application problem where all the data really isn't present in one computer system. It's somehow spread out over the internet, connected via the internet or internal network. And so the idea is that some applications just can't contain everything. The perfect example is a travel website that can book you a flight, book you a car, buy tickets, book you a hotel, and do all these things. Well, that travel website is neither a hotel nor a rental car company nor an airline, but
what it really does is it talks to all these services somewhere else on the web on your behalf, and it makes reservations for you. And so you have this convenient user interface that says, oh, here's your whole vacation. I'm going to figure all this stuff out. Now you say go, and it goes book, book, book, book, and books on all these other systems. Now it requires a lot of infrastructure, a lot of coordination, and a lot of effort to make sure that your application can talk. And these other services that are out there in the
internet have good contracts, and you know exactly how to send data to them and get data back from them. And so initially, when you're building a service under architecture, often you have one application, and it's all internal, often it's all one language, and then maybe you'll say, oh, wait a sec. We want to take part of what we do and put it in a second system, and then sort of come up with a set of rules between the systems, and then more and more and more. So now that we're solving our problem using a series
of cooperating applications communicating across the network, we're going to talk a little bit more detail about the notion of what we call web services. And in this, we're going to take a different perspective. Instead of building our application and breaking it into pieces, we're going to have an application that's going to really consume an API from somebody else. So there is some other provider of this API that's not us. And so if you're going to talk to somebody's data, like Google or Amazon or Twitter, they're going to say, you have to use our API. So
what's that? So an API is a contract that says, look, if you do this, and this and this and this, we're going to give you data this way. And they set the rules. They tell you what the URLs are. They'll tell you if it's XML or JSON. And this is called the Application Program Interface. And it's something you read and you understand. And so you go look at the documentation. This is the documentation for the Google Maps API. So it turns out that Google knows a lot about maps. It knows a lot of data. It
knows how to search maps. And it actually provides some of those features to you that your application can take advantage of. I took advantage of this at one point by asking all the students in one section of one of my online courses where they were from. And I just let them type in where it was. And then I said, well, I don't know how to code any of that. So I used this API doing what's called geocoding to look all those places up and get precise latitudes and longitudes for the ones Google could figure out.
And that saved me a lot of work. Now, these are expensive resources, but I could be patient and make use of these resources, which as long as you use them not too much, they can be free. We'll talk a little bit more about rate limiting and what's free and what's not in a bit. But you start by reading documentation. It says, do this, hit this URL, hit that URL. So if you read that documentation, you will find that there is a URL that you can hit. And they tell you where to go. And then you
go to this URL. You add a question mark. And then you say address equals. And then an hour plus. And there's all these rules. These are called URL encoding rules. When you have key values on URLs, the plus means space and percent two C means comma. So these are called URL encoded. But don't worry too much about that because we're going to have a magic library like we always do in Python that takes care of this. And so if you were to hit this URL, you type it in the exact right way in your browser,
you will get back a JSON document. It's an object that has key value pairs. The first value is the status, then it has these results and it's a list. And you dive down and eventually you can kind of find the latitude and longitude of the thing that you are looking for. And so the idea is can we write a program that can read this? And so here's our little program that reads this. And a lot of this is sort of comfortable. You've already seen some of this. You import URL lib. We have to parse some
JSON. We grab the URL. And then we're going to write a little while loop that's going to ask for a location. And we can type that location in. And we've got to concatenate with this URL the location equals. And there is a bit of code, a library, that's called parse URL and code that takes the key and the value. So the address equals and then whatever this text is that we read in from the user, that goes in here. And it does that URL encoding with the pluses and the percent to C. And all that
stuff is taken care of. And that is our URL that we're going to pass to URL open. So we print out that we're going to retrieve it. Prints this out. And if you look at this, it's too long. It has all that fancy stuff on it. And then we read it. I mean, we open it with URL open. And then we read it and decode it. So these two things, hit this URL, decode it. And then we retrieved 1669 characters because it's just a, in this case, because we've decoded it, data is a string now
that read as bytes and data is a string. So we read that many characters, 1669 characters. And then we're going to take this data and we're going to parse it with JSON. And we might get to bad data here. It might blow up, but it might work. So in this case, it works. We have an error that basically says, if we got a bad thing, we're going to blow up. But in this case, it doesn't blow up. And so now we're going to sort of dig through. And if you go back, let me just go
back. So the results sub-zero geometry. Let's show you how that works. So results is the first key. So this is a dictionary with a key of results. But then it has a list. And the zero item, this list starts here and goes there. And I'm only going to show part of it, but there's many things here. So the zero item is this. This is the sub-zero. And then geometry within that sub-zero item. So if we look at that, it is the outer dictionary, the first item in the list, sub-geometry. So that grabs one part. That
grabs this part right here. And then we're going to go into location and lat. And those are just keys within keys, a dictionary within a dictionary. And so you see it says sub-location, sub-lat. And so that is literally going to pull out of that complex structure. That will pull the latitude out. And then in the next line, pull the longitude out. So we can pull the latitude and longitude out. And then we print it out. And we can go into results sub-zero formatted address. And that goes into results zero formatted address. And that pulls this
little bit out. Now it takes a little while to write this stuff. And you have to put a lot of debug. And you don't necessarily figure out this complex bit here at the end. But you print it. You don't get what you want. You say, oh, wait a sec. That was an array. So I've got to add a little sub-zero there to get the first one out of the array. But eventually you figure it out. And it's not all that difficult. It's the first time, first few times you do it. I'm like, what am I
doing? But after a while, you realize, oh, I'm just sort of tearing this apart and digging deeper and deeper into this data structure, which I just retrieved over the internet from Google. And I learned something good from that. So up next, we're going to talk about how sometimes these APIs protect themselves with keys or signatures and why that happens and how to solve those problems. We are doing some code samples here. If you want to follow along, you can download the sample code. All is in a big zip file. I've got it. We are going
to be working with the Google Maps API. In the old days, this Maps API was free and did 2,500 requests per day. But now they've made it so that parts of it are behind API keys, and you start having to be using OAuth and stuff. But they haven't put it all behind this one address service that we've been using. That continues to work. And the basically idea of an API is you go read the documentation. You find a URL. And this is going to Google servers. And you pass in the address. And we have to
pass in the address using what's called URL encoding. So spaces are pluses. That's a comma. And then that's a space. And so we have to pass this in a certain way. But if we do it right, we hit this. We're going to get ourselves some JSON back. And that's really cool. And so deep inside here, we get the real address, a good address. We get a geometry. We have the location. We got the latitude and longitude. And we can extract stuff out of here. And so we're talking. And this one here is still rate limited
to 2,500. But it's one of the few parts of the Google Maps API that is not hidden behind an API key. In a later chapter, we'll show you how to actually talk with the API key in the geodata code. The geoload shows you how to use an API key if you want to jump ahead and take a look at that. But for now, we're just going to take a look at GeoJSON, which is going to retrieve one page and tear it apart. So let's take a look. So we're going to grab the URL.lib stuff and
import JSON. So now we're going to use JSON. But we're going to actually pull the data out of the internet. And so I just take that service URL for Google Maps API. I found that somewhere in the documentation. And then I'm going to have a loop that's going to run forever. I'm going to add for the location. And then if I hit enter, that's what this is saying, get out of the loop. And then what I'm going to do is I'm going to concatenate the service URL, which is this. And this URL.lib parse URL encode
gives a dictionary of address equals. And this bit right here gives me the string that leads to putting this address equals but then coding these spaces the right way. So if you type a space, that bit of code turns it into the plus. So that's important. And I've got the question mark sitting here at the end of that. Then what we're going to do is we're just going to do a URL open to get a handle. We're going to read the whole document. And because it's UTF-8 coming from the outside world and we want it
turned into Unicode inside our application, we say.decode. We can ask how many characters we got. And we put our JSON load s. Now up till now we've been just doing load s's from internal strings. But this is now a string that came from the outside world. And we'll put a try accept in. And we'll set JS to be none and that'll be our little trigger. Now we can look for, they give us, if we take a look at the output, they give us this okay. And that status can be a problem and it can complain
about things. So we have to check to see if we got a good status. So at this point, if you look at the outer bit of this, the outer bit that we get is a curly brace, so it's a dictionary. Then there is, within that dictionary, a key results, which is a list. But then the second thing in the outer dictionary is status. And so we can ask if the word, if we got a false, if we got nothing, that will quit. If we don't have a status key in that object, or that dictionary, or
it's not equal to okay, any number of those things, if this, or this, or this, either of those are true, we're going to quit. Failure to retrieve and print the data out. And when you're starting to read stuff all over the net, you often have to put debugging in here like this, like oh, something quit, I've got to figure out. And so debugging it. Next thing we're going to do is call JSON dump s, which is the opposite of load s, which takes this dictionary that includes arrays, and we're going to pretty print it with
an indent of four. And then we're going to print that out. And so if you look at my code, we'll see that the first thing we do, once we've parsed it, is we print it back out so we can see it. And then we're going to dig into it. So let's go ahead and run this code. Python geo JSON.py. One of these days, I will always type Python three. And arbor comma Michigan. Okay, so it ran. And so you see that it retrieved this URL. This URL was constructed and retrieved 1736 characters. And it's JSON
pretty printed with an indent of four. And this is that JSON dump s all the way down to here. So that's just JSON dump s. And then it starts extracting. So it's going to pull things out. Now, when you write this code, it's really easy to look at this and say, oh, great, it's easy. I tend to have to print this stuff out over and over and over as I kind of construct this expression. But if we look at it, the outer dictionary, the outer dictionary sub results leads to this array. And if you go
look at this array carefully, you find there is only one thing in it. And so the results is an array. Sub zero gets us this dictionary. I keep wanting to say object because that's what it's called. And that goes all the way down to here. So that's what we get there. And then within that, we now have an object. And we look for geometry within that object. Where is geometry? Right there. Geometry. Geometry goes from there to there. There's geometry in there. You've got to get used to it. That's why it's nice to have this
stuff indented. Geometry sub low. Oops, come back. Come back. And then we go to location within that. So location within geometry. And then within location, we have lat and long. And so this is pulling out this 42 and 83. And then so we print that out. Take a look. And that prints that out. Pulls that right out of the JSON. These are tricky to write, but after a while you win and you get it right and it's just fine. Okay. And so we do the same thing. Results of zero formatted address gets us this. And
so that's how we print the location out. And so that's a real quick look at how we would do that with the JSON talking to the Google Maps API. Okay. Hope this helps. Now we're going to talk about API rate limiting and security. The key thing is that the Google API and the Google data is super valuable. And you could build a website that did nothing but sort of like asked the person for something and then showed them that place and make them be a map searcher. And you added so little value and Google did
all the hard work. And so they protect these somewhat. Sometimes they'll say you can only do 50 of these a day or 500 a day or whatever. That's called rate limiting. And sometimes they say you've got to log in. You've got to create an account and get a key with us and then present your key. So that means that your account only gets so many. And they keep track of who's using their service and how much they're using it. Google gives you even sort of a dashboard that tells you some of this stuff. It's kind
of nice. And so the other thing is that sometimes an API is free and then it becomes popular and they decide they're going to put a key on it or a rate limit on it. So you've got to kind of play this game with them and the rules kind of change as things progress. So that geocoding API that we're talking about has at one point in time 2500 requests a day. You can get more requests if you get a key. Now another API that we can talk about is the Twitter API. Now Twitter API started
out as a free public API but then Twitter realized that people were making more money off of Twitter's data than Twitter was making off of Twitter's data. And so Twitter makes it so that you have to have an account. You can only request data from their API if you use your account key to sign that. And so there's a whole series of getting and issuing keys and then using those keys. And I'll just give you a short summary of the kind of code that it takes to build those requests up that have to be signed.
So you'll look through the Twitter documentation and it'll say, oh, this URL to get the tweets, et cetera, et cetera. And it says do a get request to this URL and that URL and maybe substitute a little bit of things here for the screen name you're looking for or how many tweets you want. And they tell you how to carefully construct these URLs. And so here's an example bit of code that talks to the Twitter. For now, I'll ignore the security bit. That's all hidden in this TW URL. So it looks a lot like the
last one. We're going to use JSON and URL lib. And we have found that this is the API name, blah, blah, blah, blah, blah, list.json, getting a friend list for a particular person. And so that is the base URL that we're going to do. And we're going to ask for a Twitter account. If we hit enter, we're going to break out. And TW URL augment, we're going to say, give me the first five friends of this particular screen name, the one we just read in from input. And this TW URL you'll see in a second,
it adds a bunch of stuff to prove that you are who you are. It's signing that URL. So you're sending a signed URL, which is nothing more than a whole bunch of crazy characters. We'll see that in a second. We retrieve it. And this is pretty straightforward. We can just open the URL, read it, and decode it. Decode solves the UTF-8 thing. Makes it all so that data is a real string and it's in the Unicode internally. Now we can actually get the headers. Remember I told you earlier that URL open bypasses the headers, but
it's stored them for later. And we can say, hey, give me back those headers. And that gives us back a dictionary of headers. And the headers, if you go all the way back, are a bunch of key value pairs. Key colon value in the headers. And in Twitter, if you read the documentation, there's this x dash rate limit remaining that tells you each time it returns to the API, response to the API call that you made, it says, look, you've got 12 left. You've got 11 left. You've got 10. So you can print that out.
So this prints out how many you've got left. Then we parse the JSON data. We're going to print it so we can debug it. This dump to string and then print it. Indent equals four. This is called pretty printing. And it's indenting things really nicely so that you can make more sense of it. Whereas when these things are talking, when programs are talking to each other, they don't really make the output look particularly pretty. And then if you, we're going to go through, we have the outer thing of users. And we're going to print out
the screen name and go grab the, for each user and users, we're going to print their screen name. We're going to grab their status text and print that out. And so this is what that data looks like. Kind of chopped a bit. So the thing we get is an outer layer. We get users and then we get a list. And here's the first user. Now, if you look at the actual data, it's much larger than this. Here's the second user and then we have status text, status text and the screen name. And so those are
the bits that we're extracting from that. If you look, we're going to grab the screen name. We're going to grab the status text and away you go. So you can start with this, but you realize that once you're looking at this and you're printing this out with pretty printing, you can sort of work your way in knowing that it's either a dictionary or a list. If it's a dictionary, you look up the key. If it's a list, you say which position it is and then you get more dictionaries within dictionaries within dictionaries and away you
go. And so this code actually, when it runs, it prints out the screen name and then that status and the next person. So it's my first five, in that case, my first five friends and their most recent status, the first five people. Now, let's talk a little bit about how this security works. And so you have to go to the website. You have to have a Twitter account. You can't talk to Twitter API without a Twitter account. And then you go to this website and then you set up a key. You say, I'm going to
build an application that is going to consume the Twitter API. And then you go in, you have to work through. There's documentation on how all this stuff works. You set up an API key. You set the application. So I made a key called Python on my laptop. And it gives us some values. It gives us a consumer key, a consumer secret, a token key, and a token secret. And you get to regenerate these. And there's this file called hidden.py. And you edit them and copy and paste all the stuff from those pages, those four values,
into these strings. Now, if you download my code, I don't have my keys in there. I got some placeholders for this stuff. So you've got to get to this web page that's on Twitter, copy these things in, and then the TWRL code will start to work. It uses a technology called OAuth, which is a way to sign a URL in a way that proves that you have the key and the secret and the tokens. And it can't be modified in the middle. So once you send this URL, they can check the key and the secret
to make sure that you truly signed it without actually sending the key and the secret. It's actually kind of cool and fascinating, but we won't go into it in great detail here. And so if you look at the code in TWRL.py, this is the code that does it. It actually pulls in an OAuth library, that hidden.py. That is that code that you've got. And it's got the consumer key, the consumer secret. Secrets. This is pulling that from hidden.py. This is a lot of stuff that's using this OAuth library. Don't worry too much about that. Eventually
it produces a URL that looks like this. And what happens is this was the base URL you were told to use. Then you have count equals two and screen name equals Dr. Chuck. Those parts are your parameters to that web service call. And then all this OAuth stuff is produced by this OAuth code and the consumer key and the secret. What happens is the key gets sent, the key gets sent and the secret does not get sent, but they send the signature which is based on the secret and then what it does is it rechecks
the signature on the far end. Signature is a long string by regenerating the signature because the secret is available to both you to generate the signature and to them to check the signature. So it's kind of like a hash, et cetera, et cetera. You don't have to worry about all this. These URLs get really long and your values that you need are in, the name of the URL is in and you call this routine. That's called augment that takes a URL and then parameters and then augments it by adding all this OAuth stuff. And so
that's why it's called augment to augment the URL. And once you got this set up and hidden working, then you sort of just augment the URL and then hit it. Now, you know, if you don't have the right keys or secrets or you don't have an account on Twitter, then it's going to blow up. But if you get it set up, you will be able to talk to the Twitter API with this. So this whole web services section, we've done quite a bit of stuff, right? We've looked at how instead of reading HTML or flat
text, we are creating structured data according to contracts, whether it be XML or JSON. We can retrieve and parse that information in a deterministic way. We talked about schemas that define the contracts so that you know if the data you're getting is wrong, you could know who to blame because the schema gets violated. And we've played with APIs where you're talking to someone else who's defining what the rules are and how to read their documentation. And even if they have an API key or need to sign URLs, showed a little bit about how to do
that. We're doing some code, sample code, playing through with some sample code samples. And you can get this by downloading it. I've got this whole thing downloaded. And I've got all the files here. And these are the files we're going to play with today. Today what we're going to do is talk to about the Twitter API. And the one thing we've got to learn about the Twitter API is we have to authorize ourselves. And so we have to make sure that we have a Twitter account and then we get some keys. And so in this
particular application, if you want to duplicate what I'm doing, you have to go to apps.twitter.com, click this create new application button, and then get some codes. And the codes show up as soon as you hit this button and then one more button, which I'm not going to do on screen. And so what happens is there are four codes that you've got to put in this file hidden.py. The consumer key, the consumer secret, the token key, and token secret. These are just messed up, so I'll show you how this works and it blows up if first,
and then I'll put my keys in here without showing you. But basically, this is a little file you've got to edit or these Twitter ones don't work. You'll see what happens. So the first one I'm going to do is do the simplest one of all. And that is I call this thing Twitter Test and it just is going to go ask for the user timeline. And we can take a look at this. And we're going to take the URL and we're going to augment the URL. This is the base. We found this looking at the
Twitter API documentation. We're going to pass a parameter of screen name, Dr. Chuck, and a count of two. So this is just a Python dictionary. And augment comes from this little bit of code called twurl. And this uses a bit of code called oauth, which is built into Python as well, right? Yeah, that's built into Python as well. And it augments the URL and it takes the key, the secret, the token key, and does a thing and signs it and then makes this big, long, ugly URL, which you will soon see, and it's a signature
of the URL. So we pass this data back and forth to Twitter with a signature and then they recheck the signature and it's a digital signature that knows that this URL came from a program that knows the key, secret, and token and token secret. And so this augment basically is something that I wrote, twurl, augment, is something I wrote to make it easier to add all these oauth parameters. And you feed this code by putting your data into hidden.py. Lots of people get this to work, so don't worry. It's kind of cool when you finally
get it to work. So let's take a look at what it does. Just know that this makes an awesome URL that does all the security. And we'll see one of those URLs. So ignore the certificate errors. This has to do with the fact that we're using HTTPS and Python doesn't have enough certificates put into it by default for a lot of reasons, but our quick and dirty way is to turn them off. Thank you, Python, for reducing security by teaching us so that this is the best way to do it. That's a grumpy moment from
on my part. So what we're going to do is we're going to do a URL open. This bit here is to shut off the security checking for the SSL certificate. And then we're going to read all the data. And then we're going to print it out. And we're also going to ask the connection, this URL, remember I told you a long time ago that URL lib eats the headers, but you can get them back. And now we're going to ask to get a dictionary of the headers back. And so we'll print those out. So this
is really kind of just testing the body and the headers and printing them out sort of in as raw a way we can do. So let's go run this. Now, this is going to fail the first time we do it because we haven't put the hidden variables in there. So if I say python3twtest.py, it's going to run and blow up. And it's going to give you this 401 authorization required. That's a good sign because that means that you haven't yet updated your values in hidden.py. And so this is that augmented URL. And you can see
the consumer key and the consumer secret and the OAuth token and whatever. Okay, so these tokens are like wrong. These aren't, oops, control C. They aren't real. But you'll notice it doesn't have the key and the secret of the token key, the token secret and the secret. And that's all actually encoded in this signature. It turns out that you need to have the key and the token, I mean the secret and the token secret to generate the signature. And where is the signature? Oh, there's the signature, right? There's the signature. And so this signature combined
with the nonce, you can only do, this signature has a time and includes all kinds of things. So even if you type this in, well, you'll see these go by. And it's not really breaking my security too much when you see these afterwards. So don't get all excited when you say, oh, you revealed your token and your key. Well, I can reveal my token and key, but I'm not gonna reveal the secret. Okay, so this adds all this OAuth stuff, OAuth nonce, OAuth timestamp. And these timestamps and nonces are made it so that you can't
replay my URL even if you see the exact URL. Once I hit it, then you can't hit it again. And so that's what the nonce does. So I'm gonna close hidden.py here. And I'm going to update hidden.py in another window. Okay, so I just, in another window, I updated hidden.py. I'm not gonna show you that. But now I'm gonna run python-tw-test.py. So TWRL is going to read hidden. And now these keys and secrets are my real ones that I haven't shown you. So this should work. Fingers crossed. Yay, it worked. Okay, so it worked. So
I'm calling Twitter. Here's the URL. Now don't worry, the token and the consumer key are not enough to break into my account. And neither is the signature because you can't replay this. In about five minutes, you can't replay this anymore, okay? So you can't generate the signature. I've done one. The signature includes the time and date. So you can't, trust me, go read up on OAuth. Don't worry. I haven't really revealed anything. But, so the first thing we see is this. So we see, and we should put like the line of dashes here. This is
the JSON. It ain't very pretty. It's not very pretty. Okay, and so that's the JSON from there to there. It's just what most APIs give us back. It's really dense JSON, right? And so this is a byte array. Remember how you have to do a.decode? I didn't do a.decode here. And so this is telling, and Python is telling us, this is a byte array, which it's a raw set of bytes that came from the internet, which probably are UTF-8. And if I put a decode here, then it would decode, if I say.data.decode there, then it
would be fine. But we don't care. This was just a dump. Do we get anything? And so then, here, let's do this. Print. I'll just make this code different. Put some equal signs here, a lot of equal signs. So we can easily see where the thing starts and stops. So we'll run that again. If you look at those URLs. So that was all of that stuff. And then, this is the headers. And so the headers, again, are not pretty. If you get the headers, it's a dictionary. You got cache control, no cache, comma. This is
the string, key value. You got to find your commas key value. But the one that's really interesting here is, which one is it? X rate limit remaining, right there. X rate limit remaining. So that means that for this particular API, and this header tells me that I've got 898 calls left. And this is when I will get more calls, and yeah, so let's see, yeah. So watch. I'm going to do this again, and you will see that I can only do this 897 more times now. Do, do, do, run it. I can only do this
897. So I am being tracked at this point. I am being tracked by Twitter. Twitter knows that it's Dr. Chuck that's doing this, and Dr. Chuck has done 900. He's done 899, 897. And if I keep running this, eventually Twitter will tell me, you got to wait for a while. And that's because Twitter doesn't want me, under my Dr. Chuck account, pulling out like lots and lots of stuff out of Twitter and making my own website. I do actually have my own Twitter website, using some cool software. www.drchuck.com slash Twitter. And this I have to
run, and it rate limits and causes all kinds of, you know, whatever. So, okay, so Twitter rate limit. So, I'll save that. So that's tweet. This is just to test it, okay? Because we're doing, I want to do something interesting. So we're not parsing the JSON that comes back. We're not doing anything tricky with this. And away we go. So, let's take a look at some more code. I think I don't need this anymore. So now, I am going to parse this. So most of this looks the same. I've got that same user timeline JSON.
I'm going to ignore the SSL certificates. I'm going to write a loop. So I'm going to ask the Twitter, I'm going to print, I'm going to get a Twitter account and quit if it's a blank line or if I had to enter it. I'm going to use the Twitter URL augment the same way. That's going to do all the signing using from hidden.py. I retrieve it. And I'm going to retrieve it, ignoring the SSL errors. And then I'm going to decode it. This time I'm going to decode it so that I get a real Unicode
string. And I'm going to print the first 250 characters of it. I'm going to grab the headers. And I'm going to print the remaining, the right limit. So this is sort of a very simple version of this same thing. It really is decoding the data and only printing the first 250 characters. So let's run that. Dr. Chuck, boom, and it's got 896. So that's just a little simpler version of that with a little less brutal debugging. Okay, so now let's do something even more fun. Let's go to Twitter2.py and tear it apart. And so again,
we're going to look at my friends list or someone else, anybody's friends list. We're going to ask for the friends and ask for the screen name, ask for the first five friends, and then look at their statuses, open it, decode it, get the headers, print the right limit. Remaining all this stuff is the same as in Twitter1. But now we're going to parse the JavaScript. I'm not even putting this in a try and accept because, hey, I'm talking to Twitter. I'm going to guess that Twitter's going to give me the right stuff. You'll probably want
to put a try and accept here. Then I'm going to do a debug print. I'm going to do a JSON pretty print. Let's make that be 2 so it looks a little better. And then, well, I'm going to run it and then you're going to see how we have to parse this and we're going to see that it's a list. So we're done with that. And now we're running Twitter2.py. So I'm going to go to Dr. Chuck and this is going to ask the question who Dr. Chuck's friends are. Okay, let's go to the top.
So it hit this API and it has the screen name Dr. Chuck count equals 5 and all this OAuth stuff. Again, this is not a security breach by showing you all of this because the signature, the secrets aren't there. Okay, so if we look at it, it's an outer object or dictionary and then the outer has a users which is a list. And then each user has some stuff in it. So this one's Stephanie Teasley. It's got her screen name. It's got some descriptions. Keep on going. It's got her status, her latest status. For my
friend, her status. Her source, where she's at. I don't know, man, she's got a lot of stuff here. Okay, there we go. That was the first one. And then the next one that I'm following is live EDU, etc. So you'll see that this is an array. So that outer thing is an array of users. Now, JS here is a dictionary. So I can say for you in JS subusers. Well, JS subusers is a list. So the first U is gonna be this Stephanie Teasley U and the second U is gonna be live EDU. So that's
all it took to get through all that stuff and figure that out. And then I'm gonna say, get me the screen name of my person. So let's go in here. So that's gonna pull Stephanie Teasley out. Then I'm gonna go find her status. Let's find her somewhere in here. U sub status subtext. Come on. Okay, there's sub status. Sub status is all this stuff. More, more, more, more, more, more, more. Right there, that's status. That's U sub status is that. And then U sub status subtext is this stuff. So it's gonna extract this bit right
here. And so U status text. And I print out the first 50 characters of the screen name status. And I do that for the first five because I told it I only wanted five. And then of course I get to see the right limit. So let's go down to the bottom. So all of this is the debug print of the JSON I got back. Here is the program starting to print. Here is the screen name of my first friend. And here's the first 50 characters of her most recent status. Here is the screen name of
my, and these are in reverse order who I've been following. So I've been playing with this live coding stuff. So I'm following them. What? Key error status, that didn't work. Why not? Oh, that's because live coding TV somehow doesn't have a status. So most of these work, so now you'll get to see me fix something. And when you download it, it'll be fixed. And so it says key error status. So that means that I've got to do a thing that says, if status, not in you, print, no status found. Continue. Since sometimes there's no statuses.
Who would have thought? I did not know that. Yeah, so you. Okay, so let's run this again. Did I get to see my remaining? Actually, let me change the order of this. Let me put this down here. That'll be wrong from the slides, but it'll be prettier now. Let's put the headers after the dump of the data. Okay, so let's run it again. Did I save it? Yeah. Dr. Chuck. Blah. Whole bunch of stuff. So I got 13 remaining calls on this one. So it's not the same as the other one. I don't get to
call this too many more times, so hopefully I'll get the debugging to work. Sort of. I got a bad space here. No, not status found. No status found. And I need to put three spaces there. No status found. I'll make an asterisk. So let's run it again. See, I got 13 remaining. So it's important you write code that's aware of your remaining. That's why I made so obvious about that. I'll retrieve all that. I got 12 remaining, but my code starts to look. Dang it. I now have another space here. Hang on. Got to fix
that. I need yet another space. Hopefully, I can make this as pretty as I want it to work. Oh, wait a sec. I didn't even do Dr. Chuck. I did that wrong. Typed my name wrong. OK. So now it works. Oh, well. So now I have my first, most five recent friends are this. Steph Deasley, live edu official. LifecodingTV, Nancy Gilby, and Greg E. Kruger. And so there are their statuses. And I tore all this JSON apart using twitter2.py. Of course, after fixing hidden.py, which I'm not going to show you, because it actually contains my
real consumer key and consumer secret, you're seeing the consumer key and the token key go by on each of these URLs. But what you're not seeing is these two things, which are the thing I'm protecting, so that it's not a problem. OK. So I will send that up. But there you go. Welcome. I hope you found this useful. The code will be fixed when you take a look at it and download it here from samplecode.zip. Hello, and welcome to Python Objects. I'm Charles Severance, and we're well on our way to getting through all this material
in Python. So this lecture is in a weird place. I even debated where to put it in the book. I don't really want to teach you how to write a lot of object-oriented programming, but we're going to start using objects. And I want to be able to use the terminology. And so as much as anything, this lecture is about terminology and understanding the words, things like methods and method signatures and variables and inheritance. And so think of this as a terminology lecture rather than a learn-how-to program or learn-how-to use this. It's not something you're going
to figure out right away. And there'll come a time when you as a programmer really want to start using object-oriented programming. It's really a powerful and wonderful technique. But I think it's too early as a beginning programmer to really say, oh, let's write a bunch of objects. So just relax and enjoy and learn this material and think of it as sort of a theoretical thing rather than a how-to program thing. And so part of this is we're going to start reading data structures and data on how to use all these libraries, etc. And we're going
to see the word objects, right? And then we're going to start hearing them. And I want you to be able to read the Python documentation so that you understand what's going on. And so the word objects should make sense to you even though you're not going to write a lot of objects or any programming. And so page upon page upon page, database stuff, which we're going to talk about soon, uses objects all over the place. And the beautiful soup uses objects. We've kind of been using them, and I've been waving my hands and I use
the word method without defining it. But now it's really time to define it and go to it. So I want to review from the very beginning what we think of as a program. So the classic program, my favorite little minimum program, is our little elevator floor converter, which converts from European elevator floors to United States elevator floors. And the key to this is that it's input, processing, and output. And this is a good way to model any program. And in that process, we've got variables, and we've got logic, we've got algorithms, we've got loops that
we write, we've got all kinds of things. And we construct a series of steps to achieve some goal. In object-oriented, and frankly, you've been using object-oriented all along, the program has lots of objects. And we're sort of putting stuff into these objects, taking stuff out of one object and putting it into another object. And you've actually been doing this all along. As soon as you're looking at dictionaries and lists, you're doing objects. And so an object is quite a little thing. It's sort of its own little space inside of a program that contains code and
data. And so we're working together. All these objects are now working together. It's a bit of self-contained code and data. And it is one way to take a very complex problem and make it easier by breaking it into separate things that can be engineered and developed separately. So you'd be using string objects, or maybe you'd use beautiful soup or something. These are powerful capabilities, and if you had to look at all of them, it's just, hey, here's a thing, use this object, it'll do these things for you, and there's lots of details inside of it.
Just don't look at it, don't worry about it. And so there's boundaries, the things that you can use, things that you can look at, and things that really you don't bother looking at. You go read the documentation and use it, and away it goes. But then someone had to write that, and so they built an object. So what we're going to do is look a little bit under the covers of what it takes to build some of these objects. And so if we think of this program that originally just sort of did processing, we can
think of it as having some kind of an input, right, coming into our program. And we have a string object, a dictionary object, maybe eventually some objects like a database object or an object that we eventually define. And you can think of us, we're receiving data, it comes in an object, which is a string object, or you start putting the strings in dictionaries and do whatever, we pull out a list of them, and so you can think of data as moving between these objects. And like I say, even strings, in the first week, first lecture,
first week, first everything, we were using objects, and we've been using them all along. And so you can think of every string and every dictionary as a little program all by itself that has a bit of code and a bit of data. And so a string has the data, which includes all the characters that make up the string, but then there is a method called upper that does uppercase, or rstrip, that strips off the right white space from the right. And so it's like they're almost little programs that have inputs and outputs themselves, and we
can make lots of them. And there's lots of cooperating objects that make up an application. And one of the nice things about the object-oriented pattern is that they form boundaries, and within the boundary, if you're inside the object, you can say, look, I'm going to build you a string object or a database object or a beautiful soup object, and I'm going to build this capability and I'm going to give it to you in the form of an interface, and I'm not really going to care how you use it. And so we have this sort of
visibility wall where I'm going to make an object and I'm going to let you use it, and the maker of the object doesn't necessarily have to know every single thing about the use of that object. But so just like inside the object, they don't have to worry about what you're doing with the object outside of it. When you're outside the object, you don't have to worry about what's going on inside of it. We, as the user of the object, we talk to its interface and we get things from it and give things to it and
use functionality within that object, but we don't have to look inside of this. We can just say, oh, it's a nice little magical thing. We read the documentation, we read a web page, and it told us to do this, this, and this, and away you go. And so it is sort of this isolation boundary that works both for the programmer who's writing the object and the programmer who's using the object. And so it's a very nice pattern, and so you'll see how we're going to build code and we're going to group it together, and then
we're going to be using it sort of as a big blob of stuff. So some definitions in this space, words that I want you to understand. When we're going to create one of these things, one of these objects, instances, that has some data in it and some code in it, we have to be able to define the shape of this object. What code will each object have in it and what data will each object have in it? And that's called a class. The key to a class in this little picture that I've got up here
in all these slides is a key. The class is a template. It's not the thing itself, so it's a cookie cutter. It knows a lot about how cookies are made, and if you have cookie dough and you hit the thing, then you make as many cookies as you want. And so this nice little cookie picture is a great, you know, mental model of how it works. The class is the template, and then the object are all of the cookies that are made from that template. But the template defines the shape and the nature of the
class. So the code that we write is going of each of the objects. The code we write is the class code, and then later we say, oh, let's take that template and make ourselves an object or an instance. Now, as we're defining a class, we have two basic things that we put in the class. And there's a couple of different terminologies for this. One is method, which is code. It's like a function that lives inside of a class. Not a function that lives inside your program, but one that lives inside of a class. And so
this is a scoping thing. A method is really just a function, but it lives inside the class. And then fields or attributes are data items that are in the class. And so they're variables that are defined in the class. You can define variables outside the class that you use in your program, and you've been doing that all along. But if you're saying, I'm going to build this capability and it's going to have data inside of it and code inside of it, the code is the method or message and field or attribute. And there are just
two different sets of terminology. Method is what I'll probably use if you look in some object-oriented patterns like Smalltalk or Apple. They often don't call these messages. So you can either access a method inside of a class or an object, or you can send a message to the object. The same is true for field and attribute. It's just a chunk of data that's in the object that you may or may not have the right to access. So like I said, a class is a template. It defines the characteristics of the objects that we're going to
use to make it. It is the cookie cutter. So dog is sort of the exemplar. Lassie is a particular dog. And so dog has fur and dog barks, and dogs do all these things. And so we know something about dogs, but it doesn't mean we have a dog, right? And the class is a more abstract concept that when it's time to get a dog, we know certain things about dogs. Instances or objects are once we say, oh, time to make a cookie from the template. Time to get a dog. We know something about dogs. That's
the creation of an object, and we call them instances, instance of a class. So the class doesn't exist, but we say, make me a new object using this class as its template. Oh, and now make me another one. And so we can have many, many objects from one class. So just like many cookies from one cookie cutter. Method is a bit of code that lives inside of an object. It's like a function, but it's scoped to within the object or within the class. Okay, so that kind of gets us started on some of the terminology,
and we'll come back and we'll take a look at how we write code that's object oriented. Okay, so now that we've gotten through the definitions, let's work into some sample code. But hey, look at this. We've got ourselves a cookie cutter and some cookies. So remember that a class is a template. It's not the actual thing. An object is an instance of a class. So you have to take the class and do something to make the object. And actually you can see here some other classes. Clearly a sort of a snowflake class and a gingerbread
man class. That's an object, object, object. Somewhere out here there is a snowflake class and a gingerbread class. But we've got a snowman object and a snowman object and a snowman class. So class is the template. Object is the instance. So here's a bit of Python code. So let's take a look at what we've got here. Class is a new reserved word, kind of like def. We have the name of the class. That is a name that we choose. That's the name by which we'll refer to this class for the rest of this program. And
it has a colon at the end of it, which means it starts an indented block, which ends when we deindent. Inside the class there are generally two things. There is some data, and this just looks like an assignment statement in the class, x equals zero. And then there is a def. This looks just like a function. And then it starts with a def, has a colon, indents. That function finishes right there. The difference is this is a method because it lives inside of a class. And so there is no function called party. There's a function
called party within party animal class. And we'll talk in a second about this self thing. It is the way that inside this code we refer back to that variable. So this is not actually executing any code. It's sort of remembering the template, defining the class party animal. This is what we call constructing. We're constructing, using the party animal template or class, we are making a party animal. And then once we make that, we stick it in the variable an. And then we're going to call this party animal, this party method, three times one, two, three.
Now this self thing, and we'll take a look at the self. The self ends up being an alias of an. And so you can look at this syntax. It's just kind of an equivalent of this syntax. It's calling the party method within the party animal class and passing the instance in as the first parameter. And so self ends up being an alias of an each time these are called. Now if we make a different variable and a second object, which we will eventually, you will see that that works a little bit differently. And so this
syntax is a short version of that syntax. So if we watch how this executes, it starts up here, it just defines it, and then we construct it. And that's what basically constructing it, we know how to construct it because we look at the class and we make a variable x, we make some code party, and then we construct that, that's what the party animal does, and then we assign that into an. And so an is now pointing at that. And then when we call the party method, that basically takes this an and passes it in
as the first parameter, which is used as self. And so self.x, which is what we're doing in this line right here, self.x is a variable, x starts out as zero. x starts out as zero because when it was constructed it was set to zero. So we're in here, an is an alias of self. It looks up self.x, which is zero, adds one to it, and so this becomes one. And then we print so far, so far one. And then the code returns and it goes down and does it again. And x becomes two, prints out
so far two, comes back down, and does the last time, calls it again, self.x is two, add one to it and stick it back in, so this becomes three, and we print out three, and then the program finishes. And so you can think of this as constructing the object, and then associating it with this and variable. Now that we've created this object, we can play around with things we've played around before with dir and type. We use dir and type to kind of inspect variables and types and objects. So we've been using objects all along.
This code here says, hey, make me an empty list. Well, it turns out that what we're saying is there is already a list class inside of Python, and we're constructing an empty list. And when we get back this empty list, we're assigning that into x. So x, in a sense, contains or points to an empty list. So then we say, hey, what is in x? What kind of thing is x? Well, it's a list. This is a thing. It's a list type. Lists have lists of things in them. And, you know, use append and all
the things we've been doing before, they're just objects. And then the dir, if you remember the dir, the dir is the capabilities. And there's all these internal capabilities that do things like implement the bracket operator, et cetera, those double underscore ones. We can ignore them, although you can even look them up and figure out what they mean if you feel like it. But the methods that we tend to call are in this class. And so things like x.sort, I've always told you, that is the sort method within the x thing. And the dot operator is
the operator that we use to look something up within an object. And so you've been using the syntax all along. x.sort, dictionary.items, all of those are methods within the corresponding class. If we take a look at this line of code that we've been doing for a very long time, which says, oh, stick hello there into y. It's, if I reword that as more oo or object oriented, what this single quote does says, make me a string object and put some text in it, and then when that is done being constructed, stick that into y. Right?
And so y now points to a string object that's been preinitialized to the string hello there. Now that's a long way of saying hello there ends up in y. But in oo terms we can talk about that. If we do a dir of that, we see a whole bunch of internal methods, which have double underscores. And then we see all kinds of methods that we've been using. We've been using methods like upper. We've been using methods like find. We've been using methods like rstrip, right? We've been using these methods. So we're going to like y.rstrip,
parentheses. Again, that's a method, that's an object. Not a class, it's an object, and that is the object lookup operator. Now if we do the same thing to code that we've built, or a class that we've built, so now we have a party animal class. Remember this up to here is just definition. Now we construct it, and we store it in an. So an is a variable that contains an object of type party animal. We ask it what type it is, and it prints out here. It says this is a class, and it's main underscore
party animal. And this whole thing here is the underscore main. It's scope to underscore main. But you can see that you have made a new type. You built a type by using this class keyword. And then we use the dir. Remember, dir looks for capabilities. And again, you will see a whole bunch of underscore things. They have meaning, you can look them up. But eventually you'll see the two things that you've put in it. One is the method party, and the other is the attribute, or field x. And again, these are the things that you
can say, an.x. Or an.party. Because this dot is the object operator, the object lookup operator that says, look up in the object an, the thing x. Or look up in the object an, the thing party. Okay? So up next we'll talk a little bit about how objects are created and destroyed. We also call that object life cycle. Now I'm going to talk a little bit about object life cycle. And what we mean by object life cycle is the act of creating and destroying these objects. And I've been using this term constructor already. And so when
we declare a variable, whether it's a string or a dictionary or a party animal, whether we create them and then they're discarded, and there's all this dynamic memory that comes and goes. And we as the writers of objects have the ability to insert ourselves at the moment of object creation and at the moment of object destruction. And we make special functions that we call the constructor, the object constructor, or the class constructor, and the destructor. And we don't actually explicitly call them. They're called automatically by the by Python on our behalf. And so the constructor
is much more commonly used. It's used to set up any initial values of variables if necessary, etc., etc. Destructors will cover them, but they're used very rarely. So here's a bit of code that we've got. It's our party animal, and a lot of it is the same as what we've been doing so far. So we have this variable x, and the constructor has a special name, underscore, underscore, init, underscore. Again, we pass in the instance of the object, self. And in this one, all we're going to do is print out that you're constructed. And here's
this code that we've had before. And now we have underscore, underscore, del, and then we pass in self. And we'll just print out that we're being destructed and what the current value of x is for that particular instance. So let's go ahead and run this. And so, again, this doesn't really do any code up to here. That just defines party animal, but this is the constructing of it. And basically that says, oh, and it really kind of creates these variables, and then it also runs the constructor. And so in this case, this line right here
is causing the I am constructed message to come out. Then we do and party, and party, and that says, you know, one and two. And here's an interesting thing. We're actually going to destroy this variable by throwing away an an no longer points at that object. an is going to point to 42. So we're going to sort of overwrite an and put 42 in it. And at that point, Python's like, oh, this whole little object that I just created, somewhere it's out here, it's vaporizing it and throwing it away. And so before this line completes,
it actually calls our destructor on our behalf. And so that message comes out. So we are allowed as the builder of these objects to add these little chunks of code that says, I want to be involved at the moment this object is created, and I want to be involved at the moment that this object is destroyed. Now, in this last line, an is no longer a party animal. an is now an integer. It's got a 42 in it. It's gone. It's been created. It was used, and then it was destroyed. So you've got to be
careful if you overwrite something. You can sort of throw the object away. So the constructor is a special block of code that's called when the object is created to set the object up. So we can create lots of instances. Everything we've done so far is we make a class, and then we create one instance, one object. And each of these objects ends up being stored in its own variable. We have a variable an, and we've been using it. But the more interesting thing begins to happen when we have multiple instances of the same class sitting
in different variables. And it has its own copy of the instance variables. So let's take a look at this. So this code here, I've taken out the destructor, and it shows a little bit more information. So now we're going to put two variables in here. We're going to have a current score or whatever and a name, and we're going to start it out as blank. And this time we're going to add a parameter onto the constructor. And so the self comes in sort of automatically as the object is being constructed. But if we put a
parameter on the constructor call, which is this party animal call, then this comes in as the z variable. And so self is the object itself, and z, this first parameter, is whatever parameter we put here. Everything we've done so far has no parameter here, but now we have a parameter here. And then that means that when we call this constructor, this line of code comes, and then name is no longer blank, name is going to be Sally in this particular thing. And then it'll say, oh, self.name, which will be Sally who has been constructed. And
so then we have this, and that object is now constructed, and then we put it in the variable s. And then we call the party method on that, and we construct a different one. And so this time it calls, and z is Jim, and we basically have a, oops, another copy of this. And so this is how it's going to look. As it runs down here, when this is called, it makes one instance and stores that in the variable s. And there's a variable x in there, there's a name in there, there's an init method
in party, and that's all in here. All that stuff is in here. And now we say, let's make, and that's going to have Sally in there. All right, Sally in there. And then we're going to do another constructor, and so it's going to make a whole new thing, and it's going to store that in j, and this one's going to have Jim in it. S party, then this turns into a one, and then we're going to call j party, that turns that into a one, and then s party will cause this to be a two.
And so what happens is we have now two objects, one in the variable s and one in the variable j, and they have separate copies of their instance variables. These are the instance variables, or the object fields, or whatever, but they're the variables. But the key is that every time we do a new construction, it duplicates this, and there's another copy of it. So there's an x within s. So s.x is this variable, and j.x is that variable. Okay? So the next thing we'll talk about is inheritance, and that's the idea of taking one class
and extending it to make something new. So the last topic we'll talk about here in object orientation is the notion of inheritance. And this is a form of code reuse, and it's one of the more advanced aspects of object-oriented programming. So just kind of understand what it is at a high level, and then you know where to come back to when you need to learn a bit more about inheritance. So the idea is instead of making a new class from scratch, we actually make a new class by starting with an existing class. We are extending
it, or another word for this is subclassing. And it's sort of a situation where you're like, I've got this code, and I've got this data, and I just need to add a few things to it, and then I'll have a whole new thing. And as you design objects and what we call object hierarchies, you often do this, and it's a form of sort of real clever code reuse. But again, don't necessarily think that you're supposed to know when to use this or why to use this. Right now, it's just terminology, okay? Just terminology. We have
what call these as parent-child relationships. The original class is called a parent, and the new class is called the child class. So subclasses are another word for this. You have a class, and then you subclass it. I think extending and inheriting and parent-child are probably better ways of expressing it than subclassing. So here's a bit of code. Let's take a look at this. This code's unchanged. It's the party animal code that we've been saying all along. It's the one that we construct and put a name in. And now what we're going to do is extend
it. And so you'll notice that this code down here is the part that's doing the extending. So we're making a new class, football fan. And by putting in parentheses before the colon, party animal, that says, football fan inherits everything that is party animal, meaning the x, the name, the init, the party. All those methods and data are sitting there. And now we're going to add a new variable. So football fan has, in addition to all those other variables, it has points, and it has a touchdown method. And self-points is added to, we add seven of
the points, and then we call the party. And that does that. So this is calling this method because football fan includes x, name, and party, and init, and everything. And all this constructor, so this football fan is really an amalgamation of all these things together. Party animal is just this stuff, right? And so we still have two classes. We don't just have one. We didn't erase the party animal class. And so we take a look at the code that we can run here. We can say, oh, okay, let's make a party animal, Sally. And so
that constructs an object like this, and then stores that in s, with an x starting out at zero. And then we call this party, oops, better change that color, starts out at zero. And then we call the party method, and that changes it to one. And so this bit of code, it's as if this part doesn't matter at all because it is a party animal. It's not a football fan. But now if we take a look at this code down here, take this code down here, we're going to construct a football fan and pass in
gym. But football fan has no underscore, underscore, and knit. So that actually uses the underscore and knit from party animal because we extended party animal to make football fan. So we inherited all of the good that was in there. So there it's going to make a name, a variable x, which is going to start at zero, a variable name that's going to have gym in it, and a variable points that's going to have a zero in it. So this j variable has more things in it than the s variable has. And so we can call
the j party, and if we call j party, that goes here and adds one to x, right? So that adds one to x. And then we call j touchdown. Well, that comes down in here and adds seven to the points, right? And then calls party within us. So self.party is the current object, i.e. self and j are the same thing, right? Self.party, and then it goes up here and passes self in, and it adds one to the x, in this case, of this j variable. So this becomes two. And that's where it prints out seven
and two, and away you go. And so it's a way for you to kind of take all this stuff and stuff it into a class by making a new class and just add the extending bits, the bits that are in addition to the other stuff. So like I said, inheritance is a powerful and wonderful concept. It's a form of, excellent form of reuse, but basically the whole purpose of this lecture was so that I could in the future just use these words and you would understand them as compared to, I just want to say method,
and I've been saying method all along in this high time that I defined it. So let's just review one last time. Class is a template. It is not actually a thing. It is a shape of a thing. And we define it and say when we make one of these things, it's going to have these variables in it, it's going to have these method in it. Attributes, variables within a class, method is a function that's inside of a class. Object is once we construct a class, we get back an object. And so object here is the
snowman cookies. Class is the snowman cookie cutter. And a constructor is a bit of code that sets up our object, our instance, when it first is created. And inheritance is this ability to create a new class but take all and import and affect all the capabilities of an existing class. So object-oriented is awesome. For the rest of this class, we're not going to write any object code. We're not going to use class at all, but we are going to use objects. Literally, you've been using objects from the beginning of this course. As soon as you
said, print, whoops, as soon as you said, you know, x equals high, that's an object. And as soon as you said x.upper, you were calling a method, right? You've been calling a method all along. When you're doing something like fh equals open, this thing you're getting back, that's an object. And then you do fh.read or whatever. You're calling a method in the dot operator. So you've been using objects all along. Now I'm just finally explaining to you when I say call the read method or call the upper method or what's this little dot and why
is that there? So again, it's time for us to understand that, but it will take you a long time before you encounter a problem that's large enough where as part of your solution, you're going to make a new object. But when you do, it's really a powerful thing. I mean, it's a really bad idea for me as a teacher to say, oh, write a bunch of objects. It's premature for that. It's later is when you will actually learn how to use objects. And you'll be like, oh, thank heaven that these objects are here. Okay? So
that's all for now. Thanks for listening. See you on the net. Hello and welcome to our chapter on databases. We're going to learn a lot in this chapter, learn a whole new programming language, SQL, and learn how to use that. So you're going to need a new piece of software to run all of the exercises that I'm going to do called SQLite Browser. We're using a database called SQLite. Go ahead and download this. You might have to pause and come back if you like. Go to sqlitebrowser.org and download it and install it. While you're doing
that, we'll talk a little bit about the history. So in the old days, 1960s, 1970s, I started doing computing in 1975, we didn't have a lot of storage. I mean, this is 16 gigabytes right here, and we didn't even have megabytes. I mean, the computer I had had a few megabytes of stuff. Well, so we didn't have a lot of disk drives. And so permanent storage was often sequential in these tapes, these tape drives that we had. Tapes and tape drives were the scalable part of storage because you could just make more tapes and you
could rack them up. And so that was our way of greatly increasing the storage of the computer. The problem they had was, is they were sequential. You read it, it advances, read it, advance, read and advance. Now, interestingly, we've been writing programs that do this, that everything we've written so far pretty much reads the whole file, reads the whole web page, reads this, everything we read it. We read either a loop or read the whole thing. And that's because we have plenty of memory. But we're still reading sequentially. And so the way you would do
this when you didn't have enough spinning storage or online storage is you'd use offline storage. But the trick would be that you would sort it. So let's imagine that you're a bank and you have a bunch of accounts, only a few of which are active on any day. And you have a tape that has, in account number order from low to high, the prior balance, last night's balance of every one of your bank accounts. And then you do all the transactions and you record how much money was taken in or out for each account number.
And then you sort those transactions. And then what you do is what we call the sequential master update. And that is, you would write a program that would read the first transaction and hold on to it. Say, okay, this is count 45. Then it would read the first count, like one. And it would copy one. And then it would read two and read like seven, eight, 42, 43. Then it would read like 44. And then it would read 45, but now it would change that and write the new 45 and read the next thing. And
so this might be 60. And it would read a bunch of stuff and copy a bunch of stuff. And then it would finally get to 60 and it would merge the add or subtract. And so the old balance ended up here. And the new balance did here. And you had to only make one pass through the data. So it was super efficient. So we had all these mechanisms to sort. We used to do punch cards and have sorters and all these things. And then these things would run for hours. And if you watch old TV
shows, these tapes are spinning and these things are running back and forth. These are simply reading and writing tapes. And that's how we did a lot of data processing because we could store far more on a tape drive than we could on a disk. And with racks of tape drives, we could scale the storage that our computers had. And so that's the way we did data processing. But it meant that the only way you knew what the old balance was was it was the balance as of this morning before your bank started. You don't know
what the balance was for the day. And that led to things like you can never withdraw more than $100 a day or something like that because you don't know what the old balance was. Or you might go withdraw $100 at a couple of different branches. And so they weren't able to look your stuff up right away. Now, it didn't take long until the disk drives got better and better and better. And you could store the entire accounts, all the accounts and their current balances, on computers. And then the problem becomes is what happens if sort
of in the middle of the afternoon you want to update a balance? Well, do you want to read all your data and then write a brand new one? And say that takes like 10 minutes. That means for that 10 minutes, only one person can be updating their bank balance. And so because we could randomly access this data, we didn't have to read it all sequentially. The trick was is how do you spread the data out? And then how do you make it so you can change a balance? This is, of course, second nature today. But
how do you make it so you change the balance here without changing the balance there? And you can have multiple people going simultaneously to these things. And make sure that you can't say withdraw money at two different locations simultaneously and somehow have your bank balance get corrupted by that. So there's a lot of debate on how to do that. And in early days, we just did sequential master update. But increasingly, we wanted to make better use of the random nature of our computers and our storage. And so that's what led to databases. Databases are the
science of how you make use of rotating random access data, permanent data, in a way that allows you to read, modify, and update that simultaneously from many different locations. And yet keep the data completely consistent. And so this led to a study of a thing called relational databases. And relational databases are not the only databases that happened. We had many other kinds of databases. And there was a debate. And I remember in the 70s and the 80s, there was a folks that says, oh, no, no, there. You can do index sequential. That's the way to
do it. And relational databases weren't all that popular the first time that I saw them. I didn't like relational databases. Relational databases had an inherent advantage because they were based on some really powerful mathematics. And the interesting thing is, early on, the relational databases were slower. But eventually, they figured out how to sort of bring all the cleverness to bear to make relational databases fast. And so relational databases are a pretty advanced technology. And there are companies like Oracle that are very, very wealthy. And their primary product for many, many years was nothing more than
a clever database product, a clever piece of software that was really good at solving this problem. And that's how important this problem was to computing. If you read about databases, you're going to see two sets of terminology. One set of terminology comes from the mathematical background and has to do with the underlying math, things like relations, tuples, and attributes. That's kind of like the fancy math version of it. And programmers kind of think of them as rows and columns inside of a table. And so if you look at sort of fancy theory, you'll see words
that look like this. And they're just full of this and the connection. Now, all this is important and true. And if you really want to get good, you sort of begin to understand the nature that we model data at connections rather than at sort of intersection points rather than just modeling data as a flat file the way we do. But for now, we're going to, as programmers, think of this as just like, oh, it's like a super fast spreadsheet. The super fast part is the math. For us, the rows, columns, and tables are spreadsheets. So
think in a spreadsheet of sheets, sheet, sheet, sheet. And that's like a table, a named thing like tracks or albums, artists or genres. And then there is rows, and each row has a different kind of data. And then there's columns. And we sort of specialize the first column in many spreadsheets to say what's in there. This is not really the data. This is like metadata. It's like the titles in this first column. That's not really the data, and the data starts here. And we have different kinds of data like strings and numbers, et cetera, et
cetera, for each of the rows. And literally, you can get away with this as sort of about 80% of databases. It's just a really super cool spreadsheet. But under the covers, it is far more powerful than that. So one of the early arguments that happened was, again, what the programming model for this was. And a lot of folks wanted a programming model that reflected how the data was actually stored. The notion of structured query language came about in a way to express what you wanted to happen and allow that to be sort of a very
abstract expression. Select all records that meet this criteria. Not read, read, read, read, read, read. And so structured query language is not a procedural language. It is an imperative language where you're simply saying what you want. And then somebody writes the loop. The database actually does the loop, but it's a way for you to avoid actually writing the loop. Now, that turns out to be the power of databases. Because the cleverness in how to write the loop is a way that you would probably never figure out how to be most supremely optimal when it comes
to writing the loop. As you'll see toward the end of joining many tables together and selecting and throwing a ray and getting down a count or whatever. Someone has figured out how to do that really, really well. So the idea was, is you would express, you know, we're going to create some data, we're going to retrieve some data, we're going to insert and delete it. Create, read, crud. C-R-U-D. Create, read, update, and delete, crud. And so that's what this does. It's a language that does this very simply. Now, the applications that we're going to use
this for are more of a data analysis application. We've been doing data analysis through the whole course. And the kinds of things that we'll see in the remaining chapters is we'll take some raw data file. These might actually come across the network. And we'll write some Python programs to play with that data, parse it, clean it up, make sense of it, you know. And then write it into a database. And this might be a slow processor, this might be really nasty, and this might be a way to have very clean data. And then we'll write
another Python program to sort of read this, read through it, and it's all efficient and pretty. And then we can produce files and maybe we'll visualize it or do further analysis in our Excel or JavaScript visualization framework. And so in this situation, you will be the person who is both sort of writing the programs, database administrator, and you can, using SQLite Browser, play and look at the database kind of in a raw way. And the first part of this, we are mostly going to be using SQLite Browser just to talk straight to a database. Later,
we'll write Python programs that read and write data and visualize the data. So this is what we're going to do first. And then second, we're going to do this part right here. That's the second thing we're going to do. Now, another really common use of applications and something that if you continue learning more about programming, is that you will want to write an online application like Amazon or a company or Twitter that's got a website and it stores dynamic data in databases. And so the picture for that is similar but different than the picture we're
going to start out with. And so the way this usually works is that you, the end user, uses a web browser, talks to the application, and the developer writes the application software. And that application software stores its data in a database. And inside that database, we talk to the database using SQL. And all the data is actually stored here and the magic happens. The data server is that database software that's so precious and valuable. And then there's another person often called the database administrator who has access to the direct access to the data. And these
roles in medium and large projects are kept separate mostly because the production, while it's running and live, the developer leaves the data alone and works on, say, the next version of the software. And then the developer has a test version of the application that they run on their computer where they're doing all that stuff. And so this database administrator is a role in a large project where we have to run production and keep production careful, keep production in good shape. So the database administrator has this responsibility for the production aspects of the data. And you
may be working in a situation where you're not actually controlling the data. The database server is on different computers. You have a little special access and you write programs to sort of read the data. And so the database administrator is the person who is asked by the organization to administer that data. The data that we develop, and we'll do this in the second part of these lectures, conforms to a data model. That's the metadata. Is this an integer? Is this a string? You know, how many columns is this? And the data model turns out to
be very, very important. And there's a lot of science to building an effective data model that leads to really good performance. And it's a collaborative activity between the application developers and the database administrator to make it so it's efficient, runs in production, et cetera, et cetera, et cetera. There's a lot of products out there that you may encounter. We're going to be using SQLite. SQLite's a little tiny database server, and it's built into so many things, and that's why we like it. But if you're going to work at a large organization, you can easily run
into Oracle, which is the number one commercial product. Microsoft has a thing called SQL Server, which is a commercial product, and it's also very popular and very effective. The more popular open source, there's things called Postgres. There's MySQL. And MySQL recently was sort of bought by Oracle. And there is a copy of that called MariaDB that doesn't belong to Oracle, MariaDB. And so most of the SQL that we're going to learn is common across these database systems because SQL is a standard. But then there are parts that weren't part of the original standard where each
database vendor has done things a little bit different. But there is a core common subset that does the basic create, read, update, and delete operations. So SQLite is a very popular. You probably have it in your cell phone 10, 12 times. Your web browser has a database engine in it. Your car has a few databases in it. And so SQLite is what's called an embedded database system. Python comes built in with it. You just import SQLite 3 and away you go. And so it's very, very popular because it's free, it's open source, and it's such
a tiny little piece of software that you just include it in other pieces of software and use it to solve the data management problems of those pieces of software. Like your browser might use SQLite to store your bookmarks. Now you think, oh, there's only how many bookmarks can you have. But what if there you need it to be fast? And what if there's like people that have 10,000 bookmarks? There probably are. Do you still want it fast? Do you want to be able to search? And so you get all that by using a database like
SQLite. And so again, we're going to encourage you to download the SQLite browser so you can follow along with what we're going to do coming up next. And so here is the SQLite browser. Here's what it looks like. And it's just a desktop application. And coming up next, we'll start playing with this desktop application and see how it works. So now we're going to make a database. We're going to use SQLite browser. Hopefully you've downloaded it so you can follow along. And I've got this handout, this basic database handout that saves you from having to
type all these things. So bring that up in your web browser. And so that gives you all of the commands that I'm going to type now. And so you could pull them out of the, either the web page or the, you can pull them out of the slides or you can pull them out of that, out of that. So I'm going to bring up the database browser here. Database browser. Now the thing that's going to happen, you'll see this happen on my desktop. I'm going to make a new database and you have to store it
somewhere. And so I'm going to put it on my desktop and I'm going to call it py4efund. And so we should see a new file on my database right there, py4efund. Now that's a file that you don't want to edit with a text editor or anything like that. This is a database that you're, this is a file that's to be read by SQLite browser and nothing else. Okay, so we're going to create a table and I'm going to make a table called users with a column called name that's a text and a column called email.
So I'm going to, it's already asking me to make a table. I'm going to call this users and I'm going to add a field that is called name and I'm going to add a text. And I'm going to add another field called email and I'm going to make that be text. Now the key thing here is we are in effect making columns and rendering an opinion as to exactly what the column is supposed to be used for. And we're not allowed to violate that. It's not like, oh, we'll do whatever you want because the database
is optimizing its storage based on our contract that we're effectively making the contract ourselves. We could make these columns anything we wanted, but we're just going to, we have to, we're going to contract with ourselves. And you can see it's kind of small here. You can see there's a create table and that's on the slide and that's the, the, the SQL way of doing that. This user interface is just helping us write SQL. So now I'm going to just say, okay. And if you take a look, you can see that I now have a table
users and I can look at my database structure and the table users and away we go. And so, so now that's, that is creating it. And like I said, here in the slides is the create statement or on the web page, there's the create statement that could have done it. Now we can insert some data. Let's add a new record to this database users and we'll call this guy name Charles C7 at umish.edu. So now we have a record. So it's kind of like a database spreadsheet. Now that's not the SQL way to do it.
There's SQL sort of going on in the background, but if we really want to do this using SQL, we're going to use the insert statement. And the insert statement looks like this. The SQL syntax sometimes has extra words. Insert into is actually an SQL key words. The name of table, the columns, and then the word values, and then one to one correspondence between the values and its parenthesis. So it looks kind of like a tupple in Python, but we're nowhere near Python right now. Okay, and so that's what we're going to do. And so I'm
going to grab this. Kristen and I'm going to go over here to my SQLite browser and say execute SQL. So now I can say paste that in and then hit this little run button and that's going to submit the SQL to SQLite and then update that file. And it says query executed successfully and away we go. So if I go back now and I look at the data, I see that there's two things in here. And now I can actually insert all the rest of these. Let's go back to my little bit of stuff here.
Let's put all these other rows in. It turns out that if I go into the execute SQL and I want to do more than one command at a time, I can put a semicolon at the end of each one of these things and then I can run them all at the same time. I mean, one after another actually is what's going on here. So boom, boom, boom, and I take a look at the data and look, I've got all those things in there. Now, eventually the thing that's going to generate that SQL is a program,
not us. This is we're being the database administrator, so we're sort of doing things manually. Once things get going, you write programs, do that insert over and over and over again in Python or a web language like PHP or something like that. And so that is the insert. Now, we can get rid of data. And so I'm going to say delete from, that's the key word. Users is the name of a table. Where is a where clause? We'll have lots of where clauses in SQL, which is, it's not like an if. In effect, the delete
is going towards the whole table and being turned on and off by this where clause. So delete from users, if you didn't put the where clause on, will actually delete all the rows. But where email equals ted.eumich.edu, well, that one is going to make it so it only applies to the rows where that is true. So I'm going to go over here in SQL and I'm going to say delete from users where email equals ted.eumich.edu and then I'm going to run it because it's only one. I don't need a semicolon at the end of it.
And now if I go back and I look at the data, ted is gone. Okay. Update. So the update says, updates keyword, users is the name of the table, set is a keyword, and then this is column equals new value, and then a where clause. Again, this update, if we didn't have a where clause, would change every row in the table. And so where email equals csev.eumich.edu. Oh, I got to change that because I already got the name to be Charles. So you see the name is already Charles. So I'll just execute here. Make this
be Chuck. So we see it. And then I run it. Then you take a look at the data and it's changed. That's it. That's an update statement. We're doing, you're doing great. You're doing great. And so the next thing we're going to do is we're going to take a look at how we retrieve data. Now this is the select statement, select star. You have a list of columns and star means all columns from is a keyword and then the name of a table. So this select star from users is the kind of thing you type
all the time. As a matter of fact, it's what SQLite browser is doing internally to cause this to happen. But we can do it by hand by saying select star from users and then run it. And so then we get a little record set that is those four records that are sitting there. We can also throw a where clause on the end of it. So we say select star from users where email equals csev at umich.edu. And that again, the select star from users goes at the whole table and the where clause goes at the
whole table and then filters out all of the things except one record. So the where clause is send it to the table but then filter based on whatever. And so it only shows us that. Okay, we're cruising right along here. You can also put an order by clause on there. So we can say select star from users order by email. So that's a column. Select star from users order by email. And so that orders by email. Or we can change it by to name and we can say descending. So that's the name and descending order.
Sorting and selecting are good things that databases are really good at. So this is the summary of what I've told you. So the databases do create, read, update and delete crud. And we've done all those things except we did create, delete, update, read. That's what we did. And that's the summary of SQL. And so you might be saying why did I take so long to learn such a simple and elegant and beautiful language because it's not really exciting. It's a extremely simple language that's very predictable and you're like that's pretty easy. And it turns out
that some of you may have been using SQL in situations maybe with Microsoft Access or something. Or actually type in this stuff and you just kind of typed it and you never realized that you were learning a programming language. That's why I like SQL and that's a very declarative language and it's very straightforward. It's much easier to learn SQL than it is to learn Python. Because in Python you have to figure out how loops work and how iteration variables work and you'll notice there's none of that. But the key is we've only started to understand
the power. That's the simple ability to move around and update data and read data randomly using these simple sets of commands. But up next we're going to look at how you do this with data models and relationships and really multiple tables. Hello and welcome to a code walkthrough. In this bit of code we're talking about the emaildb.py. This is a beautiful little example in that it sort of reduces talking to the database to kind of its pure essence. And so we'll start out this code and we import the SQLite 3 just to get the library
there. We make a connection and in databases we sort of end up with an open that's two steps. There's the connection to the database which checks access to the file and the cursor is kind of like our handle. It's not as simple as you just open it and read it but you open it and then you send SQL commands through the cursor and then you get your responses through that same cursor. So C-U-R here is the variable that we're interested in. And the first thing that we're going to do is we're going to, we've got
this file. It will either create this file and right now this file doesn't exist. It's going to be in the same directory. There's no emaildb. So this is actually going to create the file when it runs. And then the first thing we're going to do is drop the table if it exists. Drop table is a bit of SQL. The if exists just keeps this from blowing up if we start it with a fresh database. And in this case there is no file there so we are starting with a fresh database. So this will accomplish absolutely
nothing which is just fine. Now we're using triple quotes here. I'm just kind of using that to make this a little bit easier to read. I probably could pull those lines up a bit. This one's actually small enough that I could, maybe I'll just do that. Let's do that. Let's bring that baby right up and turn this into a single quote. That's short enough. But triple quote is just, this one here is a little longer so I'll use triple quote. So we're going to drop table. That's going to do nothing first time through. Then we're
going to do a create table. Now sometimes your application will have like a read me or something. It says go run these commands to set the database up. But we're able to just set this database up in this particular application. We'll see later ones where we're going to leave the database and not start it fresh. And in this one we can do the same. But in this one we could but we're just going to start fresh by dropping the table. So we'll create it. We're going to have an email and an account. Basically what we're
doing here is we're really going to pretend that this is a dictionary. If you recall when I said dictionary, a dictionary is like an in-memory database. Well, now we're using a database to do a database. But the first thing we're going to do here is pretend it's a dictionary. So that's a little crazy. So these next lines of code hopefully are pretty familiar to you, right? Get a file name, loop through it, check to see if it's, you know, grab and box short by default so we can press the enter key and then loop through
it, right? And so this little part right here, this is our basic loop that we're doing. And so, you know, that is pretty normal. And if we look at this line right here, that line right there is the line that is, that line right there makes sure that we can only get the from lines. We've done that a bunch of times and we're going to split it. We're not going to strip the right because the split's going to take care of that. And then we're going to grab the email address, which of course in the
from line is the second part. And then we will have that. So now we're going to do some database. So the first thing we're going to do, this bit right here is kind of like the dictionary part. So the first thing that we're going to do is we're going to select count from our database, that is an integer, where email equals. And this part right here bears some explaining. This is going to be csevitumich.edu or whatever. Now, it is dangerous to put those strings, especially from user enter to enter data into your SQL. You technically
could. I could make this be a email equals csevitumich.edu. I'd have to skate the boats and stuff. But this question mark is a placeholder. And this is a way to basically make sure that we don't allow SQL injection. Go Google SQL injection to get a sense of what that is. It's more of an issue in online applications. But in this application, we're just being good. So the way this works is this is a placeholder in this SQL that will ultimately be replaced by this. Now, you could have several question marks. We only have one in
here. And so you give a tuple. And if we just put email, it won't turn into a tuple. This is a one tuple, basically. This little weird parenthesis, email, comma, parenthesis. That is a tuple with only one thing in it. And that's just the weird Python syntax. It's rare that I apologize for Python syntax. But that's a little bit less than pretty. But it's OK. It's a tuple. And normally, if there were two of these, then there would be email, name, dot, dot, dot, dot. OK? So this cur.execute is actually not really retrieving the data.
In a way, it's looking at the SQL and making sure that maybe it might verify that the table name is right or if there's any syntax errors, et cetera, et cetera. So this actually is not really reading the data. But we have prepared this cursor. This is kind of like the opening of a file. But what we're opening is a record set. We're opening a set of records that are going to be this wherever it's true. So it's like we're going to read this like a file. Now, later things will loop through this. But we're
only going to say, hey, grab that first one. We could have even put maybe a limit clause on there or something. Grab the first one and give it back in row. And so row is going to be the information that we get from the database. And so if there are no records that meet this, then row is going to be none. So here's kind of, again, like the get. Here's like the get, where if the row wasn't there, because the way we're doing this is we're going to end up with this row in the database.
Here is this database. And there's going to be two columns. And there's a bunch of rows. And then here's going to be csev4 and gen3 and steven6, right? So these are the counts. And so we're grabbing this variable out if it's csev that we're grabbing. And that's going to come into here, right? That's going to show up in here. And that row is actually, it turns out that the row is a list, but we're only getting one thing. And what we really are doing is if we searched through and we got through and there was
nothing, then row is none means that there was none and we're seeing like gens for the first time and we have to insert it. So if row is none, we're going to run an insert statement. Insert into counts, email count. Now we've got to set it to one because it's the first time we've seen it. So values, and then again the question mark. The question mark basically says, hey, I'm going to have a value in this tuple and there's an ordering to the tuple. And so there's only one question here, one question mark placeholder here
and then one is the initial count. So email, question mark, count, one, away we go. And so then we have, again, we have a tuple that gives to this execute statement just like in that execute statement, the corresponding sort of strings or integers that are to be placed by each of the questions. So when this runs, there's going to be a new record and there's going to be a one that's put in there into that new record. If on the other hand we pull back a row that exists, we're going to get this for number.
And you might think we want to take this for number and add it, but in databases it's always better to do an update because there might be multiple applications that are talking to this database at the same time. So no matter what update does is in a single atomic operation it turns whatever this number is into one higher and we don't have to worry about other pieces of code potentially modifying. Now in this case we don't have to worry about that because we're the only piece of code, but using update to increment something is way
better than reading the value and then doing an update to adding one inside of Python and then updating the new value which is that's two SQL statements but it's also not atomic. So if the row is none, if the row exists we just know that it exists and we just want to add one to the number. We do have the number sitting here in the row variable but we don't need it. And so we're going to say update count set count equals count plus one column name where email equals and then another place holder and
then another tuple for the question mark. And so that's what this little bit of code does. That is kind of the read it, parse it, check to see if it's there, if it's not, insert it, if it is updated. And so then we see this con commit. And this con commit basically the way it works is that the database is efficiently keeping some of the information in memory and at some point it has to write all that stuff out to disk. So you can choose at times where you put this commit. Right now we're going
to commit every time through this loop but you might commit every tenth time through the loop because the commit will take some time because it forces everything to be written to disk and these can run really fast and the commit is the slowest part here. So sometimes we do things like commit every tenth record or every hundredth record. If it's an online system which is not what this is, you have to commit at the end of every sort of screenping. But for this kind of a system because we're putting so much in, this is kind
of a bulk insert, we might come up with a thing where we, you know, every tenth time we do a commit. But ultimately what this will do when this is running is it will build up slowly but surely adding new records and then one one and then it will build two and a three and all these things and add another one, that will be one. It will do this thing, right? And then at the end of the day that is what's going to be in the database. Now, so now we're, so let's take a look
what's in the database and now we can actually read the database. And so in the database we're going to run a select and we're going to say we're going to select the email and account from counts, order by count, descending. So look at that, isn't that cool? We're getting in the top ten because databases are good at sorting and they're good at all these other things. So we're going to then execute this and then we're going to ask for the rows one at a time and the rows are going to be a tuple and row
sub zero will be email and row sub one will be count. So we run all this stuff and then we close the connection and away we go, okay? So let's go ahead and run this. Let's go ahead and run all this stuff. Python three, email bb.py. It asks for a file name, mbox short. I can hit enter, right? mbox short and that's it and it looks just like that and it counts it and away we go. Now the difference is at this point we have a file, emaildb.sqlite and we can run the SQLite browser and
we can then open this database and we can see what's in there. So here we go. It has made an SQLite database. We have a table of counts and then we can take a look at the data and there we go. We've got the data and we can do this. And so let me close this. It's important at times when you don't want necessarily to have, well let's see if we can cause it to lock up. Let me run this again and it's going to drop this table. So I'm going to run the code again
but this time I am going to do the full one, mbox.txt. Now we'll see what happens here but it ran and now so what we have to do then to see this date is from the previous run but if we want the most recent one we hit refresh and then away we go and so we can see this stuff. And so this is just a real simple start to see how you can connect some of the stuff that we've been doing but store the data in a database. But the nice thing about the database is
that it can store this stuff from run to run. Even though in this case we're dropping the table every time in later things we will see how we can store data from run to run to give ourselves more restartable processes. Cheers. We're going to do some code walkthrough and if you want to follow through with the code you can download the sample code from Python for Everybody. And so the code that we're going to play with is the Twitter Spider code that is both talking to the Twitter API and talking to the database. And so
what we're going to be doing is we're going to run code that's going to hit the Twitter API much like we did in a previous chapter and we're going to retrieve the data but we're going to remember the data so we don't have to retrieve it again. And so we're going to keep track of people's friends and what we're doing here is sort of illicitly pulling down slowly but surely based subject to our rate limit we're pulling down who our friends are. And so let's take a look. We're going to use urllib and urllib error,
which was code that augments my URL to do all the OAuth calculation. We're going to get JSON data back. We're going to make a database and we have to import SQL because of the way Python doesn't trust any certificates no matter how good they are. So this is our URL to talk to the Twitter API. We're going to make a database and again the way SQL lite works is if this spider.sql lite doesn't exist, it creates it. And we get ourself a cursor and we're going to do a create table. This if not exists some
SQLs but SQL lite 3 does this. Create table if it doesn't exist. We want to start this over and over unlike the tracks example I want to start this over and over and not lose data. And this is a spidering process and we'll see a lot of these where we want a restartable process where we use a database. So if we're starting with nothing and there's no file of spider SQL lite it creates this table and it's the name of the person, whether we retrieved it or not and how many friends this person has that
we know of in our database. Now this little bit is to deal with the SSL certificate errors. The certificates are totally fine but Python doesn't trust any certificates by default which is frustrating but whatever. So here we're going to have a loop. We're going to ask for a Twitter account. We have to type quit to quit. If we hit enter in this case we're going to actually read from the database an unretrieved Twitter person and then grab all that person's friends. And so then we're going to do a fetch one, get one and that's going
to get the name of the first person, the sub zero. If we had more things than name here, sub zero is the first of those. Fetch one means get one row from the database and sub zero means the first column of that first row. And if this fails then we've retrieved all the Twitter accounts. And so we're going to augment this Twitter URL using this makes you can look at the twurl.py code. This basically requires the hidden.py file which has your keys and secrets in it. You've got to get hidden.py updated. I've got it updated
but I'm not going to show you because it has my keys and secrets in it. And so we're only going to take the first five which means we're probably not going to find friends of friends of friends. It's only if most five recent ones. We could run this with a much higher number to get to the so we have more than one friend. We'll show the URL while we retrieve it. We will do our UL open. We'll do a read and then we'll do a decode to make sure that this UTF this will give us
data in UTF-8 and then decode will give us data in Unicode which is what we need inside of Python. We will ask for the headers from the connection. We'll say give me the headers, give me a dictionary of the headers and the x rate limiting header from the Twitter API tells us when we're going to be told we can't use this API anymore because this is one of those things. And then we're going to parse and load the data that we got from Twitter and get a, I think it's a list. Yeah, it's a list.
And then we could dump this if you want and yours you can undo that. And then what we're going to do is we've just retrieved this person's screen name and their friends. And so the first thing we want to do is update the database and change the retrieve from zero to one. And that's because we want, we're going to use this to know about unretrieved. So retrieved being one means we've already retrieved it and we did retrieve it so for that account we've retrieved it. And then what we're going to do is we're going to
parse that. And so this is similar to the Twitter code we did previously in the web services chapter. We're going to go through all the users. We're going to find their screen name. We're going to print the screen name out. And then what we're going to do is see if, let's see. So we're going through all the users who are the friends of this person and we're going to say, oh okay, let's select the friends from Twitter where the name is the friend person. And what we're going to do is we're going to, if we're
going to do a curve fetch one of this Twitter, the name of the friends, this is the friend screen name, right? So we're going to say, oh okay, if we get this, we're going to get that friend screen name and we're going to get how many friends this particular screen name has. If we find a URL, we find it in there, we're going to do an update statement and add one to their friend count, how many friends they have, and then keep track. This count here is not in the database. It's just so I can
print it out at the end. If there is no record for this particular friend, we're going to insert them into it new and we're going to say, here's the new person that we just saw. Here, that's their name. We're going to set retrieve to zero and we're going to say that they have one friend, okay? And then we're going to commit the transaction and then we're going to close this at the end, okay? So let's go ahead and run this. The first time it's going to create an empty database. So I'm going to say python3
twspider. So ls star SQLite, nothing there. Python3, oops, that's because I removed it. Python3 twspider.py. Okay, so I'm going to start with a Twitter account, Dr. Chuck. And so it's doing its retrieval and don't worry, showing the token and the signature is not dangerous because you don't have the keys or the token, I mean the secrets and the token secrets. So don't get all too worried. So I have 11 calls left, so I got to hope this all works. One of my friends is Stephanie Teasley and I do these are in reverse order. So let's
grab Stephanie and ask for Stephanie's friends. So now we just retrieve Stephanie's friends and here are Stephanie's most recent friends. And I can just hit enter and it will randomly pick. Let's see if I can in the database. Let's open this up, file open database. Hope I don't lock myself. Sometimes it's a little scary when you look at the database and you're just checking. So this is what my database looks like. We retrieve Stephanie and she has, this is how many people. So these are the friends of Stephanie and me and these are how many,
I'm not in there. So we retrieve Stephanie, which was a friend. So let's go grab, oh I don't know. Let's grab Tim McKay and get that one. Remaining 10, I don't have too many of these. Tim McKay, right? So there we go. Remaining nine. And so if I do a refresh on this, then you see I've got some more folks. If I hit enter here, it will retrieve, it will pick one randomly based on the retrieve being zero. So it won't pick Stephanie or Tim because they're zero, but we have lots of other folks to
pick randomly. And we'll hit enter. So it picked, who did it pick? It picked screen name LiveEduTV, which is ironic because I'm recording this on LiveEduTV right now. And so we can keep hitting refresh and away we go. So I'm gonna stop now because I only have eight remaining. And so I'm gonna type quit. And so we will see how that works. So that's how it works. Now remember that you've got to edit the hidden.py file to make this work because we are talking to the Twitter API. If you don't edit that file, it won't
work for you. Okay, so I hope you find this useful. Cheers. So now we're gonna take a look at how we deal with smaller than one table, multiple tables. Because the real power of SQL and the power of database performance has to do with when you start connecting tables together. If you go back to that original mathematics, it models data at the intersections between the row and the columns. And these intersections are the magical bits. And so breaking an application to use multiple tables is an art form. It takes a while. There are some simple
basic things that you can learn and will teach you here. And so it's not too hard to learn the basics, but then it's much more complex to be super skilled at it. And in general, advanced databases, in my mind, it's hard to teach advanced databases because they're always so contextually grounded. You know, something like Twitter or Google, the databases are so specialized. By the time you make, everyone can do small to medium-sized databases using the basic techniques, but at some point, once you escape medium-sized databases, you end up in these sort of narrow things and
optimize each database very separately. And so I just tell people, you know, learn the basics really, really well, write programs, and then go do real work. But database design is the act of figuring out the data that your application is going to want to store and spreading that across multiple tables. But we don't just do it randomly. We do it very much cleverly. And if you look at a data model, this is what it looks like. And what we're showing here in this data model is we are showing five tables, and this is kind of
a calendar kind of a system, and we're seeing the columns that are in each of the tables, and then we're seeing the relationships between the tables. And even in these relationships, there's kind of a little bit of code, and when you have an arrow that looks like that, there's many of those to one, and this is a many-to-one relationship. Many-to-one relationship. We'll talk all about that stuff. But if you go into an organization and you have a really large and complex data application, they might have something printed out on the wall that looks about like
this, which shows the database tables and connections, et cetera, et cetera. And they might say, oh, your job is to go down in this little corner, add one column field there, and then do this, and then connect it with this thing over there, and then make a screen that shows all these things that pulls from this table, this table, this table, and that table, and that's your job if you're a programmer on a large software development project. These database models become sort of like the core backbone of the knowledge that applications are managing and using.
So the idea is that you take your application, we're going to start really simple, we're going to take your application, and you have to draw a picture. And the basic rule, and literally you could spend course upon course learning about database normalization, but I'm going to distill it into one basic rule, and that is never put the same string data in twice. So my name, Charles Severance, if I build a database well, you should go into that database and you'd say, okay, the words Charles Severance, which is the name of a person, me, in that
database, only shows up once. And what we do instead is we connect things together and model my name as a connection to the record that has my actual name in it, rather than putting my name all these other places. And so the idea is to pull duplicate data out and make only one copy of it. So there is the users, and in there is the user's name, and the user name shows up only here, and everything else points to the particular user entry. So that's the idea. And so here is our first application. We are
working as a startup. We just quit all of our jobs, and we are going to build a music management application. I mean, what a great idea. Don't you think that'll be quite successful? And so we have mocked up, and we have figured out that this is what our music management application. We want to track people's tracks, know something about what artists and albums and genre they are, and have ratings and how many times we've played them, and how long they are. Well, that's the data that our application needs to represent. And we've done testing on
this, and wireframes, and everyone loves this. It's a great user interface. And so this is how it's got to look. But we're going to have billions and billions of tracks in these things, and so we want to come up with an efficient database to handle this. And so we're going to take a look at this and look at each of the columns, and we're going to ask ourselves, is this column part of one of our existing objects, our existing tables, or is this object have to create a new table? And then once we've defined those
different objects, we connect the tables together and model the connections. Now, a little trick to kind of make it a little easier on ourselves is we can look in these columns, and look in the columns that have duplicate information vertically that's string information. So a rating is just a number like zero through five. So we don't worry too much about integers and numbers and that kind of stuff, or whatever. But we do look for strings. And the problem here is we got like these strings occur many times, and so these are the problems. And so
we have to put these things where there is replication of string data kind of in the vertical dimension. We have to put those in different tables. And so we'll start up. Now, the first question that you have to ask yourself when you're going to draw this picture of how this data is in multiple tables and connected together is what is the first one that you're going to write down? And this is an interesting debate, and often people are sitting in a conference room, and people who have experience kind of know what to do. Usually if
it's a multi-user system, like a learning management system, the users might be the central concept. Perhaps the courses might be the central concept. This is a single user system, and so you can think, well, what is really this application about? It's not about people. It's one person. But it is about tracks. And so we can say, okay, here we'll take the track is probably the sort of most foundational notion of this application. And then we can take and say, okay, now that we've decided that tracks are the foundational notion, which of these columns are simply
an attribute of the track? Not really the cheating way and the easy way. And this particular one is like these numbers, all these numbers, like this number and these numbers. Not that one. They just go along with track. And so we'll put that in. We've got the track title, rating, length, and count, and we put that in. And then the question is we've got the remaining things are, we've got the artist, we've got the album, and we've got the genre. And so we can say, okay, well, we can't, we've got some vertical duplication, so we're
going to say, okay, this track probably belongs to an album. So let's pull out the album into its own table. Oops. Pull the album out into its own table. Pull the album out into its own table. And so that pulls that out. And then you say, okay, what would be the next thing that we're going to pull out? So we've pulled out the track. We've got this taken care of, this taken care of, that taken, now we've got the album. Well, albums belong to artists. So let's take out the artist. And then we'll pick where
the genre belongs, and we'll just say that the genre belongs to the track. And so because there might be albums with more than one different genre. So each album is not necessarily a rock album. It could have a rock track and a country track, et cetera, et cetera, et cetera. And so now what we've got is we've got four tables, right? We've got a track table. We've got an album table, an artist table, and a genre table. And if we sort of double check, all of the columns that had vertical duplication in them now have
their own little table. So we can eliminate, the next thing we'll do is to show how we're going to eliminate this vertical data replication by showing how you represent these relationships that we just created inside of the database. Now we're going to represent these relationships in the database. And again, what we're trying to solve here is this notion of database normalization, third normal form. There is so much theory, right? But in this lecture, I'm just going to condense this down to don't replicate string data and use what are called keys, use integer keys to point
at those things. And we're going to use these integers then to point. So assign each row an integer, and then we're going to point from one row to another using those integers. And so we're going to add these special key columns to each of the tables. And help in the database will even give us help managing those. So we still need to keep track of who is the creator of the album, which album a track belongs to. We've got to create these relationships and we have to come up with ways to store those relationships. And
so the idea is we're going to have a column in a table which is the key column. And we're going to call this the ID column. And so this is a row, it might have many bits of data here, but in this case it's just the name of an artist. So this album is going to belong to an artist. And we're going to assign a number inside the database. And so that Led Zeppelin is one and AC-DC is two. And so we have this key, this is called a primary key. And then later when we
want to say that the who made who album really was done by AC-DC, we put the number two in. And so the difference here is instead of saying AC-DC in this record we just put the number two once we've established this number. So we assign keys and then we have these pointers that point back. And so that's how we model a relationship with these small integer numbers. And so there are three basic kind of keys that we use. One is the primary key and that is that little ID column that is just a number. But
once we give Led Zeppelin the number one, Led Zeppelin has got the key one for the rest of that database. The logical key is the text area that we use that you might look up. So the title of the band or the title of the album, that's the logical key. And then the foreign key is one of these keys that is really pointing to the primary key of another row. So that's called a foreign key. And you might think that you want to use something like an email address as the primary key for a user
table or something like that. The logical key should always be separate and there should always be a primary key, that integer number. Because things like logical keys do change. People do get new email addresses. And if you've got that email address as a foreign key pointing all over the place, it doesn't work out so well. And so that's why you use these small integer numbers that have no meaning outside. So sometimes if you're on a system and you see a URL and you see some number like 422,016, you're like, oh, that turns out to probably
be my primary key in their database. So sometimes you can look in a URL and you can see these primary keys in the URL, but they don't mean anything outside of that particular system. So like I said, a foreign key is a key that is really pointing at a row in a different table. So the album has a primary key for it, but the artist underscore ID points to a row in the artist table, as we will soon see. I have a naming convention. And in my naming convention, on this lecture, I use ID for
the primary key. And then artist underscore ID, I use uppercase for the table names. And then artist underscore ID says this is a key, this is just a key that points to the ID key of the artist table. And so that's what I do, so you'll see. And all my stuff, I'll use that. It's a convention. It's not something SQL forces you to do. But you will find when you go to organizations and work on their databases, these conventions are very important. So I can do something and you can understand the rules in which I
created. Some of these, you'll find this used by some people. You'll find completely different conventions, and that'll be okay. Whatever convention your organization uses, learn that convention. So now we're going to talk about how we put these keys in and then how we actually make the connections from one row to another row. So now that we know what a primary key, logical key, and foreign key are, we're going to actually start putting these together and creating tables that have these kind of values in them. So when we were done, we drew this picture that was
sort of a logical model of how our data would be spread across four tables and how those tables are connected. Now we have to take this and we have to map it in a way that leads to the columns and the needed columns in each of our database tables. And so here's what we do. We basically have to take, and for each of these, when we're going to build a track table, when we're going to build a track table, we add a primary key. So we just added an ID field to every one of these
things. And that's so we have a place to store the sequence number of this particular row. We have logical keys. We've just marked those. Those are strings. And then we have things like, you know, rating, length, and count. They just kind of go in here. And now we have to model a relationship. So what we do is we, in the table, the relationship starts from, we put one more column in, and this is the one I will name album ID, and that just is an integer column that's going to record the album ID. So this
might be 16, and then 16 goes in here. So there's one of these columns that's a foreign key that points to this. And that's why it's foreign. This is a key that's not in the track table. This is a key in the album table that we're pointing to. And so there's a foreign key. And that's what we have to do. And we just do that over and over and over again. And we quickly convert that picture that was a logical picture to having every table has a primary key. And every time we have a starting
point, we have a foreign key, foreign key, and then foreign key. And then we mark these things as logical key, logical key, logical key, and we'll see how we do that. And so that's the picture. Now we have a picture of exactly how we're going to lay these tables out in the fields that we need in these tables. So we're going to do a create table statement. And I've got this create table statement sitting there. And so this one's going to be a little bit different. We're going to say create table artist. And the ID
field is integer. And we're going to add all of this stuff. This is adding to the column to tell it additional stuff. It's a primary key, which means we're going to use it to look up a lot. It's automatically incremented, which means the database is actually going to provide this number for us as we insert records. It's not allowed to be null. It's not allowed to be empty. And it's supposed to be unique. And then the artist is going to have a name column, a name column that's just text. So let's do that. We already
have our users. And now we're going to do a create table in this SQL. And you can do that. That's okay. That's totally fine. And we have to get this right. And we say away we go. And so now if I take a look at database structure, I've got a users table as well as that users table we were playing with before and this artist table. Let me go ahead and delete this users table just to say goodbye. Okay, so now we have the artist table. And we take a look. And it's got an ID.
And it knows all about this stuff. Okay? So that created the table. We're going to keep doing this. The next thing that we're going to show here is we're going to show the foreign key, right? So artist ID is just an integer. In some database languages like MySQL and Oracle, you would put more stuff here to say this is a foreign key, blah, blah, blah. But in SQLite, we keep it simple and just say that is an integer column. That's a foreign key. The album table has a primary key and a foreign key, and then
the title. So we'll go back and we'll grab that text out of my little page. This create table. Go back to execute SQL. And then run that. And we'll continue with just the genre table has an ID on it. And primary key, you'll just copy and paste these. That whole thing, you do that over and over and over again. So we'll go in here and run that one. And so the last one we're going to do is the track table. And the only thing that's kind of weird about the track table is it's got two
foreign keys, right? It's got an album ID and a genre ID. Once you draw the picture, you just sort of literally translate these things. It's got two foreign keys and a primary key that's pretty much just like all those other primary keys. And integer counts an integer and lengths an integer, all that stuff. And now we've got it. So if we take a look at our database structure, we're going to see that our album, genre, and track are all set up. And these are no columns that we just made with those create statements. Okay? So
now let's insert some data. This first insert statement is kind of important to take a look at. So insert into, by the way, the keywords can be upper or lowercase, table name, columns. Now, this table has two columns. It has ID and name. But we told the database that ID was auto increment. So it's going to actually give us the number. It's going to assign the number rather than make us assign. We could make it be one, two, three. But we say, hey, database, you're good at this. Why don't you make it one, two, three?
And so there is going to be a record that it adds Led Zeppelin. So let's take a look at that. So we'll insert Led Zeppelin. Oops. Over to SQL. Insert Led Zeppelin and run it. So now if I look at database structure and I look at the, let's look at browse data and look at the artist database, you will see that I put Led Zeppelin in, but this ID field here was auto incremented. And so it was put there by the database. And now when we do the next insert, which is ACDC, and we take
a look at the data, we will see that ACDC is two. Now, if you're writing this in a program, if you're going to write this in a program, you can get these numbers back from the database in your program, but I'm not writing this in a program, so I have to remember that one is Zeppelin and two is ACDC. So I'm going to keep myself a little cheat sheet here to remember that because everywhere else in the program that we're going to say Led Zeppelin, I've got to say one now because the artist, the artist
ID of one means Led Zeppelin in those rows. And so now we're going to go back and we're going to take a look at the next one. And now we're going to put the genre in. If you think about it, we're working from the leaves out. The track will be the last table that will update because you have to define the keys for things like rock and metal and Led Zeppelin and all those other things. And again, even though the genre table has two columns, ID and name, we're only going to specify the name and
let the database assign the value. So I'm going to insert both of these and use the semicolon trick. Put a semicolon here and a semicolon there. And run that. And so if I take a look at my browse data and I look at the genre, it's assigned one to rock and two to metal. I'm going to write that down. One rock, two metal. I should have done something like rock and country because I can't even tell the difference between rock and metal, but whatever. My musical skill is not what's at issue in this class. So
now we're going to put an album in. The album is the first thing that has a foreign key. So if you remember the thing, the album points to artist. And so that means it has a foreign key of artist ID. And so we have to explicitly say this because the system doesn't know which artist who made who is. But we know that who made who is ACDC and that's two. And so we know to put artist ID in. So we'll say insert into album title artist ID. And so we have to know what this two
number is. And of course because we have our handy little cheat sheet, we can go over to execute and run that. And I'll put a semicolon there and a semicolon there and run it. And so now we have in the album field, we now have this. And so this was assigned. And so who made who, you still have to write down that. Who made who is album one and album two is Led Zeppelin four. That makes it even more complex because the name of the album is at Roman numeral four. I'm sure I can figure
that out. Okay. So the next thing that we're going to do is we're going to insert the track record. Now if you think about the track record, the track has two foreign keys. And it's got a lot of stuff. It's got the title. It's got the rating length count. But then we got the two foreign keys. And so we have to know these numbers. So this two one, this two one, this one two is the genre. We're specifying the genre and the album that this track is from by those numbers. Now, again, we have to
use this cheat sheet. But if this was a program, the program would know that one was Zeppelin and our one was who made who and two was Led Zeppelin four. And so this kind of stuff is easier for the program to understand than for us to keep track of and understand. But just so we can get through these few records. And that's why I rely so heavily on my cheat sheet. So here we are all with all these numbers. The foreign keys are the tricky part here. Everything else is really quite straightforward. So now I'm
going to insert four records into my track table. And then run that. Okay. So I'll browse data and I look at my track table. This column here, this ID, that's the primary key of the track table. And then here are the two foreign keys. Now, the interesting thing is now there is replication in these columns, but the numbers are what's being replicated and that's okay. We went a long time just not to put Led Zeppelin four in twice. We could have made this a string, but by making this an integer, it saves tons of storage
and makes it super fast. That turns out to be one of the key things that makes databases super fast is using these integers. So we take a look at all this stuff. We see that in a sense by using these little numbers, we are pointing to rows in other tables. The foreign keys are always pointing. They always point to their ID. So these foreign keys are out here. This is the primary key up here. And they always point to a row in another table. And so we have modeled all those relationships. And you will notice
that in this entire database, the who made who only appears once. The word rock only appears once. The word ACDC only appears once. What we have is we have duplication in our data, but we are duplicating the relationships, i.e. these little integer numbers, not duplicating the data itself. And in something this small, it seems irrelevant. But if you have billions of records, or hundreds of millions of records, it is very relevant. Very, very relevant. So the next thing we are going to do is take a look at how you actually reconnect all this stuff together
once we have sort of blown it out using these foreign keys and hand-constructing all these relationships, now how we bring it back together to show the data to the user. So now that we have carefully constructed our relationships in the tables, we need to reconstruct the data to show our users. And you can kind of see how you would go pull this stuff together, but there is a wonderful capability in relational databases called join that brings this all back together. And so we have done this for efficiency of storage, efficiency of scanning, etc. But we
do need to traverse these foreign keys at times. And the database software will do this for us automatically. So the join operation basically is a way to specify in a select statement that you want to pull data out of more than one table and then specifying using what is called the on clause exactly how you want that data pulled out. And so here we go. We already have a table, an album table to the artist table, and the foreign key. And we want to, in effect, pull data from both the album and the artist, the
album title and the artist name. And we want to show that. And so we're going to say select, which is the same select statement. Here's a little different syntax. This is the list of fields. This is table.field. So it's the album title and the artist.name, comma there, from the album. And I always start with where the little arrow starts from, album joined with. So that is going to walk down this connection from album to artist. Album joined with artist. Don't say with, I just say it. On, and then this is the conditions upon which that
join is going to happen. When the album's artist ID, which is this column here, album's artist ID matches, think of that as is equal to or matches the artist's ID. And so it only connects the rows here when there is a match between these two tables. And so if we look at this and we see that this one matches this one and this one matches that one. And so the join connects conditionally and it connects when the on clause is satisfied. And so when this whole join runs, this is what we get. So you select
all this stuff. Now this is an abstraction. Are you writing a loop? Are you doing two nested loops? How are you exactly bringing all this data together? We don't care about that because that's the beauty of SQL. That's the beauty of how we do this in a database. So now if we can just run this command, so let's grab this command. Select track title, genre name, from track, join genre, that exact query. Case of keywords doesn't matter. And we go over here and we run this as SQL. And we run it. We get, oops, I
got too far. Let's do this one. So let's do that one there. Select artist name. I have to add that one to my little cheat sheet. The next time you see the cheat sheet, it'll be right. So the title, so this is coming from one table and that's coming from another table. And so that's one. So here is something we can do that gives us a little more detail on that. We can say, so this is where the connection is. So you can think of the join as sort of spreading one table and connecting it
to the other table. And so what we're going to show here is it's exactly the same. The thing we're going to do is we're going to add these two columns so you can see where the match happens. And so this is one table. This is another table. And these are the kind of columns in common, even though they're not. They're the columns that match. This is where the on clause is happening, right? We have taken this table joined with this table on these two things connecting with each other. So you can almost, in some language,
some variants of SQL, this would even be a where clause. So you connect these two rows, but only connect them when those two numbers match. So you can see, I mean, if we run this, I'll just run this. And again, you just see this is where it connects. Now, interestingly, we can see what happens and what the purpose of the on clause is if we omit it. So this is exactly the same as that previous query, except there's no on clause. So it's select all four of those fields from the track joined with the genre.
So it's basically taking the track table and the genre with a join, but no on clause. So it's not filtering for matches. This is a match. This is a match. That's a match. That's a match. But we don't have an on clause, so the matchingness doesn't matter. And so you're going to get all possible combinations. And literally, if there were 10 on one side and 30 on the other side, you would get 300 rows in that join. So it'd be all combinations, except the on clause reduces the combinations. And you might think, whoa, this is
really inefficient. And I will say that's what my first reaction was when I first saw this, but it's not inefficient. That's the beauty of abstraction. That's the beauty of SQL. You say, do it, and it just figures that out. So let me grab this, and you will see that we can run this one as well. And that kind of gives you why the on clause is important, because now we have a whole bunch of these things. And the on clause just filters that out. So if we would just add the on clause back in, then
that would only show the ones we showed on the previous slide. So that's why the on clause is important. The join is like all possible combinations of all pairs of rows between these two tables. On is, oh, but only where these two things match. And you might think that it's inefficient, but the on clause turns out to be the way it becomes efficient. So now we're going to do the same thing where we're just going to take the track title and the genre. We're going to connect that together. So we select this. We need to
join from one table, join to the genre table with an on clause. And so we're going to make those connections. And the only thing we're going to look at is the title and the genre name. Oh, oops. And then run that. And so we got the title and genre name. Now the thing you'll notice is for the first time, we now have replication of string data in a vertical dimension. That's okay, because the data is not replicated in the database. The data is now replicated as a result of the join. And so we are going
to reconstruct what the user wants to see, which the user originally all the way back to the beginning wanted to see the duplicate information in the vertical axis. But now we're reconstructing it. We didn't waste the space or performance in our database, but we still have to show them. And so now the next thing we're going to do is a monster. We are going to reconstruct across all four tables. And you might think this is really hard. And sure, it's going to be a little tricky, but as long as you follow the naming convention and
the naming convention makes sense, we're going to do a select from the track's title, the artist's name, the album's title, and the genre name. From the track, join genre, join the album, join artists. And so the joins follow the little arrows, right? And then the on clause qualifies each of those arrows when to follow the arrow. And then this becomes pretty easy. It's a foreign key. The track's genre ID, that's a foreign key, equals genre.id. The primary, that's primary key, that's a foreign key because I name it that way. And I know that this goes
to that genre table because I name it that way. And track's album ID is equal to the album's ID, foreign key, primary key. And album's artist ID is equal to artist's ID. After a while, you can type these pretty fast as long as you follow a naming convention and you know the naming convention. So this looks like it's really hard to do, but after a while, it's really just a pattern. So let's go ahead and run that one. And it will, assuming we've done everything right, replicate all the data. So there's all kinds of vertical
data now being replicated. Every column has vertical data. Again, it's not in the database, the select and the join are reconstructing vertical data as it needs to be shown to the user. And so, if you've been following along, probably a couple hours later now, we started with a picture that was our mock-up of what we wanted our user interface to look like. And it had vertical stuff, and we're like, ah, we can't put that in a database model. And then we carefully built a database model that didn't have the data, and then we're like, ah,
we've got to reconstruct it. So we use join to reconstruct it. And so, after all that, we went here with a clean little model with four tables all beautifully connected together, and then we had to join it all back together. So join reconstructs it. And again, the key is the storage is efficient, the scanning is efficient, and we still use the join to produce the output that we ultimately want with all the vertical replication that our users really want to see. So one more kind of relationship, that was called a one-to-many relationship. That was actually
three one-to-many relationships. And the other major relationship is what's called a many-to-many relationship. We're going to do some code walkthroughs, actually running some code. And if you want to follow along with the code, the sample code is here in the materials of my Python for Everybody website. So you can take a look at that. So the code we're going to look at is from the database chapter. And we're going to look at tracks.py. So a lot of the lectures that I give in this database chapter are just about SQL. And this is really about SQL
and Python. So I'll go through this in some detail. So the code that I'm going through is in tracks. There's also tracks.zip that you can grab that has these two things. It's got this library.xml file, which you can export from your, if you have iTunes, you can export this, or you can just play with my iTunes. And so this is also going to review how to read XML. So we're going to actually pull all this data. And this XML that Apple produces out of iTunes is a little weird in that it's kind of key values.
And so you see key value pairs. And it even uses the word dictionary. And so it's like, I'm going to make a dictionary that has this, then a dictionary within a dictionary. This, to me, would be so nice if it was JSON, because it's really a list of dictionaries. This is a dictionary, then another dictionary, then another dictionary, and then the key for that dictionary. And it's a weird, weird format. But we'll write some Python to be able to read it. And so you export that from iTunes. And you can use my file, or you
can use your file. It might be more fun to use your file. So here's tracks.py. We're going to do some XML. And so we import that. We're going to import SQLite 3 because we want to talk to the database. And then we're going to make a database connection. And in this, once we run this, you'll see that that file will exist. And so right now, if I'm in my tracks data, that file doesn't exist. But what we'll see is this is going to actually create it. Now remember that we have a cursor, which is sort
of our, like a file handle. It's really a database handle, as it were. And in order to sort of bootstrap this nicely, we are going, because this code is going to run all the time, it's going to run and read all of library.xml. And later things, we won't wipe out the database every time. And so I'm executing a script, which is a series of SQL commands separated by semicolons. So I'm going to throw away the artist table, album table, and track table. Very similar to the stuff we covered in lecture. And then I'm going to
do the create table. And I'm doing this all automatically. And you'll notice this is a triple-coded string. So this is just one big, long string here. And it happens to know that it's SQL. I'll thank you, Adam, for that. And so it creates all these things. Now it's not quite as rich as the data model we built, because there's no genres in here. And so it's artist, album, track. And then there's a foreign key for album ID and a foreign key for artist ID, which it's sort of a subset of what we're doing. And so
when that's done, that actually creates all the tables. And we'll see those in a moment once we run the code. Then it asks for a file name for the XML. And so that's what that is. And I wrote a function that does a lookup. It's really weird, because if you look at these files, like in this dictionary, there is a key. And so the key of this dictionary, this really should have been a key value pair. But so there's this weird thing where the key for an object is inside of the object. And so we're
going to loop through all the children in this outer dictionary and find a child tag that has a particular key. And so you'll see how this works. And this was something I was going to use over and over again. And so the first thing we're going to do is we're going to just parse the string, and this is the string. And then this, of course, is an XML ET object. And then we're going to say, we're going to do a find all. And so this shows how the find all, we're going to go the third
level dictionaries. We want to see all of the tracks. And so we have a dictionary, and a dictionary, and a dictionary. And so what we want is all of these guys. All those guys right there. Track ID. So we're going to get a list of all those. That'll be the first one. This will be the second one. Because the find all says, go to the, find the dictionary key, then a dictionary tag within that, and a dictionary tag. And then we'll tell how many things we got. And then we're going to loop through, and entry
is going to iterate through each of these. And see, we'll get our name, and our artist. Another one bites the dust, a queen, and away we go. And then the next time through the loop, we'll hit this one. Okay? So then what we're going to do is we're going to go through all those entries, and if there is no track ID, and if that's this track ID field, where are you hiding? Track ID. If we don't have that, we're going to continue. And then we're going to look up the name, artist, album, play count, rating,
and total time. Okay? And so here they are, play count. A lot of those things that we had in the sample lecture that I did. And we're going to look those things up. And we're going to do some sanity checking. If we didn't get a name or an artist or an album, we're going to continue. We're going to print them out. And then we are going to ask for, get, remember how you have to get the primary key of a row so you can use it. So the way we're going to do this is we're
going to do an insert or ignore. And so this or ignore basically says, because I said that the artist's name, go up here, I said the artist's name is unique. Which means if I try to attempt to insert the same artist twice, it will blow up. Okay, because I put this constraint on that. Except when I say insert or ignore, that basically says, hey, if it's already there, don't insert it again. So what I'm doing here is insert or ignore into artist. So this is putting a new row into the artist table, unless there's already
a row in that artist table. And the syntax right here, you know, the question mark is sort of where this artist variable goes. And this is a tuple. But I have to sort of put this comma in to force it to be a tuple. So this is the way you have a one tuple. And then what I need to know is I need to know the primary key of this particular artist row. Now this line may or may not have actually done the insert. And so I need to know what the ID for that particular
artist is. So I do a select ID from artist where name equals. Now it either was already there or I'm getting it fresh and brand new. So I do an artist ID equals I fetch one row and it's going to be the first thing given that I only selected ID. And so this artist ID is going to be the ID. Now I have the foreign key for the album title, right? And so now I'm going to insert into the title artist ID. This is the foreign key to the artist table. And I got this value
that I just moments ago retrieved. And I got the album title. But this also is insert or ignore. Because now if you look, I have unique on the album title. Yep, unique's on the album title. So that'll do nothing. It doesn't blow up. Or ignore says don't blow up. Just do nothing. Because this next line is going to select it. And I grab the album's foreign key for either the existing row or the new row. And then I'm going to insert or replace. So what this basically says is if the unique constraint would be violated,
this turns into an update. Now not all SQLs have this but SQLite has this that basically says insert or replace. Some SQLs are totally standard. Some things we do like this is this select statement is a totally standard part of SQL. Then they insert is totally standard but insert or replace and insert or ignore is not totally standard. But that's okay. It works for SQLite which is what we're doing. And so we have the title, album ID, length, rating, and count. And then we have a tuple that does all that stuff. And of course the
title is unique. The title is unique in the track table as well. And so we've inserted that. So the clever bit here is dealing with new or existing names in these three lines. And we see that pattern twice here where we're doing that. Okay, so there's not much left to do except run this code. Hopefully it runs. Python 3 tracks.py and library.xml. Whoosh! Okay, so that is my... So we found 404 of those dictionaries, 3D dictionaries. And now it's starting to insert them. Insert them, insert them, insert them. And we can take a look at...
So we can do an ls-l or dir on Windows. We'll see that we made a track database. We extracted the data from this library and we made a track database. And we have all these foreign keys. So let's go and take a look at the SQLite browser. File, open database, track dbsqlite. And come on up. Where'd you hide? I got it minimized, so there you go. Let's look at the database structure. We have an album, this is the structure. Artist and track, we have no genre. And this is all like we did it by hand
except Python did all this work for us. If we take a look at the data and we start from the outside in, we have the artist names and their primary keys. There's the artist names and primary keys. And then we have the albums and we have the artist IDs. See the artist IDs, how nice those are. So we have the primary key here and the foreign key there and then we have the title. And if we get to the track, we have the album ID and away we go. So if I was clever, I could
be able to type some SQL. Oh, great. If I was smart, I'd have had this in a paste buffer. So select track.title, album.title, artist.name, I think. Artist has names and albums have titles, yes. Okay, so I can do that from track, join, album. Oops, album, join. Let me make that a little bigger. Bring that over here. Album, track, join, album, join, artist. I need an on clause and I can say track.album. ID equals album. Notice how I know the name that I named these things and album.artist. This is so great when you use a naming
convention, artist.id. Golly, I think that might work. So let's just see what we get when we type that into the SQL box here. Execute SQL. Run. Yay, I got it right the first time. So that's basically my nice little joined up track list. Oh, I'm so happy that I got that right the first time. Okay, well, so you can play with this yourself. Play with this tracks, maybe make an export of your own iTunes library and run it with that. And so I hope that you found this particular bit of code useful, okay? Cheers. So
our last major topic is called many-to-many relationships and up till now everything that we've done is what's called a one-to-many relationship. And that is there are many tracks associated with one album. There are many albums associated with one artist. There are many tracks associated with one genre. And you can think of labeling and as you look at data models they put little labels on each arrow that tell you which end of the arrow is the many and which end of the arrow is the one. And so in this case, the foreign key is pointing to
there are many of these rows over here, many rows that point to one row over here. So it's a many-to-one relationship. There are various ways. Sometimes I'll put two arrows at this end and one arrow at that end. But whatever it is, this kind of thing we've been showing is a many-to-one relationship. And that's probably the most common thing. But there are times when you just can't model things with a one-to-many relationship. So like if you have a mother and children, well that's a many-to-one relationship and it's just fine and that works fine. But sometimes
you have a many-to-many relationship in that there might be many books. One book has many authors and each author has many books. And so you don't have like the one side. There's no one. And so you have to end up building a table that what we call I call it a connector table. They call it a junction table on Wikipedia. But we need a little table that allows us to break a many-to-many relationship into an effect two many-to-one relationships and a connector table. And so this is a connector table. So you could think of this
as, you know, there are many, many links here but we don't have a way to model the many over here to here. And so what you do is you basically say, oh there's a lot of these things. There's many that go to the one. The many that go to the one. And in here you sort of create that manyness that you want to create. So it's probably just as easy to look at a sample of this. So let's imagine a learning management system where you're taking a class and there are some people that are teachers
and some people that are students and many students are members of many classes. A student can be part of many classes and a class has many students in it. So you can't really find the one end. And so what we do is we make a table called a membership. And in that table of membership we actually often don't put a primary key in at all. We simply put in two foreign keys. And if we're going to put a uniqueness constraint we put a combination of the two foreign keys as the uniqueness constraint. So we say
there can be duplicate user IDs and duplicate course IDs but there can only be, you know, user ID, course ID combinations. That has to be unique. So you can make unique be more than one column. And so if you imagine a course table and a user table there's a user ID, the name and email and the course has a title and an ID. And then we have this little table that just is the connector table that shows the points out. And so we can expand this membership. So let's take a look at how that works.
So we're going to create some tables and these are very classic tables because these are the one end of it. So these are the one end of it. So it has a primary key, a title, a logical key, email. There's a primary key for course and then there's text. So we have this unique to kind of indicate that it's a logical key. We're not going to allow ourselves to put any duplicates in here. Now the connector database here is a table member and it has two foreign keys, user ID and course ID. And you can
easily model some data here. So I'm going to model role which is going to be zero equals student and one equals instructor. And then I'm going to indicate that the primary key or uniqueness constraint is the combination of the user ID and a course ID. Now when we say the primary key, it both limits our ability to insert duplicates but it also allows the database to optimize its scanning because it knows that that combination is always unique and so it can organize its disk structure and storage structure to understand how to look things up more
efficiently. Knowing that once it's found a user ID, course ID combination, it doesn't have to look any farther because they're unique. And so all of these contracts that we add speed things up, save storage and makes things more efficient. But in ways we don't always know exactly how they happened. And so let's go ahead and make these. Let's go ahead and make these guys. I think I will start with a new database. I'm going to call it LMS for Learning Management System. No, I don't really want to do that one. And so I'm going to
not create the table. I'm going to do everything in SQL. And so let me see if it's in my cheat sheet. Nope, that's not in my cheat sheet. So I have to fix the cheat sheet again for you. By the time you see the cheat sheet, all these things will be in there. So I'm going to go in here and I'm going to grab create table user. Actually, I'm going to grab them all. Watch this. Grab them all. Highlight all these. Go over to SQL iBrowser. Blast them all in. And then I'll put a semicolon
at the end of each one of the statements. And I want to run them. So does it look good? Yep, yep, yep. So I got a course. I got membership, two foreign keys, and I got user. So that all looks good. So now we're going to have to insert some data in. And we're going to insert from the outside in. And so we're going to just put the name and email. The ID will be automatically assigned for the users. And we're going to do the same thing. And the ID and the courses will be automatically
assigned. So let me just grab all this stuff. Go into SQL. That has the semicolons at the end already for me. Thank you very much. Now I'm going to run it. And if I take a look at my data, now I've got primary keys for the courses. And I've got primary keys for the users. And I've got nothing in the membership table. And I, of course, have to remember what these values are because Jane is one, and Ed is two, and Sue is three, right? And Python is one, SQL is two, is three. And so
when I go into membership, I've got two foreign keys here and a role. And they just have to be for the course person combination. And so it's a little tricky to figure all this stuff out. But again, these are just numbers. And if you look at these numbers, user ID, course ID, role. Well, user ID one is in course one. User ID is in course as the teacher. User ID two is in course one as the student, et cetera, et cetera, et cetera. So I'm making these connections by just putting these little numbers in. And
once again, conveniently, I have all my semicolons perfectly in place. So I go to SQL. And then I run that. And then I take and I look at my membership data, and there it is. So two foreign keys and a bit of data modeled at the connection. That's the way we say that. The role is modeled at the connection. So now we build all this stuff up, we can write some queries that take a look at this. And so what we're going to do is we're going to look at who's in what course and what
role are they. And we're going to sort this in a nice way. So let's just take a quick look at the code we're writing. We're going to do a select from three tables, the user name, the member role, the course title. So in effect, we're not showing any of the foreign keys or the primary keys. We're going to go from the user table, join to the member table, join to the course table. This is pretty easy to write. You know there are three tables you want to go across. The on clause is also very easy
to write, right? The on clause models each of these connections, where the member's user ID is equal to the user's ID. And where the member's course ID is equal to the course ID. So we're going to concatenate all three of these tables together, but we're going to only keep rows where it matters. Now this role doesn't participate, but we're going to print that out. And we're going to order it by the course title first, and then the member role second, and the name third. And so let's run that. So we've reconnected it. So Ed's the
teacher of the PHP class. Sue is the student in the PHP class. Jane is the teacher in the Python class. Ed's a student, and Sue are students in the Python class. Ed's the teacher in the SQL class, and Jane is the student in the SQL class. And so we have many people, there are many students in many classes there, and so we have modeled that. But we model that with this sort of table. And if you look at a piece of software that I've written called Sugi, which is a standalone learning management system that's built
with learning tools, you will see in anything we're in membership where we have a user table, we have a context which is also the course table, and then we have a membership table, and you look, here's these foreign keys. Like that's the many side, that's the one side, many to one, and so this is now an effect of many to many between these two, but then it's modeled as a series of many to one, many to one relationships. And you see this all the time in all kinds of things where membership or other kinds of
things are necessary, many to one, or many to many. So, with all that, there's so much to learn. It's both easy and complex at the same time. It's easy when someone shows you how to do it, but at some point you will learn how to build database models, and you realize, oh, it wasn't so bad. It takes a while to get used to them. This really just is a quick walk. The bottom line is, what we just did seems like it was, wow, that's nice. Do you really have to do that? And the answer is,
if you're going to scale it all, you absolutely have to, because you simply can't read and write data sequentially. You can't read through and update one little piece of data in a file by reading all the way through and then writing a new copy of the file. That could take seconds, and in a system like an online system, you get a hundredth of a second to do something like that, and the databases make it so that happens in a thousandth of a second. So, ultimately, you simply have to take advantage of this. You just can't,
if you're going to modify data, you can read data from flat files, but even if you're going to read a lot of data, if it's big, it slows down terribly. So, it might seem like there's a trade-off that you could debate whether this is worth it, but if you're going to deal with a lot of data, you've got no choice. It's really not as much a trade-off as you think. So, this has been a quick romp through databases. We talked a little bit about indexes. There are constraints. We talked a little bit about the not
null stuff. We've talked about that. The uniqueness, that's a constraint. Another whole area is what's called transactions, and that's the locking of little areas. So, you can read an area, then lock it, and then update it to make sure no one else reads it. And so, they make sure they either get the version before you looked at it or before you change it or after you change it. And so, that's how you make sure that you can't do things having to do with bank account balances and get yourself in trouble. So, these are a lot
of SQL. It's really fascinating. SQL is a fascinating thing to use and learn and performance tune and enjoy. So, relational databases are cool. This gets us started. The big thing is don't allow replication vertically of string data. Pull that out into a separate table, establish a primary key, and then have foreign keys that point to that primary key. It is not just how much data you store. It's sort of a compression way as a way of compressing data. You might think strings take no data, but they do. Numbers take a lot less data, and it's
both how much data that's stored but also how much data has to be scanned. And that way joins work. That's part of the magic of why Oracle is such a successful company. It's a bit of art form, and it's something that you can work your whole life and always get better at. Hello, and welcome to our code walkthrough on the roster code. So, the learning objective of this is to do a many-to-many table. And so, the idea is that we're going to, just like we talked about in lecture, we're going to have a set of
users, we're going to have a set of courses, and then we're going to have a connector table or a many-to-many table that basically has two foreign keys. So, we are going to use the integer.null primary key auto-increment unique as the way to get auto-assignment of the primary keys in the user table and the course table. And then we're going to say that the name, which is like a logical key, and then the course title, we're going to mark those as unique. And we're going to take advantage of that in a moment. So, you'll see how
we take advantage of that. So, what unique means is if you try to insert the same string into this column, you know, like Chuck twice, then it's going to fail the second time because it's going to refuse to create a new record. And so, if we just kind of like take a look, we're going to get our roster data from this sample JSON, which is just an array of arrays. And this is the person's name, the class that they're in, and whether they are a teacher or a student. And so, we're going to read that.
So, we need the JSON library and the SQLite library. We make a database connection, and we get a cursor. The cursor is kind of more like the file handle. You send SQL commands to the cursor, and then you read the cursor to get the data back. The connection can create more than one cursor, so you can have more than one set of commands. But the cursor is generally like the file handle to the database server. And we are going to execute a big script, and you'll notice this is a triple-quoted string that goes all the
way down to here. And so, some people would just give this to a unit text file and have you cut and paste this, and then go run that in your SQLite browser to create them. But that's okay, because what we're going to do is we're going to set this up. It will either reconnect to existing file named rosterdb.sqlite, and if I look where I'm at, I do an ls, we find that that file is not there. So, the first time I run it, it's going to create it. But I want this to start fresh every
time, so I'm going to wipe out the tables if they exist. That way, you can run it over and over and over again, in case you make a mistake here. Now, I don't have a mistake, or hopefully I don't have a mistake on this. So, we're going to drop three tables, and we're going to create three tables. And here, we're going to create the table that has two foreign keys, user ID, course ID, that are sort of going outwards from the member table, and then we're going to model a little bit of the data at
the role. And I guess this, and again, this is straight from the lecture. And the primary key is actually a composite primary key, because we're going to look up, and it's going to force this to be the combination of user ID and course ID to be unique. But there can be many user IDs and many course IDs, but only one particular combination of a value for user ID and course ID. And so, that's what we're basically saying. You can be a member of a course, but you can only do that once. You can't be like
a member of the course a bunch of times. So, we're going to, oh, that should be roster data sample. That's okay to, oops, fix a bug. Save that, roster data sample. And so, that's just this file, and it's really just an array, and then each row is an array, and it's a way for us to get this roster data in. And so, once we do load s on JSON, we're parsing it, and then this is going to be an array of arrays. And so, for entry in JSON data, so entry is going to be one
of these things. So, entry itself is a row. So, an entry sub zero is the name, and entry sub one is the title, name, that's the sub zero, and that's the sub one of the particular entry that we're looking at. And we're going to print it out just for yux as a tuple. So, we make, that's what the two parentheses are. This inner thing is a two tuple. And we're then going to take the person, and we're going to do an insert, and this is new, or ignore. So, what the, or ignore means is if
this insert would cause an error, please don't blow up, don't, just ignore that I tried to insert it. And so, this is our trick, and it's a beautiful trick. It's like a gorgeously beautiful trick here. If we insert the name Chuck twice, or ignore will just mean that nothing happens, meaning it's already there. Okay, so if it's already there, if it's not there, it'll put it in. And the unique will guarantee that it only goes in once. So, we just, in effect, always attempt to insert it. And if it's been there once, then it's all
set. And so, this insert or ignore is a super powerful mechanism. I use it all the time. And we have a placeholder in the form of a question mark, and then we have, so one of these days, we'll have two things that we're asking for. As a matter of fact, here it is. There's a tuple down here. But this is kind of a tuple with one item in it, name, and that name is then going to substitute in for there while avoiding SQL injection. So, this runs. It may or may not insert a new record,
but if Chuck or whomever the name is is not there, it will give us a new record. And then we are going to get back the ID. And so, this is the logical key, and this is the primary key. And that primary key is going to be auto-constructed for us, and so we need to know what it is. So, we say select ID from user where name equals and then that same name. So, that's Chuck, and so that gives us one. And then what we do is we're going to fetch one record from the cursor
because that's a select and it gives us back a cursor. There's only hopefully one record there because it's unique. I could put a limit one in there, but that would be kind of redundant because the name is a unique key. And then the sub-zero just means if there were more than one thing that I was selecting, which we'll see in a bit, the sub-zero is just the first thing. And so, this is going to give us the integer user ID that was assigned, or if we're coming through later for Chuck, you know, Chuck later, Charlie
later, that will be the old one. So, this is inserted if it doesn't exist, and this is get the newly created ID field or the original ID field. And so, part of this works by having both a logical key and a primary key. The primary key is auto-generated, but the name is a logical key and it's unique. And so, that's our trick to get that assigned thing. Before, we just looked at it in the user interface of SQLite browser and wrote it down, but this is how we do it in code. So, we need to
know what that key is, whether it was new or not. And then we do the exact same pattern for the course, except we're inserting the course title. So, that's no big deal. And so, we're going to get the user ID, course ID. And then what we're going to do is we're going to insert or replace. So, this is basically if they're, remember that this user ID, course ID combination is the primary key for this member table. If there is a duplicate, if this combination is already there, this becomes effectively an update state. And we have
these two number values. Now, what's missing here is the role is not there. And so, user ID, course ID, this is the SQL bit. And now we have a tuple with two items in it. And that's because we have two question marks. And then we commit it. And as I mentioned before, sometimes you want to commit every time through. The commit is, it turns out that these things are less costly, but that's because it's not always writing all the way to disk. Whereas when you enter the commit, it's going to go and write everything to
disk, pause until it's complete, and then your program doesn't continue. So, sometimes we don't run this every single time through. Okay? So, let's just go ahead and run this. The only thing we're going to see is the output of the name and the title as it's running. So, if I do python3roster.py, hopefully I can hit enter. So, you'll notice, by the way, that this SQLite now exists, right? And it has no data in it. So, let me see if I can open this database and see it. So, you see that there's no data. So, we're
the code. We've run this code, in effect, up to this point. So, we've done all the create tables and all that stuff. So, the create tables are there. So, all this data is here. It did it. We haven't started putting any data into it yet because if we look at browse data, we're not finding anything in here. Okay? There's no data to browse. Now, hopefully we won't have locked ourselves because we are sitting right here. And when I hit enter over here, then it's going to go, and it's just going to run really fast. So,
I'll hit enter. It'll read it. And so, it inserted all of those things. And now it's been changed. And if I hit refresh over here, we will see in the user, it just sort of assigned user IDs, right? The column's auto-assigned. We will find in the course that those courses are all auto-assigned. There's the courses. And there's no duplicates because this is unique, right? And so, these are the newly created things. But then membership is user ID, course ID. And so, again, the primary key, as it were, the unique constraint slash primary key is the
combination of these things. And I haven't put anything in roll. And so, if you scroll through these, you'll see all of the users who are members of the courses that they're part of, okay? So, there you go. And I'll leave it up to you to come up with a join. I'll leave it up to you to figure out how to put the roll in. But I just wanted to kind of give you a bit of a walkthrough of this code base. And in particular, the tricks of the uniqueness keys, the auto-increment keys, the logical key
uniqueness, kind of composite primary key, and then the trick of insert or ignore. And then the quick select that comes right afterwards to get the newly generated ID or to get the old ID. You can insert or replace, which is a combination of a insert and an update. So, I hope you found this example useful and can apply it and basically create many-to-many tables. We are doing some code walkthroughs. If you want to follow along with the code, you can download the source code from the Python for Everybody website, okay? So, the code we're playing
with today is twfriends.py. And this is a step beyond the simple TW Spider. It is a restartable spider. But we're going to data model things a little bit differently. We're going to have two tables, and we're going to have a many-to-many relationship, except that it's sort of a many-to-many relationship between the same table, which is okay. Friends is a, Twitter Friends are a directional relationship. And so, we start out here in twfriends.py. Remember that the file hidden.py, I'll show it to you, but I'm not going to open it because I've got my keys and secrets
in it. So, this hidden.py file, you've got to edit that, and you've got to go to apps.twitter.com and get your keys and put them in there. Otherwise, these things won't work. But, if you have Twitter and you set your API keys up and you put them in hidden.py, then all these things will work. It's kind of fun, actually, and impressive. Not hard to do, actually. So, the Twitter URL, that's my library that reads hidden.py and augments the URL and does all the OAuth stuff. JSON and SSL because Twitter doesn't, I mean, because Python doesn't accept
any certificates, even if they're good certificates, so we kind of crush that. Here's our friends list that we're going to hit. We're going to make a database, friends.sqlite. Now, here we're doing create table if not exists. So, what this really is saying is, I want this to be a restartable process and I don't want to lose the data. We're starting out, we do not have SQLite, any SQLite files, and so this is going to create the database and create these tables, but the second time we run it, we're not going to recreate the tables. We're
going to be able to restart this because we're going to run out of rate limit before we finish this, so we just have to wait. We're going to have a people table, and we're going to have a primary key in the name. The name is going to be unique, and whether or not we've retrieved it, and that's kind of from a previous one, but then there's the who follows who, the from ID to to ID, and so this is a direction, and we're going to put a uniqueness constraint in, just like we do in many
to manys that basically says, the combination of from ID and to ID has got to be unique. We don't allow ourselves, to put duplicates of the combination, so from ID can be one in many records, and to ID can be one in many records, but one one is only allowed once, and this is the crud we have to do to convince Python to accept the Twitter certificate, and so this is similar to some of the other stuff that we've done. We're going to enter a Twitter account or quit, and if we hit enter by itself,
then we will actually go and retrieve the data then we will actually go and retrieve a record that was not yet retrieved, and now we're actually pulling out two values, ID and name, and so we will grab, fetch one is going to give us a two-tuple basically, and we're going to store that in ID and account. Of course that's like, this is coming back with a two-tuple, first of which is the ID from the database. Limit one means we're only going to get one of these, or zero of these. If there are zero of these,
that means there are no unretrieved Twitter accounts. Retrieved equals zero. Well, you'll see in a second that all the new accounts we put in are the ones for which we haven't retrieved, and again, given that our rate limit, we want to know which ones we've retrieved, okay? And so what we're going to do next is we're going to check to see if the person that we just checked, which means the length of the account is greater than we just were entered, we're going to check to see if they're already there, okay? And we're going to
select ID from people where name equals, so that's the one we just entered, and we're going to fetch one and grab the first thing because we only got one thing in the select statement here. And if this person that we just asked to see is not in the table, that means this is going to fail, we're going to do an insert or ignore. This or ignore is kind of redundant because we just checked to see if it was there, but we'll put that in just to be safe, and we're going to put the name in
for the new account that we're looking at, and we're indicating that retrieved is zero, so that we will know that we haven't retrieved it yet. You'll see that we'll update that in a second. We commit it so that later selects will see this, so that you've got to do the commit. This later select wouldn't see the one we just inserted, and we're going to ask how many rows were affected, and if it's not equal to one, then we're going to complain about we inserted it, and we are going to do this thing. We're going to
ask, hey, remember there was an ID up there? Doo doo doo. Right here, ID, integer, primary key, and we did not insert this here, but we want to know what that ID is, and every time I was showing you that in lectures, I was saying it's really easy in Python to do this, and that's what we're saying. This cursor did the insert, but one of the things that happens is after the insert, we're going to grab the last row ID, which is the primary key that was assigned by SQL. Okay, and so that means that
one way or another, coming through this code here in line 45, one way or another, we're either going to know the ID of the user that was there before, or we just inserted one, and so we're going to know the primary key of the current user, and you'll see why we need that. So ID is the primary key of the current user that we entered right here. Okay? And now what we're going to do is do the Twitter URL augment with the OAuth and all the keys and the secrets and hidden not PY. Instead, we're
going to go through, let's count 1000. Let's go count, what the heck, let's go 200, up to 200 friends. Save. No, let's do 100. We'll keep it that way. And then we're going to retrieve it, and we're retrieving the account. We're not going to print the nasty URL out. We could. Then we're going to open the URL with a connection, and then we're going to read that, and we're going to get the UTF-8 data from this, and then we're going to decode that, and we're going to have the Unicode data, so the data string is
a internal Python string with all that data representing all the wonderful characters. And of course, we're going to ask URLOpen to give us back the headers as a dictionary using this call, and we can see how many we have left for the remaining. What's the remaining rate limit that we have. So then what we're going to do is parse the data with JSON load S. If, oh wait, I need to continue in here. Continue. Save. If we are going to parse this data, we'll print it out. So that means that this died, which means it's
not syntactically correct JSON, basically. And who knows if we're ever going to see that, but at least when it blows up, it'll print this data out. We'll have to catch it, and then it'll continue. Actually, I'll make this a break. Because if that's blowing up that bad, we should quit. Now, I don't yet know what happens when this rate limit says you can't have it. But I do know that I expect when it's successful that there will be a key of users in this outer dictionary that we're going to get. And if this outer dictionary,
if users is not in the parse dictionary, then I'm going to dump out this data so that at least I can debug what happens when I've got some broken JSON. So the difference between this code, this code is going to fail when the JSON is syntactically bad, meaning a curly brace isn't right or whatever. This code will trigger when I get good JSON, but I don't have a users key in it. So then, once we've retrieved it, we're pretty happy with it. We're going to update for our account that we're retrieving. We're going to set
this as one of our retrieved accounts. And then what we're going to do is write a loop that goes through all the friends of this particular user that we're asking and gets their screen name. Prints it out. And then we're going to check to see if this one is already in our people database because this is a spider. We're grabbing accounts. And so we'll do a friend ID and do a fetch one, grab the sub-zero thing. And if that works, if this person's not in there, this fetch one is going to blow up, which means
we're going to drop down to the accept code. But if it does work, we have friend ID is, you know, they're in there and they're already in our database. They just weren't retrieved. And so now, if the friend ID wasn't there, we're going to do an insert into setting retrieve to zero and then we're going to commit. Now, remember row count is how many rows were affected by this last transaction, cur.row count, and we're going to die. If that insert doesn't work, this is unlikely, unless somehow we've ran out of disk drive or something. And
we're going to grab the friend ID as the key, the last row that was inserted. We're only going to insert one row, so it's basically the primary key of the row that we just inserted. So if you look at this code right here, it comes out the bottom one way or another with friend ID successful. Friend ID is either they're already in our database or they're not. And if we insert them, then we have it. And so now, this count new and count old is just so I can print out a nice print out. Now
we are going to insert into the friend table, which is called the follows table in this case, from ID and to ID. Those are the two outward pointing foreign keys. And we have the ID of the account that we are retrieving the friends of and then this particular friend. And so we're inserting the connection from this person to that person. And then we commit it. We want to commit these again so that later selects, when the loop goes back up, later selects get all of that data that's going on. So we do want to commit
from time to time and then we close the cursor at the very end. So let's run this and see what happens. Okay, so Python twfriends.py. Oh, of course. I am a refugee from Python 2, so I always forget to type Python 3. Okay, so we're going to start. If we take a look right now, I'm going to start another tab over here and ls-l star sqlite. Now that sqlite file is there, right? And it's actually made the tables. If you go up here, it ran all this stuff. Create the tables, yada yada, and we're sitting
right here at this line. As a matter of fact, I think, without causing too much trouble, I can open that database and get into this database right here and there is no data in the follows table and there is no data in the people table. It's completely empty, okay? So we're waiting for the first one. And I'll go with mine, Dr. Chuck. So it's retrieving the 100 friends and they all were brand new. They're all inserted, right? And so now if I hit refresh, we will see that Dr. Chuck is retrieved. Who follows? So these
are all the people I follow. One follows two. So if we look at here, we see that Dr. Chuck follows Stephanie Teasley. Because we grabbed the followers of Dr. Chuck, you know, we're gonna have a record in all of the follows for all the ones that I did, right? So these are all the people I followed and we put them in, okay? So we can go back and we can, let's see, grab somebody. Let's go grab Stephanie Teasley. And let's pull out her friends. So we grabbed a hundred of her folks. I got 14 left.
That's my x-rate limit. So I did Stephanie Teasley, so let's go back here. So you'll notice there's 101. There's probably gonna be, oh, 182. That's interesting. So we've retrieved Dr. Chuck and Stephanie Teasley and let's go take a look in the friends table, the follows table, okay? So we have all the people I follow. Now all the people Stephanie follows. Okay, so there we go. So let's go ahead and do somebody else. Let's see, I think we both follow Tim McKay. Where's Tim McKay? Yeah, let's follow Tim McKay. Let's see who Tim follows. See if
we can get like an overlap. Oh, we revisited some. Let's see if we can see this in the follows. Let's see people. So we've got Dr. Chuck retrieved and Tim McKay's somewhere down here. You know, it might take us a while before we get any really good overlaps. Let's see. Let's do a database call. Let's see, let's do a database SQL. Select. Count. Eh. Okay, so let's just run this some more. It's clearly working. Now one thing I can do here is I can hit enter and it will just pick one randomly. So it grabbed
live EDU TV and I can, and let's see how many I got left. We got 12 left. And now I can hit enter again and it picks another one. That was the next one. I was kind of picking them in order. Is it picking them in order? Let's go to people. Yeah, it's picking these. So we can see that it's going to just do the first unretrieved person, who's Nancy. Let's let it retrieve Nancy. So it grabbed Nancy, new. So we're finding some. And this table's getting really big. And so if we look at the
people table, we now have 455 people. And we have 467 following records. And so there we go. Oops. Hit enter. It does another one. And away we go. So you get the idea. I can type quit to finish. And just to give you a little interesting bit of code to show you how to do selects, I'm going to do this TW join. Now you'll notice that we're not talking. Oh, let's show you one thing. LSBonSL friends star SQL lite. So this database has it. So I can restart this process and run it again. And the
database is still there. And so we just grab Swear Trek. And so we can keep doing this. And so this data, it keeps extending. And so this is a restartable process. I can run it. And then tell it to grab the next unretrieved one. And so away we go, right? And so that's part of it. So if I run out of my, I've got eight left. Oh, how many do I have left, really? Let's keep going. How many do I got left? I got five left. Okay. Wait. Oh, I guess we'll just run it out.
So I got four left. You know what I should do is I can't change the code. At least I can't change the code. I can stop the code and I can quit the code. So what I'm going to do is I'm going to change this code a little bit really quick. And I'm going to print the headers are rate limiting at the beginning and at the end. So now I can run it again. I changed the code. Hopefully I didn't make a Python error. Tell it to go get another one and a Navarro. And so
I got three left. Oops. We'll see what happens when I run out of rate limit. Run out of rate limit. So we have one left. Hit enter. Hit control K. Open source dot org. So we have zero left. That worked. Now let's see what happens. I don't know what happens next. Oh, we blew up. Too many requests. Oh, we got an HTTP error 429. So that means that, going for Mark Cuban, that was in line 48. So the right thing to do would be in line 48. We should really put this in a try accept
block. Try accept block because it gives us an error. Print. Oh, fiddlesticks. How do I print the exception message? I always am forgetting print failed to retrieve. So we'll put that in. Now if I run it. And then I have to put a break here because that's not a good break. Failed to retrieve. Now I've got to figure out. Oh, see, I never know how to print out the error message. Yeah. So I have to... See, that's the weird thing about stuff is that I don't ever remember enough. I don't remember the syntax, what I
say here, to print the error message out. So I'm going to go to Google. And I'm going to say, print out the exception message in Python. Print out the exception message in Python. Oh, Python 3, hello. Okay, so let's go find it here in the documentation. Accept, accept. Is this it? Is this what I say? I just want to print out the message. Ah, that's it. Accept. Let's try this. So this is part of Python programming, is like, for me at least. Because I'm just not like a genius expert at this stuff. This is one
thing I like about Python, is you can guess stuff. And sometimes you guess right. So there we go. We got the error. We got the nice little error message. And we see error 429, too many requests. So that cleans that up nicely. So we have run out of requests. And on that, it is a good time to say thanks for listening. And I hope that you found this valuable. Hello, and welcome to our final chapter, retrieving and visualizing data. In this chapter, we are going to basically bring this all together. Databases, web services, code loops,
logic. And we're going to solve a problem that is a multi-step data analysis. We're going to find some data on the internet. Might be HTML, might be an API or whatever. And we're going to write a relatively slow process that's going to pull data slowly. Because these are all rate limited. This is a slow and restartable process. So you can start this. And what we're going to do is we're going to have a database that's going to hold the data that we're pulling. And so this might take several days, actually, if you really have to
do it. And then you'll build up your data in your database. And then what you tend to do is you tend to produce two databases. One is kind of a raw database that, you know, all of its data columns are aimed at helping you figure out what you've got to retrieve yet. And what you haven't retrieved yet. So that's kind of a crawling spidering process. And then you find that the data is kind of nasty and ugly. And you find that before you're going to do any analysis, you probably want to clean and process it.
So in a lot of these, you're going to go from a raw database to a clean one. And this is going to be really large. And this is going to be really small. And you're going to do this sort of once, but slowly. And you'll do this as many times as you need, changing this program, cleaning the data up over and over and over again. And then you'll end up with really clean data. And it's relatively small. And you might run programs that will loop through this to do visualizations or analysis or some things or
whatever. And so you'll actually sort of use this database as a source of information. OK. So that's the basic pattern of what we're going to work with. Now, this is what I call personal data mining. And if you're going to do this seriously, Python is used in lots of data mining activities. But if you're going to do data mining seriously with really, really large data sets, we're doing small to medium-sized data sets as you might do sort of for individual personal research versus like an organization research where you're processing the logs of a web server
or something like that. And there's lots and lots of wonderful technology. And what's really cool is this technology just keeps getting better and better because the whole data and mining data analysis natural language processing field is just so hot right now. It's so awesome. We're going to keep it simple and do stuff for ourselves for now. And I gave you a bunch of sample code that's going to make it so that you can adapt this sample code to solve the problems that you need to solve. So like I said, this is more of a programming
exercise. Data mining might be a lot more complex. If you're doing simple research, this might actually model what you do pretty well. So the first thing that we're going to do is what's called use the Google's JSON API for geocoding. And there are two versions of this. One version requires a key and one version doesn't require a key. Google used to make all this data available for free but with just a rate limit but now they're making increasingly requiring a key. So I give you code in this zip file that kind of does both. If
you really wanted to do something in production of taking user entered places and names and getting precise latitude longitude coordinates so you can produce a nice little Google map like this. But since Google has made a rate limited API, I've actually pre-spided a copy of a Google data and I have my own sort of fake Google API and so you can do your assignments and test all your code using my fake API which has no rate limits and has no problems. But it's only a limited set of the data. And so this is the basic
process and it's one of those things that it follows that basic personal data modeling. Personal data mining pattern. And so here's this API which is either Google or me. I've got my own Dr. Chuck version of this, Dr. Chuck.Net version of this. And there is an input queue of the location. So this is the user data where they just put in the name of where they think they live. University of Toobigan or something. And so this is the queue of the things that are to be retrieved. And in my case when I built this map
for the first time, there was like 15,000. And it took me days to get this. And so it would stop. And so what I would do is I would read the first one into this geoload.py, check to see if I already had it in my database. If I didn't already have the database, I would go into the API, pull the data down and I would put it in the database. And then I would go to the next one. The next one, the next one. And so I might get a thousand in my database and then
it blows up or I'm told I can't go any further. So I wait 24 hours. I start it up and it reads the first thousand and says, oh, they're all in the database already. And then it starts at one thousand and one. And then it adds that and adds that. And then until it stops. And so it took me several days of processing to get this data right. Now, I didn't have a separate cleaning process because this data is pretty simple. I was pulling out the JSON and latitude and longitude, etc. And so I didn't
have to do two separate processes to clean this data up. It was clean enough right as I pulled it. Because I was talking to an API. If you're talking to the HTML, sometimes it gets nasty and ugly. And so then I wrote this program that just reads through it. It just does a select and, you know, reads through the stuff and it prints out some summary information and tells you what to do. It also prints out and you'll see this pattern because, you know, I'm visualizing using browsers, HTML, and this happens to be used in
the Google Maps API and putting all the data in a little JavaScript file. So these end up being assignment statements in JavaScript. You can take a look at that file and all the data shows up as assignment statements in the JavaScript. And then when this HTML loads, it reads this file and puts up all those pins as long as you have access to the in browser JavaScript API. So the next thing we're going to talk about is page rank, which is spidering now HTML. We talked a lot about this spider HTML, get some links. And
so up next, we're going to actually build a real database full featured search engine using page rank. This is another worked code example. You can download the sample code zip file if you want to follow along. And the code that we're working on today is what I call the geodata code. And that is code that is going to pull some locations from this file. We're simulating or using the Google Places API to look places up and so we can visualize them on a map. And so this is the basic picture. If we take a look
at this weir.data file, it's just a flat file that has a list of organizations. And this actually was pulled from one of my MOOC surveys. We just let people type in where they went to school and this is just a sample of them. So this data is read in by this program geoload.py. And if you recall, this Google geodata has rate limits. It also has API keys, which we'll talk about in a bit too. And so the idea is this is a restartable spider-like process. And so we want to be able to run this and
have it blow up and run it and start it and not lose what we've got. So we're now using a database as well as an API. But in order to work around the rate limits of this API, we're going to use the database with a restartable process. And then we'll make some sense of this and then we'll visualize this. But in the short term, let's start with geoload.py code. Geoload.py, take a look here. So a lot of this hopefully by now is somewhat familiar to you. URL lib, JSON, SQLite. And so I mentioned that the
Google APIs, these used to be free and did not require an API key, but increasingly they're making you do API keys for especially new ones. And so what happens, you can go to your Google Places, go to Google APIs and get an API key. And you can put it in here, it'll be this long, big long thing that looks like that. And then if you have an API key, you can use the Places API. And I've got a copy of a subset, not all of it, a subset of it here at this URL. As a
matter of fact, you can just go to this URL in a browser. And it will tell you a list of the data that it knows about. And I made it so that that does the same basic protocol with the address equals as the Google Places API. So this will just change how we retrieve the data, either retrieve it from my server. Nice thing about my server, it's got no rate limit. It's really fast and you're not fighting with Google all the time. And it means that perhaps if you're in a country that Google is not
well supported, you can use my API. And that's really strange that somehow my API is more reliable and available than the Google one. But it's true. So we're going to make a database. We're going to do a create table if not exists, and we'll have some address. And we're really just caching the geographical data. We're going to cache the JSON. One of the things we do when we build these processes is we tend to simplify these things and not do all the calculation and parsing the JSON. Just load it and get it in and load
it and get it in and fill the data up in this database. And so that's what we're going to do. Because Python doesn't ship with any legitimate certificates, we have to sort of ignore certificate errors. We're going to open the file. And we're going to loop through it and pull out the address from the file. And we're going to select from the geodata where that address is the address. Let's move this in a bit. And so we're going to do a select and pull out that address. And the idea is if it's already in the
database, we don't want to do it. So we do a fetch one and pull out that first thing, which is the, that will be the JSON right there. If we get that, we'll continue up. Otherwise, we'll keep going. Pass just means don't blow up. So we accept and we just do a pass. That's like a no op. And we're going to make a dictionary because that's what we do for the key value pairs. Everything you've seen so far, I've used constants here. But because we may or may not have an API key, query equals and
then that's the address. And then the key equals and then the API key. If you recall, URL encode adds the pluses and question marks and all that nice stuff. We're going to retrieve it. We're going to read it and decode it. Print out how much data we've got. And add account. And then we're going to try to parse that JSON data and print it if something goes wrong. And as we've seen, at this top level of this JSON data from this geocoding API is an object, which we'll see a little bit of in a bit.
And it has a status field in it. And the status is okay if things went well. So if the status is not there, that means our JavaScript is not well formed or not how we expect it. If the status is not okay or not equal to zero results, then print out failure to retrieve and then quit. And then we're simply going to insert this new data that we just put in. And then we're going to commit it. And every tenth one, this is count mod 10. We're going to pause for five seconds. And we can
hit control C here. And then we're going to play the, do the geodump. Okay. So let's just run this. Geodata. Python. So let's do an LS. So we don't have, we do have, let's get rid of from a previous test, geodata.sqlite. So we'll start with a fresh, fresh set of data and run python geoload.py. Of course, I'm always forever making the mistake of forgetting Python 3. So you can see that it's running. And it's adding the query. And in this case, I don't have the API key. And it's putting the pluses in. And that's this
part here with all the pluses. That's the URL and code. And you notice it's pausing a bit. Now it depends on how fast your net connection, this may or may not go so fast. But this is not that much data. So it should, it's like only 2,000, 3,000 characters. And so it's working and talking to my server. And the interesting thing here is I can blow this up. I'm going to hit control C. In Windows you'd hit control. In Linux you'd hit control C. And in Windows I think you'd hit control Z. Depending on what
shell you're working in. But I'm going to hit control C. And you see I sort of blew it up, right? And that causes a traceback, a keyboard interrupt traceback. If I do an LS minus L, you can see that now this geodata is there. Now in the name of restarting, I will restart this. And you will see that it checks and skips. And so it runs this code here where it's right here. It grabs it and finds it in the database. So you'll see it say found in the database really quick. Chop, chop, chop. And
go really fast. And then it'll go back to catching up where it left off. And so all those up there, they did not actually re-retrieve it, because it knew about those things. And so now it's catching up and doing some more, and doing some more, and doing some more. And then I'll hit control C. It has a little counter in here that basically, if it hits 200 it stops and you have to restart it. You could obviously change this code. You could make it so it didn't sleep. It doesn't hurt to sleep for like a
second after every 100 or so if you want. You could change that code. And now let's just hit control C. And blow it up. LS minus L. And there is another bit of code. And this code, it's always good to write these really simple things. And so now we're going to import SQLite and JSON. We're going to connect ourselves up. We're going to open, except this is a UTF-8, because we're going to open this with UTF-8. And we're going to read through. And in this case, we are going to decode. We did select star from
locations. And if you recall, locations has a location and a geodata. And so the sub-zero will be the location, and the sub-one will be the geodata. And we're going to parse it, convert it to a string, and then parse it. If something goes wrong with the JSON, we'll just keep skipping it. We'll check to see if we have the status in our JSON. Let me run the SQLite browser here. File, open database. Let's take a look at what's in this database. Oh, where are we? Code three. Geodata. Geodata SQLite. So this is the data we've
got. So if you make this a little bigger, if I can, can I make that bigger? Yeah, it's not going to show us much. So you can see that these are the addresses in the geodata. That's just the JSON. So that's the JSON that we've got, and it retrieves it. And so this is a really simple database. It's just a sort of spidering process. Run, run, run. But now we're going to run the geodump code, which is going to read this and dump this stuff out and print where.js, so it's going to actually parse this
stuff. And that's code we've seen before. So we're actually reading it. And this line goes into the results. The results is an array. So if we go into results, results is an array. We're going to go grab the zeroth item in that array. And then we're going to go find geometry. And then location. And then lat and long for the latitude and longitude. And then we're also going to take the actual address out of the formatted address right here. So in this bit of code, we're actually parsing the JSON. And we're going to clean things
up, get rid of some single quotes. This kind of data cleaning is just stuff after you play with it for a while. You realize, oh, my data is ugly or does this. And I print it out. And then I'm going to write this out. And I'm going to write it into a JavaScript file. And so the JavaScript file is this where.js. And I'll show you what it looks like. It's going to be overwritten. This is the one that came out of the zip file. It'll have the latitude, the longitude. And we're going to use JavaScript
to read this in this where.html file. It's going to actually read this right there and pull that data in. And that's how we're going to visualize. I'm not going to go into great detail on how the visualization happens. But that's what's happening. And so we're going to write that. So we're going to actually write this to a file. So let's go ahead and run this code and say python3 geodump. OK, so it wrote 120 records to where.js. So if we look at where.js, this is now the new data that I just downloaded moments ago. And
it says open where.html in a browser. Now, this way you'll need the Google Maps API. And you might not be able to see this depending on where you're at. But here you go with Google Maps locations. And I think if you hover over this, you can see. And you see the UTF, why we there in that particular thing, why we had to use the UTF-8 when we wrote the file so that we didn't end up with trouble writing the file out. And so there you go. And so that is a simple visualization. And just a
simple visualization. It wrote this where.js. If you are smart with HTML and JavaScript, you can look at this where.html file. It's really just reading through a bunch of data and putting the points. That's all there is. But I'm not going to go through that. So at least not in this. And so I hope that this was useful to you. And thanks for watching. So now we're going to write a search engine. Do some of the things. We're going to do page rank. And we're going to visualize it in a web browser and show the weights.
We're really only going to do page rank on one page because you want to have links that more than one page that points to a page so that you can figure out which pages are more or less important. And then visualize it. We'll run the page rank algorithm and we'll separately do all this. So at this point we're going to do pretty much the web crawling, the index building, and the searching. We're not going to really search it. We're going to visualize the index. But you could write a simple program to do searches for keywords
and figure out which page was the most likely page for a keyword. And that would be a fun additional thing to do. So the web crawler is this program that hits a page, pulls down the HTML, parses the page, looks for links, makes a queue of incoming links that are as yet unretrieved. And I'm going to do this in a simple SQLite database. It starts out with the database basically starts with one link as the starting point and then it retrieves that page. And then you see the database end up with lots of unretrieved pages.
And then it goes back in and picks a random page and retrieves that one. And then it just expands and expands. This code that I've built that you're going to play with only stays on one website, otherwise it would go crazy. And of course, Google doesn't use an SQLite database running on your hard drive. But you'll get the idea. You'll see this thing exponentially gain links. And you'll run it for a while, pull down 1,000 web pages or whatever. But of course, make sure that you don't violate any terms conditions. And again, I've got some
data sources that you can use. And they're not rate limited. But you can also use things like Wikipedia, which I think they sort of discourage you. Or DrChuck.com, which has no rate limit. Or who knows what, right? So just be careful. Don't do this on Facebook. And don't do it on Google. Don't get yourself in trouble. And if you're using, you know, a internet connection where you're paying for bandwidth, be careful. So this is the idea of the web crawler. And this isn't my picture. This is the classic picture of a web crawler. Read a
page, parse it, take all the URLs and stick them in a queue, grab again and again. So for us, the scheduler is going to do it as long as you'd say, oh, do 100 pages or it runs until it blows up. And again, these processes that have the network in the loop, it's really important that they behave well when they blow up. And that's why databases are so useful. Because you can be writing along to the database. And some random thing happens and blows your data up and you start over. So you're reading these things,
you're storing each page, building up your storage, et cetera, et cetera. So you just keep on doing that. And with this program, you'll be able to retrieve some stuff, then run the page rank, then you can retrieve them more, and then you can run some more page rank. And you can kind of see how Google sort of evolves its index over time. Of course, we're so much simpler. And like I said, be careful when you crawl. You're going to run a crawler that just goes as fast as it can. But Google doesn't do that. It's
careful not to overwhelm any websites. It's trying to be smart about the use of your bandwidth on your website. There is a file. Our code won't bother looking at this. But there's a file called robots.txt that real web crawlers look at, and it gives a list of the things you are allowed to look at and not allowed to look at. And so if you go to Google and you see a search that says, we are not allowed to show you the summary text of this page because of the robots.txt, it's there. And you can go
and you can actually see a robots.txt. Just go to any website. It's at the top root, blah, blah, blah, blah, blah, slash robots.txt. It's not a path. It's not slash this, slash that, slash something else, robots. It's at the very, very top of a website. The index building uses the page rank algorithm. And the whole goal of the page rank algorithm is to figure out which pages have the most best links. So having the most links is really easy. You can just say, how many links go to this? But the problem is you've got to
figure out the value of those links. And then you have to, how do you figure the value of those links? By looking at how many good links come to it. So it turns out that it's an infinite problem. It's an infinitely difficult problem to use page rank. But you can approximate it. And what happens is, after a while, it converges to a reasonable value. And so we're going to run the search index. And each time it runs, you're going to see that it says, how much did these numbers change? And what happens is, in the
beginning, they change very wildly. But quickly, they flatten out. And the best way to think about the page rank is think about how water runs, where you have a small little stream going by a house. And sometimes it rains. Sometimes it's dry. And sometimes there's like a little lake. And the stream is always running. And it doesn't go up and it doesn't go down. It might go up a little bit if it rains a lot. But in general, there's sort of a steady state, meaning that whatever water's coming in is about the same as the
water going out. So we think about this in terms of web pages. The value of the links coming in is roughly the same as the value of links going out. So when that starts to balance the in and the out value from each of the nodes, then you've got pretty stable. And so what Google does is they have a really relatively stable assessment of goodness and value of pages. And they use that to commute page rank. And then they throw a few more pages in and it kind of has to adjust for a while, but
it reconverges. And so this is a calculation that generally converges and it doesn't vary wildly. And that's why Google's pretty good at kind of arriving at the true value of something. So let's take a look at what we're going to do in this application. Again, we have a file that is going to spider the web. And we only have one database. Again, in this one we'll have two databases in the next one. And so this is spider is the restartable part. And what we actually do is we put one URL in, the starting URL. And
then spider walks in and asks, are there any unretrieved pages? And it does that randomly. It sort of picks among the unretrieved pages and says, okay, great. I'll go retrieve that page. And then I'll parse that page. And then I'll put in a bunch of new unretrieved pages. Okay, as well as the text of that page and then a bunch of unretrieved pages. And then it'll go back up and it'll say, oh, give me one of the randomly non-retrieved pages. And it'll grab the next page and pull that page down and then add to it.
And so this is like there's a page and then a to-do list. And then this one becomes a page and then adds a few more things to the to-do list. And so the to-do list or the unretrieved URLs grows very rapidly. And the retrieved ones grow sort of as you retrieve them one at a time. But you've always got this long list. If you have a really short site that only has like two links, if you start at drchuck.com slash page1.htm, it'll go to page two and then go back to page one and it'll be
out of things. It'll have retrieved all of the pages. And so if you have a website that has no external links or has very few pages and they point to each other, this will run out of things to do. But if you go to a page like my blog or the sample stuff that I have up for you to spider for testing on drchuck.net, it'll run for a very long time. And you'll have far more pages to retrieve than pages that you retrieve. But that's okay. At some point, you can stop this. Maybe it stops
because you ran out of bandwidth or maybe your computer went down or who knows what, right? But it's okay. This is a restartable process because it always has some pages that are retrieved and some unretrieved pages. You start it back up. It picks randomly from the unretrieved pages. The database is the sort of persistent state of your spider rather than a bunch of dictionaries or lists inside the Python which go away when the program dies. And so at some point you have, let's just say, a few hundred pages in here and a few thousand unretrieved
pages. You can run the page rank algorithm. And what the page rank algorithm does is it loops through all the pages and figure out which pages are linked to which pages and then reads the numbers and then updates the numbers and then does that some number of times. And so this is where the numbers, all the pages, sort of start out with goodness of one. I think this printout is showing that goodness of one. And then it changes. And then the goodness goes to, some of the goodness goes up to two. Some of the goodness
goes to seven and whatever. But then it does this over and over and then it uses these numbers and then they change again. And so there's a number of time steps that this page rank runs. And you will see as the page rank runs, when I show you the code, you'll see the average sort of change in these numbers across all these things. And you'll see that the average goes down very rapidly as you get through. And so usually with a few hundred or even thousand pages, like a hundred plus times during this algorithm and
these numbers have converged. And that's when you sort of can begin to trust the numbers. Now there's this one program called SP Reset, which sets all the pages back to one. So you can start this over. So if you were to spider for a while, run SP rank for a while, play around, and then you wanted to spider some more and start it over, you could say, oh, let's start the page rank completely over. Or you could simply take the new pages and watch it adapt. Either way, this is just a way to reset all
the pages to have sort of their initial value of a goodness of 1.0. So at some point you run this. This runs really, this part here runs really slow. This part runs super fast, like in the blink of an eye. This one is pretty fast. And then at some point you've got these pages that have, you know, numbers on them. They have values on the pages. And there's a couple of programs that allow us to visualize that. One is the dump which just reads it and checks to see. It shows the new page rank, the
old page rank, and various other things and shows just a way to dump it. And then there's this thing that reads the whole thing. You say, I'd like to do 25 at the top, the best. It sorts it by page rank and then produces a JavaScript file. It has just the numbers in it. And then there is some HTML and a visualization library called D3.js, which you can read about, that when the HTML starts it reads this and has this nice force-directed layout of the page rank. And you can hover over things and you can
see what page rank you've got. And so that is the page rank algorithm that we're going to do. And up next we'll do the largest and most complex of these things, and that is the email. We're going to spider some email, which is about a gigabyte of data. Okay? We're doing a bit of code walkthrough, and if you want to, you can get to the sample code and download it all so that you can walk through the code yourself. What we're walking through today is the page rank code. And so the page rank code, let
me get the picture of the page rank code up here. Here's the picture of the page rank code. And so the page rank code has five chunks of code that are going to run. The first one we're going to look at is the spidering code. And then we'll do a separate look at these other guys later. So the first one we'll look at is spidering. And again, it's sort of the same pattern of we've got some stuff on the web, in this case web pages. We're going to have a database that sort of just captures
the stuff. It's not really trying to be particularly intelligent, but it is going to parse these with Beautiful Soup and add things to the database. Okay? And so then we'll talk about how we run the page rank algorithm and then how we visualize the page rank algorithm in a bit. Now, the first thing to notice is that I've got to put, I put the Beautiful Soup code in right here. Okay? So this is, you can get this from the bs4.zip file. There might need to be a readme. No, but there's a readme somewhere. But to
get to use Beautiful Soup, you've got to put this bs4.zip, or you have to install Beautiful Soup for your stuff. So I provide this bs4.zip as a quick and dirty way if you can't install something for all of the Python users on your system. So that's what it's supposed to look like. You're supposed to have it unzipped right here in these files. And I don't know what dammit.py means. That came from Beautiful Soup. If you look, it's in their source code. So I'm not swearing. It's Beautiful Soup. People are swearing. I'm sorry. I apologize. Okay.
So the code we're going to play with the most is, and this first one is called spider.py. And, you know, we're going to do databases. We're going to read URLs. And we're going to parse them with Beautiful Soup. Okay? And so what we're going to do is we're going to make a file. Again, this will make spider.sql lite. And here we are in PageRank and else minus l. Spider.sql lite is not there. So it's going to create the database. We do create table if not exists. We're going to have an integer primary key because we're
going to do foreign keys here. We're going to have a URL. And the URL, which is unique, the HTML, which is unique, whether we got an error. And then for the second half, when we start doing PageRank, we're going to have old rank and new rank. Because the way PageRank works is it takes the old rank, computes the new rank, and then replaces the new rank with the old rank, and then does it over and over again. And then we're going to have a many-to-many table, which points really back. So I call this from ID
and to ID. We did this with some of the Twitter stuff. And then this webs is just in case I have more than one web, but that really doesn't make much difference. Okay, so what we're going to do is we're going to select ID, URL from pages where HTML is null. This is our indicator that a page has not yet been retrieved. And error is null, ordered by random. And so this is our way, this long bit of stuff. And not all this SQL is completely standard, but this order by random is really quite nice
in SQLite. Limit once is just randomly pick a record in this database where this true is true, and then pick it randomly. And then we're going to fetch a row. And if that row is none, right, we're going to ask for a new web, a starting URL, and this is going to fire things up, and we're going to insert this new URL. Otherwise, we're going to restart. We have a row to start with. And otherwise, we're going to sort of prime this by inserting the URL we start with, insert into it. If you enter it,
it just goes to drchuck.com, which is a fine place to start. And then what we do is we, what this does is its page rank, is it uses this web's table to limit the links. It only does links to the sites that you tell it to do links. And probably the best for your page rank is to stick with one site. Otherwise, you will just never find the same site again if you let this wander the web aimlessly. And so I generally run with one web, which this should be probably called web sites. And I
pull in all the data, and I read this in, and I just make myself a list of the legit URLs, and you'll see how we use that. And the web is what are the legit places we're going to go, because we're going to go through a loop, ask for how many pages, and we're going to look for a null page. Again, we're using that random order by random limit one. And then we're going to have a, we're going to grab one. We're going to get the from ID, which is the page we're linking from, and
then the URL. Otherwise, there's no on retrieved. And so the from ID is when we start adding links to our page links, we've got to know the page we started with. And that's the primary key. We'll see how that primary key is set in a second. So otherwise, we have none. And we're going to print this from ID, the from ID and the URL that we're working with, just to make sure we're going to wipe out all of the links. Because it's on retrieved, we're going to wipe out from the links. The links is the
connection table that connects from pages back to pages. And so we're going to wipe out. So we're going to go grab this URL. We're going to read it. We're not decoding it because we're using Beautiful Soup, which compensates for the UTF-8. And so we can ask, this is the HTML error code. And we checked 200 is a good error. And if we get a bad error, we're going to say this error on page. We're going to set that error. We're going to take pages. That way, we don't retrieve it ever again. We basically check to
see if the content type is text HTML. Remember, in HTTP, you get the content type. We only want to retrieve it. We only want to look for the links on HTML pages. And so we wipe that guy out. If we get a JPEG or something like that, we're not going to retrieve JPEG. And then we commit and continue. So these are kind of like, oh, those are pages we didn't want to mess with. And then we print out how many characters we got and parse it. And we do this whole thing in a try-accept block,
because a lot of things can go wrong here. It's a bit of a long try-accept block. Keyboard interrupt, that's what happens if I hit CTRL-C at my keyboard or CTRL-Z on Windows. Some other exception probably means Beautiful Soup blew up or something else blew up. And so we indicate with the error equals negative 1 for that URL so we don't retrieve it again. At this point, at line 103, we have got the HTML for that URL. And so we're going to insert it in. And we're going to set the page rank to 1. So the
way page rank works is it gives all the pages some normal value. And then it alters that. We'll see that in a bit. So it sets it in with 1. We're going to insert or ignore. That's just in case the pages is not there. And then we're going to do an update. And that's kind of doing the same thing twice, just sort of doubly making sure if it's already there, this or ignore will cause this to do nothing. And the update will cause us to retain it. And then we commit it so that if we
do selects later, we get that information. Now this code is similar. Remember, we used beautiful soup to pull out all the anchor tags. We have a for loop. We pull out the href. And you'll see this code's a little more complex than some of the earlier stuff, because it has to deal with the real nastiness or imperfection of the web. And so we're going to use URL parse, which is actually part of the URL lib code. And that's going to break the URL into pieces. Come back. We use URL parse. We have the scheme, which
is HTTP or HTTPS. If this solves relative references, this is solved relative references by taking the current URL and hooking it up. URL join knows about slashes and all those other things. We check to see if there's an anchor, the pound sign at the end of a URL, and we throw everything past including the anchor away. If we have a JPEG or a PNG or a GIF, we are going to skip it. We don't want to bother with that. We're looking through links now. We're looking at all the links. And if we have a slash
at the end, we're going to chop off the slash by saying minus one. And so this is just kind of nasty choppage and throwing away the URLs that we're going through a page, and we have a bunch that we don't like or we have to clean them up or whatever. And now, and we've made them absolute by doing this, it's an absolute URL. This is just, you write this slowly but surely when your code blows up and you start it over and start it over and start over. Then what we do is we check to
see through all the webs. Remember, those were the URLs that we're willing to stay with and usually it's just one. If this would link off the sites, of the sites we're interested in, we're going to skip it. We are not interested in links that leave the site. So this is like link that left the site, skip it. But now we finally here at line 132, we are ready to put this into pages, URL and the HTML, and it's all good, right? And that one's going to be null right there because we haven't retrieved the HTML.
This is null because this is a page we're going to retrieve, we're giving the page rank of one, and we're giving it no HTML and that way it'll be retrieved. And then we commit that, okay? And then we want to get the ID. So we could have done this with one way or another, but we're going to do a select to say, hey, what was the ID that either was already there or was just created? And we grab that with a fetch one and say, retrieve two ID, and now we're going to put a link
in, insert or into links from ID to ID, which is the primary key of the page that we're going through and looking for links. Two ID is the link that we just created and away we run. So it's going to go and go and go and go. Let's go look at the create statement up here from ID and to ID right there, okay. So let's run it. Python 3, oops. Python 3, spider, python. So it's fresh and so it wants a URL with which to start, and I'll just start with my favorite website, www.drchuck.com. Now
this basically, this first one you put in, it's going to stay on this website for a while, okay. So I'll hit enter and let's just grab like, let's grab one page just for yucks. Okay, so it grabbed that and it printed out that it got 85, 45 characters and it printed out that it got six links. So if I go to this and open database, and I go to code 3 and I go to page rank and I look at this, oh, let me get out so it closes. So notice this SQLite journal, that means
it's not done closing so I'm going to get out of this by pressing enter and so you'll notice now that that journal file went away otherwise we would not be getting the final data. There we go. Okay, so webs, let's take a look at the data. Webs has just one URL, that's the URLs that we're allowing ourselves to look at. You can put more than one in here if you want but most people will just leave this as one. Pages, so we got this first one and we retrieved this as the HTML of it and
we found six other URLs in there that are drchuck.com URLs. There was lots of other URLs in there but there were only five other ones that we found. And what we'll find is if we go to links, we'll see that page one, links to two, links to three, links to four, links to five, links to six because the links is just a many to many table. So page one points to page two, page one to three, page one to five, okay? So that's what happens when we have the first page. So let's retrieve one more
page. Now it's, we could have started a new crawl but we're just gonna, it's gonna stay on drchuck.com and I'll just ask for one more page. And so now it went and grabbed. It randomly picked among these null guys and I'm gonna hit enter to close it and then I'll refresh this. And oh, so it looks like we retrieved OBI sample and we didn't get any new links. And so the links page, no, we didn't get any new links. So that page, whatever that was, OBI sample had no external links. So let's do another one.
Oh, one more page. So that one had 15 links, so let's take a look now. So now we have 15 pages. It picked this one to do, right? And now it added 15 more pages and then if you look at links you will see that page four, which is one it just retrieved, links back to page one. So now we're seeing this is where the page rank is gonna be cool. Four links to one, four links to whatever, away we go, right? One goes to four, four goes to one. I should have probably put a
uniqueness constraint on that. It's not supposed to duplicated that. Okay, so let's run this a bunch of times now. So let's just run it 100 times for 100 pages. It'll take a minute. So you'll see it's like freaking out on certain pages and not parsing them. It's finding its way into my blog. It's finding like 27 links. This table is growing wildly at this point. It's gonna take us a while before we get to 100. It's kind of slow. Now the interesting thing is I can hit control C at any point in time. Right? And
so that blew up. But it's okay because the data is still there and if we go back to pages, for example, and we refresh our data, we see we got a ton of stuff. And this will restart and all the things, so if we search this, I started that by HTML, you see that there's lots of files that we've got and it's never gonna retrieve those again because those have HTML. So then I can run this thing again and start it up. And when I say control C, your computer might go down, your network might
go down. There's all kinds of things that might happen and you just pick up where it leaves off. It just picks up where it leaves off and that's what's nice about this. Okay? So that's pretty much how this works. We've got this part running. We're seeing it flow into Spider DeskQL Lite. We're seeing that we can start this and replace this. And so what I'll do is I will come back in the next video and show you how all these things work together and then how we actually do the page rank. So thanks again for
listening and see you in the next video. We're picking up in the middle here where we are running a simple spider that's retrieving data and putting it into running this spider.py file and it's cruising around and doing things. And the beauty of any of these spider processes is I can stop any time and just hit control C. And so we take a look at the spider.sqlite file and retrieve it. And it looks like we've got 302 pages. I don't know how many we've got retrieved. 70. Okay, there we go. We've got about 100. Oh wait,
I'm looking for the wrong thing. No, no, no, no, no. Yeah, we've got about 107 pages. So what we're going to do now with 107 pages is we are going to run the page rank algorithm. Okay, so let's take a look at that code. So the idea of page rank, we're going to run this page rank algorithm. The spreset just resets the page rank and sprank runs as many iterations of page rank. So the basic idea is that if you were to look at the links here, we think of page 1 pointing to page 2
gives some of page 1's love to page 2. Page 4 has some value that it gives to page 1. You go on and page 2 gives love to page 46 over and over and over again. But the problem is that how good is page 1 and how much positive karma does it give to page 2? And so what happens is we start by giving every page a rank of 1. We say, look, everybody starts out equal. But then what we do is we divide up in one iteration of the page rank algorithm, we divide up
the goodness of a page across its outbound links and then accumulate that and that becomes the next rank. So let's take a look at the code for the page rank algorithm. So this is pretty simple. It only imports SQLite 3 because it's really doing everything in the database. It's going to be updating these columns right here in the database. So we're going to do some things here to speed this up. This rank runs, if you're thinking of Google, this rank runs slowly and is going to run continuously to keep updating these things. So the first
thing I do is I read in all of the from IDs from the links. Select distinct throws out any duplicates. And so I have all the from IDs, which are all the pages that have links to other pages because all the pages are in pages, but in links to have a from ID, you have to also have a to ID. And so we're also going to look at the pages that receive page rank and we're kind of precaching this stuff. So we're going to do a select distinct of from ID and to ID and loop
through that group of things. And we're making a links list here. And so we're saying if the from ID is the same as the to ID, we're not interested if the from ID is not already in my from IDs that I've got. I'm going to skip it. If the to ID is not in the from ID, meaning that this is a to ID that's not also, we don't want links that point off to nowhere or point to pages that we haven't retrieved yet. And that's what this is saying. So this is really going to give
us, it's a filter on the from IDs and the to IDs from the links table so that it only are the links that point to another page we've already retrieved. And then we're going to keep track of the entire super set of two IDs, the destination IDs. And I'm just putting these all in lists so that I don't have to hit the database so hard. Okay, so this is getting what's called the strongly connected component, meaning that any of these IDs, there is a path from every ID to every other ID eventually. So that's called
the strongly connected component in graph theory. Then what we're going to do is we're going to grab the, we're going to select new rank from pages where for all the from IDs, right? And so we're going to have a dictionary that's based on the ID, the primary key, that's what node is, equals the rank. And so if we look at our database, that means that for the part of the strongly connected component in links, we're going to grab this number and stick it into a dictionary based on the primary key of this, based on the
primary key, this number right here. So we're going to have a dictionary that's this map to that. Again, we want to do this as fast as possible. Now we're only doing one iteration at the beginning, so it asks how many times you want to run it, okay? And so we just make an integer of that. We check to see if there's any values in there. If there are no values, we are bad. And now we're going to go I equals one to range many. This is going to be one to one, so it might run
however many times. And then what it's going to do is it's going to compute the new page ranks. And so what it's really going to do is it's going to take the previous ranks and loop through them, and the previous ranks is the mapping of primary key to old page rank, okay? And for each node, we're going to have total equals total plus old rank, and then we're going to set the next ranks to be zero, okay? And then what we're going to do is figure out the number of outbound links for each page rank
item, so node and old rank in the list of the previous ranks. These are the IDs we're going to give it to, and so for this particular node, we're going to have the outbound links, and we're going to go through the links and not link to itself, although we made sure that doesn't happen. We make sure that this, but then we're going to make a list called give IDs, which are the IDs that node is going to share its goodness. And now what we're going to do is we're going to say how much goodness are
we going to flow outbound based on our previous rank of this particular node and the number of outbound links we have. So that's how much we're going to give in our outbound links. And then what we're doing is all the IDs we're giving it to, we started with the next ranks being zero for these folks. These are the receiving end, and we're going to add the amount of page rank to each one, so whatever this is. So we'll go through all of the links, give out fractional bits of our current goodness, and it's accumulated in
each one, and so eventually all the incoming links will have granted each new link value. Now I'm just going to run through and calculate the new total, and this evaporation, the idea is that it has to do with the page rank algorithm that there are dysfunctional shapes in which page rank can be trapped, and this evaporation is taking a fraction away from everyone and giving it back to everybody else. And so we add this evaporative factor, and then we're going to do some computations just to show some stuff, and that is we're calculating the average
difference between the page ranks, and you'll see this when I start running it, and this is going to tell us the stability of the page rank. So from one iteration to the next, the more it changes, the least stable it is, and you'll see in a sec that these things stabilize, and we say what's the average difference in the page ranks per node, which is what this is, and that's what we're going to print, and now we're going to take the new ranks and make them the old ranks and then run the loop again. So
I'm not actually updating the database each time through the page rank iteration, but then at the very end I am going to do the update for all of these things and update all of the rankings with a new rank. So I'm doing an in-memory calculation so that this loop here runs screamingly fast. Even if I want to do this loop 100 times or 1000 times, it's really all just in-memory data structures. Okay, so it's probably easier just for me to show you this. The code runs quite simply. Python 3, SprankRank.py. And so I'm only going
to run it for one iteration, and that means that this loop here is just going to run one time. And so it's going to start with the page ranks of the new rank of one, and it's going to just run one iteration and put the rank there. Okay, and then update this as well. So let's go ahead and run that once for one iteration. Okay, and so it ran one iteration, and the average change between the previous rank and the new rank is one. So it's actually quite crazy. So I'm going to refresh here, and
you'll see that the old rank was one, and the new rank went way down, way down, way down, way down, down a little bit, down some, up a whole bunch. Down, down, up. So you see that they went down and up. Now the sum of all of these numbers is going to be the same, right? Because all it did was like float it out and recalculate it. And so that's what happens with PageRank. And so what will happen is if I run one more PageRank iteration, this number will, these numbers will be used to compute
the new new rank, and then these will be calculated to the old rank. And so you'll see that these will get, they will change again. So I'll just run it one more time. So I'm going to run one iteration, and then I'm going to hit refresh. So you see all these numbers got copied over, but now there's a new rank that's computed based on these guys. And so they're getting, this one went up. This was 0.13. That's gone up a little bit. This one's gone up some more. This one's gone up. This one went down,
right? So this one went down from 6 to 8. And you can see that the difference is now the average difference between this number and this number across all of them went from 1 point something to 0.41. And you'll see that with these very few pages, this PageRank converges really quickly, okay? So let's run it again. And I'll just run 10, and you will watch how this converges, okay? So there you go. It converges. And you're seeing now after like 12 iterations that the difference between the old rank and the new rank, well, that's because
it's that old rank. I'll run one more iteration so that you can see. So this old rank is less than 0.005. And so now you can see that these numbers are sort of stabilizing. This is the average. That 0.005 number is the average difference between these two things. Now, if we're going to pretend to be Google for a moment, we can say python3 spider.py. So let's just do 10 more pages. Now what's going to happen here is these new pages are going to have PageRanks of 1, okay? So let's get out. So if I do
a refresh now, and I look at new rank. So there's these guys that have high rank. What you'll see, I hope, if we, yeah, okay. So you see new pages, right? These are the new ones that we just retrieved. I don't know if they're linked or not, and they all got one. So some old pages are way up, 14. Some pages, if we go downwards, are way down, right? So these are like useless pages. They, you know, they point to somewhere, but nobody points to them. That's what happens with these PageRanks, okay? So what happens
is the new records get this 0.1. And so if I run the ranking code again, and I run, let's just run five iterations, you'll see that the average delta goes up just briefly as it sort of assimilates these new pages, and then it goes right back down again. And so that's what's happening with Google. It's sort of running the spider to get more pages, then running the PageRank, which gets disturbed a little bit, but then it reconverges very rapidly. And of course, they've got billions of pages, and we've got hundreds of pages, but you get
the idea, okay? And so I can run PageRank like 100 times, and after a while, it just sort of hardly is changing. So that's 2.7 to the negative 10th power. So now, you know, let me run it one more time to update the stuff. And if I refresh this, you're going to see, look at how stable these numbers are. 14, 9, 4, 3, 5, 9, 1, 5, 6, 7. The difference is they're in the seventh one. So that's why this whole PageRank is really cool. It seems like it's really chaotic when it first starts out,
and away you go, okay? So that was just this, SPRank, right? SPRank, and SPReset, we can look at that code. I won't bother running it. It just sets the old rank to 1. That's it. That's as much code as you've got. It just starts it and lets it rerun. So I'm going to stop now, and I'm going to start a new video, where I should talk about this phase here, where we're actually going to visualize the PageRank data. And what we are in the middle of is we're in the middle of the PageRank code, and
we just got done running the PageRank, and so we have spiedered the code. We've run PageRank a bunch of times. SPReset allows us to restart the PageRank algorithm if we want, but we're not going to play with that. We're just going to play with spdump and spjson and do the visualization, which is the fun part. So I'll go into spdump. So this is a simple code, because it's really just running a SQL query and then printing stuff out, right? So we connect to our database, create a cursor, and then just do a select count, and
we're going to just show the number of links. We're going to order by the number of inbound links descending so we see the most linked things, and we'll see the top 50 that. So this is just a sample. You'll tend to write little helpers like this that make your life easier just to show you the kinds of things that you want, spdump.py. And you just kind of test to make sure that it's like, oh, this looks right to me. And so here is the number of inbound links. So that's my blog that has the most
inbound links, followed by my uncategorized, whatever that is. And these are the number of inbound links within my own blog somehow. I don't know, because this is not looking at the whole internet at all. So there we go. So that's spdump. Pretty straightforward. And now we're going to go through the visualization process. And so this is going to look at all that data and produce a JavaScript file. It's going to write a JavaScript file that will then be fed into my visualization using D3. And spjson is going to do a big, long join. It joins
the links with the thing. And HTML is not null. And error is not null. You know, order by the number of inbound links. So we're looking at the things that have the highest number of inbound links. We're going to read all this stuff. We're going to read through all those rows and pull out the page rank for each one. We are looking for the highest and lowest rank because these numbers can vary quite widely. They go all the way from 0.000 to 20 or 30. And so it asks, how many do you want to do?
So it only does the top, like 20 or something. And you'll see why we need that in the visualization. And so this is just checking. And so we're going to write out a file. We'll see what the format of this is. It's just a little, it's just a JavaScript file. And we're going to write out, we're basically normalizing the rank. We're subtracting the minimum rank. And because we're going to turn this into line weight, the thickness of the line, and so we're dividing by, you know, we're normalizing the rank to be the thickness of the
line and the size of the ball. You'll see all this. And so this is really just writing some JavaScript with the little strings and stuff like that. And then we're going to finish the JavaScript. And then we're going to write all the links out. So these are the balls that you'll see. And this is showing what, this is drawing all the lines. And this is again normalizing things for thickness and printing these things out. Now I don't want to go through this in tremendous detail, but so I'll do python spjson.py. Let's do the top 20
nodes. And if I take a look at this file spider.js, you can see that it's some objects that basically put the page rank in, which ID it is, and that's a way for me to be able to link back and forth. Weight is how big the little circle is. And then I have the links. And I only asked for the top 20. And then this is the thickness of the line, where the line starts, where the line ends. So this is read by this HTML file. And it's going to read somewhere this force.js file. And
my own spider.js code, this is some JavaScript. I mean, no, the force.js is the visualization code. And this is D3, the visualization library. So I'm using this D3.js, which is a really great visualization library. And this is just drawing the circles and making the circles of colors and making the circles bigger and smaller and then connecting all the lines in between it. So this is just there. This data feeds that thing. And so when we're all done, you simply say open. You don't have to do anything. Open force.html. And so all this beautiful JavaScript stuff
is like, oh, wow, that's really cool, because you can move these things around. Whoa. You can see the circles are bigger. If you hover over it for a while, it shows you the big ones. You know, you can see these things, and it's kind of cool. So I gave you all this force.js and force.html. And so that kind of visualizes the page rank. And you could use this to visualize quite a bit of stuff. You know, it'll take you a while to pull down enough data from a real website. But after you pull down 400
or 500 pages if you have some time, then the visualization is quite interesting. But you can see why we had to pull down several hundred pages just to get this much page rank information. Okay, so that gives you a sense of how to run the page rank code in Python for everybody. So thanks for listening. The last visualization application that we're going to take a look at is mailing lists, and that's kind of ironic. We started with the mailing lists, and we're going to end with the mailing lists. The mailing lists, of course, are from
my open source Sakai project, which I love and am very proud of. And so what we're going to do is we're going to crawl the archive of a mailing list, and then we're going to do two visualizations. One is an activity visualization, and another is a word cloud. So probably the more important thing is when I do the demonstration of how the software works. So this is a large data set, so you've got to be careful. This could spider gmain.org, which is a very free and friendly archive. This data originally came from gmain.org, but I've
got a copy of it. And so gmain.org is not rate limited, but if everyone who is watching this starts spidering gmain.org at the same time, you will crash it. It just doesn't have the horsepower to give you this data as fast. And so I've got something that can give you the data super fast and has no rate limited on a really good server, and it's cached all around the world using a technology called CloudFlare. So please, please, please don't point this at gmain.org. Point this at the URL here, mboxdrchuck.net, et cetera, et cetera. And then
you can run this as fast as you like. Now, another thing to worry about is if you have a metered connection. So don't do this on a cell phone connection because you'll pay thousands of dollars perhaps. Make sure you run a no cost connection before you start running this because this is going to pull a lot of data down. If you just start this from scratch and you let it run, on a super fast connection, downloading the whole thing is probably about four hours. On my home connection, when I had like about a 10 megabit
connection, it took several days. And so just understand that in this one, it's both fun to deal with a ton of data, and it's scary to deal with a ton of data. So this one is big. This one is, you'll see the process in action because it'll run for a while. Everything, the things will take a long time. So here's basically the flow of the data in this particular one. You are going to have the restartable spider that talks to the API, mboxdrchuck.net, which has a scalable copy of all this information. And again, it's going
to do kind of a raw database, not a very clean database. It's sort of a mess. It's just enough columns to keep track of whether or not we've got this page or not. And so this has the ones we've retrieved so far. And so what gmain does is it sort of scans down to see where to retrieve next, gets that, and then starts scanning and then adding things here. So it just adds it and then it blows up and then it comes in again and says, okay, I'll start here and then it starts retrieving stuff
and fills this in, fills this in, fills this in. And sometimes you put like a delay in this so you don't overwhelm networks, you don't overwhelm servers. But basically this is pretty much a raw retrieval of the email messages. And this file can get rather large. This is the one that's greater than a gigabyte. Now this data is actually really nasty. It's email data. The date format's changed. This is data that lasted from 2004 to like 2012 or 2013. And so this data has got a lot of things wrong with it. It even has things
where people's email address has changed. And so it has this mapping file. This comes along with it, this mapping file that says, here's this one person and here are the six email addresses that they used throughout the life of the project. And so there is a relatively complex, and so this part here is super slow, very slow. This part here is slow. But it'll take like, depending on how fast your computer is, somewhere between two minutes and ten minutes. This first part will take days, perhaps, depending on the speed of your network connection. And so
what gmodel does is it reads through this. It actually re-creates, it wipes this out and re-creates index.sqlite every time it runs so that you can change any number of things, you can respiter things, you can do whatever. And often the cleanup, this is one of those cleanup processes, and you have to tweak the cleanup process. You're like, look at your data, like, oh, the cleanup missed something, so I've got to run it again. So this produces index.sqlite every time it runs. So this is like two to ten minutes. gmodel is two to ten minutes. And
it maps names, and when it's all said and done, this is a very small, highly normalized, it's a nice data model. This one here, the content.sqlite has an ugly data model. Index.sqlite has a pretty data model. It's got foreign keys, it's got all this stuff. And all those things we talked about in the database where it's efficient. And so in your mind, keep track of how fast it is to scan all the data in a database with a bad model, and then watch when you run like gbasic, which is a scanner, or gline, which produces
line data, or gword, and watch how fast they run. They run in like a couple of seconds at the most, and this runs in two to ten minutes. And the difference is that's because the data is efficiently modeled in index.sqlite. So you can take a look at that using SQLite browser and take a look at the data model. And you'll see it looks just like the stuff we talked about in the database chapter. It's got foreign keys and all those things. And so that runs, and you've got this. And then we do our visualizations and
our analysis from this clean version of all the data. And so gbasic just loops through and prints some stuff out. It's a great way to test things. It's a pretty easy to understand program, and you could take a look at it. Gline does some bucketing and makes some histograms to produce a line graph. And then gword does a different histogram. It does a histogram of word frequency and then produces that as the word frequency ends up in gword.js. And then we have two HTML files that use the d3.js visualization to produce a line and a
word chart. And so in another video, I will show you how this code works, which is probably more useful than this picture. But this is a whole bunch of good stuff in this particular application. And if you really understand everything in here, you can build a pretty sophisticated data retrieval and analysis pipeline. And so that's it. Thank you for watching all these lectures, and I look forward to seeing you on the net. We're doing some code walkthroughs. If you want to get the source code, you can take a look at the sample code and download
it and work through it. And so what we're working on now is doing some retrieval and visualization of email data. It's kind of ironic. We are going to now look at the email data that we started with. It's the same Sakai developer list email data. And so there's this service called gmain. And gmain archives developer lists and various email lists. And I've made a copy of their data because all the students in my class hitting their server with their API would crush it. So in order to be a nice guy, I put up a much
more powerful server with just the data from this one list. And it's about a gigabyte of data, so be real careful if you're paying for network. So the basic process we're going to go through is we're going to have a spidering process that's a simple, restartable, focused on the network problems, data pulling, to pull content.sqlite, and there's going to be a database there. And then we're going to have a cleanup process. This database is going to get large, about a gigabyte. And then we're going to have a process that kind of grinds through this data.
It takes a while. And so then it's going to read this mapping, and I'll show you that when it comes, because things like people's names have changed over all these years. And it does a cleanup and makes a really nice, highly relational version of this data. And then we visualize from here. And so this could take you several days to finish this. This will take like a few minutes to run, and then this will just take seconds to run. And so this is a multi-step process where if you were doing something like running something for
two days to produce a visualization, and it blew up three cars the way through, it would do you no good. And so that's why we break this into simple parts. But right now we're just going to focus on this part right here, and take a look at the mail bit, and retrieve the mail, and then we'll have another video to talk about the rest of this stuff. So let's take a look at the code. So here is gmain.py. That is the basic code. And hopefully this stuff is starting to look familiar. The thing that's weird
here is we've got to do some date-time parsing. And there is code that's out there, but you may have to install it. And I had to write my code in a way that didn't assume that you could install the date-time parser. And so it has it, even if that's not there, it uses my own date-time parser, and that's what this code is. Don't worry too much about that. And of course we have to deal with the lack of certificates inside of Python. And so we start things out. And this is really a simple table. We've
got a messages table that's got a primary key, the email itself, when it was sent, what the subject, and the headers, and the body. And so what we're going to do is, because we have to pick up where we left off, we're going to select the largest primary key from the messages table and retrieve that. And then we're going to go to the one after that. And so we know what the ID is, and we're going to pick up where we left off. And so we have a starting point that starts either 0 or 1.
And we're going to ask how many messages to retrieve. We've got some counters. And so we're going to say, okay, see if select ID for messages where ID equals whatever that starting is, that's the highest number we've seen so far. And if row is not none, that means we've already retrieved this particular email message. Otherwise we're going to keep on going, and we're in good shape. And this is one that we want to retrieve. And we're subtracting that so we don't. And so this is the base URL. This is the URL of our API, the
one that I have a nice copy of all this data on a server that's accessible worldwide and won't crash. So the format of this is you can say I would like the email address from 1 to 2 or from 100, oops, from 102, 101, message 101 to 102. And we can just kind of walk through these things. So that's the message ID. And so if we're going to make the URL, we're going to take the base URL, add the starting address, and then add plus one. So we got the slash at the end of this
starting address. And so that's how we form those. And we're going to retrieve that, and we're going to decode it. We've seen this in some other ones. We're going to check to see if we got legit data. If not, if I got a 404 not found or something else, we're going to quit. If someone has control C, which is our control Z, we'll get the program interrupt and we'll stop. If there's some other problem, we're going to complain and keep going. And if we have five failures in a row, we're going to quit, but we'll
just keep on going because these things do have glitchy bits here. And so at this point, if we made it this far, we've retrieved the URL and we've got the number of characters we've retrieved. And if we get bad data, if it doesn't start with from, because this is a mail message, and they all start with from space, if it's right, it starts with from space. Then what we're going to, we're going to tolerate up to five failures there for bad data because it could be bad. And then we're going to find a blank line
because that's the new line at the end of one line and then a blank line. And then we're going to take and break this into the headers, the mail headers, which is that mail headers is this stuff right here, up to but not including the blank line, and then the body is everything after that, okay? And so we'll just have, break that into pieces. Otherwise we'll complain and tolerate up to five characters. And then we're going to use a regular expression, kind of from the regular expressions chapter, to pull out an email address from the
from colon line somewhere in these headers, from colon right there. It's going to go find a less than and then pull, oops, come on, pull this stuff out up to it. So you got the less than, you got the parenthesis, you got one or more non-blank characters followed by the outside, followed by one or more non-blank characters. And we'll get back a list of those. We should only get one. If we find one, we're going to grab the email. We're going to strip the lower case. And if we got some little nasty less than sign
in there, we'll tolerate that as well. So this is kind of clean up, and you get used to this where you're like, oh, how come all those email addresses have this other stuff in them? And then we also look for it if there are no less than signs. And we do this way, this is, and that's different. Some mail messages have it this way, and others, again, you write this code after you watch it for a while, and you're like, oh, it's crapped out and giving me bad stuff. And I make them all lower case
so they match better and I get rid of bad characters. Now I got an email address. Then what I do is I look for the date of this. So I'm going to graph these by date, so I look for this line and use a regular expression to pull that out. So I'm looking for a date, followed by a blank, followed by any number of characters, followed by a comma. So I'm not interested in this Wednesday bit, so I'm skipping that bit right there, and going and grabbing everything after that comma space. And so it's really
here to the end of the line. So that's the new line. So it's going all the way. It's going to pull this bit right here. That's the text. And this is where we're going to say, oh, that's kind of a funky-looking date and we want to standardize that date. So we're going to, let's see. Yeah, we're going to chop it off at the 26th character. Apparently, I don't know what the 26th, why we care about the 26th character, but we chop that off at the 26th character. And then we're going to parse it, and that's
going to give us back a nice clean date, sent at date. Otherwise, we're going to complete. We're going to quit. And if we can't parse it, then we're going to tolerate five bad email addresses in a row. Then we're looking for the subject line using another regular expression. Subject line, regular expression. That's pretty easy. Up to, but not including, right? There's a blank there. It's the subject. Let me pull that out. We get the subject. Now, at this point, we've parsed it and we've got good stuff, so we reset the fail counter because I kept
saying, if you fail five straight times, you quit. And we're going to print it out, and then we're just going to insert that stuff. We've got the ID of the message, which we've got the email address that it came from, the time it was sent, the subject, and then basically the headers in the body, and we're just inserting it. And now we're going to say, every 50th we're going to commit it, so that speeds things up, and every 100th we're going to wait a second. So that's, you know, count is going up, up, up, up,
up, and every 50th you'll see it pause, and then it will, every 100th, it'll pause for a second. Mostly that's to let me hit control C or to not overload any server. Okay, so that's the simple one. The problem is, is this data just gets ugly, and so you'll find yourself wanting to reset this and start it over. This one's going to work, of course, but it's, these are hard to build, and that's why it's a good idea. Oops. Python three gmain.py. How many messages? Well, let's just do one. Choo! Okay, so it went and
grabbed, oh, do I have this already running? 51 through 52. Let me start over. That's minus L star SQLite. Okay, rm content. I must have run it to test it. So, let's run it again. Python gmain.py and ask for one message. Okay, so there we went and got message one from one to two. We got 226 two characters, and we printed out the email address, the time we got it after all that hacking, and the subject line, and that's what we got. So, if we take a look at the database and we go into the
gmain, oh, every time you see the content SQLite journal, that means it needed to run a commit, and it hasn't run a commit, but I'll hit enter and that will do the commit, and you see that vanish. So, now I can open it and I take a look at, how come there's no messages? Did that one not get stored in there for some reason? Used refresh. Huh, let's run it again. Maybe it didn't commit. Maybe it got a bug in it. Let's make a change to the code. I'm going to see this connection.commit. See that?
Connection.commit. Gonna commit there, and the other thing I'm gonna do is, every time I stop to read, I'm gonna commit right before I read it. So, I think we should, I hope that doesn't blow up. We'll see. So, the idea is, if I wanna stop, I wanna commit it. So, let's do this. Let's do one message, and now I should hit, is it committed? Now that I've put the commits in, I think that it will look better. I can't refresh, and so there it is because I committed it, and I don't have, yeah, I don't
have the journal file, so that's good. So, that's a good idea to put those commits there. So, I'll just leave those commits in. When you download it, it'll have those commits in there. So, again, I put a commit here, and a commit at the very, very end, to make sure, and then I, so, I missed that. But now we get one, right? And so, let's just run it again, and you'll see how by selecting the max of the ID, it's gonna select the max of this and then add one to it, so it doesn't do
the next one. So, if I run it again, I say, give me one message, so it goes two to three, and give me two messages, right? So, I hit enter, and I can do refresh, and now you see we've got four messages, okay? And so, let's just fire this baby up. Tell it to get 100. Er, run, run, run, run, run, run, run. All right, it just goes and goes, and it pauses once in a while to do a commit, and if I made a commit every time, oop, it just paused there, now it finished.
So, this'll run, and we will get a bunch of data. The problem is, is if I just run this, it'll take about five hours, okay, to run this and get this all, and I've got a really fast connection. So, I have got a file that you can download, let's go find it, let's see if I can, let's see how long it'll take me to download this. I've got a file that you can download and save. Now, I'm gonna use the command line, curl, or wget is another command that we Linux and Mac people can use.
I don't know, you might have to use your browser do it, let's see how long this is gonna take. Yum, it's retrieving, minute 30. Okay, well, I'll just wait when this come back. Okay, so now that's done. I was averaging 10 megabits a second. I downloaded about 600 megabytes, 10 megabits a second. That will probably be slower for you. But, so now if I take a look, you're gonna find that that content.sqlite is 624 megabytes. Now, what happens is I've pre-spitered this, and so now if you run gmain.py and ask for five more messages, it
will pick up where I left that one off. So it's up to message 59,000. And I think that, oh, you saw an error. You saw a bug in that one. I don't know what's wrong with that one. So let's see if, so at this point, we're gonna have most of the data. It might find its way to the very end. Once you get this, it should be not too much more. I don't know, maybe it's like 63,000 or something. So what we'll do is we will let that run, and we will come back when that
one's finished and run the next phase after it's got all of its data, okay? So thanks for listening. The work that we're doing right now is we are in the process of building a writer and visualization tool for email data that came originally from this website gmain, but I've got my own copy of it. And so what we've done before is we ran gmain.py, and I grabbed a URL. I have a URL that has all this data, and I downloaded that, and then I ran gmain again to catch up, and so it took quite a
bit of catching up. But by the time I get to, remember how I said it tries to fail five times? Well, it ran out of data at 60,421, and then it started failing, and then it quit. So we pretty much have all of our data now. We have finished this process in content SQLite, okay? And if I take a look in the database browser, we can see we've got 59,823 email messages. And so if I look at any of these things, you see the headers, you see the subject line, you see the email address, you
see the body of it. So remember I split the body in half and the headers. And so I made this as raw as I possibly could because as you saw, I had to spend so much time in the gmain just getting the data successfully retrieved. And so I don't like cleaning the data up too much. And so what we're gonna look at next is the data cleaning process, okay? And so this is gmodel.py is the code we're gonna take a look at now. So let's get rid of those guys and look at gmodel.py. I don't
think I need URL lib in this code. Do I have any URL lib? No, so I don't need that, sorry. Fix that. Okay, so it's gonna read from the database, it's gonna use regular expressions, and zlib is a way to do some compression. And so I'm gonna do, in this one, I'm gonna compress some of the data to make it so that I have less data to, some of the text fields are gonna be compressed. I wanted to keep these fields uncompressed inside of messages. And so we have some just cleanup messages and cleans things
up. And it turns out that the way email addresses in this particular mail corpus, they changed over time and there's certain kinds of things. Sometimes the gmain.org is the email address when people wanna hide their address. And I made all kinds of stuff and I split it and checked to see if it ended with this. And I cleaned up things, just that kind of thing. And so I have all kinds of cleanup stuff going on in here. And I have this mapping and DNS mapping that I'll talk about in a bit where organizations sometimes sent
email with different addresses over time and people sent email from different time. And we're gonna do the parsing of the date and that is the code for that. I'm gonna pull out the header information. This is sort of borrowed from the other code. We'll clean up the email addresses and the domain names. And we'll pull the date out, pull the subject out, pull out the message ID, various things. So here's the main body of the code. We're going to go from content.sqlite to index.sqlite. And what I'm gonna do every time is I'm gonna wipe out
index.sqlite and drop the messages, senders, subjects, and replies. So this is a normalized database in that it has foreign keys. So there's a messages table here with an integer primary key, the GUID for it. The GUID stands for global unique ID, sender ID, and it's gonna have a blob. These are blobs, binary or large objects for the headers in the body because I'm gonna compress them in this database to make them. And then the senders, each sender has a key and then each subject line is gonna have a key and then replies our connection from
one message to another. And so this is like a many to many. Now, I also have this file called mapping.sqlite and so we can take a look at that one, mapping.sqlite. And so what happened is this has two tables that I hand deal with. And so sometimes in the end, this was a email address that mapped to that. So Indiana.edu, that's a way to take an at's the email address. And then these were a bunch of people that had email addresses changing throughout the project and I sort of kind of mapped them in a way.
And so this is just sort of like, I pull this in really quick and I read all this stuff from the DNS mapping and I, other than stripping and making this lowercase, et cetera, I just am gonna make a dictionary. DNS mapping, which is the old name to the new name and the email address mapping from the old name to the new name and I'm using fixsender. Fixsender is because the email addresses even within gmain were kind of funky. So don't worry so much about this. Okay, and so now what I'm gonna do is I
opened up a connection just to read all that stuff in and now I'm going to actually open the main content and I'm asking this a little trickier. I open that read only. That was so that I could potentially be running the spider and running this at the same time. I get a cursor. And so I'm gonna read through, so in the content file, this is the big one, I'm gonna read through and go through every one and write all of these things in. And I'm gonna take all the email addresses and I'm going to put
those in a list. So I loaded that, I've got the mappings loaded and so now I'm going to go through every single message. I got all the senders, all the subjects and all the global unique IDs. So I read in each message. So now I'm going through content one at a time. I parse the headers. I check to see if the sender's name, email address, after it's been cleaned up, is in my mapping. Mapping.getSender and the default is I get backSender. That's what that's saying. Lookup Sender, if it's in there, give me the entry of
that key, otherwise give me sender back. We're gonna print every 250 things we do. We'll complain if this is true. We're gonna go get the mapping between the senders which is a way to look up the primary key. I could have done this with a database thing but I wanted it to be fast. So that's part of the reason I read all these things in so I could have those mappings to be really fast. You'll see this takes a little while even though I got all this stuff cached. And so then if I don't have
a sender ID, meaning that I haven't seen it yet, then I'm gonna do an insert or ignore into senders and then I'm gonna do a select and then you've seen this where I grab the row back and I'm really just trying to look at the recently assigned ID and then I'm going to not only set the sender ID for this iteration loop but I'm also gonna store it in the dictionary and so that builds this dictionary up. And you'll see the same thing is true for subject ID. I'm gonna insert it into the subjects table
and get a primary key if I don't know what it is and then I'm gonna put it into, not only am I going to put it into the database but I'm also gonna put it into my dictionary. And the same thing, I guess I didn't do it for the GUID. Okay. So now what I have is the sender ID and the subject ID which are foreign keys into the sender table and the subject table and I'm gonna insert the message with the sender ID, subject ID, the sent at, headers, and body. And the values here
are the GUID, sender ID, subject ID, sent at. Now this here is Zlib compress. So what I'm taking is the message, the header, and the body and this little bit ends up with a compressed version of this stuff and you'll see it in a second. And this keeps the size of these text things down at the cost of the computation of, we have to, at the cost of the computation to compress and decompress when we want to read it. Okay. And then I pull the GUIDs out, the ID which is the GUID and I pull
out the primary key for this thing based on the GUID. And I update this dictionary. Okay. So let me run that code. It is doing a lot of cleanup and I'll tell you it took me a long time to make this work. So just, so this code that I'm running now, oh, don't forget to take a Python 3, Chuck. So this is gonna run every 250. So it did all this precaching. So that's how long it takes to do 250. Now there's 60,000 in here. And so this is really busy. The reason it's bouncing back
and forth is that every time it makes this journal file, that's, and then does a commit. So you can kind of see that it's, it's busy making journal files and committing and there's a lot of activity going on here. It just so happens that Adam shows me these files. Okay, so it finished. It took about three minutes to finish that, right? And so if we take a look at the size of the files, we will see that the index is much smaller. It's fully normalized. It's still 263 megabytes. It's all compressed. So let's take a
look at that in the browser. So it's 200 megabytes. But it loads up a lot faster. There we go. So we have a senders table, right? Which is just kind of a many to one table. We have a subjects to table, which is a many to one table. And we have messages, which has foreign keys. It takes a little bit to load that up. Okay, and so we see the foreign keys for sender and subject and that saves us. All those foreign keys save us. And so we have, you can kind of see that I
can't see the headers in the body because now they're compressed. That saves me a whole bunch of stuff, right? It saved me a whole bunch of stuff. And so that's what's in that file. And that, we've finished this process, okay? And we've finished modeling the data and making it really clean. And we'll pick back up and the rest of the stuff we will do is actually visualizing pulling data out of index.sqlite. The idea is this can be restarted. This can be run over and over and over. Even though it takes like three minutes to run
this, that's way better than five hours to run this. So three minutes, five hours. And then you'll see, and we'll see now reading this as in seconds because we got it all nice and normalized in a quite pretty way. So I hope this has been useful. In the next one, we'll actually do the visualization. We are in the process of retrieving data from this gmain server, one that I've made a copy of. And we have, so far, spied it all, ended up with 600 megabytes of spied-ed information. We have ran a rather complex cleanup process
that you probably don't need to fully understand. You can look at it for patterns. But in general, the cleanup process will be very sensitive to the data. And then we have this index.sqlite, which is 260 megabytes right now. And we are going to now do the easy, the fun, easy bits here where we're going to run little queries that just pull data out. And so these are much simpler. So part of what I wrote when I was doing this is I wanted to do some simple, basic calculations on the data to make sure I really
was sort of looking for anomalies, right? What was working, what wasn't working. So I wrote a series of really simple things like this gbasic, the gbasic code, just to give me some basic data, right? So I wrote things down and I counted things. And so, do I need URL librequest in this one? I don't think so. Let's fix that bug. It's not there. No reason to put any of that stuff in there. So it just, it reads that index.sqlite, which is our cleaned up data. It reads through and makes a dictionary of this pattern. You're
going to see a lot where I'm going to make a dictionary of ID to senders, save myself repeatedly looking at things. I'm going to grab the subjects. I've cached them all. I could have done this all with SQL, but I just wanted to do things faster. And now I'm going to go through each of these messages and make a dictionary of them. I'm going to put a lot of stuff in memory. And then I'm going to do some counts. I'm going to see who is sent the most, right? The organizations. And so now I've got
to go through all the messages. I am not actually, so you'll notice that I'm not selecting the body or the headers here. I am just getting sender ID, subject ID. I probably could have done this with a join. It would have been cleaner. You can do that. You can make that change. Do that with a join so it's cleaner. And so I'm going through all the messages except not the body. So this is going to be really quick. And I'm pulling out the senders ID. I'm breaking the sender into pieces. See, my data is clean
now. I cleaned it all up in the previous processes. And if I don't have two pieces, I continue and I get the domain name. So I have the person. I'm doing a basic dictionary histogram for the people and the domains. And then I'm going to sort them with a sorted. And we're going to grab the key. We're going to sort it by the how many there are reverse. And then print out the top few of the organizations and the top few of the people. OK? So we'll just run that code. Python gbasic.py. Let's type the
dump out the top 10. So we loaded 59,000 messages, 29,000 subjects, and 1,800 senders, and figured out the top 10 people and the top 10 organizations. And you can write various things like that that just sort of scream through your data and it's good to get sanity checking on your data. OK? So that's gbasic. Now I want to do gword.py because that's kind of fun. gword.py. I don't need URLib. Why do I keep putting URLib in all these things? So we'll get rid of that. So this is really simple because I'm just going to go
for the words in the subject line. And so I go through index.sqlite. I read in all of the subjects. And I make a dictionary of those. And then I go and find all the subjects. And then I'm doing this code right here. I'm pulling out the subject based on the message. And I'm doing this so that when the subjects are used more than once, I count the words more than once. DisturMakeTrans, I talked about that in an earlier chapter. This basically throws away a punctuation in numbers so that when I make my words, I don't
end up with words that are like dashes. It compresses them down. Then I strip it. I convert everything to lowercase. This is basically just to keep too many words from showing up. Then I do a split. And then I got accounts, a dictionary. So this is a no punctuation, no numbers dictionary count. And then I just take the and do a dictionary. And then I sort them in reverse order. And I figure out what the highest and lowest is by running through a, I could have probably done this with a max and a min if
I felt like it. And so now I have the highest and the lowest. I should have done a max and a min on that one. Why did I do that? But oh well. And now I've got to spread out the size. And so I'm going to produce this file gword.js, which is needed by the visualization because it's going to use d3.js, a word visualizer, and gword.js. I have to tell it how big the text is. And so I'm doing some text normalization. Took me a little experimentation. So if I run this now, and I say
python gword.js, and I say python 3gword.js, which is a lot better. Oh, not python. Okay, so now I can go look at the gword.js, wherever that is, gword.js. Yep. And so this is basically, it normalized all the frequencies and made it font size. These are font sizes now. And so this is just the data that's needed by this gword.jm, which uses this d3 visualization word cloud code. So this pulls in all my data, and then this is just some JavaScript that draws the picture on the page. And so the easy part now is to just
open gword.htm in a browser. It just so happens on a Mac I can do this. And so that gives me a word cloud based on that data. It kind of randomizes it. It shows different stuff. But it's using this data to generate how big those things are, and then using a bit of randomness and simulated annealing to lay it out. That's not stuff that we actually have to worry about, okay? So that's how we get to the point where we're seeing a word cloud from this. Now we're going to do another visualization. And this time
we're going to do a line visualization. And we're going to create a thing called gline.js and produce, with another HTML file, we're going to use d3 and produce that output. So let's say goodbye here, goodbye, goodbye, goodbye, goodbye. So gline.py, get rid of that file. So again, I'm going to preload all of the senders in this case. And again, I could have done this with a join. Probably should have done this with a join. I'm going to preload all the messages, the sender ID, subject ID, etc. I'll load those up. And now I'm going to
read through. I'm going to have the sending organizations and the senders. And I'm going to accumulate and split the senders. And I'm going to have the sending organizations. And then I'm going to do a simple dictionary as I accumulate the sending organizations by splitting the person's name into add signs. And then based on the organization, I accumulate it. And then I sort them. And I pull out the top ten organizations. I print those out. And now I'm going to produce, break this down into months. And I'll show you what this looks like in a second.
Let's go to the gline.js. So the month looks like this, okay? So the month looks like that. So that's the first seven characters of the date. So if we look at the date, date looks like that. The month is the first seven characters. And this is the data that I've got to give it. We'll clean that up in a second. That data will look better in a moment. Go back to gline.py. And so this is... We're doing a... The key is a tuple, which is the month, and which organization it is that did it. And
it's only in the top ten organizations. And then we're going to do a... We're going to basically do a dictionary where the key is a tuple. And then we're going to sort it. Sort by key in this case, not by value. That's... And the months is going to sort that. And then we're going to write all this data out into gline.js. So let's go ahead and run this. And again, this is just the data that has to be written in a way that the JavaScript can understand it. Python, gline, python3, gline.py. Okay, so top ten
organizations. So let's take a look at that JavaScript. So this is what it looks like. So it just so happens that you got to tell it the... These are the data points, these are the lines. So this is the year, the line for University of Michigan, gmail.com, swinsburg.com. So this first column is that line points and the next line points. So all this code was to get the data in such a way that I could produce this JavaScript file. Because if I look at gline.htm, I need that data in that particular format. And I've got
all this stuff. I make a line chart. And I draw it with this data, that data. I had to go read all the documentation on how to figure this stuff out. And that's the data that I'm going to use. And I had to figure this out. And I had to transform it and make it pretty. It took me quite a while to get this to work. And this is not a JavaScript class nor a how to visualize in D3. But basically, we pulled all that stuff in. And here's the gline that came from the JavaScript.
And then it makes an array to data table. And then that data table is what gline draws. So with no further ado, let's open gline.htm to show that data. So there you go. That's the Sakai developer participation from 2015 through 2005 through 2015, based on which organizations did the most commits in Sakai. And so I know that I haven't done all this code full justice. There's a lot of code here. The fun is just to kind of run it and see it. And then when the time comes to come back and see the techniques that
are used when you're trying to build your own visualization pipeline. So I hope that you found this useful. You know, this is a lot of code. Hard to explain in 15, 20 minutes. But I hope you take some time and look it over. And I hope you found all these videos. This is kind of the last walk-through video for chapter 16 of the book. And so I hope that I will see you on the net. Thank you.