Welcome to our an introduction. I'm Barton Paulson. And my goal in this course is to introduce you to our This is our, but also, this is our. And then finally, this is our, it's arguably the language of data science. And just so you don't think I'm making stuff up off the top of my head, I have some actual data. This is a ranking from a survey of data mining experts on the software that they use most often in their work. And take a look here at the top are is first. In fact, it's 50% more
than Python, which is another major tool in data science. So both of them are important. But you can see why I personally am fond of R. And why is the one that I want to start with introducing you to data science. Now there's a few reasons that R is especially important. Number one, it's free. And it's open source compared to other software packages can be 1000s of dollars per year. Also, R is optimized for vector operations, which means you can go through an entire row, or an entire table of data without you having to explicitly
write for loops. If you've ever had to do that, then you know, it's a pain. And so this is a nice thing. Also, R has an amazing community behind it where you can find supportive people. And you can get examples of whatever it is you need to do. And you can get new developments all the time. Plus our has over 9000 contributed or third party packages available, make it possible to basically do anything. Or if you want to put it in the words of Yoda. You can say this, this is our there is no if
only how, and in this case, I'm quoting our user Simon Blomberg. So very briefly, in some, here's why I want to introduce you to our number one, because r is the language of data science, because it's free, and it's open source. And because of the free packages that you can download, install r makes it possible to do nearly anything when you're working with data. So I'm really glad you're here. And then I'll have this chance to show you how you can use R to do your own work with data in a more productive, more interesting
and more effective way. Thanks for joining me. The first thing that we need to do for our an introduction is to get set up. More specifically we need to talk about installing are, the way you do this is you can download it, you just need to go to the home page for the our project for statistical computing. And that's at our dash project.org. When you get there, you can click on this link in the first paragraph that says download our and then I'll bring you to this page that lists all the places that you can
download it. Now I find the easiest is to simply go to this top one, this has cloud because that'll automatically direct you to whichever of the below mirrors is best for your location. When you click on that, you'll end up at this page, the comprehensive our archive network, or CRAN, which we'll see again, in this course, you need to come here and click on your operating system. If you're on a Mac, it'll take you to this page. And the version you're going to want to click on is just right here, it's a package file, that's
a zipped application installation file, click on that, download it and follow the standard installation directions. If you're on a Windows PC, then you're probably going to want this one base again, click on it, download it and go through the standard installation procedure. And if you're on a Linux computer, you're probably already familiar with what you need to do. So I'm not going to run through that. Now before we get a look at what is actually like when you open it, there's one other thing you need to do. And that is to get the files that
we're going to be using in this course. on the page that you found this video, there's a link that says download files. If you click on that, then you'll download a zipped folder called our oh one underscore entro underscore files, download that unzip it. And if you want to put it on your desktop, when you open it, you're going to see something like this a single folder that's on your desktop. And if you click on it, then it opens up a collection of scripts. The dot r extension is for an R source or a script
file. I also have a folder with a few data files that we'll be using in one of these videos. If you simply double click on this first file, whose full name is this, that'll open up in our and let me show you what that looks like. When you open up the application Are you will probably get a setup of windows that look like this. On the left is the source window or the script window where you actually do your programming. On the right is the console window that shows you the output and right now it's
got a bunch of boilerplate text. Now coming over here again on the left, any line that begins with a pound sign or hashtag or aka Thorpe is a commented line. That's not right. On these other lines or code that can be run, by the way, you may notice a red warning just popped up on the right side, that's just telling us about something that has to do with changes in our and it doesn't affect us. What I'm going to do right here is I'm going to put the cursor in this line, and then I'm going
to hit Command or Control and then enter, which will run that line. And you can see now, that is opened up over here. And what I've done is I've made available to the program, a collection of data sets. Now I'm going to pick one of those data sets is the iris data sets very well known as a measurement of three species of the iris flower. And we're going to do head to see the first six lines. And there we have the sepal length, sepal width, petal length and petal width of in this case, it's also
Tosa. But if you want to see a summary of the variables, get some quick descriptive statistics, we can run this next line over here. And now I get the quartiles. The mean, as well as the frequency of the three different species of Iris, on the other hand, is really nice to get things visually. So I'm going to run this basic plot command for the entire dataset. And it opens up a small window, I'm gonna make it bigger. And it's a scatterplot of the measurements or the three kinds of viruses, as well as a funny one
where it's including the three different categories, they're gonna close that window. And so that is basically what our looks like and how our works in its simplest possible version. Now, before we leave, I'm actually going to take a moment to clean up the application in the memory, I'm going to detach or remove the datasets package that I added. I already closed the plot. So I don't need to do this one separately. But what I can do is come over here to clear the console, I'm actually going to come up to edit and come down to
clear console. And that cleans it out. And this is a very quick run through of what our looks like in its native environment. But in the next movie, I'm going to show you another application we can install called our studio that lays on top of this, and makes interacting with our a lot easier and a lot more organized and really a lot more fun to work with. The next step and are an introduction and setting up is about something called our studio. Now. This is our studio. And what it is is a piece of software
that you can download, in addition to our what you've already installed, and its purpose is really simple. It makes working with our easier. Now there's a few different ways that it does is number one is it has consistent commands. What's funny is, the different operating systems have slightly different keyboard commands for the same operations. And our, our studio fixes that. And it makes it the same whether you're on Mac, Windows or Linux. Also, there's a unified interface instead of having two, three or 17. windows open, you have one window with the information organized, and also
makes it really easy to navigate with the keyboards and to manage the information that you have in our and let me show you how to do this. But first we have to install it, where you're going to need to do is to go to our studios website, which is at our studio.com. From there, click on download our studio. Now bring it to this page or something like it. And you're going to want to choose the desktop version. Now, when you get there, you're going to want to download the free sort of community version as opposed
to the $1,000 a year version. And so click here on the left. And then you're going to come to the list of installers for supported platforms, it's down here on the left, this is where you get to choose your operating system. Click the top one if you have windows. The next one if you have a Mac and then we have lots of different versions of Linux, whichever one you get, click on it, download it and go through the standard installation process, then open it up. And then let me show you what it's like working in
our studio. To do this, open up this file and we'll see what it's like in our studio. When you open up our studio, you get this one window that has several different panes in it. At the top, we have the script or the source window. And this is where you do your actual programming. And you'll see that it looks really similar to what we did when I opened up the our application. The color is a little different. But that's something that you can change in preferences or options. The console is down here at the bottom.
And that's where you get the text output. Over here is the environment that saves the variables if you're using any and then plots and other information show up here in the bottom right. Now you have the option of rearranging things and changing what's there as much as you want. Our studio is a flexible environment. And you can resize things by simply dragging the divider between the areas. So let me show you quick example, using the exact same code that I did in my previous example. So you can see how it works in our studio as
opposed to the regular our app that we use first time. First, I'm going to load some data, that's by using the datasets package, I'm going to do a Command or Ctrl N, enter to load that one. And you can see right here, it's run the command. And then I want to do the quick summary of data I'm going to do head Irish shows the first six lines. And then here it is down here, I can make that a little bit bigger if I want. Then I can do a summary by just coming back here, and
clicking Command or Control Enter. And actually, I'm going to do a keyboard command to make the console bigger now. And then we can see all of that, I have the same basic descriptive statistics and the same frequencies there. And go back to how it was before. And make this bring this one down a little. And now we can do the plot. Now this time, you see it shows up in this window here on the side, which is nice. It's not a standalone window. Let me make that one bigger, it takes a moment to adjust. And
there we have the same information that we had in the our app. Right here, it's more organized in a cohesive environment. And you see that I'm using keyboard shortcuts to move around. And it makes life really easy for dealing with the information that I have in our I'm going to do the same cleanup, I'm going to detach the package that I had, this is actually a little command to clear the plots. And then here in our studio, I can run a funny little command that'll do the same as doing Ctrl l to clear the console
for me. And that is a quick run through of how you can do some very basic coding in our studio, again, which makes working with our more organized more efficient and easier to do overall. In our very basic introduction to our and setting up, there's one more thing I want to mention that makes working with are really amazing. And that's the packages that you can download install. Basically, you can think of them as giving you have superpowers when you're doing your analysis, because you can basically do anything with the packages that are available. Specifically, packages
are bundles of code. So it's more software that add new function to our makes it so we can do new things. Now, there are two kinds of package two general categories. There are base packages, these are packages that are installed with our so they're already there. But they're not loaded by default. That way, our doesn't use maybe as much memory as it might otherwise. But more significant than that are the contributed or third party packages. These are packages that need to be downloaded, installed, and then loaded separately. And when you get those, it makes things
extraordinary. And so you may ask yourself, where to get these marvelous packages that make things so superduper? Well, you have a few choices. Number one, you can go to CRAN. That's the comprehensive our archive network, that's an official, our site that has things listed with the official documentation, too, you can go to a site called CRAN tastic, which really is just a way of listing these things. And when you click on the links, it redirects you back to CRAN. And then third, you can also get our packages from GitHub, which is an entirely different process.
If you're familiar with GitHub, it's not a big deal. Otherwise, you don't usually need to deal with it. But let's start with this first one, the comprehensive our archive network, or CRAN. Now, we saw this previously, when we were just downloading our This time, we're going to CRAN dot r dash project.org. And we're specifically looking for this one, the CRAN packages, that's gonna be right here on the left click on packages. And when you open that, you're gonna have an interesting option. And that's to go to task views. And that breaks it down by topic.
So we have here, packages that deal with Bayesian inference packages that deal with chemo metrics, and computational physics, so on and so forth. If you click on any one of those, it'll give you a short description of the packages that are available and what they're designed to do. Now another place to get packages, I said, is CRAN tastic, at CRAN tastic.org. And this is one that lists the most recently updated the most popular packages. And it's a nice way of getting some sort of information about what people use most frequently, although it does redirect you
back to CRAN to do the actual downloading. And then finally@github.com if you go to slash trending slash R, you'll see the most common are most frequently downloaded packages on GitHub for use and are now regardless of how you get it, let me show you the ones that I use most often and I find these Make working with are really a lot more effective and what easier. Now they have kind of cryptic names. The first one is d plier, which is for manipulating data frames, then there's tidy or for cleaning up information, stringer for working with
strings or text information. lubra date for manipulating date information. h TT er for working with website data. GG ww is where the GG stands for grammar of graphics. This is for interactive visualizations. GG, plot two is probably the most common package for creating graphics or data visualizations in our SHINee is another one that allows you to create interactive applications that you can install on websites. reo is for our input output is for importing and exporting data. And then our markdown allows you to create what are called interactive notebooks or rich documents for sharing your information.
Now, there are others, but there's one in particular, that thing's useful. I call it the one package to load them all. And it's Pac Man, which not surprisingly, stands for package manager. And I'm going to demonstrate all of these in another course that we have here. But let me show you very quickly how to get them working. He just tried an R. If you open up this file from the course files, let me show you what it looks like. What we have here in our studio is the file for this particular video. And I say
that I use Pac Man, if you don't have it installed already, then run this one installation line. This is the standard installation command in R. And now add Pac Man, and then it will show up here and packages. Now I already have it installed. So you can see it right there. But it's not currently loaded. See because installing means making it available on your hard drive. But loading means actually making it accessible to your current routines. So then I need to load it or import it. And I can do it with one of two ways.
I can use the require, which gives a confirmation message, I can do it like this. And you see it's got that little sentence there. Or I can do library which simply loads it without saying anything. You can see now by the way that it's checked off, so we know it's there. Now, if you have Pac Man installed, even if it's not loaded, then you can actually use Pac Man to install other packages. So what I actually do is because I have Pac Man installed, I just go straight to this one you do Pac Man and
then the two colons. It says, use this command, even though this package isn't loaded. And then I load an entire collection, all the things that I showed you starting with Pac Man itself. So now I'm going to run this command. And what's nice about Pac Man is, if you don't have the package, it will actually install it, make it available and load it. And I gotta tell you, this is a much easier way to do it than the standard r routine. And then, for base packages, that means the ones that come with are natively like
the data sets package, you still want to do it this way you load and unload them separately. So now I've got that one available. And then I can do the work that I want to do. Now I'm actually not going to do it right now, because I'm going to show it to you in future videos. But now I have a whole collection of packages available, they're going to give me a lot more functionality and make my work more effective. I'm going to finish by simply unloading what I have here. Now if you want to with
Pac Man, you can unload specific packages, or the easiest way is to do p underscore unload all. And what that does is it unload all of the add on or contributed third party packages. And you can see I've got the full list here of what is unloaded. However, for the base packages like data sets, you need to use the standard r command detach, which I'll use right here. And then I'll clear my console. And that's a very quick run through of how packages can be found online installed into our and loaded to make your code
more available. And I'll demonstrate how those work in basically every video from here on out. So you'll be able to see how to exploit their functionality to make your work a lot faster and a lot easier. Probably the best place to start when you're working with any statistics program is basic graphics so you can get a quick visual impression of what you're dealing with. And the command and are the next simplest of all, is the default plot command is also known as basic x, y plotting for the x and y axes on a graph. And
what's neat about RS plot command is that it adapts to data types and to the number of variables that you're dealing with. Now, it's going to be a lot easier for me to simply show you how this works. So let's try it in our just open up the script file and we'll see how we can do some basic visualizations in our The first thing that we're going to do is load some data Data Sets from the data sets package that comes with our, we simply do library data sets. And that loads it up, we're gonna
use the iris data, which I've showed you before. And you'll get to see many more times. Let's look at the first few lines. I'll zoom in on that. And what this is, is the measurement of the Siebel m petal length and width for three species of viruses is a very famous data set about 100 years old. And it's a great way of getting a quick feel for what we're able to do and are, I'll come back to the full window here. And what we're going to do is first get a little information about the plot
command to get help on something in our just do the question mark, and the thing you want help for. Now we're in our studio. So this opens up right here in the help window. And you see we've got the whole set of information here, all the parameters and additional links, you can click on and then examples here at the bottom. I'm going to come over here and I'm going to use the command for a categorical variable first. And that's the most basic kind of data that we have. And so species, which is three different species
is what I want to use right here. So I'm going to do plot, and then in the parentheses, you put what it is you want to plot. And what I'm doing here is I'm saying it's in the data set, Iris, that's our data frame, actually. And then the dollar sign says use this variable that's in that data. So that's how you specify the whole thing. And then we get an extremely simple three bar chart, I'll zoom in on it. And what it tells you is that we have three species of Iris setosa, versicolor, and virginica,
and then we have 50 of each. And so it's nice now that we have balanced group that we have three groups because that might affect some of the analyses that you do. And it's an extremely quick and easy way to begin looking at the data all zoom back out. Now let's look at a quantitative variable, so one that's on an interval or nominal level of measurement. For this one, I'll do petal length. And you see I do the same thing plot and then Iris and then peddling. Please note I'm not telling are that this is
now a quantitative variable. On the other hand, it's able to figure that one out and by itself. Now, this one's a little bit funny, because it's a scatterplot, I'm going to zoom in on it. But the x axis is the index number or the row number in the dataset. So that one's really not helpful. It's the variable that's going on the Y, that's the petal length that you get to see the distribution. On the other hand, you know that we have 50 of each species. And we have the setosa. And then we have the versicolor.
And then we have the virginica. And so you can see that there are group differences on these three things. Now, what I'm going to do is I'm going to ask for a specific kind of plot to break it down more explicitly between the two categories. That is, I'm going to put in two variables now, where I have my categorical species, and then a comma, and then the petal length, which is my quantitative measurement. I'm going to run that again, you just hit Ctrl, or command and Enter. And this is one that I'm looking for here.
Let's zoom in on that. Again, you see that it's adapted. And it knows, for instance, that the first variable I gave it is categorical. The second was quantitative, and the most common chart for that is a box plot. And so that's what it automatically chooses to do. And you can see, it's a good plot here, we can see very strong separation between the groups on this particular measurement. I'll zoom back out. And then let's try a quantitative pair. So now I'll do petal length and petal width, so it's gonna be a little bit different. I'll
run that command. And now this one is a proper scatterplot, where we have a measurement across the bottom, and a measurement of the side. But you can see that there's a really strong positive association between these two. So not surprisingly, as a petal gets longer, it generally also gets wider, so it just gets bigger overall. And then finally, if I want to run the plot command on the entire data set the entire data frame, this is what happens, we do plot and then Iris. Now we've seen this one in previous examples, but let me zoom
in on it. And what it is, is an entire matrix of scatter plots of the four quantitative variables. And then we have species, which is kind of funny because it's not labeling them. But it shows us a dot plot for the measurements of each species. And this is a really nice way if you don't have too many variables of getting a very quick holistic impression of what's going on in your data. And so the point of this is that the default plot command is able to adapt to the number of variables I gave it, and
to the kind of variables I give it, and it makes life really easy. Now, I want you to know that it's possible to change the way that these look. I'm going to specify some options. I'm going to do the plot again, this scatterplot where I say plot, and then in parentheses, I give these two arguments, or saying what I want in it, I'm gonna say, do the petal length, and do the petal width. And then I'm gonna go to another line, I'm just separating with comma. Now if you want to, you can write this all
as one really long line, I break it up, because I think it makes a little more readable. I'm going to specify the color, a new with call for color, and then I use a hex code. And that code is actually for the red that is used on the data lab homepage. And then PCH is four point character, and that is a 19 is a solid circle. Now I'm going to main title on it, and then I'm gonna put a label on the x axis and a label on the y axis. So I'm actually going to
run those now by doing Command or Control Enter for each line, and you can see it builds up. And when we finished, we got the whole thing, I'll zoom in on it again. And this is the kind of plot that you could actually use in a presentation or possibly in a publication. And so even what the base command, we're able to get really good looking, informative and clean graphs. Now, what's interesting is that the plot command can do more than just show data, we can actually feed it in formulas, if you want, for instance, to
get a cosine, I do plot and then coast is for cosine. And then I give the limit, I go from zero to two times pi, because that's relevant for cosine. I click on that, and you can see the graph there, it's doing our little cosine curve, I can do an exponential distribution from one to five. And there it is curving up. And I can do D norm, which is for a density of a normal distribution from minus three to plus three. And there's the good old bell curve there in the bottom right. And then we
can use the same kind of options that we used earlier for our scatterplot. Here to say, do a plot of D norm, so the bell curve from minus three to plus three on the x axis. And now we're going to change the color to red l WD is for linewidth, make it thicker, give it a title on the top, a label on the x axis and a label on the y axis. We'll zoom in on that. And so there is my new and improved prettier and presentation ready bell curve that I got with a default
plot, command and R. And so this is a really flexible and powerful command. Also, it's the base package. And you'll see that we have a lot of other commands that can do even more elaborate things. But this is a great way to start and get a quick impression of your data, see what you're dealing with, and shape the analyses that you do subsequently. The next step in our introduction, and our discussion of basic graphics, is bar charts. And the reason I like to talk about bar charts is this, because simple is good. And when it
comes to bar charts, bar charts are the most basic graphic for the most basic data. And so they're a wonderful place to start in your analysis. Let me show you how this works. Just try it in our open up this script. And let's run through and see how it works. When you open up the file in our studio, the first thing we're going to want to do is come down here and open up the datasets package. And then we're going to scroll down a little bit and we're going to use a dataset called empty cars.
Let's get a little bit of information about this do the question mark and the name of the data set. This is Motor Trend. That's a magazine car road test from 1974. So you know they're 42 years old. Let's take a look at the first few rows of what's in empty cars by doing head. I'm going to zoom in on this. And what you can see is that we have a list of cars the Mazda RX four and the wagon the Datsun 710, the AMC Hornet and I actually remember these cars and we have several variables
on each of them we have the mpg MPG, we have the number of cylinders the displacement and cubic inches, the horsepower the final drive ratio which has to do with the axle, and then we have the weight in tons the quarter mile time in seconds. And these are a bunch of really really slow cars. V S is for whether the cylinders are in a V, or whether they are in a straight or in line. And then the am is for automatic or manual. Then we go into the next line we have gear which is the
number of gears in the transmission and carb for how many carburetor barrels they have, which is we don't even use carburetors anymore. Anyhow. So that's what's in the data set. I'll zoom back out. Now if we want to do a really basic bar chart, you might think that the most obvious thing to do would be to use RS bar plot command. That's, it's named for the bar chart. And then to specify the data set empty cars, and then the dollar sign, and then the variable that we want cylinders. So you think that would work, but
unfortunately, it doesn't. Instead, what we get is this, which is just kind of going through all the cases on a one by one by one row and telling us how many cylinders are in that case, that's not a good one. That's not what we want. And so what we need to do is we actually need to reformat the data a little bit, by the way, you would have to do the exact same thing, if you wanted to make a bar chart in a spreadsheet, like Excel or Google Sheets, you can't do it with the raw
data, you first need to create a summary table. And so what we're going to do here is we're going to use the command table, we're gonna say, take this variable from this data set and make a table of it, and feed it into an object, you know, a data thing, data container called cylinders, I'm going to run that one. And then you see that just showed up in the top left, let me zoom in on that one. So now I have in my environment, a data object called cylinders, it's a table, it's got a length
of three, it's got a size of 1000 bytes, and it gives us a little bit more information. Let's go back to where we were. But now I've saved that information into cylinders, which just has the number of cylinders, I can run the bar plot command. And now I get the kind of plot I expected to see. From this, we see that we have a fair number of cars with four cylinders, a smaller number was six. And because this is in 74, we've done a lot of eight cylinder cars in this particular data set. Now, we
can also use the default plot command, which I showed you previously, on the same data, we're just going to do something a little different, it's actually going to make a line chart where the lines are the same length of each bars, I'd probably use the bar plot instead, because it's easier to tell what's going on. But this is a way of making a default chart that gives you the information you need for the categorical variables. Remember, simple is good. And that's a great way to start. In our last video, on basic graphics, we talked about
bar charts. If you have a quantitative variable, then the most basic kind of chart is a histogram. And this is for data that is quantitative or scaled or measured, or interval or ratio level, all of those are referring to basically the same thing. And in all of those, you want to get an idea of what you have. And a histogram allows you to see what you have. Now there's a few things you're going to be looking for with a histogram. Number one, you're going to be looking for the shape of the distribution, is it symmetrical,
is it skewed is a uni modal by modal, you're going to look for gaps or big empty spaces in the distribution. You're also going to look for outliers, unusual scores, because those can distort any of your subsequent analyses. He'll look for symmetry to see whether you have the same number of high and low scores or whether you have to do some sort of adjustment to the distribution. But this is going to be easier if we just try it in R. So open up this R script file. And let's take a look at how we can
do histograms in R. When you open up the file, the first thing we need to do is come down here and load the data sets. We'll do this by running the library command, I just do Ctrl or Command Enter. And then we can do the iris data set. Again, we've looked at it before. But let's get a little bit of information from it by asking for help on Iris. And there we have Edgar Anderson's Iris data, also known as Fisher's Iris data, because he published an article on it. And here's the full set of information
available on it from 1936. So it's 80 years old. Let's take a look at the first few rows. Again, we've seen this before, Siebel and petal length and width for three species of Iris. We're gonna do a basic histogram on the four quantitative variables that are in here. And so I'm going to use just the hist command. So hist and then the dataset Iris and then the dollar sign to say which variable and then Siebel dot length. I run that I get my first histogram. Let's zoom in on a little bit. And what happens here
is of course, it's a basic sort of black line on white background, which is fine for exploratory graphics. And it gives us a default title that says histogram of the variable and it gives us the the clunky name which is also on the x axis on the bottom, it automatically adjusts the x axis and it chooses about seven or nine bars, which is usually the best choice for a histogram. And then on the left, it gives us the frequency or the count of how many offs revisions are in that group. So for instance, we have
only five irises whose sepal length is between four and four and a half centimeters, I think it is. Let's zoom back out. And let's do another one. Now, this time for a simple width, you can see that's almost a perfect bell curve. And we do petal length, we get something different. Let me zoom in on that one. And this is where we see a big gap, we've got a really strong bar there at the low end. In fact, it goes above the frequency axis. And then we have a gap. And then sort of a bell
curve that lets us know that there's something interesting going on with the data that we're going to want to explore a little more fully. And then we'll do another one for petal width, I'll just run this command. And you can see the same kind of pattern here where there's a big clump at the low end, there's a gap. And then there's sort of a bell curve beyond that. Now, another way to do this is to do the histograms by groups. And that would be an obvious thing to do here, because we have three different species
of Iris. So what we're going to do here is we're going to put the graphs into three rows, one above another in one column. I'm going to do this by changing a parameter pa RS for parameter, and I'm giving it the number of rows that I want to have in my output. And I need to give it a combination of numbers, I do this C, which is for concatenate, it means treat these two numbers as one unit, where three is the number of rows, and then the one is the number of columns. So I run
that it doesn't show anything just yet. And then I'm going to come down and I'm going to do this more elaborate command, I'm going to do hist. That's the histogram that we've been doing. I'm going to do petal length, except this time in square brackets, I'm going to put a selector is this means use only these rows. And the way I do this is by saying I want to do it for this atossa irises. So I say, Iris, that's the data set, and then dollar sign. And then species is the variable. And then two equals
because in computers, that means is equivalent to and then in quotes, and they have to spell it exactly the same with the same capitalization and do setosa. So this is the variable and the row selection. I'm also going to put in some limits for the x, because I want to manually make sure that all three of the histograms I have have the same x scale. So I'm going to specify that breaks is for how many bars I wanted the histogram. And and actually, what's funny about this is it's really only a suggestion that you give
to the computer, then I'm going to put a title above that one, I'm going to have no x label, and I'm going to make it read somebody would do all of that right now. I'll just run each line. And then you see I have a very skinny chart, let's zoom in on it. So it's very short. But that's because I'm gonna have multiple charts, it's gonna make more sense when we look at them all together. But you can see by the way that the petal width for this atossa irises is on the low end. Now
let's do the same thing for versicolor. I'm going to run through all that. It's all gonna be the same, except we're gonna make it purple. There's versicolor. And then let's do virginica last. And we'll make those blue. And now I can zoom in on that. And now when we have our three histograms, it's the same variable petal width, but now I'm doing it separately for each of the three species. And it's really easy to see what's going on here. Now. setosa is really low versicolor and virginica overlap, but they're still distinct distributions. This approach, by
the way, is referred to as small multiples, making many versions of the same chart on the same scale. So it's really easy to compare across groups are across conditions, which is what we're able to do right here. Now, by the way, anytime you change the graphical parameters, you want to make sure to change them back to what they were before. So here, I'm going par, and then going back to one column and one row. And that's a good way of doing histograms for examining quantitative variables, and even for exploring some of the complications that can
arise when you have different categories with different scores on those variables. In our two previous videos, we looked at some basic graphics for one variable at a time, we looked at bar charts for categorical variables, and we looked at histograms for quantitative variables. While there's a lot more you can do with univariate distributions. You also might want to look at by various distributions, we're gonna look at scatter plots as the most common version of that you do a scatter plot when what you want to do is visualize the association between two quantitative variables. Now, I
actually know it's more flexible than that. But this is the canonical case for a scatterplot. And when you do that, what sorts of things do you want to look for in your scatterplot? I mean, there's a purpose in it. Well, number one, you want to see if the association between your two variables is linear, or if it can be described by a straight line, because most of the procedures that we do assume linearity. You also want to check if you have consistent spread across the scores as you go from one end to the x axis
to another, because if things fan out considerably, then you have what's called heteroscedasticity. And it can really complicate some of the other analyses. As always, you want to look for outliers, because an unusual score, or especially an unusual combination of scores, can drastically throw off some of your other interpretations. And then you want to look for the correlation is there an association between these two variables. So that's what we're looking for it, let's try it in our simply open up this file, and let's see how it works. The first thing we need to do in
our is come down and open up the datasets package just to command or control and Enter. And we'll load the data sets, we're going to use empty cars, we looked at that before, it's got a little bit of information, it's road test data from 1974. And let's look at the first few cases. I'll zoom in on that. Again, we have miles per gallon cylinders, so on and so forth. Now, anytime you're going to do an association, it's a really good idea to look at the univariate or one variable at a time distributions as well, we're
going to look at the association between weight and mpg. So let's look at the distribution for each of those separately. I'll do that with a histogram, I do hist. And then in parentheses, I specify the data set empty cars in this case, and then $1 sign to save which variable in that data set. So there's the histogram for weight. And you know, it's not horrible there, it looks like we've got a few on the high end there. And here's the histogram for miles per gallon. Again, mostly kind of normal, but a few on the high
end. But let's look at the plot of the two of them together. Now, what's interesting is I just use the generic plot command, I feed that in, and r is able to tell that I'm giving it to quantitative variables, and that a scatterplot is the best kind of plot for that. So we're gonna do weight and mpg. And then let me zoom in on that. And what you see here is one circle for each car at the joint position of its weight and its MPG, and it's a strong downhill pattern. Not surprisingly, the more a
car weighs and we have some in this data set that are five tonnes, the lower miles per gallon, we have get down to about 10 miles per gallon here, the smallest cars, which appear to weigh substantially under two times get about 30 miles per gallon. Now, this is probably adequate for most purposes. But there's a few other things that we can do. So for instance, I'm going to add some colors here, I'm going to take the same plot, and then add on additional arguments or say, use a solid circle pchs for point character 19 as
a solid circle, c x has to do with this size of things, and I'm going to make in the 1.5 means making 150% larger call is for color and I'm specifying a particular read the one for data lab in hex code, I'm going to give a title, I'm going to give an X label and a y label. And then we'll zoom in on that. And now we have a more polished chart that also because of the solid red circles makes it easier to see the pattern that's going in there, where we got some really heavy
cars with really bad gas mileage, and then almost perfect linear association up to the lighter cars was much better gas mileage. And so a scatterplot is the easiest way of looking at the association between two variables, especially when those two variables are quantitative. So they're on a scaled or measured outcome. And that's something that you want to do anytime you're doing your analysis to first visualize it, and then use that as the introduction to any numerical or statistical work you do after that, as we go through are necessarily very short presentations on basic graphics. I
want to finish by saying one more thing, and that is you have the possibility of overlaying plots. And that means putting one plot directly on top of or superimposing it on another. Now, you may ask yourself why you want to do this Well, I can give you an artistic version on this. This, of course, is Pablo Picasso's Les Demoiselles d'Avignon. And it's one of the early masterpieces in Cubism and the idea of Cubism is it gives you many views, or it gives you simultaneously several different perspectives on the same thing. And we're gonna try to
do a similar thing with data. And so we can say very quickly. Thanks, Pablo. Now, why would you overlay plots, really, if you want the technical explanation is because you get increased information density, you get more information, and hopefully more insight in the same amount of space and hopefully the same amount of time. Now, there is a potential risk here. You might be saying to yourself at this point, well, you want dense, guess what? I can do dance. And then we end up with something vaguely like this, the Garden of Earthly Delights, and it's completely
overwhelming, and it just makes you kind of shut down cognitively. No, thank you. Hieronymus Bosch. No, I instead, well, I like Hieronymus Bosch his work. And to tell you when it comes to data graphics use restraint. Just because you can do something doesn't mean that you should do that thing. When it comes to graphics and overland plots, the general rule is this, use views that complement and support one another that don't compete. But that gives greater information in a coherent and consistent way. This is going to make a lot more sense. If we just take
a look at how it works in our so open up this script. And we'll see how we can overlay plots for greater information density and greater insight. The first thing that we're going to need to do is open up the datasets package. And we're going to be using a data set we haven't used before about lynxes, that's the animal. This is about Canadian Lynx trappings from 1821 to 1934. If you want the actual information on the dataset, there it is. Now let's take a look at the first few lines of data. This one is a
time series. And so what's unusual about it is this is just one line of numbers. And you have to know that it starts at 1821. And it goes through. So let's make a default chart with a histogram. As a way you've seen, or links trappings consistent or how much variability was there, we'll do hist, which is the default histogram. And we'll simply put links in, we don't have to specify variables, because there's only one variable in it. And when we do that, I'll zoom in on that, we get really a skewed distribution, most of the
observations are down at the low end, and then it tapers off to it's actually measured in 1000s. So we can tell that there is a very common value, it's at the low end. And then on the other hand, we don't know what years those were. So we're ignoring that for just a moment and taking a look at the overall distribution of trappings, regardless of yours, Miss zoom back out. And we can do some options on this one to make it a little more intricate, we can do a histogram. And then in parentheses, I specify the
data. I also can tell it how many bins I want. And again, it sort of is suggesting that because r is going to do what it wants Anyhow, I can say make it a density instead of frequency. So it'll give proportions of the total distribution. We'll change the colors to call the sisal one because you can use color names. And our will give it a title here. By the way, I'm using the paste command because it's a long title, and I want it to show up on one line, but I need to spread my command
across two lines, you can go longer, I have to use a short command line. So you can actually see what we do when we're zoomed in here. So there's that one, and then we're going to give it a label, this has number of links trapped. And now we have a more elaborate chart. I'll zoom in on it, and it's a kind of little thistle purple lilac color. And we have divided the number of bins differently previously, it was one bar for every 1000. Now it's one bar for 500. But that's just one chart. We're here
to see how we can overlay charts and a really good one anytime you're dealing with a histogram is a normal distribution. So you want to see are the data distributed normally now we can tell they're skewed here, but let's get an idea of how far they are from normal. To do this, we use the command curve. And then D norm is for density of the normal distribution. And then here I tell it axes you know just a generic variable name, but I tell it use the mean of the Lynx data. Use the standard deviation of
the Lynx data We'll make it a slightly different fissel color. Number four, we'll make it two pixels wide, the line width is two pixels and then add says stick it on the previous graph. And so now I'll zoom in on that. And you can see if we had a normal distribution with the same mean and standard deviation as this data, it would look like that. Obviously, that's not what we have, because we have this great big spike here on the low end, then I can do a couple of other things, I can put in what
are called kernel density estimators. And those are sort of like a bell curve, except they're not parametric, instead, they follow the distribution of the data, that means they can have a lot more curves in them, they still add up to one like a normal distribution. So let's see what those would look like here, we're gonna do lines. That's what we use for this one. And then we say density, that's going to be the standard kernel density estimator, we'll make it blue. And there it is, on top, I'm going to do one more than we'll zoom
in, I can change a parameter of the kernel density estimator, here, I'm using a just to say, average across it sort of like a moving average, average across a little more. And now let me zoom in on that. And you can see, for instance, the blue line follows the spike at the low end a lot more closely than it dips down. On the other hand, the purple line is a lot more slower to change, because of the way I gave it his instructions with the Adjust equals three. And then I'm going to add one more
thing, something called a rug plot, it's a little vertical lines underneath the plot for each individual data point. And I do that with rug. And I say just use links, and then we're gonna make it a line width or pixel width of two, and then we'll make it gray. And that, and assuming is our final plot, you can see now that we have the individual observations marked, and you can see why each bar is as tall as it is and why the kernel density estimator follows the distribution that it does. This is our final histogram
with several different views of the same data. It's not Cubism, but it's a great way of getting a richer view of even a single variable that can then inform the subsequent analyses you do to get more meaning and more utility out of your data. Continuing in our an introduction, the next thing we need to talk about is basic statistics. And we'll begin by discussing the basic summary function in our The idea here is that once you have done the pictures that you've done that basic visualizations, then you're going to want to get some precision by
getting numerical or statistical information. Depending on the kinds of variables you have, you're going to want different things. So for instance, you're going to want counts or frequencies for categories. They're going to want things like core titles and the mean for quantitative variables. We can try this in our and you'll see that it's a very, very simple thing to do. Just open up this script and follow along. What we're going to do is load the data sets package, controller command and then enter. And we're actually going to look at some data and do an analysis
that we've seen several times already, we're going to load the iris data. And let's take a look at the first few lines. And again, this is for quantitative measurements on the seaplane petal length and width are three species of Iris flowers. And what we're going to do is we're going to get summary in three different ways. First, we're going to do summary for a categorical variable. And the way we do this is we use the summary function. And then we'd say Iris, because that's the data set and then $1 sign and then the name of
the variable that we want. So in this case, it's species, we'll run that command. And you can see it just has setosa 50 versicolor 50 and virginica 50. And those are the frequencies are the counts for each of those three categories in the species variable. Now we're going to get something more elaborate for the quantitative variable, we'll use sepal length for that one, and I'll just run that next line. And now you can see it lays it out horizontally, we have the minimum value of 4.3, then we have the first quartile of 5.1, the median
than the mean than the third quartile and then the maximum score of 7.9. And so this is a really nice way of getting a quick impression of the spread of scores. And also by comparing the median and the mean sometimes you can tell whether it's symmetrical or there skewness going on. And then you have one more option and that is getting a summary for the entire data frame or data set. at once, and what I do is I simply do summary and then in the parentheses for the argument, I just give the name of the
dataset IRS. And this one, I need to zoom in a little bit, because now it arranges it vertically. Where do we do sepal length. So that's our first variable, and we get the courthouse and we get the median. And we do Siebel with petal length, petal width, and then it switches over at the last one species where it gives us the counts or frequencies of each of those three categories. So that's the most basic version of what you're able to do with the default summary variable in R gives you quick descriptives gives you the precision
to follow up on some of the graphics that we did previously. And it gets you ready for your further analyses. As you're starting to work with R, and you're getting basic statistics, you may find you want a little more information than the base summary function gives you. In that case, you can use something called describe, and its purpose is really easy. It gets more in detail. Now, this is not included in ours basic functionality. Instead, this comes from a contributed package, it comes from this psych package. And when you run describe from site, this is
where you're going to get, you'll get n that's the sample size, the mean, the standard deviation, the median, the 10%, trimmed mean, the median absolute deviation, the minimum and maximum values, the range skewness, and kurtosis, and standard errors. Now, don't forget, you still want to do this after you do your graphical summaries pictures first numbers later. But let's see how this works in our simply open up this script, and we'll run through it step by step. When you open up are, the first thing we're going to need to do is we're going to need to
install the package. Now, I'm actually going to go through my default installation of packages, because I'm going to use one of these Pac Man. And this just makes things a little bit easier. So we're going to load all these packages. And this assumes, of course, you have Pac Man installed already, we're going to get the data sets. And then we'll load our Iris data. We've done that lots of times before sepal, and petal length and width and the species. But now we're going to do something a little different, we're going to load a package, I'm
using p load from the Pac Man package. That's That's why I loaded it already. And this will download it if you don't have it already, it might take a moment. And it downloads a few dependencies, generally other packages that need to come along with it. Now, if you want to get some help on it, you can do p anytime you have P and underscore that's something from Pac Man p help site. Now when you do that, it's going to open up a web browser and it's going to get the PDF help. I've got it open
already because it's really big. In fact, it's 367 pages here, have documentation about the functions inside. Obviously, we're not going to do the whole thing here. What we are going to do is we can look at some of it in the our viewer, if you simply add this argument here, web equals F for false, you can spell out the word false, as long as you do it in all caps, then opens up here on the right. And here is actually this is a web browser. This is a web page we're looking at. And each of
these, you can click on and get information about the individual bits and pieces. Now, let's use describe that comes from this package. It's for quantitative variables only. So you don't want to use it for categories. What we're going to do here is we're going to pick one quantitative variable right now. And that is Iris and then sepal length. When we run that one, here's what we get. Now I get a list here a line, the first number, the one simply indicates the row number, we only have one row. So that's what we have anyhow. And
it gives us the N of 150, the mean of 5.84, the standard deviation, the median, so on and so forth out to the standard error there at the end. Now, that's for one quantitative variable. If you want to do more than that, or especially if you want to do an entire data frame, just give the name of the data frame in describe. So here we go describe Iris. I'm going to zoom in on that one because now we have a lot of stuff. Now it lists all the variables down the side sepal length and it
gives the variables numbers 12345. And it gives us the information for each one of them. Please note it's given us numerical information for species but it shouldn't be doing that because that's a categorical variable. So you can ignore that last line. That's why I put an asterisk right there. But otherwise, this gives you more detailed information including things is like the standard deviation and the skewness that you might need. To get a more complete picture of what you have in your data. I use describe a lot, it's a great way to compliment histograms and other
charts like box plots to give you a more precise image of your data and prepare you for your other analyses. To finish up our section in our an introduction on basic statistics, let's take a short look at selecting cases. What this does is it allows you to focus your analysis, choose particular cases and look at them more closely. Now in art, you can do this a couple of different ways. You can select by category if you have the name of a category, or you can select by value on a scaled variable. Or you can select
by both. Let me show you how this works and are just open up this script and we'll take a look at how it works. As with most of our other examples, we'll begin by loading the data sets package and by using library, just Ctrl or Command Enter to run that command that's now loaded, and we'll use the iris dataset. So we'll look at the first few cases head Iris is how we do that. Zoom in on it for a second. There's the iris data, we've already seen it several times, we'll come down and we'll make
a histogram of the petal length for all of the irises in the data set. So I received the name of the data set and then petal length. There's our histogram off to the right, I'll zoom in on it for a second. So you see, of course, that we've got this group's stuck way at the left, and then we have a gap right here, then we have a pretty much normal distribution, the rest of it, I'll zoom back out, we can also get some summary statistics. I'll do that right here. For petal length, there we have
the minimum value of the core tiles and the mean. Now let's do one more thing. And let's get the name of the species. That's going to be our categorical variable and the number of cases for of each species. So I do summary, and then it knows that this is a categorical variable. So we run it through and we have 50 of each, that's good. The first thing we're going to do is we're going to select cases by their category, in this case by the species of Iris. We'll do this three times. We'll do it once
for versicolor. So I'm going to do a histogram where I say use the iris data. And then dollar sign means use this variable petal length. And then in square brackets, I put this to indicate select these rows or select these cases. And I say select when this variable species is equals, you got to use the two equal signs to versicolor. Make sure you spell it and capitalize it exactly as it appears in the data. Then we'll put a title on it. This says petal length versicolor. So here we go. And there is our selected cases.
This is just 50 cases going into the histogram. Now on the bottom right, we'll do a similar thing for virginica, where we simply change our selection criteria from versicolor virginica. And we get a new title there. And then finally, we can do first atossa also. So great. That's three different histograms by selecting values on a categorical variable, where you just type them in quotes exactly as they appear in the data. Now, another way to do this is to select by value on a quantitative or scaled variable. We want to do that what you do is
in the square brackets to indicate you're selecting rows, you put the variable, I'm specifying that it's in the IRS data set, and then say what value you're selecting. I'm looking for values less than two. And I have the title chance to reflect that. Now what's interesting is this selects the subtypes. This is the exact same group. And so the diagram doesn't change. But the titles and the method of selecting the cases did. Probably more interesting. One is when you want to use multiple selectors. Let's look for virginica that will be our species. And we want
short petals only. So this says what variable we're using petal length. And this is how we select with a Iris dollar sign species. So that tells us which variable is equal to with the two equals virginica. And then I just put an ampersand, and then say, Iris petal length is less than 5.5. Then I can run that I get my new title, and I'll zoom in on it. And so what we have here are just virginica, but the shorter ones. And so this is a pair of selectors use simultaneously. Now, another way to do this,
by the way, is if you know you're going to be using the same sub sample, many times, you might as well create a new data set that has just those cases. And the way you do that is you specified the data that you're selecting from then in square brackets, the rows and the columns, and then you use the assignment operator. That's the less than and dash here. What you can read as a GED So, so I'm going to create one called i dot setosa, for Iris setosa. And I'm going to do it by going to
the iris data. And in species reading just setosa, I then put a comma, because this one selects the rows, I need to tell it which columns. If I want all of them, you just leave it blank. So I'm going to do that. And now you see up here in the top right, I'll zoom in on it, I now have a new object new data object. And the environment is a data frame called ice atossa. And we can look at that sub sample that I've just created, we'll get the head of just those cases. Now you
see, it looks just the same as the other ones, except it only has 50 cases, as opposed to 150. And get a summary for those cases. And this time, I'm doing just the petal length. And I can also get a histogram for the petal length. And it's going to be just these two choices. And so that's several ways of dealing with sub samples. And again, saving this election, if you're going to be using it multiple times, it allows you to drill down on the data and get a more focused picture of what's going on, and
helps inform your analyses that you carry on from this point. The next step in our introduction is to talk about accessing data. And to get that started, we need to say a little bit about data formats. And the reason for that is sometimes your data, you're like talking about apples and oranges, you have fundamentally different kinds of things. Now, there are two ways in particular that this can happen. The first one is you can have data of different types, different data types. And then regardless of the type, you can have your data in different structures,
and it's important to understand each of these, we'll start by talking about data types. This is like the level of measurement of a variable. You can have numeric variables, which usually come in integer whole number or single precision or double precision. You can have character variables with text in them. We don't have string variables in our they're all character, you can have logical which are true, false, or otherwise called Boolean. You can have complex numbers, and you can have a data type raw. But regardless of which kind that you have, you can arrange them into
different data structures. The most common structures are vector, matrix or array, data frame, and list, we'll take a look at each of these. A vector is one or more numbers in a one dimensional array. Imagine them all in a straight line. Now, what's interesting here is that in other situations, if it's a single number, it would be called a scalar. But in AR, it's still a vector is just a vector of length one. The important thing about vectors is that the data are all of the same data type. So for instance, all character or all
integer. And you can think of this as ours basic data object in it, most of the things are variation of the vector. going one step up from this is a matrix, a matrix has rows and columns, it's two dimensional data. On the other hand, they all need to be of the same length, the columns all need to be the same length, and all the data needs to be of the same class. Interestingly, the columns are not named, they're referred to by index numbers, which can make them a little weird to work with. And then you
can step up from that into an array. This is identical to a matrix, but it's for three or more dimensions. On the other hand, probably the most common form is a data frame. This is a two dimensional collection that can have vectors of multiple types. You can have character variables in one, you can have integer variables, and another you can have logical and a third, the trick is, they all need to be the same length. And you can think of this as the closest thing that R has that's analogous to a spreadsheet. And in fact,
if you import a spreadsheet, you're going to go into a data frame, typically. Now the neat thing is that R has special functions for working with data frames, things that you can do with those you can do with others. And we'll see how those work as we go through this course and through others. And then finally, there's the list. This is our most flexible data format. You can put basically anything in the list. It's an ordered collection of elements. And you can have any class, any length, any structure. And interestingly, lists can include lists include
lists, and so on and so forth. So it gets like the Russian nesting dolls, you have one inside the other one inside the other. Now the trick is that may sound very flexible and may very good. It's actually kinda hard to work with lists. And so a data frame really sort of the optimal level of complexity for a data structure. And then let me talk about something else here the idea of coercion now, in the world of ethics cores is a bad thing in the world of data science. coercion is good. What it means here
is coercion is changing data objects. From one type to another, it's changing the level of measurement or the nature of the variable that you're dealing with. So for example, you can change a character to a logical, you can change a matrix to a data frame, you can change double precision to integer, you can do any of these, it's going to be easiest to see how it works. If we go to our end, give it a whirl. So open up this script, and let's see how it works in our studio. For this demonstration of data types,
you don't need to load any packages, we're just going to run through things all on their own. We'll start with numeric data. And what I'm going to do is I'm going to create a data object a variable called n one, my first numeric variable, and then I use the assignment operator. That's this, the little left arrow, and this right as n, one gets 15. Now, our does double precision by default, let me do this n one. And then you can see that it showed up here on the top right. If I call the name of
that object, it'll show its contents in the console. So I just type n one and run that. And there you can see in the console at the bottom left, it brought up a one in square brackets, that's an index number for the first objects in an array. And this is an array of one number, but there it is, and we get the value of 15. Also, we can use the our command type of to get a confirmation of what type of variable that says. And it's double precision by default, we can also do another one
where you do 1.5, we can get its contents 1.5. And then we see that it also is double precision, we want to come down and do a character I'm calling that see one for my first character variable, you see that I do see one the name of the object I want to create, I put the assignment operator the less than and dash, which is right as gets. And then I have in double quotes. In other languages, you would do single quotes for a single character. And you would use double quotes for strings. They're the same
thing in R, and I put in double quotes the lowercase C, that's just something I chose. So I feed that in, you can see that it showed up in the global environment there on the right, we can call it forward and you see it shows up with the double quotes on it. We've got the type of and it's a character, that's good. If we want to do an entire string of texts, I can feed that into C two, just by having it all in the double quotes. And we pull it out. And we see that
it also is listed as a character even though in other languages, it would be called a string. We can do logical, this is L one for logical first. And then feeding in true when you write true or false, they have to be all caps, or you can do just the capital T or the capital F. And then I call that one out. And it says true. Notice, by the way, there's no quotes around it. That's one way you can tell it it's a logical and not a character. If we put quotes into it, it would
be a character variable, we get the type of there we go, it's logical. I said you can also use abbreviation so for my second logical variable l two, I'll just use F. I feed that in. And now you see that it when I ask it to tell me what it is it prints out the whole word false. And then we get the type of again also logical, then we can come down to data structures, I'm going to create a vector which is a collection of one dimensional collection. And I'm doing it by creating v one
for vector one. And then I use the C here, which stands for concatenate. You can also think of it as like combine or collect. And I'm going to put five numbers in there, you need to use a comma between the values. And then I call out the object. Then there's my five numbers, notice it shows them without the commas but I had to have the commas going in. And then I asked our Is it a vector is period vector and then asked about it. And it's just gonna say true? Yes, it is. I can also
make a vector of characters. And do that right here, I get the characters, and it's also a vector. And that can make a vector of logical values true and false. Call that. And it's a vector also. Now a matrix, you may remember is in going in more than one dimension. In this case, I'm going to call it m one for matrix one. And I'm using the matrix function. So I'm saying matrix and then combine these values tt ffts. And then I'm saying how many rows I want in it, and it can figure out the number
of columns by doing some math. So I'm going to put that into m one. And then I'll ask for it AC. Now it displays it in the rows and columns, and it writes out the full true or false. Now I can do another one where I'm going to do a second matrix and this is where I explicitly shape it in the rows and columns. Now, that's for my convenience r doesn't care that I broke it up to make the rows and columns, but it's a way of working with it. And if I want to tell
it to organize it To go by rows, I can specify that with the by row equals T or true command. I do that. And now I have the ABCD. And you see, by the way that I have the index numbers, on the left are the row index numbers, that's row one and row two, and on the top are the column index numbers, and they come second, which is why it's blank and then one for the first column and then blank and then two for the second column, then we can make an array. What I'm going
to do here is I'm going to create a data and I'm going to use the colon operator, which says, Give me the numbers one through 24, I still have to use the concatenate to combine them. And then they give the dimensions of my array and it goes rows, columns, and then tables. Because I'm using three dimensions here, I'm going to feed that into an object called array one. And there's my array right there, you can see that I have two tables. In fact, let me zoom in on that one. And so it starts at the
last level, which is the table. And then we have the rows and the columns listed separately for each of them. a data frame allows me to combine vectors of the same length but of different types. Now, what I'm doing here is I'm creating a vector of numeric values of character values and logical values. So these are three different vectors. But then what I'm going to do is I'm going to use this function c bind for a column bind to combine them into a single data frame and call it DFA for a data frame, a, or
all. Now, the trick here is that we had some unintentional coercion by just using C bind, what it did is it coerced it all to the most general format. I had numeric variables and character variables, and logical and the most general is character. And so it turned everything into a character variable. That's a problem, it's not what I wanted, I have to add a nother function to this, I have to tell it specifically make it a data frame by using AZ dot data dot frame. When I do that, I can combine it. And now you
see it's maintained the data types of each of the variables. That's the way I want it. And then finally, I can do a list, I'm going to create three objects here, object one, which is numeric with three values, object two, which is character with four and object three, which is logical with five. And then I'm going to combine them into a list using the list function, put them into list one, and then we can see the contents of list one. And you can see it's kind of a funky structure, and it can be hard to
read. But there's all the information there. And then we're going to do something that's kind of, you know, hard to get around logically, because I'm going to create a new list that has list one in it. So I have the same three objects, plus I'm adding on to it list one. So list two, I'm gonna zoom in on that one. And you can see it's a lot longer. And we got a lot index numbers there in the brackets. There, the three integers, the four character values, and the five logical values. And then here they are
repeated, but that's because they're all parts of list one, which I included in this list. And so those are some of the different ways that you can structure data of different types. But you want to know also that we can coerce them into different types to serve our different purposes. The next thing we need to talk about is coercing types. Now there's automatic coercion, we've seen a little bit of that, where the data automatically goes to the least restrictive data type. So for instance, if we do this where we have a one, which is numeric,
be in quotes, which is character, and a logical value, and we feed them all into this idea coerce one. And by the way, by putting parentheses around it, it automatically saves it and shows us the response. Now you can see that what it's done is is taken all of them and made all of them character because that's the least specific most general format. And so that'll happen, but you kind of watch out because you don't want things getting coerced when you're not paying attention. On the other hand, you can coerce things, specifically, if you want
to haven't go in a particular way. So I can take this variable right here coerce to, I'm gonna put a five into that. And we can get its type and we see that it's double. Okay, that's fine. What if I want to make it integer, then what I do is I use this command as dot integer. I run that feed into coerce three. And it looks the same when we see the output but now it is an integer. That's how it's represented in the memory. I can also take a character variable and here I have
one Two and Three in quotes, which thank them characters and get those and you can see that they're all character. But now I can feed them in with this as dot numeric, and it's able to see that they are numerical numbers in there, and coerce them to numeric. Now you see that is lost the quotes, and it goes to the default double precision, probably the one you'll do the most often is taking a matrix. And that's just let's take a look, I'll make a matrix of nine numbers in three rows and three columns. There they
are. And what we're going to do is we're going to coerce it to a data frame. Now that doesn't change the way it looks is going to look the same. But there's a lot of functions you can only do with data frames that you can't do with matrices. This one, by the way, will ask is it a matrix? And the answer is true. But now let's do this, we'll do the same thing and just add on as dot data dot frame. Then now we thought to make it a data frame. And you see, it basically
looks the same. It's listed a little differently. This one had its index numbers here for the rows and the columns. This one is a row index. And then we have variable names across the top. And it's just automatically given them variables one, two, and three. But the numbers in it look exactly the same. On the other hand, if we come back here and ask, Is it a data frame, we get true. So it's a very long discussion here. But the point here is, data comes in different types and in different structures, and you're able to
manipulate those, so you can get them in the format, and the time and the arrangement that you need for doing your analyses in our to continue our introduction and accessing data, we want to talk about factors. And depending on the kind of work that you do, this may be a really important topic. factors have to do with categories and names of those categories. Specifically, a factor is an attribute of a vector. This specifies the possible values and their order, it's going to be a lot easier to see if we just try it. In our end,
let me demonstrate some of the variations, just open up this script, and we can run through it together. What we're going to do here is create a bunch of artificial data, and then we're going to see how it works. First one I'm going to do is I'm going to create a variable x one with the numbers one through three. And by putting it in parentheses here, it'll both stored in the environment, and it will display it in the console. So there we have three numbers, one, two, and three, I'm going to create a nother variable
y, that's the numbers one through nine. So there that is. Now what I want to do is I want to combine these two, and I'm going to use the C minor column bind data frame. So it's going to put them together, and it's going to make them a data frame. And it's going to save them into a new object I'm creating called df for data frame one. And we'll get to see the results of that. Let me zoom in on it a little bit. And there you can see, we have nine rows of data. We
have one variable x one that's from the one that I created, and then we have y. And then we have the nine indexes or the row IDs there down the side. Please note that the first 1x, one only had three values. And so what it did is it repeated it. So you see it happening three different times 123123. And what we want to find out is now what kind of variable is x one in this data frame? Well, it's an integer, and we want to get the structure, it shows that it's still an integer if
we're looking at this line right here. Okay, but we can change it to a factor by using as dot factor. And it's going to react differently than, so I'm going to create a new one called x two, that, again, is just the numbers one, two, and three. But now I'm telling are those specifically represent factors, then I'll create a new data frame using this x two that I saved as a factor and the one through nine that we had and why. Now, at this point, it looks the same. But if we come back to where
we were, and we get the type of it's still an integer, that's fine, but we get the structure of df two. Now it tells us that x two instead of being an integer is a factor with three levels. And it gives us the three levels in quotes one, two, and three, and then it lists the data. Now, if we want to take an existing variable, and define it as a factor, we can do that too. Here, I'll create yet another variable with three values in it. And then we'll bind it to y in a data
frame. And then I'm going to use this one factor right here. And I'm going to tell it to reclassify this variable x three as a factor and feed it into the same place, and that these are the levels of the factor. And because I put in parentheses, it'll show To us in the console, there we have it, let's get the type. It's an integer, but the structure shows it again as a factor. So that's one way we could take an existing variable and turn it into a factor. If you want to do labels, we can
do it this way. We'll do x four, again, that's the one through three. And we'll bind it to nine to make a data frame. And here, I'm going to take the existing variable, df four, and then the variable is x four, I'm going to tell it the labels. And then I'm going to give them text labels, I'm going to say that there are Mac OS, Windows and Linux three operating systems. And please note, I need to put those in the same order that I want them to line up to those numbers. So one will be
Mac OS two will be windows and three will be Linux. I run that through, we can pull it up here. And now you can see how it goes through. And it changes that factor to the text variables. Even though I entered it numerically. I want the type of to see what it is. It's still called it integer, even though it's showing me words, and the structure. This is an important one, let's zoom in on that just for a second. The structure here at the bottom, it says it's a factor with three levels, and it starts
giving me the labels. But then it shows us that those are actually numbers one, two, and three underneath. If you're used to working with a program like SPSS, where you can have values, and then you can add value labels on top of them. It's the same kind of concept here. Then I want to show you how we can switch the order of things. And this gets a little confusing. So try it a couple of times and see if you can follow the logic here. We'll create another variable x five, that's just the one, two and
three, we'll bind it to why. And there's our data frame just like we've had in the other examples. Now what I'm going to do is I'm going to take that new variable x five in the data frame five, df five. And notice here, I'm listing the levels, but I'm listing them in a different order. I'm changing the order that I put them in there. And then I'm lining up these labels. When I run that through, now you can see the labels here, maybe yes, now maybe yes, no, it is showing us the nine values. And
then this is an interesting one, because they're ordered, it puts them with the less than sign at each point indicate which one comes first which one comes later, we can take a look at the actual data frame that I made. Or zoom in on that. And you can see, we know that the first one's a one because when I created this, it was 123. And so the maybe is a one you see because it's the second thing here in each one. So one equals maybe. But by putting it in this order, it falls in the
middle of this one, there may be situations in which you want to do that, I just want to know that you have this flexibility in creating your factor labels in our. And finally, we can check the type of that. And it's still an integer because it's still coded numerically underneath, but we can get this structure and see how that works. So factors give you the opportunity to assign labels to your variables, and then use them as factors in various analyses if you do experimental research, and this sort of thing becomes really important. And so this
gives you an additional possibility for your analyses in our as you define your numerical variables as factors for using your own analyses. Our next step in our an introduction in accessing data is entering data. So this is where you're typing it in manually. And I like to think of this as a version of ad hoc data, because under most circumstances, you would import a data set. But there are situations in which you need just a small amount of data right away, and you can type it in this way. Now, there are many different methods that
are available for this. There's something called the colon operator. There's SC Q, which is for sequence, there, C which is short for concatenate, there's a scan, and there's Rep. And I'm going to show you how each of these works. I will also mention this little one, the less than and a dash, that is the assignment operator in our let's take a look at it in our and I'll explain how all of it works. Just open up this script, and we'll give it a whirl. What we're going to do here is just begin with a little
discussion of the assignment operator, the less than dash is used to assign values to a variable, so is called an assignment operator. Now a lot of other programs would use an equal sign, but we use this one that's like an arrow, and you read it as it gets. So x gets five, it can go in the other direction pointing to the right, that would be very unusual. And you can use an equal sign or knows what you mean. But those are generally considered poor form. And that's not just arbitrary. If you look at the Google
style guide for our it's specific about that. In our studio, you have a shortcut for This, if you do option dash, it inserts the assignment operator and a space. So I'll come down here right now, do option dash, there you see. So that's a nice little shortcut that you can use in our studio when you're doing your ad hoc data entry. Let's start by looking at the colon operator. And most of this you would have seen already. And what this means is you simply stick a colon between two numbers, and it goes through them sequentially.
So I'm doing x one is a variable that I'm creating. And then I have the assignment operator and get zero colon 10. And that means it gets the numbers zero through 10. And there they all are going to delete my colon operator that's waiting for me to do something here. Now if we want to go in descending order, just put the higher number first. So I'll put 10 colon zero, there it goes the other way, as EQ or SEC is short for sequence, and it's a way of being a little more specific about what you
want. If you want to, we can call it the help on sequence. It's right over here for sequence generation. There's the information. And we can do ascending values. So sec 10, duplicate one through 10 doesn't start at zero starts at one. But you can also specify how much you want things to jump by. So if you want to count down in threes, II do 30 to zero by negative three means step down threes, we'll run that one. And because it's in parentheses, it'll both save it to the environment, and it'll show it on the console
right away. So those are ways of doing sequential numbers. And that can be really helpful. Now if you want to enter an arbitrary collection of numbers in different order, you can use C that stands for concatenate, you can also think of it as combine or collect, we can call it the help on that one. There it is. And let's just take these numbers and you see to combine them into the data object x five, and we can pull it in there you see, it just went right through. An interesting one is scan. And this is
we're entering data live. So we'll do scan here, get some help on that one, you can see it read data values. And this one takes a little bit of explanation, I'm going to create an object x sex. And then I'm feeding into it a scan with opening and closing parentheses because I'm running that command. So here's what happens, I run that one. And then down here in the console, you see that it now has one and a colon. And I can just start typing numbers. And after each one, I hit Enter. And I can type
in however many I want. And then when you're done, just hit enter twice. And it reads them all. And if you want to see what's in there, come back up here and just call the name of that object. There are the numbers that I entered. And so there may be situations in which that makes it a lot easier to enter data, especially if you're using a 10 key. Now, rep you can guess is for repetition. We'll call the help on that one, replicate elements. And here's what we're going to do, we're going to say x
seven, we're going to repeat or replicate. True, and we're going to do it five times. So x seven. And then if you want to see there are our five trues. All in a row. If you want to repeat more than one value, it depends on anything, set things up a little bit. Here, I'm going to do replicate a repeat for true and false. But by doing it as a set where I'm doing the see concatenate to collect the set, what it's going to do is repeat that set in order five times. So true, false, true,
false, true, false, and so on. That's fine. But if you want to do the first one, five times, and then the second one, five times, I mean, think of it as like co lading. On a photocopier. If you don't want it correlated, you do each. And that's going to do True, true, true, true, true false, false, false, false false. And so these are various ways that you can set up data, get it in really for an ad hoc or an as needed analysis. And it's a way of checking how functions work is I've used in
a lot of examples here. And you can explore some as possibilities and see how you can use it in your own work. The next step in our introduction, and accessing data is talking about importing data, which will probably be the most common way of getting data into R. Now the goal here is you want to try to make it easy. Get the data in there, get a large amount, get it in quickly and get processing as soon as you can. Now there are a few kinds of data files you might want to import. There are
CSV files, S stands for comma separated values in a sort of the plain text version of a spreadsheet. Any spreadsheet program can export data as a CSV and nearly any data program at all can read them. They're also straight text files. txt. Those can actually be opened up in text editors and word processing documents, then there are XLS x. And those are Excel spreadsheets as well as the XLS version. And then finally, if you're going to get fancy, you have the opportunity to import JSON. That's JavaScript Object Notation. And if you're using web data, you
might be dealing with that kind of data. Now, R has built in functions for importing data in many formats, including the ones I just mentioned. But if you really want to make your life easy, you can use just one, a package that I load, every time I use R is reo, which is short for our import output. And what reo does is it combines all of our import functions into one simple utility with consistent syntax and functionality. It makes life so much easier. Let's see how this all works in our just open up this script,
and we'll run through the examples all the way through. But there is one thing you're going to want to do first, and that is, you're going to want to go to the course files that we download at the beginning of this course, these are the individual our scripts, because this folder right here that significant. This is a collection of three data sets, I'm going to click on that. And they're all called m BB. And the reason they're called that is because they contain Google Trends information. And that searches for Mozart, Beethoven, and Bach, three major
classical music composers. And it's all about the relative popularity of these three search terms over a period of several years. And I have it here in CSV or comma separated value format, and as a txt file dot txt, and then even as an Excel spreadsheet. Now let's go to our and we'll open up each one of these. The first thing we're going to need to do is make sure that you have reo. Now I've done this before that Rio is one of the things I download every time. So I'm going to use Pac Man and
do my standard importing or loading of packages. So reals available now, I do want to tell you one thing significant about Excel files. And we're going to go to the official our documentation for this. If you click on this, it'll open up your web browser. And this is a shortcut web page to the our documentation. And here's what it says. I'm actually read this verbatim. Reading Excel spreadsheets. The most common our data import export question seems to be how do I read an Excel spreadsheet. This chapter collects together advice and options given earlier. Note that
most of the advices for pre Excel 2007 spreadsheets and not the later XLS x format. The first piece of advice is to avoid doing so if possible. If you have access to excel, export the data you want from excel in a tab delimited or comma separated form, and use read dot delete or read dot CSV to import it into R, you may need to use read.dl m to or read dot CSV to and a locale that uses comma as the decimal point, exporting a diff file and reading it using read dot diff is another possibility.
Okay, so really what they're saying is, don't do it. Well, let's go back to our now it's gonna say right here, you have been warned. But let's make life easy by using Rio. Now if you've saved these three files to your desktop, then it's really easy to import them this way. We'll start with the CSV. We use reo underscore CSV is the name of the object that I'm going to be using to import stuff into. And all we need is this command import. We don't have to specify that as a CSV, or C that has
headers or anything, we just use import. And then in quotes, and in parentheses, we put the name and location of the file. So on a Mac, it shows up this way to your desktop. I'm going to run that. And you can see that it just showed up in my environment on the top right, I'll expand that a little bit. I now have a data frame, I'll come back out. Let's take a look at the first few rows of that data frame. I'll zoom up. And you can see we have months listed. And then the relative
popularity of search for Mozart, Beethoven and Bach during those months. Now, if I want to read the text file, what's really nice is I can use the exact same command import, and I just give the location in the name of the file, I have to add the dot txt. But I run that and we look at the head and you'll see it's exactly the same no difference Piece of cake. What's nice about Rio is I can even do the XLS x file. Now it helps that there's only one tab in that file, and that it's
set up to look exactly the same as the others want to do that. We went through and you see that once again. It's the same thing Rio was able to read all of these automatically makes life very easy. Another neat thing is that our hands on thing called a Data Viewer. Now we'll get a little bit of information on that to help and you invoke the Data Viewer. Let's do this one we do with a capital V for view. And then we say what it is we want to see. And we'll do rio underscore CSV.
When we do that command, it opens up a new tab here. And it's like a spreadsheet right here. And in fact, it's sortable, we can click on this, go from the lowest to the highest, and vice versa. And you see that Mozart actually is setting the range here. And that's one way to do it. You can also come over to here and just click on this little, it looks like a calendar. But it is, in fact, the same thing, we can double click on that. And now you see we get a viewer of that file
as well. I'm going to close both of those. And I'm just going to show you the built in our commands for reading files. Now, these are ones that Rio uses on its own. And we don't have to go through all this. But you may encounter these in a lot of existing code, because not everybody uses Rio. And I want you to see how they work. If you have a text file, and it's saved in tab delimited format, you need the complete address. And you might try to do something like this read dot table is normally
the command. And you need to say that you have a header that there's variable names across the top. But when you read this, it's going to get an error message. And it's you know, it's frustrating. That's because there are missing values in there in the top left corner. And so what we need to do is we just need to be a little more specific about what the separator is. And so I do the same thing, I say read dot table, there's the name of the file in this location, we have a header. And this is
where I say the separator is a tab, the back score says indicate this is a tab. So if I run that one, then it shows up, it reads it properly. We can also do CSV. The nice thing here is you don't have to specify the delimiter. Because CSV means that it's comma separated, so we know what it is. And I can read that one in the exact same way. And if I want to, I can come over here. And I can just click on the viewer here. And I see the data that way also. And
so it's really easy to import data, especially if you use the package Rio, which is able to automatically read the format and get it in properly and get you started on your analyses as soon as possible. Now, the part of our introduction that maybe most of you were waiting for is modeling data. On the other hand, because this is a very short introductory course, I'm really just giving a tiny little overview of a handful of common procedures. And an another course here at data lab.cc, we'll have much more thorough investigations of common statistical modeling and
machine learning algorithms. But right now, I just want to give you a flavor of what can be done in R. And we'll start by looking at a common procedure. hierarchical clustering are ways of finding which cases or observations in your data belong with each other. More specifically, you can think of it as the idea of like with like, which cases are like other ones. Now, the thing is, of course, this depends on your criteria, how you measure similarity, how you measure distance, and there's a few decisions you have to make. You can do, for instance,
what's called a hierarchical approach, which is what we're going to do. Or you can do it where you're trying to get a set number of groups, or s called K, the number of groups, you also have many choices for measures of distance. And you also have a choice between what's called divisive clustering, where you start with everything in one group, and then you split them apart, or agglomerative, which is where they all start separately, and you selectively put them together. But we're going to try to make our life simple here. So we're going to do
the single most common kind of clustering, we're going to use a measure of Euclidean distance, we're going to use hierarchical clustering. So we don't have to set the number of groups in advance. And we're going to use a divisive method, we start with them all together and gradually split them. Let me show you how this works in our. And what you'll find is even though this may sound like a very sophisticated technique, and a lot of the mathematics is sophisticated, it's really not hard to do in reality. So what we're going to do here is
we're going to use a data set that we use frequently I'm going to load my default packages to get some of this ready. And then I'll bring in the data sets, we're going to use m t cars, which if you recall, is Motor Trend, car road tests data from 1974. And there are 32 cars in there and we're gonna see how they grew up what cars are similar to which other ones. Now let's take a look at the first few rows of data to see what variables we have in here. You see we have MPG,
cylinders displacement, so on and so forth. Not all of these are going to be really influential or are useful variables. And so I'm going to drop a few of them and create a new data set, that includes just the ones I want. If you want to see how I do that, I'm going to come back here and I'm going to create a new object, a new data frame called cars. And this says, it gets the data from empty cars. By putting the blank in the space here, that means use all of the rows. But here
I'm selecting the columns see for concatenate, means I want columns one through four, skip Five, six, and seven, skip eight, and then nine through 11. That's way of selecting my variables. So I'm going to do that and you see the cars is now showing up in my environment, they're at the top right, let's take a look at the head of that data set. We'll zoom in on that one. And they can see it's a little bit smaller, we have mpg cylinders, displacement, weight, horsepower, quarter mile, seconds, and so on. Now, we're going to do the
cluster analysis, and we're going to find is that if we're using the default, it's super, super easy. In fact, I'm going to be using something called pipes, which is from the package D plier, which is why I loaded it is this thing right here. And what it allows you to do is to take the results of one step and feed it directly in as the input data into the next step. Otherwise, this would be several different steps. But I can run it really quickly, I'm going to create an object called h c for hierarchical clusters,
we're going to read the cars data that I just created, we're going to get the distance or the dis similarity matrix, which says how far each observation is in Euclidean space from each of the others. And then we feed that through the hierarchical cluster routine h clust. So that saves it into an object and now we need to do is plot the results. We're gonna do plot H, see my hierarchical cluster object, then we get this very busy chart over here. But if I zoom in on it, and wait a second, you can see that
it's this nice little, it's called a dendrogram. Because it's a branches and trees looks more like roots here, you can see they all start up together, and then they split and then they split and they split. Now if you know your car's from 1974. And you can see that some of these things make sense. So for instance, here we have the Honda Civic and the Toyota Corolla, which are still in production are right next to each other, if you're 128. And if yacht x one nine are very well, they were both small Italian sports cars,
they were different in many ways. But you can see that they're right next to each other. The Ferrari Dino, the Lotus Europa, they make sense to put next to each other. If we come over here, the Lincoln Continental and the Cadillac Fleetwood and the Chrysler Imperial, it's no surprise that are next to each other. What is interesting is this one here, the mangiarotti Bora, it's totally separate from everything else, because it's a very unusual different kind of car at the time. Now, one really important thing to remember is that the clustering is only valid for
these data points, based on the data that I gave it, I only gave it a handful of variables. And so it has to use those ones to make the clusters. If I gave it different variables or different observations, we could end up with a very different kind of clustering. But I want to show you one more thing we can do here with this clusters to make it even easier to read. Let me zoom back out. And what we're going to do is draw some boxes around the clusters, we're going to start by drawing two boxes
that have gray borders. Now I'm going to run that one. And you can see that it showed up. And then we're going to make three blue ones, four green ones, and five dark red ones. And then let me come and zoom in on this again. And now it's easier to see what the groups are in this particular data set. So we have here, for instance, the Hornet for drive, the valley and the Mercedes Benz, 450, SLC, Dodge, challenger, and Javelin all clumping together in one general group. And then we have these other really big VAT
American cars. What's interesting is again, is that the MAS Ronnie Bora is off by itself almost immediately. It's kind of surprising because the Ford Panthera has a lot in common with it. But this is a way of seeing based on the information that I gave it, how things are clustered. And if you're doing market analysis, if you're trying to find out who's in your audience, if you're trying to find out what groups of people think in similar ways, this is an approach that you're probably going to use. And you can see that it's really simple
to set it up, at least using the default in our as a way of seeing how you have regularities and consistencies in groupings in your data. As we go through our very brief introduction to modeling data and are another common procedure that we might want to look at briefly, is called principal components. And the idea here is that in certain situations, less is more. That is less noise, and fewer unhelpful variables in your data can translate to more meaning and that's why After In any case, now, this approach is also known as dimensionality reduction. And
I like to think of it by an analogy, you look at this photo, and what you see are these big black outlines of people, you can tell basically how tall they are, what they're wearing, where they're going. And it takes a moment to realize that you're actually looking at a photograph that goes straight down. And you can see the people there on the bottom, and you're looking at their shadows. And we're trying to do a similar thing. Even though these are shadows, you can still tell a lot about the people, people are three dimensional, shadows
are two dimensional, but we've retained almost all the important information. If you want to do this with data, the most common method is called principal component analysis, or PCA. And let me give you an example of the steps metaphorically in PCA. You begin with two variables. And so here's a scatterplot, we've got x across the bottom y decide, and this is just artificial data. And you can see that there's a strong linear association between these two. Well, what we're going to do is we're going to draw a regression line through the data set, and you
know, it's there about 45 degrees. And then we're going to measure the perpendicular distance of each data point to the regression line. Now, not the vertical distance, that's what we would do if we were looking for regression residuals, but the perpendicular distance. And that's what those red lines are, then what we're going to do is we're going to collapse the data by sliding each point down the red line to the regression line. And that's what we have there. And then finally, we have the option of rotating it. So it's not on diagonal anymore, but it's
flat. And that there is the PC the principal component. Now, let's recap what we've accomplished here, we went from a two dimensional data set to a one dimensional data set, but maintained some of the information in the data. But I like to think that we've maintained most of the information. And hopefully, we maintain the most important information in our data set. And the reason we're doing this is we've made the analysis and interpretation easier and more reliable. By going from something that was more complex, two dimensional or higher dimensions, down to something that's simpler to
deal with fewer dimensions, it means easier to make sense of in general, let me show you how this works in our open up this script. And we'll go through an example in our studio. To do this, we'll first need to load our packages, because I'm going to use a few of these. Although those will load the data sets. Now I'm going to use the empty cars data set, we've seen that a lot. And I'm going to create a little subset of variables. Let's look at the entire list of variables. And I don't want all of
those in my particular data set. So the same way I did with hierarchical clustering, I'm going to create a subset by dropping a few of those variables. And we'll take a look at that subset. Let's zoom in on that. So there's the first six cases in my slightly reduced data set. And we're going to use that to see what dimensions we can get to that we have fewer than the 123456789 variables we here. Let's try to get to something a little less and see if we still maintain some of the important information in this data
set. Now what we're going to do is we're going to start by computing the PCA, the principal component analysis, we'll use the entire data frame here, I'm going to feed into an object called PC for a principal components. And there's more than one way to do this in our but I want to use p r comp. And this specifies the data set that I'm going to use. And I'm going to do two optional arguments. One is called centering the data, which means moving them so the means of other variables are zero. And then the second
one is scaling the data which sort of compresses or expands the range of the data. So it's unit or variance of one for each of them. That puts all of them on the same scale. And it keeps any one variable from sort of overwhelming the analysis. So let me run through that. And now we have a new object that showed up on the right. And if you want to you can also specify variables by specifically including them. The tilde here means that I'm making my prediction based on all the rest of these. And I can
give the variable names all the way through. And then I say what data set it's coming from. I say data equals empty cars, and I can do the centering in the scaling there. Also, it produces exactly the same thing. It's just two different ways of saying the same command. To examine the results, we can come down and get a summary of the object PC that I created. So I'll click on that and then we'll zoom in on this. And here's the summary it talks about creating nine components pc one for principal component one to PC
nine for principal component Nice, you get the same number of components that you had as original variables. But the question is whether it divvies up the variation separately. Now, you can take a look here at principal component one is the standard deviation of 2.3391. What that means is, if each variable will begin with a standard deviation of one, this one has as much as 2.4 of the original variables, the second one has 159, and the others have less than one unit standard deviation, which means they're probably not very important in the analysis, we can get
a scree plot for the number of components and get an idea on how much each one of them explains of the original variance. And we see right here, I'll zoom in on that, that our first component seems to be really big and important. Our second one is smaller, but it still seems to be you know, above zero, and then we kind of grind out down to that one. Now there's several different criteria for choosing how many components are important what you want to do with them. Right now, we're just eyeballing it. And we see that
number one is really big number two, sort of a minor axis in our data. If you want to, you can get the standard deviations and something called the rotation here, I mean, just call PC. And then we'll zoom in on that in the console. to scroll back up a little bit. And it's a lot of numbers. The standard deviations here are the same as what we got from this first row right here. So that just repeats it. The first one's really big, the second one's smaller. And then what this right here does, what the rotation
is, it says, What's the association between each of the individual variables and the nine different components. So you can read these like correlations. I'm going to come back. And let's see how individual cases load on the PCs. What I do that is I use predict running through PCs, and then I feed those results using the pipe. And I round them off, so they're a little more readable. I'll zoom in on that. And here, we've got nine components listed, and we got all of our cars. But the first two are probably the ones that are most
important. So we have here the PC one and two easy, we got a giant value there, 2.49273354, and so on. But probably the easiest way to deal with all this is to make a plot. And what we're going to do is go something with a funny name of biplot. What that means is a two dimensional plot, really, all it says is going to chart the first two components. But that's good, because based on our analysis, it's really only the first two that seem to matter anyhow. So let's do the biplot, which is a very busy
chart. But if we zoom on it, we might be able to see a little better what's going on here. And what we have is the first principal component across the bottom, and the second one up the side. And then the red lines indicate approximately the direction of each individual variables contribution to these. And then we have each case we show its name about where it would go. Now if you remember from the hierarchical clustering, the Maasai Bora was really unusual. And you can see it's up there all by itself. And then really, what we seem
to have here is displacement and weight and cylinders, and horsepower. This appears to be big, heavy cars going in this direction. Then we have the Honda Civic, the Porsche 911, Lotus, Europa, these are small cars with smaller engines more efficient. These are fast cars up here. And these are slow cars down here. And so it's pretty easy to see what's going on with each of these as in terms of clustering the variables. With a hierarchical clustering, we clustered cases, now we're looking at clusters of variables. And we see that it might work to talk about
big versus small and slow versus fast as the important dimensions in our data as a way of getting insight to what's happening and directing us in our subsequent analyses. Let's finish our very short introduction to modeling data in our with a brief discussion of regression, probably one of the most common and powerful methods for analyzing data. I like to think of it as the analytical version of E Pluribus Unum that is out of many one, or in the data science sense, out of many variables, one variable, you want to put out one more way out
of many scores, one score. The idea with regression is that you use many different variables simultaneously, to predict scores on one particular outcome variable. And there's so much going on here. I'd like to think that there's some For everyone, there are many versions, and many adaptations of regression that really make it flexible, and powerful for almost no matter what you're trying to do, we'll take a look at some of these in our so let's try it in our and just open up this script. And let's see how you can adapt regression to a number of
different tasks and use different versions of it. When we come here to our script, we're going to scroll down here a little bit and install some packages, we're going to be using several packages in this one, I'll load those ones as well as the datasets package. Because we're going to use a data set from that called us judge radians. Let's get some information on it. It is lawyers ratings of state judges in the US Superior Court. And let's take a look at the first few cases with head I'll zoom in on that. And what we
have here are six judges listed by name. And we have scores on a number of different variables like diligence and demeanor. And whether it finishes with whether they're worthy of retention, that's the RTN retention. Let's scroll back out. And what we might want to do is use all these different judgments to predict whether lawyers think that these judges should be retained on the bench. Now, we're going to use a couple of shortcuts that can actually make working with regression situations kind of nice. First, we're going to take our data set, and we're going to feed
it into an object called data. So that shows up now in our environment on the top right. And then we're going to define variable reps, you don't have to do this, but it makes the code really, really easy to use. Plus, you find if you do this, then you can actually just use the same code without having to redo it every time you do an analysis. So what we're going to do is we're going to create an object called x, it's actually going to be a matrix, and it's going to consist of all of our
predictor variables simultaneously. And the way I'm going to do this is I'm going to use as matrix and then I'm gonna say read data, which is what we defined right here, and read all of the columns except number 12. That's one called retention, that's our outcome. So the minus means don't include that, but do all the others. So I do that, and now I have an object called x. And then the second one, I say, go to data. And then this blank means use all of the rows, but only read the 12th column, that's the
one that has retention our outcome. So following standard method, x, those are all our variables and why that's our single outcome variable. Now, the easiest version of regression is called simultaneous entry, you use all of the x variables at once, throw them in one big equation to try to predict your single outcome. And in our we use lm, which is for linear model. And what we have here is y, that's our outcome variable. And then the tilde that means is predicted by or as a function of x. And then x is all of our variables
together being used as predictors. So this is the simplest possible version, and we'll save it into an object called reg for regression one. And now, if you want to be a little more explicit, you can give the individual variables you can say that our 10 retention is a function of or as predicted by all of these other variables. And then I say that they come from the data set us judge ratings that we don't have to do the data, and then dollar sign for you to these. That'll give me the exact same thing. So I
don't need to do that one explicitly. If you want to see the results, we just call on the object that we created from the linear model. And I'm going to zoom in on that. And what we have are the coefficients. This is the intercept, start with minus two. And then for each step up on this one, as 0.1, point three, six, so on and so forth. You'll see By the way, that it's changed the name of each of the variables to add the x because they're in the dataset x now, that's fine. We can do
inferential tests on these individual coefficients by asking for a summary. We click on that. And we'll zoom in. And now you can see, there's the value that we had previously, but now there's a standard error. And then this is the t test. And then over here is the probability value. And the asterisks indicate values that are below the standard probability cutoff of point oh five. Now we expect the intercept to be below that. I see. For instance, this one integrity has a lot to do with people's judgment of whether a person should be retained. And
this one physical really, are they sick, and we have some others that are kind of on their way. And this is a nice one overall. And if you come down here, you can see the multiple r squared. It's super high. And what it means is that These variables collectively predicted very, very well, whether the lawyers felt that the judge should be retained. Let's go back now to our script, you can get some more summary data here, if you want, we can get the analysis of variance table, the ANOVA table, and we click on that zoom
in there, you can see that we have our residuals and the y. Come back out, we do the coefficients. Here are the regression coefficients, we saw those previously, this is just a different way of getting at this same information, we can get confidence intervals. Let's zoom in on that. And now we have a 95% confidence interval. So the two and a half percent, on the low end the nine, seven and a half on the top end, in terms of what each of the coefficients would be. We can get the residuals on a case by case
basis, let's do this one. And when we zoom in on that, now, this is a little hard to read in and of itself, because they're just numbers. But an easier way to deal with that is to get a histogram of the residuals from the model. So to do that, we just run this command, and then I'll zoom in on this. And you can see that it's a little bit skewed mostly around zero, we've got one person we have on the high end, but mostly, these are pretty good predictions. Come back out. Now I want to
show you something a little more complicated. We're going to do different kinds of regression, I'm going to use two additional libraries for this one is called Lars that stands for least angle regression, and carat, which stands for classification and regression training. We'll do that by loading those two. And then we're going to do a conventional stepwise regression, which a lot of people say there's problems with this, but I'm just gonna show that I'm gonna do it really fast. There's our stepwise regression, then we're going to do something from Lars called stage wise, it's similar to
stepwise, but it has better generalizability. We run that through, we can also do least angle regression. And then really, one of my favorites is the lasso. That's the least absolute shrinkage and selection operator. Now I'm running through just the absolute bare minimum versions of these, there's a lot more that we would want to do explore these. But what I'm going to do is compare the predictive ability of each of them. And I'm going to feed into an object called R to conference comparison of the R squared values. And here I specify where it is, in
each of them, I have to give a little index number, then we're going to round off the values. And I'm going to give them the name, say the first one stepwise and forward then larger than lasso. And we can see the values. And what this shows us here at the bottom is that all of them were able to predict it super well. But we knew that because when we did just the standard simultaneous entry, there was amazingly high predictive ability within this data set. But you will find situations in which each of these can vary
a little bit, maybe sometimes they vary a lot. But the point here is there are many different ways of doing regression and are makes those available to whatever you want to do. So explore your possibilities and see what seems to fit. In other courses, we will talk much more about what each of these mean, how they can be applied and how it can be interpreted. But right now, I simply want you to note that these exist, and they can be done, at least in theory in a very simple way in our. And so that brings
us to the end of our an introduction. And I want to make a brief conclusion primarily to give you some next steps, other things that you can do. As you learn to work more with our now we have a lot of resources available here. Number one, we have additional courses on our in data lab.cc. And I encourage you to explore each of them. If you like our you might also like working with Python, another very popular language for working in data science, which has the advantage of also being a general purpose programming language. The things
that we do in our we can do almost all the same things in Python. And it's nice to do a compare and contrast between the two with the courses we have at data lab.cc. I'd also recommend you spend some time simply on the concepts and practice of data visualization. R has fabulous packages for data visualization. But understanding what you're trying to get and designing quality ones is sort of a separate issue. And so encourage you to get the design training from our other courses on visualization. And then finally, a major topic is machine learning or
methods for processing large amounts of data and getting predictions from one set of data that can be applied usefully to others. We do that for both R and Python and other mechanisms here in data lab. Take a look at all of them and see how well you think you can use them in your own work now Another thing that you can do is you can try looking at the annual our user conference, which is user with a capital R and an exclamation point. There are also local our user groups are rugs. And I have to
say Unfortunately, there is not yet an official our day. But if you think about September 19, it's International Talk Like a Pirate Day. And we like to think as pirates say are and so that can be our unofficial day for celebrating these statistical programming language are any case, I'd like to thank you for joining me for this and I wish you happy computing