Taylor series | Chapter 11, Essence of calculus

4.49M views3521 WordsCopy TextShare

3Blue1Brown

Taylor polynomials are incredibly powerful for approximations and analysis. Help fund future project...

Video Transcript:

When I first learned about Taylor series, I definitely didn't appreciate just how important they are. But time and time again they come up in math, physics, and many fields of engineering because they're one of the most powerful tools that math has to offer for approximating functions. I think one of the first times this clicked for me as a student was not in a calculus class but a physics class.

We were studying a certain problem that had to do with the potential energy of a pendulum, and for that you need an expression for how high the weight of the pendulum is above its lowest point, and when you work that out it comes out to be proportional to 1 minus the cosine of the angle between the pendulum and the vertical. The specifics of the problem we were trying to solve are beyond the point here, but what I'll say is that this cosine function made the problem awkward and unwieldy, and made it less clear how pendulums relate to other oscillating phenomena. But if you approximate cosine of theta as 1 minus theta squared over 2, everything just fell into place much more easily.

If you've never seen anything like this before, an approximation like that might seem completely out of left field. If you graph cosine of theta along with this function, 1 minus theta squared over 2, they do seem rather close to each other, at least for small angles near 0, but how would you even think to make this approximation, and how would you find that particular quadratic? The study of Taylor series is largely about taking non-polynomial functions and finding polynomials that approximate them near some input.

The motive here is that polynomials tend to be much easier to deal with than other functions, they're easier to compute, easier to take derivatives, easier to integrate, just all around more friendly. So let's take a look at that function, cosine of x, and really take a moment to think about how you might construct a quadratic approximation near x equals 0. That is, among all of the possible polynomials that look like c0 plus c1 times x plus c2 times x squared, for some choice of these constants, c0, c1, and c2, find the one that most resembles cosine of x near x equals 0, whose graph kind of spoons with the graph of cosine x at that point.

Well, first of all, at the input 0, the value of cosine of x is 1, so if our approximation is going to be any good at all, it should also equal 1 at the input x equals 0. Plugging in 0 just results in whatever c0 is, so we can set that equal to 1. This leaves us free to choose constants c1 and c2 to make this approximation as good as we can, but nothing we do with them is going to change the fact that the polynomial equals 1 at x equals 0.

It would also be good if our approximation had the same tangent slope as cosine x at this point of interest. Otherwise the approximation drifts away from the cosine graph much faster than it needs to. The derivative of cosine is negative sine, and at x equals 0, that equals 0, meaning the tangent line is perfectly flat.

On the other hand, when you work out the derivative of our quadratic, you get c1 plus 2 times c2 times x. At x equals 0, this just equals whatever we choose for c1. So this constant c1 has complete control over the derivative of our approximation around x equals 0.

Setting it equal to 0 ensures that our approximation also has a flat tangent line at this point. This leaves us free to change c2, but the value and the slope of our polynomial at x equals 0 are locked in place to match that of cosine. The final thing to take advantage of is the fact that the cosine graph curves downward above x equals 0, it has a negative second derivative.

Or in other words, even though the rate of change is 0 at that point, the rate of change itself is decreasing around that point. Specifically, since its derivative is negative sine of x, its second derivative is negative cosine of x, and at x equals 0, that equals negative 1. Now in the same way that we wanted the derivative of our approximation to match that of the cosine so that their values wouldn't drift apart needlessly quickly, making sure that their second derivatives match will ensure that they curve at the same rate, that the slope of our polynomial doesn't drift away from the slope of cosine x any more quickly than it needs to.

Pulling up the same derivative we had before, and then taking its derivative, we see that the second derivative of this polynomial is exactly 2 times c2. So to make sure that this second derivative also equals negative 1 at x equals 0, 2 times c2 has to be negative 1, meaning c2 itself should be negative 1 half. This gives us the approximation 1 plus 0x minus 1 half x squared.

To get a feel for how good it is, if you estimate cosine of 0. 1 using this polynomial, you'd estimate it to be 0. 995, and this is the true value of cosine of 0.

1. It's a really good approximation! Take a moment to reflect on what just happened.

You had 3 degrees of freedom with this quadratic approximation, the constants c0, c1, and c2. c0 was responsible for making sure that the output of the approximation matches that of cosine x at x equals 0, c1 was in charge of making sure that the derivatives match at that point, and c2 was responsible for making sure that the second derivatives match up. This ensures that the way your approximation changes as you move away from x equals 0, and the way that the rate of change itself changes, is as similar as possible to the behaviour of cosine x, given the amount of control you have.

You could give yourself more control by allowing more terms in your polynomial and matching higher order derivatives. For example, let's say you added on the term c3 times x cubed for some constant c3. In that case, if you take the third derivative of a cubic polynomial, anything quadratic or smaller goes to 0.

As for that last term, after 3 iterations of the power rule, it looks like 1 times 2 times 3 times c3. On the other hand, the third derivative of cosine x comes out to sine x, which equals 0 at x equals 0. So to make sure that the third derivatives match, the constant c3 should be 0.

Or in other words, not only is 1 minus ½ x2 the best possible quadratic approximation of cosine, it's also the best possible cubic approximation. You can make an improvement by adding on a fourth order term, c4 times x to the fourth. The fourth derivative of cosine is itself, which equals 1 at x equals 0.

And what's the fourth derivative of our polynomial with this new term? Well, when you keep applying the power rule over and over, with those exponents all hopping down in front, you end up with 1 times 2 times 3 times 4 times c4, which is 24 times c4. So if we want this to match the fourth derivative of cosine x, which is 1, c4 has to be 1 over 24.

And indeed, the polynomial 1 minus ½ x2 plus 1 24 times x to the fourth, which looks like this, is a very close approximation for cosine x around x equals 0. In any physics problem involving the cosine of a small angle, for example, predictions would be almost unnoticeably different if you substituted this polynomial for cosine of x. Take a step back and notice a few things happening with this process.

First of all, factorial terms come up very naturally in this process. When you take n successive derivatives of the function x to the n, letting the power rule keep cascading on down, what you'll be left with is 1 times 2 times 3 on and on up to whatever n is. So you don't simply set the coefficients of the polynomial equal to whatever derivative you want, you have to divide by the appropriate factorial to cancel out this effect.

For example, that x to the fourth coefficient was the fourth derivative of cosine, 1, but divided by 4 factorial, 24. The second thing to notice is that adding on new terms, like this c4 times x to the fourth, doesn't mess up what the old terms should be, and that's really important. For example, the second derivative of this polynomial at x equals 0 is still equal to 2 times the second coefficient, even after you introduce higher order terms.

And it's because we're plugging in x equals 0, so the second derivative of any higher order term, which all include an x, will just wash away. And the same goes for any other derivative, which is why each derivative of a polynomial at x equals 0 is controlled by one and only one of the coefficients. If instead you were approximating near an input other than 0, like x equals pi, in order to get the same effect you would have to write your polynomial in terms of powers of x minus pi, or whatever input you're looking at.

This makes it look noticeably more complicated, but all we're doing is making sure that the point pi looks and behaves like 0, so that plugging in x equals pi will result in a lot of nice cancellation that leaves only one constant. And finally, on a more philosophical level, notice how what we're doing here is basically taking information about higher order derivatives of a function at a single point, and translating that into information about the value of the function near that point. You can take as many derivatives of cosine as you want.

It follows this nice cyclic pattern, cosine of x, negative sine of x, negative cosine, sine, and then repeat. And the value of each one of these is easy to compute at x equals 0. It gives this cyclic pattern 1, 0, negative 1, 0, and then repeat.

And knowing the values of all those higher order derivatives is a lot of information about cosine of x, even though it only involves plugging in a single number, x equals 0. So what we're doing is leveraging that information to get an approximation around this input, and you do it by creating a polynomial whose higher order derivatives are designed to match up with those of cosine, following this same 1, 0, negative 1, 0, cyclic pattern. And to do that, you just make each coefficient of the polynomial follow that same pattern, but you have to divide each one by the appropriate factorial.

Like I mentioned before, this is what cancels out the cascading effect of many power rule applications. The polynomials you get by stopping this process at any point are called Taylor polynomials for cosine of x. More generally, and hence more abstractly, if we were dealing with some other function other than cosine, you would compute its derivative, its second derivative, and so on, getting as many terms as you'd like, and you would evaluate each one of them at x equals 0.

Then for the polynomial approximation, the coefficient of each x to the n term should be the value of the nth derivative of the function evaluated at 0, but divided by n factorial. This whole rather abstract formula is something you'll likely see in any text or course that touches on Taylor polynomials. And when you see it, think to yourself that the constant term ensures that the value of the polynomial matches with the value of f, the next term ensures that the slope of the polynomial matches the slope of the function at x equals 0, the next term ensures that the rate at which the slope changes is the same at that point, and so on, depending on how many terms you want.

And the more terms you choose, the closer the approximation, but the tradeoff is that the polynomial you'd get would be more complicated. And to make things even more general, if you wanted to approximate near some input other than 0, which we'll call a, you would write this polynomial in terms of powers of x minus a, and you would evaluate all the derivatives of f at that input, a. This is what Taylor polynomials look like in their fullest generality.

Changing the value of a changes where this approximation is hugging the original function, where its higher order derivatives will be equal to those of the original function. One of the simplest meaningful examples of this is the function e to the x around the input x equals 0. Computing the derivatives is super nice, as nice as it gets, because the derivative of e to the x is itself, so the second derivative is also e to the x, as is its third, and so on.

So at the point x equals 0, all of these are equal to 1. And what that means is our polynomial approximation should look like 1 plus 1 times x plus 1 over 2 times x squared plus 1 over 3 factorial times x cubed, and so on, depending on how many terms you want. These are the Taylor polynomials for e to the x.

Ok, so with that as a foundation, in the spirit of showing you just how connected all the topics of calculus are, let me turn to something kind of fun, a completely different way to understand this second order term of the Taylor polynomials, but geometrically. It's related to the fundamental theorem of calculus, which I talked about in chapters 1 and 8 if you need a quick refresher. Like we did in those videos, consider a function that gives the area under some graph between a fixed left point and a variable right point.

What we're going to do here is think about how to approximate this area function, not the function for the graph itself, like we've been doing before. Focusing on that area is what's going to make the second order term pop out. Remember, the fundamental theorem of calculus is that this graph itself represents the derivative of the area function, and it's because a slight nudge dx to the right bound of the area gives a new bit of area approximately equal to the height of the graph times dx.

And that approximation is increasingly accurate for smaller and smaller choices of dx. But if you wanted to be more accurate about this change in area, given some change in x that isn't meant to approach 0, you would have to take into account this portion right here, which is approximately a triangle. Let's name the starting input a, and the nudged input above it x, so that change is x-a.

The base of that little triangle is that change, x-a, and its height is the slope of the graph times x-a. Since this graph is the derivative of the area function, its slope is the second derivative of the area function, evaluated at the input a. So the area of this triangle, 1 half base times height, is 1 half times the second derivative of this area function, evaluated at a, multiplied by x-a2.

And this is exactly what you would see with a Taylor polynomial. If you knew the various derivative information about this area function at the point a, how would you approximate the area at the point x? Well you have to include all that area up to a, f of a, plus the area of this rectangle here, which is the first derivative, times x-a, plus the area of that little triangle, which is 1 half times the second derivative, times x-a2.

I really like this, because even though it looks a bit messy all written out, each one of the terms has a very clear meaning that you can just point to on the diagram. If you wanted, we could call it an end here, and you would have a phenomenally useful tool for approximating these Taylor polynomials. But if you're thinking like a mathematician, one question you might ask is whether or not it makes sense to never stop and just add infinitely many terms.

In math, an infinite sum is called a series, so even though one of these approximations with finitely many terms is called a Taylor polynomial, adding all infinitely many terms gives what's called a Taylor series. You have to be really careful with the idea of an infinite series, because it doesn't actually make sense to add infinitely many things, you can only hit the plus button on the calculator so many times. But if you have a series where adding more and more of the terms, which makes sense at each step, gets you increasingly close to some specific value, what you say is that the series converges to that value.

Or, if you're comfortable extending the definition of equality to include this kind of series convergence, you'd say that the series as a whole, this infinite sum, equals the value it's converging to. For example, look at the Taylor polynomial for e to the x, and plug in some input, like x equals 1. As you add more and more polynomial terms, the total sum gets closer and closer to the value e, so you say that this infinite series converges to the number e, or what's saying the same thing, that it equals the number e.

In fact, it turns out that if you plug in any other value of x, like x equals 2, and look at the value of the higher and higher order Taylor polynomials at this value, they will converge towards e to the x, which is e squared. This is true for any input, no matter how far away from 0 it is, even though these Taylor polynomials are constructed only from derivative information gathered at the input 0. In a case like this, we say that e to the x equals its own Taylor series at all inputs x, which is kind of a magical thing to have happen.

Even though this is also true for a couple other important functions, like sine and cosine, sometimes these series only converge within a certain range around the input whose derivative information you're using. If you work out the Taylor series for the natural log of x around the input x equals 1, which is built by evaluating the higher order derivatives of the natural log of x at x equals 1, this is what it would look like. When you plug in an input between 0 and 2, adding more and more terms of this series will indeed get you closer and closer to the natural log of that input.

But outside of that range, even by just a little bit, the series fails to approach anything. As you add on more and more terms, the sum bounces back and forth wildly. It does not, as you might expect, approach the natural log of that value, even though the natural log of x is perfectly well defined for inputs above 2.

In some sense, the derivative information of ln of x at x equals 1 doesn't propagate out that far. In a case like this, where adding more terms of the series doesn't approach anything, you say that the series diverges. And that maximum distance between the input you're approximating near and points where the outputs of these polynomials actually converge is called the radius of convergence for the Taylor series.

There remains more to learn about Taylor series. There are many use cases, tactics for placing bounds on the error of these approximations, tests for understanding when series do and don't converge, and for that matter, there remains more to learn about calculus as a whole, and the countless topics not touched by this series. The goal with these videos is to give you the fundamental intuitions that make you feel confident and efficient in learning more on your own, and potentially even rediscovering more of the topic for yourself.

In the case of Taylor series, the fundamental intuition to keep in mind as you explore more of what there is, is that they translate derivative information at a single point to approximation information around that point. Thank you once again to everybody who supported this series. The next series like it will be on probability, and if you want early access as those videos are made, you know where to go.

Thank you.