# Exploring exponentiation

*The following is an edited and prettified transcript of 3Blue1Brown's latest video, "How (and why) to raise e to the power of a matrix." This edit is part of my application to intern with 3Blue1Brown. For context, see my previous post: making math magical. *

"Since Newton, mankind has come to realize that the laws of physics are always expressed in the language of differential equations." - Steven Strogatz

There's a funny little exercise in an old differential equations textbook that I learned from in college which asks the reader to compute $e^{\textbf{A}t}$ (where $\textbf{A}$, we're told, is going to be a matrix, and the insinuation seems to be that the result will also be a matrix). It then offers several examples for what you might plug in for $\textbf{A}$:

Now, taken out of context, putting a matrix in an exponent like this probably seems like total nonsense. But what it refers to is an extremely beautiful operation, and the reason it shows up in this book is that it’s useful. It's used to solve a very important class of differential equations.

In turn, given that the universe is often written in the language of differential equations, you see matrix exponents pop up in physics all the time too. Especially in quantum mechanics, where they are littered all over the place. They play a particularly prominent role. This has a lot to do with Schrödinger’s equation, which we’ll touch on a bit later. (And it also may help in understanding your romantic relationships, but again, all in due time.)

A big part of the reason I want to cover this topic is that there’s an extremely nice way to visualize what matrix exponents are actually doing using *flow *that not a lot of people seem to talk about. But for the bulk of this chapter, let’s start by laying out what exactly the operation actually is, and then see if we can get a feel for what kinds of problems it helps us to solve.

## Exponentiation $\neq$ repeated multiplication

The first thing you should know is that raising $e$ to the power of a matrix is *not* some bizarre way to multiply the constant $e$ by itself multiple times. You would be right to call that nonsense. The actual definition is related to a certain infinite polynomial for describing real number powers of $e$–what we call its Taylor series.

For example, if I took the number $2$ and plugged it into this polynomial, then as you add more and more terms (each of which looks like some power of $2$ divided by some factorial) the sum approaches a number near $7.389$, and this number is precisely $e \times e$.

If you increment this input by $1$, then somewhat miraculously (no matter where you started from) the effect on the output is always to multiply it by another factor of $e$.

For reasons that you're going to see a bit, mathematicians became interested in plugging all kinds of things into this polynomial–things like complex numbers, and for our purposes today, matrices–even when those objects do not immediately make sense as exponents.

What some authors do is give this infinite polynomial the name “$\exp$” when you plug in more exotic inputs. It's a gentle nod to the connection that this has to exponential functions in the case of real numbers, even though obviously these inputs don't make sense as exponents. However, an equally common convention is to give a much less gentle nod to the connection and just abbreviate the whole thing as $e$ to the power of whatever object you’re plugging in (whether that’s a complex number, or a matrix, or all sorts of more exotic objects).

So while the Taylor series of $e^x$ is a theorem for real number inputs, it’s a definition for more exotic inputs. Cynically, you could call this a blatant abuse of notation. More charitably, you might view it as an example of the beautiful cycle between discovery and invention in math.

In either case, plugging in a *matrix–*even to a polynomial–might seem a little strange, so let’s be clear on what we mean here.

## Exponentiating matrices

In order to multiply a matrix by itself, the matrix must have the same number of rows and columns (in other words, it must be *square*). That way you can multiply it by itself according to the usual rules of matrix multiplication.

This is what we mean by "squaring" it:

Similarly, if you were to take that result, and then multiply it by the original matrix again, this is what we mean by "cubing" the matrix. If you carry on like this, you can take any whole power of a matrix. It's perfectly sensible:

In this context, powers still mean exactly what you’d expect: repeated multiplication. Each term of this polynomial is scaled by $1$ and divided by some factorial. With matrices, all that means is that you multiply each component by that number ($\frac{1}{n!}$). Likewise, it makes sense to add two matrices together by simply adding their respective components together, term-by-term.

The astute among you might ask how sensible it is to take this out to infinity, which would be a great question, and one that I'm largely going to postpone the answer to. But I can show you one pretty fun example here and now:

Take this $2 \times 2$ matrix that has $-\pi$ and $\pi$ sitting off its diagonal entries. Let's see what the sum gives.

The first term is the identity matrix. This is actually what we mean *by definition* when we raise a matrix to the $0$th power (any $\text{matrix}^0 = I_n$). Then we add in the matrix itself, which gives us the pi-off-the-diagonal terms. And then add half of the matrix squared, and continuing on I’ll have the computer keep adding more and more terms, each of which requires taking one more matrix product to get a new power and adding it to a running tally:

As it keeps going, it seems to be approaching a stable value, which is around $-1$ times the identity matrix. In this sense, we say the infinite sum *equals* that negative identity. By the end of this video, my hope is that this particular fact comes to make total sense to you. For any of you familiar with Euler’s famous identity, this is essentially the matrix version of that.

It turns out that in general, no matter what matrix you start with, as you add more and more terms into this Taylor series you eventually approach some stable value. Though sometimes it can take quite a while before you get there.

Just seeing the definition like this, in isolation, raises all kinds of questions. Most notably, *why* would mathematicians and physicists be interested in torturing their poor matrices this way? What problems are they trying to solve?

And if you’re anything like me, a new operation is only satisfying when you have a clear view of what it’s *trying to do–*some sense of how to predict the output based on the input before you actually crunch the numbers. How on earth could you have predicted that the matrix with $\pi$ off the diagonals results in the negative identity matrix like this?

*aside: math textbooks $\neq$ math discovery*

Often in math, you should view the definition not as a starting point, but as a target. Contrary to the structure of textbooks, mathematicians do not start by making definitions, and then listing a lot of theorems and proving them, and then showing some examples. The process of discovering math typically goes the other way around. They start by chewing on specific problems and then generalizing those problems, then coming up with constructs that might be helpful in those general cases. And only then do you write down a new definition, or extend an old one.

As to what sorts of specific examples might motivate matrix exponents, two come to mind. One involving relationships, and the other quantum mechanics. Let’s start with relationships.

## Romeo and Juliet

Suppose we have two lovers–let's call them Romeo and Juliet. And let's let $x$ represent Juliet’s love for Romeo, and $y$ represent his love for her, both of which are going to be values that change with time $t$ (so $x$ is really shorthand for $x(t)$, and $y$ for $y(t)$).

*Note: This is an example that we actually touched on in chapter $1$ of the Differential Equations series. It's based on a Steven Strogatz article, but it’s okay if you didn’t see that.*

The way their relationship works is that the rate at which Juliet’s love for Romeo changes (the *derivative* of her love with respect to time) is equal to $-1$ times Romeo’s love for her. So in other words, when Romeo is expressing cool disinterest, that's when Juliet’s feelings actually increase, whereas if he becomes too infatuated, her interest will start to fade. Romeo, on the other hand, is the opposite. The rate of change of his love is equal to Juliet’s love. So while Juliet is mad at him, his affections tend to decrease, whereas if she loves him, that's when his feelings grow.

Of course, neither one of these numbers is holding still; as Romeo’s love increases in response to Juliet, her equation continues to apply and drives her love down. Both of these equations always apply, from each infinitesimal point in time to the next, so every slight change to one value immediately influences the rate of change of the other:

This is a system of differential equations. It’s a puzzle, where your challenge is to find explicit functions for $x(t)$ and $y(t)$ that make both these expressions true.

Now, as systems of differential equations go, this one is on the simpler side. Enough so that many calculus students could probably just guess at an answer ($x(t) = \cos(t)$ and $y(t) = \sin(t)$), for example.

Keep in mind, though, it’s not enough to find just *some* pair of functions that makes this true; if you want to actually predict where Romeo and Juliet end up after some starting point, you have to make sure that your functions match the *initial set of conditions* at time $t=0$.

More to the point, our actual goal is to systematically solve more general versions of this equation, without guessing and checking, and it's that question that leads us to matrix exponents. Very often when you have multiple changing values like this, it’s helpful to package them together as coordinates of a single point in a higher-dimensional space.

So for Romeo and Juliet, think of their relationship as a point in a $2$-D space, the $x$-coordinate capturing Juliet’s feelings, and the $y$-coordinate capturing Romeo’s. Sometimes it’s helpful to picture this state as an arrow from the origin, other times just as a point; all that really matters is that it encodes two numbers. And moving forward we'll be writing that as a column vector:

And of course, remember that this vector (which captures the state of their love) is a function of time.

You might picture the rate of change of the state, the thing that packages together the derivative of $x$ and the derivative of $y$, as a kind of velocity vector in this state space. Something that tugs on our vector in some direction, and with some magnitude indicating how quickly it’s changing.

Remember, the rule here is that the rate of change of $x$ (that is, $\frac{dx}{dt}$) is $-y$, and the rate of change of $y$ (or $\frac{dy}{dt}$) is $x$. Set up as vectors like this, we could rewrite the right-hand side of this equation as a product of this matrix with the original vector $\begin{bmatrix} x\\y \end{bmatrix}$.

The top row encodes Juliet’s rule, and the bottom row encodes Romeo’s rule. So what we have here is a differential equation telling us that the rate of change of some vector is equal to a certain matrix times that vector itself.

### Recap: visualizing matrices as transformations

In a moment we’ll talk about how matrix exponentiation solves this kind of equation, but before that let me show you a simpler way that we can solve this particular system, one that uses pure geometry, and it helps set the stage for visualizing matrix exponents a bit later.

This matrix from our system is a $90$-degree rotation matrix. For any of you a bit rusty on how to think of matrices as transformations, there’s a video all about it on this channel (a series, really).

The basic idea is that when you multiply a matrix by the vector $\begin{bmatrix} 1\\0 \end{bmatrix}$, it pulls out the first column. And similarly, if you multiply it by $\begin{bmatrix} 0\\1 \end{bmatrix}$ that pulls out the second column.

What this means is that when you look at a matrix, you can read its columns as telling you what it does to these two vectors, known as the basis vectors. The way it acts on any other vector is a result of scaling and adding these two basis results by that vector’s coordinates:

So, looking back at the matrix from our system, notice how from its columns we can tell it takes the first basis vector to $\begin{bmatrix} 0\\1 \end{bmatrix}$, and the second to $\begin{bmatrix} -1\\0 \end{bmatrix}$, hence why I’m calling it the $90$-degree rotation matrix:

What it means for our equation is that it's saying wherever Romeo and Juliet are in this state space, their rate of change has to look like a $90$-degree rotation of this position vector. The only way velocity can be permanently perpendicular to position like this is when you rotate about the origin in circular motion. Never growing nor shrinking because the rate of change has no component in the direction of position.

More specifically, since the length of this velocity vector equals the length of the position vector, then for each unit of time, the distance that this covers is equal to one radius’s worth of arc length along that circle. In other words, it rotates at one radian per unit time; so in particular, it would take $2\pi$ units of time to make a full revolution.

If you want to describe this kind of rotation with a formula, we can use a more general rotation matrix, which looks like this:

Again, we can read it in terms of the columns.

Notice how the first column tells us that it takes the first basis vector to $\begin{bmatrix} \cos(t) \\ \sin(t) \end{bmatrix}$, and the second column tells us that it takes the second basis vector to $\begin{bmatrix} -\sin(t) \\ \cos(t) \end{bmatrix}$, both of which are consistent with rotating by $t$ radians.

So to solve the system, if you want to predict where Romeo and Juilet end up after $t$ units of time, you can multiply this matrix by their initial state. The active viewers among you may also enjoy taking a moment to pause and confirm that explicit formulas you get for $x(t)$ and $y(t)$ really do satisfy the system of differential equations that we started with.

The mathematician in you might wonder if it’s possible to solve not just this specific system, but equations like it for any matrix. To ask this question is to set yourself up to rediscover matrix exponents.

## $e^{\textrm{matrix}}$ and Schrödinger’s equation

The main goal for today is for you to understand how this equation lets you intuitively picture this operation which we write as $e$ raised to a matrix, and on the flip side how being able to compute matrix exponents lets you explicitly solve this equation.

A much less whimsical example is Schrödinger’s famous equation, which is the fundamental equation describing how systems in quantum mechanics change over time.

It looks pretty intimidating, and I mean, it’s quantum mechanics so of course it will, but it’s actually not that different from the Romeo-Juliet setup.

The symbol $\ket{\psi}$ refers to a certain vector. It's a vector that packages together all the information you might care about in a system, like the various particles’ positions and momenta.

It's analogous to our simpler $2$-D vector that encoded all the information about Romeo and Juliet. Schrödinger’s equation says that the rate at which this state vector changes looks like a certain matrix times the current state itself.

There are a number of things making Schrödinger’s equation notably more complicated. But in the back of your mind you might think of this as a target point that you and I can build up to, with simpler examples like Romeo and Juliet offering more friendly stepping stones along the way.

Actually, the simplest example, which is tied to ordinary real-number powers of $e$, is the one-dimensional case. This is when you have a single changing value, and its rate of change equals some constant times itself. So the bigger the value, the faster it grows. Most people are more comfortable visualizing this with a graph, where the higher the value of the graph, the steeper its slope, resulting in this ever-steepening upward curve. Just keep in mind that when we get to higher dimensional variants, graphs are a lot less helpful.

This is a highly important equation in its own right. It's a very powerful concept when the rate of change of a value is proportional to the value itself. This is the equation governing things like compound interest, or the early stages of population growth before the effects of limited resources kicks in, or the early stages of an epidemic while most of the population is susceptible.

Calculus students all learn about how the derivative of $e^{rt}$ is $r$ times itself. In other words, this self-reinforcing growth is the same thing as exponential growth, and $e^{rt}$ solves this equation.

Actually, a better way to think about it is that there are many different solutions to this equation, one for each initial condition, something like an initial investment size or initial population, which we'll just call $x_0$.

Notice, by the way, how the higher the value for $x_0$, the higher the initial slope of the resulting solution, which should make complete sense given the equation.

The function $e^{rt}$ is just a solution when the initial condition is $1$. But! If you multiply by any other initial condition, you get a new function that still satisfies the property, it still has a derivative which is r times itself. But this time it starts at $x_0$, since $e^0$ is $1$.

This is worth highlighting before we generalize to more dimensions: Do not think of the exponential part as a solution in and of itself. Think of it as something which *acts on an initial condition* in order to give a solution.

You see, up in the $2$-dimensional case, when we have a changing vector whose rate of change is constrained to be some matrix times itself, what the solution looks like is also an exponential term acting on a given initial condition. But the exponential part, in that case, will produce a matrix that changes with time, and the initial condition is a vector.

In fact, you should think of the definition of matrix exponentiation as being heavily motivated by making sure this is true. For example, if we look back at the system that popped up with Romeo and Juliet, the claim now is that solutions look like $e$ raised to this $\begin{bmatrix} 0 & -1 \\ 1&0 \end{bmatrix}$ matrix all times time, multiplied by some initial condition.

But we’ve already seen the solution in this case, we know it looks like a rotation matrix times the initial condition. So let’s take a moment to roll up our sleeves and compute the exponential term using the definition I mentioned at the start, and see if it lines up.

Remember, writing $e$ to the power of a matrix is a shorthand–a shorthand for plugging it into this long infinite polynomial, the Taylor series of $e^x$. I know it might seem pretty complicated to do this, but trust me, it’s very satisfying how this particular one turns out. If you actually sit down and you compute successive powers of this matrix, what you’d notice is that they fall into a cycling pattern every four iterations.

This should make sense, given that we know it’s a $90$-degree rotation matrix. So when you add together all infinitely many matrices term-by-term, each term in the result looks like a polynomial in $t$ with some nice cycling pattern in the coefficients, all of them scaled by the relevant factorial term.

Those of you who are savvy with Taylor series might be able to recognize that each one of these components is the Taylor series for either $\sin$ or $\cos$, though in that top right corner's case it’s actually $-\sin(t)$. So what we get from the computation is exactly the rotation matrix we had from before!

To me, this is extremely beautiful. We have two completely different ways of reasoning about the same system and they give us the same answer. I mean it’s reassuring that they do, but it is wild just how different the mode of thought is when you're chugging through the polynomial vs. when you're geometrically reasoning about what a velocity perpendicular to a position must imply.

Hopefully the fact that these line up inspires a little confidence in the claim that matrix exponents really do solve systems like this. This explains the computation we saw at the start, by the way, with the matrix that had -$\pi$ and $\pi$ off the diagonal, producing the negative identity. This expression is exponentiating a $90$-degree rotation matrix times $\pi$, which is another way to describe what the Romeo-Juliet setup does after $\pi$ units of time. As we now know, that has the effect of rotating everything by $180$-degrees in this state space, which is the same as multiplying everything by $-1$.

Also, for any of you familiar with imaginary number exponents, this example is probably ringing a ton of bells. It is 100% analogous. In fact, we could have framed the entire example where Romeo and Juliet’s feelings were packaged into a complex number, and the rate of change of that complex number would have been $i$ times itself since multiplication by $i$ also acts like a $90$-degree rotation.

The same exact line of reasoning, both analytic and geometric, would have led to this whole idea that $e^{it}$ describes rotations. These are actually two of many different examples throughout math and physics when you find yourself exponentiating some object which acts as a $90$-degree rotation, times time. It shows up with quaternions or many of the matrices that pop up in quantum mechanics.

In all of these cases, we have this really neat general idea that if you take some operation that rotates $90$-degrees in some plane (often it's a plane in some high-dimensional space we can’t visualize) then what we get by exponentiating that operation times time is something that generates all other rotations in that same plane.

One of the more complicated variations on this same theme is Schrödinger’s equation. It’s not just that it has the derivative-of-a-state-equals-some-matrix-times-that-state form. The nature of the relevant matrix here is such that this equation also describes a kind of rotation, though in many applications of Schrödinger's equation it will be a rotation inside a kind of function space.

It’s a little more involved, though, because typically there's a combination of many different rotations. It takes time to really dig into this equation, and I’d love to do that in a later chapter. But right now I cannot help but at least allude to the fact that this imaginary unit $i$ that sits so impishly in such a fundamental equation for all of the universe, is playing basically the same role as the matrix from our Romeo-Juliet example.

What this $i$ communicates is that the rate of change of a certain state is, in a sense, perpendicular to that state, and hence that the way things have to evolve over time will involve a kind of oscillation. But matrix exponentiation can do so much more than just rotation. You can always visualize these sorts of differential equations using a vector field.

## Visualizing with flow

The idea is that this equation $\frac{d}{dt}\vec{v}(t) = M\vec{v}(t)$ tells us that the velocity of a state is entirely determined by its position. So what we do is go to every point in this space, and draw a little vector indicating what the velocity of a state must be if it passes through that point. For our type of equation, this means that we go to each point $v$ in space, and we attach the vector $M$ times $v$. To intuitively understand how any given initial condition will evolve, you let it flow along this field, with a velocity always matching whatever vector it’s sitting on at any given point in time.

So if the claim is that solutions to this equation look like $e^{Mt}$ times some initial condition, it means you can visualize what the matrix $e^{Mt}$ does by letting every possible initial condition flow along this field for $t$ units of time.

The transition from start to finish is described by whatever matrix pops out from the computation for $e^{Mt}$. In our main example with the $90$-degree rotation matrix, the vector field looks like this, and as we saw $e^{Mt}$ describes rotation in that case, which lines up with flow along this field.

As another example, the more Shakespearean Romeo and Juliet might have equations that look a little more like this, where Juliet’s rule is symmetric with Romeo’s, and both of them are inclined to get carried away in response to one and another’s feelings:

Again, the way the vector field you’re looking at has been defined is to go to each point $v$ in space and attach the vector $Mv$.

This is the pictorial way of saying that the rate of change of a state must always equal $M$ times itself. But for this example, flow along the vector field looks a lot different from how it did before. If Romeo and Juliet start anywhere in this upper-right half of the plane, their feelings will feed off of each other and they both tend towards infinity. If they're in the other half of the plane, well, let’s just say they stay more true to their Capulet and Montague family traditions.

So even before you try calculating the exponential of this particular matrix, you can already have an intuitive sense of what the answer should look like. The resulting matrix should describe the transition from time $0$ to time $t$, which, if you look at the field, seems to indicate that it will squish along one diagonal while stretching along another, getting more extreme as $t$ gets larger.

Of course, all of this is presuming that $e^{Mt}$ times an initial condition actually solves these systems. This is one of those facts that’s easiest to believe when you just work it out yourself. But I’ll run through a quick rough sketch.

Write out the full polynomial that defines $e^{Mt}$, and multiply by some initial condition vector on the right. And then take the derivative of this with respect to $t$.

Because $M$ is a constant, this just means applying the power rule to each one of the terms. And that power rule really nicely cancels out with the factorial terms.

So what we're left with is an expression that looks almost identical to what we had before, except that every term has an extra $M$ hanging on to it. But this can be factored out to the left. So the derivative of this expression is $M$ times the original expression, and hence it solves the equation.

This actually sweeps under the rug some details required for rigor, mostly centered around the question of whether or not this thing actually converges, but it does give the main idea.

In the next chapter, I would like to talk more about the properties that this operation has, most notably its relationship with eigenvectors and eigenvalues, which leads us to more concrete ways of thinking about how you actually carry out the computation, which otherwise seems insane. Also, time permitting, it might be fun to talk about what it means to raise $e$ to the power of the derivative operator $\frac{d}{dt}$.