Eli Rose 's postsabout code trinkets quotesmedia log
How Do You Generalize Derivatives to Higher Dimensions?

The analogue of the derivative for functions whose inputs and outputs are vectors is called the total derivative. The total derivative of a function $f: \mathbb{R}^n \rightarrow \mathbb{R}^m$ is an object that gives you a function for each point in $\mathbb{R}^n$. In other words it is a function $\mathbb{R}^n \rightarrow \mathbb{R}^n \rightarrow \mathbb{R}^m$.

I'll write $D(f)$ for the total derivative of $f$; the function $\mathbb{R}^n \rightarrow \mathbb{R}^n \rightarrow \mathbb{R}^m$. $D(f)(\mathbf{x})$ is what you get when you put in $\mathbf{x}$ — it is a particular function $\mathbb{R}^n \rightarrow \mathbb{R}^m$.

(I wanted to use the more familiar $\frac{d}{d\text{...}}$ notation, but unfortunately it seems to obscure some analogies more than it aids others, because of the need to put in a symbol at the bottom.)

$D(f)(\mathbf{x})$ maps vectors to vectors. I'm going to claim that it "tells you what $f$ does to a very small vector whose tail is placed at $\mathbf{x}$." What does this mean, and how is this a generalization of the derivative?

Think about our old friend $f = x^2$, $\frac{d}{dx} f = 2x$. This means that if you make a very small increase $\epsilon$ to your input at point $x$, your output will increase by $2x\epsilon$. You can view this as a function $\mathbb{R} \rightarrow \mathbb{R} \rightarrow \mathbb{R}$. Given a point $x \in \mathbb{R}$, it gives you a function which maps $\epsilon \rightarrow 2x\epsilon$. It tells you what $f$ does to a very small increase at point $x$.

Now back to the multidimensional case. Adding a very small vector $\mathbf{\epsilon}$ to $\mathbf{x}$ is like making a very small increase to your input, but in multiple dimensions at once. So the total derivative $\mathbb{R}^n \rightarrow \mathbb{R}^n \rightarrow \mathbb{R}^m$ is the analogue of the derivative in the sense that given a point $\mathbf{x} \in \mathbb{R}^n$, it gives you a function which tells you how to map $\mathbf{\epsilon} \rightarrow \text{???}$ for vectors $\mathbf{\epsilon} \in \mathbb{R}^n$ whose tails are placed at $\mathbf{x}$. That's why I said it tells you what $f$ does to a very small vector whose tail is placed at $\mathbf{x}$.

(This post really ought to have some illustrations, but alas I couldn't find any on the Internet, so I'm stuck trying to paint pictures with words.)

An important fact about the total derivative is that it's a linear transformation. With our old friend the derivative, this was so obvious that perhaps it slipped beneath notice. $\epsilon \rightarrow 2x\epsilon$ is linear because it just multiplies the input by some number. All derivatives are linear in this sense.

(Don't get confused: $\frac{d}{dx} x^3 = 3x^2$ isn't a linear function of $x$, but that's the function $\mathbb{R} \rightarrow \mathbb{R} \rightarrow \mathbb{R}$. The thing which we are saying is linear is the function $\mathbb{R} \rightarrow \mathbb{R}$, which is $\epsilon \rightarrow 3x^2 \epsilon$. In introductory calculus classes I've been in, we "squished down" the derivative of a function into $\mathbb{R} \rightarrow \mathbb{R}$ by identifying $\epsilon \rightarrow a\epsilon$ with $a \in \mathbb{R}$. For example, whenever we graphed the derivative of a function. That is all well and good for some things, but I think it is hard to understand the total derivative from that perspective.)

Because there are more dimensions, the total derivative of a function $\mathbb{R}^n \rightarrow \mathbb{R}^m$ can be more complex than "multiply by a number." More things can happen to a very small vector than just scaling; it can be rotated, skewed, etc. One coordinate can be mixed with another. But it is still a linear function of $\mathbf{\epsilon}$. And since it is a linear transformation from $\mathbb{R}^n \rightarrow \mathbb{R}^m$, it can be represented as an $m \times n$ matrix. This is called the Jacobian matrix of the function. We do this by filling out a matrix with every possible partial derivative you could take, organized in a certain way.

For example, take the function $f(\mathbf{x}) = f(x_1, x_2) = \langle x_1^2 + \sin(x_2), x_1x_2 + 10 \rangle$ from $\mathbb{R}^2 \rightarrow \mathbb{R}^2$. We can write down its Jacobian matrix.

$$D(f) = \begin{bmatrix}\frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2}\\ \frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2}\end{bmatrix} = \begin{bmatrix}2x_1 & \cos(x_2)\\ x_2 & x_1\end{bmatrix}$$

Fill in a concrete value for $\mathbf{x} = \langle x_1, x_2 \rangle$ and the matrix represents a function $\mathbb{R}^2 \rightarrow \mathbb{R}^2$. It acts on vectors to transform them the way that a very small vector whose tail is at $\mathbf{x}$ would be transformed.

Before I learned about the total derivative, the answer I would give to "how do you generalize derivatives to higher dimensions?" was "with the gradient." The gradient $\nabla f$ is a special case of the total derivative; when the function is $\mathbb{R}^n \rightarrow \mathbb{R}$, the total derivative is a matrix of dimension $1 \times n$ whose transpose can be seen as a column vector in $\mathbb{R}^n$. We take the function $\mathbb{R}^n \rightarrow \mathbb{R}^n \rightarrow \mathbb{R}$ and select just the first part ($\mathbb{R}^n \rightarrow \mathbb{R}^n$) to serve as a vector field. So the gradient is really just the total derivative with some tricks and flips.

The total derivative is the One True Way to think about multi-dimensional derivatives. You can do everything you want with it. It's got a chain rule: $$D(f \circ g)(\mathbf{x}) = D(f)(g(\mathbf{x})) \circ D(g)(\mathbf{x})$$

Note that if you replace $D(f)$ with $f'$, this is almost the same thing as the familiar one-dimensional chain rule $$((f \circ g)(x))' = f'(g(x))g'(x)$$

But mere multiplication in the one-dimensional case has been replaced with composition in the multi-dimensional case. This makes sense because, for one-dimensional linear functions, composition is multiplication. ($(x \rightarrow 5x) \circ (x \rightarrow 2x) = x \rightarrow 10x$.) But when our linear functions get more dimensions, composition becomes more complicated. However, it's still a kind of multiplication — matrix multiplication of the Jacobian matrices!

Here's a table of analogies between the one-dimensional derivative and the total derivative.

d/dx                   |  D
------------------------------------------------
small increase         => add small vector
multiplication         => linear transformation
R -> R -> R            => R^n -> R^n -> R^m
real-valued function   => vector-valued function
f'(g(x))g'(x)          => D(f)(g(x)) o D(g)(x)