Matrix Calculus for Machine Learning and Beyond

This is the course page for an 18.S096 Special Subject in Mathematics at MIT taught in January 2022 (IAP) by Professors Alan Edelman and Steven G. Johnson.

Lectures: MWF 11am–1pm, Jan 10–28, virtually via Zoom. 3 units, 2 problem sets due Jan 19 and Jan 26, no exams. Piazza discussions. TA/grader: Gaurav Arya.

Description:

We all know that calculus courses such as 18.01 and 18.02 are univariate and vector calculus, respectively. Modern applications such as machine learning require the next big step, matrix calculus.

This class covers a coherent approach to matrix calculus showing techniques that allow you to think of a matrix holistically (not just as an array of scalars), compute derivatives of important matrix factorizations, and really understand forward and reverse modes of differentiation. We will discuss adjoint methods, custom Jacobian matrix vector products, and how modern automatic differentiation is more computer science than mathematics in that it is neither symbolic nor finite differences.

Prerequisites: Linear Algebra such as 18.06 and multivariate calculus such as 18.02.

Course will involve simple numerical compuations using the Julia language. Ideally install it on your own computer following these instructions, but as a fallback you can run it in the cloud here:

Topics:

Here are some of the topics covered:

Derivatives as linear operators and linear approximation on arbitrary vector spaces: beyond gradients and Jacobians.
Derivatives of functions with matrix inputs and/or outputs (e.g. matrix inverses and determinants). Kronecker products and matrix "vectorization".
Derivatives of matrix factorizations (e.g. eigenvalues/SVD) and derivatives with constraints (e.g. orthogonal matrices).
Multidimensional chain rules, and the signifance of right-to-left ("forward") vs. left-to-right ("reverse") composition. Chain rules on computational graphs (e.g. neural networks).
Forward- and reverse-mode manual and automatic multivariate differentiation.
Adjoint methods (vJp/pullback rules) for derivatives of solutions of linear, nonlinear, and differential equations.
Application to nonlinear root-finding and optimization. Multidimensional Newton and steepest–descent methods.
Applications in engineering/scientific optimization and machine learning.
Second derivatives, Hessian matrices, quadratic approximations, and quasi-Newton methods.

Lecture 1 (Jan 10)

part 1: introductory slides
part 2: derivatives as linear operators (handwritten notes)
lecture video (unfortunately missing a section of part 2).

Further reading: matrixcalculus.org (linked in the slides) is a fun site to play with derivatives of matrix and vector functions. The Matrix Cookbook has a lot of formulas for these derivatives, but no derivations. Some notes on vector and matrix differentiation were posted for 6.S087 from IAP 2021.

Further reading (fancier math): the perspective of derivatives as linear operators is sometimes called a Fréchet derivative and you can find lots of very abstract (what I'm calling "fancy") presentations of this online, chock full of weird terminology whose purpose is basically to generalize the concept to weird types of vector spaces. The "little-o notation" o(δx) we're using here for "infinitesimal asymptotics" is closely related to the asymptotic notation used in computer science, but in computer science people are typically taking the limit as the argument (often called "n") becomes very large instead of very small. A fancy name for a row vector is a "covector" or linear form, and the fancy version of the relationship between row and column vectors is the Riesz representation theorem, but until you get to non-Euclidean geometry you may be happier thinking of a row vector as the transpose of a column vector.

Lecture 2 (Jan 12)

video
part 1: derivatives as linear operators, continued from lecture 1.
part 2: 2x2 Matrix Jacobians (html) (pluto notebook source code)
pset 1 (due next Wed)

Continued discussing Jacobian matrices (for vector-valued functions of vectors), with some example. Switched a streamlined "infinitesimal" notation df=f'(x)dx, where we now simply omit higher-order terms instead of writing o(δx), and f'(x) is taken to be a linear operator acting to the right on dx (≠ dx f'(x)!). Sum, product, and chain rules for derivatives as linear operators.

The chain rule and associativity: forward vs. reverse differentiation. The chain rule leads to a product of Jacobian matrices, and while you can't rearrange this product (matrix multiplication is not commutative) you can change the order in which you do the multiplications from left-to-right or right-to-left (matrix multiplication is associative). It turns out that this ordering can have a huge impact on the practical speed at which you can compute derivatives in large problem. Explained why, if you have 1 (or few) outputs, then you want to do Jacobian products from left-to-right so that you do vector–matrix products and not matrix–matrix products: this is called reverse-mode differentiation (also "adjoint" differentiation, or "backpropagation" in machine learning). Conversely, if there is only 1 (or a few) inputs, then you want to do the chain rule from right-to-left, calledd forward-mode differentiation. Gave an example of training neural networks, where there are zillions of inputs (the "fitting" parameters of the NN) but only one output (the "loss" function measuring the error compared to training data), and this leads to reverse-mode differentiation or "backpropagation". (Unfortunately somewhat garbled during lecture, but written notes are cleaned up.)

In part 2 (last few minutes), began setting up some example problems involving matrix functions of matrices, and "vectorization" of matrices to vectors and linear operators to Kronecker products. To be continued.

Further reading: The terms "forward-mode" and "reverse-mode" differentiation are most prevalent in automatic differentiation (AD), which will will cover later in this course. You can find many, many articles online about backpropagation in neural networks. There are many other versions of this, e.g. in differential geometry the derivative linear operator (multiplying Jacobians and perturbations dx right-to-left) is called a pushforward, whereas multiplying a gradient row vector (covector) by a Jacobian left-to-right is called a pullback. This video on the principles of AD in Julia by Dr. Mohamed Tarek also starts with a similar left-to-right (reverse) vs right-to-left (forward) viewpoint and goes into how it translates to Julia code, and how you define custom chain-rule steps for Julia AD.

Lecture 3 (Jan 14)

part 1: gradient = column or row vector? slides
part 1: Matrix Jacobians notebook from lecture 2 (html), (pluto notebook source code)
part 2: finite-difference notebook
video

Revisiting the gradient ∇f. Scalar functions of matrices, matrix dot products, and the trace. Matrix Jacobians, continued from lecture 2. Finite-difference approximation.

Further reading: Wikipedia has a useful list of properties of the matrix trace. The "matrix dot product" introduced today is also called the Frobenius inner product, and the corresponding norm ("length" of the matrix viewed as a vector) is the Frobenius norm. When you "flatten" a matrix A by stacking its columns into a single vector, the result is called vec(A), and many important linear operations on matrices can be expressed as Kronecker products. The Matrix Cookbook has lots of formulas for derivatives of matrix functions. There is a lot of information online on finite difference approximations, these 18.303 notes, or Section 5.7 of Numerical Recipes. The Julia FiniteDifferences.jl package provides lots of algorithms to compute finite-difference approximations; a particularly robust and powerful way to obtain high accuracy is to employ Richardson extrapolation to smaller and smaller δx. If you make δx too small, the finite precision (#digits) of floating-point arithmetic leads to catastrophic cancellation errors.

Lecture 4 (Jan 19)

part 1: derivative of matrix determinant and inverse (html) (julia source)
part 2: nonlinear root-finding, optimization, and adjoint-method differentiation slides
video (missing lecture part 2, sorry!)
pset 1 solutions and computational notebook
pset 2 (due Wed Jan 26)

Further reading (part 1): There are lots of discussions of the derivative of a determinant online, involving the "adjugate" matrix det(A)A⁻¹. Not as well documented is that the gradient of the determinant is the cofactor matrix widely used for the Laplace expansion of a determinant. The formula for the derivative of log(det A) is also nice, and logs of determinants appear in surprisingly many applications (from statistics to quantum field theory). The Matrix Cookbook contains many of these formulas, but no derivations. A nice application of d(det(A)) is solving for eigenvalues λ by applying Newton's method to det(A-λI)=0, and more generally one can solve det(M(λ))=0 for any function Μ(λ) — the resulting roots λ are called nonlinear eigenvalues (if M is nonlinear in λ), and one can apply Newton's method using the determinant-derivative formula here.

Further reading (part 2): SGJ gave another overview of optimization in 18.335 (video). There are many textbooks on nonlinear optimization algorithms of various sorts, including specialized books on convex optimization, derivative-free optimization, etcetera. A useful review of topology-optimization methods can be found in Sigmund and Maute (2013). See the notes on adjoint methods and slides from 18.335 (video).

Lecture 5 (Jan 21)

automatic differentiation notes (also Julia source in Weave.jl-compatible form)
video

Automatic differentiation, guest lecture by Dr. Chris Rackauckas.

Further reading: Googling "automatic differentiation" will turn up many, many resources — this is a huge practical field these days. ForwardDiff.jl (described detail by this paper) in Julia uses dual number arithmetic similar to lecture to compute derivatives; see also this AMS review article, or google "dual number automatic differentiation" for many other reviews. Implementing automatic reverse-mode AD is much more complicated than defining a new number type, unfortunately, and involves a lot more intricacies of compiler technology. See also Chris's blog post on tradeoffs in AD, and Chris's discussion post on AD limitations.

Lecture 6 (Jan 24)

part 1: derivatives of eigenproblems (html) (julia source)
part 2: second derivatives, bilinear forms, and Hessian matrices (notes)
video

Further reading (part 1): Computing derivatives on curved surfaces ("manifolds") is closely related to tangent spaces in differential geometry. The effect of constraints can also be expressed in terms of Lagrange multipliers, which are useful in expressing optimization problems with constraints (see also chapter 5 of Convex Optimization by Boyd and Vandenberghe). In physics, first and second derivatives of eigenvalues and first derivatives of eigenvectors are often presented as part of "time-independent" perturbation theory in quantum mechanics, or as the Hellmann–Feynmann theorem for the case of dλ. The derivative of an eigenvector involves all of the other eigenvectors, but a much simpler "vector–Jacobian product" (involving only a single eigenvector and eigenvalue) can be obtained from left-to-right differentiation of a scalar function of an eigenvector, as reviewed in the 18.335 notes on adjoint methods.

Further reading (part 2): Bilinear forms are an important generalization of quadratic operations to arbitrary vector spaces, and we saw that the second derivative can be viewed as a symmetric bilinear forms. This is closely related to a quadratic form, which is just what we get by plugging in the same vector twice, e.g. the f''(x)[δx,δx]/2 that appears in quadratic approximations for f(x+δx) is a quadratic form. The most familar multivariate version of f''(x) is the Hessian matrix; Khan academy has an elementary [introduction to quadratic approximation](https://www.khanacademy.org/math/multivariable-calculus/applications-of-multivariable-derivatives/quadratic-approxi

Lecture 7 (Jan 26)

part 1: continued Hessian notes from previous lecture: minima/maxima and f″ definiteness, Hessian conditioning and steepest-descent convergence
part 2: derivatives and backpropagation on graphs and linear operators: poster
video
pset 2 solutions and computational notebook

Further reading (part 1): Positive-definite Hessian matrices, or more generally definite quadratic forms f″, appear at extrema (f′=0) of scalar-valued functions f(x) that are local minima; there a lot more formal treatments of the same idea, and conversely Khan academy has the simple 2-variable version where you can check the sign of the 2×2 eigenvalues just by looking at the determinant and a single entry (or the trace). There's a nice stackexchange discussion on why an ill-conditioned Hessian tends to make steepest descent converge slowly; some Toronto course notes on the topic may also be helpful.

Further reading (part 2): See this blog post on calculus on computational graphs for a gentle review, and these Columbia course notes for a more formal approach.

Lecture 8 (Jan 28)

part 1: continued Hessian notes: using Hessians for optimization (Newton & quasi-Newton & Newton–Krylov methods), cost of Hessians
part 2: adjoint differentiation of ODES (guest lecture by Dr. Chris Rackauckas): notes from 18.337
video

Further reading (part 1): See e.g. these Stanford notes on sequential quadratic optimization using trust regions (sec. 2.2). See 18.335 notes on BFGS quasi-Newton methods (also video). The fact that a quadratic optimization problem in a sphere has strong duality and hence is efficiently solvable is discussed in section 5.2.4 of the Convex Optimization book. There has been a lot of work on automatic Hessian computation, but for large-scale problems you can ultimately only compute Hessian–vector products efficiently in general, which are equivalent to a directional derivative of the gradient, and can be used e.g. for Newton–Krylov methods.

Further reading (part 2): A very general reference on adjoint-method (reverse-mode/backpropagation) differentiation of ODEs (and generalizations thereof), using notation similar to that of Chris R. today, is Cao et al (2003) (pdf). See also the adjoint sensitivity analysis and automatic sensitivity analysis sections of Chris's amazing DifferentialEquations.jl software suite for numerical solution of ODEs in Julia. There is a nice YouTube lecture on adjoint sensitivity of ODEs, again using a similar notation.

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

mitmath / matrixcalc

Programming Languages