Currently, I am taking a class called "Convex Optimization for Machine Learning & Computer Vision". Even though the lectures are quite theoretical, the programming homework are not and we do some interesting projects like this one: How would you separate the background and the foreground of a video using optimization?
Most likely, you would somehow make the assumption that the background takes most of the space and stays still while the foreground are all the smaller parts that are moving. But how would we describe that mathematically and how would we "implement" this assumption? That's what we gonna cover in this blog post.
We are given a video and we want to separate the background from the foreground . We can think of each original frame as a matrix in . In order to work with the data, we vectorize each frame and then stack everything together to form a matrix that contains the whole video at once where and are the number of frames. Thus, we try to find the background matrix and the foreground matrix such that .
Let's suppose we have a function that measures how much a matrix (i.e. the video) "changes" and we have a second function that measures how many non-zero elements a matrix has (i.e. how much "space" an animation might take in comparison to the whole video). Then, we can formulate following constrained optimization problem:
The sort of weights how much emphasis we want to put into the two objectives. As it turns out, it also makes sense to weaken our constraint a bit. Instead of reconstructing the exact image we could also reconstruct a similar image such that the difference between the two images are below a certain threshold
Before we take a closer look at the two mysterious functions and we transform the constrained optimization problem into an unconstrained one using the indicator function that is 0 if the argument is true and otherwise. This enables us to use methods from convex optimization.
Most of the norms you usually encounter are vector -norms. For they are defined like this
For example, if you take you get the Euclidean norm that can measure the "direct" distance between two vectors. Or, you take and you get the Manhattan norm that can measure the distance when you are only allowed to move on a grid (like it is in Manhattan).
For any we have a true norm and any norm turns out to be convex. Proof: Let and . A function is convex iff .
However, for the -norm is not a norm but a Quasinorm and therefore, the function is not necessarily convex anymore. You can also see in the visualization that the epigraph (the level curves) are not a convex set.
Notice how the edges at the axes become sharper and sharper. Taking breaks the original -norm formula (whats the 0-th root of a number?) so people agreed to define the 0-norm as the number of non-zero elements in a vector
The 0-norm looks like a suitable candidate for our function. We assume that the foreground takes only a small part of the video, i.e. each frame will have many zero elements in it and our foreground matrix will be sparse. Thus, we want to minimize over the 0-norm. Sadly, the 0-norm is not only non-convex but also too hard to optimize. We do know how to optimize over convex functions so we need to find a suitable convex alternative to the 0-norm that behaves not the same but close enough.
Let's suppose we are in only. The blue line below shows all optimal -pair solutions for any given problem. So which solution should we pick? Clearly there are infinite many. We could make the additional requirement to make the solution as sparse as possible, i.e. either or should be zero. It should be obvious that the solution will be the intersection of the blue line with the - or -axis. But how would we get the intersection? Instead of optimizing over the original energy function (the blue line) only, we will also optimize over the 1-norm (red line), i.e. we will make the area of the red square as small as possible such that the red edges still hit any part of the blue line.
Because the 1-norm has "edges" at the axes, it is likely that we end up with a sparse solution. This is also known as "l1-norm regularization" and you can read a nice blog post here that explains this trick more in detail. For we lose the edges at the axes and for our "norm" wouldn't be convex anymore, so the 1-norm seems to be a reasonable candidate for enforcing sparsity on the video matrix .
One tiny part is missing: is a matrix and the 1-norm is defined for vectors. Thus, we need to vectorize first before we apply the 1-norm. There is sometimes an abuse of notation, and the 1-norm of a matrix refers to
Matrix Schatten -Norm
There are several ways to define a matrix norm and things can become confusing. One way is by vectorizing the matrix and actually referring to a vector norm like it was the case above. A general formula would be
One famous special case is the Frobenius norm that matches the definition of the Euclidean norm.
Sometimes, a matrix norm refers to the operator norm which "measure the 'size' of certain linear operators". Let be a linear mapping from to . Then we could define as
Or, we could apply a Singular value decomposition (SVD) on and then define a norm using the singular values . That is the idea of the Schatten -norm and that is what we gonna look more in detail now.
To understand the SVD we first need to understand the rank of a matrix. mathematically, it tells you how many rows (or columns) are linearly independent, but I like to think of it as "how interesting" your matrix is. Suppose you take measurements and your data is dimensional. Then, you could represent your data as an matrix. Now, if your measurements are all similar or you could find some simple (linear) "rule" that can explain all rows (or columns) your matrix has, then the matrix is probably low rank and you could throw away a lot of unnecessary data. In contranst, if it turns out that each measurement has some interesting information which is not present in the other measurements, then you better keep it in order not to lose information. In some sense, the rank also indicates, how much you could compress your data without losing meaningful insights.
One technique to come up with "rules" that explain your data in a more compressed way is to calculate the singular value decomposition.
where and are orthogonal matrices () and is a diagonal matrix consisting of singular values unequal to 0 and equal to zero. consists of the necessary orthonormal bases and tells you how much you have to scale them to reconstruct entries in .
The higher a specific singluar value is, the more important it is to explain the data. For example, if your data follows a linear trend in , then you will end up with a very high first singular value and a quite small second singular value. And if you can explain your data perfectly using a simple line, then the second singular value will be zero and your rank is 1 because
Now, remember that the Schatten -norm is defined over the singular values for . Interestingly enough, taking also yields the Frobenius norm.
Setting will not give us a real norm as it was the case with the vector 0-norm and it won't be convex anymore but it turns out that it counts the non-zero elements of your singular values. Do you see the similarity between the vector 0-norm and matrix Schatten 0-norm?
Going back to our video separation problem, we assumed that the background of the video does not change and is probably the same frame for the whole video, i.e. all columns of (frames) should be more or less the same image vector. If we can minimize the rank of we would archive this. But again, the rank is not convex and we need to find a convex substitute.
Like we did in the previous section, we will replace the Schatten 0-norm with the Schatten 1-norm, which is also known as the nuclear norm. This is a good approximation to the rank that we can work with.
So, our final optimization problem looks like this
Notice that none of the terms are differentiable, so simple gradient descent or forward backward splitting does not work here.
Alternating Direction Method of Multipliers (ADMM)
Instead, we use ADMM. For now, I won't go into details how to derive these update rules but the idea is to introduce a Lagrangian Multiplier to enforce and optimize over , , and finally in an alternating way. Each optimization problem can be reformulated using the operator and the of the Nuclear norm, the L1 norm and the indicator function are well known. Please have a look at the Python code to see the update iteration. We reformulate the problem as the Augmented Lagrangian.
Optimization over A: Low Rank Matrix
Optimization over B: Sparse Matrix
Optimization over M: Reconstruction
Dual Ascend Step for Y
In ADMM, we perform a gradient ascend step for the dual variable to enforce the equality constraint:
Implementation using Numpy
def calcA(B, M, Y, rho): X = M - B - (1/rho)*Y n1, n2 = X.shape U, S, VH = np.linalg.svd(X, full_matrices=False) diag_plus = np.diag(np.maximum(np.zeros(S.shape), S - 1/rho)) A = U @ diag_plus @ VH return A def calcB(A, M, Y, rho, lamb): X = M - A - (1/rho)*Y x = mat2vec(X) lamb_rho = lamb / rho b = x b[np.where(b < -lamb_rho)] += lamb_rho b[np.where(b > lamb_rho)] -= lamb_rho B = vec2mat(b, X.shape) return B def calcM(A, B, Y, Z, rho, eps): X = A + B + (1/rho)*Y x = mat2vec(X) z = mat2vec(Z) m = z + eps / (np.maximum(np.linalg.norm(x-z), eps)) * (x-z) M = vec2mat(m, X.shape) return M def calcY(A, B, M, Y, rho): Y = Y + rho * (A + B - M) return Y def calcEnergy(A, B, M, Y, rho, lamb): A_nuc = np.linalg.norm(A, ord='nuc') B_1 = lamb*np.linalg.norm(mat2vec(B), ord=1) inner = np.trace(Y.T.dot(A+B-M)) fro = rho/2 * np.linalg.norm(A+B-M) energy = A_nuc + B_1 + inner + fro return energy
for it in range(max_it+1): # update A: A = calcA(B, M, Y, rho) # update B: B = calcB(A, M, Y, rho, lamb) # update M: M = calcM(A, B, Y, Z, rho, eps) # update Y: Y = calcY(A, B, M, Y, rho) # update augmented lagrangian energies energy = calcEnergy(A, B, M, Y, rho, lamb) energies.append(energy) # output status print("Iteration:", it, "/", max_it, ", Energy:", energy)