论文 FaceNet - A Unified Embedding for Face Recognition and Clustering

A Stellar Hiker

2018-07-31

Introduction

核心思想

Face Image -> 128-D Embedding (End to End)
Euclidean distance between Embeddings = Measure of face similarity
Triplet Loss = minimize sum(||A - P|| - ||A - N|| + α)，P和N如何选择很重要

with Embedding, face recognition, verification, clustering 变成了常规任务，Embedding 之间距离的较量。
图片是 tight crops of the face area，无 2D or 3D alignment。

Triplet Loss

为什么不用 softmax？

Usually in supervised learning we have a fixed number of classes and train the network using the softmax cross entropy loss. However in some cases we need to be able to have a variable number of classes. In face recognition for instance, we need to be able to compare two unknown faces and say whether they are from the same person or not.

Triplet loss tries to enforce a margin between each pair of faces from one person to all other faces. 和 SVM 的 margin 有点像。

triplets of embeddings:

an anchor
a positive of the same class as the anchor
a negative of a different class

Triplet Loss

公式

即 For some distance on the embedding space $d$ , the loss of a triplet $(a,p,n)$ is:

$L = \max(d(a,p) - d(a,n) + margin, 0)$

Triplet Loss Function

Triplet 应当选 hard triplets，违反公式1的例子，这样才对模型的训练有帮助，fast convergence。

Triplet Selection and Training Procedure

Three categories of triplets:

Easy Triplets
triplets which have a loss of 0, because $d(a,p) + margin \lt d(a,n)$
Hard Triplets
triplets where the negative is closer to the anchor than the positive, i.e. $d(a,n) \lt d(a,p)$
Semi-hard Triplet
triplets where the negative is not closer to the anchor than the positive, but which still have positive loss: $d(a,p) \lt d(a,n) \lt d(a,p) + margin$

Categories of Negatives

在选择 triplet 时，我们想要 Hard Positive $$argmax_{x_i^p} \parallel f(x_i^a) - f(x_i^p) \parallel_2^2$$ 和 Hard Negative $$argmin_{x_i^n} \parallel f(x_i^a) - f(x_i^n) \parallel_2^2$$。
但是在整个训练集上计算不现实，而且 outlier 和 mislabelled 会严重影响选择。

The paper pick a random semi-hard negative for every pair of anchor and positive, and train on these triplets.

有两条出路：

Offline Triplet Mining
Generate triplets offline every n steps, using the most recent network checkpoint and computing the argmin and argmax on a subset of the data.
Not efficient enough.
Online Triplet Mining
Generate triplets online. This can be done by selecting the hard positive/negative exemplars from within a mini-batch.

Online Generation

In online mining, we have computed a batch of $B$ embeddings from a batch of $B$ inputs.
valid triplet: (i,j,k) 中 i,j 属于同一人，k 则不。

Suppose that you have a batch of faces as input of size $B = PK$ , composed of $P$ different persons with $K$ images each. A typical value is $K=4$ . 有两种策略：

Batch All
select all the valid triplets, and average the loss on the hard and semi-hard triplets.
- a crucial point here is to not take into account the easy triplets (those with loss
  0), as averaging on them would make the overall loss very small.
- this produces a total of $PK(K-1)(PK-K)$ triplets ( $PK$ anchors, $K-1$ possible positives per anchor, $PK-K$ possible negatives).
Batch Hard (better)
for each anchor, select the hardest positive (biggest distance $d(a,p)$ $d (a, p)$ and the hardest negative among the batch.
- this produces $PK$ triplets.
- the selected tripltes are the hardest among the batch.

Model Architecture

Model Structure
Train the CNN using Stochastic Gradient Descent (SGD) with standard backprop and AdaGrad.

两种，Zeiler&Fergus based Model，GoogLeNet style Inception Model。

Their practical differences lie in the difference of parameters and FLOPS. 选择使用哪个模型要看应用场景。

Model running in a datacenter can have many parameters and require a large number of FLOPS.
Model running on a mobile phone needs to have few parameters, so that it can fit into memory.