Face Recognition and Neural Style Transfer

About

This is the notes for Improving Deep Neural Networks taught by Andrew Ng at Coursera. Here, lesson notes and assignments will be provided in order to enhance my comprehension about Neural Networks. You can view my github for the programming assignment.

Content

There are two main parts in this week, which has been listed in the title:

  • Face Recognition
  • Neural Sytle Transfer

Both of them are quite interesting, and at the end of this week’s course, we’ll build a face recognition system and a neural style transfer machine. Let’s learn some basic techniques behind these two funny things.

Face Recognition

This technique has two different categories:

  • Face Vertification
  • Face Recognition

But When you implement the assignment you’ll find face recognition is built just beyond face vertification. There’s no much difference between them.

To let the recognition much faster, we use one-shot learning. The idea of one-shot learning is to transform the original image into a 128 neuron as a vector. Then what we do is to compare the difference between two vectors generated by input image and image in the database. Of course, it is the convolutional computations that we need to generate vectors, which we called the whole process, the Siamese Network. You can view DeepFace paper by Taigman for detailed description on what is Siamese Network. Futhermore, the goal of learning is :

In order to train our neural network, we put forward with Triplet Loss as our cost function. If we define the image saved in the database Anchor, abbreviated as A, positive image(The right person picture) as P, and negative image as N. Then our triplet loss can be expressed as below:

||F(A)-F(P)||2 - ||F(A)-F(N)||2 + ∂ ≤ 0 (∂ here is the margin)

Thus, the cost function can be written like this:

Give 3 images, A,P,N,
L(A,P,N)=max(||F(A)-F(P)||2 - ||F(A)-F(N)||2 + ∂ , 0)
J=∑L(A,P,N) for all images.

Attention: Since triplet loss can be satified easily for most images, you need to find those images that are hard to train. Only in this way can we update the parameters to build a well-performed machine.

Just to note that you can also use tanh activation as the final layer for the final output is just 0/1.

Neural Style Transfer

This funny deep learning way is published in the paper A Neural Algorithm of Artistic Style. You can view the paper yourself if you like. Here I only cover how to find our cost function in order to build our neural network.

Before implementing this technique, you need to know what’s the neural network actually do. Here is a visualization of neural network Visualizing and Understanding Convolutional Networks. The same as we defined before, we’ll again define 3 abbreviations: Content image as C, Style image as S and Generated image as G. Next, we will define the content cost function and style cost function repectively. Then we just add them together with different parameters as our final cost function.

Content Cost Function

Here we just map the image into a vector, just like what we do in the face recognition (Here we use a pre-trained VGG net as our map function). Then use L2 norm to measure the similarity of two images. The smaller value of content cost function, the more similar between these two pictures.

Style Cost Function

Again, we use a VGG net as our map function. But instead of choosing the output as our final value, a middle layer should be chosen in order to get the general style of Style image. Suppose here you get the output of layer l (your chosen layer),its shape is (n_H*n_W*n_C), then we use this product to generate Gram matrix for the purpose of get the colrelation of two images through matrix. Here is the concrete step of getting our style matrix:

Then, the cost function can be defined as this:

Finally, we can define the cost function as :
J(G)=∂ Jcontent(C,G)+ß Jstyle(S,G)

Use this formula, you will get your own neural style transfer machine!

Assignment 1 Face Recognition

The Triplet Loss

We’ve covered the detailed implement of triplet loss function. Here is the code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
def triplet_loss(y_true, y_pred, alpha = 0.2):
"""
Implementation of the triplet loss as defined by formula (3)

Arguments:
y_true -- true labels, required when you define a loss in Keras, you don't need it in this function.
y_pred -- python list containing three objects:
anchor -- the encodings for the anchor images, of shape (None, 128)
positive -- the encodings for the positive images, of shape (None, 128)
negative -- the encodings for the negative images, of shape (None, 128)

Returns:
loss -- real number, value of the loss
"""

anchor, positive, negative = y_pred[0], y_pred[1], y_pred[2]

### START CODE HERE ### (≈ 4 lines)
# Step 1: Compute the (encoding) distance between the anchor and the positive, you will need to sum over axis=-1
pos_dist = tf.reduce_sum(tf.square(tf.subtract(anchor,positive)),axis=-1)
#print(pos_dist)
# Step 2: Compute the (encoding) distance between the anchor and the negative
neg_dist = tf.reduce_sum(tf.square(tf.subtract(anchor,negative)),axis=-1)
# Step 3: subtract the two previous distances and add alpha.
basic_loss = tf.add(tf.subtract(pos_dist,neg_dist),alpha)
# Step 4: Take the maximum of basic_loss and 0.0. Sum over the training examples.
loss = tf.reduce_sum(tf.maximum(0.0,basic_loss))
### END CODE HERE ###

return loss

verify function

Three steps are needed with our given function.

1.Compute the encoding of the image from image_path
2.Compute the distance about this encoding and the encoding of the identity image stored in the database
3.right if distance is less than 0.7, otherwise false.

Code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# GRADED FUNCTION: verify

def verify(image_path, identity, database, model):
"""
Function that verifies if the person on the "image_path" image is "identity".

Arguments:
image_path -- path to an image
identity -- string, name of the person you'd like to verify the identity. Has to be a resident of the Happy house.
database -- python dictionary mapping names of allowed people's names (strings) to their encodings (vectors).
model -- your Inception model instance in Keras

Returns:
dist -- distance between the image_path and the image of "identity" in the database.
door_open -- True, if the door should open. False otherwise.
"""

### START CODE HERE ###

# Step 1: Compute the encoding for the image. Use img_to_encoding() see example above. (≈ 1 line)
encoding = img_to_encoding(image_path,model)

# Step 2: Compute distance with identity's image (≈ 1 line)
dist = np.linalg.norm(encoding-database[identity])

# Step 3: Open the door if dist < 0.7, else don't open (≈ 3 lines)
if dist<0.7:
print("It's " + str(identity) + ", welcome home!")
door_open = True
else:
print("It's not " + str(identity) + ", please go away")
door_open = False

### END CODE HERE ###

return dist, door_open

face recognition

You will find there’s little difference between this and face vertification. Just loop over the database dictionary to compare the image one by one.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# GRADED FUNCTION: who_is_it

def who_is_it(image_path, database, model):
"""
Implements face recognition for the happy house by finding who is the person on the image_path image.

Arguments:
image_path -- path to an image
database -- database containing image encodings along with the name of the person on the image
model -- your Inception model instance in Keras

Returns:
min_dist -- the minimum distance between image_path encoding and the encodings from the database
identity -- string, the name prediction for the person on image_path
"""

### START CODE HERE ###

## Step 1: Compute the target "encoding" for the image. Use img_to_encoding() see example above. ## (≈ 1 line)
encoding = img_to_encoding(image_path,model)

## Step 2: Find the closest encoding ##

# Initialize "min_dist" to a large value, say 100 (≈1 line)
min_dist = 100

# Loop over the database dictionary's names and encodings.
for (name, db_enc) in database.items():

# Compute L2 distance between the target "encoding" and the current "emb" from the database. (≈ 1 line)
dist = np.linalg.norm(encoding-db_enc)

# If this distance is less than the min_dist, then set min_dist to dist, and identity to name. (≈ 3 lines)
if dist<min_dist:
min_dist = dist
identity = name

### END CODE HERE ###

if min_dist > 0.7:
print("Not in the database.")
else:
print ("it's " + str(identity) + ", the distance is " + str(min_dist))

return min_dist, identity

Assignment 2 Deep Learning & Art: Neural Style Transfer

One thing to mention, we use transfer learning to shorten the process of training a new CNN. Here we use VGG-19 for our Neural Style transfer.

Computing the content cost

We would like the “generated” image G to have similar content as the input image C. Suppose you have chosen some layer’s activations to represent the content of an image. In practice, you’ll get the most visually pleasing results if you choose a layer in the middle of the network–neither too shallow nor too deep. (After you have finished this exercise, feel free to come back and experiment with using different layers, to see how the results vary.)

So, suppose you have picked one particular hidden layer to use. Now, set the image C as the input to the pretrained VGG network, and run forward propagation. Let $a^{(C)}$ be the hidden layer activations in the layer you had chosen. (In lecture, we had written this as $a^{(C)[l]}$, but here we’ll drop the superscript $[l]$ to simplify the notation.) This will be a $n_H \times n_W \times n_C$ tensor. Repeat this process with the image G: Set G as the input, and run forward progation. Let $$a^{(G)}$$ be the corresponding hidden layer activation. We will define as the content cost function as:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# GRADED FUNCTION: compute_content_cost

def compute_content_cost(a_C, a_G):
"""
Computes the content cost

Arguments:
a_C -- tensor of dimension (1, n_H, n_W, n_C), hidden layer activations representing content of the image C
a_G -- tensor of dimension (1, n_H, n_W, n_C), hidden layer activations representing content of the image G

Returns:
J_content -- scalar that you compute using equation 1 above.
"""

### START CODE HERE ###
# Retrieve dimensions from a_G (≈1 line)
m, n_H, n_W, n_C = a_G.get_shape().as_list()

# Reshape a_C and a_G (≈2 lines)
a_C_unrolled = tf.transpose(tf.reshape(a_C,(n_H*n_W,n_C)))
a_G_unrolled = tf.transpose(tf.reshape(a_G,(n_H*n_W,n_C)))

# compute the cost with tensorflow (≈1 line)
J_content = tf.reduce_sum(tf.square(tf.subtract(a_C_unrolled,a_G_unrolled)))/(4*n_H*n_W*n_C)
### END CODE HERE ###

return J_content

Style Cost Function

The style matrix is also called a “Gram matrix.” In linear algebra, the Gram matrix G of a set of vectors (v{1},…,v{n}) is the matrix of dot products, whose entries are Gij=viTvj=np.dot(vi,vj). In other words, $ G_{ij} $ compares how similar $v_i$ is to vj : If they are highly similar, you would expect them to have a large dot product, and thus for Gij to be large.

Note that there is an unfortunate collision in the variable names used here. We are following common terminology used in the literature, but $G$ is used to denote the Style matrix (or Gram matrix) as well as to denote the generated image $G$. We will try to make sure which $G$ we are referring to is always clear from the context.

In NST, you can compute the Style matrix by multiplying the “unrolled” filter matrix with their transpose:

The result is a matrix of dimension (nc,nc) where $n_C$ is the number of filters. The value Gij measures how similar the activations of filter $i$ are to the activations of filter $j$.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# GRADED FUNCTION: gram_matrix

def gram_matrix(A):
"""
Argument:
A -- matrix of shape (n_C, n_H*n_W)

Returns:
GA -- Gram matrix of A, of shape (n_C, n_C)
"""

### START CODE HERE ### (≈1 line)
GA = tf.matmul(A,tf.transpose(A))
### END CODE HERE ###

return GA

After generating the Style matrix (Gram matrix), your goal will be to minimize the distance between the Gram matrix of the “style” image S and that of the “generated” image G. For now, we are using only a single hidden layer $a^{[l]}$, and the corresponding style cost for this layer is defined as:

where $G^{(S)}$ and $G^{(G)}$ are respectively the Gram matrices of the “style” image and the “generated” image, computed using the hidden layer activations for a particular hidden layer in the network.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# GRADED FUNCTION: compute_layer_style_cost

def compute_layer_style_cost(a_S, a_G):
"""
Arguments:
a_S -- tensor of dimension (1, n_H, n_W, n_C), hidden layer activations representing style of the image S
a_G -- tensor of dimension (1, n_H, n_W, n_C), hidden layer activations representing style of the image G

Returns:
J_style_layer -- tensor representing a scalar value, style cost defined above by equation (2)
"""

### START CODE HERE ###
# Retrieve dimensions from a_G (≈1 line)
m, n_H, n_W, n_C = a_G.get_shape().as_list()

# Reshape the images to have them of shape (n_C, n_H*n_W) (≈2 lines)
a_S = tf.transpose(tf.reshape(a_S,(n_H*n_W,n_C)))
a_G = tf.transpose(tf.reshape(a_G,(n_H*n_W,n_C)))

# Computing gram_matrices for both images S and G (≈2 lines)
GS = gram_matrix(a_S)
GG = gram_matrix(a_G)

# Computing the loss (≈1 line)
J_style_layer = tf.reduce_sum(tf.square(tf.subtract(GS,GG)))/(4*(n_C*n_W*n_H)**2)

### END CODE HERE ###

return J_style_layer

You can combine the style costs for different layers as follows:

Total Cost Function

Finally, let’s create a cost function that minimizes both the style and the content cost. The formula is:

Exercise: Implement the total cost function which includes both the content cost and the style cost.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# GRADED FUNCTION: total_cost

def total_cost(J_content, J_style, alpha = 10, beta = 40):
"""
Computes the total cost function

Arguments:
J_content -- content cost coded above
J_style -- style cost coded above
alpha -- hyperparameter weighting the importance of the content cost
beta -- hyperparameter weighting the importance of the style cost

Returns:
J -- total cost as defined by the formula above.
"""

### START CODE HERE ### (≈1 line)
J = alpha*J_content+beta*J_style
### END CODE HERE ###

return J

Integrate all

Finally, let’s put everything together to implement Neural Style Transfer!

Here’s what the program will have to do:

  1. Create an Interactive Session
  2. Load the content image
  3. Load the style image
  4. Randomly initialize the image to be generated
  5. Load the VGG16 model
  6. Build the TensorFlow graph:
    • Run the content image through the VGG16 model and compute the content cost
    • Run the style image through the VGG16 model and compute the style cost
    • Compute the total cost
    • Define the optimizer and the learning rate
  7. Initialize the TensorFlow graph and run it for a large number of iterations, updating the generated image at every step.

Summary

That’s all for CNN part. I’ve learnt to read some primary paper and basic knowledge about Covoluntional Neural Network. Next I’ll go through the RNN, the sequence model!