This is the notes for Improving Deep Neural Networks taught by Andrew Ng at Coursera. Here, lesson notes and assignments will be provided in order to enhance my comprehension about Neural Networks. You can view my github for the programming assignment.
This week we learnt about some CNN applications: object detections. There are several strategies to implement this task:
- Object Localization
- Landmark Detections
- Sliding Windows Detections
- Bounding Box Prediction
- Non-Maximum Suppresion
- Anchor Boxes
- YOLO Algorithm
Now, let’s talk about them one by one~
You can find that all of the tasks we’ve done before are just about classification but not localization. Given a picture, just output 0/1 to classify if this image has a cat. But in this chapter we’d like to do not only classification, but also object localization which is to say, give the position of the detected object. In order to acommplish that, we are going to output 4 more values: bx,by,bh,bw. All of these four values depict the precise position of the object in the image. Thus, we are going to define a new format of output y, a
I don’t know what does this strange landmark matter. As Andrew says, we use landmark detection to help us know what’t the movement of someone currently. But he didn’t cover the details… Let’s omitted it.
In the lessons in Machine learning, Andrew has already covered some details about slide windows detections in maybe PCA?(mentioned the automatic driving chapter anyway). Suppose you have a window to detect the object you want. Then you just slide this window to check if the object exists here. To detect different objects of the different size, various sizes of windows are needed. When it comes to the implementation of sliding windows, you need to change the orignal output, a single number into a vector. Furthermore, just iterate over the whole windows position is much computational-expensive. To handle with this shortcoming, we choose concolution implementation of sliding windows, which is to convolute the whole image into the final output, decrease much repeated computation.
In the previou part, we talked about sliding window dectection. Here we introduce another method used in YOLO algorithm: Bounding box. The main idea is like this: spilt the whole image into
3*3 grid cells,(or
19*19,just for explanation). Then for each grid cell, we implement the object detection on it. And we only detect the object whose center is in this grid cell.
For the purpose of building an evaluation of object localization, we use intersection over union. Generally speaking, IoU is a measure of the overlap between two bounding boxes.
Say we have detect that there are two cars in the picture. But several windows are pointed to the same car. Additional objects should be omitted. Thus, non-max suppression is a good idea. Here is the non-max suppression algorithm:
If we want to detect multiple object in one position, say a woman is in front of a car, we want to detect both car and woman, the past bounding box detection can not work well. Therefore, someone put forward the anchor box. With two anchor boxes, each object in the training image is assigned to grid cell that contains object’s midpoint and anchor box for the grid cell with highest IoU. Also, the correspoinding shape of output y should change like this:
We will go through this algorithm in this assignment, so let’s just skip it :)
This assignment is to implement YOLO model to detect the cars in the road. At the end of the assignment, you’ll have a well trained model to detect the cars in the road! Let’s begin.
First, we encoding the preprocessed image into a deep CNN, then get a encoding matrix.
Since we are using 5 anchor boxes, each of the 19 x19 cells thus encodes information about 5 boxes. Anchor boxes are defined only by their width and height.
For simplicity, we will flatten the last two last dimensions of the shape (19, 19, 5, 85) encoding. So the output of the Deep CNN is (19, 19, 425).
Now, for each box (of each cell) we will compute the following elementwise product and extract a probability that the box contains a certain class.
As the steps mentioned above, you can write the belowing code implementation:
def yolo_filter_boxes(box_confidence, boxes, box_class_probs, threshold = .6):
Here we only implement IoU: Non-max suppression uses the very important function called “Intersection over Union”, or IoU.
def iou(box1, box2):
You are now ready to implement non-max suppression. The key steps are:
- Select the box that has the highest score.
- Compute its overlap with all other boxes, and remove boxes that overlap it more than
- Go back to step 1 and iterate until there’s no more boxes with a lower score than the current selected box.
This will remove all boxes that have a large overlap with the selected boxes. Only the “best” boxes remain.
def yolo_non_max_suppression(scores, boxes, classes, max_boxes = 10, iou_threshold = 0.5):
def yolo_eval(yolo_outputs, image_shape = (720., 1280.), max_boxes=10, score_threshold=.6, iou_threshold=.5):
After running your model, you’ve seen your well-predicted model on image detection.
In the next lesson, we’ll learn about face recognition and NN style tranformation.