What is Object Detection ?
Classification with localization
Defining the target label y
- 行人，0 or 1；
- 汽车，0 or 1；
- 摩托车，0 or 1；
- 图片背景，0 or 1；
Target Label Y
- 对Pc应用logistic regression损失函数，或者平方预测误差。
Sliding Windows Detection Algorithm
Weakness : computational cost
Improvement : Convolutional Implementation of Sliding Windows
Turning FC layer into convolutional layers
Convolutional Implementation of Sliding Windows
IoU (Intersection over union)
More generally, IoU is a measure of the overlap between two bounding boxes.
Anchor Box Algorithm
With two anchor boxes: Each object in training image is assigned to grid cell that contains object’s midpoint and anchor box for the grid cell with highest IoU.
- 如果我们使用了两个Anchor box，但是同一个格子中却有三个对象的情况，此时只能用一些额外的手段来处理；
- 同一个格子中存在两个对象，但它们的Anchor box 形状相同，此时也需要引入一些专门处理该情况的手段。
How to choose anchor box ?
一般人工指定Anchor box 的形状，选择5~10个以覆盖到多种不同的形状，可以涵盖我们想要检测的对象的形状；
高级方法(K-means 算法)：将不同对象形状进行聚类，用聚类后的结果来选择一组最具代表性的Anchor box，以此来代表我们想要检测对象的形状。
对于每一个小格子，输出一个预定义标签：yi=[Pc bx by bh bw c1 c2 c3] 对于不同的网格 i 有不同的标签向量yi；
- 目标Y：使用3×3网格划分，输出大小3×3×16(假设使用2个anchor boxes);
假设使用了2个Anchor box，那么对于每一个网格，我们都会得到预测输出的2个bounding boxes，其中一个Pc比较高；
YOLO Practice for Car Detection
Here’s an example of what your bounding boxes look like:
If you have 80 classes that you want YOLO to recognize, you can represent the class label c either as an integer from 1 to 80, or as an 80-dimensional vector (with 80 numbers) one component of which is 1 and the rest of which are 0. The video lectures had used the latter representation; in this notebook, we will use both representations, depending on which is more convenient for a particular step.
Because the YOLO model is very computationally expensive to train, we will load pre-trained weights for you to use.
YOLO Model Details
YOLO (“you only look once”) is a popular algoritm because it achieves high accuracy while also being able to run in real-time. This algorithm “only looks once” at the image in the sense that it requires only one forward propagation pass through the network to make predictions. After non-max suppression, it then outputs recognized objects together with the bounding boxes.
First things to know:
- The input is a batch of images of shape
(m, 608, 608, 3)
- The output is a list of bounding boxes along with the recognized classes. Each bounding box is represented by 6 numbers
(pc,bx,by,bh,bw,c)as explained above. If you expand c into an 80-dimensional vector, each bounding box is then represented by 85 numbers.
We will use 5 anchor boxes. So you can think of the YOLO architecture as the following: IMAGE (m, 608, 608, 3) -> DEEP CNN -> ENCODING (m, 19, 19, 5, 85).
Lets look in greater detail at what this encoding represents.
If the center/midpoint of an object falls into a grid cell, that grid cell is responsible for detecting that object.
Since we are using 5 anchor boxes, each of the 19 x19 cells thus encodes information about 5 boxes. Anchor boxes are defined only by their width and height.
For simplicity, we will flatten the last two last dimensions of the shape
(19, 19, 5, 85) encoding. So the output of the Deep CNN is
(19, 19, 425).
Now, for each box (of each cell) we will compute the following elementwise product and extract a probability that the box contains a certain class.
Here’s one way to visualize what YOLO is predicting on an image:
- For each of the 19x19 grid cells, find the maximum of the probability scores (taking a max across both the 5 anchor boxes and across different classes).
- Color that grid cell according to what object that grid cell considers the most likely.
Doing this results in this picture:
Note that this visualization isn’t a core part of the YOLO algorithm itself for making predictions; it’s just a nice way of visualizing an intermediate result of the algorithm.
Another way to visualize YOLO’s output is to plot the bounding boxes that it outputs. Doing that results in a visualization like this:
In the figure above, we plotted only boxes that the model had assigned a high probability to, but this is still too many boxes. You’d like to filter the algorithm’s output down to a much smaller number of detected objects. To do so, you’ll use non-max suppression. Specifically, you’ll carry out these steps:
- Get rid of boxes with a low score (meaning, the box is not very confident about detecting a class)
- Select only one box when several boxes overlap with each other and detect the same object.
Filtering with a threshold on class scores
# GRADED FUNCTION: yolo_filter_boxes
Even after filtering by thresholding over the classes scores, you still end up a lot of overlapping boxes. A second filter for selecting the right boxes is called non-maximum suppression (NMS).
Non-max suppression uses the very important function called “Intersection over Union”, or IoU.
- In this exercise only, we define a box using its two corners (upper left and lower right): (x1, y1, x2, y2) rather than the midpoint and height/width.
- To calculate the area of a rectangle you need to multiply its height (y2 - y1) by its width (x2 - x1)
- You’ll also need to find the coordinates (xi1, yi1, xi2, yi2) of the intersection of two boxes. Remember that:
- xi1 = maximum of the x1 coordinates of the two boxes
- yi1 = maximum of the y1 coordinates of the two boxes
- xi2 = minimum of the x2 coordinates of the two boxes
- yi2 = minimum of the y2 coordinates of the two boxes
# GRADED FUNCTION: iou
You are now ready to implement non-max suppression. The key steps are:
- Select the box that has the highest score.
- Compute its overlap with all other boxes, and remove boxes that overlap it more than iou_threshold.
- Go back to step 1 and iterate until there’s no more boxes with a lower score than the current selected box.
This will remove all boxes that have a large overlap with the selected boxes. Only the “best” boxes remain.
TensorFlow has two built-in functions that are used to implement non-max suppression (so you don’t actually need to use your iou() implementation):
# GRADED FUNCTION: yolo_non_max_suppression
Wrapping up the filtering
It’s time to implement a function taking the output of the deep CNN (the 19x19x5x85 dimensional encoding) and filtering through all the boxes using the functions you’ve just implemented.
Exercise: Implement yolo_eval() which takes the output of the YOLO encoding and filters the boxes using score threshold and NMS. There’s just one last implementational detail you have to know. There’re a few ways of representing boxes, such as via their corners or via their midpoint and height/width. YOLO converts between a few such formats at different times, using the following functions (which we have provided):
boxes = yolo_boxes_to_corners(box_xy, box_wh)
which converts the yolo box coordinates (x,y,w,h) to box corners’ coordinates (x1, y1, x2, y2) to fit the input of yolo_filter_boxes
boxes = scale_boxes(boxes, image_shape)
YOLO’s network was trained to run on 608x608 images. If you are testing this data on a different size image–for example, the car detection dataset had 720x1280 images–this step rescales the boxes so that they can be plotted on top of the original 720x1280 image.
# GRADED FUNCTION: yolo_eval