## What is Object Detection ?

### Classification with localization

### Defining the target label y

输出：包含图片中存在的对象及定位框

- 行人，0 or 1；
- 汽车，0 or 1；
- 摩托车，0 or 1；
- 图片背景，0 or 1；
- 定位框：bx、by、bh、bw

其中，bx、by表示汽车中点，bh、bw分别表示定位框的高和宽。以图片左上角为(0,0)，以右下角为(1,1)，这些数字均为位置或长度所在图片的比例大小。

#### Target Label Y

#### Loss Function

如果采用平方误差形式的损失函数：

Pc=1：

此时，我们需要关注神经网络对所有输出值的准确度；Pc=0：

此时，我们只关注神经网络对背景值的准确度。

当然在实际的目标定位应用中，我们可以使用更好的方式是：

- 对c1、c2、c3使用softmax交叉熵损失函数；
- 对边界框的四个值应用平方误差或者类似的方法；
- 对Pc应用logistic regression损失函数，或者平方预测误差。

## Prerequisites

### Sliding Windows Detection Algorithm

**Weakness : ** computational cost

### Improvement : Convolutional Implementation of Sliding Windows

#### Turning FC layer into convolutional layers

#### Convolutional Implementation of Sliding Windows

### IoU (Intersection over union)

More generally, IoU is a measure of the overlap between two bounding boxes.

### Non-max Suppression

### Anchor Box

#### Overlapping Objects

#### Anchor Box Algorithm

With two anchor boxes: Each object in training image is assigned to grid cell that contains object’s midpoint and anchor box for the grid cell with highest IoU.

#### Demo

#### Weakness

- 如果我们使用了两个Anchor box，但是同一个格子中却有三个对象的情况，此时只能用一些额外的手段来处理；
- 同一个格子中存在两个对象，但它们的Anchor box 形状相同，此时也需要引入一些专门处理该情况的手段。

但是以上的两种问题出现的可能性不会很大，对目标检测算法不会带来很大的影响。

#### How to choose anchor box ?

一般人工指定Anchor box 的形状，选择5~10个以覆盖到多种不同的形状，可以涵盖我们想要检测的对象的形状；

高级方法(K-means 算法)：将不同对象形状进行聚类，用聚类后的结果来选择一组最具代表性的Anchor box，以此来代表我们想要检测对象的形状。

## YOLO

将图片分割成n×n个小的图片；

采用图像分类和定位算法，分别应用在图像的n×n个格子中；

对于每一个小格子，输出一个预定义标签：yi=[Pc bx by bh bw c1 c2 c3] 对于不同的网格 i 有不同的标签向量yi；

将n×n个格子标签合并在一起，最终的目标输出Y的大小为：n×n×8（这里8是因为例子中的目标值有8个）

YOLO notation：

- 将对象分配到一个格子的过程是：观察对象的中点，将该对象分配到其中点所在的格子中，（即使对象横跨多个格子，也只分配到中点所在的格子中，其他格子记为无该对象，即标记为“0”）；
- YOLO显式地输出边界框，使得其可以具有任意宽高比，并且能输出更精确的坐标，不受滑动窗口算法滑动步幅大小的限制；
- YOLO是一次卷积实现，并不是在n×n网格上进行n^2次运算，而是单次卷积实现，算法实现效率高，运行速度快，可以实现实时识别。

### Training Set

- 输入X：同样大小的完整图片;
- 目标Y：使用3×3网格划分，输出大小3×3×16(假设使用2个anchor boxes);
- 对不同格子中的小图，定义目标输出向量y。

### Prediction

输入与训练集中相同大小的图片，同时得到每个格子中不同的输出结果：3×3×16。

### Non-max Suppression

假设使用了2个Anchor box，那么对于每一个网格，我们都会得到预测输出的2个bounding boxes，其中一个Pc比较高；

抛弃概率Pc值低的预测bounding boxes；

对每个对象（如行人、汽车、摩托车）分别使用NMS算法得到最终的预测边界框。

## YOLO Practice for Car Detection

### Problem Statement

Here’s an example of what your bounding boxes look like:

If you have 80 classes that you want YOLO to recognize, you can represent the class label c either as an integer from 1 to 80, or as an 80-dimensional vector (with 80 numbers) one component of which is 1 and the rest of which are 0. The video lectures had used the latter representation; in this notebook, we will use both representations, depending on which is more convenient for a particular step.

Because the YOLO model is very computationally expensive to train, we will load pre-trained weights for you to use.

### YOLO Model Details

YOLO (“you only look once”) is a popular algoritm because it achieves high accuracy while also being able to run in real-time. This algorithm “only looks once” at the image in the sense that it requires only one forward propagation pass through the network to make predictions. After non-max suppression, it then outputs recognized objects together with the bounding boxes.

First things to know:

- The input is a batch of images of shape
`(m, 608, 608, 3)`

- The output is a list of bounding boxes along with the recognized classes. Each bounding box is represented by 6 numbers
`(pc,bx,by,bh,bw,c)`

as explained above. If you expand c into an 80-dimensional vector, each bounding box is then represented by 85 numbers.

We will use 5 anchor boxes. So you can think of the YOLO architecture as the following: IMAGE (m, 608, 608, 3) -> DEEP CNN -> ENCODING (m, 19, 19, 5, 85).

Lets look in greater detail at what this encoding represents.

If the center/midpoint of an object falls into a grid cell, that grid cell is responsible for detecting that object.

Since we are using 5 anchor boxes, each of the 19 x19 cells thus encodes information about 5 boxes. Anchor boxes are defined only by their width and height.

For simplicity, we will flatten the last two last dimensions of the shape `(19, 19, 5, 85)`

encoding. So the output of the Deep CNN is `(19, 19, 425)`

.

Now, for each box (of each cell) we will compute the following elementwise product and extract a probability that the box contains a certain class.

Here’s one way to visualize what YOLO is predicting on an image:

- For each of the 19x19 grid cells, find the maximum of the probability scores (taking a max across both the 5 anchor boxes and across different classes).
- Color that grid cell according to what object that grid cell considers the most likely.

Doing this results in this picture:

Note that this visualization isn’t a core part of the YOLO algorithm itself for making predictions; it’s just a nice way of visualizing an intermediate result of the algorithm.

Another way to visualize YOLO’s output is to plot the bounding boxes that it outputs. Doing that results in a visualization like this:

In the figure above, we plotted only boxes that the model had assigned a high probability to, but this is still too many boxes. You’d like to filter the algorithm’s output down to a much smaller number of detected objects. To do so, you’ll use non-max suppression. Specifically, you’ll carry out these steps:

- Get rid of boxes with a low score (meaning, the box is not very confident about detecting a class)
- Select only one box when several boxes overlap with each other and detect the same object.

### Filtering with a threshold on class scores

1 | # GRADED FUNCTION: yolo_filter_boxes |

### Non-max suppression

Even after filtering by thresholding over the classes scores, you still end up a lot of overlapping boxes. A second filter for selecting the right boxes is called non-maximum suppression (NMS).

Non-max suppression uses the very important function called “Intersection over Union”, or IoU.

Hints :

- In this exercise only, we define a box using its two corners (upper left and lower right): (x1, y1, x2, y2) rather than the midpoint and height/width.

- To calculate the area of a rectangle you need to multiply its height (y2 - y1) by its width (x2 - x1)
- You’ll also need to find the coordinates (xi1, yi1, xi2, yi2) of the intersection of two boxes. Remember that:
- xi1 = maximum of the x1 coordinates of the two boxes
- yi1 = maximum of the y1 coordinates of the two boxes
- xi2 = minimum of the x2 coordinates of the two boxes
- yi2 = minimum of the y2 coordinates of the two boxes

1 | # GRADED FUNCTION: iou |

You are now ready to implement non-max suppression. The key steps are:

- Select the box that has the highest score.
- Compute its overlap with all other boxes, and remove boxes that overlap it more than iou_threshold.
- Go back to step 1 and iterate until there’s no more boxes with a lower score than the current selected box.

This will remove all boxes that have a large overlap with the selected boxes. Only the “best” boxes remain.

TensorFlow has two built-in functions that are used to implement non-max suppression (so you don’t actually need to use your iou() implementation):

1 | # GRADED FUNCTION: yolo_non_max_suppression |

### Wrapping up the filtering

It’s time to implement a function taking the output of the deep CNN (the 19x19x5x85 dimensional encoding) and filtering through all the boxes using the functions you’ve just implemented.

Exercise: Implement yolo_eval() which takes the output of the YOLO encoding and filters the boxes using score threshold and NMS. There’s just one last implementational detail you have to know. There’re a few ways of representing boxes, such as via their corners or via their midpoint and height/width. YOLO converts between a few such formats at different times, using the following functions (which we have provided):

1 | boxes = yolo_boxes_to_corners(box_xy, box_wh) |

which converts the yolo box coordinates (x,y,w,h) to box corners’ coordinates (x1, y1, x2, y2) to fit the input of yolo_filter_boxes

1 | boxes = scale_boxes(boxes, image_shape) |

YOLO’s network was trained to run on 608x608 images. If you are testing this data on a different size image–for example, the car detection dataset had 720x1280 images–this step rescales the boxes so that they can be plotted on top of the original 720x1280 image.

1 | # GRADED FUNCTION: yolo_eval |