# YOLO Object Detection

Object Detection and YOLO

## What is Object Detection ?

### Defining the target label y

• 行人，0 or 1；
• 汽车，0 or 1；
• 摩托车，0 or 1；
• 图片背景，0 or 1；
• 定位框：bx、by、bh、bw

#### Loss Function

• Pc=1：
此时，我们需要关注神经网络对所有输出值的准确度；

• Pc=0：
此时，我们只关注神经网络对背景值的准确度。

• 对c1、c2、c3使用softmax交叉熵损失函数；
• 对边界框的四个值应用平方误差或者类似的方法；
• 对Pc应用logistic regression损失函数，或者平方预测误差。

## Prerequisites

### Sliding Windows Detection Algorithm

Weakness : computational cost

### IoU (Intersection over union)

More generally, IoU is a measure of the overlap between two bounding boxes.

### Anchor Box

#### Anchor Box Algorithm

With two anchor boxes: Each object in training image is assigned to grid cell that contains object’s midpoint and anchor box for the grid cell with highest IoU.

#### Weakness

• 如果我们使用了两个Anchor box，但是同一个格子中却有三个对象的情况，此时只能用一些额外的手段来处理；
• 同一个格子中存在两个对象，但它们的Anchor box 形状相同，此时也需要引入一些专门处理该情况的手段。

#### How to choose anchor box ?

• 一般人工指定Anchor box 的形状，选择5~10个以覆盖到多种不同的形状，可以涵盖我们想要检测的对象的形状；

• 高级方法(K-means 算法)：将不同对象形状进行聚类，用聚类后的结果来选择一组最具代表性的Anchor box，以此来代表我们想要检测对象的形状。

## YOLO

• 将图片分割成n×n个小的图片；

• 采用图像分类和定位算法，分别应用在图像的n×n个格子中；

• 对于每一个小格子，输出一个预定义标签：yi=[Pc bx by bh bw c1 c2 c3] 对于不同的网格 i 有不同的标签向量yi；

• 将n×n个格子标签合并在一起，最终的目标输出Y的大小为：n×n×8（这里8是因为例子中的目标值有8个）

YOLO notation：

• 将对象分配到一个格子的过程是：观察对象的中点，将该对象分配到其中点所在的格子中，（即使对象横跨多个格子，也只分配到中点所在的格子中，其他格子记为无该对象，即标记为“0”）；
• YOLO显式地输出边界框，使得其可以具有任意宽高比，并且能输出更精确的坐标，不受滑动窗口算法滑动步幅大小的限制；
• YOLO是一次卷积实现，并不是在n×n网格上进行n^2次运算，而是单次卷积实现，算法实现效率高，运行速度快，可以实现实时识别。

### Training Set

• 输入X：同样大小的完整图片;
• 目标Y：使用3×3网格划分，输出大小3×3×16(假设使用2个anchor boxes);
• 对不同格子中的小图，定义目标输出向量y。

## YOLO Practice for Car Detection

### Problem Statement

Here’s an example of what your bounding boxes look like:

If you have 80 classes that you want YOLO to recognize, you can represent the class label c either as an integer from 1 to 80, or as an 80-dimensional vector (with 80 numbers) one component of which is 1 and the rest of which are 0. The video lectures had used the latter representation; in this notebook, we will use both representations, depending on which is more convenient for a particular step.

Because the YOLO model is very computationally expensive to train, we will load pre-trained weights for you to use.

### YOLO Model Details

YOLO (“you only look once”) is a popular algoritm because it achieves high accuracy while also being able to run in real-time. This algorithm “only looks once” at the image in the sense that it requires only one forward propagation pass through the network to make predictions. After non-max suppression, it then outputs recognized objects together with the bounding boxes.

First things to know:

• The input is a batch of images of shape (m, 608, 608, 3)
• The output is a list of bounding boxes along with the recognized classes. Each bounding box is represented by 6 numbers (pc,bx,by,bh,bw,c) as explained above. If you expand c into an 80-dimensional vector, each bounding box is then represented by 85 numbers.

We will use 5 anchor boxes. So you can think of the YOLO architecture as the following: IMAGE (m, 608, 608, 3) -> DEEP CNN -> ENCODING (m, 19, 19, 5, 85).

Lets look in greater detail at what this encoding represents.

If the center/midpoint of an object falls into a grid cell, that grid cell is responsible for detecting that object.

Since we are using 5 anchor boxes, each of the 19 x19 cells thus encodes information about 5 boxes. Anchor boxes are defined only by their width and height.

For simplicity, we will flatten the last two last dimensions of the shape (19, 19, 5, 85) encoding. So the output of the Deep CNN is (19, 19, 425).

Now, for each box (of each cell) we will compute the following elementwise product and extract a probability that the box contains a certain class.

Here’s one way to visualize what YOLO is predicting on an image:

• For each of the 19x19 grid cells, find the maximum of the probability scores (taking a max across both the 5 anchor boxes and across different classes).
• Color that grid cell according to what object that grid cell considers the most likely.

Doing this results in this picture:

Note that this visualization isn’t a core part of the YOLO algorithm itself for making predictions; it’s just a nice way of visualizing an intermediate result of the algorithm.

Another way to visualize YOLO’s output is to plot the bounding boxes that it outputs. Doing that results in a visualization like this:

In the figure above, we plotted only boxes that the model had assigned a high probability to, but this is still too many boxes. You’d like to filter the algorithm’s output down to a much smaller number of detected objects. To do so, you’ll use non-max suppression. Specifically, you’ll carry out these steps:

• Get rid of boxes with a low score (meaning, the box is not very confident about detecting a class)
• Select only one box when several boxes overlap with each other and detect the same object.

### Non-max suppression

Even after filtering by thresholding over the classes scores, you still end up a lot of overlapping boxes. A second filter for selecting the right boxes is called non-maximum suppression (NMS).

Non-max suppression uses the very important function called “Intersection over Union”, or IoU.

Hints :

• In this exercise only, we define a box using its two corners (upper left and lower right): (x1, y1, x2, y2) rather than the midpoint and height/width.
• To calculate the area of a rectangle you need to multiply its height (y2 - y1) by its width (x2 - x1)
• You’ll also need to find the coordinates (xi1, yi1, xi2, yi2) of the intersection of two boxes. Remember that:
• xi1 = maximum of the x1 coordinates of the two boxes
• yi1 = maximum of the y1 coordinates of the two boxes
• xi2 = minimum of the x2 coordinates of the two boxes
• yi2 = minimum of the y2 coordinates of the two boxes

You are now ready to implement non-max suppression. The key steps are:

1. Select the box that has the highest score.
2. Compute its overlap with all other boxes, and remove boxes that overlap it more than iou_threshold.
3. Go back to step 1 and iterate until there’s no more boxes with a lower score than the current selected box.

This will remove all boxes that have a large overlap with the selected boxes. Only the “best” boxes remain.

TensorFlow has two built-in functions that are used to implement non-max suppression (so you don’t actually need to use your iou() implementation):

### Wrapping up the filtering

It’s time to implement a function taking the output of the deep CNN (the 19x19x5x85 dimensional encoding) and filtering through all the boxes using the functions you’ve just implemented.

Exercise: Implement yolo_eval() which takes the output of the YOLO encoding and filters the boxes using score threshold and NMS. There’s just one last implementational detail you have to know. There’re a few ways of representing boxes, such as via their corners or via their midpoint and height/width. YOLO converts between a few such formats at different times, using the following functions (which we have provided):

which converts the yolo box coordinates (x,y,w,h) to box corners’ coordinates (x1, y1, x2, y2) to fit the input of yolo_filter_boxes

YOLO’s network was trained to run on 608x608 images. If you are testing this data on a different size image–for example, the car detection dataset had 720x1280 images–this step rescales the boxes so that they can be plotted on top of the original 720x1280 image.