YOLOv3: An Incremental Improvement

YOLOv3: An Incremental Improvement

Abstract

We present some updates to YOLO! We made a bunch of little design changes to make it better. We also trained this new network that’s pretty swell. It’s a little bigger than last time but more accurate. It’s still fast though, don’t worry. At 320 × 320 YOLOv3 runs in 22 ms at 28.2 mAP, as accurate as SSD but three times faster. When we look at the old .5 IOU mAP detection metric YOLOv3 is quite good. It achieves 57.9 AP50 in 51 ms on a Titan X, compared to 57.5 AP50 in 198 ms by RetinaNet, similar performance but 3.8× faster. As always, all the code is online at https://pjreddie.com/yolo/.

我們發布了YOLO的更新!我們用了很多小的設計改變來優化它。我們也訓練了這個新的非常強大的網絡。雖然有點大,但更準確。不用擔心,速度仍然很快。在320 x 320的YOLOV3運行了22ms,28.2mAP,準確率和SSD一樣,但速度提升了3倍。當我們看老版的.5 IOU mAP 檢測度量,YOLOv3非常棒。它在titanX上能達到57.9mAP在50到51ms,相比RetinaNet57.5mAP 50 198ms,類似準確率,但是快了3.8倍。和往常一樣,代碼上傳到https://pjreddie.com/yolo/.

1. Introduction

Sometimes you just kinda phone it in for a year, you know? I didn’t do a whole lot of research this year. Spent a lot of time on Twitter. Played around with GANs a little. I had a little momentum left over from last year [10] [1]; I managed to make some improvements to YOLO. But, honestly, nothing like super interesting, just a bunch of small changes that make it better. I also helped out with other people’s research a little.

你們召喚它一年了,知道嗎?今年我沒有做太多研究?;撕芏鄷r間在Twitter上,玩了一下GAN。去年也剩下一些momentum,我也成功對優化了YOLO。但,坦白來講,最有趣的事情,就是通過很多很小的改變來優化它。我還幫別人做了一些研究。

Actually, that’s what brings us here today. We have a camera-ready deadline and we need to cite some of the random updates I made to YOLO but we don’t have a source. So get ready for a TECH REPORT!

事實上,這就是今天我們來到這里的原因。我們有一個相機準備的最后期限,我們需要舉出一些隨機更新,但我們沒有一個源。準備好一份科技報告!

The great thing about tech reports is that they don’t need intros, y’all know why we’re here. So the end of this introduction will signpost for the rest of the paper. First we’ll tell you what the deal is with YOLOv3. Then we’ll tell you how we do. We’ll also tell you about some things we tried that didn’t work. Finally we’ll contemplate what this all means.

科技報告最棒的事情就是我們不需要簡介,你們都知道為什么我們在這。所以,簡介的其他部分會引出論文的其他部分。首先,我會YOLOv3解決了什么,然后我告訴你我們怎么做的,我們也會講我們做的改變但失敗了,最后我們總結這些意味著什么。

2. The Deal

So here’s the deal with YOLOv3: We mostly took good ideas from other people. We also trained a new classifier network that’s better than the other ones. We’ll just take you through the whole system from scratch so you can understand it all.

所以,YOLOv3解決了什么:我們大部分是從其他人那里得到好的想法。我們也訓練一個新的比另一個些好的分類器網絡。我們會一點點講整個系統,這樣你就能完全理解了。

2.1 Bounding Box Prediction

Following YOLO9000 our system predicts bounding boxes using dimension clusters as anchor boxes [13]. The network predicts 4 coordinates for each bounding box, tx, ty, tw, th. If the cell is offset from the top left corner of the image by (cx, cy) and the bounding box prior has width and height pw, ph, then the predictions correspond to:

下面是我們YOLO9000系統使用維度集群作為錨框來預測邊框。網絡對每個邊框預測4個坐標值,tx, ty, tw, ty。如果這個cell變異了左上角(cx, cy)邊框,邊框先驗寬高為pw, ph,那么預測就表示為:

$$b_x = σ(t_x) + c_x $$

$$b_y = σ(t_y) + c_y$$

$$b_w = p_we^{t_w}$$

$$b_h = p_he^{t_h}$$

During training we use sum of squared error loss. If the ground truth for some coordinate prediction is t? * our gradient is the ground truth value (computed from the ground truth box) minus our prediction: t?* ? t*. This ground truth value can be easily computed by inverting the equations above.

在訓練期間,我們使用了方差損失的和。如果每個坐標預測的真值是t^ * ,我們的梯度是真值(從真值邊框計算而來)減去我們的預測值,t * ? t * 。那么真值通過插入上面的等式很容易被計算出來。

YOLOv3 predicts an objectness score for each bounding box using logistic regression. This should be 1 if the bounding box prior overlaps a ground truth object by more than any other bounding box prior. If the bounding box prior is not the best but does overlap a ground truth object by more than some threshold we ignore the prediction, following [15]. We use the threshold of .5. Unlike [15] our system only assigns one bounding box prior for each ground truth object. If a bounding box prior is not assigned to a ground truth object it incurs no loss for coordinate or class predictions, only objectness.

YOLOv3使用logistic回歸來為每一個邊框給每個物體打分。如果邊框先驗比任何其他邊框與真值重疊更多,那么它為的分數為1.如果邊框先驗不是最好的,但與真值覆蓋面積大于某個閾值,我們就忽略預測,參照[15]。我們使用.5作為閾值。不像[15],我們的系統為每個真值物體只分配了一個邊框。如果這個邊框沒有被分配到真值物體,那么坐標或者類別預測不加入到損失中,只有對象。

2.2 Class Prediction

Each box predicts the classes the bounding box may contain using multilabel classification. We do not use a softmax as we have found it is unnecessary for good performance, instead we simply use independent logistic classifiers. During training we use binary cross-entropy loss for the class predictions.

每個邊框預測的類可以有多個標簽。我們不需要softmax,因為我們已經發現更好的性能沒必要使用softmax,取代的是,我們簡單實用了獨立的logistic分類器。在訓練期間,我們使用二值交叉熵損失來分類做類預測。

This formulation helps when we move to more complex domains like the Open Images Dataset [5]. In this dataset there are many overlapping labels (i.e. Woman and Person). Using a softmax imposes the assumption that each box has exactly one class which is often not the case. A multilabel approach better models the data.

這個公式很有用,當我們使用更復雜的域,比如開放的Images數據庫。在數據集里有很多重疊的標簽,比如女人和人。使用Softmax作了一個假設,就是每一個邊框只有一個類別,但實際場景并非如此。一個多標簽的方法對于數據更好的模型。

2.3 Predictions Across Scale

YOLOv3 predicts boxes at 3 different scales. Our system extracts features from those scales using a similar concept to feature pyramid networks [6]. From our base feature extractor we add several convolutional layers. The last of these predicts a 3-d tensor encoding bounding box, objectness, and class predictions. In our experiments with COCO [8] we predict 3 boxes at each scale so the tensor is N ×N ×[3?(4+ 1+ 80)] for the 4 bounding box offsets, 1 objectness prediction, and 80 class predictions.

YOLOv3在3種尺度下來預測。我們系統使用特征金字塔網絡類似的概念,從這些尺度中提取特征。從我們基本特征提取器,我們增加了幾個卷積層。這些最尾端預測一個編碼為邊框,物體和類預測的3-d張量。在我們COCO實驗中,我們每個尺寸預測3個邊框,所以張量為 N ×N ×[3?(4+ 1+ 80)],4個邊框偏置,1個物體預測,80個類別預測。

Next we take the feature map from 2 layers previous and upsample it by 2×. We also take a feature map from earlier in the network and merge it with our upsampled features using element-wise addition. This method allows us to get more meaningful semantic information from the upsampled features and finer-grained information from the earlier feature map. We then add a few more convolutional layers to process this combined feature map, and eventually predict a similar tensor, although now twice the size.

接下來,我們從之前的兩層得到特征圖,然后2倍上采樣。我們也從更前面的網絡中得到特征圖,然后使用元素級增加的方式和上采樣的特征做融合。這個方法讓我們從上采樣特征中得到更有用的語義信息并從更之前的特征圖中得到更細粒信息。當我們增加更多的卷積層來結合特征圖,最終預測一個相似的張量,雖然尺寸已經2倍大。

We perform the same design one more time to predict boxes for the final scale. Thus our predictions for the 3rd scale benefit from all the prior computation as well as finegrained features from early on in the network.

我們又用了一次同樣的設計來為最后一個尺度預測邊框。然后,我們第三個尺度的預測,就得益于所有之前的計算,和從更之前的網絡中的細粒特征。

We still use k-means clustering to determine our bounding box priors. We just sort of chose 9 clusters and 3 scales arbitrarily and then divide up the clusters evenly across scales. On the COCO dataset the 9 clusters were: (10×13),(16×30),(33×23),(30×61),(62×45),(59×119) (116 × 90),(156 × 198),(373 × 326).

我們仍然使用k-means集群來先驗邊框。我們只選了9個集群和3個隨機尺度,然后將這些集群均勻地分布在各個尺寸上。在COCO級9組:(10×13),(16×30),(33×23),(30×61),(62×45),(59×119) (116 × 90),(156 × 198),(373 × 326).

2.4 Feature Extractor

We use a new network for performing feature extraction. Our new network is a hybrid approach between the network used in YOLOv2, Darknet-19, and that newfangled residual network stuff. Our network uses successive 3 × 3 and 1 × 1 convolutional layers but now has some shortcut connections as well and is significantly larger. It has 53 convolutional layers so we call it.... wait for it..... Darknet-53!

我們使用了新的網絡來進行特征提取。我們新的網絡是一種使用在YOLOv2, Darknet-19的網絡的混合方法和新奇的殘差網絡的東西。我們的網絡使用連續的3 x 3 和1 x 1的卷積層,但現在也有了捷徑連接,明顯增大了網絡。它有了53個卷積層,所以我們叫它...等等...Darknet-53!

This new network is much more powerful than Darknet19 but still more efficient than ResNet-101 or ResNet-152. Here are some ImageNet results:

這個網絡比Darknet19更強大,但仍然比ResNet-101和ResNet-152更強大。這里是ImageNet結果:

Each network is trained with identical settings and tested at 256×256, single crop accuracy. Run times are measured on a Titan X at 256 × 256. Thus Darknet-53 performs on par with state-of-the-art classifiers but with fewer floating point operations and more speed. Darknet-53 is better than ResNet-101 and 1.5× faster. Darknet-53 has similar performance to ResNet-152 and is 2× faster.

每個網絡都在256 x 256相同的配置下訓練,單一作物的準確性。運行時間在TitanX 256x256上測試。因此Darknet是目前最好的結果的分類器,但用了更少的浮點運算和更快的速度。Darknet-53比ResNet-101效果更好,而且快1.5倍。Darknet-53有ResNet-152類似的準確性,但快2倍速度。

Darknet-53 also achieves the highest measured floating point operations per second. This means the network structure better utilizes the GPU, making it more efficient to evaluate and thus faster. That’s mostly because ResNets have just way too many layers and aren’t very efficient.

Darknet-53也達到了每秒最高的浮點計算量。這個意味著網絡結構更適合GPU,使它驗證更有效,也更快。更多是因為ResNets有更多的層,當并不是很高效。

2.5 Training

We still train on full images with no hard negative mining or any of that stuff. We use multi-scale training, lots of data augmentation, batch normalization, all the standard stuff. We use the Darknet neural network framework for training and testing [12].

我們仍然在整幅圖上訓練沒有難分負樣本挖掘和任何其他策略。我們使用多尺寸訓練,許多數據增強,塊歸一化,和所以基本的東西。我們使用Darknet神經網絡框架來訓練和測試。

3 How We Do

YOLOv3 is pretty good! See table 3. In terms of COCOs weird average mean AP metric it is on par with the SSD variants but is 3× faster. It is still quite a bit behind other models like RetinaNet in this metric though.

YOLOv3非常棒!看表3。 就COCO奇怪的平均AP度量而言,準確率相當于SSD,但快3倍。可能稍稍落后于像RetinaNet的網絡。

However, when we look at the “old” detection metric of mAP at IOU= .5 (or AP50 in the chart) YOLOv3 is very strong. It is almost on par with RetinaNet and far above the SSD variants. This indicates that YOLOv3 is a very strong detector that excels at producing decent boxes for objects. However, performance drops significantly as the IOU threshold increases indicating YOLOv3 struggles to get the boxes perfectly aligned with the object.

然而,當我們看到“old”檢測度量IOU=.5,YOLOV3非常強大。幾乎和RetinaNet媲美,遠超于SSD。這說明YOLOv3是非常強大的檢測器,擅長產生好的邊框。然而,在閾值增加的時候,性能急速下降,表明YOLOv3在努力將邊框與物體對齊。

In the past YOLO struggled with small objects. However, now we see a reversal in that trend. With the new multi-scale predictions we see YOLOv3 has relatively high APS performance. However, it has comparatively worse performance on medium and larger size objects. More investigation is needed to get to the bottom of this.

在之前YOLO檢測小物體不太好。然而,現在我們看到了轉機。新的多尺度預測,我們發現YOLOv3有相當高的APS表現。然而,相比可能中等或大的物體有稍稍差的效果。還需要做更多研究來提升它。

When we plot accuracy vs speed on the AP50 metric (see figure 3) we see YOLOv3 has significant benefits over other detection systems. Namely, it’s faster and better.

當我們畫折線圖,準確率和速度在AP50度量上(見圖3)我們可以看到YOLO比其他檢測系統更好。叫做,更快更好。

4. Things We Tried That Didn't Work

We tried lots of stuff while we were working on YOLOv3. A lot of it didn’t work. Here’s the stuff we can remember.

我們在YOLOv3上嘗試很多東西。很多沒有奏效,這里使我們記得的一些算法。

Anchor box x, y offset predictions. We tried using the normal anchor box prediction mechanism where you predict the x, y offset as a multiple of the box width or height using a linear activation. We found this formulation decreased model stability and didn’t work very well.

錨框x,y偏置預測我們試著使用正常的錨框預測機制,預測偏置x,y為使用線性激活的多個框的寬高。我們發現這個方程減少模型穩定性,所以不是太好。

Linear x, y predictions instead of logistic. We tried using a linear activation to directly predict the x, y offset instead of the logistic activation. This led to a couple point drop in mAP.

線性x,y預測而非logistic我們試著使用線性激活來直接預測偏置x,y,而不是logsitic激活。這個會導致mAP下降幾個點。

Focal loss. We tried using focal loss. It dropped our mAP about 2 points. YOLOv3 may already be robust to the problem focal loss is trying to solve because it has separate objectness predictions and conditional class predictions. Thus for most examples there is no loss from the class predictions? Or something? We aren’t totally sure.

焦點損失我們嘗試了焦點損失。它會讓我們的mAP下降2個點。對于焦點損失要解決的問題YOLOv3已經解決的很好了,因為它把物體預測和條件類預測分開了。對于大多數例子沒有類預測損失?;蛘咂渌裁??我們不完全確定。

Dual IOU thresholds and truth assignment. Faster RCNN uses two IOU thresholds during training. If a prediction overlaps the ground truth by .7 it is as a positive example, by [.3?.7] it is ignored, less than .3 for all ground truth objects it is a negative example. We tried a similar strategy but couldn’t get good results.

雙IOU閾值和真值分配Faster RCNN在訓練的時候使用了兩個IOU閾值。如果一個預測值覆蓋了真值超過.7,它就是真值。.3-.7就忽略了,少于.3的為負樣本。我們試了類似的策略,但效果不太好。

We quite like our current formulation, it seems to be at a local optima at least. It is possible that some of these techniques could eventually produce good results, perhaps they just need some tuning to stabilize the training.

我們非常喜歡我們現在的公式,至少看起來是局部最優的??赡苓@些技術能得到好的結果,可能我們需要微調以穩定訓練。

5 What This All Means

YOLOv3 is a good detector. It’s fast, it’s accurate. It’s not as great on the COCO average AP between .5 and .95 IOU metric. But it’s very good on the old detection metric of .5 IOU.

YOLOv3是個很好的檢測器。很快,準確率很高??赡茉?5到.95IOU域不太好,但是在老的檢測域.5IOU非常好。

Why did we switch metrics anyway? The original COCO paper just has this cryptic sentence: “A full discussion of evaluation metrics will be added once the evaluation server is complete”. Russakovsky et al report that that humans have a hard time distinguishing an IOU of .3 from .5! “Training humans to visually inspect a bounding box with IOU of 0.3 and distinguish it from one with IOU 0.5 is surprisingly

difficult.” [16] If humans have a hard time telling the difference, how much does it matter?

為什么我們交換了域?原來的COCO說了句有含義的話:一旦評估服務完成,就會增加評估域的完全討論。Russakovsky在很難區分IOU.3到.5的報告說。訓練人類這樣做就很難。如果人都很難區分,又有什么意義呢?

But maybe a better question is: “What are we going to do with these detectors now that we have them?” A lot of the people doing this research are at Google and Facebook. I guess at least we know the technology is in good hands and definitely won’t be used to harvest your personal information and sell it to.... wait, you’re saying that’s exactly what it will be used for?? Oh.

但或許更好的問題是:“現在我們有了他們,我們用這些檢測器來做什么呢”許多人在Google和Facebook在做這件事。我猜至少我們知道科技在好的人手里,完全不會被用來侵犯你的個人信息然后賣到。。。等等,你會說那的確是將要用來做的事?oh。

Well the other people heavily funding vision research are the military and they’ve never done anything horrible like killing lots of people with new technology oh wait.....

當然其他人把視覺研究在軍事上,他們沒做什么用新科技來殺害更多的人. oh .等等...

I have a lot of hope that most of the people using computer vision are just doing happy, good stuff with it, like counting the number of zebras in a national park [11], or tracking their cat as it wanders around their house [17]. But computer vision is already being put to questionable use and as researchers we have a responsibility to at least consider the harm our work might be doing and think of ways to mitigate it. We owe the world that much. In closing, do not @ me. (Because I finally quit Twitter).

我很希望大部分人用計算機視覺來做開心,好的事情,比如在國家公園數斑馬,或者跟蹤他們的貓,當貓在家里轉到的時候。但計算機時間已經應用到有爭議的應用。作為研究者,我們有義務,至少考慮我們工作的危害,可能做或者想一些辦法去減輕它。我們欠世界太多。最后,不要@我,我已經完全不用twitter啦。

最后編輯于
?著作權歸作者所有,轉載或內容合作請聯系作者
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發布,文章內容僅代表作者本人觀點,簡書系信息發布平臺,僅提供信息存儲服務。
  • 序言:七十年代末,一起剝皮案震驚了整個濱河市,隨后出現的幾起案子,更是在濱河造成了極大的恐慌,老刑警劉巖,帶你破解...
    沈念sama閱讀 227,797評論 6 531
  • 序言:濱河連續發生了三起死亡事件,死亡現場離奇詭異,居然都是意外死亡,警方通過查閱死者的電腦和手機,發現死者居然都...
    沈念sama閱讀 98,179評論 3 414
  • 文/潘曉璐 我一進店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人,你說我怎么就攤上這事?!?“怎么了?”我有些...
    開封第一講書人閱讀 175,628評論 0 373
  • 文/不壞的土叔 我叫張陵,是天一觀的道長。 經常有香客問我,道長,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 62,642評論 1 309
  • 正文 為了忘掉前任,我火速辦了婚禮,結果婚禮上,老公的妹妹穿的比我還像新娘。我一直安慰自己,他們只是感情好,可當我...
    茶點故事閱讀 71,444評論 6 405
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著,像睡著了一般。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發上,一...
    開封第一講書人閱讀 54,948評論 1 321
  • 那天,我揣著相機與錄音,去河邊找鬼。 笑死,一個胖子當著我的面吹牛,可吹牛的內容都是我干的。 我是一名探鬼主播,決...
    沈念sama閱讀 43,040評論 3 440
  • 文/蒼蘭香墨 我猛地睜開眼,長吁一口氣:“原來是場噩夢啊……” “哼!你這毒婦竟也來了?” 一聲冷哼從身側響起,我...
    開封第一講書人閱讀 42,185評論 0 287
  • 序言:老撾萬榮一對情侶失蹤,失蹤者是張志新(化名)和其女友劉穎,沒想到半個月后,有當地人在樹林里發現了一具尸體,經...
    沈念sama閱讀 48,717評論 1 333
  • 正文 獨居荒郊野嶺守林人離奇死亡,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內容為張勛視角 年9月15日...
    茶點故事閱讀 40,602評論 3 354
  • 正文 我和宋清朗相戀三年,在試婚紗的時候發現自己被綠了。 大學時的朋友給我發了我未婚夫和他白月光在一起吃飯的照片。...
    茶點故事閱讀 42,794評論 1 369
  • 序言:一個原本活蹦亂跳的男人離奇死亡,死狀恐怖,靈堂內的尸體忽然破棺而出,到底是詐尸還是另有隱情,我是刑警寧澤,帶...
    沈念sama閱讀 38,316評論 5 358
  • 正文 年R本政府宣布,位于F島的核電站,受9級特大地震影響,放射性物質發生泄漏。R本人自食惡果不足惜,卻給世界環境...
    茶點故事閱讀 44,045評論 3 347
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧,春花似錦、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 34,418評論 0 26
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至,卻和暖如春,著一層夾襖步出監牢的瞬間,已是汗流浹背。 一陣腳步聲響...
    開封第一講書人閱讀 35,671評論 1 281
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留,地道東北人。 一個月前我還...
    沈念sama閱讀 51,414評論 3 390
  • 正文 我出身青樓,卻偏偏與公主長得像,于是被迫代替她去往敵國和親。 傳聞我的和親對象是個殘疾皇子,可洞房花燭夜當晚...
    茶點故事閱讀 47,750評論 2 370

推薦閱讀更多精彩內容

  • rljs by sennchi Timeline of History Part One The Cognitiv...
    sennchi閱讀 7,388評論 0 10
  • 文/逸飛 我們的人生,從出生開始便在不斷的做加法,從懵懵懂懂無知的嬰兒到懂得了1、2、3、4、5牙牙學語的幼兒,再...
    晴天愛閱讀閱讀 417評論 0 2
  • 為什么現階段要好好努力? 因為你要培養自己的核心競爭力,要更有自信和實力,才能配得上未來更有底氣的人生。 1. 最...
    洛依汐閱讀 335評論 0 1
  • 忽然忘記了昨晚的夢。忽然忘記了所有的夢。要怎么一點點回憶起來。
    煉心清秋閱讀 230評論 0 0