YOLO is all you need in object detection

5 minute read

Published:

The title is obviouly inspired by a famous paper, Attention is all you need. I’ll talk about the usage of YOLO in a real project.

The description of the task

In a medical medical research project, a real X-ray image is expected to be processed for further judgement whether hydrarthrosis of the knee occurs.

Here we only focus a part of the project – detect and truncate the knees in the images.

The mission seems quite simple cause state-of-the-art object detection methods perform well in detecting objects in vedios, not to mention detect in images. What’s more, the features of the keen are so unique that even traditional image process methods (no deep neruel network) may work. At the very beginning, we also thought so.

Talk is cheap, show me the code. The difficulties coming along were out of expetection.

Difficulties ahead

  1. The number of the images is too small. In a Chinese hospital, images are organized in person-level, which means a person has a diretory which contains all his images such as X-ray image, MRI, and so on. Therefore, the selection of X-ray images by doctors takes long time. And you know, Chinese doctors are busy with all kinds of stuff. So only 20 or so images were available at first. As we all know, a deep learning model needs to be feeded of large quantities of images for training.

  2. The qualities of the images are bad. The shoots of X-ray were casually done resulting in few standard images. The images below shows some real images from a hospotal.

  3. A image may include more than one leg😤. An experienced doctor is competent to diagnose whatever the appearence of the image. Nevertheless, for a machine it matters a lot.

The first method I tried

I tried traditional methods when waiting for more images. Motivated by an article [1], a pixel-level match method was tried.

The idea is intuitive: given a templete, say, a knee from a standard image, find the most matched part in a test image. This method applies for looking for a exactly same part as the templete in an image. Nevertheless, the knees of the images differ a lot.

Anyway, in the case that no more images are available, I tried this method.

The result was fairly good. Nearly a half of the images were successfully detected. But this method failed to work in a image with multiple legs. In addition, pixel-level features highly depends on the templete image which means a image with a similar pixel distribution will get a good result while the others may lead to disappointing results. To sum up, it is not a general method.

The second method I tried

With no more images, I came up with another traditional method. I noticed that the crack of a knee is special feature which distinguish the knee out. If we can find the crack, we can approximately draw a box over the crack which is the knee’s position. Things like finding a crack are called edge detection. Edge detection is broadly used for applications like finding the barcode on a product.

My process steps:

  1. GaussianBlur
  2. Canny detection
  3. Dilate
  4. Closs operation with a cross element
  5. Find the biggest part in the image which is supposed to be the knee

Tuning the parameters is an art. Gaussian blur is needed or not? Canny or Sobel? Dilate or erode? How many iterations when dilating or eroding? Close or open operation? What about the structure of the element? Different combination trials leads to a fine-grained configuration.

Final results were also pretty-good, but not good enough for my project.

The final try: YOLO

Thanks to god, 280 more images were provided this month. Although the number is under our expectations, 280 is better than none. Then we followed a guide [2] and tried YOLO.

YOLO(You Only Look Once) is state-of-the-art object detection model. I was not sure whether 280 images for training were enough.

The process is summaried as follows:

  1. Label the data.
  2. Create a virtual environment using Conda or pyenv, install packages like tensorflow-gpu using pip.
  3. Connect to the remote machine by ssh.
  4. Use SFTP to upload files synchronously.
  5. Train the data using GPU.
  6. Test.

LabelImg was used to label a new class. The training process took two hours on a 1080Ti GPU. Many bugs were faced with and fixed. The result was amazing.

Some hyper-parameters are listed below.

Hyper-parametersValue
Batch size5
Learning rate0.001
Epoch100
Momentum0.9

Lovely bugs:

  1. The slash in Linux system is ‘/’ while it in Window is ‘'. A local machine of mine is equipped with Windows OS so a conflict occured.

  2. After training, the trained weight file should be put in a directory.

  3. Out of memory error. Just reduce the batch size.

  4. The guide is not detailed. A deep understanding of how the code works is needed.

Reference:

[1] https://club.6parker.com/enter1/index.php?app=forum&act=threadview&tid=14301264

[2] https://www.e-learn.cn/content/python/724590