## I whale do better next time ;-)

In a recent kaggle competition, the goal was to develop a face recognition software for right whales. Right whales are critically endangered and only a few hundred individual whales exist. The task was to tell which individual is seen on aerial images like the one below. Such an automatic whale identification helps marine biologists in their conservation efforts.

I describe the methods I developed in python for the competition in this post. I placed 72nd out of 364 teams (top 20%) and I was quite happy with my achievement but see the title! This was my first real-world machine learning project (i.e., not a lecture assignment with a clean, already prepared data set). I had very little prior experience in image processing and machine learning, and I worked mostly alone. This blog is my way to bring the competition to closure, reflect on what I learnt, and what I would do differently next time in a similar competition/project.

### Some background and motivation

Right whales are critically endangered. They are tame and friendly creatures that do not swim away from ships, which made them easy targets for whale hunters in the past. Conservation efforts aim to track and monitor the population, assess their health, and rescue them if they get entangled in fishing nets.

Whales are manually identified by marine biologists in aerial images, which is time consuming, tedious, and cannot be done during the aerial surveys. If marine biologists could identify whales quickly and automatically during the aerial surveys, they would have immediate, on-site access to the whale’s historical migration, health, and entanglement record, which can be life-saving. Also, it would free up a significant amount of research time to be spent on higher level, more important tasks.

### The data

We were given ~11,000 whale images and ~4,000 of those were labeled and served as the training set (check out the data here). There are ~450 different individual right whales in the training set, so on average we got 10 pictures per whale (some whales had more than 10 pictures, some had only one image per individual). Each image contains exactly one whale face. The images vary in quality, resolution, and illumination. The viewing angle of the whales is random: whales can be seen from above, from the side, from the front, or from the back. The whale’s head can be partially obscured by several reasons: the whales can be partially or fully under the water and their face is seen through wavy water; sometimes the head is above water, but there are waves around the head obscuring part of it; whales blow water and the cloud of spray can partially hide the head; as well as bright speckles of light could also make it difficult to see the whale.

### Evaluation

The evaluation score is determined based on the multi-class logarithmic loss metric:

$\displaystyle logloss = -\frac{1}{N} \sum\limits_{i=1}^{N} \sum\limits_{j=1}^{M}y_{ij}\log (p_{ij})$,

where $\displaystyle N$ is the number of images in the test set (~7,000), $\displaystyle M$ is the number of whales (~450), $\displaystyle y_{ij}$ is unity if image $\displaystyle i$ belongs to whale $\displaystyle j$ and zero otherwise, $\displaystyle \log$ is the natural logarithm, and $\displaystyle p_{ij}$ is my predicted probability that image $\displaystyle i$ belongs to whale $\displaystyle j$. To avoid the extremes of the log function, the predicted probabilities are replaced with $\displaystyle \max(\min(p,1-10^{-15}),10^{-15})$.  Let’s spend some time to get to understand the formula.

The worse case scenario is that your classifier chooses one class for each image, assigns $\displaystyle p=1$ for that class, $\displaystyle p=0$ for the rest, and you incorrectly classify all the images. In that case, $\displaystyle p=0$ gets replaced by $\displaystyle 10^{-15}$. As $\displaystyle \log(10^{-15}) \approx -34$, $\displaystyle logloss \approx 34$, which is the sample submission benchmark on the leaderboard and roughly 100 teams out of the 360 have this score. Even if you correctly classify some images, your score will be heavily penalized for each misclassified image with $\displaystyle p=0$.

In the next scenario, I assign equal probabilities to each class on all the images: $\displaystyle p_{ij} = \frac{1}{M}$. Then my score becomes 6.10 (because $\displaystyle \log(1/450) \approx -6.10$), which is a significant improvement and puts me in the top 50%.

Finally, let’s assume that $\displaystyle p_{ij} = \frac{n_j}{M}$, where $\displaystyle n_j$ is the number of times whale $\displaystyle j$ appears in the training set. In other words, we assume that whale $\displaystyle j$ appears as frequently in the test set as it does in the training set. Kaggle did not suggest that this assumption is valid. However, when I submitted such a solution, my score was 5.98, which landed me in the top 35% without any machine learning!

The take-away message is that it is important to understand how your solution is evaluated on kaggle. The rest of the blog describes my effort to improve the score. My final score was 5.556 (a relatively small improvement over 5.98), which landed me in the top 20%.

### General considerations

Humans identify the whales based on the white spots on their head called the callosity pattern, patches of hardened skin (see the image below). This pattern is unique to each whale and it remains more or less unchanged during their lives.

Different callosity patterns that can be seen on a whale’s head.

My approach (and I think the approach of most teams as well) was to perform detection to locate the whale’s head on the image, followed by classification to identify which face is seen on the image. In the next four sections, I described failed and successful attempts at detection and classification.

### Failed attempts in detection

I think it is important to talk about failed attempts because one can learn a lot by understanding why an approach fails and another succeeds. I had two failed approaches for face detection: brute force template matching of an averaged whale face, and a tile-based approach to identify the most likely location of the whale’s head on the image.

In brute force template matching, I slide a template through the image to find the most similar region.  To generate the template, I labeled a few 100 images by hand by clicking on the locations of the blow hole and the top of the nose. Then, I bounded the head with a square along the nose-blow hole axis, aligned and resized the squares, normalized the intensities to reduce differences in illumination, and finally calculated the average of all labeled whale faces. This is what I got:

The image shows that I was quite accurate with my clicks. The blow hole is clearly visible at the bottom of the picture, and the central white region is the average callosity pattern.

It’s important to consider that the orientation of the whale’s head and its size is random on the image. Therefore, I need to slide different rotated versions of the template through the image (I used a 5º step size for rotation), and I need to consider different size ratios between the image and the template (I used 100 different size scales). To speed the computations up, it is best to keep the template’s size fixed and downscale the image.

The problem with template matching is that the whales can look quite different from the template, especially if there are waves around the head. Often waves, spray, or speckles looked most similar to the template. Template matching in general works great if you have a good template (for example, you need to find a logo on the image). But it can perform poorly if the template is blurry or ill-defined, as is the case of the whale’s face.

The tile-based approach was my idea for whale detection. It consists of two steps: first, the whale’s body is distinguished from the ocean and waves, then in a second step, the whale’s head is distinguished from its body.

In the first step (distinguishing whale from ocean), I select random regions/tiles of 300×300 pixels from random images and label them based on whether they show a whale or water. 300×300 pixels were necessary for me to accurately classify the tile. If the tile was smaller, I had quite a few cases where I couldn’t tell what was on the tile. Once the tiles were collected, I used principle component analysis (PCA) to reduce the number of dimensions from $300^2$ pixels to a few 100 principal components by retaining 90% of the variance. Then I gave this training set to a support vector machine (SVM).

Once the SVM is trained, I produce a heat map of the image to find the most likely location of the whale:

I use the SVM to predict whether partially overlapping tiles on the image contain a whale (1) or water (0). I sum the predictions of the SVM to produce the heat map seen on the right, and then fit a rectangle around the blob to cut the whale.

The whale detection is very accurate: the whale’s body is successfully found on more than 95% of the images.

In the second step (distinguishing the whale’s head from its body), I follow a similar approach. I take the rectangles that contain the body, label random tiles on random images based on whether the tile shows the whale’s head or the rest of the body, train an SVM, and produce heat maps.

Unfortunately, the second step was unsuccessful; I couldn’t distinguish the head from the body:

The most likely reason is that the whale’s head is quite similar to the whale’s body, so the computer didn’t find any good features to distinguish the two.

### What worked – LBP cascades

The face detection method I finally settled for is called the Viola-Jones object detection framework. This method is used in smart phones to detect and track faces in real time. It is fast, invariant to scale and location, and it is a generic method – it can be trained to detect any object. The disadvantages are that the method is not rotation invariant, and it can return multiple overlapping detections on the same face. I explain below how I address these two disadvantages.

For more details on this object detection method, check out this page on wikipedia, and these two documentations on the Open source Computer Vision (OpenCV) webpage on how you can train a classifier, and how object detection works. I use OpenCV and its python bindings for the whale face detection.

Next, I describe how I trained the classifier.

The classifier needs a training set: positive and negative samples, images of whale heads and non-whale head areas, respectively. I already marked a few hundred whale heads for template matching, so I used those as positive samples. I generated more positive samples by introducing slight random rotations to the whale’s head (within the range of ±5°). Then, I looped through the images that contain the whale heads and marked areas that contain ocean, spray, the whale’s body, occasional dolphins and birds, anything but the whale’s head. These marked areas are my negative samples. I rotated the negative images by 90°, 180°, and 270° to increase the sample size.

Once the training set is prepared, the classifier is trained. One needs to set some parameters for the training. You can choose to train for Haar features or for local binary patterns (LBP) in OpenCV. Haar features perform slightly better in human face recognition, but it takes a significant amount of time (days or weeks) to train them. LBPs are trained quickly (within a matter of minutes or hours), so I train for LBP features. To train the classifier, you need to specify:

• the size of the positive images: if the images are large, more features can be extracted (although that doesn’t always help accuracy), but the training slows down;
• the number of cascade stages: the more cascades you have, the more accurate your classification can be, but the training slows down;
• the minimum hit rate, which specifies the amount of positive samples that needs to be detected correctly before the classifier moves on to the next stage.

I mapped the parameter space of the three parameters to find what combination of them perform best at detecting the whale’s face. In other words, I specified a few values for each parameter and trained a separate cascade for each parameter combination. The images sizes were [40, 60, 80] pixels, I used [15, 20, 25, 30] cascade stages, and the minimum hit rate was [0.99, 0.995, 0.999]. I trained 36 cascades overall, evaluated each on the same randomly selected 100 images, and choose the cascade that performed best (for details on detection, see the next paragraphs). I found that the cascade with a size of 40 pixels, 30 cascades, and a minimum hit rate of 0.999 was able to correctly detect the whale’s head on 90% of the images, this is the cascade that performed best.

In the rest of this section, I describe how detection with the cascade works and how I addressed the two shortcomings of the Viola-Jones framework: no rotation invariance, and multiple overlapping detections on the same face.

The solutions rely on the minNeighbors parameter of OpenCV’s detectMultiscale algorithm. This parameter helps to eliminate false positive detections if tuned properly (see this stackoverflow entry for more explanation and illustration). If minNeighbors is too small, false positives are found. If it is too large, true positives are eliminated. So, the trick is to find just the right range of minNeighbors that detects one region on the image. The perfect range needs to be determined for each image separately.

The schematic code snippet here describes the main steps:

... load image ...
max_minNeighbors_old = 0

for r in rotation_angles:
... rotate image by r° ...
collect_regions = []
max_minNeighbors = 0

for m in minNeighbors:
... count number of regions detected ...

if nr. of regions = 1:
max_minNeighbors = m
collect_regions.append(region detected)

if max_minNeighbors > max_minNeighbors_old:
... cut the last detected region out ...
... save this part of the image ...
max_minNeighbors_old = max_minNeighbors

... found the most likely region of the whale's face ...


The two main ideas are:

• The image needs to be rotated (lines 4-5) because the cascade can only detect the face if it is vertical.
• The most likely location of the whale’s face maximizes the value of minNeighbors while keeping the number of detected regions at 1.

The code finds the largest minNeighbors value for each rotation angle, and the most likely location of the whale’s face is the one that maximizes minNeighbors amongst all rotation angles.

### Failed attempts in classification

Once the faces are detected, we need to identify which of the ~450 whales are on the images. I had quite a few failed attempts and I will not describe all of them in detail here. In most cases, first I standardized the images and used PCA to reduce the number of features such that 80%, 85%, 90%, 95%, and 99% of the variance is retained. I tried each variance retained value with each method.

I tried to train an SVM to classify the images, I tried neural networks with one hidden layer, I tried to draw contours to mark the callosity pattern and compare the contour shapes using Hu’s set of invariants, and I also tried linear discriminant analysis to reduce the number of features such that the separation between classes is maximized. In each case, only 1-2% of the images were correctly classified. I discuss what I believe is the most likely reason why the methods failed in the ‘Lessons learnt’ section.

### What worked (sort of) – nearest neighbors

Nearest neighbors is a relatively simple machine learning algorithm that estimates the class of a test sample by finding the most similar sample(s) of the training set using some distance metric. Nearest neighbors could correctly classify roughly 10% of the whale faces. The output of the method is the estimated class of a test sample, and the distance(s) to the nearest neighbor(s).

The task here is to determine how to translate the distances to the $\displaystyle p_{ij}$ probabilities such that the logloss score is minimized. I divided the 4,000 labeled images to training and test sets (with a 75%-25% ratio), I trained the nearest neighbors on the training set and checked the performance of the model on the test set. I printed the minimum, median, and maximum distances of the correctly and incorrectly identified images and I found that the correctly identified images have significantly smaller distances than the incorrectly classified ones. This inspired me to try the following power law formula:

$\displaystyle p_{ij} = \frac{c}{d^k}$,

where $c$ is a normalizing constant, and $k$ is a small integer number (between 1 and 6). I found that my score is minimized if $k=4$ and the scaling constant is such that $\displaystyle p_{ij} =\frac{n_j}{M}$ for the smallest $\displaystyle d^{-k}$ value.

### Lessons learnt

I discuss two things in this closing section. Why the classification did not perform better? Based on what I learnt, what would I do differently in a similar project in the future? Let’s start with the former.

There are a few potential reasons why the classification performed poorly:

• Maybe the images were not aligned accurately enough. Viola-Jones object detection is fast, but it only gives a rough location of the object. Eight snapshots of the same whale are included below to illustrate the point (these are not standardized images). Observe how e.g., the location of the blowhole and the middle peninsula differs on the images. My guess is that a different method (maybe a neural network trained to find the blow hole and the nose) could better align the faces and this would improve the performance of the classifier.
• Maybe the problem is that I tried to use the same classifier for all viewing angles. In human face recognition, a classifier trained on frontal faces perform poorly on profile views. I considered labeling the images based on whether the whale’s face is seen from above, or from either of the sides, and train different classifiers for the three sub-classes. Eventually, I decided not to do this because a significant fraction of the whales have only a small number of images (1-2) so I might not get one image per viewing angle.

I will also do a few things differently if I encounter a similar project:

• I’d find teammates. 🙂
• I would do more ensemble learning. My impression from looking at this and other kaggle competitions is that most winning groups use ensemble learning: they combine the results of a large number of methods to boost the overall performance.
• I’d try deep learning tools. I didn’t try deep learning because I thought a large number of training images (on the order of $10^5 - 10^6$ images) are required, which we clearly didn’t have. I was surprised to find that the winner teams used deep networks (1st place, 2nd place, 3rd place). So I need to learn more about deep learning. If you know an article or post that explains why my intuition of deep learning is wrong, and how and why deep learning works if you have only a few 1,000 images, please let me know and leave a comment below.

### Final word

All in all, I learnt an awful lot about image processing and machine learning during the competition. Thanks kaggle.com for providing the platform for this competition. It was a beautiful and exciting project. I really hope that the NOAA and the New England Aquarium will succeed in conserving this species. If you can, please consider sponsoring a right whale. I’d hate to live in a world without gorgeous animals like the right whales.

Some ipython notebooks, coordinates of clicks on images, the Viola-Jones cascade, and mugshots of whale faces are available at this github repository.

I thank Brown University for providing computational resources for this project.