Kaggle for the paws

In a recent Kaggle competition, the goal was to use a dataset on shelter animals to do two things: gain insights that can potentially improve their outcome, and to develop a classification model which predicts the outcome of animals (adoption, died, euthanasia, return to owner, transport). For more information on the competition, see here.

I got lots of support from my group, the Data Science Practice team at Computing and Information Services, Brown University, Providence RI: Mark Howison (Director of Data Science Practice), Dave Berenbaum, Paul Stey, and Erik Knoll. I also teamed up with Gerard Toonstra towards the end of the competition, who is a Competitions Master at Kaggle. Thank you all for your help and support!

pickles

Pickles is a beagle puppy with a silly personality who was up for adoption at RISPCA. Photo by RISPCA, the Rhode Island Society for the Prevention of Cruelty to Animals.

I published a few scripts during the competition and their aim was to gain useful insights about the outcome trends by exploring the data. These scripts are available on Kaggle’s site:

The rest of the blog describes how we developed our classifier. First, I’ll introduce what we believe is a useful model for the shelter, then I describe how we generated our features, and how we trained and evaluated the properties of the classifier. If you are not familiar with classification, please read this first.

What does a useful model do for the shelter?

We believe that the shelter would use the model when a new animal arrives. This is the time when the shelter wants to know the likely outcome of the new animal. A classifier that predicts the outcome of animals at the time of their intake can help in several ways:

  • If the adoption probability of an animal is low, the shelter employees can focus on those animals and give them a little extra help finding a new home. Based on the three scripts published above, that extra help could be neutering/spaying the animal. The shelter could also lower the adoption fee of such animals.
  • With a slight modification, the model can help with logistic planning. If the model predicts how long the animals stay in the shelter (instead of predicting the likely outcome), one could better plan and allocate resources for animals that are likely to stay long.

Collecting features

Upon the arrival of a new animal, the shelter has three types of available information:

  • the properties of the new animal at intake, such as animal type (cat or dog), age, gender, breed, color, and the time and date of the intake. These animal properties have a strong impact on the outcome. For example, most people like to adopt young puppies and kitties, and the adoption probability of older animals is lower. People also have preferences for certain breeds. Not many people like specific dog breeds that are perceived aggressive (e.g. pitbull, rottweiler, bulldog).
  • the properties of the animals which are currently in the shelter. If there are many young animals currently in the shelter, it probably lowers the adoption probability of old animals.
  • the properties of all animals prior the intake. This is our training data which is used to forecast the outcome of the new animal.

We note that not all of this data was readily provided by Kaggle. Kaggle gave us the animal properties and the datetime of the outcome. We felt that the datetime of the outcome is a feature which should not be used. If the goal is to predict the outcome, one should not know when that outcome will occur down to the minute. Many fellow kagglers were on the same opinion and they described on the forum that using the outcome datetime introduces a “data leak”.

To collect the necessary features, we augmented the data provided by Kaggle. This was not against the rules and the full dataset is available at the Austin Animal Center’s site. We matched the animals in Kaggle’s training and test sets to the shelter’s database and collected the intake time of each animal. The intake database also describe under what circumstances the animal arrived to the shelter (the intake type, e.g. owner-surrendered, stray) and what was the animal’s condition upon arrival (intake condition, e.g. normal, injured, sick). We added these features too to our feature set.

The augmented database allowed us to collect the properties of the in-shelter animals. Here is our train of thought: If an animal arrives to the shelter at time t, then the in-shelter animals have an intake time before t and an outcome time after t. We used this double condition to collect the in-shelter animals on each day at noon within the time frame of the dataset (each day between October 2013 and February 2016). If a new animal arrives for example on June 10th 2014, we look up what animals were present in the shelter on that day at noon. We tried a few things to describe the properties of the in-shelter animals. First, we counted how many dogs and cats were present with ages between 0 and 6 weeks (neonatal animals that cannot be adopted), young animals with ages between 6 weeks and a year, and adult animals older than one year (see Fig. 1).

shelter_dist

Fig. 1: The number of neonatal, young and adult animals in the shelter on each day between 1st of October 2013 and 21st of February 2016. Interestingly, the number of neonatal cats spikes around June in 2014 and 2015. A similar spike is not seen for neonatal dogs. There are a low number of animals at the beginning and at the end of the time frame (the general arch of the figure). This is an artifact as we don’t have available data prior to October 2013 and after February 2016.

We also described the in-shelter population by counting the number of animals in each animal type, age group, breed group, and gender group. We found no improvement upon adding more features, so our final solution describes the in-shelter population using the animal type and the age groups.

The animal breed and color features proved especially difficult to work with because both features have more than a 100 unique entries. Therefore, we came up with ways to group the breeds and colors in order to reduce the number of categories. The dog breeds were grouped into dog groups as described in one of the Kaggle scripts above. The colors were also grouped in a similar way. Colors can have a light, medium, or dark shade. Based on the color string of the animal, we figure out whether the animal is unicolor (either light medium or dark), or whether the animal has two shades (e.g. light/dark, medium/dark), or tricolor (light medium and dark shades are all present). These color shade groups lower the number of categories to 10, which is a manageable number.

We also added a few features that describe the name of the animal. The first feature describes whether the animal has a name (1) or not (0). The second feature describes how many characters are there in the name. Finally, the third feature describe how frequently that name occurred in the training set.

The intake time features describe the year, month, day, and hour of the intake, which weekday was the intake (e.g. Monday or Tuesday), and which quarter of the year was the intake.

The age of the animal was converted to days.

We used one-hot encoding on categorical features. These features are the animal type, gender, color and breed groups, intake type, and intake condition.

All features were normalized to be between 0 and 1 before we fed the data to the classifier.

Classification

We tried several different classifiers including (but not limited to) random forests, support vector machines, nearest neighbors. Finally we decided to use xgboost because it gave the best single-model score.

We combined xgboost with sklearn’s RandomizedSearchCV to train the model. The problem is unbalanced because there are only a low number of animals with an outcome of death and euthanasia. To deal with the unbalanced dataset, we used stratified k-folds when splitting the data. A stratified fold has the same percentage of samples for each class as the whole dataset. In other words, the points in each class are split equally between the folds.

Here is the outline of our code:

import xgboost as xgb
from sklearn.grid_search import RandomizedSearchCV
from sklearn.cross_validation import StratifiedKFold
from scipy.stats import randint as sp_randint
from scipy.stats import uniform as sp_uniform
import numpy as np
X = pd.read_csv('test-train_files/train_intake.csv')

Y = X['OutcomeType'].values
X.drop(['AnimalID','OutcomeType'], axis=1, inplace=True)

n_iter = 1000
k_fold = 5

cv = StratifiedKFold(Y,n_folds=k_fold,shuffle=True)

# initialize the classifier
GB = xgb.XGBClassifier()

param_grid = {'max_depth': sp_randint(1, 90),
              'learning_rate': sp_uniform(loc=0e0,scale=1e0),
              'objective':['multi:softprob'],
              'nthread': [8],
              'missing': [np.nan],
              'reg_alpha': [0.01,0.017782794,0.031622777,0.056234133,\
                            0.1,0.17782794,0.31622777,0.56234133,1.,1.77827941,\
                            3.16227766,5.62341325,10.,\
                            17.7827941,31.6227766,56.2341325,100.],
              'colsample_bytree': sp_uniform(loc=0.2e0,scale=0.8e0),
              'subsample': sp_uniform(loc=0.2e0,scale=0.8e0),
              'n_estimators': sp_randint(50, 200)}

search_GB = RandomizedSearchCV(GB,param_grid,scoring='log_loss',\
               n_iter=n_iter,cv=cv,verbose=True).fit(X,Y)
print ' ',search_GB.best_score_
print ' ',search_GB.best_params_

# save the results
f_name = open('xgboost_RSCV.dat','w')
pickle.dump([search_GB.grid_scores_],f_name)
f_name.close()

We generated 5 stratified folds of the data and performed a randomized search over 6 xgboost parameters (maximum depth of the trees, learning rate, the regularization parameter alpha, subsample and column subsample, and the number of trees), we typically ran 1000 simulations.

We found that the average of the best 20 models gave the best score on the Leaderboard. The average of the best n models is a way to reduce the variance of a single model (it is a simple way to ensemble models) and it gave a 0.01 improvement in our Leaderboard score compared to the score of the best single model. Our final Leaderboard score is 0.794, which is not great in comparison to other models. We speculate that most other models use the outcome datetime as features. The outcome time helps especially a lot to predict the outcome type ‘transfer’ because multiple animals are transferred at the same time so the outcome time pinpoints those events. We believe that it is in the best interest of the shelter not to use the outcome time because (as we discussed above) in a real world scenario one does not know when the outcome will occur if the task is to predict the outcome. The model needs to be practical first and foremost.

We show the importances of our features in Fig. 2. The feature importances were generated using xgboost’s get_fscore function using our best single model (for more details, see this post). The figure shows that features related to the intake datetime, the animal’s age and name, and the properties of the in-shelter animals are most predictive of the outcome. We were surprised to see that features related to the animal’s breed, color, and gender are less important than we anticipated.

feature_importance_sd_xgb

Fig. 2: The intake hour, the age of the animal in days, and the occurrence rate of the animal’s name are the top 3 most important features. The features that describe the animal type and age of the in-shelter animals are named ‘shelter_dist_x’. Other important features are the intake’s weekday and day of the month, and the length of the name. Our model shows that the breed, color, and gender of the animals are less important than we guessed.

We also show the confusion matrix of the best single model. Traditionally the diagonal elements of the confusion matrix show how many points are classified correctly, and the off-diagonal elements show how many points are misclassified (often the counts are normalized). Such a confusion matrix characterizes models with hard predictions that return the most likely class. This is not suitable for us because our model returns prediction probabilities for each class (soft predictions). So we show two modified confusion matrix figures that are suitable to visualize the predicted probabilities.

The first confusion matrix figure is a heatmap and it shows the mean predicted probabilities (see Fig. 3.). The diagonal blocks show the average predicted probabilities of the correct classes, and the off-diagonal elements show the average predicted probabilities of the incorrect classes. Ideally, the diagonal elements should be close to unity and the off-diagonal elements should be close to zero.

conf_mat_probs_heatmap

Fig. 3: Confusion matrix with predicted probabilities as a heatmap. The colors represent the mean predicted probabilities (see the colorbar for the values).

 

This figure shows that adoption and transfer are predicted most accurately because their corresponding diagonal elements are the highest (with a mean above 50%). Death is incorrectly predicted because the corresponding diagonal element is close to zero. Interestingly, animals that died often have high predicted adoption or transfer probabilities.

Our second visualization is a slightly modified version of this kernel. The x axis and the colors of the point show the true classes (e.g. animals that were transferred are symbolized with blue points). The y axis is the predicted probabilities of our best single model. The height of the points in each block shows the predicted probability for the animals. For example, the upper left green block shows the predicted transfer probability of animals that were adopted. Ideally, the points in the diagonal blocks should cluster at the upper region of the block (close to unity), and the points in the off-diagonal blocks should cluster at the lower part of the block (close to zero). This figure might be a bit more difficult to interpret but it carries more information: we can see the scatter or distribution of the predictions in each block.

conf_mat_probs_equalized

Fig. 4: The confusion matrix with prediction probabilities. The x axis and the color of the points show the true classes (e.g. green points are animals that were adopted). The y axis shows the predicted probabilities that a point belongs to each of the classes. For example, the upper left green corner shows the predicted transfer probability of animals that were adopted, and the lower left green corner shows the predicted adoption probability of animals that were adopted. See the paragraphs above and below for more explanation and interpretation of this figure.

The confusion matrix figure shows a few interesting properties of our model:

  • Adoption is usually accurately predicted – the points in the diagonal green box cluster at the upper regions of the block, and the points in the off-diagonal green blocks cluster around the lower part of the block.
  • Death is usually misclassified because the points in the diagonal black block are close to zero and the predicted death probability (the second row from the bottom) is always low. This is likely because there are only a few animals that died in the dataset.
  • Euthanasia is often misclassified – the points in the orange diagonal block show a two-banded structure. The points cluster both at the lower and upper regions of the block.
  • Return to owner is difficult to predict and it is often confused with adoption. The points in the diagonal ‘return to owner’ block are more or less uniformly distributed (random guess). And the predicted adoption probabilities of animals that were returned to their owner are large and do not cluster around 0.
  • Transfer is often confused with adoption.

Final words

This competition was a lot of fun and we got a peek into the lives of these animals. Our group became especially fond of Butch, a 5 years old neutered male english bulldog who appeared several times in the Austin shelter’s database. Apparently, Butch is a very bad puppy who likes to run away. 🙂

If you work at a shelter (or know someone who does), you keep a database of your animal intakes and outcomes, and you feel you don’t utilize that database well enough, please feel free to get in touch with me (andras.zsom at gmail dot com). I would be happy to take a look at the dataset. Hopefully I will be able to provide some new insights and new ways to help the paws.

Our codes, figures, and csv files are available on github here.

If you like the post, please leave a comment below, follow the blog via email, and  share this post on social media. I’m also happy to answer any questions or concerns. Please post your question in the comment section below or send me an email (andras.zsom at gmail dot com).

 

Advertisements
This entry was posted in Finished projects and tagged , , , . Bookmark the permalink.

One Response to Kaggle for the paws

  1. Pingback: Predicting Shelter Animal Outcomes: Team Kaggle for the Paws | Andras Zsom | No Free Hunch

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s