Skip to content

Dog Shelter Problem


You operate a dog shelter 🐶, and you're building a model to predict the probability of a dog getting adopted based on its age, weight, and breed. You have the following data.

import numpy as np
import pandas as pd

# dog breeds list
breeds = [
    'english pointer', 'english setter', 'kerry blue terrier', 'cairn terrier', 'english cocker spaniel',
    'gordon setter', 'airedale terrier', 'australian terrier', 'bedlington terrier', 'border terrier',
    'bull terrier', 'fox terrier (smooth)', 'english toy terrier (black &tan)', 'swedish vallhund',
    'belgian shepherd dog', 'old english sheepdog', 'griffon nivernais', 'briquet griffon vendeen',
    'ariegeois', 'gascon saintongeois', 'great gascony blue', 'poitevin', 'billy', 'artois hound',
    'porcelaine', 'small blue gascony', 'blue gascony griffon', 'grand basset griffon vendeen',
    'norman artesien basset', 'blue gascony basset'
]

# random generator
rng = np.random.default_rng(1)

# training DataFrame
Ndogs = 10
dogs = pd.DataFrame({
    'id': rng.choice(1000000, size=Ndogs, replace=False),
    'age': rng.uniform(low=0, high=16, size=Ndogs).round(0),
    'weight': rng.uniform(low=10, high=115, size=Ndogs).round(1),
    'breed': pd.Categorical(rng.choice(breeds[:25], size=Ndogs, replace=True), categories=breeds),
    'adopted': rng.choice([True, False], size=Ndogs, replace=True)
})

# Insert some NaNs
dogs.iloc[rng.choice(Ndogs, size=int(Ndogs * 0.25), replace=False), 1] = np.nan

print(dogs)
#        id   age  weight                             breed  adopted
# 0  311831   NaN    88.8                   english pointer    False
# 1  473184   9.0    39.4                      bull terrier    False
# 2  822941   5.0    60.9  english toy terrier (black &tan)    False
# 3   34852  13.0   113.0                australian terrier     True
# 4  948647   NaN   111.0                kerry blue terrier     True
# 5  511817   7.0    86.1                      bull terrier    False
# 6  144159   2.0    66.8              old english sheepdog    False
# 7  755162   6.0    39.1              fox terrier (smooth)    False
# 8  950457   3.0    26.9               gascon saintongeois     True
# 9  249228   4.0   111.8                    border terrier     True

Build a compressed sparse column matrix to represent the training features. Be sure to one-hot-encode the dog breeds into indicator columns, one for each possible breed (not just the observed breeds 😉).

The output matrix should look something like this:

    age  weight  is_english_pointer  is_english_setter ...
0   NaN    88.8                 1.0                0.0
1   9.0    39.4                 0.0                0.0
2   5.0    60.9                 0.0                0.0
3  13.0   113.0                 0.0                0.0
4   NaN   111.0                 0.0                0.0
...

Try with Google Colab