10 minute read

This is the first in a series of posts about constructing a predictive model, based on physical measurements, of age of abalone, where abalone referes to a group of sea snails. In this post, we are going to go over the dataset and provide some descriptive statistics. The dataset that we are going to use and other related information can be accessed from the UCI Machine Learning Repository. Note that this dataset has been originally collected in relation to a non-machine-learning study (Nash et al., 1994).

Before we start our analysis, let us import all the required Python libraries and load our dataset as pandas dataframe.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import math

column_names = ["sex", "length", "diameter", "height", "whole_weight", "shucked_weight", "viscera_weight",
                "shell_weight", "rings"]
dataset = pd.read_csv("abalone.data", names=column_names)

dataset.head()

As the first step, we should have a look at the description of the dataset and get a rough idea about what we are dealing with. According to the metadata, the dataset contains 4,177 labeled observations. Predictor or independent variables are all either real-valued or integer, overall 8 in number. The target or dependent variable is integer-valued and signifies the number of rings that a given abalone has. By adding 1.5 to the number of rings one can judge about the age of the ablaone.

It is worth mentioning that there are no missing values for this dataset (observations with missing data points have been excluded from the dataset we get to work with). This significantly reduces work load and minimizes room for error as fewer assumptions need to be made about how to treat observations with missing values. Also, there are no experts to consult with and ask for their professional opinion.

Numeric predictor variables have been standardized beforehand to facilitate analysis and are pretty self-explanatory. Thus, we are not going to waste too much time going over what each one of them means exactly. However, we can observe that there 4 distinct weight measures. According to the dataset description, whole weight refers to the whole abalone, shucked weight refers to the weight of meat, viscera weight refers to the gut weight (after bleeding) and shell weight is weight after being dried.

We further observe that sex variable is of nominal type. It has 3 distinct values: M for male, F for female and I for infant. Due to its type, this variable cannot be directly used by most of the predictive models, like regression, neural networs, etc. For that reason, we need to transform it to a numeric type. There are a number of ways in which this can be accomplished. For example, we could introduce dummy variables. This approach would allow us to better separate instances belonging to different sexes at the price of increased degrees of freedom. Or, alternatively, we could just use a numeric encoding scheme. Numeric encoding leaves the number of attributes unchanged but we are running the risk that certain models may incorrectly interpret classes that have closer encoding as being similar in their interpretations. For that reason, let us check if there is a clear difference in the number of rings for different sexes.

pd.pivot_table(dataset, values="rings", columns="sex", aggfunc=np.mean)

It comes as no surprise that the average number of rings for infant abalone is smaller than that of grown abalone. For grown female and male abalone, the average number of rings looks to be about equal. Thus, we could transform sex variable into another variable that will have two values: “infant” and “grown”. However, by doing this, we might get in trouble if sex variable correlates with other predictor variables. To avoid any of the mentioned problems, we will introduce dummy variables and keep all 3 distinct values.

dataset = pd.concat([pd.get_dummies(dataset.sex, prefix='sex'), dataset.drop(['sex'], axis=1)], axis=1)
dataset.head()

Next we are going to have a look at descriptive statistics for the variables in the dataset.

dataset.describe()

As per above table, we have about equal distribution between female, male and infant abalone in the dataset. The average number of rings is about 10 while the median number of rings is 9. This means that the distribution of our target variable is skewed to the right and the longer right tail “pulls” the center of probability mass towards itself.

We can also observe that the smallest value for height variable is 0. Let us investigate these observations.

dataset[(dataset.height==0)]

There are 2 observations with 0 height. While it is possible that the true recorded values for these observations got rounded down during data collection, we will treat them as erroneous and remove then from subsequent analysis.

dataset = dataset[(dataset.height) != 0]

Having spotted some possible inaccuracies, we should investigate further and make sure that the remaining observations make sense to us. For example, we would expect the whole weight of abalone to be larger than constituent parts.

dataset[(dataset.whole_weight <= dataset.shucked_weight) | (dataset.whole_weight <= dataset.viscera_weight) | (dataset.whole_weight <= dataset.shell_weight)]

There are 4 observations for which this is not the case. In particular, for all 4 observations, shucked weight exceeds whole weight. We will be consistent in our treatment and assume that these are erroneous observations.

dataset = dataset[(dataset.shucked_weight < dataset.whole_weight)]

Without expert knowledge, it is hard to make educated decisions about validity of “size” variables of abalone, i.e. length, diameter and height. Thus, for the purposes of this analysis, we will assume that all remaining observations are valid.

While descriptive statistics provides us with valuable information about our data, it is always important to have a look at visualizations as they can reveal patterns and connections that cannot be expressed in pure numeric fashion. We will start off by looking at distributions of each of the variables separately.

dataset.hist(figsize=(15, 15), bins=40)

png

As we previously suspected, the distribution of the target variable, rings, is slighly skewed to the right, i.e. the right tail is longer. At the same time, it roughly resembles a normal, bell-shaped distribution. This further means that we can treat this problem as a regression problem. All “weight” variables are also skewed to the right. The shapes of their respective distributions look similar to the shape of a lognormally distributed random variable. This futher hints at a possibility of using some kind of variable transformation to “normalize” the variables.

Diameter and length variables are, on the other hand, slightly left skewed with distribution shapes resembling closely a normal distribution. These variables are in no need of any special preprocessing.

Correlation and bivariate analyses are a natural continuation of a univariate analysis. We are going to create a correlation matrix in what follows.

corr = dataset.corr()
corr

Let us investigate the resulting correlation matrix in more detail and see whether we can find some interesting patterns. We are especially interested in correlations between the dependent variable and predictor variables. First of all, we observe that there is a positive correlation between the number of rings and variables that represent “size” of abalone, i.e. length, diameter and height. The same applies to variables representing weight measures, whole weight, shucked weight, viscera weight and shell weight. This makes absolute sense as we would expect older abalone to be bigger, in terms of both weight and size, than younger abalone.

If we look at correlations between the number of rings and the dummy variables that we created for sex variable, we observe that there is negative correlation between the number of rings and the dummy variable representing whether a given abalone is an infant or not. At the same time, the correlations between the number of rings and other two dummy variables are very similar. Thus, in retrospect, we could have created a single variable that would just signify a grown abalone. However, we are going to keep 3 dummy variables and see how our model performs regardless.

While we considered how the dependent variable is related to the predictor variables, it is also important to consider if any two of the independent variables are correlated. Usually, when building predictive models, we try to avoid having highly correlated predictors. Note, however, that it becomes very tedious to go over each pair of variables in the above table. For that reason, we are going to create a simple heat map.

mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
fig, ax = plt.subplots(figsize=(10,7))
sns.heatmap(corr, ax=ax, mask=mask, cmap=sns.dark_palette("muted purple", input="xkcd"))

png

According to the heat map, all the variables that concern size or weight of abalone are highly correlated. We can also observe that most of the variables are negatively correlated with the dummy variable representing infant abalone. Thus, we may speculate that this dummy variable is going to be very important in the predictor model that we are going to construct. The remaining dummy variables representing sex are moderately correlated with other predictor variables.

It looks like our data is pretty tidy which means that we can start to build some predictive models. We are going to cover some of the algorithms in the next post.

References

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

Warwick J Nash, Tracy L Sellers, Simon R Talbot, Andrew J Cawthorn and Wes B Ford (1994). “The Population Biology of Abalone (Haliotis species) in Tasmania. I. Blacklip Abalone (H. rubra) from the North Coast and Islands of Bass Strait”, Sea Fisheries Division, Technical Report No. 48 (ISSN 1034-3288).