ju-hache
5/22/2018 - 7:40 AM

Truncate data

Some outliers are clearly skewing the distribution and the plot doesn't give much information in this form : We need to truncate the data, how do we do that ?

We'll use a basic yet really powerful rule : the 68–95–99.7 rule. This rule states that for a normal distribution :

68.27% of the values $ \in [\mu - \sigma , \mu + \sigma]$ 95.45% of the values $ \in [\mu - 2\sigma , \mu + 2\sigma]$ 99.7% of the values $ \in [\mu - 3\sigma , \mu + 3\sigma]$ where $\mu$ and $\sigma$ are the mean and standard deviation of the normal distribution. Here it's true that the distribution isn't necessarily normal but for a shape like the one we've got, we'll see that applying the third filter will improve our results radically.

temp = df_kiva_loans['loan_amount']

plt.figure(figsize=(12,8))
sns.distplot(temp[~((temp-temp.mean()).abs()>3*temp.std())]);
plt.ylabel("density estimate", fontsize=16)
plt.xlabel('loan amount', fontsize=16)
plt.title("KDE of loan amount (outliers removed)", fontsize=16)
plt.show();