Banking marketing data set — Exploratory Data Analysis in Python
Before we dive into code, let’s see what exploratory analysis is and the reason why it is fundamental in the data field to provide the two goals of the data: data-driven and data product.
Most of the machine learning as we know does not support text data, among them exception goes mainly for tree based models. Apart from this, many processes and techniques applied while cleansing and preprocessing the data benefit from exploratory data analysis. Exploratory data analysis is where one performs investigation on data discovering patterns, anomalies using statistical summary of the data and graphical representation.
In terms of tools, Pandas with its numerous methods comes in handy allowing one to get a summary of statistics, grouping, merging, sorting, etc. the data; combining it with graphical representation from Matplot and Seaborn, so allowing us getting the insight and taking action based of these findings.
The data set used for this analysis belongs to a marketing campaign conducted by a Portuguese banking institution to determine whether or not clients would make a subscription to a term deposit.
Thus, data was read using Pandas as shown below, surely importing relevant libraries namely NumPy, Matplotlib and Seaborn.
Exploratory data analysis was conducted confirming structures of the data as described from the repository. Beside, the overall framework was considering verification of missing values (which there is none — explicitly; implicit yes, since there are many “unknown” observation), duplicated, descriptive statistics (below) and univariate/bivariate analysis in combination with visualization techniques.
In univariate analysis is divided in part — analysis of numeric features then categorical features. In analysis of numeric variables , it checks for the general distribution; boxcox plotting for outliers — clients that subscribed (‘yes’) versus those whose not subscribed (‘no’); and pointplot to see the median of both categories of subscription.
There was a need to know when more subscriptions occured in absolute values as well as in relative values, in other words, from total contacts in a month which proportion of them subscribed. We can see in absolute value that the manager was a professional that subscribed more. In other hand, students were individuals that subscribed more when analysis made focusing in the profession isolated.
There was a need to know when more subscriptions occured in absolute values as well as in relative values, in other words, from total contacts in a month which proportion of them subscribed. We can see in absolute value that the manager was a professional that subscribed more. In other hand, students were individuals that subscribed more when analysis made focusing in the profession isolated.
When it comes to months, we can conclude that more contacts generate more subscriptions in general (absolute values). However, it does not seem to be an effective approach once if we analyze total subscription by contacts per month see that months with lower contacts generate more subscriptions overall.
From the correlation matrix, there is almost none correlation among numeric variables present. Only exception goes to correlation between ‘duration’ and ‘subscription’ as expected.