Banking marketing data set — Exploratory Data Analysis in Python

Leide Manuel Soares
4 min readDec 8, 2020
Design with Random image from Internet

Before we dive into code, let’s see what exploratory analysis is and the reason why it is fundamental in the data field to provide the two goals of the data: data-driven and data product.

Most of the machine learning as we know does not support text data, among them exception goes mainly for tree based models. Apart from this, many processes and techniques applied while cleansing and preprocessing the data benefit from exploratory data analysis. Exploratory data analysis is where one performs investigation on data discovering patterns, anomalies using statistical summary of the data and graphical representation.

In terms of tools, Pandas with its numerous methods comes in handy allowing one to get a summary of statistics, grouping, merging, sorting, etc. the data; combining it with graphical representation from Matplot and Seaborn, so allowing us getting the insight and taking action based of these findings.

The data set used for this analysis belongs to a marketing campaign conducted by a Portuguese banking institution to determine whether or not clients would make a subscription to a term deposit.

Thus, data was read using Pandas as shown below, surely importing relevant libraries namely NumPy, Matplotlib and Seaborn.

Initial coding for data analisis.

Exploratory data analysis was conducted confirming structures of the data as described from the repository. Beside, the overall framework was considering verification of missing values (which there is none — explicitly; implicit yes, since there are many “unknown” observation), duplicated, descriptive statistics (below) and univariate/bivariate analysis in combination with visualization techniques.

Descriptive Statistics of the numerical variables
Descriptive Statistics of the categorical variables

In univariate analysis is divided in part — analysis of numeric features then categorical features. In analysis of numeric variables , it checks for the general distribution; boxcox plotting for outliers — clients that subscribed (‘yes’) versus those whose not subscribed (‘no’); and pointplot to see the median of both categories of subscription.

Code to plot graphs below
Global distribution (left); outliers visualization by subscription (middle); median by subscription categories (right)
Global distribution (left); outliers visualization by subscription (middle); median by subscription categories (right)
Code to plot graphs below
Global distrubution of categories (left); distribution by subscription

There was a need to know when more subscriptions occured in absolute values as well as in relative values, in other words, from total contacts in a month which proportion of them subscribed. We can see in absolute value that the manager was a professional that subscribed more. In other hand, students were individuals that subscribed more when analysis made focusing in the profession isolated.

Code to plot graphs below

There was a need to know when more subscriptions occured in absolute values as well as in relative values, in other words, from total contacts in a month which proportion of them subscribed. We can see in absolute value that the manager was a professional that subscribed more. In other hand, students were individuals that subscribed more when analysis made focusing in the profession isolated.

Occupation contacts / subscription: percentage of subscription (left); percentage for each profession

When it comes to months, we can conclude that more contacts generate more subscriptions in general (absolute values). However, it does not seem to be an effective approach once if we analyze total subscription by contacts per month see that months with lower contacts generate more subscriptions overall.

Monthly contacts / subscription: absolute percentage of subscription in a year(left); percentage in a month

From the correlation matrix, there is almost none correlation among numeric variables present. Only exception goes to correlation between ‘duration’ and ‘subscription’ as expected.

Correlation Matrix

--

--