Problem Statement

Build a model which predicts sales based on the money spent on different platforms for marketing.

Use the advertising dataset given in ISLR and analyse the relationship between 'TV advertising' and 'sales' using a simple linear regression model.

Import libraries

In [1]:
import warnings
warnings.filterwarnings('ignore')

# Import the numpy and pandas package

import numpy as np
import pandas as pd

# Data Visualisation
import matplotlib.pyplot as plt 
import seaborn as sns

Datasets

In [2]:
advertising = pd.DataFrame(pd.read_csv(r"data\datasets_advertising.csv"))
advertising.head()
Out[2]:
TV Radio Newspaper Sales
0 230.1 37.8 69.2 22.1
1 44.5 39.3 45.1 10.4
2 17.2 45.9 69.3 12.0
3 151.5 41.3 58.5 16.5
4 180.8 10.8 58.4 17.9
In [3]:
advertising.sample(frac = 1)
Out[3]:
TV Radio Newspaper Sales
67 139.3 14.5 10.2 13.4
32 97.2 1.5 30.0 13.2
33 265.6 20.0 0.3 17.4
199 232.1 8.6 8.7 18.4
179 165.6 10.0 17.6 17.6
... ... ... ... ...
21 237.4 5.1 23.5 17.5
10 66.1 5.8 24.2 12.6
136 25.6 39.0 9.3 9.5
5 8.7 48.9 75.0 7.2
131 265.2 2.9 43.0 17.7

200 rows × 4 columns

Based on your knowledge of machine learning please perform the following tasks:

Data Inspection

Data Cleaning

Exploratory Data Analysis

Performing Simple Linear Regression Model

Model Evaluation

** Notice: Please document your code blocks and add a description to each task and explain your methodology and results accordingly in details.

In [4]:
#Importing my libraries

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
import xgboost as xgb
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn import model_selection
from sklearn.linear_model import Lasso
from sklearn.linear_model import LinearRegression

Summary statistics to gain a first impression:

In [5]:
advertising.describe()
Out[5]:
TV Radio Newspaper Sales
count 200.000000 200.000000 200.000000 200.000000
mean 147.042500 23.264000 30.554000 15.130500
std 85.854236 14.846809 21.778621 5.283892
min 0.700000 0.000000 0.300000 1.600000
25% 74.375000 9.975000 12.750000 11.000000
50% 149.750000 22.900000 25.750000 16.000000
75% 218.825000 36.525000 45.100000 19.050000
max 296.400000 49.600000 114.000000 27.000000

Visual Overview of the dimensions and spread of features (as well as the target variable Sales):

In [6]:
import numpy as np; np.random.seed(42)
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


sns.boxplot(x="variable", y="value", data=pd.melt(advertising))

plt.show()

by far the largest amount of spending occurs in TV advetisements and this is also the feature with the highest relative variance. For accurate linear regression the explanatory variable may need to be scaled, but a log transformation will also help. I will investigate this further....

In [22]:
plot = sns.lmplot(y="Sales", x="TV", height=6, palette="rainbow_r", data=advertising)
axes = plot.axes
In [7]:
sns.relplot(x="TV", y="Sales", palette="rainbow_r", data = advertising)
Out[7]:
<seaborn.axisgrid.FacetGrid at 0x1ff2d6f2208>

this relationship does not look completely linear, probably should take the log to straighten it

In [23]:
advertising['log_tv'] = advertising['TV'].apply(np.log)
advertising['log_sales'] = advertising['Sales'].apply(np.log)
In [24]:
import seaborn as sns

sns.relplot(x="log_tv", y="log_sales", palette="rainbow_r", data = advertising)
Out[24]:
<seaborn.axisgrid.FacetGrid at 0x1ff35b8a948>

better....

I will try regression on both the original and log transformed data and compare the results

In [25]:
X = advertising['TV']
y = advertising['Sales']
In [26]:
X = np.asarray(X).reshape(-1, 1)
y = np.asarray(y).reshape(-1, 1)
In [27]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = \
    train_test_split(X, y, 
                    test_size = 0.3, random_state=0) 
In [ ]:
 

10 fold Cross validation on unscaled - untransformed data is not very good:

Tried few other regressors but will not investigate tuning those parameters as simple linear regression was requested

In [28]:
#10-fold cross validation:

# prepare configuration for cross validation test harness
seed = 7
# prepare models
models = []
models.append(('LR', LinearRegression()))
models.append(('KNN', KNeighborsRegressor()))
models.append(('XGB',xgb.XGBRegressor(verbosity=0))) 
models.append(('LassoR',Lasso()))

# evaluate each model in turn
results = []
names = []
scoring = 'accuracy'
for name, model in models:
    kfold = model_selection.KFold( n_splits=5, random_state=seed)
    cv_results = model_selection.cross_val_score(model, X, y)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)
# boxplot algorithm comparison
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()
LR: 0.806124 (0.032621)
KNN: 0.801715 (0.031889)
XGB: 0.714438 (0.057179)
LassoR: 0.806133 (0.032375)
In [35]:
from sklearn import preprocessing 
  
""" MIN MAX SCALER """
  
min_max_scaler = preprocessing.MinMaxScaler(feature_range =(0, 1)) 
  
# Scaled feature 
x_after_min_max_scaler = min_max_scaler.fit_transform(X) 
  
#print ("\nAfter min max Scaling : \n", x_after_min_max_scaler) 

X = x_after_min_max_scaler
  
""" Standardisation """
  
Standardisation = preprocessing.StandardScaler() 
  
# Scaled feature 
x_after_Standardisation = Standardisation.fit_transform(X) 
  
#print ("\nAfter Standardisation : \n", x_after_Standardisation) 

X = x_after_Standardisation
In [36]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = \
    train_test_split(X, y, 
                    test_size = 0.3, random_state=0) 
In [37]:
#10-fold cross validation:

# prepare configuration for cross validation test harness
seed = 7
# prepare models
models = []
models.append(('LR', LinearRegression()))
models.append(('KNN', KNeighborsRegressor()))
models.append(('XGB',xgb.XGBRegressor(verbosity=0))) 
models.append(('LassoR',Lasso()))

# evaluate each model in turn
results = []
names = []
scoring = 'accuracy'
for name, model in models:
    kfold = model_selection.KFold( n_splits=5, random_state=seed)
    cv_results = model_selection.cross_val_score(model, X, y)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)
# boxplot algorithm comparison
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()
LR: 0.806124 (0.032621)
KNN: 0.801715 (0.031889)
XGB: 0.714438 (0.057179)
LassoR: 0.770638 (0.015585)

The plot shows that scaling and standardizing is not enough to imrpove the not-quite linear relationship, so I will use the log transformation

In [38]:
X = advertising['log_tv']
y = advertising['log_sales']

X = np.asarray(X).reshape(-1, 1)
y = np.asarray(y).reshape(-1, 1)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = \
    train_test_split(X, y, 
                    test_size = 0.3, random_state=0) 
In [ ]:
 
In [39]:
#10-fold cross validation:

# prepare configuration for cross validation test harness
seed = 7
# prepare models
models = []
models.append(('LR', LinearRegression()))
models.append(('KNN', KNeighborsRegressor()))
models.append(('XGB',xgb.XGBRegressor(verbosity=0))) 

# evaluate each model in turn
results = []
names = []
scoring = 'accuracy'
for name, model in models:
    kfold = model_selection.KFold( n_splits=5, random_state=seed)
    cv_results = model_selection.cross_val_score(model, X, y)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)
# boxplot algorithm comparison
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()
LR: 0.856337 (0.040541)
KNN: 0.818649 (0.011925)
XGB: 0.756443 (0.041011)

As expected The Log transformation improved the accuracy of simple linear regression by nearly 5% from 0.806124 to 0.856337 using 10-fold cross validation.

In [ ]: