Lasso Regression in Income per Person

andom Forest in Income per Person

Introduction

In gap minder data, Income per person is analyzed with the relation of oil consumption, Co2 emission, internet user rate an so on.

Because these features are related social welfare, there might be some correlation with income per person.

Getting and Preparing Data

incomeperperson : Gross Domestic Product per capita

oilperperson: Oil Consumption per capita

co2emissions: CO2 consumtion

internetuserate : Internet users (per 100 people)

lifeexpectancy : life expectancy at birth (years)

polityscore : subtracting an autocracy score from a democracy score.

relectricperperson: residential electricity consumption per person

urbanrate : urban population

employrate : Percentage of total population, age above 15, that has been employed

The target data is converted to 12 category and change its data type to string for tree classification.

Data Modelling

Target data is income per person and the rest of the columns are splited into train and test data as an explanatory variable.

Results

List of the Regression Coefficients

dict(zip(predictors.columns, model.coef_))

'oilperperson': 0.10345142697286504,
'co2emissions': 0.0,
'internetuserate': 0.05437449227627304,
'lifeexpectancy': -0.004915740617536609,
'polityscore': 0.0,
'relectricperperson': 8.150909699388816e-05,
'urbanrate': 0.04452277720162526,
'employrate': -0.013997857746155112

Coefficient Plot

# plot coefficient progression

m_log_alphas = -np.log10(model.alphas_)

ax = plt.gca()

plt.plot(m_log_alphas, model.coef_path_.T)

plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k',

label='alpha CV')

plt.ylabel('Regression Coefficients')

plt.xlabel('-log(alpha)')

plt.title('Regression Coefficients Progression for Lasso Paths')

# plot mean square error for each fold

m_log_alphascv = -np.log10(model.cv_alphas_)

plt.figure()

This plot shows the relative importance of the predictor selected at any step of the selection process, how the reggression coefficients changed with the addition of a new predictor at each step.

The green blue line, had the largest regression coefficient which is oilperperson. It was therefore entered into the model first, followed by internetuserate , the purple line, at step two. In urbanrate, the grey line, at step three and so on.

train_error = mean_squared_error(tar_train, model.predict(pred_train))

test_error = mean_squared_error(tar_test, model.predict(pred_test))

training data MSE = 1.79

test data MSE = 2.60

rsquared_train=model.score(pred_train,tar_train)

rsquared_test=model.score(pred_test,tar_test)

training data R-square = 0.79

test data R-square = 0.71

The R-square values were 0.79 and 0.71, indicating that the selected model explained 79% and 71% of the variance in Income per person connectedness for the training and test sets, respectively.

This suggests that prediction accuracy was not stable across the two data sets.

Codes

# -*- coding: utf-8 -*-

"""

Created on Mon Dec 14 16:26:46 2015

@author: jrose01

"""

#from pandas import Series, DataFrame

import pandas

import numpy as np

import matplotlib.pylab as plt

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LassoLarsCV

data = pandas.read_csv('data/gapminder.csv', low_memory=False)

data = data[["incomeperperson", "oilperperson", "co2emissions", "internetuserate",

"lifeexpectancy", "polityscore", "relectricperperson", "urbanrate", "employrate"]]

def SetColumns(cols):

for col in cols:

print(col)

data[col]=data[col].replace(' ', 0)

data[col] = data[col].astype(float, errors = 'raise')

#pandas.to_numeric(data[col], errors='coerce')

Columns = ["incomeperperson", "oilperperson", "co2emissions", "internetuserate",

"lifeexpectancy", "polityscore", "relectricperperson", "urbanrate", "employrate"]

data['incomeperperson'] = pandas.to_numeric(data['incomeperperson'], errors='coerce')

data = data.dropna(subset=['incomeperperson'])

data['incomeperperson'] = pandas.cut(data.incomeperperson, [ 300, 800, 1300, 2500, 4000, 6000,10000,15000,30000,50000,100000,200000]).cat.codes

SetColumns(Columns)

data_clean=data.dropna()

Columns = Columns.remove('incomeperperson')

predictors = data_clean[["oilperperson", "co2emissions", "internetuserate",

"lifeexpectancy", "polityscore", "relectricperperson", "urbanrate", "employrate"]]

target = data_clean.incomeperperson

# split data into train and test sets

pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target,

test_size=.3, random_state=123)

# specify the lasso regression model

model=LassoLarsCV(cv=10, precompute=False).fit(pred_train,tar_train)

# print variable names and regression coefficients

dict(zip(predictors.columns, model.coef_))

# plot coefficient progression

m_log_alphas = -np.log10(model.alphas_)

ax = plt.gca()

plt.plot(m_log_alphas, model.coef_path_.T)

plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k',

label='alpha CV')

plt.ylabel('Regression Coefficients')

plt.xlabel('-log(alpha)')

plt.title('Regression Coefficients Progression for Lasso Paths')

# plot mean square error for each fold

m_log_alphascv = -np.log10(model.cv_alphas_)

plt.figure()

# MSE from training and test data

from sklearn.metrics import mean_squared_error

train_error = mean_squared_error(tar_train, model.predict(pred_train))

test_error = mean_squared_error(tar_test, model.predict(pred_test))

print ('training data MSE')

print(train_error)

print ('test data MSE')

print(test_error)

# R-square from training and test data

rsquared_train=model.score(pred_train,tar_train)

rsquared_test=model.score(pred_test,tar_test)

print ('training data R-square')

print(rsquared_train)

print ('test data R-square')

print(rsquared_test)

Search This Blog

Alcohol Consumption and Panic Disorders, Agoraphobia