Lasso Regression in Income per Person
andom Forest in Income per Person
Introduction
In gap minder data, Income per person is analyzed with the relation of oil consumption, Co2 emission, internet user rate an so on.
Because these features are related social welfare, there might be some correlation with income per person.
Getting and Preparing Data
oilperperson: Oil Consumption per capita
co2emissions: CO2 consumtion
internetuserate : Internet users (per 100 people)
lifeexpectancy : life expectancy at birth (years)
polityscore : subtracting an autocracy score from a democracy score.
relectricperperson: residential electricity consumption per person
urbanrate : urban population
employrate : Percentage of total population, age above 15, that has been employed
The target data is converted to 12 category and change its data type to string for tree classification.
Data Modelling
Results
List of the Regression Coefficients
dict(zip(predictors.columns, model.coef_))
- 'oilperperson': 0.10345142697286504,
- 'co2emissions': 0.0,
- 'internetuserate': 0.05437449227627304,
- 'lifeexpectancy': -0.004915740617536609,
- 'polityscore': 0.0,
- 'relectricperperson': 8.150909699388816e-05,
- 'urbanrate': 0.04452277720162526,
- 'employrate': -0.013997857746155112
Coefficient Plot
# plot coefficient progression
m_log_alphas = -np.log10(model.alphas_)
ax = plt.gca()
plt.plot(m_log_alphas, model.coef_path_.T)
plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k',
label='alpha CV')
plt.ylabel('Regression Coefficients')
plt.xlabel('-log(alpha)')
plt.title('Regression Coefficients Progression for Lasso Paths')
# plot mean square error for each fold
m_log_alphascv = -np.log10(model.cv_alphas_)
plt.figure()
This plot shows the relative importance of the predictor selected at any step of the selection process, how the reggression coefficients changed with the addition of a new predictor at each step.
The green blue line, had the largest regression coefficient which is oilperperson. It was therefore entered into the model first, followed by internetuserate , the purple line, at step two. In urbanrate, the grey line, at step three and so on.
train_error = mean_squared_error(tar_train, model.predict(pred_train))
test_error = mean_squared_error(tar_test, model.predict(pred_test))
training data MSE = 1.79
test data MSE = 2.60
rsquared_train=model.score(pred_train,tar_train)
rsquared_test=model.score(pred_test,tar_test)
training data R-square = 0.79
test data R-square = 0.71
The R-square values were 0.79 and 0.71, indicating that the selected model explained 79% and 71% of the variance in Income per person connectedness for the training and test sets, respectively.
This suggests that prediction accuracy was not stable across the two data sets.
# -*- coding: utf-8 -*-
"""
Created on Mon Dec 14 16:26:46 2015
@author: jrose01
"""
#from pandas import Series, DataFrame
import pandas
import numpy as np
import matplotlib.pylab as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LassoLarsCV
data = pandas.read_csv('data/gapminder.csv', low_memory=False)
data = data[["incomeperperson", "oilperperson", "co2emissions", "internetuserate",
"lifeexpectancy", "polityscore", "relectricperperson", "urbanrate", "employrate"]]
def SetColumns(cols):
for col in cols:
print(col)
data[col]=data[col].replace(' ', 0)
data[col] = data[col].astype(float, errors = 'raise')
#pandas.to_numeric(data[col], errors='coerce')
Columns = ["incomeperperson", "oilperperson", "co2emissions", "internetuserate",
"lifeexpectancy", "polityscore", "relectricperperson", "urbanrate", "employrate"]
data['incomeperperson'] = pandas.to_numeric(data['incomeperperson'], errors='coerce')
data = data.dropna(subset=['incomeperperson'])
data['incomeperperson'] = pandas.cut(data.incomeperperson, [ 300, 800, 1300, 2500, 4000, 6000,10000,15000,30000,50000,100000,200000]).cat.codes
SetColumns(Columns)
data_clean=data.dropna()
Columns = Columns.remove('incomeperperson')
predictors = data_clean[["oilperperson", "co2emissions", "internetuserate",
"lifeexpectancy", "polityscore", "relectricperperson", "urbanrate", "employrate"]]
target = data_clean.incomeperperson
# split data into train and test sets
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target,
test_size=.3, random_state=123)
# specify the lasso regression model
model=LassoLarsCV(cv=10, precompute=False).fit(pred_train,tar_train)
# print variable names and regression coefficients
dict(zip(predictors.columns, model.coef_))
# plot coefficient progression
m_log_alphas = -np.log10(model.alphas_)
ax = plt.gca()
plt.plot(m_log_alphas, model.coef_path_.T)
plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k',
label='alpha CV')
plt.ylabel('Regression Coefficients')
plt.xlabel('-log(alpha)')
plt.title('Regression Coefficients Progression for Lasso Paths')
# plot mean square error for each fold
m_log_alphascv = -np.log10(model.cv_alphas_)
plt.figure()
# MSE from training and test data
from sklearn.metrics import mean_squared_error
train_error = mean_squared_error(tar_train, model.predict(pred_train))
test_error = mean_squared_error(tar_test, model.predict(pred_test))
print ('training data MSE')
print(train_error)
print ('test data MSE')
print(test_error)
# R-square from training and test data
rsquared_train=model.score(pred_train,tar_train)
rsquared_test=model.score(pred_test,tar_test)
print ('training data R-square')
print(rsquared_train)
print ('test data R-square')
print(rsquared_test)



Comments
Post a Comment