Linear Regression Sacramento



Simple Linear Regression with Sacramento Real Estate Data

Authors: Matt Brems, Sam Stack, Justin Pounders


In this lab you will hone your exploratory data analysis (EDA) skills and practice constructing simple linear regressions using a data set on Sacramento real estate sales. The data set contains information on qualities of the property, location of the property, and time of sale.

1. Read in the Sacramento housing data set.

sac_csv = './datasets/sacramento_real_estate_transactions.csv'
type(sac_csv)
str
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn.linear_model
import statsmodels.api as sm


% matplotlib inline 
#shows plotting in notebook
# A: 
sc = pd.read_csv(sac_csv)
type(sc)
pandas.core.frame.DataFrame

2. Conduct exploratory data analysis on this data set.

Report any notable findings here and any steps you take to clean/process data.

Note: These EDA checks should be done on every data set you handle. If you find yourself checking repeatedly for missing/corrupted data, it might be beneficial to have a function that you can reuse every time you’re given new data.

# A:
sc.head(5)
street city zip state beds baths sq__ft type sale_date price latitude longitude
0 3526 HIGH ST SACRAMENTO 95838 CA 2 1 836 Residential Wed May 21 00:00:00 EDT 2008 59222 38.631913 -121.434879
1 51 OMAHA CT SACRAMENTO 95823 CA 3 1 1167 Residential Wed May 21 00:00:00 EDT 2008 68212 38.478902 -121.431028
2 2796 BRANCH ST SACRAMENTO 95815 CA 2 1 796 Residential Wed May 21 00:00:00 EDT 2008 68880 38.618305 -121.443839
3 2805 JANETTE WAY SACRAMENTO 95815 CA 2 1 852 Residential Wed May 21 00:00:00 EDT 2008 69307 38.616835 -121.439146
4 6001 MCMAHON DR SACRAMENTO 95824 CA 2 1 797 Residential Wed May 21 00:00:00 EDT 2008 81900 38.519470 -121.435768
# A:
type(sc.head(10))
pandas.core.frame.DataFrame
# check null values
sc.isnull().sum()
street       0
city         0
zip          0
state        0
beds         0
baths        0
sq__ft       0
type         0
sale_date    0
price        0
latitude     0
longitude    0
dtype: int64
# checking if the states match to CA
sc['state'].unique()
array(['CA', 'AC'], dtype=object)
# checking the types that are available 
sc['type'].unique()
array(['Residential', 'Condo', 'Multi-Family', 'Unkown'], dtype=object)
# checking datatypes, null values, columns and if all entries are there 
sc.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 985 entries, 0 to 984
Data columns (total 12 columns):
street       985 non-null object
city         985 non-null object
zip          985 non-null int64
state        985 non-null object
beds         985 non-null int64
baths        985 non-null int64
sq__ft       985 non-null int64
type         985 non-null object
sale_date    985 non-null object
price        985 non-null int64
latitude     985 non-null float64
longitude    985 non-null float64
dtypes: float64(2), int64(5), object(5)
memory usage: 92.4+ KB
# checking statistics for columns 
# notice that there is negative values for price, sq_ft
#seen that there are houses that have no bedrooms, or baths
sc.describe()
zip beds baths sq__ft price latitude longitude
count 985.000000 985.000000 985.000000 985.000000 985.000000 985.000000 985.000000
mean 95750.697462 2.911675 1.776650 1312.918782 233715.951269 38.445121 -121.193371
std 85.176072 1.307932 0.895371 856.123224 139088.818896 5.103637 5.100670
min 95603.000000 0.000000 0.000000 -984.000000 -210944.000000 -121.503471 -121.551704
25% 95660.000000 2.000000 1.000000 950.000000 145000.000000 38.482704 -121.446119
50% 95762.000000 3.000000 2.000000 1304.000000 213750.000000 38.625932 -121.375799
75% 95828.000000 4.000000 2.000000 1718.000000 300000.000000 38.695589 -121.294893
max 95864.000000 8.000000 5.000000 5822.000000 884790.000000 39.020808 38.668433
# found row that have negative price, which also inclued negaitive Sq_ft and state is wrong 
sc[sc['price'] <0]
street city zip state beds baths sq__ft type sale_date price latitude longitude
703 1900 DANBROOK DR SACRAMENTO 95835 AC 1 1 -984 Condo Fri May 16 00:00:00 EDT 2008 -210944 -121.503471 38.668433
# seeing how many values are negative 
(sc['price'] <0).value_counts()
False    984
True       1
Name: price, dtype: int64
# Decided to correct state, and take the negaive off te sq-ft and price 
# keeping the row
sc.loc[703,'state'] = 'CA'
sc.loc[703,'sq__ft'] = 984
sc.loc[703,'price'] = 210944
# checking value to make sure changes was made
sc.loc[703]
street                   1900 DANBROOK DR
city                           SACRAMENTO
zip                                 95835
state                                  CA
beds                                    1
baths                                   1
sq__ft                                984
type                                Condo
sale_date    Fri May 16 00:00:00 EDT 2008
price                              210944
latitude                         -121.503
longitude                         38.6684
Name: 703, dtype: object
# checked to see how mand houses with no bedrooms
#looks like the houses with no bedrooms also have no baths and no sq_ft
#kept rows because they must be land available to build 
sc[sc['beds'] == 0]
street city zip state beds baths sq__ft type sale_date price latitude longitude
73 17 SERASPI CT SACRAMENTO 95834 CA 0 0 0 Residential Wed May 21 00:00:00 EDT 2008 206000 38.631481 -121.501880
89 2866 KARITSA AVE SACRAMENTO 95833 CA 0 0 0 Residential Wed May 21 00:00:00 EDT 2008 244500 38.626671 -121.525970
100 12209 CONSERVANCY WAY RANCHO CORDOVA 95742 CA 0 0 0 Residential Wed May 21 00:00:00 EDT 2008 263500 38.553867 -121.219141
121 5337 DUSTY ROSE WAY RANCHO CORDOVA 95742 CA 0 0 0 Residential Wed May 21 00:00:00 EDT 2008 320000 38.528575 -121.228600
126 2115 SMOKESTACK WAY SACRAMENTO 95833 CA 0 0 0 Residential Wed May 21 00:00:00 EDT 2008 339500 38.602416 -121.542965
133 8082 LINDA ISLE LN SACRAMENTO 95831 CA 0 0 0 Residential Wed May 21 00:00:00 EDT 2008 370000 38.477200 -121.521500
147 9278 DAIRY CT ELK GROVE 95624 CA 0 0 0 Residential Wed May 21 00:00:00 EDT 2008 445000 38.420338 -121.363757
153 868 HILDEBRAND CIR FOLSOM 95630 CA 0 0 0 Residential Wed May 21 00:00:00 EDT 2008 585000 38.670947 -121.097727
169 14788 NATCHEZ CT RANCHO MURIETA 95683 CA 0 0 0 Residential Tue May 20 00:00:00 EDT 2008 97750 38.492287 -121.100032
192 5201 LAGUNA OAKS DR Unit 126 ELK GROVE 95758 CA 0 0 0 Condo Tue May 20 00:00:00 EDT 2008 145000 38.423251 -121.444489
234 3139 SPOONWOOD WAY Unit 1 SACRAMENTO 95833 CA 0 0 0 Residential Tue May 20 00:00:00 EDT 2008 215500 38.626582 -121.521510
236 2340 HURLEY WAY SACRAMENTO 95825 CA 0 0 0 Condo Tue May 20 00:00:00 EDT 2008 225000 38.588816 -121.408549
248 611 BLOSSOM ROCK LN FOLSOM 95630 CA 0 0 0 Condo Tue May 20 00:00:00 EDT 2008 240000 38.645700 -121.119700
249 8830 ADUR RD ELK GROVE 95624 CA 0 0 0 Residential Tue May 20 00:00:00 EDT 2008 242000 38.437420 -121.372876
253 221 PICASSO CIR SACRAMENTO 95835 CA 0 0 0 Residential Tue May 20 00:00:00 EDT 2008 250000 38.676658 -121.528128
265 230 BANKSIDE WAY SACRAMENTO 95835 CA 0 0 0 Residential Tue May 20 00:00:00 EDT 2008 270000 38.676937 -121.529244
268 4236 ADRIATIC SEA WAY SACRAMENTO 95834 CA 0 0 0 Residential Tue May 20 00:00:00 EDT 2008 270000 38.647961 -121.543162
279 11281 STANFORD COURT LN Unit 604 GOLD RIVER 95670 CA 0 0 0 Condo Tue May 20 00:00:00 EDT 2008 300000 38.625289 -121.260286
285 3224 PARKHAM DR ROSEVILLE 95747 CA 0 0 0 Residential Tue May 20 00:00:00 EDT 2008 306500 38.772771 -121.364877
286 15 VANESSA PL SACRAMENTO 95835 CA 0 0 0 Residential Tue May 20 00:00:00 EDT 2008 312500 38.668692 -121.545490
308 5404 ALMOND FALLS WAY RANCHO CORDOVA 95742 CA 0 0 0 Residential Tue May 20 00:00:00 EDT 2008 425000 38.527502 -121.233492
310 14 CASA VATONI PL SACRAMENTO 95834 CA 0 0 0 Residential Tue May 20 00:00:00 EDT 2008 433500 38.650221 -121.551704
324 201 FIRESTONE DR ROSEVILLE 95678 CA 0 0 0 Residential Tue May 20 00:00:00 EDT 2008 500500 38.770153 -121.300039
326 2733 DANA LOOP EL DORADO HILLS 95762 CA 0 0 0 Residential Tue May 20 00:00:00 EDT 2008 541000 38.628459 -121.055078
327 9741 SADDLEBRED CT WILTON 95693 CA 0 0 0 Residential Tue May 20 00:00:00 EDT 2008 560000 38.408841 -121.198039
469 8491 CRYSTAL WALK CIR ELK GROVE 95758 CA 0 0 0 Residential Mon May 19 00:00:00 EDT 2008 261000 38.416916 -121.407554
477 6286 LONETREE BLVD ROCKLIN 95765 CA 0 0 0 Residential Mon May 19 00:00:00 EDT 2008 274500 38.805036 -121.293608
494 3072 VILLAGE PLAZA DR ROSEVILLE 95747 CA 0 0 0 Residential Mon May 19 00:00:00 EDT 2008 307000 38.773094 -121.365905
503 12241 CANYONLANDS DR RANCHO CORDOVA 95742 CA 0 0 0 Residential Mon May 19 00:00:00 EDT 2008 331500 38.557293 -121.217611
505 907 RIO ROBLES AVE SACRAMENTO 95838 CA 0 0 0 Residential Mon May 19 00:00:00 EDT 2008 344755 38.664765 -121.445006
... ... ... ... ... ... ... ... ... ... ... ... ...
600 7 CRYSTALWOOD CIR LINCOLN 95648 CA 0 0 0 Residential Mon May 19 00:00:00 EDT 2008 4897 38.885962 -121.289436
601 7 CRYSTALWOOD CIR LINCOLN 95648 CA 0 0 0 Residential Mon May 19 00:00:00 EDT 2008 4897 38.885962 -121.289436
602 3 CRYSTALWOOD CIR LINCOLN 95648 CA 0 0 0 Residential Mon May 19 00:00:00 EDT 2008 4897 38.886093 -121.289584
604 113 RINETTI WAY RIO LINDA 95673 CA 0 0 0 Residential Fri May 16 00:00:00 EDT 2008 30000 38.687172 -121.463933
686 5890 TT TRAK FORESTHILL 95631 CA 0 0 0 Residential Fri May 16 00:00:00 EDT 2008 194818 39.020808 -120.821518
718 9967 HATHERTON WAY ELK GROVE 95757 CA 0 0 0 Residential Fri May 16 00:00:00 EDT 2008 222500 38.305200 -121.403300
737 3569 SODA WAY SACRAMENTO 95834 CA 0 0 0 Residential Fri May 16 00:00:00 EDT 2008 247000 38.631139 -121.501879
743 6288 LONETREE BLVD ROCKLIN 95765 CA 0 0 0 Residential Fri May 16 00:00:00 EDT 2008 250000 38.804993 -121.293609
754 6001 SHOO FLY RD PLACERVILLE 95667 CA 0 0 0 Residential Fri May 16 00:00:00 EDT 2008 270000 38.813546 -120.809254
755 3040 PARKHAM DR ROSEVILLE 95747 CA 0 0 0 Residential Fri May 16 00:00:00 EDT 2008 271000 38.770835 -121.366996
757 6007 MARYBELLE LN SHINGLE SPRINGS 95682 CA 0 0 0 Unkown Fri May 16 00:00:00 EDT 2008 275000 38.643470 -120.888183
774 8253 KEEGAN WAY ELK GROVE 95624 CA 0 0 0 Residential Fri May 16 00:00:00 EDT 2008 298000 38.446286 -121.400817
789 5222 COPPER SUNSET WAY RANCHO CORDOVA 95742 CA 0 0 0 Residential Fri May 16 00:00:00 EDT 2008 313000 38.529181 -121.224755
798 3232 PARKHAM DR ROSEVILLE 95747 CA 0 0 0 Residential Fri May 16 00:00:00 EDT 2008 325500 38.772821 -121.364821
819 2274 IVY BRIDGE DR ROSEVILLE 95747 CA 0 0 0 Residential Fri May 16 00:00:00 EDT 2008 375000 38.778561 -121.362008
823 201 KIRKLAND CT LINCOLN 95648 CA 0 0 0 Residential Fri May 16 00:00:00 EDT 2008 389000 38.867125 -121.319085
824 12075 APPLESBURY CT RANCHO CORDOVA 95742 CA 0 0 0 Residential Fri May 16 00:00:00 EDT 2008 390000 38.535700 -121.224900
826 5420 ALMOND FALLS WAY RANCHO CORDOVA 95742 CA 0 0 0 Residential Fri May 16 00:00:00 EDT 2008 396000 38.527384 -121.233531
828 1515 EL CAMINO VERDE DR LINCOLN 95648 CA 0 0 0 Residential Fri May 16 00:00:00 EDT 2008 400000 38.904869 -121.320750
836 1536 STONEY CROSS LN LINCOLN 95648 CA 0 0 0 Residential Fri May 16 00:00:00 EDT 2008 433500 38.860007 -121.310946
848 200 HILLSFORD CT ROSEVILLE 95747 CA 0 0 0 Residential Fri May 16 00:00:00 EDT 2008 511000 38.780051 -121.378718
859 4478 GREENBRAE RD ROCKLIN 95677 CA 0 0 0 Residential Fri May 16 00:00:00 EDT 2008 600000 38.781134 -121.222801
861 200 CRADLE MOUNTAIN CT EL DORADO HILLS 95762 CA 0 0 0 Residential Fri May 16 00:00:00 EDT 2008 622500 38.647800 -121.030900
862 2065 IMPRESSIONIST WAY EL DORADO HILLS 95762 CA 0 0 0 Residential Fri May 16 00:00:00 EDT 2008 680000 38.682961 -121.033253
888 3035 ESTEPA DR Unit 5C CAMERON PARK 95682 CA 0 0 0 Condo Thu May 15 00:00:00 EDT 2008 119000 38.681393 -120.996713
901 1530 TOPANGA LN Unit 204 LINCOLN 95648 CA 0 0 0 Condo Thu May 15 00:00:00 EDT 2008 138000 38.884150 -121.270277
917 501 POPLAR AVE WEST SACRAMENTO 95691 CA 0 0 0 Residential Thu May 15 00:00:00 EDT 2008 165000 38.584526 -121.534609
934 1550 TOPANGA LN Unit 207 LINCOLN 95648 CA 0 0 0 Condo Thu May 15 00:00:00 EDT 2008 188000 38.884170 -121.270222
947 1525 PENNSYLVANIA AVE WEST SACRAMENTO 95691 CA 0 0 0 Residential Thu May 15 00:00:00 EDT 2008 200100 38.569943 -121.527539
970 3557 SODA WAY SACRAMENTO 95834 CA 0 0 0 Residential Thu May 15 00:00:00 EDT 2008 224000 38.631026 -121.501879

108 rows × 12 columns

The data set is complete with correct datatypes with no null values. Use unique to see if all state in the state column are the same. There was one AC. I just corrected AC to CA which will complete the state dataset. When I looked at the data through the describe. I noticed that there was a min value of a negative price in price column and a negative square feet in the sq_ft column which I decided. I would just correct the format of the values. So, it would make sense.

_Fun Fact: Zip codes often have leading zeros — e.g., 02215 = Boston, MA — which will often get knocked off automatically by many software programs like Python or Excel. You can imagine that this could create some issues. _

3. Our goal will be to predict price. List variables that you think qualify as predictors of price in an SLR model.

For each of the variables you believe to be a valid potential predictor in an SLR model, generate a plot showing the relationship between the independent and dependent variables.

type(sc.corr())
pandas.core.frame.DataFrame
sns.heatmap(sc.corr(), annot=True, center =0)
<matplotlib.axes._subplots.AxesSubplot at 0x102184198>

png

# A: Using the heat map shows that the indpendent variable x Price
# is Strongly correlated with beds, baths, sq_ft.
#Seems like beds, baths, and sq_ft would be for predictors 

When you’ve finished cleaning or have made a good deal of progress cleaning, it’s always a good idea to save your work.

shd.to_csv('./datasets/sacramento_real_estate_transactions_Clean.csv')

4. Which variable would be the best predictor of Y in an SLR model? Why?

# A: Baths have the strongest correlation compared to beds and sq_ft with price with beds being at 0.42. 

5. Build a function that will take in two lists, Y and X, and return the intercept and slope coefficients that minimize SSE.

Y is the target variable and X is the predictor variable.

  • Test your function on price and the variable you determined was the best predictor in Problem 4.
  • Report the slope and intercept.
# x = np.linspace(-5,50,100)
x = sc['baths']
# y = 50 + 2 * x + np.random.normal(0, 20, size=len(x))
y = sc['price']
sc['Mean_Yhat'] = y.mean()
intercept_slope = np.sum(np.square(sc['price'] - sc['Mean_Yhat']))
# A: I know I did this wrong
def si(table):
    intercept_slope = np.sum(np.square(sc['price'] - sc['Mean_Yhat']))
    return intercept_slope
print('Intercept is ', intercept_slope)
Intercept is  18838783738865.37
si('baths')
18838783738865.37

6. Interpret the intercept. Interpret the slope.

# A: X works with the y value. As the X value increases so should the y value
# The slope will increase by the price variable

7. Give an example of how this model could be used for prediction and how it could be used for inference.

Be sure to make it clear which example is associated with prediction and which is associated with inference.

# Prediction
# A: As a real estate agent, I may want to predict the pricing of housing
# in different areas along with the amount of space available and bedrooms
# Considering the cusotmer price range. I will best be able to determine
# where to start looking. 

8: [Bonus] Using the model you came up with in Problem 5, calculate and plot the residuals.

y_bar = sc['price'].mean()
x_bar = sc['baths'].mean()
std_y = np.std(sc['price'], ddof=1)
std_x = np.std(sc['baths'], ddof=1)
r_xy = sc.corr().loc['baths','price']

beta_1_hat = r_xy * std_y / std_x
beta_0_hat = y_bar - beta_1_hat *x_bar

print(beta_1_hat,beta_0_hat)
64318.53523673409 119872.75465554858
sc['Linear_Yhat'] = beta_0_hat + beta_1_hat * sc['baths']
# A:

fig = plt.figure(figsize=(15,7))
fig.set_figheight(8)
fig.set_figwidth(15)


ax = fig.gca()


ax.scatter(x=sc['baths'], y=sc['price'], c='k')
ax.plot(sc['baths'], sc['Linear_Yhat'], color='k');

for _, row in sc.iterrows():
    plt.plot((row['baths'], row['baths']), (row['price'], row['Linear_Yhat']), 'r-')

png


The material following this point can be completed after the second lesson on Monday.


Dummy Variables


It is important to be cautious with categorical variables, which represent distict groups or categories, when building a regression. If put in a regression “as-is,” categorical variables represented as integers will be treated like continuous variables.

That is to say, instead of group “3” having a different effect on the estimation than group “1” it will estimate literally 3 times more than group 1.

For example, if occupation category “1” represents “analyst” and occupation category “3” represents “barista”, and our target variable is salary, if we leave this as a column of integers then barista will always have beta*3 the effect of analyst.

This will almost certainly force the beta coefficient to be something strange and incorrect. Instead, we can re-represent the categories as multiple “dummy coded” columns.

9. Use the pd.get_dummies function to convert the type column into dummy-coded variables.

Print out the header of the dummy-coded variable output.

# A:
sc_new = pd.get_dummies(sc[['type']])

sc_new.head()
type_Condo type_Multi-Family type_Residential type_Unkown
0 0 0 1 0
1 0 0 1 0
2 0 0 1 0
3 0 0 1 0
4 0 0 1 0

A Word of Caution When Creating Dummies

Let’s touch on precautions we should take when dummy coding.

If you convert a qualitative variable to dummy variables, you want to turn a variable with N categories into N-1 variables.

Scenario 1: Suppose we’re working with the variable “sex” or “gender” with values “M” and “F”.

You should include in your model only one variable for “sex = F” which takes on 1 if sex is female and 0 if sex is not female! Rather than saying “a one unit change in X,” the coefficient associated with “sex = F” is interpreted as the average change in Y when sex = F relative to when sex = M.

| Female | Male | |——-|——| | 0 | 1 | | 1 | 0 | | 0 | 1 | | 1 | 0 | | 1 | 0 | As we can see a 1 in the female column indicates a 0 in the male column. And so, we have two columns stating the same information in different ways.

Scenario 2: Suppose we’re modeling revenue at a bar for each of the days of the week. We have a column with strings identifying which day of the week this observation occured in.

We might include six of the days as their own variables: “Monday”, “Tuesday”, “Wednesday”, “Thursday”, “Friday”, “Saturday”. But not all 7 days.

Monday Tuesday Wednesday Thursday Friday Saturday
1 0 0 0 0 0
0 1 0 0 0 0
0 0 1 0 0 0
0 0 0 1 0 0
0 0 0 0 1 0
0 0 0 0 0 1
0 0 0 0 0 0

As humans we can infer from the last row that if its is not Monday, Tusday, Wednesday, Thursday, Friday or Saturday than it must be Sunday. Models work the same way.

The coefficient for Monday is then interpreted as the average change in revenue when “day = Monday” relative to “day = Sunday.” The coefficient for Tuesday is interpreted in the average change in revenue when “day = Tuesday” relative to “day = Sunday” and so on.

The category you leave out, which the other columns are relative to is often referred to as the reference category.

10. Remove “Unkown” from four dummy coded variable dataframe and append the rest to the original data.

# A: 
sc_new.drop('type_Unkown', axis = 1, inplace=True)
sc_new.head()
type_Condo type_Multi-Family type_Residential
0 0 0 1
1 0 0 1
2 0 0 1
3 0 0 1
4 0 0 1
sc.head(1)
street city zip state beds baths sq__ft type sale_date price latitude longitude Mean_Yhat Linear_Yhat
0 3526 HIGH ST SACRAMENTO 95838 CA 2 1 836 Residential Wed May 21 00:00:00 EDT 2008 59222 38.631913 -121.434879 234144.263959 184191.289892
sc = pd.concat([sc, sc_new], axis=1)
sc.head(1)
street city zip state beds baths sq__ft type sale_date price latitude longitude Mean_Yhat Linear_Yhat type_Condo type_Multi-Family type_Residential
0 3526 HIGH ST SACRAMENTO 95838 CA 2 1 836 Residential Wed May 21 00:00:00 EDT 2008 59222 38.631913 -121.434879 234144.263959 184191.289892 0 0 1

11. Build what you think may be the best MLR model predicting price.

The independent variables are your choice, but include at least three variables. At least one of which should be a dummy-coded variable (either one we created before or a new one).

To construct your model don’t forget to load in the statsmodels api:

from sklearn.linear_model import LinearRegression

model = LinearRegression()

I’m going to engineer a new dummy variable for ‘HUGE houses’. Those whose square footage is 3 (positive) standard deviations away from the mean.

Mean = 1315
STD = 853
Huge Houses > 3775 sq ft
sc['Huge_homes'] = (sc['sq__ft'] > 3775).astype(int)
sc['Huge_homes'].value_counts()
0    975
1     10
Name: Huge_homes, dtype: int64
from sklearn.linear_model import LinearRegression

X = sc[['sq__ft', 'beds', 'baths','Huge_homes']].values
y = sc['price'].values

model = LinearRegression()
model.fit(X,y)

y_pred = model.predict(X)

12. Plot the true price vs the predicted price to evaluate your MLR visually.

Tip: with seaborn’s sns.lmplot you can set x, y, and even a hue (which will plot regression lines by category in different colors) to easily plot a regression line.

sc.head()
street city zip state beds baths sq__ft type sale_date price latitude longitude Mean_Yhat Linear_Yhat type_Condo type_Multi-Family type_Residential
0 3526 HIGH ST SACRAMENTO 95838 CA 2 1 836 Residential Wed May 21 00:00:00 EDT 2008 59222 38.631913 -121.434879 234144.263959 184191.289892 0 0 1
1 51 OMAHA CT SACRAMENTO 95823 CA 3 1 1167 Residential Wed May 21 00:00:00 EDT 2008 68212 38.478902 -121.431028 234144.263959 184191.289892 0 0 1
2 2796 BRANCH ST SACRAMENTO 95815 CA 2 1 796 Residential Wed May 21 00:00:00 EDT 2008 68880 38.618305 -121.443839 234144.263959 184191.289892 0 0 1
3 2805 JANETTE WAY SACRAMENTO 95815 CA 2 1 852 Residential Wed May 21 00:00:00 EDT 2008 69307 38.616835 -121.439146 234144.263959 184191.289892 0 0 1
4 6001 MCMAHON DR SACRAMENTO 95824 CA 2 1 797 Residential Wed May 21 00:00:00 EDT 2008 81900 38.519470 -121.435768 234144.263959 184191.289892 0 0 1
sc['y_pred'] = y_pred
sns.lmplot(x='price', y='y_pred', data=sc, hue='Huge_homes')
<seaborn.axisgrid.FacetGrid at 0x1c16640f28>

png

13. List the five assumptions for an MLR model.

Indicate which ones are the same as the assumptions for an SLR model.

SLR AND MLR:

  • Linearity: Y must have an approximately linear relationship with each independent X_i.
  • Independence: Errors (residuals) e_i and e_j must be independent of one another for any i != j.
  • Normality: The errors (residuals) follow a Normal distribution.
  • Equality of Variances: The errors (residuals) should have a roughly consistent pattern, regardless of the value of the X_i. (There should be no discernable relationship between X_1 and the residuals.)

MLR ONLY:

  • Independence Part 2: The independent variables X_i and X_j must be independent of one another for any i != j

14. Pick at least two assumptions and articulate whether or not you believe them to be met for your model and why.

# A: With the errors looking skewed right it does not show normality 
sc['Residuals'] = sc['price'] - sc['y_pred']
sns.distplot(sc['Residuals'])
<matplotlib.axes._subplots.AxesSubplot at 0x1c16531630>

png

#A Looks like it does show linearity because the y and x have an approximate
# linear relationship
# Plot

x='Residuals'
y='price'

plt.scatter(x, y, s=area, data=sc, alpha=0.5)
plt.title('Scatter plot pythonspot.com')
plt.xlabel('x')
plt.ylabel('y')
plt.show()

png

15. [Bonus] Generate a table showing the point estimates, standard errors, t-scores, p-values, and 95% confidence intervals for the model you built.

Write a few sentences interpreting some of the output.

Hint: scikit-learn does not have this functionality built in, but statsmodels does in the summary method. To fit the statsmodels model use something like the following. There is one big caveat here, however! statsmodels.OLS does not add an intercept to your model, so you will need to do this explicitly by adding a column filled with the number 1 to your X matrix

import statsmodels.api as sm

# The Default here is Linear Regression (ordinary least squares regression OLS)
model = sm.OLS(y,X).fit()

The material following this point can be completed after the first lesson on Tuesday.


# Standard Errors assume that the covariance matrix of the errors is correctly specified.
# The condition number is large, 1.7e+04. This might indicate that there are
# strong multicollinearity or other numerical problems.
# A "unit" increase in sq_ft is associated with a 9.4538 "unit" increase in prince.

# Importing the stats model API
import statsmodels.api as sm


# Setting my X and y for modeling
sc['intercept'] = 1
X = sc[['intercept','sq__ft','beds','baths','Huge_homes']]
y = sc['price']

model = sm.OLS(y,X).fit()
model.summary()
OLS Regression Results
Dep. Variable: price R-squared: 0.194
Model: OLS Adj. R-squared: 0.191
Method: Least Squares F-statistic: 59.05
Date: Wed, 04 Jul 2018 Prob (F-statistic): 1.07e-44
Time: 13:37:52 Log-Likelihood: -12951.
No. Observations: 985 AIC: 2.591e+04
Df Residuals: 980 BIC: 2.594e+04
Df Model: 4
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
intercept 1.252e+05 9748.440 12.844 0.000 1.06e+05 1.44e+05
sq__ft 9.4538 6.985 1.353 0.176 -4.254 23.161
beds -3947.2178 5955.987 -0.663 0.508 -1.56e+04 7740.738
baths 5.979e+04 8400.448 7.117 0.000 4.33e+04 7.63e+04
Huge_homes 1.747e+05 4.29e+04 4.075 0.000 9.06e+04 2.59e+05
Omnibus: 231.835 Durbin-Watson: 0.432
Prob(Omnibus): 0.000 Jarque-Bera (JB): 549.528
Skew: 1.257 Prob(JB): 4.69e-120
Kurtosis: 5.658 Cond. No. 1.70e+04



Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.7e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

16. Regression Metrics

Implement a function called r2_adj() that will calculate $R^2_{adj}$ for a model.

def r2_adj(y_true, y_preds, p):
    n = len(y_true)
    y_mean = np.mean(y_true)
    numerator = np.sum(np.square(y_true - y_preds)) / (n - p - 1)
    denominator = np.sum(np.square(y_true - y_mean)) / (n - 1)
    return 1 - numerator / denominator

17. Metrics, metrics, everywhere…

Write a function to calculate and print or return six regression metrics. Use other functions liberally, including those found in sklearn.metrics.

# A:
from sklearn.metrics import *
from sklearn.metrics import mean_squared_error, mean_absolute_error,r2_score
y = sc['price']
x = sc['baths']
y = sc['price']
X = sc.drop(['street', 'city', 'zip', 'state', 'type', 'sale_date', 'latitude', 'longitude','sq__ft'], axis="columns")


regression = sklearn.linear_model.LinearRegression(
    fit_intercept = True, 
    normalize = False,
    copy_X = True,
    n_jobs = 1
)

model = regression.fit(X, y)
y_hat = model.predict(X)

y_hat
def regression_metrics(y, y_hat, p):
    r2 = sklearn.metrics.r2_score(y, y_hat),
    mse = sklearn.metrics.mean_squared_error(y, y_hat),
    #r2_adj = r2_adj(y, y_hat,p),
    msle = sklearn.metrics.mean_squared_log_error(y, y_hat),
    mae = sklearn.metrics.mean_absolute_error(y,y_hat),
    rmse = np.sqrt(mse)
    
    print('r2 = ', r2)
    print('mse = ', mse)
    #print(r2_adj)
    print('msle = ', msle)
    print('mae = ', mae)
    print('rmse = ', rmse)   

18. Model Iteration

Evaluate your current home price prediction model by calculating all six regression metrics. Now adjust your model (e.g. add or take away features) and see how to metrics change.

# A:
regression_metrics(sc['price'], y_pred, X.shape[1])

r2 =  (0.19421379933530336,)
mse =  (15411199973.689539,)
msle =  (0.8478591994535334,)
mae =  (93077.08723188947,)
rmse =  [124141.85423816]
sc.columns
Index(['street', 'city', 'zip', 'state', 'beds', 'baths', 'sq__ft', 'type',
       'sale_date', 'price', 'latitude', 'longitude', 'Mean_Yhat',
       'Linear_Yhat', 'type_Condo', 'type_Multi-Family', 'type_Residential',
       'y_pred', 'Huge_homes', 'Residuals', 'intercept'],
      dtype='object')
features = ['beds', 'baths', 'sq__ft','type_Condo', 'type_Multi-Family', 'type_Residential']
X = sc[features].values
y = sc['price'].values

model = LinearRegression()
model.fit(X, y)

y_hat = model.predict(X)
regression_metrics(sc['price'], y_pred, X.shape[1])
r2 =  (0.19421379933530336,)
mse =  (15411199973.689539,)
msle =  (0.8478591994535334,)
mae =  (93077.08723188947,)
rmse =  [124141.85423816]

19. Bias vs. Variance

At this point, do you think your model is high bias, high variance or in the sweet spot? If you are doing this after Wednesday, can you provide evidence to support your belief?

# A it seems like it will be in the sweet spot. I don't see signs for high bias or high variance 
from sklearn.model_selection import cross_val_score


cv_scores = cross_val_score(model, X, y)
print(cv_scores)
print(np.mean(cv_scores))

[ 0.09621108  0.10519894 -0.14432181]
0.01902940197070409
Written on September 30, 2018