670 lines
18 KiB
Plaintext
670 lines
18 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Machine Learning Exercises Solution"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"%matplotlib inline\n",
|
|
"import matplotlib.pyplot as plt\n",
|
|
"import pandas as pd\n",
|
|
"import numpy as np"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Exercise 1\n",
|
|
"\n",
|
|
"You've just been hired at a real estate investment firm and they would like you to build a model for pricing houses. You are given a dataset that contains data for house prices and a few features like number of bedrooms, size in square feet and age of the house. Let's see if you can build a model that is able to predict the price. In this exercise we extend what we have learned about linear regression to a dataset with more than one feature. Here are the steps to complete it:\n",
|
|
"\n",
|
|
"1. Load the dataset ../data/housing-data.csv\n",
|
|
"- plot the histograms for each feature\n",
|
|
"- create 2 variables called X and y: X shall be a matrix with 3 columns (sqft,bdrms,age) and y shall be a vector with 1 column (price)\n",
|
|
"- create a linear regression model in Keras with the appropriate number of inputs and output\n",
|
|
"- split the data into train and test with a 20% test size\n",
|
|
"- train the model on the training set and check its accuracy on training and test set\n",
|
|
"- how's your model doing? Is the loss growing smaller?\n",
|
|
"- try to improve your model with these experiments:\n",
|
|
" - normalize the input features with one of the rescaling techniques mentioned above\n",
|
|
" - use a different value for the learning rate of your model\n",
|
|
" - use a different optimizer\n",
|
|
"- once you're satisfied with training, check the R2score on the test set"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Load the dataset ../data/housing-data.csv\n",
|
|
"df = pd.read_csv('../data/housing-data.csv')\n",
|
|
"df.head()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"df.columns"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# plot the histograms for each feature\n",
|
|
"plt.figure(figsize=(15, 5))\n",
|
|
"for i, feature in enumerate(df.columns):\n",
|
|
" plt.subplot(1, 4, i+1)\n",
|
|
" df[feature].plot(kind='hist', title=feature)\n",
|
|
" plt.xlabel(feature)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# create 2 variables called X and y:\n",
|
|
"# X shall be a matrix with 3 columns (sqft,bdrms,age)\n",
|
|
"# and y shall be a vector with 1 column (price)\n",
|
|
"X = df[['sqft', 'bdrms', 'age']].values\n",
|
|
"y = df['price'].values"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"X"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"y"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from tensorflow.keras.models import Sequential\n",
|
|
"from tensorflow.keras.layers import Dense\n",
|
|
"from tensorflow.keras.optimizers import Adam"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# create a linear regression model in Keras\n",
|
|
"# with the appropriate number of inputs and output\n",
|
|
"model = Sequential()\n",
|
|
"model.add(Dense(1, input_shape=(3,)))\n",
|
|
"model.compile(Adam(learning_rate=0.8), 'mean_squared_error')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from sklearn.model_selection import train_test_split"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# split the data into train and test with a 20% test size\n",
|
|
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"len(X_train)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"len(X)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# train the model on the training set and check its accuracy on training and test set\n",
|
|
"# how's your model doing? Is the loss growing smaller?\n",
|
|
"model.fit(X_train, y_train, epochs=10)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"df.describe()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from sklearn.metrics import r2_score"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# check the R2score on training and test set (probably very bad)\n",
|
|
"\n",
|
|
"y_train_pred = model.predict(X_train)\n",
|
|
"y_test_pred = model.predict(X_test)\n",
|
|
"\n",
|
|
"print(\"The R2 score on the Train set is:\\t{:0.3f}\".format(r2_score(y_train, y_train_pred)))\n",
|
|
"print(\"The R2 score on the Test set is:\\t{:0.3f}\".format(r2_score(y_test, y_test_pred)))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# try to improve your model with these experiments:\n",
|
|
"# - normalize the input features with one of the rescaling techniques mentioned above\n",
|
|
"# - use a different value for the learning rate of your model\n",
|
|
"# - use a different optimizer\n",
|
|
"df['sqft1000'] = df['sqft']/1000.0\n",
|
|
"df['age10'] = df['age']/10.0\n",
|
|
"df['price100k'] = df['price']/1e5"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"X = df[['sqft1000', 'bdrms', 'age10']].values\n",
|
|
"y = df['price100k'].values"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"model = Sequential()\n",
|
|
"model.add(Dense(1, input_dim=3))\n",
|
|
"model.compile(Adam(learning_rate=0.1), 'mean_squared_error')\n",
|
|
"model.fit(X_train, y_train, epochs=20)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# once you're satisfied with training, check the R2score on the test set\n",
|
|
"\n",
|
|
"y_train_pred = model.predict(X_train)\n",
|
|
"y_test_pred = model.predict(X_test)\n",
|
|
"\n",
|
|
"print(\"The R2 score on the Train set is:\\t{:0.3f}\".format(r2_score(y_train, y_train_pred)))\n",
|
|
"print(\"The R2 score on the Test set is:\\t{:0.3f}\".format(r2_score(y_test, y_test_pred)))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"model.fit(X_train, y_train, epochs=40, verbose=0)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# once you're satisfied with training, check the R2score on the test set\n",
|
|
"\n",
|
|
"y_train_pred = model.predict(X_train)\n",
|
|
"y_test_pred = model.predict(X_test)\n",
|
|
"\n",
|
|
"print(\"The R2 score on the Train set is:\\t{:0.3f}\".format(r2_score(y_train, y_train_pred)))\n",
|
|
"print(\"The R2 score on the Test set is:\\t{:0.3f}\".format(r2_score(y_test, y_test_pred)))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Exercise 2\n",
|
|
"\n",
|
|
"Your boss was extremely happy with your work on the housing price prediction model and decided to entrust you with a more challenging task. They've seen a lot of people leave the company recently and they would like to understand why that's happening. They have collected historical data on employees and they would like you to build a model that is able to predict which employee will leave next. The would like a model that is better than random guessing. They also prefer false negatives than false positives, in this first phase. Fields in the dataset include:\n",
|
|
"\n",
|
|
"- Employee satisfaction level\n",
|
|
"- Last evaluation\n",
|
|
"- Number of projects\n",
|
|
"- Average monthly hours\n",
|
|
"- Time spent at the company\n",
|
|
"- Whether they have had a work accident\n",
|
|
"- Whether they have had a promotion in the last 5 years\n",
|
|
"- Department\n",
|
|
"- Salary\n",
|
|
"- Whether the employee has left\n",
|
|
"\n",
|
|
"Your goal is to predict the binary outcome variable `left` using the rest of the data. Since the outcome is binary, this is a classification problem. Here are some things you may want to try out:\n",
|
|
"\n",
|
|
"1. load the dataset at ../data/HR_comma_sep.csv, inspect it with `.head()`, `.info()` and `.describe()`.\n",
|
|
"- Establish a benchmark: what would be your accuracy score if you predicted everyone stay?\n",
|
|
"- Check if any feature needs rescaling. You may plot a histogram of the feature to decide which rescaling method is more appropriate.\n",
|
|
"- convert the categorical features into binary dummy columns. You will then have to combine them with the numerical features using `pd.concat`.\n",
|
|
"- do the usual train/test split with a 20% test size\n",
|
|
"- play around with learning rate and optimizer\n",
|
|
"- check the confusion matrix, precision and recall\n",
|
|
"- check if you still get the same results if you use a 5-Fold cross validation on all the data\n",
|
|
"- Is the model good enough for your boss?\n",
|
|
"\n",
|
|
"As you will see in this exercise, the a logistic regression model is not good enough to help your boss. In the next chapter we will learn how to go beyond linear models.\n",
|
|
"\n",
|
|
"This dataset comes from https://www.kaggle.com/ludobenistant/hr-analytics/ and is released under [CC BY-SA 4.0 License](https://creativecommons.org/licenses/by-sa/4.0/)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# load the dataset at ../data/HR_comma_sep.csv, inspect it with `.head()`, `.info()` and `.describe()`.\n",
|
|
"\n",
|
|
"df = pd.read_csv('../data/HR_comma_sep.csv')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"df.head()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"df.info()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"df.describe()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Establish a benchmark: what would be your accuracy score if you predicted everyone stay?\n",
|
|
"\n",
|
|
"df.left.value_counts() / len(df)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Predicting 0 all the time would yield an accuracy of 76%"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Check if any feature needs rescaling.\n",
|
|
"# You may plot a histogram of the feature to decide which rescaling method is more appropriate.\n",
|
|
"df['average_montly_hours'].plot(kind='hist');"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"df['average_montly_hours_100'] = df['average_montly_hours']/100.0"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"df['average_montly_hours_100'].plot(kind='hist');"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"df['time_spend_company'].plot(kind='hist');"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# convert the categorical features into binary dummy columns.\n",
|
|
"# You will then have to combine them with\n",
|
|
"# the numerical features using `pd.concat`.\n",
|
|
"df_dummies = pd.get_dummies(df[['sales', 'salary']])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"df_dummies.head()\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"df.columns"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"X = pd.concat([df[['satisfaction_level', 'last_evaluation', 'number_project',\n",
|
|
" 'time_spend_company', 'Work_accident',\n",
|
|
" 'promotion_last_5years', 'average_montly_hours_100']],\n",
|
|
" df_dummies], axis=1).values\n",
|
|
"y = df['left'].values"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"X.shape"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# do the usual train/test split with a 20% test size\n",
|
|
"\n",
|
|
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# play around with learning rate and optimizer\n",
|
|
"\n",
|
|
"model = Sequential()\n",
|
|
"model.add(Dense(1, input_dim=20, activation='sigmoid'))\n",
|
|
"model.compile(Adam(learning_rate=0.5), 'binary_crossentropy', metrics=['accuracy'])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"model.summary()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"model.fit(X_train, y_train, epochs=10)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"y_test_pred = model.predict_classes(X_test)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from sklearn.metrics import confusion_matrix, classification_report"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"def pretty_confusion_matrix(y_true, y_pred, labels=[\"False\", \"True\"]):\n",
|
|
" cm = confusion_matrix(y_true, y_pred)\n",
|
|
" pred_labels = ['Predicted '+ l for l in labels]\n",
|
|
" df = pd.DataFrame(cm, index=labels, columns=pred_labels)\n",
|
|
" return df"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# check the confusion matrix, precision and recall\n",
|
|
"\n",
|
|
"pretty_confusion_matrix(y_test, y_test_pred, labels=['Stay', 'Leave'])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"print(classification_report(y_test, y_test_pred))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from tensorflow.keras.wrappers.scikit_learn import KerasClassifier"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# check if you still get the same results if you use a 5-Fold cross validation on all the data\n",
|
|
"\n",
|
|
"def build_logistic_regression_model():\n",
|
|
" model = Sequential()\n",
|
|
" model.add(Dense(1, input_dim=20, activation='sigmoid'))\n",
|
|
" model.compile(Adam(learning_rate=0.5), 'binary_crossentropy', metrics=['accuracy'])\n",
|
|
" return model\n",
|
|
"\n",
|
|
"model = KerasClassifier(build_fn=build_logistic_regression_model,\n",
|
|
" epochs=10, verbose=0)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from sklearn.model_selection import KFold, cross_val_score"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"cv = KFold(5, shuffle=True)\n",
|
|
"scores = cross_val_score(model, X, y, cv=cv)\n",
|
|
"\n",
|
|
"print(\"The cross validation accuracy is {:0.4f} ± {:0.4f}\".format(scores.mean(), scores.std()))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"scores"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Is the model good enough for your boss?"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"No, the model is not good enough for my boss, since it performs no better than the benchmark."
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.7.10"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
}
|