Files
CVML-MachineLearning/solutions/3 Machine Learning Exercises Solution.ipynb
Sem van der Hoeven d979ca38f5 add all files
2021-05-26 15:12:05 +02:00

670 lines
18 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Machine Learning Exercises Solution"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%matplotlib inline\n",
"import matplotlib.pyplot as plt\n",
"import pandas as pd\n",
"import numpy as np"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exercise 1\n",
"\n",
"You've just been hired at a real estate investment firm and they would like you to build a model for pricing houses. You are given a dataset that contains data for house prices and a few features like number of bedrooms, size in square feet and age of the house. Let's see if you can build a model that is able to predict the price. In this exercise we extend what we have learned about linear regression to a dataset with more than one feature. Here are the steps to complete it:\n",
"\n",
"1. Load the dataset ../data/housing-data.csv\n",
"- plot the histograms for each feature\n",
"- create 2 variables called X and y: X shall be a matrix with 3 columns (sqft,bdrms,age) and y shall be a vector with 1 column (price)\n",
"- create a linear regression model in Keras with the appropriate number of inputs and output\n",
"- split the data into train and test with a 20% test size\n",
"- train the model on the training set and check its accuracy on training and test set\n",
"- how's your model doing? Is the loss growing smaller?\n",
"- try to improve your model with these experiments:\n",
" - normalize the input features with one of the rescaling techniques mentioned above\n",
" - use a different value for the learning rate of your model\n",
" - use a different optimizer\n",
"- once you're satisfied with training, check the R2score on the test set"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Load the dataset ../data/housing-data.csv\n",
"df = pd.read_csv('../data/housing-data.csv')\n",
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.columns"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# plot the histograms for each feature\n",
"plt.figure(figsize=(15, 5))\n",
"for i, feature in enumerate(df.columns):\n",
" plt.subplot(1, 4, i+1)\n",
" df[feature].plot(kind='hist', title=feature)\n",
" plt.xlabel(feature)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# create 2 variables called X and y:\n",
"# X shall be a matrix with 3 columns (sqft,bdrms,age)\n",
"# and y shall be a vector with 1 column (price)\n",
"X = df[['sqft', 'bdrms', 'age']].values\n",
"y = df['price'].values"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"y"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from tensorflow.keras.models import Sequential\n",
"from tensorflow.keras.layers import Dense\n",
"from tensorflow.keras.optimizers import Adam"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# create a linear regression model in Keras\n",
"# with the appropriate number of inputs and output\n",
"model = Sequential()\n",
"model.add(Dense(1, input_shape=(3,)))\n",
"model.compile(Adam(learning_rate=0.8), 'mean_squared_error')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.model_selection import train_test_split"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# split the data into train and test with a 20% test size\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"len(X_train)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"len(X)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# train the model on the training set and check its accuracy on training and test set\n",
"# how's your model doing? Is the loss growing smaller?\n",
"model.fit(X_train, y_train, epochs=10)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.describe()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.metrics import r2_score"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# check the R2score on training and test set (probably very bad)\n",
"\n",
"y_train_pred = model.predict(X_train)\n",
"y_test_pred = model.predict(X_test)\n",
"\n",
"print(\"The R2 score on the Train set is:\\t{:0.3f}\".format(r2_score(y_train, y_train_pred)))\n",
"print(\"The R2 score on the Test set is:\\t{:0.3f}\".format(r2_score(y_test, y_test_pred)))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# try to improve your model with these experiments:\n",
"# - normalize the input features with one of the rescaling techniques mentioned above\n",
"# - use a different value for the learning rate of your model\n",
"# - use a different optimizer\n",
"df['sqft1000'] = df['sqft']/1000.0\n",
"df['age10'] = df['age']/10.0\n",
"df['price100k'] = df['price']/1e5"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X = df[['sqft1000', 'bdrms', 'age10']].values\n",
"y = df['price100k'].values"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"model = Sequential()\n",
"model.add(Dense(1, input_dim=3))\n",
"model.compile(Adam(learning_rate=0.1), 'mean_squared_error')\n",
"model.fit(X_train, y_train, epochs=20)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# once you're satisfied with training, check the R2score on the test set\n",
"\n",
"y_train_pred = model.predict(X_train)\n",
"y_test_pred = model.predict(X_test)\n",
"\n",
"print(\"The R2 score on the Train set is:\\t{:0.3f}\".format(r2_score(y_train, y_train_pred)))\n",
"print(\"The R2 score on the Test set is:\\t{:0.3f}\".format(r2_score(y_test, y_test_pred)))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"model.fit(X_train, y_train, epochs=40, verbose=0)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# once you're satisfied with training, check the R2score on the test set\n",
"\n",
"y_train_pred = model.predict(X_train)\n",
"y_test_pred = model.predict(X_test)\n",
"\n",
"print(\"The R2 score on the Train set is:\\t{:0.3f}\".format(r2_score(y_train, y_train_pred)))\n",
"print(\"The R2 score on the Test set is:\\t{:0.3f}\".format(r2_score(y_test, y_test_pred)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exercise 2\n",
"\n",
"Your boss was extremely happy with your work on the housing price prediction model and decided to entrust you with a more challenging task. They've seen a lot of people leave the company recently and they would like to understand why that's happening. They have collected historical data on employees and they would like you to build a model that is able to predict which employee will leave next. The would like a model that is better than random guessing. They also prefer false negatives than false positives, in this first phase. Fields in the dataset include:\n",
"\n",
"- Employee satisfaction level\n",
"- Last evaluation\n",
"- Number of projects\n",
"- Average monthly hours\n",
"- Time spent at the company\n",
"- Whether they have had a work accident\n",
"- Whether they have had a promotion in the last 5 years\n",
"- Department\n",
"- Salary\n",
"- Whether the employee has left\n",
"\n",
"Your goal is to predict the binary outcome variable `left` using the rest of the data. Since the outcome is binary, this is a classification problem. Here are some things you may want to try out:\n",
"\n",
"1. load the dataset at ../data/HR_comma_sep.csv, inspect it with `.head()`, `.info()` and `.describe()`.\n",
"- Establish a benchmark: what would be your accuracy score if you predicted everyone stay?\n",
"- Check if any feature needs rescaling. You may plot a histogram of the feature to decide which rescaling method is more appropriate.\n",
"- convert the categorical features into binary dummy columns. You will then have to combine them with the numerical features using `pd.concat`.\n",
"- do the usual train/test split with a 20% test size\n",
"- play around with learning rate and optimizer\n",
"- check the confusion matrix, precision and recall\n",
"- check if you still get the same results if you use a 5-Fold cross validation on all the data\n",
"- Is the model good enough for your boss?\n",
"\n",
"As you will see in this exercise, the a logistic regression model is not good enough to help your boss. In the next chapter we will learn how to go beyond linear models.\n",
"\n",
"This dataset comes from https://www.kaggle.com/ludobenistant/hr-analytics/ and is released under [CC BY-SA 4.0 License](https://creativecommons.org/licenses/by-sa/4.0/)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# load the dataset at ../data/HR_comma_sep.csv, inspect it with `.head()`, `.info()` and `.describe()`.\n",
"\n",
"df = pd.read_csv('../data/HR_comma_sep.csv')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.info()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.describe()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Establish a benchmark: what would be your accuracy score if you predicted everyone stay?\n",
"\n",
"df.left.value_counts() / len(df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Predicting 0 all the time would yield an accuracy of 76%"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Check if any feature needs rescaling.\n",
"# You may plot a histogram of the feature to decide which rescaling method is more appropriate.\n",
"df['average_montly_hours'].plot(kind='hist');"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df['average_montly_hours_100'] = df['average_montly_hours']/100.0"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df['average_montly_hours_100'].plot(kind='hist');"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df['time_spend_company'].plot(kind='hist');"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# convert the categorical features into binary dummy columns.\n",
"# You will then have to combine them with\n",
"# the numerical features using `pd.concat`.\n",
"df_dummies = pd.get_dummies(df[['sales', 'salary']])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_dummies.head()\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.columns"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X = pd.concat([df[['satisfaction_level', 'last_evaluation', 'number_project',\n",
" 'time_spend_company', 'Work_accident',\n",
" 'promotion_last_5years', 'average_montly_hours_100']],\n",
" df_dummies], axis=1).values\n",
"y = df['left'].values"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X.shape"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# do the usual train/test split with a 20% test size\n",
"\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# play around with learning rate and optimizer\n",
"\n",
"model = Sequential()\n",
"model.add(Dense(1, input_dim=20, activation='sigmoid'))\n",
"model.compile(Adam(learning_rate=0.5), 'binary_crossentropy', metrics=['accuracy'])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"model.summary()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"model.fit(X_train, y_train, epochs=10)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"y_test_pred = model.predict_classes(X_test)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.metrics import confusion_matrix, classification_report"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def pretty_confusion_matrix(y_true, y_pred, labels=[\"False\", \"True\"]):\n",
" cm = confusion_matrix(y_true, y_pred)\n",
" pred_labels = ['Predicted '+ l for l in labels]\n",
" df = pd.DataFrame(cm, index=labels, columns=pred_labels)\n",
" return df"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# check the confusion matrix, precision and recall\n",
"\n",
"pretty_confusion_matrix(y_test, y_test_pred, labels=['Stay', 'Leave'])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(classification_report(y_test, y_test_pred))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from tensorflow.keras.wrappers.scikit_learn import KerasClassifier"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# check if you still get the same results if you use a 5-Fold cross validation on all the data\n",
"\n",
"def build_logistic_regression_model():\n",
" model = Sequential()\n",
" model.add(Dense(1, input_dim=20, activation='sigmoid'))\n",
" model.compile(Adam(learning_rate=0.5), 'binary_crossentropy', metrics=['accuracy'])\n",
" return model\n",
"\n",
"model = KerasClassifier(build_fn=build_logistic_regression_model,\n",
" epochs=10, verbose=0)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.model_selection import KFold, cross_val_score"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cv = KFold(5, shuffle=True)\n",
"scores = cross_val_score(model, X, y, cv=cv)\n",
"\n",
"print(\"The cross validation accuracy is {:0.4f} ± {:0.4f}\".format(scores.mean(), scores.std()))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"scores"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Is the model good enough for your boss?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"No, the model is not good enough for my boss, since it performs no better than the benchmark."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.10"
}
},
"nbformat": 4,
"nbformat_minor": 2
}