{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Machine Learning Exercises Solution" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 1\n", "\n", "You've just been hired at a real estate investment firm and they would like you to build a model for pricing houses. You are given a dataset that contains data for house prices and a few features like number of bedrooms, size in square feet and age of the house. Let's see if you can build a model that is able to predict the price. In this exercise we extend what we have learned about linear regression to a dataset with more than one feature. Here are the steps to complete it:\n", "\n", "1. Load the dataset ../data/housing-data.csv\n", "- plot the histograms for each feature\n", "- create 2 variables called X and y: X shall be a matrix with 3 columns (sqft,bdrms,age) and y shall be a vector with 1 column (price)\n", "- create a linear regression model in Keras with the appropriate number of inputs and output\n", "- split the data into train and test with a 20% test size\n", "- train the model on the training set and check its accuracy on training and test set\n", "- how's your model doing? Is the loss growing smaller?\n", "- try to improve your model with these experiments:\n", " - normalize the input features with one of the rescaling techniques mentioned above\n", " - use a different value for the learning rate of your model\n", " - use a different optimizer\n", "- once you're satisfied with training, check the R2score on the test set" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Load the dataset ../data/housing-data.csv\n", "df = pd.read_csv('../data/housing-data.csv')\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.columns" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# plot the histograms for each feature\n", "plt.figure(figsize=(15, 5))\n", "for i, feature in enumerate(df.columns):\n", " plt.subplot(1, 4, i+1)\n", " df[feature].plot(kind='hist', title=feature)\n", " plt.xlabel(feature)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# create 2 variables called X and y:\n", "# X shall be a matrix with 3 columns (sqft,bdrms,age)\n", "# and y shall be a vector with 1 column (price)\n", "X = df[['sqft', 'bdrms', 'age']].values\n", "y = df['price'].values" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from tensorflow.keras.models import Sequential\n", "from tensorflow.keras.layers import Dense\n", "from tensorflow.keras.optimizers import Adam" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# create a linear regression model in Keras\n", "# with the appropriate number of inputs and output\n", "model = Sequential()\n", "model.add(Dense(1, input_shape=(3,)))\n", "model.compile(Adam(learning_rate=0.8), 'mean_squared_error')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# split the data into train and test with a 20% test size\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "len(X_train)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "len(X)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# train the model on the training set and check its accuracy on training and test set\n", "# how's your model doing? Is the loss growing smaller?\n", "model.fit(X_train, y_train, epochs=10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.describe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import r2_score" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# check the R2score on training and test set (probably very bad)\n", "\n", "y_train_pred = model.predict(X_train)\n", "y_test_pred = model.predict(X_test)\n", "\n", "print(\"The R2 score on the Train set is:\\t{:0.3f}\".format(r2_score(y_train, y_train_pred)))\n", "print(\"The R2 score on the Test set is:\\t{:0.3f}\".format(r2_score(y_test, y_test_pred)))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# try to improve your model with these experiments:\n", "# - normalize the input features with one of the rescaling techniques mentioned above\n", "# - use a different value for the learning rate of your model\n", "# - use a different optimizer\n", "df['sqft1000'] = df['sqft']/1000.0\n", "df['age10'] = df['age']/10.0\n", "df['price100k'] = df['price']/1e5" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X = df[['sqft1000', 'bdrms', 'age10']].values\n", "y = df['price100k'].values" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model = Sequential()\n", "model.add(Dense(1, input_dim=3))\n", "model.compile(Adam(learning_rate=0.1), 'mean_squared_error')\n", "model.fit(X_train, y_train, epochs=20)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# once you're satisfied with training, check the R2score on the test set\n", "\n", "y_train_pred = model.predict(X_train)\n", "y_test_pred = model.predict(X_test)\n", "\n", "print(\"The R2 score on the Train set is:\\t{:0.3f}\".format(r2_score(y_train, y_train_pred)))\n", "print(\"The R2 score on the Test set is:\\t{:0.3f}\".format(r2_score(y_test, y_test_pred)))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model.fit(X_train, y_train, epochs=40, verbose=0)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# once you're satisfied with training, check the R2score on the test set\n", "\n", "y_train_pred = model.predict(X_train)\n", "y_test_pred = model.predict(X_test)\n", "\n", "print(\"The R2 score on the Train set is:\\t{:0.3f}\".format(r2_score(y_train, y_train_pred)))\n", "print(\"The R2 score on the Test set is:\\t{:0.3f}\".format(r2_score(y_test, y_test_pred)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 2\n", "\n", "Your boss was extremely happy with your work on the housing price prediction model and decided to entrust you with a more challenging task. They've seen a lot of people leave the company recently and they would like to understand why that's happening. They have collected historical data on employees and they would like you to build a model that is able to predict which employee will leave next. The would like a model that is better than random guessing. They also prefer false negatives than false positives, in this first phase. Fields in the dataset include:\n", "\n", "- Employee satisfaction level\n", "- Last evaluation\n", "- Number of projects\n", "- Average monthly hours\n", "- Time spent at the company\n", "- Whether they have had a work accident\n", "- Whether they have had a promotion in the last 5 years\n", "- Department\n", "- Salary\n", "- Whether the employee has left\n", "\n", "Your goal is to predict the binary outcome variable `left` using the rest of the data. Since the outcome is binary, this is a classification problem. Here are some things you may want to try out:\n", "\n", "1. load the dataset at ../data/HR_comma_sep.csv, inspect it with `.head()`, `.info()` and `.describe()`.\n", "- Establish a benchmark: what would be your accuracy score if you predicted everyone stay?\n", "- Check if any feature needs rescaling. You may plot a histogram of the feature to decide which rescaling method is more appropriate.\n", "- convert the categorical features into binary dummy columns. You will then have to combine them with the numerical features using `pd.concat`.\n", "- do the usual train/test split with a 20% test size\n", "- play around with learning rate and optimizer\n", "- check the confusion matrix, precision and recall\n", "- check if you still get the same results if you use a 5-Fold cross validation on all the data\n", "- Is the model good enough for your boss?\n", "\n", "As you will see in this exercise, the a logistic regression model is not good enough to help your boss. In the next chapter we will learn how to go beyond linear models.\n", "\n", "This dataset comes from https://www.kaggle.com/ludobenistant/hr-analytics/ and is released under [CC BY-SA 4.0 License](https://creativecommons.org/licenses/by-sa/4.0/)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# load the dataset at ../data/HR_comma_sep.csv, inspect it with `.head()`, `.info()` and `.describe()`.\n", "\n", "df = pd.read_csv('../data/HR_comma_sep.csv')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.info()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.describe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Establish a benchmark: what would be your accuracy score if you predicted everyone stay?\n", "\n", "df.left.value_counts() / len(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Predicting 0 all the time would yield an accuracy of 76%" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Check if any feature needs rescaling.\n", "# You may plot a histogram of the feature to decide which rescaling method is more appropriate.\n", "df['average_montly_hours'].plot(kind='hist');" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df['average_montly_hours_100'] = df['average_montly_hours']/100.0" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df['average_montly_hours_100'].plot(kind='hist');" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df['time_spend_company'].plot(kind='hist');" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# convert the categorical features into binary dummy columns.\n", "# You will then have to combine them with\n", "# the numerical features using `pd.concat`.\n", "df_dummies = pd.get_dummies(df[['sales', 'salary']])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_dummies.head()\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.columns" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X = pd.concat([df[['satisfaction_level', 'last_evaluation', 'number_project',\n", " 'time_spend_company', 'Work_accident',\n", " 'promotion_last_5years', 'average_montly_hours_100']],\n", " df_dummies], axis=1).values\n", "y = df['left'].values" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# do the usual train/test split with a 20% test size\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# play around with learning rate and optimizer\n", "\n", "model = Sequential()\n", "model.add(Dense(1, input_dim=20, activation='sigmoid'))\n", "model.compile(Adam(learning_rate=0.5), 'binary_crossentropy', metrics=['accuracy'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model.summary()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model.fit(X_train, y_train, epochs=10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y_test_pred = model.predict_classes(X_test)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import confusion_matrix, classification_report" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def pretty_confusion_matrix(y_true, y_pred, labels=[\"False\", \"True\"]):\n", " cm = confusion_matrix(y_true, y_pred)\n", " pred_labels = ['Predicted '+ l for l in labels]\n", " df = pd.DataFrame(cm, index=labels, columns=pred_labels)\n", " return df" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# check the confusion matrix, precision and recall\n", "\n", "pretty_confusion_matrix(y_test, y_test_pred, labels=['Stay', 'Leave'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(classification_report(y_test, y_test_pred))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from tensorflow.keras.wrappers.scikit_learn import KerasClassifier" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# check if you still get the same results if you use a 5-Fold cross validation on all the data\n", "\n", "def build_logistic_regression_model():\n", " model = Sequential()\n", " model.add(Dense(1, input_dim=20, activation='sigmoid'))\n", " model.compile(Adam(learning_rate=0.5), 'binary_crossentropy', metrics=['accuracy'])\n", " return model\n", "\n", "model = KerasClassifier(build_fn=build_logistic_regression_model,\n", " epochs=10, verbose=0)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import KFold, cross_val_score" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cv = KFold(5, shuffle=True)\n", "scores = cross_val_score(model, X, y, cv=cv)\n", "\n", "print(\"The cross validation accuracy is {:0.4f} ± {:0.4f}\".format(scores.mean(), scores.std()))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "scores" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Is the model good enough for your boss?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "No, the model is not good enough for my boss, since it performs no better than the benchmark." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" } }, "nbformat": 4, "nbformat_minor": 2 }