CVML-MachineLearning/solutions/3 Machine Learning Exercises Solution.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Machine Learning Exercises Solution"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt\n",
    "import pandas as pd\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 1\n",
    "\n",
    "You've just been hired at a real estate investment firm and they would like you to build a model for pricing houses. You are given a dataset that contains data for house prices and a few features like number of bedrooms, size in square feet and age of the house. Let's see if you can build a model that is able to predict the price. In this exercise we extend what we have learned about linear regression to a dataset with more than one feature. Here are the steps to complete it:\n",
    "\n",
    "1. Load the dataset ../data/housing-data.csv\n",
    "- plot the histograms for each feature\n",
    "- create 2 variables called X and y: X shall be a matrix with 3 columns (sqft,bdrms,age) and y shall be a vector with 1 column (price)\n",
    "- create a linear regression model in Keras with the appropriate number of inputs and output\n",
    "- split the data into train and test with a 20% test size\n",
    "- train the model on the training set and check its accuracy on training and test set\n",
    "- how's your model doing? Is the loss growing smaller?\n",
    "- try to improve your model with these experiments:\n",
    "    - normalize the input features with one of the rescaling techniques mentioned above\n",
    "    - use a different value for the learning rate of your model\n",
    "    - use a different optimizer\n",
    "- once you're satisfied with training, check the R2score on the test set"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load the dataset ../data/housing-data.csv\n",
    "df = pd.read_csv('../data/housing-data.csv')\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# plot the histograms for each feature\n",
    "plt.figure(figsize=(15, 5))\n",
    "for i, feature in enumerate(df.columns):\n",
    "    plt.subplot(1, 4, i+1)\n",
    "    df[feature].plot(kind='hist', title=feature)\n",
    "    plt.xlabel(feature)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# create 2 variables called X and y:\n",
    "# X shall be a matrix with 3 columns (sqft,bdrms,age)\n",
    "# and y shall be a vector with 1 column (price)\n",
    "X = df[['sqft', 'bdrms', 'age']].values\n",
    "y = df['price'].values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from tensorflow.keras.models import Sequential\n",
    "from tensorflow.keras.layers import Dense\n",
    "from tensorflow.keras.optimizers import Adam"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# create a linear regression model in Keras\n",
    "# with the appropriate number of inputs and output\n",
    "model = Sequential()\n",
    "model.add(Dense(1, input_shape=(3,)))\n",
    "model.compile(Adam(learning_rate=0.8), 'mean_squared_error')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# split the data into train and test with a 20% test size\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "len(X_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "len(X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# train the model on the training set and check its accuracy on training and test set\n",
    "# how's your model doing? Is the loss growing smaller?\n",
    "model.fit(X_train, y_train, epochs=10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import r2_score"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# check the R2score on training and test set (probably very bad)\n",
    "\n",
    "y_train_pred = model.predict(X_train)\n",
    "y_test_pred = model.predict(X_test)\n",
    "\n",
    "print(\"The R2 score on the Train set is:\\t{:0.3f}\".format(r2_score(y_train, y_train_pred)))\n",
    "print(\"The R2 score on the Test set is:\\t{:0.3f}\".format(r2_score(y_test, y_test_pred)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# try to improve your model with these experiments:\n",
    "#     - normalize the input features with one of the rescaling techniques mentioned above\n",
    "#     - use a different value for the learning rate of your model\n",
    "#     - use a different optimizer\n",
    "df['sqft1000'] = df['sqft']/1000.0\n",
    "df['age10'] = df['age']/10.0\n",
    "df['price100k'] = df['price']/1e5"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X = df[['sqft1000', 'bdrms', 'age10']].values\n",
    "y = df['price100k'].values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = Sequential()\n",
    "model.add(Dense(1, input_dim=3))\n",
    "model.compile(Adam(learning_rate=0.1), 'mean_squared_error')\n",
    "model.fit(X_train, y_train, epochs=20)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# once you're satisfied with training, check the R2score on the test set\n",
    "\n",
    "y_train_pred = model.predict(X_train)\n",
    "y_test_pred = model.predict(X_test)\n",
    "\n",
    "print(\"The R2 score on the Train set is:\\t{:0.3f}\".format(r2_score(y_train, y_train_pred)))\n",
    "print(\"The R2 score on the Test set is:\\t{:0.3f}\".format(r2_score(y_test, y_test_pred)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.fit(X_train, y_train, epochs=40, verbose=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# once you're satisfied with training, check the R2score on the test set\n",
    "\n",
    "y_train_pred = model.predict(X_train)\n",
    "y_test_pred = model.predict(X_test)\n",
    "\n",
    "print(\"The R2 score on the Train set is:\\t{:0.3f}\".format(r2_score(y_train, y_train_pred)))\n",
    "print(\"The R2 score on the Test set is:\\t{:0.3f}\".format(r2_score(y_test, y_test_pred)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 2\n",
    "\n",
    "Your boss was extremely happy with your work on the housing price prediction model and decided to entrust you with a more challenging task. They've seen a lot of people leave the company recently and they would like to understand why that's happening. They have collected historical data on employees and they would like you to build a model that is able to predict which employee will leave next. The would like a model that is better than random guessing. They also prefer false negatives than false positives, in this first phase. Fields in the dataset include:\n",
    "\n",
    "- Employee satisfaction level\n",
    "- Last evaluation\n",
    "- Number of projects\n",
    "- Average monthly hours\n",
    "- Time spent at the company\n",
    "- Whether they have had a work accident\n",
    "- Whether they have had a promotion in the last 5 years\n",
    "- Department\n",
    "- Salary\n",
    "- Whether the employee has left\n",
    "\n",
    "Your goal is to predict the binary outcome variable `left` using the rest of the data. Since the outcome is binary, this is a classification problem. Here are some things you may want to try out:\n",
    "\n",
    "1. load the dataset at ../data/HR_comma_sep.csv, inspect it with `.head()`, `.info()` and `.describe()`.\n",
    "- Establish a benchmark: what would be your accuracy score if you predicted everyone stay?\n",
    "- Check if any feature needs rescaling. You may plot a histogram of the feature to decide which rescaling method is more appropriate.\n",
    "- convert the categorical features into binary dummy columns. You will then have to combine them with the numerical features using `pd.concat`.\n",
    "- do the usual train/test split with a 20% test size\n",
    "- play around with learning rate and optimizer\n",
    "- check the confusion matrix, precision and recall\n",
    "- check if you still get the same results if you use a 5-Fold cross validation on all the data\n",
    "- Is the model good enough for your boss?\n",
    "\n",
    "As you will see in this exercise, the a logistic regression model is not good enough to help your boss. In the next chapter we will learn how to go beyond linear models.\n",
    "\n",
    "This dataset comes from https://www.kaggle.com/ludobenistant/hr-analytics/ and is released under [CC BY-SA 4.0 License](https://creativecommons.org/licenses/by-sa/4.0/)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# load the dataset at ../data/HR_comma_sep.csv, inspect it with `.head()`, `.info()` and `.describe()`.\n",
    "\n",
    "df = pd.read_csv('../data/HR_comma_sep.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Establish a benchmark: what would be your accuracy score if you predicted everyone stay?\n",
    "\n",
    "df.left.value_counts() / len(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Predicting 0 all the time would yield an accuracy of 76%"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Check if any feature needs rescaling.\n",
    "# You may plot a histogram of the feature to decide which rescaling method is more appropriate.\n",
    "df['average_montly_hours'].plot(kind='hist');"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['average_montly_hours_100'] = df['average_montly_hours']/100.0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['average_montly_hours_100'].plot(kind='hist');"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['time_spend_company'].plot(kind='hist');"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# convert the categorical features into binary dummy columns.\n",
    "# You will then have to combine them with\n",
    "# the numerical features using `pd.concat`.\n",
    "df_dummies = pd.get_dummies(df[['sales', 'salary']])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_dummies.head()\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X = pd.concat([df[['satisfaction_level', 'last_evaluation', 'number_project',\n",
    "                   'time_spend_company', 'Work_accident',\n",
    "                   'promotion_last_5years', 'average_montly_hours_100']],\n",
    "               df_dummies], axis=1).values\n",
    "y = df['left'].values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# do the usual train/test split with a 20% test size\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# play around with learning rate and optimizer\n",
    "\n",
    "model = Sequential()\n",
    "model.add(Dense(1, input_dim=20, activation='sigmoid'))\n",
    "model.compile(Adam(learning_rate=0.5), 'binary_crossentropy', metrics=['accuracy'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.summary()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.fit(X_train, y_train, epochs=10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y_test_pred = model.predict_classes(X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import confusion_matrix, classification_report"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def pretty_confusion_matrix(y_true, y_pred, labels=[\"False\", \"True\"]):\n",
    "    cm = confusion_matrix(y_true, y_pred)\n",
    "    pred_labels = ['Predicted '+ l for l in labels]\n",
    "    df = pd.DataFrame(cm, index=labels, columns=pred_labels)\n",
    "    return df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# check the confusion matrix, precision and recall\n",
    "\n",
    "pretty_confusion_matrix(y_test, y_test_pred, labels=['Stay', 'Leave'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(classification_report(y_test, y_test_pred))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from tensorflow.keras.wrappers.scikit_learn import KerasClassifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# check if you still get the same results if you use a 5-Fold cross validation on all the data\n",
    "\n",
    "def build_logistic_regression_model():\n",
    "    model = Sequential()\n",
    "    model.add(Dense(1, input_dim=20, activation='sigmoid'))\n",
    "    model.compile(Adam(learning_rate=0.5), 'binary_crossentropy', metrics=['accuracy'])\n",
    "    return model\n",
    "\n",
    "model = KerasClassifier(build_fn=build_logistic_regression_model,\n",
    "                        epochs=10, verbose=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import KFold, cross_val_score"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cv = KFold(5, shuffle=True)\n",
    "scores = cross_val_score(model, X, y, cv=cv)\n",
    "\n",
    "print(\"The cross validation accuracy is {:0.4f} ± {:0.4f}\".format(scores.mean(), scores.std()))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "scores"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Is the model good enough for your boss?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "No, the model is not good enough for my boss, since it performs no better than the benchmark."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}