{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "The goal of this practical session is to introduce Python and Scikit-Learn, a library for running different machine learning models and use them for solving AI/data science problems." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Linear regression and k-NN: doing it ourselves\n", "\n", "We are going to write a first program to compute the coefficients of a regression line and a second program to compute the k-nearest neighbours (k-NN) of a given point." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Linear regression\n", "\n", "We'll start with a simple data set:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import numpy as np\n", "\n", "rng = np.random.RandomState(42)\n", "x = 10 * rng.rand(50)\n", "y = 2 * x - 1 + rng.randn(50)\n", "plt.scatter(x, y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$\\rightarrow$ Print the values of x and y. What's the type of x and y? What's their relation? What's the level of noise in this relation?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " \n", "We now want to find the relation between x and y solely from the observed points. We can postulate different possible relations (models): linear, polynomial of degree 2, spline, ... and learn the parameters of the models from the data. We'll start here with the most simple model, the linear one, which assumes that the relation takes the form:\n", "\n", "\\begin{equation*}\n", " y = ax + b,\n", "\\end{equation*}\n", "\n", "where $a$ and $b$ are real numbers, respectively called the slope and the intercept (or bias). Learning the model thus amounts to learn the values of $a$ and $b$ from the observed data (we will denote by $\\hat{a}$ and $\\hat{b}$ the values of $a$ and $b$ learned from the observed data).\n", "\n", "We nevertheless need to guide the learning by stating some properties on the desired line. In linear regression, the property we require is that the line learned should be as close as possible to the obersved data. By \"as close as\", we mean that the sum of squared residuals should be as small as possible (this method is sometimes referred to as OLS for Ordinary Least Square). The residual is defined as the error in approximating y by a linear relation. If we have observed $n$ points (xi,yi)$1 \\le i \\le n$, the sum of residuals amounts to:\n", "\n", "\\begin{equation*}\n", " \\sum_{i=1}^{n}(y_i - a x_i + b)^2.\n", "\\end{equation*}\n", "\n", "The above optimization problem leads to the following solution (we'll see that later in the course) where $\\bar{x}$ and $\\bar{y}$ denote the means of x and y:\n", "\n", "\\begin{align}\n", " \\hat{a} = & \\frac{\\sum_{i=1}^{n}(x_i - \\bar{x}) (y_i - \\bar{y})}{\\sum_{i=1}^{n}(x_i - \\bar{x})^2}, \\\\\n", " \\hat{b} = & \\bar{y} - \\hat{a} \\bar{x}.\n", "\\end{align}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$\\rightarrow$ Write a program that computes the coefficients of the line (slope and intercept) and draw the line together with the orginal data points.\n", "\n", "$\\rightarrow$ How to use the above results to predict the y value of new points for which we only know x?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### k-Nearest Neighbours (k-NN) on Iris dataset\n", "\n", "We are now going to implement a code that can predict a class label of a new example given known class labels of several examples. The examples with the known labels constitute the training set.\n", "\n", "To do so we are going to consider the Iris dataset, which consists of 4-dimensional examples (the dimensions correspond to sepal length, sepal width, petal length and petal width) associated with a class label from three possible iris species. \n", "\n", "For each new example with unknown class label, the $k$-NN algorithm consists in retrieving the $k$ nearest neighbours in the training set (using in our case their 4-dimensional representation) and in assigning to the new example the majority class in the set of $k$ nearest neigbours.\n", "\n", "$\\rightarrow$ You are asked to write a code that can take that can predict the class label of examples which are not part of the training set (this set of examples will be referred to as the test set in the remainder). The value of $k$ should be an argument of your program, which you can play with. You'll use the standard Eucliden distance to measure the distance between point and select the $k$ closest ones.\n", "\n", "The Iris dataset can be loaded and split in train/test parts using the following commands:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import load_iris\n", "dataset = load_iris()\n", "X_iris = dataset.data\n", "y_iris = dataset.target" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "Xtrain, Xtest, ytrain, ytest = train_test_split(X_iris, y_iris,\n", " random_state=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The same (and more) using Scikit-Learn\n", "\n", "The Scikit-Learn API is designed with the following guiding principles in mind, as outlined in the Scikit-Learn API paper:\n", "\n", "* Consistency: All objects share a common interface drawn from a limited set of methods, with consistent documentation.\n", "\n", "* Inspection: All specified parameter values are exposed as public attributes.\n", "\n", "* Limited object hierarchy: Only algorithms are represented by Python classes; datasets are represented in standard formats (NumPy arrays, Pandas DataFrames, SciPy sparse matrices) and parameter names use standard Python strings.\n", "\n", "* Composition: Many machine learning tasks can be expressed as sequences of more fundamental algorithms, and Scikit-Learn makes use of this wherever possible.\n", "\n", "* Sensible defaults: When models require user-specified parameters, the library defines an appropriate default value.\n", "\n", "In practice, these principles make Scikit-Learn very easy to use, once the basic principles are understood. Every machine learning algorithm in Scikit-Learn is implemented via the Estimator API, which provides a consistent interface for a wide range of machine learning applications.\n", "\n", "Most commonly, the steps in using the Scikit-Learn estimator API are as follows (we will step through a handful of detailed examples in the sections that follow).\n", "\n", "* Choose a class of model by importing the appropriate estimator class from Scikit-Learn.\n", "* Choose model hyperparameters by instantiating this class with desired values.\n", "* Arrange data into a features matrix and target vector following the discussion above.\n", "* Fit the model to your data by calling the fit() method of the model instance.\n", "* Apply the Model to new data:\n", " * For supervised learning, often we predict labels for unknown data using the predict() method.\n", " * For unsupervised learning, we often transform or infer properties of the data using the transform() or predict() method.\n", "\n", "We will now step through several simple examples of applying supervised and unsupervised learning methods.\n", "\n", "### Linear regression\n", "\n", "As an example of this process, let's consider a simple linear regression—that is, the common case of fitting a line to (𝑥,𝑦) data. We will use the same simple data as before for our regression example:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import numpy as np\n", "\n", "rng = np.random.RandomState(42)\n", "x = 10 * rng.rand(50)\n", "y = 2 * x - 1 + rng.randn(50)\n", "plt.scatter(x, y)\n", "print(\"x:\",x)\n", "print(\"y:\", y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's now walk through the process of building a linear regression model (the process is the same for all ML models).\n", "\n", "1. Choose a class of model\n", "\n", "In Scikit-Learn, every class of model is represented by a Python class. So, for example, if we would like to compute a simple linear regression model, we can import the linear regression class:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import LinearRegression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. Choose model hyperparameters\n", "\n", "An important point is that a class of model is not the same as an instance of a model.\n", "\n", "Once we have decided on our model class, there are still some options open to us. Depending on the model class we are working with, we might need to answer one or more questions like the following:\n", "\n", "Would we like to fit for the offset (i.e., y-intercept)?\n", "Would we like the model to be normalized?\n", "Would we like to preprocess our features to add model flexibility?\n", "What degree of regularization would we like to use in our model?\n", "How many model components would we like to use?\n", "These are examples of the important choices that must be made once the model class is selected. These choices are often represented as hyperparameters, or parameters that must be set before the model is fit to data. In Scikit-Learn, hyperparameters are chosen by passing values at model instantiation. We will explore how you can quantitatively motivate the choice of hyperparameters later.\n", "\n", "For our linear regression example, we can instantiate the LinearRegression class and specify that we would like to fit the intercept using the fit_intercept hyperparameter:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model = LinearRegression(fit_intercept=True)\n", "model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$\\rightarrow$ Explain the fit_intercept hyperparameter." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Keep in mind that when the model is instantiated, the only action is the storing of these hyperparameter values. In particular, we have not yet applied the model to any data: the Scikit-Learn API makes very clear the distinction between choice of model and application of model to data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3. Arrange data into a features matrix and target vector\n", "\n", "Previously we detailed the Scikit-Learn data representation, which requires a two-dimensional features matrix and a one-dimensional target array. Here our target variable y is already in the correct form (a length-n_samples array), but we need to massage the data x to make it a matrix of size [n_samples, n_features]. In this case, this amounts to a simple reshaping of the one-dimensional array:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X = x[:, np.newaxis]\n", "X.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "4. Fit the model to your data\n", "\n", "Now it is time to apply our model to data. This can be done with the fit() method of the model:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model.fit(X, y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This fit() command causes a number of model-dependent internal computations to take place, and the results of these computations are stored in model-specific attributes that the user can explore. In Scikit-Learn, by convention all model parameters that were learned during the fit() process have trailing underscores; for example in this linear model, we have the following:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model.coef_\n", "model.intercept_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These two parameters represent the slope and intercept of the simple linear fit to the data. Comparing to the data definition, we see that they are very close to the input slope of 2 and intercept of -1.\n", "\n", "5. Predict labels for unknown data\n", "\n", "Once the model is trained, the main task of supervised machine learning is to evaluate it based on what it says about new data that was not part of the training set. In Scikit-Learn, this can be done using the predict() method. For the sake of this example, our \"new data\" will be a grid of x values, and we will ask what y values the model predicts:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "xfit = np.linspace(-1, 11)\n", "print(\"xfit: \",xfit)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$\\rightarrow$ What's the effect of linspace?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As before, we need to coerce these x values into a [n_samples, n_features] features matrix, after which we can feed it to the model:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Xfit = xfit[:, np.newaxis]\n", "yfit = model.predict(Xfit)\n", "print(\"Xfit: \",Xfit)\n", "print(\"yfit: \",yfit)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$\\rightarrow$ What's the effect of xfit[:, np.newaxis]? Of model.predict(Xfit)? What's the ype of Xfit ? Of yfit ?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, let's visualize the results by plotting first the raw data, and then this model fit:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.scatter(x, y)\n", "plt.plot(xfit, yfit);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### k-Nearest Neighbours on Iris dataset\n", "\n", "Let's take a look at another example of this process, using the Iris dataset. Our question will be this: given a model trained on a portion of the Iris data, how well can we predict the remaining labels?\n", "\n", "For this task, we will use an extremely simple algorithm known as the $k$-NN algorithm (see above). We would like to evaluate the model on data it has not seen before, and so we will split the data into a training set and a testing set. This could be done by hand, but it is more convenient to use the train_test_split utility function:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import load_iris\n", "dataset = load_iris()\n", "X_iris = dataset.data\n", "y_iris = dataset.target" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "Xtrain, Xtest, ytrain, ytest = train_test_split(X_iris, y_iris,\n", " random_state=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$\\rightarrow$ What's the use of random_state?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$\\rightarrow$ With the data arranged, we can follow the above recipe to predict the labels using the k-NN algorithm. Write the corresponding code:\n", "\n", "from sklearn.??? import ??? # 1. choose model class\n", "\n", "model = ??? # 2. instantiate model\n", "\n", "model.fit(Xtrain, ytrain) # 3. fit model to data\n", "\n", "??? = model.predict(Xtest) # 4. predict on new data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we can use the accuracy_score utility to see the fraction of predicted labels that match their true value:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import accuracy_score\n", "accuracy_score(ytest, y_model)\n", "\n", "#compute the confusion matrix\n", "from sklearn.metrics import confusion_matrix\n", "confusion_matrix(ytest, y_model)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$\\rightarrow$ What is accuracy? Give its formula and explain it.\n", "\n", "$\\rightarrow$ What's the confusion matrix of the k-NN algorithm on the Iris dataset? Write a code to compute and visualize this matrix." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.0" } }, "nbformat": 4, "nbformat_minor": 2 }