Logistic Regression
Logistic regression is a statistical technique used to model and predict the probability of a binary outcome based on one or more independent variables.
Goal
By the end of this lesson, you should be able to:
- Write cost function of logistic regression
- Use logistic regression to calculate probabilities of binary classification
- Train logistic regression model
- Split data into training, validation, and testing set
classification
, hypothesis
, probability
, logistic function
, coefficients
, binary classification
, cost function
, gradient descent
, update function
, matrix notation
Introduction
The problem of classification deals with categorical data. In this problem, we wish to identify a set of data whether they belong to a particular class of category. For example, a given text message from an email, we would like to classify if it is a spam or not a spam. Another example would be given some measurement of cancer cells we wish to classify if it is benign or malignant. In this section we will learn logistic regression to solve this classification problem.
Hypothesis Function
Let's take an example of breast cancer classification problem. Let's say depending on the cell size, an expert can identify if the cell is benign or malignant. We can plot something like the following figure.
In the y-axis, value of 1 means it is a malignant cell while value of 0 means it is benign. The x-axis can be considered as a normalized size of the cell with mean 0 and standard deviation of 1 (recall z-normalization).
If we can model this plot as a function , we can set the following criteria to classify the cells. For example, we will predict it is malignant if , otherwise, it is benign. This means we need a function where we can model the data in a step wise manners and fulfills the following:
where is the probability that a cell with feature is a malignant cell.
One of the function that we can use that have this step-wize shape and the above properties is a logistic function. A logistic function can be written as.
The plot of a logistic function looks like the following.
import numpy as np
import matplotlib.pyplot as plt
z = np.array(range(-10,11))
y = 1/(1+np.exp(-z))
plt.plot(z,y)
We can write our hypothesis as follows.
where is a function of . What should be this function. We can then use our linear model of a straight line and transform it into a logistic function if we use the following transformation.
when , the above equation is simply the straight line equation of linear regression.
This is the case when we only have one feature . If we have more than one feature, we should write as follows.
Note that in this notes we tend to omit the hat symbol to indicate it is the estimated parameters as in the previous notes. We will just indicate the estimated parameters as instead of .
The above relationship shows that we can map the value of linear regression into a new function with a value from 0 to 1. This new function can be considered as an estimated probability that on input . For example, if this means that 70% chance it is malignant. We can then use the following boundary conditions:
- y = 1 (malignant) if
- y = 0 (benign) if
The above conditions also means that we can classify when and when . We can draw these boundary conditions.
In the figure above, we indicated the predicted label with the orange colour. We see that when , the data is marked as (orange). On the other hand, when , the data is marked as (orange). The thick black line shows the decision boundary for this particular example.
How do we get this boundary decision. Once we found the estimated values for , we can find the value of which gives . You will work on computing the parameters in the problem set. For now, let's assume that you manage to find the value of and . The equation $\beta^T x = 0 $ can be written as follows.
We can then substitute the values for into the equation.
From the figure above, this fits where the thick line is, which is at around 0.3.
Cost Function
Similar to linear regression, our purpose here is to find the parameters . To do so, we will have to minimize some cost function using optimization algorithm.
For logistic regression, we will choose the following cost function.
We can try to understand the term inside the bracket intuitively. Let's see the case when . In this case, the cost term is given by:
The cost is 0 if and because is 0 when . Moreover, as , the cost will reach . See plot by wolfram alpha.
On the other hand, when $ y = 0$, the cost term is given by:
In this case, the cost is 0 when but it reaches when . See plot by wolfram alpha.
We can write the overall cost function for all the data points from to as follows.
Notice that when , the function reduces to
and when , the function reduces to
Gradient Descent
We can find the parameters again by using the gradient descent algorithm to perform:
The update functions for the parameters are given by
The derivative of the cost function is given by
See the appendix for the derivation. We can substitute this in to get the following update function.
Matrix Notation
The above equations can be written in matrix notation so that we can perform a vectorized computation.
Hypothesis Function
Recall that our hypothesis can be written as:
where
We can write this equation as vector multiplication as follows.
and
where
and
Recall that this is for a single data with features. The result of this vector multiplication is a single number for that one single data with features.
Cost Function
Recall that the cost function for all the data points from to as follows.
Notice that when , the function reduces to
and when , the function reduces to
How can we vectorize this computation in Python? Numpy provides the function np.where()
which we can use if we have more than one computation depending on certain conditions.
For example, if we have an input x = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
, we can compute the value of y on whether is even or odd. Let's say, we will square the value if the value is even. On the other hand, we will leave the value as it is if it is odd. Below cell shows how the code can be written.
import numpy as np
# create a list from 0 to 9
x = list(range(10))
# using np.where()
y = np.where(np.mod(x,2) == 0, np.power(x,2), x)
print(y)
[ 0 1 4 3 16 5 36 7 64 9]
We can, thus, use np.where()
to calculate the cost function depending on whether is 1 or zero using the two equations above. The summation in the above equation can be computed using np.sum()
.
An example of using np.sum()
can be seen in the below cell.
# create a list from 0 to 9
x = list(range(10))
# using np.sum() to sum up all the numbers in the vectors
y = np.sum(x)
print(y)
45
If you are dealing with a matrix, you can specify the axis of np.sum()
whether you want to sum over the rows or the columns. By default is over the rows or axis=0
in Numpy.
x = [[1, 2, 3], [4, 5, 6]]
print(np.sum(x, axis=0))
[5 7 9]
In the above code we sum over the rows and so we have three values for each column. If we wish to sum over all the columns, we should do as the one below.
x = [[1, 2, 3], [4, 5, 6]]
print(np.sum(x, axis=1))
[ 6 15]
In the above output, we see that 6 is the sum of [1, 2, 3]
and 15 is the sum of [4, 5, 6]
.
You can try out all code presented so far here:
Gradient Descent Update
Recall that the update function in our gradient descent calculation was the following.
We can write this a vectorized calculation particularly because we have the summation of some multiplication terms. This sounds like a good candidate for a matrix multiplication. Recall that our hypothesis for data points is a column vector.
Similarly, which is the actual target value from the training set can be written as a column vector of size . Therefore, we can do the calculation element-wise for the following term.
The result is a column vector too.
The features can be arranged as a matrix as shown below.
We can do the multiplication and the summation as a matrix multiplication of the following equation.
Note that we transpose the matrix so that it has the shape of . In this way, we can do matrix multiplication with which has the shape of .
The rest of the computation is just a multiplication of some constants. So we can write our update function as follows.
Multi-Class
Since Logistic function's output range only from 0 to 1, does it mean that it can only predict binary classification, i.e. classification problem involving only two classes? The answer is no. We can extend the technique to apply to multi-class classification by using a technique called one-versus-all.
The idea of one-versus-all technique is to reduce the multi-class classification problem to binary classification problem. Let's say we have three class and we would like to predict between cat, dog, and fish images. We can treat this problem as binary classification by predicting if an image is a cat or no cat. In this first instance, we treat both dog and fish images as a no-cat image. We then repeat the same procedures and try to predict if an image is a dog or a no-dog image. Similarly, we do the same with the fish and no-fish image.
To facilitate this kind of prediction, instead of having one target column in the training set , we will be preparing three target columns, each column for each class. We need to prepare something like the following data set.
feature_1 | feature_2 | cat | dog | fish |
---|---|---|---|---|
x | x | 1 | 0 | 0 |
x | x | 1 | 0 | 0 |
x | x | 0 | 1 | 0 |
x | x | 0 | 0 | 1 |
x | x | 0 | 1 | 0 |
We can then train the model three times and obtain the coefficients for each class. In this example, we would have three sets of beta coefficients, one for the cat versus no-cat, another one for dog versus no-dog, and the last one for fish versus no-fish. We can then use these coefficients to calculate the probability for each class and produce the probability.
Recall that our hypothesis function returns a probability between 0 to 1.
We can then construct three columns where each column contains the probability for the particular binary classification relevant to the column target. For example, we can have something like the following table.
feature_1 | feature_2 | cat | dog | fish | predicted class |
---|---|---|---|---|---|
x | x | 0.8 | 0.2 | 0.3 | cat |
x | x | 0.9 | 0.1 | 0.2 | cat |
x | x | 0.5 | 0.9 | 0.4 | dog |
x | x | 0.3 | 0.2 | 0.8 | fish |
x | x | 0.1 | 0.7 | 0.5 | dog |
In the above example, the first two rows have cat class as their highest probability. Therefore, we set "cat" as the predicted class in the last column. On the other hand, the third and the last row have "dog" as their highest probability and therefore, they are predicted as "dog". Similarly, with "fish" in the second last row.
Appendix
Derivation of Logistic Regression Derivative
We want to find , where
To simplify our derivation, we will consider each case when and when . When , the cost function is given by
Derivating this with respect to is
Recall that the expression for the hypothesis is
The derivative of this is given by
or
We can then now substitute this back
This can be written as
This is for the case of .
Now let's do the same for , the cost function is given by
Derivating this with respect to gives
Substituting expression for the hypothesis function and its derivative gives us
This is for .
Combining for both cases and , we have