Skip to main content

Logistic Regression

Logistic regression is a statistical technique used to model and predict the probability of a binary outcome based on one or more independent variables.

Goal

By the end of this lesson, you should be able to:

  • Write cost function of logistic regression
  • Use logistic regression to calculate probabilities of binary classification
  • Train logistic regression model
  • Split data into training, validation, and testing set
Keywords

classification, hypothesis, probability, logistic function, coefficients, binary classification, cost function, gradient descent, update function, matrix notation

Introduction

The problem of classification deals with categorical data. In this problem, we wish to identify a set of data whether they belong to a particular class of category. For example, a given text message from an email, we would like to classify if it is a spam or not a spam. Another example would be given some measurement of cancer cells we wish to classify if it is benign or malignant. In this section we will learn logistic regression to solve this classification problem.

Hypothesis Function

Let's take an example of breast cancer classification problem. Let's say depending on the cell size, an expert can identify if the cell is benign or malignant. We can plot something like the following figure.

In the y-axis, value of 1 means it is a malignant cell while value of 0 means it is benign. The x-axis can be considered as a normalized size of the cell with mean 0 and standard deviation of 1 (recall z-normalization).

If we can model this plot as a function p(x)p(x), we can set the following criteria to classify the cells. For example, we will predict it is malignant if p(x)0.5p(x) \geq 0.5, otherwise, it is benign. This means we need a function where we can model the data in a step wise manners and fulfills the following:

0p(x)10 \leq p(x) \leq 1

where p(x)p(x) is the probability that a cell with feature xx is a malignant cell.

One of the function that we can use that have this step-wize shape and the above properties is a logistic function. A logistic function can be written as.

y=11+ezy = \frac{1}{1 + e^{-z}}

The plot of a logistic function looks like the following.

import numpy as np
import matplotlib.pyplot as plt

z = np.array(range(-10,11))
y = 1/(1+np.exp(-z))
plt.plot(z,y)

We can write our hypothesis as follows.

p(x)=11+ez(x)p(x) = \frac{1}{1 + e^{-z(x)}}

where zz is a function of xx. What should be this zz function. We can then use our linear model of a straight line and transform it into a logistic function if we use the following transformation.

z(x)=β0x0+β1x1z(x) = \beta_0 x_0 + \beta_1 x_1

when x0=1x_0 = 1, the above equation is simply the straight line equation of linear regression.

β0+β1x1\beta_0 + \beta_1 x_1

This is the case when we only have one feature x1x_1. If we have more than one feature, we should write as follows.

z(x)=β0x0+β1x1++βnxnz(x) = \beta_0 x_0 + \beta_1 x_1 + \ldots + \beta_n x_n

Note that in this notes we tend to omit the hat symbol to indicate it is the estimated parameters as in the previous notes. We will just indicate the estimated parameters as β\beta instead of β^\hat{\beta}.

The above relationship shows that we can map the value of linear regression into a new function with a value from 0 to 1. This new function p(x)p(x) can be considered as an estimated probability that y=1y = 1 on input xx. For example, if p(x)=0.7p(x) = 0.7 this means that 70% chance it is malignant. We can then use the following boundary conditions:

  • y = 1 (malignant) if p(x)0.5p(x) \geq 0.5
  • y = 0 (benign) if p(x)<0.5p(x) < 0.5

The above conditions also means that we can classify y=1y=1 when βTx0\beta^T x \geq 0 and y=0y = 0 when βTx<0\beta^T x < 0. We can draw these boundary conditions.

In the figure above, we indicated the predicted label yy with the orange colour. We see that when p(x)0.5p(x)\geq 0.5, the data is marked as y=1y=1 (orange). On the other hand, when p(x)0.5p(x) \leq 0.5, the data is marked as y=0y=0 (orange). The thick black line shows the decision boundary for this particular example.

How do we get this boundary decision. Once we found the estimated values for β\beta, we can find the value of xx which gives βTx=0\beta^Tx = 0. You will work on computing the parameters β\beta in the problem set. For now, let's assume that you manage to find the value of β0=0.56\beta_0 = -0.56 and β1=1.94\beta_1 = 1.94. The equation $\beta^T x = 0 $ can be written as follows.

β0+β1x=0\beta_0 + \beta_1 x = 0

We can then substitute the values for β\beta into the equation.

0.56+1.94x=0-0.56 + 1.94 x = 0
x=0.290.3x = 0.29 \approx 0.3

From the figure above, this fits where the thick line is, which is at around 0.3.

Cost Function

Similar to linear regression, our purpose here is to find the parameters β\beta. To do so, we will have to minimize some cost function using optimization algorithm.

For logistic regression, we will choose the following cost function.

J(β)=1mΣi=1m{log(p(x)) if y=1log(1p(x)) if y=0J(\beta) = \frac{1}{m} \Sigma_{i=1}^m \left\{ \begin{matrix} -\log(p(x)) & \text{ if } y = 1\\ -\log(1 - p(x)) & \text{ if } y = 0 \end{matrix}\right.

We can try to understand the term inside the bracket intuitively. Let's see the case when y=1y=1. In this case, the cost term is given by:

log(p(x))-\log(p(x))

The cost is 0 if y=1y = 1 and p(x)=1p(x) = 1 because log(z)-\log(z) is 0 when z=1z=1. Moreover, as p(x)0p(x) \rightarrow 0, the cost will reach \infty. See plot by wolfram alpha.

On the other hand, when $ y = 0$, the cost term is given by:

log(1p(x))-\log(1-p(x))

In this case, the cost is 0 when p(x)=0p(x) = 0 but it reaches \infty when p(x)1p(x) \rightarrow 1. See plot by wolfram alpha.

We can write the overall cost function for all the data points from i=1i=1 to mm as follows.

J(β)=1m[Σi=1myilog(p(xi))+(1yi)log(1p(xi))]J(\beta) = -\frac{1}{m}\left[\Sigma_{i=1}^m y^i \log(p(x^i)) + (1 - y^i) \log(1 - p(x^i))\right]

Notice that when yi=1y^i = 1, the function reduces to

J(β)=1m[Σi=1mlog(p(xi))]J(\beta) = -\frac{1}{m}\left[\Sigma_{i=1}^m \log(p(x^i)) \right]

and when yi=0y^i = 0, the function reduces to

J(β)=1m[Σi=1mlog(1p(xi))]J(\beta) = -\frac{1}{m}\left[\Sigma_{i=1}^m \log(1 - p(x^i))\right]

Gradient Descent

We can find the parameters β\beta again by using the gradient descent algorithm to perform:

minJ(β)β\begin{matrix} min & J(\beta)\\ \beta & \end{matrix}

The update functions for the parameters are given by

βj=βjαβjJ(β)\beta_j = \beta_j - \alpha \frac{\partial}{\partial \beta_j} J(\beta)

The derivative of the cost function is given by

βjJ(β)=1mΣi=1m(p(x)yi)xji\frac{\partial}{\partial \beta_j}J(\beta) = \frac{1}{m}\Sigma_{i=1}^m \left(p(x)-y^i \right)x_j^i

See the appendix for the derivation. We can substitute this in to get the following update function.

βj=βjα1mΣi=1m(p(x)yi)xji\beta_j = \beta_j - \alpha \frac{1}{m}\Sigma_{i=1}^m \left(p(x)-y^i \right)x_j^i

Matrix Notation

The above equations can be written in matrix notation so that we can perform a vectorized computation.

Hypothesis Function

Recall that our hypothesis can be written as:

p(x)=11+ez(x)p(x) = \frac{1}{1 + e^{-z(x)}}

where

z(x)=β0x0+β1x1++βnxnz(x) = \beta_0 x_0 + \beta_1 x_1 + \ldots + \beta_n x_n

We can write this equation as vector multiplication as follows.

z=bTxz = \mathbf{b}^T \mathbf{x}

and

p(x)=11+ebTxp(x) = \frac{1}{1 + e^{-\mathbf{b}^T \mathbf{x}}}

where

b=[β^0β^1β^n]\mathbf{b} = \begin{bmatrix} \hat{\beta}_0\\ \hat{\beta}_1 \\ \ldots\\ \hat{\beta}_n \end{bmatrix}

and

x=[1x1x2xn]\mathbf{x} = \begin{bmatrix} 1 \\ x_1 \\ x_2 \\ \ldots \\ x_n \\ \end{bmatrix}

Recall that this is for a single data with nn features. The result of this vector multiplication zz is a single number for that one single data with nn features.

Cost Function

Recall that the cost function for all the data points from i=1i=1 to mm as follows.

J(β)=1m[Σi=1myilog(p(xi))+(1yi)log(1p(xi))]J(\beta) = -\frac{1}{m}\left[\Sigma_{i=1}^m y^i \log(p(x^i)) + (1 - y^i) \log(1 - p(x^i))\right]

Notice that when yi=1y^i = 1, the function reduces to

J(β)=1m[Σi=1mlog(p(xi))]J(\beta) = -\frac{1}{m}\left[\Sigma_{i=1}^m \log(p(x^i)) \right]

and when yi=0y^i = 0, the function reduces to

J(β)=1m[Σi=1mlog(1p(xi))]J(\beta) = -\frac{1}{m}\left[\Sigma_{i=1}^m \log(1 - p(x^i))\right]

How can we vectorize this computation in Python? Numpy provides the function np.where() which we can use if we have more than one computation depending on certain conditions.

For example, if we have an input x = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], we can compute the value of y on whether xx is even or odd. Let's say, we will square the value if the value is even. On the other hand, we will leave the value as it is if it is odd. Below cell shows how the code can be written.

import numpy as np
# create a list from 0 to 9
x = list(range(10))

# using np.where()
y = np.where(np.mod(x,2) == 0, np.power(x,2), x)
print(y)
[ 0  1  4  3 16  5 36  7 64  9]

We can, thus, use np.where() to calculate the cost function depending on whether yiy^i is 1 or zero using the two equations above. The summation in the above equation can be computed using np.sum().

An example of using np.sum() can be seen in the below cell.

# create a list from 0 to 9
x = list(range(10))

# using np.sum() to sum up all the numbers in the vectors
y = np.sum(x)
print(y)
45

If you are dealing with a matrix, you can specify the axis of np.sum() whether you want to sum over the rows or the columns. By default is over the rows or axis=0 in Numpy.

x = [[1, 2, 3], [4, 5, 6]]
print(np.sum(x, axis=0))

[5 7 9]

In the above code we sum over the rows and so we have three values for each column. If we wish to sum over all the columns, we should do as the one below.

x = [[1, 2, 3], [4, 5, 6]]
print(np.sum(x, axis=1))
[ 6 15]

In the above output, we see that 6 is the sum of [1, 2, 3] and 15 is the sum of [4, 5, 6].

You can try out all code presented so far here:

Gradient Descent Update

Recall that the update function in our gradient descent calculation was the following.

βj=βjα1mΣi=1m(p(x)yi)xji\beta_j = \beta_j - \alpha \frac{1}{m}\Sigma_{i=1}^m \left(p(x)-y^i \right)x_j^i

We can write this a vectorized calculation particularly because we have the summation of some multiplication terms. This sounds like a good candidate for a matrix multiplication. Recall that our hypothesis for mm data points is a column vector.

p(x)=11+eXb\mathbf{p}(x) = \frac{1}{1 + e^{-\mathbf{X}\mathbf{b}}}

Similarly, yy which is the actual target value from the training set can be written as a column vector of size m×1m\times 1. Therefore, we can do the calculation element-wise for the following term.

py\mathbf{p} - \mathbf{y}

The result is a column vector too.

The features xjix_j^i can be arranged as a matrix as shown below.

X=[1x11x21xn11x12x22xn21x1mx22xnm]\mathbf{X} = \begin{bmatrix} 1 & x_1^1 & x_2^1 & \ldots & x_n^1 \\ 1 & x_1^2 & x_2^2 & \ldots & x_n^2 \\ \ldots & \ldots & \ldots & \ldots & \ldots \\ 1 &x_1^m & x_2^2 & \ldots & x_n^m \end{bmatrix}

We can do the multiplication and the summation as a matrix multiplication of the following equation.

XT(py)\mathbf{X}^T(\mathbf{p} - \mathbf{y})

Note that we transpose the matrix X\mathbf{X} so that it has the shape of (1+n)×m(1+n) \times m. In this way, we can do matrix multiplication with (py)(\mathbf{p} - \mathbf{y}) which has the shape of m×1m \times 1.

The rest of the computation is just a multiplication of some constants. So we can write our update function as follows.

b=bα1mXT(py)\mathbf{b} = \mathbf{b} - \alpha\frac{1}{m}\mathbf{X}^T(\mathbf{p} - \mathbf{y})

Multi-Class

Since Logistic function's output range only from 0 to 1, does it mean that it can only predict binary classification, i.e. classification problem involving only two classes? The answer is no. We can extend the technique to apply to multi-class classification by using a technique called one-versus-all.

The idea of one-versus-all technique is to reduce the multi-class classification problem to binary classification problem. Let's say we have three class and we would like to predict between cat, dog, and fish images. We can treat this problem as binary classification by predicting if an image is a cat or no cat. In this first instance, we treat both dog and fish images as a no-cat image. We then repeat the same procedures and try to predict if an image is a dog or a no-dog image. Similarly, we do the same with the fish and no-fish image.

To facilitate this kind of prediction, instead of having one target column in the training set , we will be preparing three target columns, each column for each class. We need to prepare something like the following data set.

feature_1feature_2catdogfish
xx100
xx100
xx010
xx001
xx010

We can then train the model three times and obtain the coefficients for each class. In this example, we would have three sets of beta coefficients, one for the cat versus no-cat, another one for dog versus no-dog, and the last one for fish versus no-fish. We can then use these coefficients to calculate the probability for each class and produce the probability.

Recall that our hypothesis function returns a probability between 0 to 1.

p(x)=11+eXb\mathbf{p}(x) = \frac{1}{1 + e^{-\mathbf{Xb}}}

We can then construct three columns where each column contains the probability for the particular binary classification relevant to the column target. For example, we can have something like the following table.

feature_1feature_2catdogfishpredicted class
xx0.80.20.3cat
xx0.90.10.2cat
xx0.50.90.4dog
xx0.30.20.8fish
xx0.10.70.5dog

In the above example, the first two rows have cat class as their highest probability. Therefore, we set "cat" as the predicted class in the last column. On the other hand, the third and the last row have "dog" as their highest probability and therefore, they are predicted as "dog". Similarly, with "fish" in the second last row.

Appendix

Derivation of Logistic Regression Derivative

We want to find βjJ(β)\frac{\partial}{\partial \beta_j}J(\beta), where

J(β)=1m[Σi=1myilog(p(xi))+(1yi)log(1p(xi))]J(\beta) = -\frac{1}{m}\left[\Sigma_{i=1}^m y^i \log(p(x^i)) + (1 - y^i) \log(1 - p(x^i))\right]

To simplify our derivation, we will consider each case when y=1y=1 and when y=0y=0. When y=1y=1, the cost function is given by

J(β)=1m[Σi=1mlog(p(xi))]J(\beta) = -\frac{1}{m}\left[\Sigma_{i=1}^m \log(p(x^i)) \right]

Derivating this with respect to θ\theta is

βjJ(β)=1mΣ1p(x)βp(x)\frac{\partial}{\partial \beta_j}J(\beta) = -\frac{1}{m}\Sigma \frac{1}{p(x)}\frac{\partial}{\partial \beta}p(x)

Recall that the expression for the hypothesis is

p(x)=11+eβTxp(x) = \frac{1}{1 + e^{-\beta^T x}}

The derivative of this is given by

βjp(x)=1(1+eβTx)2×xj×eβTx\frac{\partial}{\partial \beta_j} p(x) = - \frac{1}{(1 + e^{-\beta^T x})^2} \times -x_j \times e^{-\beta^T x}

or

βjp(x)=xjeβTx(1+eβTx)2\frac{\partial}{\partial \beta_j} p(x) = \frac{x_j e^{-\beta^T x}}{(1 + e^{-\beta^T x})^2}

We can then now substitute this back

βjJ(β)=1mΣ(1+eβTx)xjeβTx(1+eβTx)2\frac{\partial}{\partial \beta_j}J(\beta) = -\frac{1}{m}\Sigma (1 + e^{-\beta^T x}) \frac{x_j e^{-\beta^T x}}{(1 + e^{-\beta^T x})^2}
βjJ(β)=1mΣxjeβTx(1+eβTx)\frac{\partial}{\partial \beta_j}J(\beta) = -\frac{1}{m}\Sigma \frac{x_j e^{-\beta^T x}}{(1 + e^{-\beta^T x})}

This can be written as

βjJ(β)=1mΣp(x)xjeβTx\frac{\partial}{\partial \beta_j}J(\beta) = -\frac{1}{m}\Sigma p(x) x_j e^{-\beta^T x}

This is for the case of y=1y = 1.

Now let's do the same for y=0y = 0, the cost function is given by

J(β)=1m[Σi=1mlog(1p(xi))]J(\beta) = -\frac{1}{m}\left[\Sigma_{i=1}^m \log(1 - p(x^i))\right]

Derivating this with respect to θ\theta gives

βjJ(β)=1mΣ11p(x)βp(x)\frac{\partial}{\partial \beta_j}J(\beta) = \frac{1}{m}\Sigma \frac{1}{1 - p(x)}\frac{\partial}{\partial \beta}p(x)

Substituting expression for the hypothesis function and its derivative gives us

βjJ(β)=1mΣ1111+eβTxxjeβTx(1+eβTx)2\frac{\partial}{\partial \beta_j}J(\beta) = \frac{1}{m}\Sigma \frac{1}{1 - \frac{1}{1 + e^{-\beta^T x}}} \frac{x_j e^{-\beta^T x}}{(1 + e^{-\beta^T x})^2}
βjJ(β)=1mΣ1+eβTxeβTxxjeβTx(1+eβTx)2\frac{\partial}{\partial \beta_j}J(\beta) = \frac{1}{m}\Sigma \frac{1 + e^{-\beta^T x}}{e^{-\beta^T x}} \frac{x_j e^{-\beta^T x}}{(1 + e^{-\beta^T x})^2}
βjJ(β)=1mΣxj(1+eβTx)\frac{\partial}{\partial \beta_j}J(\beta) = \frac{1}{m}\Sigma \frac{x_j}{(1+e^{\beta^T x})}
βjJ(β)=1mΣp(x)xj\frac{\partial}{\partial \beta_j}J(\beta) = \frac{1}{m}\Sigma p(x) x_j

This is for y=0y = 0.

Combining for both cases y=0y=0 and y=1y=1, we have

βjJ(β)=1mΣi=1myip(x)xjeβTx+(yi1)p(x)xji\frac{\partial}{\partial \beta_j}J(\beta) = -\frac{1}{m}\Sigma_{i=1}^m y^i p(x) x_j e^{-\beta^T x} + (y^i - 1) p(x) x_j^i
βjJ(β)=1mΣi=1myip(x)xjeβTx+yip(x)xjp(x)xji\frac{\partial}{\partial \beta_j}J(\beta) = -\frac{1}{m}\Sigma_{i=1}^m y^i p(x) x_j e^{-\beta^T x} + y^i p(x) x_j - p(x) x_j^i
βjJ(β)=1mΣi=1m(yip(x)(1+eβTx)p(x))xji\frac{\partial}{\partial \beta_j}J(\beta) = -\frac{1}{m}\Sigma_{i=1}^m \left(y^i p(x)(1 + e^{-\beta^T x}) - p(x) \right)x_j^i
βjJ(β)=1mΣi=1m(yip(x))xji\frac{\partial}{\partial \beta_j}J(\beta) = -\frac{1}{m}\Sigma_{i=1}^m \left(y^i - p(x) \right)x_j^i
βjJ(β)=1mΣi=1m(p(x)yi)xji\frac{\partial}{\partial \beta_j}J(\beta) = \frac{1}{m}\Sigma_{i=1}^m \left(p(x)-y^i \right)x_j^i