By the end of this lesson, you should be able to:
- Create a Panda’s DataFrame and selecting data from DataFrame
- Using library to read
CSV
orEXCEL
file - Give example of linear regression and classification
- Split data randomly into training set and testing set
- Normalize data using min-max normalization
Important Words:
- Data Frame
- Series
- index
- column
- vectorized operation
- normalization
- z-norm
- min-max
- dataset
- test, training, validation set
Short Introduction to Machine Learning
In the previous weeks we have discussed how various computation can be done. We begin by discussing computational complexity and divide and conquer strategy through recursion. We then discusses how data can have computation associated with it through object oriented programming. We discussed several data structures and algorithms associated with various problems. We then end up with offering a different perspective by looking at computation as a state machine.
In this second half of the course, we will look into how computation can learn from data in order to make a new computation. This new computation is often called a prediction. In these lessons, we focus on what is called as supervised machine learning. The word supervised machine learning indicates that the computer learns from some existing data on how to compute the prediction. An example of this would be given some images labelled as “cat” and “not a cat”, the computer can learn to predict (or to compute) whether any a new image given to it is a cat or not a cat.
cat | not a cat |
---|---|
Another example would be given a some data of housing prices in Singapore with the year of sale, area, number of rooms, and its floor hight, the computer can predict the price of another house. One other example would be given data of breast cancer cell and its measurements, one can predict whether the cell is malignant or benight. Supervised machine learning assumes that we have some existing data labelled with this category “malignant” and “benign”. Using this labelled data (supervised), the computer can predict the category given some new data.
Reading Data
The first step in machine learning would be to understand the data itself. In order to do that we need to be able to read data from some source. One common source is a text file in the form of CSV format (comma separated value). Another common format is Excel spreadsheet. The data can be from some databases or some server. Different data sources will require different ways of handling it. But in all those cases we will need to know how to read those data.
For this purpose we will use Pandas library to read our data. We import the data into our Python code by typing the following code.
import pandas as pd
Now we can use Pandas functions to read the data. For example, if we want to read a CSV file, we can simply type:
df = pd.read_csv('mydata.csv')
Let’s take an example of Singapore housing prices. We can get some of these data from Data.gov.sg. We have downloaded the CSV file so that you can access it simply from the following dropbox link. We can use the url to the raw file to open the CSV in our Python code.
import pandas as pd
file_url = 'https://www.dropbox.com/s/jz8ck0obu9u1rng/resale-flat-prices-based-on-registration-date-from-jan-2017-onwards.csv?raw=1'
df = pd.read_csv(file_url)
df
month | town | flat_type | block | street_name | storey_range | floor_area_sqm | flat_model | lease_commence_date | remaining_lease | resale_price | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2017-01 | ANG MO KIO | 2 ROOM | 406 | ANG MO KIO AVE 10 | 10 TO 12 | 44.0 | Improved | 1979 | 61 years 04 months | 232000.0 |
1 | 2017-01 | ANG MO KIO | 3 ROOM | 108 | ANG MO KIO AVE 4 | 01 TO 03 | 67.0 | New Generation | 1978 | 60 years 07 months | 250000.0 |
2 | 2017-01 | ANG MO KIO | 3 ROOM | 602 | ANG MO KIO AVE 5 | 01 TO 03 | 67.0 | New Generation | 1980 | 62 years 05 months | 262000.0 |
3 | 2017-01 | ANG MO KIO | 3 ROOM | 465 | ANG MO KIO AVE 10 | 04 TO 06 | 68.0 | New Generation | 1980 | 62 years 01 month | 265000.0 |
4 | 2017-01 | ANG MO KIO | 3 ROOM | 601 | ANG MO KIO AVE 5 | 01 TO 03 | 67.0 | New Generation | 1980 | 62 years 05 months | 265000.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
95853 | 2021-04 | YISHUN | EXECUTIVE | 326 | YISHUN RING RD | 10 TO 12 | 146.0 | Maisonette | 1988 | 66 years 04 months | 650000.0 |
95854 | 2021-04 | YISHUN | EXECUTIVE | 360 | YISHUN RING RD | 04 TO 06 | 146.0 | Maisonette | 1988 | 66 years 04 months | 645000.0 |
95855 | 2021-04 | YISHUN | EXECUTIVE | 326 | YISHUN RING RD | 10 TO 12 | 146.0 | Maisonette | 1988 | 66 years 04 months | 585000.0 |
95856 | 2021-04 | YISHUN | EXECUTIVE | 355 | YISHUN RING RD | 10 TO 12 | 146.0 | Maisonette | 1988 | 66 years 08 months | 675000.0 |
95857 | 2021-04 | YISHUN | EXECUTIVE | 277 | YISHUN ST 22 | 04 TO 06 | 146.0 | Maisonette | 1985 | 63 years 05 months | 625000.0 |
95858 rows × 11 columns
The output of read_csv()
function is in Pandas’ DataFrame
type.
type(df)
pandas.core.frame.DataFrame
DataFrame
is Pandas’ class that contains attributes and methods to work with a tabular data as shown above. Recall that we can create our own custom data type using the keyword class
and define its attributes and methods. We can even override some of Python operators to work with our new data type. This is what Pandas library does with DataFrame
. This DataFrame
class provides some properties and methods that allows us to work with tabular data.
For example, we can get all the name of the columns in our data frame using df.columns
properties.
df.columns
Index(['month', 'town', 'flat_type', 'block', 'street_name', 'storey_range',
'floor_area_sqm', 'flat_model', 'lease_commence_date',
'remaining_lease', 'resale_price'],
dtype='object')
We can also get the index (the names on the rows) using df.index
.
df.index
RangeIndex(start=0, stop=95858, step=1)
We can also treat this data frame as a kind of matrix to find its shape using df.shape
.
df.shape
(95858, 11)
As we can see, the data contains 95858 rows and 11 columns. One of the column names is called resale_price
. Since our aim is to predict the house price, this column is usually called the target. The rest of the columns is called the features. This means that we have about 10 feature columns.
The idea supervised machine learning is that using the given data such as shown above, the computer would like to predict what is the target give a new set of features. The computer does this by learning the existing labelled data. The label in this case is the resale price from the historical sales data.
In order to understand the data, it is important to be able to manipulate and work on the data frame.
Data Frame Operations
It is important to know how to manipulate the data. Pandas has two data structures:
You can consider Series
data as one-dimensional labelled array while DataFrame
data as two-dimensional labelled data structure. For example, the table that we saw previously, which is the output of read_csv()
function, is a DataFrame
because it has both rows and columns and, therefore, two dimensional. On the other hand, we can access just one of the columns from the data frame to get a Series
data.
Getting a Column or a Row as a Series
You can access the column data as series using the square bracket operator.
df[column_name]
print(df['resale_price'])
print(type(df['resale_price']))
0 232000.0
1 250000.0
2 262000.0
3 265000.0
4 265000.0
...
95853 650000.0
95854 645000.0
95855 585000.0
95856 675000.0
95857 625000.0
Name: resale_price, Length: 95858, dtype: float64
<class 'pandas.core.series.Series'>
The code above prints the column resale_price
and its type. As can be seen the type of the output is a Series
data type.
You can also get some particular row by specifying its index
.
You can also access the column using the .loc[index, column]
method. In this method, you need to specify the labels of the index. For example, to access all the rows for a particular column called resale_price
, we can do as follows. Notice that we use :
to access all the rows. Moreover, we specify the name of the columns in the code below.
print(df.loc[:, 'resale_price'])
print(type(df.loc[:, 'resale_price']))
0 232000.0
1 250000.0
2 262000.0
3 265000.0
4 265000.0
...
95853 650000.0
95854 645000.0
95855 585000.0
95856 675000.0
95857 625000.0
Name: resale_price, Length: 95858, dtype: float64
<class 'pandas.core.series.Series'>
In the above code, we set the index to access all rows by using :
. Recall that in Python’s list slicing, we also use :
to access all the element. Similarly here, we use :
to access all the rows. In a similar way, we can use :
to access all the columns, e.g. df.loc[:, :]
will copy the whole data frame.
This also gives you a hint how to access a particular row. Let’s say, you only want to acces the first row, you can type the following.
print(df.loc[0, :])
print(type(df.loc[0, :]))
month 2017-01
town ANG MO KIO
flat_type 2 ROOM
block 406
street_name ANG MO KIO AVE 10
storey_range 10 TO 12
floor_area_sqm 44
flat_model Improved
lease_commence_date 1979
remaining_lease 61 years 04 months
resale_price 232000
Name: 0, dtype: object
<class 'pandas.core.series.Series'>
In the above code, we access the first row, which is at index 0, and all the columns.
Recall that all these data are of the type Series
. You can create a Data Frame from an existing series just like when you create any other object by instantiating a DataFrame
object and passing on an argument as shown below.
df_row0 = pd.DataFrame(df.loc[0, :])
df_row0
0 | |
---|---|
month | 2017-01 |
town | ANG MO KIO |
flat_type | 2 ROOM |
block | 406 |
street_name | ANG MO KIO AVE 10 |
storey_range | 10 TO 12 |
floor_area_sqm | 44 |
flat_model | Improved |
lease_commence_date | 1979 |
remaining_lease | 61 years 04 months |
resale_price | 232000 |
Getting Rows and Columns as DataFrame
The operator :
works similar to Python’s slicing. This means that you can get some rows by slicing them. For example, you can access the first 10 rows as follows.
print(df.loc[0:10, :])
print(type(df.loc[0:10, :]))
month town flat_type block street_name storey_range \
0 2017-01 ANG MO KIO 2 ROOM 406 ANG MO KIO AVE 10 10 TO 12
1 2017-01 ANG MO KIO 3 ROOM 108 ANG MO KIO AVE 4 01 TO 03
2 2017-01 ANG MO KIO 3 ROOM 602 ANG MO KIO AVE 5 01 TO 03
3 2017-01 ANG MO KIO 3 ROOM 465 ANG MO KIO AVE 10 04 TO 06
4 2017-01 ANG MO KIO 3 ROOM 601 ANG MO KIO AVE 5 01 TO 03
5 2017-01 ANG MO KIO 3 ROOM 150 ANG MO KIO AVE 5 01 TO 03
6 2017-01 ANG MO KIO 3 ROOM 447 ANG MO KIO AVE 10 04 TO 06
7 2017-01 ANG MO KIO 3 ROOM 218 ANG MO KIO AVE 1 04 TO 06
8 2017-01 ANG MO KIO 3 ROOM 447 ANG MO KIO AVE 10 04 TO 06
9 2017-01 ANG MO KIO 3 ROOM 571 ANG MO KIO AVE 3 01 TO 03
10 2017-01 ANG MO KIO 3 ROOM 534 ANG MO KIO AVE 10 01 TO 03
floor_area_sqm flat_model lease_commence_date remaining_lease \
0 44.0 Improved 1979 61 years 04 months
1 67.0 New Generation 1978 60 years 07 months
2 67.0 New Generation 1980 62 years 05 months
3 68.0 New Generation 1980 62 years 01 month
4 67.0 New Generation 1980 62 years 05 months
5 68.0 New Generation 1981 63 years
6 68.0 New Generation 1979 61 years 06 months
7 67.0 New Generation 1976 58 years 04 months
8 68.0 New Generation 1979 61 years 06 months
9 67.0 New Generation 1979 61 years 04 months
10 68.0 New Generation 1980 62 years 01 month
resale_price
0 232000.0
1 250000.0
2 262000.0
3 265000.0
4 265000.0
5 275000.0
6 280000.0
7 285000.0
8 285000.0
9 285000.0
10 288500.0
<class 'pandas.core.frame.DataFrame'>
Notice, however, that the slicing in Pandas’ data frame is inclusive of the ending index unlike Python’s slicing. The other thing to note about is that the output data type is no longer a series but rather a DataFrame
. The reason is that now the data is two-dimensionsional.
You can specify both the rows and the columns you want as shown below.
df.loc[0:10,'month':'remaining_lease']
month | town | flat_type | block | street_name | storey_range | floor_area_sqm | flat_model | lease_commence_date | remaining_lease | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 2017-01 | ANG MO KIO | 2 ROOM | 406 | ANG MO KIO AVE 10 | 10 TO 12 | 44.0 | Improved | 1979 | 61 years 04 months |
1 | 2017-01 | ANG MO KIO | 3 ROOM | 108 | ANG MO KIO AVE 4 | 01 TO 03 | 67.0 | New Generation | 1978 | 60 years 07 months |
2 | 2017-01 | ANG MO KIO | 3 ROOM | 602 | ANG MO KIO AVE 5 | 01 TO 03 | 67.0 | New Generation | 1980 | 62 years 05 months |
3 | 2017-01 | ANG MO KIO | 3 ROOM | 465 | ANG MO KIO AVE 10 | 04 TO 06 | 68.0 | New Generation | 1980 | 62 years 01 month |
4 | 2017-01 | ANG MO KIO | 3 ROOM | 601 | ANG MO KIO AVE 5 | 01 TO 03 | 67.0 | New Generation | 1980 | 62 years 05 months |
5 | 2017-01 | ANG MO KIO | 3 ROOM | 150 | ANG MO KIO AVE 5 | 01 TO 03 | 68.0 | New Generation | 1981 | 63 years |
6 | 2017-01 | ANG MO KIO | 3 ROOM | 447 | ANG MO KIO AVE 10 | 04 TO 06 | 68.0 | New Generation | 1979 | 61 years 06 months |
7 | 2017-01 | ANG MO KIO | 3 ROOM | 218 | ANG MO KIO AVE 1 | 04 TO 06 | 67.0 | New Generation | 1976 | 58 years 04 months |
8 | 2017-01 | ANG MO KIO | 3 ROOM | 447 | ANG MO KIO AVE 10 | 04 TO 06 | 68.0 | New Generation | 1979 | 61 years 06 months |
9 | 2017-01 | ANG MO KIO | 3 ROOM | 571 | ANG MO KIO AVE 3 | 01 TO 03 | 67.0 | New Generation | 1979 | 61 years 04 months |
10 | 2017-01 | ANG MO KIO | 3 ROOM | 534 | ANG MO KIO AVE 10 | 01 TO 03 | 68.0 | New Generation | 1980 | 62 years 01 month |
If you want to select the column, you can pass on a list of columns as shown in the example below.
columns = ['town', 'block', 'resale_price']
df.loc[:, columns]
town | block | resale_price | |
---|---|---|---|
0 | ANG MO KIO | 406 | 232000.0 |
1 | ANG MO KIO | 108 | 250000.0 |
2 | ANG MO KIO | 602 | 262000.0 |
3 | ANG MO KIO | 465 | 265000.0 |
4 | ANG MO KIO | 601 | 265000.0 |
... | ... | ... | ... |
95853 | YISHUN | 326 | 650000.0 |
95854 | YISHUN | 360 | 645000.0 |
95855 | YISHUN | 326 | 585000.0 |
95856 | YISHUN | 355 | 675000.0 |
95857 | YISHUN | 277 | 625000.0 |
95858 rows × 3 columns
A similar output can be obtained without .loc
df[columns]
town | block | resale_price | |
---|---|---|---|
0 | ANG MO KIO | 406 | 232000.0 |
1 | ANG MO KIO | 108 | 250000.0 |
2 | ANG MO KIO | 602 | 262000.0 |
3 | ANG MO KIO | 465 | 265000.0 |
4 | ANG MO KIO | 601 | 265000.0 |
... | ... | ... | ... |
95853 | YISHUN | 326 | 650000.0 |
95854 | YISHUN | 360 | 645000.0 |
95855 | YISHUN | 326 | 585000.0 |
95856 | YISHUN | 355 | 675000.0 |
95857 | YISHUN | 277 | 625000.0 |
95858 rows × 3 columns
You can also combine specifying the rows and the columns as usual using .loc
.
df.loc[0:10, columns]
town | block | resale_price | |
---|---|---|---|
0 | ANG MO KIO | 406 | 232000.0 |
1 | ANG MO KIO | 108 | 250000.0 |
2 | ANG MO KIO | 602 | 262000.0 |
3 | ANG MO KIO | 465 | 265000.0 |
4 | ANG MO KIO | 601 | 265000.0 |
5 | ANG MO KIO | 150 | 275000.0 |
6 | ANG MO KIO | 447 | 280000.0 |
7 | ANG MO KIO | 218 | 285000.0 |
8 | ANG MO KIO | 447 | 285000.0 |
9 | ANG MO KIO | 571 | 285000.0 |
10 | ANG MO KIO | 534 | 288500.0 |
The index is not always necessarily be an integer. Pandas can take strings as the index of a data frame. But there are times, even when the index is not an integer, we still prefer to locate using the position of the rows to select. In this case, we can use .iloc[position_index, position_column]
.
columns = [1, 3, -1]
df.iloc[0:10, columns]
town | block | resale_price | |
---|---|---|---|
0 | ANG MO KIO | 406 | 232000.0 |
1 | ANG MO KIO | 108 | 250000.0 |
2 | ANG MO KIO | 602 | 262000.0 |
3 | ANG MO KIO | 465 | 265000.0 |
4 | ANG MO KIO | 601 | 265000.0 |
5 | ANG MO KIO | 150 | 275000.0 |
6 | ANG MO KIO | 447 | 280000.0 |
7 | ANG MO KIO | 218 | 285000.0 |
8 | ANG MO KIO | 447 | 285000.0 |
9 | ANG MO KIO | 571 | 285000.0 |
The above code gives the same data frame but it uses different input to specifies. By using .iloc[]
, we specify the position of the index and the columns instead of the label of the index and the columns. It happens that for the index, the position numbering is exactly the same as the label.
Selecting Data Using Conditions
We can use conditions with Pandas’ data frame to select particular rows and columns using either .loc[]
or .iloc[]
. The reason is that these methods can take in boolean arrays.
Let’s see some examples below. First, let’s list down the resale price by focusing on the block at a given town.
columns = ['town', 'block', 'resale_price']
df.loc[:, columns]
town | block | resale_price | |
---|---|---|---|
0 | ANG MO KIO | 406 | 232000.0 |
1 | ANG MO KIO | 108 | 250000.0 |
2 | ANG MO KIO | 602 | 262000.0 |
3 | ANG MO KIO | 465 | 265000.0 |
4 | ANG MO KIO | 601 | 265000.0 |
... | ... | ... | ... |
95853 | YISHUN | 326 | 650000.0 |
95854 | YISHUN | 360 | 645000.0 |
95855 | YISHUN | 326 | 585000.0 |
95856 | YISHUN | 355 | 675000.0 |
95857 | YISHUN | 277 | 625000.0 |
95858 rows × 3 columns
Let’s say we want to see those sales where the price is greater than $500k. We can put in this condition in filtering the rows.
df.loc[df['resale_price'] > 500_000, columns]
town | block | resale_price | |
---|---|---|---|
43 | ANG MO KIO | 304 | 518000.0 |
44 | ANG MO KIO | 646 | 518000.0 |
45 | ANG MO KIO | 328 | 560000.0 |
46 | ANG MO KIO | 588C | 688000.0 |
47 | ANG MO KIO | 588D | 730000.0 |
... | ... | ... | ... |
95853 | YISHUN | 326 | 650000.0 |
95854 | YISHUN | 360 | 645000.0 |
95855 | YISHUN | 326 | 585000.0 |
95856 | YISHUN | 355 | 675000.0 |
95857 | YISHUN | 277 | 625000.0 |
27233 rows × 3 columns
Note: Python ignores the underscores in between numeric literals and you can use it to make it easier to read.
Let’s say if we want to find all those sales between \$500k and \$600k only, we can use the AND operator &
to have more than one conditions.
df.loc[(df['resale_price'] >= 500_000) & (df['resale_price'] <= 600_000), columns]
town | block | resale_price | |
---|---|---|---|
43 | ANG MO KIO | 304 | 518000.0 |
44 | ANG MO KIO | 646 | 518000.0 |
45 | ANG MO KIO | 328 | 560000.0 |
49 | ANG MO KIO | 101 | 500000.0 |
110 | BEDOK | 185 | 580000.0 |
... | ... | ... | ... |
95849 | YISHUN | 504C | 550000.0 |
95850 | YISHUN | 511B | 600000.0 |
95851 | YISHUN | 504C | 590000.0 |
95852 | YISHUN | 838 | 571888.0 |
95855 | YISHUN | 326 | 585000.0 |
13478 rows × 3 columns
Note: the parenthesis separating the two AND conditions are compulsory.
We can also specify more conditions. For example, we are only interested in ANG MO KIO area. We can have the following code.
df.loc[(df['resale_price'] >= 500_000) & (df['resale_price'] <= 600_000) &
(df['town'] == 'ANG MO KIO'), columns]
town | block | resale_price | |
---|---|---|---|
43 | ANG MO KIO | 304 | 518000.0 |
44 | ANG MO KIO | 646 | 518000.0 |
45 | ANG MO KIO | 328 | 560000.0 |
49 | ANG MO KIO | 101 | 500000.0 |
1219 | ANG MO KIO | 351 | 530000.0 |
... | ... | ... | ... |
94741 | ANG MO KIO | 545 | 590000.0 |
94742 | ANG MO KIO | 545 | 600000.0 |
94743 | ANG MO KIO | 551 | 520000.0 |
94746 | ANG MO KIO | 642 | 545000.0 |
94749 | ANG MO KIO | 353 | 588000.0 |
329 rows × 3 columns
If you are interested only in blocks 300s and 400s, you can add this conditions further.
df.loc[(df['resale_price'] >= 500_000) & (df['resale_price'] <= 600_000) &
(df['town'] == 'ANG MO KIO') &
(df['block'] >= '300') & (df['block'] < '500'), columns]
town | block | resale_price | |
---|---|---|---|
43 | ANG MO KIO | 304 | 518000.0 |
45 | ANG MO KIO | 328 | 560000.0 |
1219 | ANG MO KIO | 351 | 530000.0 |
2337 | ANG MO KIO | 344 | 538000.0 |
2338 | ANG MO KIO | 329 | 548000.0 |
... | ... | ... | ... |
92327 | ANG MO KIO | 353 | 520000.0 |
92332 | ANG MO KIO | 459 | 600000.0 |
92335 | ANG MO KIO | 459 | 500000.0 |
94740 | ANG MO KIO | 305 | 537000.0 |
94749 | ANG MO KIO | 353 | 588000.0 |
174 rows × 3 columns
Series and DataFrame Functions
Pandas also provides several functions that can be useful in understanding the data. In this section, we will explore some of these.
Creating DataFrame and Series
We can create a new DataFrame from other data type such as dictionary, list-like objects, or Series. For example, given a Series
, you can convert into a DataFrame
as shown below.
price = df['resale_price']
print(isinstance(price, pd.Series))
price_df = pd.DataFrame(price)
print(isinstance(price_df, pd.DataFrame))
price_df
True
True
resale_price | |
---|---|
0 | 232000.0 |
1 | 250000.0 |
2 | 262000.0 |
3 | 265000.0 |
4 | 265000.0 |
... | ... |
95853 | 650000.0 |
95854 | 645000.0 |
95855 | 585000.0 |
95856 | 675000.0 |
95857 | 625000.0 |
95858 rows × 1 columns
Similarly, you can convert other data to a Series
by using its contructor. In the example below, we create a new series from a list of integers from 2 to 100.
new_series = pd.Series(list(range(2,101)))
print(isinstance(new_series, pd.Series))
new_series
True
0 2
1 3
2 4
3 5
4 6
...
94 96
95 97
96 98
97 99
98 100
Length: 99, dtype: int64
Copying
One useful function is to copy a data frame to another dataframe. We can use df.copy()
. This function has an argument deep
which by default is True
. If it is true, it will do a deep copy of the Data Frame. Otherwise, it will just do a shallow copy. See documention.
df2 = df.copy()
df2
month | town | flat_type | block | street_name | storey_range | floor_area_sqm | flat_model | lease_commence_date | remaining_lease | resale_price | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2017-01 | ANG MO KIO | 2 ROOM | 406 | ANG MO KIO AVE 10 | 10 TO 12 | 44.0 | Improved | 1979 | 61 years 04 months | 232000.0 |
1 | 2017-01 | ANG MO KIO | 3 ROOM | 108 | ANG MO KIO AVE 4 | 01 TO 03 | 67.0 | New Generation | 1978 | 60 years 07 months | 250000.0 |
2 | 2017-01 | ANG MO KIO | 3 ROOM | 602 | ANG MO KIO AVE 5 | 01 TO 03 | 67.0 | New Generation | 1980 | 62 years 05 months | 262000.0 |
3 | 2017-01 | ANG MO KIO | 3 ROOM | 465 | ANG MO KIO AVE 10 | 04 TO 06 | 68.0 | New Generation | 1980 | 62 years 01 month | 265000.0 |
4 | 2017-01 | ANG MO KIO | 3 ROOM | 601 | ANG MO KIO AVE 5 | 01 TO 03 | 67.0 | New Generation | 1980 | 62 years 05 months | 265000.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
95853 | 2021-04 | YISHUN | EXECUTIVE | 326 | YISHUN RING RD | 10 TO 12 | 146.0 | Maisonette | 1988 | 66 years 04 months | 650000.0 |
95854 | 2021-04 | YISHUN | EXECUTIVE | 360 | YISHUN RING RD | 04 TO 06 | 146.0 | Maisonette | 1988 | 66 years 04 months | 645000.0 |
95855 | 2021-04 | YISHUN | EXECUTIVE | 326 | YISHUN RING RD | 10 TO 12 | 146.0 | Maisonette | 1988 | 66 years 04 months | 585000.0 |
95856 | 2021-04 | YISHUN | EXECUTIVE | 355 | YISHUN RING RD | 10 TO 12 | 146.0 | Maisonette | 1988 | 66 years 08 months | 675000.0 |
95857 | 2021-04 | YISHUN | EXECUTIVE | 277 | YISHUN ST 22 | 04 TO 06 | 146.0 | Maisonette | 1985 | 63 years 05 months | 625000.0 |
95858 rows × 11 columns
Statistical Functions
We can get some descriptive statistics about the data using some of Pandas functions. For example, we can get the five point summary using .describe()
method.
df.describe()
floor_area_sqm | lease_commence_date | resale_price | |
---|---|---|---|
count | 95858.000000 | 95858.000000 | 9.585800e+04 |
mean | 97.772234 | 1994.553934 | 4.467242e+05 |
std | 24.238799 | 13.128913 | 1.552974e+05 |
min | 31.000000 | 1966.000000 | 1.400000e+05 |
25% | 82.000000 | 1984.000000 | 3.350000e+05 |
50% | 95.000000 | 1995.000000 | 4.160000e+05 |
75% | 113.000000 | 2004.000000 | 5.250000e+05 |
max | 249.000000 | 2019.000000 | 1.258000e+06 |
The above code only shows a few columns because the other columns are not numbers. Pandas will only try to get the statistics of the columns that contain numeric numbers. We can also get the individual statistical functions as shown below.
print(df['resale_price'].mean())
446724.22886801313
print(df['resale_price'].std())
155297.43748684428
print(df['resale_price'].min())
140000.0
print(df['resale_price'].max())
1258000.0
print(df['resale_price'].quantile(q=0.75))
525000.0
You can change the way the statistics is computed. Currently, the statistics is calculated over all the rows in the vertical dimension. This is what is considered as axis=0
in Pandas. You can change it to compute over all the columns by specifying axis=1
.
df.mean(axis=1)
0 78007.666667
1 84015.000000
2 88015.666667
3 89016.000000
4 89015.666667
...
95853 217378.000000
95854 215711.333333
95855 195711.333333
95856 225711.333333
95857 209043.666667
Length: 95858, dtype: float64
Again, Pandas only computes the mean from the numeric data across the columns.
Transposing Data Frame
You can also change the rows into the column and the column into the rows. For example, previously we have this data frame we created from a Series
when extracting row 0.
df_row0
0 | |
---|---|
month | 2017-01 |
town | ANG MO KIO |
flat_type | 2 ROOM |
block | 406 |
street_name | ANG MO KIO AVE 10 |
storey_range | 10 TO 12 |
floor_area_sqm | 44 |
flat_model | Improved |
lease_commence_date | 1979 |
remaining_lease | 61 years 04 months |
resale_price | 232000 |
In the above code, the column is row 0 and the rows are the different column names. You can transpose the data using the .T
property.
df_row0_transposed = df_row0.T
df_row0_transposed
month | town | flat_type | block | street_name | storey_range | floor_area_sqm | flat_model | lease_commence_date | remaining_lease | resale_price | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2017-01 | ANG MO KIO | 2 ROOM | 406 | ANG MO KIO AVE 10 | 10 TO 12 | 44 | Improved | 1979 | 61 years 04 months | 232000 |
Vector Operations
One useful function in Pandas is .apply()
(see documentation) where we can apply some function to all the data in the column or row or Series in a vectorized manner. In this way, we need not iterate or loop the data one at a time to apply this computation.
For example, if we want to create a column for resale price in terms of $1000, we can use the .apply()
method by dividing the resale_price
column with 1000.
def divide_by_1000(data):
return data / 1000
df['resale_price_in1000'] = df['resale_price'].apply(divide_by_1000)
df['resale_price_in1000']
0 232.0
1 250.0
2 262.0
3 265.0
4 265.0
...
95853 650.0
95854 645.0
95855 585.0
95856 675.0
95857 625.0
Name: resale_price_in1000, Length: 95858, dtype: float64
The method .apply()
takes in a function that will be processed for every data in that Series. Instead of creating a named function, we can make use of Python’s lambda function to do the same.
df['resale_price_in1000'] = df['resale_price'].apply(lambda data: data/1000)
df['resale_price_in1000']
0 232.0
1 250.0
2 262.0
3 265.0
4 265.0
...
95853 650.0
95854 645.0
95855 585.0
95856 675.0
95857 625.0
Name: resale_price_in1000, Length: 95858, dtype: float64
Notice that the argument in divide_by_1000()
becomes the first token after the keyword lambda
. The return value of the function is provided after the colon, i.e. :
.
You can use this to process and create any other kind of data. For example, we can create a new categorical column called “Pricey” and set any sales above $500k is considered as pricey otherwise is not. If it is pricey, we will label it as 1, otherwise, as 0.
df['pricey'] = df['resale_price_in1000'].apply(lambda price: 1 if price > 500 else 0 )
df[['resale_price_in1000', 'pricey']]
resale_price_in1000 | pricey | |
---|---|---|
0 | 232.0 | 0 |
1 | 250.0 | 0 |
2 | 262.0 | 0 |
3 | 265.0 | 0 |
4 | 265.0 | 0 |
... | ... | ... |
95853 | 650.0 | 1 |
95854 | 645.0 | 1 |
95855 | 585.0 | 1 |
95856 | 675.0 | 1 |
95857 | 625.0 | 1 |
95858 rows × 2 columns
In the above function, we use the if expression to specify the return value for the lambda function. It follows the following format:
expression_if_true if condition else expression_if_false
There are many other Pandas functions and methods. It is recommended that you look into the documentation for further references.
Reference
Normalization
Many times, we will need to normalize the data, both the features and the target. The reason is that each column in the dataset may have different scales. For example, the column floor_area_sqm
is in between 33 to 249 while lease_commense_date
is actually in a range between 1966 and 2019. See below statistics for these two columns.
display(df['floor_area_sqm'].describe())
display(df['lease_commence_date'].describe())
count 95858.000000
mean 97.772234
std 24.238799
min 31.000000
25% 82.000000
50% 95.000000
75% 113.000000
max 249.000000
Name: floor_area_sqm, dtype: float64
count 95858.000000
mean 1994.553934
std 13.128913
min 1966.000000
25% 1984.000000
50% 1995.000000
75% 2004.000000
max 2019.000000
Name: lease_commence_date, dtype: float64
As we will see later in subsequent weeks, we usually need to normalize the data before doing any training for our machine learning model. There are two common normalization:
- z normalization
- minmax normalization
You will work on the functions to calculate these normalization in your problem sets.
Z Normalization
This is also called as standardization. In this tranformation, we move the mean of the data distribution to 0 and its standard deviation to 1. The equation is given as follows.
\[normalized = \frac{data - \mu}{\sigma}\]Min-Max Normalization
In this transformation, we scale the data in such a way that the maximum value in the distribution is 1 and its minimum value is 0. We can scale it using the following equation.
\[normalized = \frac{data - min}{max - min}\]Splitting Dataset
One common pre-processing operations that we normally do in machine learning is to split the data into:
- training dataset
- test dataset
The idea of splitting the dataset is simply because we should NOT use the same data to verify the model which we train. Let’s illustrate this using our HDB resale price dataset. We have this HDB resale price dataset with 95858 entries. In our machine learning process, we would like to use this data to do the following:
- Train the model using the dataset
- Verify the accuracy of the model
If we only have one dataset, we cannot use the same data to verify the accuracy with the ones we use to train the model. This bias would obviously create high accuracy. The analogy is like when a teacher giving exactly the same question during the exam as the ones during the practice session.
To overcome this, we should split the data into two. One set is used to train the model while the other one is used to verify the model. Coming back to the analogy of the teacher giving the exam, if the teacher has a bank of questions, he or she should separate the questions into two. Some questions can be used for practice before the exam, while the rest can be used as exam questions.
If we illustrate this using our HDB resale price dataset, this means that we have to split the table into two. Out of 95868 entries, we will take some entries as out training dataset and leave the rest for out test dataset. The common proportion is either 70% or 80% for the training dataset and leave the other 30% or 20% for the test dataset.
One important note is that the split must be done randomly. This is to avoid systematic bias in the split of the dataset. For example, one may enter the data according to the flat type and so flat with smaller rooms and smaller floor area will be on the top rows and those with the larger flats will be somewhere at the end of the rows. If we do not split the data randomly, we may not have any larger flats in our training set and only use the smaller flats. Similarly it can happen with any other column such as the block or the town area.
There are times in machine learning, we need to experiment with different parameters and find the optimum parameters. In these cases, the dataset is usually split into three:
- training dataset, which is used to build the model
- validation dataset, which is used to evaluate the model for various parameters and to choose the optimum parameter
- test dataset, which is used to evaluate the model built with the optimum parameter found previously
You will see some of these application in the subsequent weeks.