Working With Data

By the end of this lesson, you should be able to:

Create a Panda’s DataFrame and selecting data from DataFrame
Using library to read CSV or EXCEL file
Give example of linear regression and classification
Split data randomly into training set and testing set
Normalize data using min-max normalization

Important Words:

Data Frame
Series
index
column
vectorized operation
normalization
z-norm
min-max
dataset
test, training, validation set

Short Introduction to Machine Learning

In the previous weeks we have discussed how various computation can be done. We begin by discussing computational complexity and divide and conquer strategy through recursion. We then discusses how data can have computation associated with it through object oriented programming. We discussed several data structures and algorithms associated with various problems. We then end up with offering a different perspective by looking at computation as a state machine.

In this second half of the course, we will look into how computation can learn from data in order to make a new computation. This new computation is often called a prediction. In these lessons, we focus on what is called as supervised machine learning. The word supervised machine learning indicates that the computer learns from some existing data on how to compute the prediction. An example of this would be given some images labelled as “cat” and “not a cat”, the computer can learn to predict (or to compute) whether any a new image given to it is a cat or not a cat.

cat	not a cat

Another example would be given a some data of housing prices in Singapore with the year of sale, area, number of rooms, and its floor hight, the computer can predict the price of another house. One other example would be given data of breast cancer cell and its measurements, one can predict whether the cell is malignant or benight. Supervised machine learning assumes that we have some existing data labelled with this category “malignant” and “benign”. Using this labelled data (supervised), the computer can predict the category given some new data.

Reading Data

The first step in machine learning would be to understand the data itself. In order to do that we need to be able to read data from some source. One common source is a text file in the form of CSV format (comma separated value). Another common format is Excel spreadsheet. The data can be from some databases or some server. Different data sources will require different ways of handling it. But in all those cases we will need to know how to read those data.

For this purpose we will use Pandas library to read our data. We import the data into our Python code by typing the following code.

import pandas as pd

Now we can use Pandas functions to read the data. For example, if we want to read a CSV file, we can simply type:

df = pd.read_csv('mydata.csv')

Let’s take an example of Singapore housing prices. We can get some of these data from Data.gov.sg. We have downloaded the CSV file so that you can access it simply from the following dropbox link. We can use the url to the raw file to open the CSV in our Python code.

import pandas as pd

file_url = 'https://www.dropbox.com/s/jz8ck0obu9u1rng/resale-flat-prices-based-on-registration-date-from-jan-2017-onwards.csv?raw=1'
df = pd.read_csv(file_url)
df

	month	town	flat_type	block	street_name	storey_range	floor_area_sqm	flat_model	lease_commence_date	remaining_lease	resale_price
0	2017-01	ANG MO KIO	2 ROOM	406	ANG MO KIO AVE 10	10 TO 12	44.0	Improved	1979	61 years 04 months	232000.0
1	2017-01	ANG MO KIO	3 ROOM	108	ANG MO KIO AVE 4	01 TO 03	67.0	New Generation	1978	60 years 07 months	250000.0
2	2017-01	ANG MO KIO	3 ROOM	602	ANG MO KIO AVE 5	01 TO 03	67.0	New Generation	1980	62 years 05 months	262000.0
3	2017-01	ANG MO KIO	3 ROOM	465	ANG MO KIO AVE 10	04 TO 06	68.0	New Generation	1980	62 years 01 month	265000.0
4	2017-01	ANG MO KIO	3 ROOM	601	ANG MO KIO AVE 5	01 TO 03	67.0	New Generation	1980	62 years 05 months	265000.0
...	...	...	...	...	...	...	...	...	...	...	...
95853	2021-04	YISHUN	EXECUTIVE	326	YISHUN RING RD	10 TO 12	146.0	Maisonette	1988	66 years 04 months	650000.0
95854	2021-04	YISHUN	EXECUTIVE	360	YISHUN RING RD	04 TO 06	146.0	Maisonette	1988	66 years 04 months	645000.0
95855	2021-04	YISHUN	EXECUTIVE	326	YISHUN RING RD	10 TO 12	146.0	Maisonette	1988	66 years 04 months	585000.0
95856	2021-04	YISHUN	EXECUTIVE	355	YISHUN RING RD	10 TO 12	146.0	Maisonette	1988	66 years 08 months	675000.0
95857	2021-04	YISHUN	EXECUTIVE	277	YISHUN ST 22	04 TO 06	146.0	Maisonette	1985	63 years 05 months	625000.0

95858 rows × 11 columns

The output of read_csv() function is in Pandas’ DataFrame type.

type(df)

pandas.core.frame.DataFrame

DataFrame is Pandas’ class that contains attributes and methods to work with a tabular data as shown above. Recall that we can create our own custom data type using the keyword class and define its attributes and methods. We can even override some of Python operators to work with our new data type. This is what Pandas library does with DataFrame. This DataFrame class provides some properties and methods that allows us to work with tabular data.

For example, we can get all the name of the columns in our data frame using df.columns properties.

df.columns

Index(['month', 'town', 'flat_type', 'block', 'street_name', 'storey_range',
       'floor_area_sqm', 'flat_model', 'lease_commence_date',
       'remaining_lease', 'resale_price'],
      dtype='object')

We can also get the index (the names on the rows) using df.index.

df.index

RangeIndex(start=0, stop=95858, step=1)

We can also treat this data frame as a kind of matrix to find its shape using df.shape.

df.shape

(95858, 11)

As we can see, the data contains 95858 rows and 11 columns. One of the column names is called resale_price. Since our aim is to predict the house price, this column is usually called the target. The rest of the columns is called the features. This means that we have about 10 feature columns.

The idea supervised machine learning is that using the given data such as shown above, the computer would like to predict what is the target give a new set of features. The computer does this by learning the existing labelled data. The label in this case is the resale price from the historical sales data.

In order to understand the data, it is important to be able to manipulate and work on the data frame.

Data Frame Operations

It is important to know how to manipulate the data. Pandas has two data structures:

You can consider Series data as one-dimensional labelled array while DataFrame data as two-dimensional labelled data structure. For example, the table that we saw previously, which is the output of read_csv() function, is a DataFrame because it has both rows and columns and, therefore, two dimensional. On the other hand, we can access just one of the columns from the data frame to get a Series data.

Getting a Column or a Row as a Series

You can access the column data as series using the square bracket operator.

df[column_name]

print(df['resale_price'])
print(type(df['resale_price']))

      232000.0
      250000.0
      262000.0
      265000.0
      265000.0
           ...   
  650000.0
  645000.0
  585000.0
  675000.0
  625000.0
Name: resale_price, Length: 95858, dtype: float64
<class 'pandas.core.series.Series'>

The code above prints the column resale_price and its type. As can be seen the type of the output is a Series data type.

You can also get some particular row by specifying its index.

You can also access the column using the .loc[index, column] method. In this method, you need to specify the labels of the index. For example, to access all the rows for a particular column called resale_price, we can do as follows. Notice that we use : to access all the rows. Moreover, we specify the name of the columns in the code below.

print(df.loc[:, 'resale_price'])
print(type(df.loc[:, 'resale_price']))

      232000.0
      250000.0
      262000.0
      265000.0
      265000.0
           ...   
  650000.0
  645000.0
  585000.0
  675000.0
  625000.0
Name: resale_price, Length: 95858, dtype: float64
<class 'pandas.core.series.Series'>

In the above code, we set the index to access all rows by using :. Recall that in Python’s list slicing, we also use : to access all the element. Similarly here, we use : to access all the rows. In a similar way, we can use : to access all the columns, e.g. df.loc[:, :] will copy the whole data frame.

This also gives you a hint how to access a particular row. Let’s say, you only want to acces the first row, you can type the following.

print(df.loc[0, :])
print(type(df.loc[0, :]))

month                             2017-01
town                           ANG MO KIO
flat_type                          2 ROOM
block                                 406
street_name             ANG MO KIO AVE 10
storey_range                     10 TO 12
floor_area_sqm                         44
flat_model                       Improved
lease_commence_date                  1979
remaining_lease        61 years 04 months
resale_price                       232000
Name: 0, dtype: object
<class 'pandas.core.series.Series'>

In the above code, we access the first row, which is at index 0, and all the columns.

Recall that all these data are of the type Series. You can create a Data Frame from an existing series just like when you create any other object by instantiating a DataFrame object and passing on an argument as shown below.

df_row0 = pd.DataFrame(df.loc[0, :])
df_row0

	0
month	2017-01
town	ANG MO KIO
flat_type	2 ROOM
block	406
street_name	ANG MO KIO AVE 10
storey_range	10 TO 12
floor_area_sqm	44
flat_model	Improved
lease_commence_date	1979
remaining_lease	61 years 04 months
resale_price	232000

Getting Rows and Columns as DataFrame

The operator : works similar to Python’s slicing. This means that you can get some rows by slicing them. For example, you can access the first 10 rows as follows.

print(df.loc[0:10, :])
print(type(df.loc[0:10, :]))

      month        town flat_type block        street_name storey_range  \
 2017-01  ANG MO KIO    2 ROOM   406  ANG MO KIO AVE 10     10 TO 12   
 2017-01  ANG MO KIO    3 ROOM   108   ANG MO KIO AVE 4     01 TO 03   
 2017-01  ANG MO KIO    3 ROOM   602   ANG MO KIO AVE 5     01 TO 03   
 2017-01  ANG MO KIO    3 ROOM   465  ANG MO KIO AVE 10     04 TO 06   
 2017-01  ANG MO KIO    3 ROOM   601   ANG MO KIO AVE 5     01 TO 03   
 2017-01  ANG MO KIO    3 ROOM   150   ANG MO KIO AVE 5     01 TO 03   
 2017-01  ANG MO KIO    3 ROOM   447  ANG MO KIO AVE 10     04 TO 06   
 2017-01  ANG MO KIO    3 ROOM   218   ANG MO KIO AVE 1     04 TO 06   
 2017-01  ANG MO KIO    3 ROOM   447  ANG MO KIO AVE 10     04 TO 06   
 2017-01  ANG MO KIO    3 ROOM   571   ANG MO KIO AVE 3     01 TO 03   
2017-01  ANG MO KIO    3 ROOM   534  ANG MO KIO AVE 10     01 TO 03   

    floor_area_sqm      flat_model  lease_commence_date     remaining_lease  \
           44.0        Improved                 1979  61 years 04 months   
           67.0  New Generation                 1978  60 years 07 months   
           67.0  New Generation                 1980  62 years 05 months   
           68.0  New Generation                 1980   62 years 01 month   
           67.0  New Generation                 1980  62 years 05 months   
           68.0  New Generation                 1981            63 years   
           68.0  New Generation                 1979  61 years 06 months   
           67.0  New Generation                 1976  58 years 04 months   
           68.0  New Generation                 1979  61 years 06 months   
           67.0  New Generation                 1979  61 years 04 months   
          68.0  New Generation                 1980   62 years 01 month   

    resale_price  
     232000.0  
     250000.0  
     262000.0  
     265000.0  
     265000.0  
     275000.0  
     280000.0  
     285000.0  
     285000.0  
     285000.0  
    288500.0  
<class 'pandas.core.frame.DataFrame'>

Notice, however, that the slicing in Pandas’ data frame is inclusive of the ending index unlike Python’s slicing. The other thing to note about is that the output data type is no longer a series but rather a DataFrame. The reason is that now the data is two-dimensionsional.

You can specify both the rows and the columns you want as shown below.

df.loc[0:10,'month':'remaining_lease']

	month	town	flat_type	block	street_name	storey_range	floor_area_sqm	flat_model	lease_commence_date	remaining_lease
0	2017-01	ANG MO KIO	2 ROOM	406	ANG MO KIO AVE 10	10 TO 12	44.0	Improved	1979	61 years 04 months
1	2017-01	ANG MO KIO	3 ROOM	108	ANG MO KIO AVE 4	01 TO 03	67.0	New Generation	1978	60 years 07 months
2	2017-01	ANG MO KIO	3 ROOM	602	ANG MO KIO AVE 5	01 TO 03	67.0	New Generation	1980	62 years 05 months
3	2017-01	ANG MO KIO	3 ROOM	465	ANG MO KIO AVE 10	04 TO 06	68.0	New Generation	1980	62 years 01 month
4	2017-01	ANG MO KIO	3 ROOM	601	ANG MO KIO AVE 5	01 TO 03	67.0	New Generation	1980	62 years 05 months
5	2017-01	ANG MO KIO	3 ROOM	150	ANG MO KIO AVE 5	01 TO 03	68.0	New Generation	1981	63 years
6	2017-01	ANG MO KIO	3 ROOM	447	ANG MO KIO AVE 10	04 TO 06	68.0	New Generation	1979	61 years 06 months
7	2017-01	ANG MO KIO	3 ROOM	218	ANG MO KIO AVE 1	04 TO 06	67.0	New Generation	1976	58 years 04 months
8	2017-01	ANG MO KIO	3 ROOM	447	ANG MO KIO AVE 10	04 TO 06	68.0	New Generation	1979	61 years 06 months
9	2017-01	ANG MO KIO	3 ROOM	571	ANG MO KIO AVE 3	01 TO 03	67.0	New Generation	1979	61 years 04 months
10	2017-01	ANG MO KIO	3 ROOM	534	ANG MO KIO AVE 10	01 TO 03	68.0	New Generation	1980	62 years 01 month

If you want to select the column, you can pass on a list of columns as shown in the example below.

columns = ['town', 'block', 'resale_price']
df.loc[:, columns]

	town	block	resale_price
0	ANG MO KIO	406	232000.0
1	ANG MO KIO	108	250000.0
2	ANG MO KIO	602	262000.0
3	ANG MO KIO	465	265000.0
4	ANG MO KIO	601	265000.0
...	...	...	...
95853	YISHUN	326	650000.0
95854	YISHUN	360	645000.0
95855	YISHUN	326	585000.0
95856	YISHUN	355	675000.0
95857	YISHUN	277	625000.0

95858 rows × 3 columns

A similar output can be obtained without .loc

df[columns]

	town	block	resale_price
0	ANG MO KIO	406	232000.0
1	ANG MO KIO	108	250000.0
2	ANG MO KIO	602	262000.0
3	ANG MO KIO	465	265000.0
4	ANG MO KIO	601	265000.0
...	...	...	...
95853	YISHUN	326	650000.0
95854	YISHUN	360	645000.0
95855	YISHUN	326	585000.0
95856	YISHUN	355	675000.0
95857	YISHUN	277	625000.0

95858 rows × 3 columns

You can also combine specifying the rows and the columns as usual using .loc.

df.loc[0:10, columns]

	town	block	resale_price
0	ANG MO KIO	406	232000.0
1	ANG MO KIO	108	250000.0
2	ANG MO KIO	602	262000.0
3	ANG MO KIO	465	265000.0
4	ANG MO KIO	601	265000.0
5	ANG MO KIO	150	275000.0
6	ANG MO KIO	447	280000.0
7	ANG MO KIO	218	285000.0
8	ANG MO KIO	447	285000.0
9	ANG MO KIO	571	285000.0
10	ANG MO KIO	534	288500.0

The index is not always necessarily be an integer. Pandas can take strings as the index of a data frame. But there are times, even when the index is not an integer, we still prefer to locate using the position of the rows to select. In this case, we can use .iloc[position_index, position_column].

columns = [1, 3, -1]
df.iloc[0:10, columns]

	town	block	resale_price
0	ANG MO KIO	406	232000.0
1	ANG MO KIO	108	250000.0
2	ANG MO KIO	602	262000.0
3	ANG MO KIO	465	265000.0
4	ANG MO KIO	601	265000.0
5	ANG MO KIO	150	275000.0
6	ANG MO KIO	447	280000.0
7	ANG MO KIO	218	285000.0
8	ANG MO KIO	447	285000.0
9	ANG MO KIO	571	285000.0

The above code gives the same data frame but it uses different input to specifies. By using .iloc[], we specify the position of the index and the columns instead of the label of the index and the columns. It happens that for the index, the position numbering is exactly the same as the label.

Selecting Data Using Conditions

We can use conditions with Pandas’ data frame to select particular rows and columns using either .loc[] or .iloc[]. The reason is that these methods can take in boolean arrays.

Let’s see some examples below. First, let’s list down the resale price by focusing on the block at a given town.

columns = ['town', 'block', 'resale_price']
df.loc[:, columns]

	town	block	resale_price
0	ANG MO KIO	406	232000.0
1	ANG MO KIO	108	250000.0
2	ANG MO KIO	602	262000.0
3	ANG MO KIO	465	265000.0
4	ANG MO KIO	601	265000.0
...	...	...	...
95853	YISHUN	326	650000.0
95854	YISHUN	360	645000.0
95855	YISHUN	326	585000.0
95856	YISHUN	355	675000.0
95857	YISHUN	277	625000.0

95858 rows × 3 columns

Let’s say we want to see those sales where the price is greater than $500k. We can put in this condition in filtering the rows.

df.loc[df['resale_price'] > 500_000, columns]

	town	block	resale_price
43	ANG MO KIO	304	518000.0
44	ANG MO KIO	646	518000.0
45	ANG MO KIO	328	560000.0
46	ANG MO KIO	588C	688000.0
47	ANG MO KIO	588D	730000.0
...	...	...	...
95853	YISHUN	326	650000.0
95854	YISHUN	360	645000.0
95855	YISHUN	326	585000.0
95856	YISHUN	355	675000.0
95857	YISHUN	277	625000.0

27233 rows × 3 columns

Note: Python ignores the underscores in between numeric literals and you can use it to make it easier to read.

Let’s say if we want to find all those sales between \$500k and \$600k only, we can use the AND operator & to have more than one conditions.

df.loc[(df['resale_price'] >= 500_000) & (df['resale_price'] <= 600_000), columns]

	town	block	resale_price
43	ANG MO KIO	304	518000.0
44	ANG MO KIO	646	518000.0
45	ANG MO KIO	328	560000.0
49	ANG MO KIO	101	500000.0
110	BEDOK	185	580000.0
...	...	...	...
95849	YISHUN	504C	550000.0
95850	YISHUN	511B	600000.0
95851	YISHUN	504C	590000.0
95852	YISHUN	838	571888.0
95855	YISHUN	326	585000.0

13478 rows × 3 columns

Note: the parenthesis separating the two AND conditions are compulsory.

We can also specify more conditions. For example, we are only interested in ANG MO KIO area. We can have the following code.

df.loc[(df['resale_price'] >= 500_000) & (df['resale_price'] <= 600_000) &
       (df['town'] == 'ANG MO KIO'), columns]

	town	block	resale_price
43	ANG MO KIO	304	518000.0
44	ANG MO KIO	646	518000.0
45	ANG MO KIO	328	560000.0
49	ANG MO KIO	101	500000.0
1219	ANG MO KIO	351	530000.0
...	...	...	...
94741	ANG MO KIO	545	590000.0
94742	ANG MO KIO	545	600000.0
94743	ANG MO KIO	551	520000.0
94746	ANG MO KIO	642	545000.0
94749	ANG MO KIO	353	588000.0

329 rows × 3 columns

If you are interested only in blocks 300s and 400s, you can add this conditions further.

df.loc[(df['resale_price'] >= 500_000) & (df['resale_price'] <= 600_000) &
       (df['town'] == 'ANG MO KIO') & 
       (df['block'] >= '300') & (df['block'] < '500'), columns]

	town	block	resale_price
43	ANG MO KIO	304	518000.0
45	ANG MO KIO	328	560000.0
1219	ANG MO KIO	351	530000.0
2337	ANG MO KIO	344	538000.0
2338	ANG MO KIO	329	548000.0
...	...	...	...
92327	ANG MO KIO	353	520000.0
92332	ANG MO KIO	459	600000.0
92335	ANG MO KIO	459	500000.0
94740	ANG MO KIO	305	537000.0
94749	ANG MO KIO	353	588000.0

174 rows × 3 columns

Series and DataFrame Functions

Pandas also provides several functions that can be useful in understanding the data. In this section, we will explore some of these.

Creating DataFrame and Series

We can create a new DataFrame from other data type such as dictionary, list-like objects, or Series. For example, given a Series, you can convert into a DataFrame as shown below.

price = df['resale_price']
print(isinstance(price, pd.Series))
price_df = pd.DataFrame(price)
print(isinstance(price_df, pd.DataFrame))
price_df

True
True

	resale_price
0	232000.0
1	250000.0
2	262000.0
3	265000.0
4	265000.0
...	...
95853	650000.0
95854	645000.0
95855	585000.0
95856	675000.0
95857	625000.0

95858 rows × 1 columns

Similarly, you can convert other data to a Series by using its contructor. In the example below, we create a new series from a list of integers from 2 to 100.

new_series = pd.Series(list(range(2,101)))
print(isinstance(new_series, pd.Series))
new_series

True





     2
     3
     4
     5
     6
     ... 
   96
   97
   98
   99
  100
Length: 99, dtype: int64

Copying

One useful function is to copy a data frame to another dataframe. We can use df.copy(). This function has an argument deep which by default is True. If it is true, it will do a deep copy of the Data Frame. Otherwise, it will just do a shallow copy. See documention.

df2 = df.copy()
df2

	month	town	flat_type	block	street_name	storey_range	floor_area_sqm	flat_model	lease_commence_date	remaining_lease	resale_price
0	2017-01	ANG MO KIO	2 ROOM	406	ANG MO KIO AVE 10	10 TO 12	44.0	Improved	1979	61 years 04 months	232000.0
1	2017-01	ANG MO KIO	3 ROOM	108	ANG MO KIO AVE 4	01 TO 03	67.0	New Generation	1978	60 years 07 months	250000.0
2	2017-01	ANG MO KIO	3 ROOM	602	ANG MO KIO AVE 5	01 TO 03	67.0	New Generation	1980	62 years 05 months	262000.0
3	2017-01	ANG MO KIO	3 ROOM	465	ANG MO KIO AVE 10	04 TO 06	68.0	New Generation	1980	62 years 01 month	265000.0
4	2017-01	ANG MO KIO	3 ROOM	601	ANG MO KIO AVE 5	01 TO 03	67.0	New Generation	1980	62 years 05 months	265000.0
...	...	...	...	...	...	...	...	...	...	...	...
95853	2021-04	YISHUN	EXECUTIVE	326	YISHUN RING RD	10 TO 12	146.0	Maisonette	1988	66 years 04 months	650000.0
95854	2021-04	YISHUN	EXECUTIVE	360	YISHUN RING RD	04 TO 06	146.0	Maisonette	1988	66 years 04 months	645000.0
95855	2021-04	YISHUN	EXECUTIVE	326	YISHUN RING RD	10 TO 12	146.0	Maisonette	1988	66 years 04 months	585000.0
95856	2021-04	YISHUN	EXECUTIVE	355	YISHUN RING RD	10 TO 12	146.0	Maisonette	1988	66 years 08 months	675000.0
95857	2021-04	YISHUN	EXECUTIVE	277	YISHUN ST 22	04 TO 06	146.0	Maisonette	1985	63 years 05 months	625000.0

95858 rows × 11 columns

Statistical Functions

We can get some descriptive statistics about the data using some of Pandas functions. For example, we can get the five point summary using .describe() method.

df.describe()

	floor_area_sqm	lease_commence_date	resale_price
count	95858.000000	95858.000000	9.585800e+04
mean	97.772234	1994.553934	4.467242e+05
std	24.238799	13.128913	1.552974e+05
min	31.000000	1966.000000	1.400000e+05
25%	82.000000	1984.000000	3.350000e+05
50%	95.000000	1995.000000	4.160000e+05
75%	113.000000	2004.000000	5.250000e+05
max	249.000000	2019.000000	1.258000e+06

The above code only shows a few columns because the other columns are not numbers. Pandas will only try to get the statistics of the columns that contain numeric numbers. We can also get the individual statistical functions as shown below.

print(df['resale_price'].mean())

446724.22886801313

print(df['resale_price'].std())

155297.43748684428

print(df['resale_price'].min())

140000.0

print(df['resale_price'].max())

1258000.0

print(df['resale_price'].quantile(q=0.75))

525000.0

You can change the way the statistics is computed. Currently, the statistics is calculated over all the rows in the vertical dimension. This is what is considered as axis=0 in Pandas. You can change it to compute over all the columns by specifying axis=1.

df.mean(axis=1)

       78007.666667
       84015.000000
       88015.666667
       89016.000000
       89015.666667
             ...      
  217378.000000
  215711.333333
  195711.333333
  225711.333333
  209043.666667
Length: 95858, dtype: float64

Again, Pandas only computes the mean from the numeric data across the columns.

Transposing Data Frame

You can also change the rows into the column and the column into the rows. For example, previously we have this data frame we created from a Series when extracting row 0.

df_row0

	0
month	2017-01
town	ANG MO KIO
flat_type	2 ROOM
block	406
street_name	ANG MO KIO AVE 10
storey_range	10 TO 12
floor_area_sqm	44
flat_model	Improved
lease_commence_date	1979
remaining_lease	61 years 04 months
resale_price	232000

In the above code, the column is row 0 and the rows are the different column names. You can transpose the data using the .T property.

df_row0_transposed = df_row0.T
df_row0_transposed

	month	town	flat_type	block	street_name	storey_range	floor_area_sqm	flat_model	lease_commence_date	remaining_lease	resale_price
0	2017-01	ANG MO KIO	2 ROOM	406	ANG MO KIO AVE 10	10 TO 12	44	Improved	1979	61 years 04 months	232000

Vector Operations

One useful function in Pandas is .apply() (see documentation) where we can apply some function to all the data in the column or row or Series in a vectorized manner. In this way, we need not iterate or loop the data one at a time to apply this computation.

For example, if we want to create a column for resale price in terms of $1000, we can use the .apply() method by dividing the resale_price column with 1000.

def divide_by_1000(data):
    return data / 1000

df['resale_price_in1000'] = df['resale_price'].apply(divide_by_1000)
df['resale_price_in1000']

      232.0
      250.0
      262.0
      265.0
      265.0
         ...  
  650.0
  645.0
  585.0
  675.0
  625.0
Name: resale_price_in1000, Length: 95858, dtype: float64

The method .apply() takes in a function that will be processed for every data in that Series. Instead of creating a named function, we can make use of Python’s lambda function to do the same.

df['resale_price_in1000'] = df['resale_price'].apply(lambda data: data/1000)
df['resale_price_in1000']

      232.0
      250.0
      262.0
      265.0
      265.0
         ...  
  650.0
  645.0
  585.0
  675.0
  625.0
Name: resale_price_in1000, Length: 95858, dtype: float64

Notice that the argument in divide_by_1000() becomes the first token after the keyword lambda. The return value of the function is provided after the colon, i.e. :.

You can use this to process and create any other kind of data. For example, we can create a new categorical column called “Pricey” and set any sales above $500k is considered as pricey otherwise is not. If it is pricey, we will label it as 1, otherwise, as 0.

df['pricey'] = df['resale_price_in1000'].apply(lambda price: 1 if price > 500 else 0 )
df[['resale_price_in1000', 'pricey']]

	resale_price_in1000	pricey
0	232.0	0
1	250.0	0
2	262.0	0
3	265.0	0
4	265.0	0
...	...	...
95853	650.0	1
95854	645.0	1
95855	585.0	1
95856	675.0	1
95857	625.0	1

95858 rows × 2 columns

In the above function, we use the if expression to specify the return value for the lambda function. It follows the following format:

expression_if_true if condition else expression_if_false

There are many other Pandas functions and methods. It is recommended that you look into the documentation for further references.

Reference

Normalization

Many times, we will need to normalize the data, both the features and the target. The reason is that each column in the dataset may have different scales. For example, the column floor_area_sqm is in between 33 to 249 while lease_commense_date is actually in a range between 1966 and 2019. See below statistics for these two columns.

display(df['floor_area_sqm'].describe())
display(df['lease_commence_date'].describe())

count    95858.000000
mean        97.772234
std         24.238799
min         31.000000
25%         82.000000
50%         95.000000
75%        113.000000
max        249.000000
Name: floor_area_sqm, dtype: float64



count    95858.000000
mean      1994.553934
std         13.128913
min       1966.000000
25%       1984.000000
50%       1995.000000
75%       2004.000000
max       2019.000000
Name: lease_commence_date, dtype: float64

As we will see later in subsequent weeks, we usually need to normalize the data before doing any training for our machine learning model. There are two common normalization:

z normalization
minmax normalization

You will work on the functions to calculate these normalization in your problem sets.

Z Normalization

This is also called as standardization. In this tranformation, we move the mean of the data distribution to 0 and its standard deviation to 1. The equation is given as follows.

\[normalized = \frac{data - \mu}{\sigma}\]

Min-Max Normalization

In this transformation, we scale the data in such a way that the maximum value in the distribution is 1 and its minimum value is 0. We can scale it using the following equation.

\[normalized = \frac{data - min}{max - min}\]

Splitting Dataset

One common pre-processing operations that we normally do in machine learning is to split the data into:

training dataset
test dataset

The idea of splitting the dataset is simply because we should NOT use the same data to verify the model which we train. Let’s illustrate this using our HDB resale price dataset. We have this HDB resale price dataset with 95858 entries. In our machine learning process, we would like to use this data to do the following:

Train the model using the dataset
Verify the accuracy of the model

If we only have one dataset, we cannot use the same data to verify the accuracy with the ones we use to train the model. This bias would obviously create high accuracy. The analogy is like when a teacher giving exactly the same question during the exam as the ones during the practice session.

To overcome this, we should split the data into two. One set is used to train the model while the other one is used to verify the model. Coming back to the analogy of the teacher giving the exam, if the teacher has a bank of questions, he or she should separate the questions into two. Some questions can be used for practice before the exam, while the rest can be used as exam questions.

If we illustrate this using our HDB resale price dataset, this means that we have to split the table into two. Out of 95868 entries, we will take some entries as out training dataset and leave the rest for out test dataset. The common proportion is either 70% or 80% for the training dataset and leave the other 30% or 20% for the test dataset.

One important note is that the split must be done randomly. This is to avoid systematic bias in the split of the dataset. For example, one may enter the data according to the flat type and so flat with smaller rooms and smaller floor area will be on the top rows and those with the larger flats will be somewhere at the end of the rows. If we do not split the data randomly, we may not have any larger flats in our training set and only use the smaller flats. Similarly it can happen with any other column such as the block or the town area.

There are times in machine learning, we need to experiment with different parameters and find the optimum parameters. In these cases, the dataset is usually split into three:

training dataset, which is used to build the model
validation dataset, which is used to evaluate the model for various parameters and to choose the optimum parameter
test dataset, which is used to evaluate the model built with the optimum parameter found previously

You will see some of these application in the subsequent weeks.

PREVIOUSArray and Linked List

NEXTMatplotlib and Seaborn