Visualization

By the end of this lesson, you should be able to:

  • Create scatter plot and statistical plots like box plot, histogram, and bar plot

Important words:

  • scatter plot
  • line plot
  • pair plot
  • bar plot
  • box plot
  • histogram

In this lesson, we will discuss common plots to visualize data using Matplotlib and Seaborn. Seaborn works on top of Matplotlib and you will need to import both packages in most of the cases.

Reference:

First, let’s import the necessary packages in this notebook.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

In this notebook, we will still work with HDB resale price dataset to illustrate some visualization we can use. So let’s import the dataset.

file_url = 'https://www.dropbox.com/s/jz8ck0obu9u1rng/resale-flat-prices-based-on-registration-date-from-jan-2017-onwards.csv?raw=1'
df = pd.read_csv(file_url)
df
month town flat_type block street_name storey_range floor_area_sqm flat_model lease_commence_date remaining_lease resale_price
0 2017-01 ANG MO KIO 2 ROOM 406 ANG MO KIO AVE 10 10 TO 12 44.0 Improved 1979 61 years 04 months 232000.0
1 2017-01 ANG MO KIO 3 ROOM 108 ANG MO KIO AVE 4 01 TO 03 67.0 New Generation 1978 60 years 07 months 250000.0
2 2017-01 ANG MO KIO 3 ROOM 602 ANG MO KIO AVE 5 01 TO 03 67.0 New Generation 1980 62 years 05 months 262000.0
3 2017-01 ANG MO KIO 3 ROOM 465 ANG MO KIO AVE 10 04 TO 06 68.0 New Generation 1980 62 years 01 month 265000.0
4 2017-01 ANG MO KIO 3 ROOM 601 ANG MO KIO AVE 5 01 TO 03 67.0 New Generation 1980 62 years 05 months 265000.0
... ... ... ... ... ... ... ... ... ... ... ...
95853 2021-04 YISHUN EXECUTIVE 326 YISHUN RING RD 10 TO 12 146.0 Maisonette 1988 66 years 04 months 650000.0
95854 2021-04 YISHUN EXECUTIVE 360 YISHUN RING RD 04 TO 06 146.0 Maisonette 1988 66 years 04 months 645000.0
95855 2021-04 YISHUN EXECUTIVE 326 YISHUN RING RD 10 TO 12 146.0 Maisonette 1988 66 years 04 months 585000.0
95856 2021-04 YISHUN EXECUTIVE 355 YISHUN RING RD 10 TO 12 146.0 Maisonette 1988 66 years 08 months 675000.0
95857 2021-04 YISHUN EXECUTIVE 277 YISHUN ST 22 04 TO 06 146.0 Maisonette 1985 63 years 05 months 625000.0

95858 rows × 11 columns

Categories of Plots

There are different categories of plot in Seaborn packages as shown in Seaborn documentation.

We can use either scatterplot or lineplot if we want to see relationship between two or more data. On the other hand, we have a few options to see distribution of data. The common one would be a histogram. The last category is categorical plot. We can use box plot, for example, to see the statistics of different categories. We will illustrate this more in the following sections.

Histogram and Boxplot

One of the first thing we may want to do in understanding the data is to see its distribution and its descriptive statistics. To do this, we can use histplot to show the histogram of the data and boxplot to show the five-number summary of the data.

Let’s see the resale price in the area around Tampines. First, let’s check what are the town listed in this data set.

np.unique(df['town'])
array(['ANG MO KIO', 'BEDOK', 'BISHAN', 'BUKIT BATOK', 'BUKIT MERAH',
       'BUKIT PANJANG', 'BUKIT TIMAH', 'CENTRAL AREA', 'CHOA CHU KANG',
       'CLEMENTI', 'GEYLANG', 'HOUGANG', 'JURONG EAST', 'JURONG WEST',
       'KALLANG/WHAMPOA', 'MARINE PARADE', 'PASIR RIS', 'PUNGGOL',
       'QUEENSTOWN', 'SEMBAWANG', 'SENGKANG', 'SERANGOON', 'TAMPINES',
       'TOA PAYOH', 'WOODLANDS', 'YISHUN'], dtype=object)

Now, let’s get the data for resale in Tampines only.

df_tampines = df.loc[df['town'] == 'TAMPINES',:]
df_tampines
month town flat_type block street_name storey_range floor_area_sqm flat_model lease_commence_date remaining_lease resale_price
917 2017-01 TAMPINES 2 ROOM 299A TAMPINES ST 22 01 TO 03 45.0 Model A 2012 94 years 02 months 250000.0
918 2017-01 TAMPINES 3 ROOM 403 TAMPINES ST 41 01 TO 03 60.0 Improved 1985 67 years 09 months 270000.0
919 2017-01 TAMPINES 3 ROOM 802 TAMPINES AVE 4 04 TO 06 68.0 New Generation 1984 66 years 05 months 295000.0
920 2017-01 TAMPINES 3 ROOM 410 TAMPINES ST 41 01 TO 03 69.0 Improved 1985 67 years 08 months 300000.0
921 2017-01 TAMPINES 3 ROOM 462 TAMPINES ST 44 07 TO 09 64.0 Simplified 1987 69 years 06 months 305000.0
... ... ... ... ... ... ... ... ... ... ... ...
95671 2021-04 TAMPINES EXECUTIVE 495E TAMPINES ST 43 04 TO 06 147.0 Apartment 1994 71 years 10 months 630000.0
95672 2021-04 TAMPINES EXECUTIVE 477 TAMPINES ST 43 04 TO 06 153.0 Apartment 1993 71 years 04 months 780000.0
95673 2021-04 TAMPINES EXECUTIVE 497J TAMPINES ST 45 10 TO 12 139.0 Premium Apartment 1996 74 years 03 months 695000.0
95674 2021-04 TAMPINES EXECUTIVE 857 TAMPINES ST 83 01 TO 03 154.0 Maisonette 1988 66 years 735000.0
95675 2021-04 TAMPINES MULTI-GENERATION 454 TAMPINES ST 42 01 TO 03 132.0 Multi Generation 1987 65 years 04 months 600000.0

6392 rows × 11 columns

Now, we can plot its resale price distribution using histplot.

See documentation for histplot

sns.histplot(x='resale_price', data=df_tampines)
<AxesSubplot:xlabel='resale_price', ylabel='Count'>

png

In the above plot, we use df_tampines as our data source and use resale_price column as our x-axis. We can change the plot if we want to show it vertically.

sns.set()
sns.histplot(y='resale_price', data=df_tampines)
<AxesSubplot:xlabel='Count', ylabel='resale_price'>

png

Notice that the background changes. This is because we have called sns.set() which set Seaborn default setting instead of using Matplotlib’s setting. For example, Matplotlib uses whitebackground and no grid. Seaborn by default displays some white grid on gray background.

By default, the bins argument is auto and Seaborn will try to calculate how many bins should be used. But we can specify this manually.

sns.histplot(y='resale_price', data=df_tampines, bins=10)
<AxesSubplot:xlabel='Count', ylabel='resale_price'>

png

We can see that majority of the sales of resale HDB in Tampines is priced at about $400k to $500k.

We can also use the boxplot to see some descriptive statistics of the data.

See documentation on boxplot

sns.boxplot(x='resale_price', data=df_tampines)
<AxesSubplot:xlabel='resale_price'>

png

See Understanding Boxplot for more detail. But the figure in that website summarizes the different information given in a boxplot.

The box gives you the 25th percentile and the 75th percentile boundary. The line inside the box gives you the median of the data. As we can see the median is about $400k to $500k. The difference between the 75th percentile (Q3) and the 25th percentile (Q1) is called the Interquartile Range (IQR). This definition is needed to understand what defines outliers. The minimum and the maximum here is not the minimum and the maximum value in the data, but rather is capped at

\(min = Q1 - 1.5\times IQR\) \(max = Q3 + 1.5\times IQR\)

Anything below or above these “minimum” and “maximum” are considered an outlier in the box plot. In the figure above, for example, we have quite a number of outliers on the high end of the resale price.

Modifying Labels and Titles

Since Seaborn is built on top of Matplotlib, we can use some of Matplotlib functions to change the figure’s labels and title. For example, we can change the histogram’s plot x and y labels and its titles using plt.xlabel(), plt.ylabel(), and plt.title. You can access these Matplotlib’s functions by first storing the output of your Seaborn plot.

myplot = sns.histplot(y='resale_price', data=df_tampines, bins=10)

png

Once you obtain the handle, you can call Matplotlib’s function by adding the word set_ in front of it. For example, if the Matplotlib’s function is plt.xlabel(), you call it as myplot.set_xlabel(). See the code below.

myplot = sns.histplot(y='resale_price', data=df_tampines, bins=10)
myplot.set_xlabel('Count', fontsize=16)
myplot.set_ylabel('Resale Price', fontsize=16)
myplot.set_title('HDB Resale Price in Tampines', fontsize=16)
Text(0.5, 1.0, 'HDB Resale Price in Tampines')

png

Notice now that the plot has a title and both the x and y label has changed.

The above plot will be much easier if we plots in thousands of dollars. So let’s create a new column of resale price in $1000.

df_tampines['resale_price_1000'] = df_tampines['resale_price'].apply(lambda price: price/1000)
df_tampines['resale_price_1000'].describe()
/var/folders/9l/s5tr888d1yldwlfg3_yyk7380000gq/T/ipykernel_13126/1487284426.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_tampines['resale_price_1000'] = df_tampines['resale_price'].apply(lambda price: price/1000)





count    6392.000000
mean      479.670371
std       125.569977
min       238.000000
25%       390.000000
50%       455.000000
75%       550.000000
max       990.000000
Name: resale_price_1000, dtype: float64

Now, let’s plot it one more time.

myplot = sns.histplot(y='resale_price_1000', data=df_tampines, bins=10)
myplot.set_xlabel('Count', fontsize=16)
myplot.set_ylabel('Resale Price in $1000', fontsize=16)
myplot.set_title('HDB Resale Price in Tampines', fontsize=16)
Text(0.5, 1.0, 'HDB Resale Price in Tampines')

png

Using Hue

Seaborn make it easy to plot the same data and colour those data depending on another data. For example, we can see the distribution of the resale price according to the room number or the storey range. Seaborn has an argument called hue to specify which data column you want to use to colour this.

myplot = sns.histplot(y='resale_price_1000', hue='flat_type', data=df_tampines, bins=10)
myplot.set_xlabel('Count', fontsize=16)
myplot.set_ylabel('Resale Price in $1000', fontsize=16)
myplot.set_title('HDB Resale Price in Tampines', fontsize=16)
Text(0.5, 1.0, 'HDB Resale Price in Tampines')

png

So we can see from the distribution that 4-room flats in Tampines contributes roughly the largest sales. We can also see that 4-room flat resale price is around the median of the all the resale flats in this area.

myplot = sns.histplot(y='resale_price_1000', hue='storey_range', data=df_tampines, bins=10)
myplot.set_xlabel('Count', fontsize=16)
myplot.set_ylabel('Resale Price in $1000', fontsize=16)
myplot.set_title('HDB Resale Price in Tampines', fontsize=16)
Text(0.5, 1.0, 'HDB Resale Price in Tampines')

png

The above colouring is not so obvious because they are on top of one another, one way is to change the settings in such a way that it is stacked. We can do this by setting the multiple argument for the case when there are multiple data in the same area.

myplot = sns.histplot(y='resale_price_1000', hue='storey_range', 
                      multiple='stack',
                      data=df_tampines, bins=10)
myplot.set_xlabel('Count', fontsize=16)
myplot.set_ylabel('Resale Price in $1000', fontsize=16)
myplot.set_title('HDB Resale Price in Tampines', fontsize=16)
Text(0.5, 1.0, 'HDB Resale Price in Tampines')

png

Scatter Plot and Line Plot

We use scatter plot and line plot to visualize relationship between two or more data. For example, we can plot the floor area and resale price to see if there is any relationship.

myplot = sns.scatterplot(x='floor_area_sqm', y='resale_price_1000', data=df_tampines)
myplot.set_xlabel('Floor Area ($m^2$)')
myplot.set_ylabel('Resale Price in $1000')
Text(0, 0.5, 'Resale Price in $1000')

png

As we can see from the plot above, that the price tend to increase with the increase in floor area. You can again use the hue argument to see any category in the plot.

myplot = sns.scatterplot(x='floor_area_sqm', y='resale_price_1000', 
                         hue='flat_type',
                         data=df_tampines)
myplot.set_xlabel('Floor Area ($m^2$)')
myplot.set_ylabel('Resale Price in $1000')
Text(0, 0.5, 'Resale Price in $1000')

png

We can see that flat type in a way also has relationship with the floor area.

Pair Plot

One useful plot is called Pair Plot in Seaborn where it plots the relationship on multiple data columns.

myplot = sns.pairplot(data=df_tampines)

png

The above plots immediately plot different scatter plots and histogram in a matrix form. The diagonal of the plot shows the histogram of that column data. The rest of the cell shows you the scatter plot of two columns in the data frame. From these, we can quickly see the relationship between different columns in the data frame.