Visualization

By the end of this lesson, you should be able to:

Create scatter plot and statistical plots like box plot, histogram, and bar plot

Important words:

scatter plot
line plot
pair plot
bar plot
box plot
histogram

In this lesson, we will discuss common plots to visualize data using Matplotlib and Seaborn. Seaborn works on top of Matplotlib and you will need to import both packages in most of the cases.

Reference:

First, let’s import the necessary packages in this notebook.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

In this notebook, we will still work with HDB resale price dataset to illustrate some visualization we can use. So let’s import the dataset.

file_url = 'https://www.dropbox.com/s/jz8ck0obu9u1rng/resale-flat-prices-based-on-registration-date-from-jan-2017-onwards.csv?raw=1'
df = pd.read_csv(file_url)
df

	month	town	flat_type	block	street_name	storey_range	floor_area_sqm	flat_model	lease_commence_date	remaining_lease	resale_price
0	2017-01	ANG MO KIO	2 ROOM	406	ANG MO KIO AVE 10	10 TO 12	44.0	Improved	1979	61 years 04 months	232000.0
1	2017-01	ANG MO KIO	3 ROOM	108	ANG MO KIO AVE 4	01 TO 03	67.0	New Generation	1978	60 years 07 months	250000.0
2	2017-01	ANG MO KIO	3 ROOM	602	ANG MO KIO AVE 5	01 TO 03	67.0	New Generation	1980	62 years 05 months	262000.0
3	2017-01	ANG MO KIO	3 ROOM	465	ANG MO KIO AVE 10	04 TO 06	68.0	New Generation	1980	62 years 01 month	265000.0
4	2017-01	ANG MO KIO	3 ROOM	601	ANG MO KIO AVE 5	01 TO 03	67.0	New Generation	1980	62 years 05 months	265000.0
...	...	...	...	...	...	...	...	...	...	...	...
95853	2021-04	YISHUN	EXECUTIVE	326	YISHUN RING RD	10 TO 12	146.0	Maisonette	1988	66 years 04 months	650000.0
95854	2021-04	YISHUN	EXECUTIVE	360	YISHUN RING RD	04 TO 06	146.0	Maisonette	1988	66 years 04 months	645000.0
95855	2021-04	YISHUN	EXECUTIVE	326	YISHUN RING RD	10 TO 12	146.0	Maisonette	1988	66 years 04 months	585000.0
95856	2021-04	YISHUN	EXECUTIVE	355	YISHUN RING RD	10 TO 12	146.0	Maisonette	1988	66 years 08 months	675000.0
95857	2021-04	YISHUN	EXECUTIVE	277	YISHUN ST 22	04 TO 06	146.0	Maisonette	1985	63 years 05 months	625000.0

95858 rows × 11 columns

Categories of Plots

There are different categories of plot in Seaborn packages as shown in Seaborn documentation.

We can use either scatterplot or lineplot if we want to see relationship between two or more data. On the other hand, we have a few options to see distribution of data. The common one would be a histogram. The last category is categorical plot. We can use box plot, for example, to see the statistics of different categories. We will illustrate this more in the following sections.

Histogram and Boxplot

One of the first thing we may want to do in understanding the data is to see its distribution and its descriptive statistics. To do this, we can use histplot to show the histogram of the data and boxplot to show the five-number summary of the data.

Let’s see the resale price in the area around Tampines. First, let’s check what are the town listed in this data set.

np.unique(df['town'])

array(['ANG MO KIO', 'BEDOK', 'BISHAN', 'BUKIT BATOK', 'BUKIT MERAH',
       'BUKIT PANJANG', 'BUKIT TIMAH', 'CENTRAL AREA', 'CHOA CHU KANG',
       'CLEMENTI', 'GEYLANG', 'HOUGANG', 'JURONG EAST', 'JURONG WEST',
       'KALLANG/WHAMPOA', 'MARINE PARADE', 'PASIR RIS', 'PUNGGOL',
       'QUEENSTOWN', 'SEMBAWANG', 'SENGKANG', 'SERANGOON', 'TAMPINES',
       'TOA PAYOH', 'WOODLANDS', 'YISHUN'], dtype=object)

Now, let’s get the data for resale in Tampines only.

df_tampines = df.loc[df['town'] == 'TAMPINES',:]
df_tampines

	month	town	flat_type	block	street_name	storey_range	floor_area_sqm	flat_model	lease_commence_date	remaining_lease	resale_price
917	2017-01	TAMPINES	2 ROOM	299A	TAMPINES ST 22	01 TO 03	45.0	Model A	2012	94 years 02 months	250000.0
918	2017-01	TAMPINES	3 ROOM	403	TAMPINES ST 41	01 TO 03	60.0	Improved	1985	67 years 09 months	270000.0
919	2017-01	TAMPINES	3 ROOM	802	TAMPINES AVE 4	04 TO 06	68.0	New Generation	1984	66 years 05 months	295000.0
920	2017-01	TAMPINES	3 ROOM	410	TAMPINES ST 41	01 TO 03	69.0	Improved	1985	67 years 08 months	300000.0
921	2017-01	TAMPINES	3 ROOM	462	TAMPINES ST 44	07 TO 09	64.0	Simplified	1987	69 years 06 months	305000.0
...	...	...	...	...	...	...	...	...	...	...	...
95671	2021-04	TAMPINES	EXECUTIVE	495E	TAMPINES ST 43	04 TO 06	147.0	Apartment	1994	71 years 10 months	630000.0
95672	2021-04	TAMPINES	EXECUTIVE	477	TAMPINES ST 43	04 TO 06	153.0	Apartment	1993	71 years 04 months	780000.0
95673	2021-04	TAMPINES	EXECUTIVE	497J	TAMPINES ST 45	10 TO 12	139.0	Premium Apartment	1996	74 years 03 months	695000.0
95674	2021-04	TAMPINES	EXECUTIVE	857	TAMPINES ST 83	01 TO 03	154.0	Maisonette	1988	66 years	735000.0
95675	2021-04	TAMPINES	MULTI-GENERATION	454	TAMPINES ST 42	01 TO 03	132.0	Multi Generation	1987	65 years 04 months	600000.0

6392 rows × 11 columns

Now, we can plot its resale price distribution using histplot.

See documentation for histplot

sns.histplot(x='resale_price', data=df_tampines)

<AxesSubplot:xlabel='resale_price', ylabel='Count'>

png

In the above plot, we use df_tampines as our data source and use resale_price column as our x-axis. We can change the plot if we want to show it vertically.

sns.set()
sns.histplot(y='resale_price', data=df_tampines)

<AxesSubplot:xlabel='Count', ylabel='resale_price'>

png

Notice that the background changes. This is because we have called sns.set() which set Seaborn default setting instead of using Matplotlib’s setting. For example, Matplotlib uses whitebackground and no grid. Seaborn by default displays some white grid on gray background.

By default, the bins argument is auto and Seaborn will try to calculate how many bins should be used. But we can specify this manually.

sns.histplot(y='resale_price', data=df_tampines, bins=10)

<AxesSubplot:xlabel='Count', ylabel='resale_price'>

png

We can see that majority of the sales of resale HDB in Tampines is priced at about $400k to $500k.

We can also use the boxplot to see some descriptive statistics of the data.

See documentation on boxplot

sns.boxplot(x='resale_price', data=df_tampines)

<AxesSubplot:xlabel='resale_price'>

png

See Understanding Boxplot for more detail. But the figure in that website summarizes the different information given in a boxplot.

The box gives you the 25th percentile and the 75th percentile boundary. The line inside the box gives you the median of the data. As we can see the median is about $400k to $500k. The difference between the 75th percentile (Q3) and the 25th percentile (Q1) is called the Interquartile Range (IQR). This definition is needed to understand what defines outliers. The minimum and the maximum here is not the minimum and the maximum value in the data, but rather is capped at

$min = Q1 - 1.5\times IQR$ $max = Q3 + 1.5\times IQR$

Anything below or above these “minimum” and “maximum” are considered an outlier in the box plot. In the figure above, for example, we have quite a number of outliers on the high end of the resale price.

Modifying Labels and Titles

Since Seaborn is built on top of Matplotlib, we can use some of Matplotlib functions to change the figure’s labels and title. For example, we can change the histogram’s plot x and y labels and its titles using plt.xlabel(), plt.ylabel(), and plt.title. You can access these Matplotlib’s functions by first storing the output of your Seaborn plot.

myplot = sns.histplot(y='resale_price', data=df_tampines, bins=10)

png

Once you obtain the handle, you can call Matplotlib’s function by adding the word set_ in front of it. For example, if the Matplotlib’s function is plt.xlabel(), you call it as myplot.set_xlabel(). See the code below.

myplot = sns.histplot(y='resale_price', data=df_tampines, bins=10)
myplot.set_xlabel('Count', fontsize=16)
myplot.set_ylabel('Resale Price', fontsize=16)
myplot.set_title('HDB Resale Price in Tampines', fontsize=16)

Text(0.5, 1.0, 'HDB Resale Price in Tampines')

png

Notice now that the plot has a title and both the x and y label has changed.

The above plot will be much easier if we plots in thousands of dollars. So let’s create a new column of resale price in $1000.

df_tampines['resale_price_1000'] = df_tampines['resale_price'].apply(lambda price: price/1000)
df_tampines['resale_price_1000'].describe()

/var/folders/9l/s5tr888d1yldwlfg3_yyk7380000gq/T/ipykernel_13126/1487284426.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_tampines['resale_price_1000'] = df_tampines['resale_price'].apply(lambda price: price/1000)





count    6392.000000
mean      479.670371
std       125.569977
min       238.000000
25%       390.000000
50%       455.000000
75%       550.000000
max       990.000000
Name: resale_price_1000, dtype: float64

Now, let’s plot it one more time.

myplot = sns.histplot(y='resale_price_1000', data=df_tampines, bins=10)
myplot.set_xlabel('Count', fontsize=16)
myplot.set_ylabel('Resale Price in $1000', fontsize=16)
myplot.set_title('HDB Resale Price in Tampines', fontsize=16)

Text(0.5, 1.0, 'HDB Resale Price in Tampines')

png

Using Hue

Seaborn make it easy to plot the same data and colour those data depending on another data. For example, we can see the distribution of the resale price according to the room number or the storey range. Seaborn has an argument called hue to specify which data column you want to use to colour this.

myplot = sns.histplot(y='resale_price_1000', hue='flat_type', data=df_tampines, bins=10)
myplot.set_xlabel('Count', fontsize=16)
myplot.set_ylabel('Resale Price in $1000', fontsize=16)
myplot.set_title('HDB Resale Price in Tampines', fontsize=16)

Text(0.5, 1.0, 'HDB Resale Price in Tampines')

png

So we can see from the distribution that 4-room flats in Tampines contributes roughly the largest sales. We can also see that 4-room flat resale price is around the median of the all the resale flats in this area.

myplot = sns.histplot(y='resale_price_1000', hue='storey_range', data=df_tampines, bins=10)
myplot.set_xlabel('Count', fontsize=16)
myplot.set_ylabel('Resale Price in $1000', fontsize=16)
myplot.set_title('HDB Resale Price in Tampines', fontsize=16)

Text(0.5, 1.0, 'HDB Resale Price in Tampines')

png

The above colouring is not so obvious because they are on top of one another, one way is to change the settings in such a way that it is stacked. We can do this by setting the multiple argument for the case when there are multiple data in the same area.

myplot = sns.histplot(y='resale_price_1000', hue='storey_range', 
                      multiple='stack',
                      data=df_tampines, bins=10)
myplot.set_xlabel('Count', fontsize=16)
myplot.set_ylabel('Resale Price in $1000', fontsize=16)
myplot.set_title('HDB Resale Price in Tampines', fontsize=16)

Text(0.5, 1.0, 'HDB Resale Price in Tampines')

png

Scatter Plot and Line Plot

We use scatter plot and line plot to visualize relationship between two or more data. For example, we can plot the floor area and resale price to see if there is any relationship.

myplot = sns.scatterplot(x='floor_area_sqm', y='resale_price_1000', data=df_tampines)
myplot.set_xlabel('Floor Area ($m^2$)')
myplot.set_ylabel('Resale Price in $1000')

Text(0, 0.5, 'Resale Price in $1000')

png

As we can see from the plot above, that the price tend to increase with the increase in floor area. You can again use the hue argument to see any category in the plot.

myplot = sns.scatterplot(x='floor_area_sqm', y='resale_price_1000', 
                         hue='flat_type',
                         data=df_tampines)
myplot.set_xlabel('Floor Area ($m^2$)')
myplot.set_ylabel('Resale Price in $1000')

Text(0, 0.5, 'Resale Price in $1000')

png

We can see that flat type in a way also has relationship with the floor area.

Pair Plot

One useful plot is called Pair Plot in Seaborn where it plots the relationship on multiple data columns.

myplot = sns.pairplot(data=df_tampines)

png

The above plots immediately plot different scatter plots and histogram in a matrix form. The diagonal of the plot shows the histogram of that column data. The rest of the cell shows you the scatter plot of two columns in the data frame. From these, we can quickly see the relationship between different columns in the data frame.

PREVIOUSWorking with Data

NEXTLinear Regression