Blog Post 0 - Visualization For Peguins Dataset

In this blog post, we are going to create a visualization using the palmer penguins data set. The following tutorial will guide you to create this interesting visualization!

1. Imports

First, we load the packages we may use in this post.

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns

2. Pre-Processing & Data Cleanning

Then, we get the penguin data. We are using the data from palmer_penguins.csv which contains observations of penguin features from various studies. We have 344 penguins, with 17 features each (including some NaNs).

url = 'https://philchodrow.github.io/PIC16A/datasets/palmer_penguins.csv'
penguins = pd.read_csv(url)
#checking first 5 rows of our data  
print(penguins.shape)     
penguins.head()

(344, 17)

	studyName	Sample Number	Species	Region	Island	Stage	Individual ID	Clutch Completion	Date Egg	Culmen Length (mm)	Culmen Depth (mm)	Flipper Length (mm)	Body Mass (g)	Sex	Delta 15 N (o/oo)	Delta 13 C (o/oo)	Comments
0	PAL0708	1	Adelie Penguin (Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N1A1	Yes	11/11/07	39.1	18.7	181.0	3750.0	MALE	NaN	NaN	Not enough blood for isotopes.
1	PAL0708	2	Adelie Penguin (Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N1A2	Yes	11/11/07	39.5	17.4	186.0	3800.0	FEMALE	8.94956	-24.69454	NaN
2	PAL0708	3	Adelie Penguin (Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N2A1	Yes	11/16/07	40.3	18.0	195.0	3250.0	FEMALE	8.36821	-25.33302	NaN
3	PAL0708	4	Adelie Penguin (Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N2A2	Yes	11/16/07	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Adult not sampled.
4	PAL0708	5	Adelie Penguin (Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N3A1	Yes	11/16/07	36.7	19.3	193.0	3450.0	FEMALE	8.76651	-25.32426	NaN

In this post, we are going to create a visualization using columns “Species”, “Culmen Length (mm)”, “Culmen Depth (mm)”. Thus, we select only these three columns from the penguins data set and we drop the NAs.

# Select only these three columns from the penguins data set
df = penguins[["Species", "Culmen Length (mm)", "Culmen Depth (mm)"]]

# drop the NA values
df = df.dropna()
# Visualize the new dataframe
df.head()

	Species	Culmen Length (mm)	Culmen Depth (mm)
0	Adelie Penguin (Pygoscelis adeliae)	39.1	18.7
1	Adelie Penguin (Pygoscelis adeliae)	39.5	17.4
2	Adelie Penguin (Pygoscelis adeliae)	40.3	18.0
4	Adelie Penguin (Pygoscelis adeliae)	36.7	19.3
5	Adelie Penguin (Pygoscelis adeliae)	39.3	20.6

# Examine the dataframe
df.describe()

	Culmen Length (mm)	Culmen Depth (mm)
count	342.000000	342.000000
mean	43.921930	17.151170
std	5.459584	1.974793
min	32.100000	13.100000
25%	39.225000	15.600000
50%	44.450000	17.300000
75%	48.500000	18.700000
max	59.600000	21.500000

# Examine the counts of species
df["Species"].value_counts()

Adelie Penguin (Pygoscelis adeliae)          151
Gentoo penguin (Pygoscelis papua)            123
Chinstrap penguin (Pygoscelis antarctica)     68
Name: Species, dtype: int64

3. Create a Scatterplot

For our visualization, we made a scatterplot of x=culmen length, y=culmen depth, and color-coded by species. We utilized seaborn to draw this plot.

fgrid = sns.relplot(x = "Culmen Length (mm)", 
                    y = "Culmen Depth (mm)", 
                    hue = "Species", 
                    data = df,
                    palette=["b", "r", "g"]).set(title=
                          'Figure 1: Culmen Length vs Culmen Depth by Species')

Analysis: This scatterplot shows the correlation of Culmen Length and Culmen depth by species. Culmen length and Culmen Depth are positively correlated within each species. These features may be useful for classifies penguins’ species because there are three distinct clusters by species. Gentoo has the least culmen depth, Adelie has the least culmen length, and Chinstrap has the greatest culmen length and depth.

Written on January 13, 2022