Blog Post 0 - Visualization For Peguins Dataset

In this blog post, we are going to create a visualization using the palmer penguins data set. The following tutorial will guide you to create this interesting visualization!

1. Imports

First, we load the packages we may use in this post.

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns

2. Pre-Processing & Data Cleanning

Then, we get the penguin data. We are using the data from palmer_penguins.csv which contains observations of penguin features from various studies. We have 344 penguins, with 17 features each (including some NaNs).

url = 'https://philchodrow.github.io/PIC16A/datasets/palmer_penguins.csv'
penguins = pd.read_csv(url)
#checking first 5 rows of our data  
print(penguins.shape)     
penguins.head()
(344, 17)
studyName Sample Number Species Region Island Stage Individual ID Clutch Completion Date Egg Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Body Mass (g) Sex Delta 15 N (o/oo) Delta 13 C (o/oo) Comments
0 PAL0708 1 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N1A1 Yes 11/11/07 39.1 18.7 181.0 3750.0 MALE NaN NaN Not enough blood for isotopes.
1 PAL0708 2 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N1A2 Yes 11/11/07 39.5 17.4 186.0 3800.0 FEMALE 8.94956 -24.69454 NaN
2 PAL0708 3 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N2A1 Yes 11/16/07 40.3 18.0 195.0 3250.0 FEMALE 8.36821 -25.33302 NaN
3 PAL0708 4 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N2A2 Yes 11/16/07 NaN NaN NaN NaN NaN NaN NaN Adult not sampled.
4 PAL0708 5 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N3A1 Yes 11/16/07 36.7 19.3 193.0 3450.0 FEMALE 8.76651 -25.32426 NaN

In this post, we are going to create a visualization using columns “Species”, “Culmen Length (mm)”, “Culmen Depth (mm)”. Thus, we select only these three columns from the penguins data set and we drop the NAs.

# Select only these three columns from the penguins data set
df = penguins[["Species", "Culmen Length (mm)", "Culmen Depth (mm)"]]

# drop the NA values
df = df.dropna()
# Visualize the new dataframe
df.head()
Species Culmen Length (mm) Culmen Depth (mm)
0 Adelie Penguin (Pygoscelis adeliae) 39.1 18.7
1 Adelie Penguin (Pygoscelis adeliae) 39.5 17.4
2 Adelie Penguin (Pygoscelis adeliae) 40.3 18.0
4 Adelie Penguin (Pygoscelis adeliae) 36.7 19.3
5 Adelie Penguin (Pygoscelis adeliae) 39.3 20.6
# Examine the dataframe
df.describe()
Culmen Length (mm) Culmen Depth (mm)
count 342.000000 342.000000
mean 43.921930 17.151170
std 5.459584 1.974793
min 32.100000 13.100000
25% 39.225000 15.600000
50% 44.450000 17.300000
75% 48.500000 18.700000
max 59.600000 21.500000
# Examine the counts of species
df["Species"].value_counts()
Adelie Penguin (Pygoscelis adeliae)          151
Gentoo penguin (Pygoscelis papua)            123
Chinstrap penguin (Pygoscelis antarctica)     68
Name: Species, dtype: int64

3. Create a Scatterplot

For our visualization, we made a scatterplot of x=culmen length, y=culmen depth, and color-coded by species. We utilized seaborn to draw this plot.

fgrid = sns.relplot(x = "Culmen Length (mm)", 
                    y = "Culmen Depth (mm)", 
                    hue = "Species", 
                    data = df,
                    palette=["b", "r", "g"]).set(title=
                          'Figure 1: Culmen Length vs Culmen Depth by Species')

Analysis: This scatterplot shows the correlation of Culmen Length and Culmen depth by species. Culmen length and Culmen Depth are positively correlated within each species. These features may be useful for classifies penguins’ species because there are three distinct clusters by species. Gentoo has the least culmen depth, Adelie has the least culmen length, and Chinstrap has the greatest culmen length and depth.

Written on January 13, 2022