19 Example: Summary Statistics in Python

Both R and Python (pandas) include functions or methods to summarise the data
R applies functions to the column(s)
Python pandas applies methods to the column(s) of the dataframe

19.1 Python

Set the working directory to the data folder and read the iris dataset as an R object DF.

import pandas as pd

DF = pd.read_csv('iris.csv')

Click to toggle script window

import pandas as pd

print('Minimum value of', DF[['SepalLength']].min())

print('Maximum value of', DF[['SepalLength']].max())

print('Mean of', DF[['SepalLength']].mean())

print('Median of', DF[['SepalLength']].median())

print('Standard deviation of', DF[['SepalLength']].std())

print('Summary Statistics of', DF[['SepalLength']].describe())

print('Summary Statistics: All')
DF.describe()

print('Mean Sepal Length by Species:')
DF.groupby('Species')[['SepalLength']].mean()

print('Mean Sepal & Petal Length by Species:')
DF.groupby('Species')[['SepalLength', 'PetalLength']].mean()

print('Summary Statistics of Sepal Length by Species:')
DF.groupby('Species')[['SepalLength']].describe()

print('Number of missing values of Sepal Length = ', sum(DF.SepalLength.isnull()))

print('Number of non-missing values of Sepal Length = ', sum(DF.SepalLength.notnull()))

print('Counts for different Species:')
DF.Species.value_counts()

print('Calculate a new column with log-transformed SepalLength')
import numpy as np
DF['log_SepalLength'] = np.log(DF['SepalLength'])
DF.head()

print('Sort SepalLength in descending order')
DF.sort_values(by = "SepalLength", ascending = False)

Minimum value of SepalLength    4.3
dtype: float64

Maximum value of SepalLength    7.9
dtype: float64

Mean of SepalLength    5.843333
dtype: float64

Median of SepalLength    5.8
dtype: float64

Standard deviation of SepalLength    0.828066
dtype: float64

Summary Statistics of        SepalLength
count   150.000000
mean      5.843333
std       0.828066
min       4.300000
25%       5.100000
50%       5.800000
75%       6.400000
max       7.900000

Summary Statistics: All

       SepalLength  SepalWidth  PetalLength  PetalWidth
count   150.000000  150.000000   150.000000  150.000000
mean      5.843333    3.057333     3.758000    1.199333
std       0.828066    0.435866     1.765298    0.762238
min       4.300000    2.000000     1.000000    0.100000
25%       5.100000    2.800000     1.600000    0.300000
50%       5.800000    3.000000     4.350000    1.300000
75%       6.400000    3.300000     5.100000    1.800000
max       7.900000    4.400000     6.900000    2.500000

Mean Sepal Length by Species:

            SepalLength
Species                
setosa            5.006
versicolor        5.936
virginica         6.588

Mean Sepal & Petal Length by Species:

            SepalLength  PetalLength
Species                             
setosa            5.006        1.462
versicolor        5.936        4.260
virginica         6.588        5.552

Summary Statistics of Sepal Length by Species:

           SepalLength                                            
                 count   mean       std  min    25%  50%  75%  max
Species                                                           
setosa            50.0  5.006  0.352490  4.3  4.800  5.0  5.2  5.8
versicolor        50.0  5.936  0.516171  4.9  5.600  5.9  6.3  7.0
virginica         50.0  6.588  0.635880  4.9  6.225  6.5  6.9  7.9

Number of missing values of Sepal Length =  0

Number of non-missing values of Sepal Length =  150

Counts for different Species:

setosa        50
versicolor    50
virginica     50
Name: Species, dtype: int64

Calculate a new column with log-transformed SepalLength

   SepalLength  SepalWidth  PetalLength  PetalWidth Species  log_SepalLength
0          5.1         3.5          1.4         0.2  setosa         1.629241
1          4.9         3.0          1.4         0.2  setosa         1.589235
2          4.7         3.2          1.3         0.2  setosa         1.547563
3          4.6         3.1          1.5         0.2  setosa         1.526056
4          5.0         3.6          1.4         0.2  setosa         1.609438

Sort SepalLength in descending order

     SepalLength  SepalWidth  ...    Species  log_SepalLength
131          7.9         3.8  ...  virginica         2.066863
135          7.7         3.0  ...  virginica         2.041220
122          7.7         2.8  ...  virginica         2.041220
117          7.7         3.8  ...  virginica         2.041220
118          7.7         2.6  ...  virginica         2.041220
..           ...         ...  ...        ...              ...
41           4.5         2.3  ...     setosa         1.504077
42           4.4         3.2  ...     setosa         1.481605
38           4.4         3.0  ...     setosa         1.481605
8            4.4         2.9  ...     setosa         1.481605
13           4.3         3.0  ...     setosa         1.458615

[150 rows x 6 columns]

19.2 Note

Essential Descriptive Statistics of pandas
Python pandas follows the broadcasting behaviour like numpy.
Often, calculation or transformation on pandas DataFrame use numpy package functions directly.
Python pandas has a quick reference guide to compare common R operations with dplyr or tidyverse with equivalent in pandas.

18 Example: Summary Statistics in R

20 R: base plot