19 Example: Summary Statistics in Python
-
Both R and Python (
pandas
) include functions or methods to summarise the data - R applies functions to the column(s)
-
Python
pandas
applies methods to the column(s) of the dataframe
19.1 Python
Set the working directory to the data folder and read the iris dataset as an R object DF
.
import pandas as pd
DF = pd.read_csv('iris.csv')
Click to toggle script window
import pandas as pd
print('Minimum value of', DF[['SepalLength']].min())
print('Maximum value of', DF[['SepalLength']].max())
print('Mean of', DF[['SepalLength']].mean())
print('Median of', DF[['SepalLength']].median())
print('Standard deviation of', DF[['SepalLength']].std())
print('Summary Statistics of', DF[['SepalLength']].describe())
print('Summary Statistics: All')
DF.describe()
print('Mean Sepal Length by Species:')
'Species')[['SepalLength']].mean()
DF.groupby(
print('Mean Sepal & Petal Length by Species:')
'Species')[['SepalLength', 'PetalLength']].mean()
DF.groupby(
print('Summary Statistics of Sepal Length by Species:')
'Species')[['SepalLength']].describe()
DF.groupby(
print('Number of missing values of Sepal Length = ', sum(DF.SepalLength.isnull()))
print('Number of non-missing values of Sepal Length = ', sum(DF.SepalLength.notnull()))
print('Counts for different Species:')
DF.Species.value_counts()
print('Calculate a new column with log-transformed SepalLength')
import numpy as np
'log_SepalLength'] = np.log(DF['SepalLength'])
DF[
DF.head()
print('Sort SepalLength in descending order')
= "SepalLength", ascending = False) DF.sort_values(by
Minimum value of SepalLength 4.3
dtype: float64
Maximum value of SepalLength 7.9
dtype: float64
Mean of SepalLength 5.843333
dtype: float64
Median of SepalLength 5.8
dtype: float64
Standard deviation of SepalLength 0.828066
dtype: float64
Summary Statistics of SepalLength
count 150.000000
mean 5.843333
std 0.828066
min 4.300000
25% 5.100000
50% 5.800000
75% 6.400000
max 7.900000
Summary Statistics: All
SepalLength SepalWidth PetalLength PetalWidth
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333
std 0.828066 0.435866 1.765298 0.762238
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
Mean Sepal Length by Species:
SepalLength
Species
setosa 5.006
versicolor 5.936
virginica 6.588
Mean Sepal & Petal Length by Species:
SepalLength PetalLength
Species
setosa 5.006 1.462
versicolor 5.936 4.260
virginica 6.588 5.552
Summary Statistics of Sepal Length by Species:
SepalLength
count mean std min 25% 50% 75% max
Species
setosa 50.0 5.006 0.352490 4.3 4.800 5.0 5.2 5.8
versicolor 50.0 5.936 0.516171 4.9 5.600 5.9 6.3 7.0
virginica 50.0 6.588 0.635880 4.9 6.225 6.5 6.9 7.9
Number of missing values of Sepal Length = 0
Number of non-missing values of Sepal Length = 150
Counts for different Species:
setosa 50
versicolor 50
virginica 50
Name: Species, dtype: int64
Calculate a new column with log-transformed SepalLength
SepalLength SepalWidth PetalLength PetalWidth Species log_SepalLength
0 5.1 3.5 1.4 0.2 setosa 1.629241
1 4.9 3.0 1.4 0.2 setosa 1.589235
2 4.7 3.2 1.3 0.2 setosa 1.547563
3 4.6 3.1 1.5 0.2 setosa 1.526056
4 5.0 3.6 1.4 0.2 setosa 1.609438
Sort SepalLength in descending order
SepalLength SepalWidth ... Species log_SepalLength
131 7.9 3.8 ... virginica 2.066863
135 7.7 3.0 ... virginica 2.041220
122 7.7 2.8 ... virginica 2.041220
117 7.7 3.8 ... virginica 2.041220
118 7.7 2.6 ... virginica 2.041220
.. ... ... ... ... ...
41 4.5 2.3 ... setosa 1.504077
42 4.4 3.2 ... setosa 1.481605
38 4.4 3.0 ... setosa 1.481605
8 4.4 2.9 ... setosa 1.481605
13 4.3 3.0 ... setosa 1.458615
[150 rows x 6 columns]
19.2 Note
Python
pandas
follows the broadcasting behaviour likenumpy
.Often, calculation or transformation on
pandas DataFrame
usenumpy
package functions directly.Python
pandas
has a quick reference guide to compare common R operations withdplyr
ortidyverse
with equivalent inpandas
.