17 Data: Summary Statistics

  • Both R and Python (pandas) include functions or methods to summarise the data
  • R applies functions to the column(s)
  • Python pandas applies methods to the column(s) of the dataframe

17.1 R

Set the working directory to the data folder and read the iris dataset as an R object DF.

DF = read.csv('iris.csv')

Function Explanation Example
min, max Minimum, Maximum value of a numeric column min(DF$SepalLength)
mean, median, mode Mean, Median, Model of a numeric column mean(DF$SepalLength)
quantile Quantile of a numeric column quantile(DF$SepalLength, probs = c(0.25, 0.50, 0.75)
sd, var Standard deviation, Variance of a numeric column sd(DF$SepalLength)
summary Summary statistics of a column summary(DF$SepalLength)
aggregate Summary statistics of numeric column grouped by a categorical column aggregate(SepalLength ~ Species, data = DF, FUN = mean, na.rm = TRUE)
table Counts of categorical variable(s) table(DF$Species)
order, sort Sort data in a column sort(DF$SepalLength, decreasing = TRUE)
as.logical, as.integer, as.numeric, as.character Convert one data to another as.integer(DF$SepalLength)
NA Missing values
is.na Evaluate if the object includes missing value sum(is.na(DF$Species))
is.na with ! Evaluate if the object has no missing value (! Not operator) sum(!is.na(DF$Species))

17.2 Python

Set the working directory to the data folder and read the iris dataset as an R object DF.

import pandas as pd

DF = pd.read_csv('iris.csv')

Methods Explanation Example
min, max Minimum, Maximum value of a numeric column DF.min()
mean, median, mode Mean, Median, Model of a numeric column DF.mean()
quantile Quantile of a numeric column DF.quantile(q = [0.25, 0.50, 0.75])
std, var Standard deviation, Variance of a numeric column DF.std()
describe Summary statistics of a column DF.desscribe()
groupby Summary statistics of numeric column grouped by a categorical column DF.groupby('Species').mean()
value_counts Counts of categorical variable(s) DF.Species.value_counts()
sort_index, sort_values Sort data in a column DF.sort_values(by = 'SepalLength', ascending = False)
astype Convert one type to another DF[['SepalLength']].astype(int); DF.astype({'SepalLength' : int})
NaN, NAN, nan Missing values
isnull, isna Evaluate if the object includes missing value sum(DF.Species.isnull())
notnull Evaluate if the object has no missing value sum(DF.Species.notnull())

Note

Missing values can be NaN, NAN, nan, and np.NaN

Method: to_numpy()

Explanation: Convert the DataFrame to a NumPy array.

Example: df.to_numpy()

When using NumPy matrices it is possible to initialise a Matrix or an Array to be empty, as such you can have a defined matrix shape with no elements contained.