17 Data: Summary Statistics
-
Both R and Python (
pandas
) include functions or methods to summarise the data - R applies functions to the column(s)
-
Python
pandas
applies methods to the column(s) of the dataframe
17.1 R
Set the working directory to the data folder and read the iris dataset as an R object DF
.
DF = read.csv('iris.csv')
Function | Explanation | Example |
---|---|---|
min, max |
Minimum, Maximum value of a numeric column | min(DF$SepalLength) |
mean, median, mode |
Mean, Median, Model of a numeric column | mean(DF$SepalLength) |
quantile |
Quantile of a numeric column | quantile(DF$SepalLength, probs = c(0.25, 0.50, 0.75) |
sd, var |
Standard deviation, Variance of a numeric column | sd(DF$SepalLength) |
summary |
Summary statistics of a column | summary(DF$SepalLength) |
aggregate |
Summary statistics of numeric column grouped by a categorical column | aggregate(SepalLength ~ Species, data = DF, FUN = mean, na.rm = TRUE) |
table |
Counts of categorical variable(s) | table(DF$Species) |
order, sort |
Sort data in a column | sort(DF$SepalLength, decreasing = TRUE) |
as.logical, as.integer, as.numeric, as.character |
Convert one data to another | as.integer(DF$SepalLength) |
NA |
Missing values | |
is.na |
Evaluate if the object includes missing value | sum(is.na(DF$Species)) |
is.na with ! |
Evaluate if the object has no missing value (! Not operator) | sum(!is.na(DF$Species)) |
17.2 Python
Set the working directory to the data folder and read the iris dataset as an R object DF
.
import pandas as pd
DF = pd.read_csv('iris.csv')
Methods | Explanation | Example |
---|---|---|
min, max |
Minimum, Maximum value of a numeric column | DF.min() |
mean, median, mode |
Mean, Median, Model of a numeric column | DF.mean() |
quantile |
Quantile of a numeric column | DF.quantile(q = [0.25, 0.50, 0.75]) |
std, var |
Standard deviation, Variance of a numeric column | DF.std() |
describe |
Summary statistics of a column | DF.desscribe() |
groupby |
Summary statistics of numeric column grouped by a categorical column | DF.groupby('Species').mean() |
value_counts |
Counts of categorical variable(s) | DF.Species.value_counts() |
sort_index, sort_values |
Sort data in a column | DF.sort_values(by = 'SepalLength', ascending = False) |
astype |
Convert one type to another | DF[['SepalLength']].astype(int); DF.astype({'SepalLength' : int}) |
NaN, NAN, nan |
Missing values | |
isnull, isna |
Evaluate if the object includes missing value | sum(DF.Species.isnull()) |
notnull |
Evaluate if the object has no missing value | sum(DF.Species.notnull()) |
Note
Missing values can be NaN
, NAN
, nan
, and np.NaN
Method: to_numpy()
Explanation: Convert the DataFrame to a NumPy array.
Example: df.to_numpy()
When using NumPy
matrices it is possible to initialise a Matrix or an Array to be empty, as such you can have a defined matrix shape with no elements contained.