Section 17 The Final Problem
In the following example, we will create a sample data of students with marks in different subjects (Math, Science and Literature). The students also belong to two levels of sex (Male & Female) and representing 20 different regions and 4 social classes.
All data will be sampled from a uniform distribution. Different descriptive statistics will be presented for the sampled data. We will finally conduct simple statistical tests to investigate specific hypotheses.
Create a Data with Student information
- 100 students with ID as ID1, ID2,.., ID100
- These students are from 5 different regions: R1,..,R5 with 20 from each regions (in order)
- They belong to 4 social classes in each region: S1,..,S4
- Generate factor: Sex (M, F) with equal proportion from each region
- Set seed value as 123456
- Generate Math marks from a uniform distribution for all 100 individuals: range: 10, 100
- Generate Science marks from a uniform distribution for all 100 individuals: range: 10, 100
- Generate Literature marks from a uniform distribution for all 100 individuals: range: 20, 95
Create a dataframe with variables ID, Sex, Region, Class, Math, Sci, Lit
- Do we need to include the character vector as ‘factor’?
- Check the structure of the data
- Check the levels of the factors
Descriptive Statistics
- Number of individuals in each level of factors: Sex, Region, Social class
- Calculate sample descriptive statistics for Math, Science and Literature: mean, sd, cv, se
- Calculate the estimate of correlation between Math and Science
Modify the data
- Sample marks between 10 and 20 with size equals to the number of individuals in social class 1
- Add it to the Math subject of all individuals in the social class 1.
- Check if the marks exceed 100. If so, reassign it to 100.
- Sample marks between 5 and 20 with size equals to the number of individuals in the social class 2 and 4
- Add it to the Science subject of all individuals in the social class 2 and 4.
- Check if the marks exceed 100. If so, reassign it to 100
Incorporate missing information
- The Region information is not available for the following individuals: ID20, ID40, ID60, ID80, ID100
- The Math marks data are not available for the following individuals: ID5, ID25, ID45, ID65, ID85
Statistical test
- Check the help page to conduct an unpaired t-test
- ?t.test
- Assess if the mean Math marks are different between two sexes
Descriptive Statistics
- Number of individuals in combination of levels of factors: Sex, Region
- Calculate mean, sd, cv, se of Math for each levels of factors: Sex, Region, Social class
- Calculate the correlation between Math, Sci and Lit
Conduct an unpaired t-test to test if math marks is different between two sexes ?t.test
Manual computation
- Write a function that will compute the following and return the result as a list
- Obtain mean, sd, cv, se in each group
- calculate t statistic, p-value, CI
- Add a message: mean difference is statistically significant / not significant
- Make a list with mean, sd of each group, t statistic, message