Section 17 The Final Problem

In the following example, we will create a sample data of students with marks in different subjects (Math, Science and Literature). The students also belong to two levels of sex (Male & Female) and representing 20 different regions and 4 social classes.

All data will be sampled from a uniform distribution. Different descriptive statistics will be presented for the sampled data. We will finally conduct simple statistical tests to investigate specific hypotheses.

  • Create a Data with Student information

    • 100 students with ID as ID1, ID2,.., ID100
    • These students are from 5 different regions: R1,..,R5 with 20 from each regions (in order)
    • They belong to 4 social classes in each region: S1,..,S4
    • Generate factor: Sex (M, F) with equal proportion from each region
    • Set seed value as 123456
    • Generate Math marks from a uniform distribution for all 100 individuals: range: 10, 100
    • Generate Science marks from a uniform distribution for all 100 individuals: range: 10, 100
    • Generate Literature marks from a uniform distribution for all 100 individuals: range: 20, 95
  • Create a dataframe with variables ID, Sex, Region, Class, Math, Sci, Lit

    • Do we need to include the character vector as ‘factor’?
    • Check the structure of the data
    • Check the levels of the factors
  • Descriptive Statistics

    • Number of individuals in each level of factors: Sex, Region, Social class
    • Calculate sample descriptive statistics for Math, Science and Literature: mean, sd, cv, se
    • Calculate the estimate of correlation between Math and Science
  • Modify the data

    • Sample marks between 10 and 20 with size equals to the number of individuals in social class 1
    • Add it to the Math subject of all individuals in the social class 1.
    • Check if the marks exceed 100. If so, reassign it to 100.
    • Sample marks between 5 and 20 with size equals to the number of individuals in the social class 2 and 4
    • Add it to the Science subject of all individuals in the social class 2 and 4.
    • Check if the marks exceed 100. If so, reassign it to 100
  • Incorporate missing information

    • The Region information is not available for the following individuals: ID20, ID40, ID60, ID80, ID100
    • The Math marks data are not available for the following individuals: ID5, ID25, ID45, ID65, ID85
  • Statistical test

    • Check the help page to conduct an unpaired t-test
    • ?t.test
    • Assess if the mean Math marks are different between two sexes
  • Descriptive Statistics

    • Number of individuals in combination of levels of factors: Sex, Region
    • Calculate mean, sd, cv, se of Math for each levels of factors: Sex, Region, Social class
    • Calculate the correlation between Math, Sci and Lit
  • Conduct an unpaired t-test to test if math marks is different between two sexes ?t.test

  • Manual computation

    • Write a function that will compute the following and return the result as a list
    • Obtain mean, sd, cv, se in each group
    • calculate t statistic, p-value, CI
    • Add a message: mean difference is statistically significant / not significant
    • Make a list with mean, sd of each group, t statistic, message