Chapter 16 Stats exercises

Please use your "Exercises.R" script for this exercise and the main workshop directory as the working directory. Ensure you are using annotations and code sections to keep the contents clear and separated.

16.1 Data

For this exercise we will be looking at the World Happiness Report for 2016. This report is a survey of the state of global happiness.

Download world_happiness_report_2016.csv to the directory "Chapter_15-16".

Source: https://www.kaggle.com/unsdsn/world-happiness?select=2016.csv

The columns for the data are:

  • Country: Name of the country.
  • Region: Region the country belongs to.
  • Happiness Rank: Rank of the country based on the Happiness Score.
  • Happiness Score: A metric measured in 2016 by asking the sampled people the question: "How would you rate your happiness on a scale of 0 to 10 where 10 is the happiest".
  • Lower Confidence Interval: Lower Confidence Interval of the Happiness Score.
  • Upper Confidence Interval: Upper Confidence Interval of the Happiness Score.
  • Economy (GDP per Capita): The extent to which GDP contributes to the calculation of the Happiness Score.
  • Family: The extent to which Family contributes to the calculation of the Happiness Score.
  • Health (Life Expectancy): The extent to which Life expectancy contributed to the calculation of the Happiness Score.
  • Freedom: The extent to which Freedom contributed to the calculation of the Happiness Score.
  • Trust (Government Corruption): The extent to which Perception of Corruption contributes to Happiness Score.
  • Generosity: The extent to which Generosity contributed to the calculation of the Happiness Score.
  • Dystopia Residual: The extent to which Dystopia Residual contributed to the calculation of the Happiness Score. More info in the "What is Dystopia?" section at https://worldhappiness.report/faq/.

16.2 World happiness challenges

With that information carry out the following challenges and answer the questions.

Solutions are in the expandable boxes. Try your best to solve each challenge but use the solutions for help if you would like. Even if your method works it can be good to check the solution as there are many ways to do the same thing in R.

World happiness challenge 1

Read in the file "world_happiness_report_2016" as data frame called "happy_df". Ensure strings are read in as factors and that none of the columns are read in as row names.

happy_df <- read.csv("Chapter_15-16/world_happiness_report_2016.csv", 
                      check.names = FALSE,
                      stringsAsFactors = TRUE,
                      sep = ","
                      )

World happiness challenge 2

Answer the following questions using the output from one function:

How many countries are in the region "Western Europe"?
What is the maximum number in the "Happiness Score" column?
What is the minimum value in the "Dystopia Residual" column?
From the columns "Economy (GDP per Capital)", "Freedom" , and "Dystopia Residual", which has the highest mean value?
From the columns "Economy (GDP per Capital)", "Freedom" , and "Dystopia Residual", which has the lowest mean value?
summary(happy_df)
##         Country                                Region   Happiness Rank   Happiness Score Lower Confidence Interval Upper Confidence Interval
##  Afghanistan:  1   Sub-Saharan Africa             :38   Min.   :  1.00   Min.   :2.905   Min.   :2.732             Min.   :3.078            
##  Albania    :  1   Central and Eastern Europe     :29   1st Qu.: 40.00   1st Qu.:4.404   1st Qu.:4.327             1st Qu.:4.465            
##  Algeria    :  1   Latin America and Caribbean    :24   Median : 79.00   Median :5.314   Median :5.237             Median :5.419            
##  Angola     :  1   Western Europe                 :21   Mean   : 78.98   Mean   :5.382   Mean   :5.282             Mean   :5.482            
##  Argentina  :  1   Middle East and Northern Africa:19   3rd Qu.:118.00   3rd Qu.:6.269   3rd Qu.:6.154             3rd Qu.:6.434            
##  Armenia    :  1   Southeastern Asia              : 9   Max.   :157.00   Max.   :7.526   Max.   :7.460             Max.   :7.669            
##  (Other)    :151   (Other)                        :17                                                                                       
##  Economy (GDP per Capita)     Family       Health (Life Expectancy)    Freedom       Trust (Government Corruption)   Generosity    
##  Min.   :0.0000           Min.   :0.0000   Min.   :0.0000           Min.   :0.0000   Min.   :0.00000               Min.   :0.0000  
##  1st Qu.:0.6702           1st Qu.:0.6418   1st Qu.:0.3829           1st Qu.:0.2575   1st Qu.:0.06126               1st Qu.:0.1546  
##  Median :1.0278           Median :0.8414   Median :0.5966           Median :0.3975   Median :0.10547               Median :0.2225  
##  Mean   :0.9539           Mean   :0.7936   Mean   :0.5576           Mean   :0.3710   Mean   :0.13762               Mean   :0.2426  
##  3rd Qu.:1.2796           3rd Qu.:1.0215   3rd Qu.:0.7299           3rd Qu.:0.4845   3rd Qu.:0.17554               3rd Qu.:0.3119  
##  Max.   :1.8243           Max.   :1.1833   Max.   :0.9528           Max.   :0.6085   Max.   :0.50521               Max.   :0.8197  
##                                                                                                                                    
##  Dystopia Residual
##  Min.   :0.8179   
##  1st Qu.:2.0317   
##  Median :2.2907   
##  Mean   :2.3258   
##  3rd Qu.:2.6646   
##  Max.   :3.8377   
## 

World happiness challenge 3

Create the below plot.

  • The equation does not have to be in exactly the same position as the below. Just make sure it is all clearly visible, it contains the same text, and it is not on top of any points.
  • The colour for the line is "mediumslateblue"
  • Note you can put "\n" into a label/title to make a new line.
  • e.g "This is the first line \n and this is the second line"

#Fit linear model of Economy (x) against Family (y)
fit_economy_family <- 
  lm(Family~`Economy (GDP per Capita)`, data = happy_df)
#Create string for linear equation
c <- round(fit_economy_family$coefficients[1], digits = 4)
m <- round(fit_economy_family$coefficients[2], digits = 4)
lm_equation <- paste0("y = ", m , "x + ", c)
#Produce plot
plot(x = happy_df$`Economy (GDP per Capita)`,
     y = happy_df$Family,
     main = "Comparison of the effect of \n Economy and Family on happiness",
     xlab = "Economy (GDP per Capita)",
     ylab = "Family",
     pch = as.numeric(happy_df$Region),
     xlim = c(0,2), ylim = c(0,1.5),
     col = 1)
#Add abline
abline(fit_economy_family, col = "mediumslateblue")
#Add equation to top right
text(x = 0.4, y = 1.1, labels = lm_equation)
#Add the legend
legend(x = "bottomright",
       pch = 1:nlevels(happy_df$Region),
       legend = levels(happy_df$Region))

World happiness challenge 4

It may be a bit awkward to get the right size for the plot in the "Plots" pane. Therefore save it as a png file with the following options:

  • File name of "Economy_vs_family.png"
  • Dimensions of file is 8 inches for height and width
  • Resolution of 200dpi
#PNG command
png(filename = "Chapter_15-16/Economy_vs_family.png", 
    units = "in", height = 8, width = 8, res = 200)
#Produce plot
plot(x = happy_df$`Economy (GDP per Capita)`,
     y = happy_df$Family,
     main = "Comparison of the effect of \n Economy and Family on happiness",
     xlab = "Economy (GDP per Capita)",
     ylab = "Family",
     pch = as.numeric(happy_df$Region),
     xlim = c(0,2), ylim = c(0,1.5),
     col = 1)
#Add abline
abline(fit_economy_family, col = "mediumslateblue")
#Add equation to top right
text(x = 0.4, y = 1.1, labels = lm_equation)
#Add the legend
legend(x = "bottomright",
       pch = 1:nlevels(happy_df$Region),
       legend = levels(happy_df$Region))
#dev.off
dev.off()

World happiness challenge 5

Answer the following questions using information on the line graph and the linear model.

Does the linear model have a positive or negative gradient?
Which variable (Economy or Family) has higher values?
If the value of Economy was 2.1 what would be the predictive value of Family according to the linear model equation?
Which region appears to have the highest values for Economy and for Family?

World happiness challenge 6

Create the following boxplot.

  • You can change the the levels with the following code
short_region_names <- 
  c("OZ_n_NZ", "C_n_E_Europe", "E_Asia", "L_America_n_Caribbean",
    "Middle_E_n_N_Africa", "N_America", "SE_Asia", "S_Asia",
    "Sub_Saharan", "W_Europe")
levels(happy_df$Region) <- short_region_names

#Change level names
short_region_names <- 
  c("OZ_n_NZ", "C_n_E_Europe", "E_Asia", "Latin_A_n_Carib",
    "ME_n_N_Africa", "N_America", "SE_Asia", "S_Asia",
    "Sub_Saharan", "W_Europe")
levels(happy_df$Region) <- short_region_names
#Create boxplot
boxplot(`Happiness Score`~Region, data = happy_df, 
        ylim = c(0,10),
        col = 1:nlevels(happy_df$Region),
        main = "Happiness scores across regions"
)

World happiness challenge 7

Save the above as a jpeg called "Region_happiness_boxplots.jpg" with a width of 1600 pixels and a height of 750 pixels.

#jpeg
jpeg(filename = "Chapter_15-16/Region_happiness_boxplots.jpg", units = "px",
     width = 1600, height = 750 )
#Create boxplot
boxplot(`Happiness Score`~Region, data = happy_df, 
        ylim = c(0,10),
        col = 1:nlevels(happy_df$Region),
        main = "Happiness scores across regions"
)
#dev.off
dev.off()

World happiness challenge 8

Last challenge! Carry out t-tests on the following comparisons for Happiness Score and determine if the means are significantly different.

  • West Europe and North America
  • West Europe and South Asia
  • South Asia and South Eastern Asia
#Subset the data frames to get vectors of our regions of interest
WE_happiness <- happy_df[happy_df$Region == "W_Europe","Happiness Score"]
NA_happiness <- happy_df[happy_df$Region == "N_America","Happiness Score"] 
SA_happiness <- happy_df[happy_df$Region == "S_Asia","Happiness Score"] 
SEA_happiness <- happy_df[happy_df$Region == "SE_Asia","Happiness Score"]
#Carry out t-tests
WE_NA_ttest <- t.test(WE_happiness, NA_happiness)
WE_SA_ttest <- t.test(WE_happiness, SA_happiness)
SA_SEA_ttest <- t.test(SA_happiness, SEA_happiness)
#Extract p values into a new vector with element names
region_happiness_pvalues <- c(WE_NA = WE_NA_ttest$p.value,
                              WE_SA = WE_SA_ttest$p.value,
                              SA_SEA = SA_SEA_ttest$p.value)
#Logical to determine if the p-value is less than 0.05
#I.e. the means are significantly different
region_happiness_pvalues < 0.05
##  WE_NA  WE_SA SA_SEA 
##  FALSE   TRUE  FALSE