Indoor Air Analysis

Author

Trevor Harrington

Hypothesis:

The final data set containing the additional variables used for this analysis halves the temporal scale from a three-decade analysis to just fifteen years with five-year intervals (1952, 1957, 2002, 2007). This reduction suggests that the initial regression analysis (Table C) may prove a small enough predictive factor that it can be largely considered null, minus some exploration of its statistical significance.

Layer One: Countries and Resources

Countries with a higher GDP and larger population size have lower percentages of premature deaths caused by indoor pollution from bio fuels. significant differences in the percentage of premature deaths caused by indoor pollution from bio fuels across regions, even after controlling for GDP, population size, and other relevant factors will also play a significant role.

  • higher GDP may be associated with greater access to alternative fuels and cleaner cooking technologies, which can reduce indoor pollution levels. Although it is possible population size may have a positive correlation to premature deaths from indoor pollution, this study aims to show that, assuming GDP is above the average global (13,000 $USD) this should represent economic growth, and a reduction in percent premature deaths from IAP.
  • This hypothesis assumes that countries with higher GDP and larger population size may have more resources and infrastructure to invest in clean cooking and heating technologies, leading to lower levels of indoor pollution and related premature deaths.
    • To reduce the impact of neglecting this variable, it may be useful to run an analysis specifically determining the significance of the relationship between percent deaths and year, or alternatively reduce the number of years being considered.
  • This hypothesis assumes that regional factors such as cultural norms, access to alternative fuels, and air quality regulations may impact indoor pollution levels and related premature deaths, and that these factors may differ between regions even after controlling for other variables
  • If this hypothesis proves a strong correlation between countries in different climate regions for a single year, it may be useful to extrapolate this comparison over the 30 year time span. if no correlation is found, it may not be worth further consideration.
  • Temperate regions are more likely to invest in indoor heating which often involved the use of bio fuels like wood stoves. Poor ventilation is also likely to be a component that will play a role in temperate regions being associated with greater % deaths by IAP. it is possible that analysis into exposure levels based on regions and even determine if years after a cold winter will reflect higher rates of exposure-related premature deaths.

Null Hypothesis: There is no significant association between a country’s GDP or population size and the percentage of premature deaths caused by indoor pollution from biofuels. Furthermore, there are no significant regional differences in the percentage of premature deaths caused by indoor pollution from biofuels after controlling for GDP, population size, and other relevant factors.


Hypothesis Testing

Code
#gapminder provides life expectency, population, GDP for years 1952, 1957, 2002, 2007; can be used to generate unique 
gm_df <- gapminder %>% 
    clean_names() %>%
    rename("entity"="country")

#Generate a new df and join the gapminder and indoor pollution dataframes
merged_df <- indoor_pollution %>%
  left_join(gm_df) %>%  
  filter_all(all_vars(!is.na(.))) ## remove all the variables that dont have a match in both dataframes
Joining with `by = join_by(entity, year)`

Single Variable Visualization and Linear Regression

1.1) Percent Deaths and Year

Bland-Altman plot was previously run for exploratory testing graph B which, despite having a limited number of variables, showed an interesting distribution of results that suggested some value in revisiting with the full data set. This graph is beneficial for comparing two measurements by plotting the difference between the two values against their mean. this plot compares the percentage of deaths globally that were recorded in 2015 and in 1995 to see how much variation exists. It is likely there will be a large cluster of data points around (0,0) on the graph, suggesting countries that had a very low percentage of deaths from indoor air pollution in 1995, which did not change dramatically when compared to 2015. Values above 0 on the y axis represent countries that have increased in deaths caused by indoor air pollution between 1995 and 2015, while values below the line decreased. Considering the exploratory test, it is likely that most datapoints will be below the line which corresponds to the negative slope generated in Table C ( -0.113)

Running a second BA plot that seperates this result into region may be valuable for understanding if some countries are improving more significantly than other. These results will likely show the developed regions having very few values that are not at the (0,0) coordinates, with significant spread expected for South Asia, Sub-Saharan Africa and East Asia & Pacific regions.

A regression analysis may be beneficial to see how different this estimate is compared to that obtained in Table C, however assuming this value does not change significantly, it is likely that even if the observational years were not reduced from the addition of GapMinder data, it would not have represented a significantly strong predictive variable for the scope of this analysis.

Code
BA_plot <- indoor_pollution %>% #Bland-Altman plots show the relationship between two paried variables to determine how much change is .
  
   mutate(region = countrycode(entity, origin = "country.name", destination = "region")) %>%
  
  filter(region!= 0) %>%

  
   rename("percent_iap" = deaths_cause_all_causes_risk_household_air_pollution_from_solid_fuels_sex_both_age_age_standardized_percent) %>%

  mutate(year = paste0("Y", year)) %>%
  spread(year, percent_iap) %>%
  mutate(current = Y2015,
         change = Y2015 - Y1995)
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `region = countrycode(entity, origin = "country.name",
  destination = "region")`.
Caused by warning in `countrycode_convert()`:
! Some values were not matched unambiguously: Africa, African Region, African Union, America, Andean Latin America, Asia, Australasia, Caribbean, Central Asia, Central Europe, Central Europe, Eastern Europe, and Central Asia, Central Latin America, Central sub-Saharan Africa, Commonwealth, Commonwealth High Income, Commonwealth Low Income, Commonwealth Middle Income, East Asia, East Asia & Pacific - World Bank region, Eastern Europe, Eastern Mediterranean Region, Eastern sub-Saharan Africa, England, Europe, Europe & Central Asia - World Bank region, European Region, European Union, G20, High-income, High-income Asia Pacific, High-income North America, High-middle SDI, High SDI, Latin America & Caribbean - World Bank region, Low-middle SDI, Low SDI, Micronesia (country), Middle East & North Africa, Middle SDI, Nordic Region, North Africa and Middle East, North America, Northern Ireland, Oceania, OECD Countries, Region of the Americas, Scotland, South-East Asia Region, South Asia - World Bank region, Southeast Asia, Southeast Asia, East Asia, and Oceania, Southern Latin America, Southern sub-Saharan Africa, Sub-Saharan Africa - World Bank region, Timor, Tropical Latin America, Wales, Western Europe, Western Pacific Region, Western sub-Saharan Africa, World, World Bank High Income, World Bank Low Income, World Bank Lower Middle Income, World Bank Upper Middle Income
Code
BA_plot %>%
  
  ggplot(aes(current, change)) +
    geom_point()+
  geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'

Code
indoor_pollution %>%
  
  rename("percent_IAP" = deaths_cause_all_causes_risk_household_air_pollution_from_solid_fuels_sex_both_age_age_standardized_percent) %>%
  
  mutate("region" = countrycode(entity, origin = "country.name", destination = "region")) %>%
  filter(!is.na(region))
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `region = countrycode(entity, origin = "country.name",
  destination = "region")`.
Caused by warning in `countrycode_convert()`:
! Some values were not matched unambiguously: Africa, African Region, African Union, America, Andean Latin America, Asia, Australasia, Caribbean, Central Asia, Central Europe, Central Europe, Eastern Europe, and Central Asia, Central Latin America, Central sub-Saharan Africa, Commonwealth, Commonwealth High Income, Commonwealth Low Income, Commonwealth Middle Income, East Asia, East Asia & Pacific - World Bank region, Eastern Europe, Eastern Mediterranean Region, Eastern sub-Saharan Africa, England, Europe, Europe & Central Asia - World Bank region, European Region, European Union, G20, High-income, High-income Asia Pacific, High-income North America, High-middle SDI, High SDI, Latin America & Caribbean - World Bank region, Low-middle SDI, Low SDI, Micronesia (country), Middle East & North Africa, Middle SDI, Nordic Region, North Africa and Middle East, North America, Northern Ireland, Oceania, OECD Countries, Region of the Americas, Scotland, South-East Asia Region, South Asia - World Bank region, Southeast Asia, Southeast Asia, East Asia, and Oceania, Southern Latin America, Southern sub-Saharan Africa, Sub-Saharan Africa - World Bank region, Timor, Tropical Latin America, Wales, Western Europe, Western Pacific Region, Western sub-Saharan Africa, World, World Bank High Income, World Bank Low Income, World Bank Lower Middle Income, World Bank Upper Middle Income
# A tibble: 6,060 × 5
   entity      code   year percent_IAP region    
   <chr>       <chr> <dbl>       <dbl> <chr>     
 1 Afghanistan AFG    1990        19.6 South Asia
 2 Afghanistan AFG    1991        19.3 South Asia
 3 Afghanistan AFG    1992        19.5 South Asia
 4 Afghanistan AFG    1993        19.7 South Asia
 5 Afghanistan AFG    1994        19.4 South Asia
 6 Afghanistan AFG    1995        19.6 South Asia
 7 Afghanistan AFG    1996        19.8 South Asia
 8 Afghanistan AFG    1997        19.7 South Asia
 9 Afghanistan AFG    1998        19.0 South Asia
10 Afghanistan AFG    1999        19.9 South Asia
# … with 6,050 more rows
Code
model_region_temp <- linear_reg() %>%
  set_engine("lm")  # construct model instance

model_region_recipe <- recipe(deaths_cause_all_causes_risk_household_air_pollution_from_solid_fuels_sex_both_age_age_standardized_percent~year, data = indoor_pollution)

model_region <-
  workflow() %>%
  add_model(model_region_temp) %>%
  add_recipe(model_region_recipe)  # combine the model and recipe to generate a regression analysis

model_region_fit <- model_region %>% fit(indoor_pollution)

model_region_fit %>%
  glance() %>%
  kable(digits = c(4, 4, 2, 4, 0, 0, 2, 2, 2, 2, 0, 0)) %>%
  kable_styling(bootstrap_options = c("hover", "striped"))
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.0285 0.0284 5.47 235.1636 0 1 -24978.2 49962.39 49983.36 239742.8 8008 8010
Code
model_region_fit %>%
  extract_fit_engine() %>%
  tidy() %>%
  kable() %>%
  kable_styling(bootstrap_options = c("hover", "striped"))
term estimate std.error statistic p.value
(Intercept) 222.3923682 14.1584405 15.70741 0
year -0.1083154 0.0070633 -15.33505 0

1.2) Percent Deaths and Region

Code
model_data <- indoor_pollution %>%
  
  rename("percent_IAP" = deaths_cause_all_causes_risk_household_air_pollution_from_solid_fuels_sex_both_age_age_standardized_percent) %>%
  
  mutate("region" = countrycode(entity, origin = "country.name", destination = "region")) %>%
  filter(!is.na(region))
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `region = countrycode(entity, origin = "country.name",
  destination = "region")`.
Caused by warning in `countrycode_convert()`:
! Some values were not matched unambiguously: Africa, African Region, African Union, America, Andean Latin America, Asia, Australasia, Caribbean, Central Asia, Central Europe, Central Europe, Eastern Europe, and Central Asia, Central Latin America, Central sub-Saharan Africa, Commonwealth, Commonwealth High Income, Commonwealth Low Income, Commonwealth Middle Income, East Asia, East Asia & Pacific - World Bank region, Eastern Europe, Eastern Mediterranean Region, Eastern sub-Saharan Africa, England, Europe, Europe & Central Asia - World Bank region, European Region, European Union, G20, High-income, High-income Asia Pacific, High-income North America, High-middle SDI, High SDI, Latin America & Caribbean - World Bank region, Low-middle SDI, Low SDI, Micronesia (country), Middle East & North Africa, Middle SDI, Nordic Region, North Africa and Middle East, North America, Northern Ireland, Oceania, OECD Countries, Region of the Americas, Scotland, South-East Asia Region, South Asia - World Bank region, Southeast Asia, Southeast Asia, East Asia, and Oceania, Southern Latin America, Southern sub-Saharan Africa, Sub-Saharan Africa - World Bank region, Timor, Tropical Latin America, Wales, Western Europe, Western Pacific Region, Western sub-Saharan Africa, World, World Bank High Income, World Bank Low Income, World Bank Lower Middle Income, World Bank Upper Middle Income
Code
model_region_temp <- linear_reg() %>%
  set_engine("lm")  # construct model instance

model_region_recipe <- recipe(percent_IAP~region, data = model_data)

model_region <-
  workflow() %>%
  add_model(model_region_temp) %>%
  add_recipe(model_region_recipe)  # combine the model and recipe to generate a regression analysis

model_region_fit <- model_region %>% fit(model_data)

model_region_fit %>%
  glance() %>%
  kable(digits = c(4, 4, 2, 4, 0, 0, 2, 2, 2, 2, 0, 0)) %>%
  kable_styling(bootstrap_options = c("hover", "striped"))
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.4601 0.4595 4.19 859.604 0 6 -17272.38 34560.77 34614.44 106090.1 6053 6060
Code
model_region_fit %>%
  extract_fit_engine() %>%
  tidy() %>%
  kable(digits= 4) %>%
  kable_styling(bootstrap_options = c("hover", "striped"))
term estimate std.error statistic p.value
(Intercept) 6.7929 0.1292 52.5772 0
regionEurope & Central Asia -5.2048 0.1671 -31.1450 0
regionLatin America & Caribbean -3.7370 0.1827 -20.4525 0
regionMiddle East & North Africa -5.2642 0.2110 -24.9514 0
regionNorth America -6.6191 0.4598 -14.3951 0
regionSouth Asia 5.8436 0.2995 19.5088 0
regionSub-Saharan Africa 3.5797 0.1699 21.0700 0

1.3) Percent Deaths and Population

Code
merged_df %>%
  rename(percent_iap = deaths_cause_all_causes_risk_household_air_pollution_from_solid_fuels_sex_both_age_age_standardized_percent) %>%
  mutate("region" = countrycode(entity, origin = "country.name", destination = "region")) %>%
  filter(!is.na(region)) %>%
  
  group_by(pop = cut(pop, breaks = c(1, 10^1, 10^2, 10^3, 10^4, 10^5, 10^6, 10^7, 10^8, 10^9, Inf))) %>% ## group by population ranges 
  
  summarize(avg_percent_iap = mean(percent_iap)) %>% ## calculate average percentage for each population range
  
  ggplot() + 
  geom_col(mapping = aes(x = pop, y = avg_percent_iap, fill = pop)) + ## use geom_col to create the bar chart
  scale_x_discrete(labels = c("1-10", "10-100", "100-1K", "1K-10K", "10K-100K", "100K-1M", "1M-10M", "10M-1B", "1B-10B", ">10B"), 
                   name = "population") + ## change x-axis labels
  labs(y = "Average Percent Deaths from Indoor Air Pollution",
       title = "Average Percent Deaths from Indoor Air Pollution by population size")

Code
model_data <- merged_df %>%
  
  rename("percent_IAP" = deaths_cause_all_causes_risk_household_air_pollution_from_solid_fuels_sex_both_age_age_standardized_percent) %>%
  
  mutate("region" = countrycode(entity, origin = "country.name", destination = "region")) %>%
  filter(!is.na(region))

model_region_temp <- linear_reg() %>%
  set_engine("lm")  # construct model instance

# Run a regression analysis, determine if population size is a factor contributing to indoor air pollution deaths. 
## This analysis will be valuable for determining if population has any compounding effect with other variables in further analysis. 

model_region_recipe <- recipe(percent_IAP~pop, data = model_data)

 # combine the recipe with the model to generate a regression analysis
model_region <-
  workflow() %>%
  add_model(model_region_temp) %>% 
  add_recipe(model_region_recipe) 

model_region_fit <- model_region %>% fit(model_data)

model_region_fit %>%
  glance() %>%
  kable(digits = c(4, 4, 2, 4, 0, 0, 2, 2, 2, 2, 0, 0)) %>%
  kable_styling(bootstrap_options = c("hover", "striped"))
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.012 0.0101 5.74 6.3233 0 1 -1658.44 3322.88 3335.66 17215 522 524
Code
model_region_fit %>%
  extract_fit_engine() %>%
  tidy() %>%
  kable(digits = c(4,4,4,4)) %>%
  kable_styling(bootstrap_options = c("hover", "striped"))
term estimate std.error statistic p.value
(Intercept) 6.0303 0.2617 23.0401 0.0000
pop 0.0000 0.0000 2.5146 0.0122

1.4) Percent Deaths and GDP

Code
merged_df %>%
  
  rename(percent_IAP = deaths_cause_all_causes_risk_household_air_pollution_from_solid_fuels_sex_both_age_age_standardized_percent) %>%
  
  ggplot(aes(percent_IAP , log(gdp_percap))) +
  geom_line() +
  geom_smooth() +
  labs(title = "Indoor Air Pollution vs. GDP",
       x = "GDP ($USD) log10 scale",
       y = "Percent Premature Deaths")
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Code
model_data <- merged_df %>%
  
  rename("percent_IAP" = deaths_cause_all_causes_risk_household_air_pollution_from_solid_fuels_sex_both_age_age_standardized_percent) %>%
  
  mutate("region" = countrycode(entity, origin = "country.name", destination = "region")) %>%
  filter(!is.na(region))

model_region_temp <- linear_reg() %>%
  set_engine("lm")  # construct model instance

model_region_recipe <- recipe(percent_IAP~gdp_percap, data = model_data)

model_region <-
  workflow() %>%
  add_model(model_region_temp) %>%
  add_recipe(model_region_recipe)  # combine the model and recipe to generate a regression analysis

model_region_fit <- model_region %>% fit(model_data)

model_region_fit %>%
  glance() %>%
  kable(digits = c(4, 4, 2, 4, 0, 0, 2, 2, 2, 2, 0, 0)) %>%
  kable_styling(bootstrap_options = c("hover", "striped"))
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.5079 0.507 4.05 538.7472 0 1 -1475.82 2957.64 2970.42 8574.23 522 524
Code
model_region_fit %>%
  extract_fit_engine() %>%
  tidy() %>%
  kable() %>%
  kable_styling(bootstrap_options = c("hover", "striped"))
term estimate std.error statistic p.value
(Intercept) 9.8365378 0.2359062 41.69681 0
gdp_percap -0.0003717 0.0000160 -23.21093 0

1.5) Percent Deaths effect on life expectancy

Code
merged_df %>%
summarize(mean(life_exp))%>%
  print()
# A tibble: 1 × 1
  `mean(life_exp)`
             <dbl>
1             65.4
Code
merged_df %>%
  
  rename("percent_IAP" = deaths_cause_all_causes_risk_household_air_pollution_from_solid_fuels_sex_both_age_age_standardized_percent) %>%
  
  mutate("region" = countrycode(entity, origin = "country.name", destination = "region")) %>%
  filter(!is.na(region)) %>%
  
  ggplot(aes(percent_IAP,
            life_exp)) +
  geom_line() +
  geom_smooth() +
  labs(title = "Life expectency vs. Indoor Air Pollution",
       x = "Percent Premature Deaths",
       y = "Life Expectency (years)")
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Code
model_data <- merged_df %>%
  
  rename("percent_IAP" = deaths_cause_all_causes_risk_household_air_pollution_from_solid_fuels_sex_both_age_age_standardized_percent) %>%
  
  mutate("region" = countrycode(entity, origin = "country.name", destination = "region")) %>%
  filter(!is.na(region))

model_region_temp <- linear_reg() %>%
  set_engine("lm")  # construct model instance

model_region_recipe <- recipe(life_exp~percent_IAP, data = model_data)

model_region <-
  workflow() %>%
  add_model(model_region_temp) %>%
  add_recipe(model_region_recipe)  # combine the model and recipe to generate a regression analysis

model_region_fit <- model_region %>% fit(model_data)

model_region_fit %>%
  glance() %>%
  kable(digits = c(4, 4, 2, 4, 0, 0, 2, 2, 2, 2, 0, 0)) %>%
  kable_styling(bootstrap_options = c("hover", "striped"))
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.5767 0.5759 7.69 711.1291 0 1 -1811.22 3628.44 3641.22 30842.82 522 524
Code
model_region_fit %>%
  extract_fit_engine() %>%
  tidy() %>%
  kable(digits =c(4,4,4,4)) %>%
  kable_styling(bootstrap_options = c("hover", "striped"))
term estimate std.error statistic p.value
(Intercept) 75.0811 0.4938 152.0376 0
percent_IAP -1.5529 0.0582 -26.6670 0

Discussion

The analyses conducted provide further understanding of the observations made during the exploratory analysis phase. Each single-variable regression analysis conducted was highly reliable (p<0.001), although the strength of correlation varied considerably. Examining the implications of these findings will be valuable in comprehending the outcomes of multiple variable analysis, which can provide more insight into the importance and cumulative impact of these connections. In this study, the association between the percentage of deaths caused by indoor air pollution (IAP) and three independent variables, namely GDP, year, and region, were compared separately to comprehend their distinct contributions. The hypothesis under examination posited that GDP had the greatest influence on the proportion of deaths caused by IAP. However, the results of the analysis revealed that region was a better predictor of IAP (59.2% of the variation in IAP deaths was estimated to be predictable by region) than GDP (only 50.8% of the variation in IAP deaths was predicted by GDP).

Section 1.3, 1.4, and 1.5 use the merged dataset combining the original indoor pollution data with an additional dataset provided by the “gapminder” package. this dataset contains observations from 1952, 1957, 2002 and 2007 and adds several variables: GDP, population size, average life expectancy, are the variables that this analysis will utilize to generate additional inference on the original data frame. While this data is valuable, it reduces the total observations from 8010 to 524. Whenever these additional variables are not needed for the specific test, the indoor pollution data frame will be used.

Section 1.1 According to Table C in the exploratory analysis section, the year variable was found to explain only 1.3% of the variation in premature deaths related to indoor air pollutants. However, when the analysis was conducted again using the entire dataset, the r^2 value decreased further to 1.08% (p < 0.001). This could be due to the fact that the oldest data included in the analysis dated back to 1992, a period when modern medicine and the availability of drugs were already having a significant impact on reducing deaths from treatable conditions and infections worldwide. To further investigate the correlation between indoor deaths and year, a multiple variable regression analysis could be conducted for each region separately. This could reveal that the countries with the highest number of deaths (South Asia, Sub-Saharan Africa, East Asia) are improving annually at a faster rate than regions where the percentage of deaths was already relatively low in 1990.

Section 1.2 In this analysis, one of the main hypotheses being tested focuses on the correlation between indoor air pollution (IAP) and two key factors: region and GDP. Exploratory analysis graphs C and D highlighted a significant disparity in the average number of deaths caused by IAP in countries located in regions such as Sub-Saharan Africa, South Asia, and East Asia & Pacific. The regression analysis conducted in this study found that region alone was capable of predicting 59.2% (p < 0.001) of the variation in the percentage of premature deaths caused by IAP. This result is consistent with the hypothesis of the analysis, which proposed that region and GDP would be the two most important factors in comprehending the global distribution of IAP.

Section 1.3 Population was found to have the lowest impact in the percentage of deaths caused by indoor air pollution (r^2 = 0.012, p<0.001). This likely related to the fact that the deaths is given as a percentage of the population, so unless increasing the population contributes to a reduction of quality of life shared by the total population, the percentage should be relatively unaffected. The estimate for this value is 0.00 suggesting no significant linear relationship exists between population and percent premature deaths when only considering this single variable.

Section 1.4 GDP is the second of the main variables being investigated in this analysis, and is hypothesized to be the most significant predictor for estimating a given country or region’s percentage of premature deaths from IAP. The estimate generated by this regression analysis found (with results scaled x100) for every 100$ increase to GDP per capita, the percentage of premature deaths caused by indoor pollution decreases by 0.0372%, (p <0.001, R^2 = 0.5079). The intercept provided by this analysis suggests that when GDP is = 0 the mean percentage of premature deaths from indoor pollution would be 9.84%. These results provide strong evidence for the hypothesis of this analysis. to be compounded when controlling for regions.

Section 1.5 was created to consider the role that deaths from indoor air pollution plays on life expectancy. The graph generated in this section appears to show that countries suffering from high rates of indoor air pollution tend to also have lower life expectancy, ranging from approximately 75 to 60. This graph is supported by the regression analysis that suggests for every 1% increase in percent premature deaths from indoor air pollution, a decrease in life experctancy of 1.55 years is predicted (p <0.01, r^2 = 0.577).

This analysis also provided an intercept that estimates if percent IAP were = 0, average life expectancy would increase from the mean (65.43 years) to 75.08 years old. This is a nearly 10 year estimated increase on global lifespans that could be achieved by improving the quality of the air inside of households and public spaces.


Multiple Variable Visualization and Linear Regression

1.7) Year and Region

This analysis will determine if, despite a very small r-squared value for this analysis in both Table C (r = 3.0% p<0.001) and Section 1.1 (r^2 = 2.85%, p<0.001) there could potentially be a more significant correlation when running the analysis for each region, as opposed to considering global average. Considering the time variable was not included in the hypothesis, it is unlikely this will be of value for the overall conclusion, but may provide some additional insight as to what significance improvement over time can help describe the regional trends of indoor air pollution-relation premature deaths.

Code
indoor_pollution %>%
  
   rename("percent_IAP" = deaths_cause_all_causes_risk_household_air_pollution_from_solid_fuels_sex_both_age_age_standardized_percent) %>%
  
  mutate(region= countrycode(entity, origin = "country.name", destination = "region")) %>%
  
  group_by(year,region) %>%
  summarize(mean_deaths = mean(percent_IAP)) %>%
  
# create a scatter plot with GDP per capita on the x-axis and deaths caused by air pollution on the y-axis, colored by region
  ggplot(aes(x = year, y = mean_deaths, color = region)) +
  geom_line(size = 1) +
  labs(title = "Percent Deaths from IAP vs Year, by Region",
       x = "year (1992-2007)",
       y = "Percent deaths caused by air pollution",
       color = "Region") +
  geom_smooth(method = "lm")
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `region = countrycode(entity, origin = "country.name",
  destination = "region")`.
Caused by warning in `countrycode_convert()`:
! Some values were not matched unambiguously: Africa, African Region, African Union, America, Andean Latin America, Asia, Australasia, Caribbean, Central Asia, Central Europe, Central Europe, Eastern Europe, and Central Asia, Central Latin America, Central sub-Saharan Africa, Commonwealth, Commonwealth High Income, Commonwealth Low Income, Commonwealth Middle Income, East Asia, East Asia & Pacific - World Bank region, Eastern Europe, Eastern Mediterranean Region, Eastern sub-Saharan Africa, England, Europe, Europe & Central Asia - World Bank region, European Region, European Union, G20, High-income, High-income Asia Pacific, High-income North America, High-middle SDI, High SDI, Latin America & Caribbean - World Bank region, Low-middle SDI, Low SDI, Micronesia (country), Middle East & North Africa, Middle SDI, Nordic Region, North Africa and Middle East, North America, Northern Ireland, Oceania, OECD Countries, Region of the Americas, Scotland, South-East Asia Region, South Asia - World Bank region, Southeast Asia, Southeast Asia, East Asia, and Oceania, Southern Latin America, Southern sub-Saharan Africa, Sub-Saharan Africa - World Bank region, Timor, Tropical Latin America, Wales, Western Europe, Western Pacific Region, Western sub-Saharan Africa, World, World Bank High Income, World Bank Low Income, World Bank Lower Middle Income, World Bank Upper Middle Income
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
`geom_smooth()` using formula = 'y ~ x'

Code
BA_plot <- indoor_pollution %>% #Bland-Altman plots show the relationship between two paried variables to determine how much change is .
  
   mutate(region = countrycode(entity, origin = "country.name", destination = "region")) %>%
  
  filter(region!= 0) %>%

  
   rename("percent_iap" = deaths_cause_all_causes_risk_household_air_pollution_from_solid_fuels_sex_both_age_age_standardized_percent) %>%

  mutate(year = paste0("Y", year)) %>%
  spread(year, percent_iap) %>%
  mutate(current = Y2015,
         change = Y2015 - Y1995)
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `region = countrycode(entity, origin = "country.name",
  destination = "region")`.
Caused by warning in `countrycode_convert()`:
! Some values were not matched unambiguously: Africa, African Region, African Union, America, Andean Latin America, Asia, Australasia, Caribbean, Central Asia, Central Europe, Central Europe, Eastern Europe, and Central Asia, Central Latin America, Central sub-Saharan Africa, Commonwealth, Commonwealth High Income, Commonwealth Low Income, Commonwealth Middle Income, East Asia, East Asia & Pacific - World Bank region, Eastern Europe, Eastern Mediterranean Region, Eastern sub-Saharan Africa, England, Europe, Europe & Central Asia - World Bank region, European Region, European Union, G20, High-income, High-income Asia Pacific, High-income North America, High-middle SDI, High SDI, Latin America & Caribbean - World Bank region, Low-middle SDI, Low SDI, Micronesia (country), Middle East & North Africa, Middle SDI, Nordic Region, North Africa and Middle East, North America, Northern Ireland, Oceania, OECD Countries, Region of the Americas, Scotland, South-East Asia Region, South Asia - World Bank region, Southeast Asia, Southeast Asia, East Asia, and Oceania, Southern Latin America, Southern sub-Saharan Africa, Sub-Saharan Africa - World Bank region, Timor, Tropical Latin America, Wales, Western Europe, Western Pacific Region, Western sub-Saharan Africa, World, World Bank High Income, World Bank Low Income, World Bank Lower Middle Income, World Bank Upper Middle Income
Code
BA_plot %>%
  
  ggplot(aes(current, change)) +
    geom_point()+
  geom_smooth(method = "lm",
              se = FALSE)+
  labs(x = "Current (2015)",
       y = "Change (2015 - 1995)",
       title = "Bland-Altman plot for Change in Percentage Premature Deaths (1992-2007)")
`geom_smooth()` using formula = 'y ~ x'

Code
model_data <- merged_df %>%
  
  rename("percent_IAP" = deaths_cause_all_causes_risk_household_air_pollution_from_solid_fuels_sex_both_age_age_standardized_percent) %>%
  
  mutate("region" = countrycode(entity, origin = "country.name", destination = "region")) %>%
  filter(!is.na(region))

model_region_temp <- linear_reg() %>%
  set_engine("lm")  # construct model instance

model_region_recipe <- recipe(percent_IAP~region+year, data = model_data) %>%
  step_interact(~year:starts_with("region"))

model_region <-
  workflow() %>%
  add_model(model_region_temp) %>%
  add_recipe(model_region_recipe)  # combine the model and recipe to generate a regression analysis

model_region_fit <- model_region %>% fit(model_data)

model_region_fit %>%
  glance() %>%
  kable(digits = c(4, 4, 2, 4, 0, 0, 2, 2, 2, 2, 0, 0)) %>%
  kable_styling(bootstrap_options = c("hover", "striped"))
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.6067 0.5967 3.67 60.5125 0 13 -1417.11 2864.22 2928.15 6852.98 510 524
Code
model_region_fit %>%
  extract_fit_engine() %>%
  tidy() %>%
  kable(digits =c(4,4,3,3)) %>%
  kable_styling(bootstrap_options = c("hover", "striped"))
term estimate std.error statistic p.value
(Intercept) 314.0192 175.210 1.792 0.0737
regionEurope & Central Asia -182.9784 214.587 -0.853 0.3942
regionLatin America & Caribbean -0.4776 222.226 -0.002 0.9983
regionMiddle East & North Africa -10.5247 239.916 -0.044 0.9650
regionNorth America -311.0244 495.569 -0.628 0.5305
regionSouth Asia -61.8841 319.888 -0.193 0.8467
regionSub-Saharan Africa -57.0697 202.315 -0.282 0.7780
year -0.1537 0.088 -1.754 0.0800
`year_x_regionEurope & Central Asia` 0.0889 0.107 0.828 0.4079
`year_x_regionLatin America & Caribbean` -0.0008 0.111 -0.007 0.9941
`year_x_regionMiddle East & North Africa` 0.0027 0.120 0.023 0.9820
`year_x_regionNorth America` 0.1522 0.248 0.614 0.5393
`year_x_regionSouth Asia` 0.0350 0.160 0.219 0.8269
`year_x_regionSub-Saharan Africa` 0.0308 0.101 0.304 0.7613

1.8) Percent Deaths and GDP + Region

This section of hypothesis testing aims to show the significance of GDP in terms of reducing the percentage of premature deaths related to indoor pollution on a region- by- region basis to see what impact region has on the effectiveness of this money spent. Assuming both assumptions in the hypothesis will be true, this graph should show all of the regions having a range of slope values that can help describe region-specific indicators of GDP effectiveness for reducing % IAP deaths.

Code
merged_df %>%
  
   rename("percent_IAP" = deaths_cause_all_causes_risk_household_air_pollution_from_solid_fuels_sex_both_age_age_standardized_percent) %>%
  
  mutate(region= countrycode(entity, origin = "country.name", destination = "region")) %>%
  
  group_by(entity,region) %>%
  summarize(mean_gdp = mean(gdp_percap),
            mean_deaths = mean(percent_IAP)) %>%
  
# create a scatter plot with GDP per capita on the x-axis and deaths caused by air pollution on the y-axis, colored by region
  ggplot(aes(x = mean_gdp, y = mean_deaths, color = region)) +
  geom_point(size = 2) +
  labs(title = "GDP ($USD) per capita vs. Percent , by Region",
       x = "GDP per capita",
       y = "Percent deaths caused by air pollution",
       color = "Region") +
  geom_smooth(method = "lm",
              size = .75)
`summarise()` has grouped output by 'entity'. You can override using the
`.groups` argument.
`geom_smooth()` using formula = 'y ~ x'
Warning in qt((1 - level)/2, df): NaNs produced
Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
-Inf

Code
model_data <- merged_df %>%
  
  rename("percent_IAP" = deaths_cause_all_causes_risk_household_air_pollution_from_solid_fuels_sex_both_age_age_standardized_percent) %>%
  
  mutate("region" = countrycode(entity, origin = "country.name", destination = "region")) %>%
  filter(!is.na(region))

model_region_temp <- linear_reg() %>%
  set_engine("lm")  # construct model instance

model_region_recipe <- recipe(percent_IAP~region+gdp_percap, data = model_data)

model_region <-
  workflow() %>%
  add_model(model_region_temp) %>%
  add_recipe(model_region_recipe)  # combine the model and recipe to generate a regression analysis

model_region_fit <- model_region %>% fit(model_data)

model_region_fit %>%
  glance() %>%
  kable(digits = c(4, 4, 2, 4, 0, 0, 2, 2, 2, 2, 0, 0)) %>%
  kable_styling(bootstrap_options = c("hover", "striped"))
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.7187 0.7149 3.08 188.342 0 7 -1329.28 2676.57 2714.92 4901.1 516 524
Code
model_region_fit %>%
  extract_fit_engine() %>%
  tidy() %>%
  kable(digits = c(3,3,3,4)) %>%
  kable_styling(bootstrap_options = c("hover", "striped"))
term estimate std.error statistic p.value
(Intercept) 9.770 0.460 21.2449 0.000
regionEurope & Central Asia -2.862 0.528 -5.4186 0.000
regionLatin America & Caribbean -3.383 0.529 -6.3980 0.000
regionMiddle East & North Africa -5.177 0.564 -9.1790 0.000
regionNorth America -0.763 1.227 -0.6217 0.534
regionSouth Asia 5.422 0.772 7.0191 0.000
regionSub-Saharan Africa 1.861 0.504 3.6891 0.000
gdp_percap 0.000 0.000 -15.2612 0.000

1.9) Percent IAP vs. GDP by country

Part of the hypothesis for this analysis is looking to determine the strength of the relationship between percent of premature deaths caused by indoor air pollution to show a significant correlation with factors that describe a observation such as region. country, and GDP. The expectation of a graph that plots the mean global GDP by entity (across the four years being represented) will be a overall negative slope showing % deaths decrease as GDP increases. Assuming this graph shows a strong correlation, a single variable regression analysis will be beneficial to describe in statistical terms, the significance of this relationship.

Code
merged_df %>%
  
   rename("percent_IAP" = deaths_cause_all_causes_risk_household_air_pollution_from_solid_fuels_sex_both_age_age_standardized_percent) %>%
  
  mutate(region= countrycode(entity, origin = "country.name", destination = "region")) %>%
  
  group_by(entity) %>%
  summarize(mean_gdp = mean(gdp_percap),
            mean_deaths = mean(percent_IAP)) %>%
  
# create a scatter plot with GDP per capita on the x-axis and deaths caused by air pollution on the y-axis, colored by country instead of region
  ggplot(aes(x = log(mean_gdp), y = mean_deaths, group = entity)) +
  geom_point(size = 1) +
  geom_smooth() +
  labs(title = " Premature Deaths by country from Air Pollution Vs. GDP ",
       y = "Avg Percentage of deaths from Indoor Air Pollution",
       x = "average GDP (%USD) log-10 scale",
       color = "Region")
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Code
model_data <- merged_df %>%
  
  rename("percent_deaths_by_IAP" = deaths_cause_all_causes_risk_household_air_pollution_from_solid_fuels_sex_both_age_age_standardized_percent) %>%
  
   mutate("region" = countrycode(entity,origin = "country.name",
   destination= "region")) %>%
  filter(!is.na(region))
  
model_region_temp <- linear_reg() %>% 
  set_engine("lm")  #construct model instance

model_region_reg<-
  recipe(percent_deaths_by_IAP~gdp_percap+entity,
         data = model_data)
  #generate a recipe -- what variables do we have in y = mx+b

model_region<-
  workflow() %>%
  add_model(model_region_temp) %>%
  add_recipe(model_region_reg) #combine the model and recipe to generate a regression analysis

model_region_fit <- model_region %>% fit(model_data)

 model_region_fit %>%
  glance() %>% 
  kable(digits=c(4,4,2,4,0,0,2,2,2,2,0,0)) %>% 
  kable_styling(bootstrap_options = c("hover", "striped"))
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.9709 0.9612 1.14 99.9394 0 131 -734.64 1735.28 2302.05 506.53 392 524

1.10) GDP and Population size, by Country

Code
merged_df %>%
  
   rename("percent_IAP" = deaths_cause_all_causes_risk_household_air_pollution_from_solid_fuels_sex_both_age_age_standardized_percent) %>%
  
  mutate(region = countrycode(entity, origin = "country.name", destination = "region")) %>%
  
  group_by(entity) %>%
  summarize(mean_population = mean(pop),
            mean_deaths = mean(percent_IAP)) %>%
  
# create a scatter plot with GDP per capita on the x-axis and deaths caused by air pollution on the y-axis, colored by country instead of region
  ggplot(aes(x = log(mean_population), y = mean_deaths)) +
  geom_point(size = 1) +
  geom_smooth(method = "lm",
              se=FALSE) +
  labs(title = " ",
       x = "Average Population by Country (log10 scale)",
       y = "Percent Deaths by IAP")
`geom_smooth()` using formula = 'y ~ x'

Code
model_data <- merged_df %>%
  
  rename("percent_deaths_by_IAP" = deaths_cause_all_causes_risk_household_air_pollution_from_solid_fuels_sex_both_age_age_standardized_percent) %>%
  
   mutate("region" = countrycode(entity,origin = "country.name",
   destination= "region")) %>%
  filter(!is.na(region))
  
model_region_temp <- linear_reg() %>% 
  set_engine("lm")  #construct model instance

model_region_reg<-
  recipe(percent_deaths_by_IAP~gdp_percap+pop,
         data = model_data)
  #generate a recipe -- what variables do we have in y = mx+b

model_region<-
  workflow() %>%
  add_model(model_region_temp) %>%
  add_recipe(model_region_reg) #combine the model and recipe to generate a regression analysis

1.11) Two-Sample t-test: Correlating GDP and Percent IAP by income.

In this analysis, several t-test will be performed to compare the mean percentages of premature deaths caused by indoor air pollution in countries with different ranges of GDP per capita using the worldbank.org 2021 averages for high- middle- and low- income GDP, as well as an additional high(er) income t-test to observe what happens to the t-value and degree of freedom when all countries included are more likely to have access to better technologies and health care. although the values for GDP are not adjusted for inflation, and likely do not represent accurate values for the purchasing power from 1992-2007, it will still provide valuable information for how significant the proportion changes with increasing GDP

The t-test enables this research to determine whether there was a statistically significant difference in the proportion of premature deaths caused by indoor air pollution between countries with different GDP per capita ranges. More specifically, the goal of these t-tests is to statistically represent the impact that increasing GDP per capita for the lowest income countries has significant impact that decreases as GDP gets closer the percentage of deaths comparable to the most developed high-income countries. Assuming this hypothesis is correct, it is likely that there will a major decrease in the difference in mean between groups as the lower GDP gets closer to high-income countries.

Low-Income Countries ($0 - $1085 USD)

Code
# Create a new column to indicate whether the GDP per capita is below or above 2500 $USD
tt_merged_df <- merged_df %>%
  mutate(gdp_cat = ifelse(gdp_percap < 1085, "below", "above")) %>%
    
  rename("percent_iap" = deaths_cause_all_causes_risk_household_air_pollution_from_solid_fuels_sex_both_age_age_standardized_percent) %>%
  
   mutate("region" = countrycode(entity,origin = "country.name",
   destination= "region")) %>%
  filter(!is.na(region))

# Perform t-test
ttest <- t.test(percent_iap ~ gdp_cat, data = tt_merged_df)

# Print the t-test results
print(ttest)

    Welch Two Sample t-test

data:  percent_iap by gdp_cat
t = -23.628, df = 281.64, p-value < 2.2e-16
alternative hypothesis: true difference in means between group above and group below is not equal to 0
95 percent confidence interval:
 -9.465416 -8.009596
sample estimates:
mean in group above mean in group below 
           4.417084           13.154589 

Middle-Income Countries ($1086 - $4255 USD)

Code
# Create a new column to indicate whether the GDP per capita is below or above 5000 $USD
tt_merged_df <- merged_df %>%
  filter(gdp_percap > 1086) %>%
  mutate(gdp_cat = ifelse(gdp_percap < 4255, "below", "above")) %>%
    
  rename("percent_iap" = deaths_cause_all_causes_risk_household_air_pollution_from_solid_fuels_sex_both_age_age_standardized_percent) %>%
  
   mutate("region" = countrycode(entity,origin = "country.name",
   destination= "region")) %>%
  filter(!is.na(region))

# Perform t-test
ttest <- t.test(percent_iap ~ gdp_cat, data = tt_merged_df)

# Print the t-test results
print(ttest)

    Welch Two Sample t-test

data:  percent_iap by gdp_cat
t = -21.997, df = 185.07, p-value < 2.2e-16
alternative hypothesis: true difference in means between group above and group below is not equal to 0
95 percent confidence interval:
 -8.938122 -7.466801
sample estimates:
mean in group above mean in group below 
           1.755227            9.957688 

High-Income Countries ($4256 - $13205 USD)

Code
# Create a new column to indicate whether the GDP per capita is below or above 10000
tt_merged_df <- merged_df %>%
  filter(gdp_percap > 4256) %>%
  mutate(gdp_cat = ifelse(gdp_percap < 13205, "below", "above")) %>%
    
  rename("percent_iap" = deaths_cause_all_causes_risk_household_air_pollution_from_solid_fuels_sex_both_age_age_standardized_percent) %>%
  
   mutate("region" = countrycode(entity,origin = "country.name",
   destination= "region")) %>%
  filter(!is.na(region))

# Perform t-test
ttest <- t.test(percent_iap ~ gdp_cat, data = tt_merged_df)

# Print the t-test results
print(ttest)

    Welch Two Sample t-test

data:  percent_iap by gdp_cat
t = -10.262, df = 206.63, p-value < 2.2e-16
alternative hypothesis: true difference in means between group above and group below is not equal to 0
95 percent confidence interval:
 -2.988349 -2.025190
sample estimates:
mean in group above mean in group below 
          0.4260148           2.9327845 

high(er)-Income Countries ($13206 - $18,000 USD)

Code
# Create a new column to indicate whether the GDP per capita is below or above 10000
tt_merged_df <- merged_df %>%
   filter(gdp_percap > 13206) %>%
  mutate(gdp_cat = ifelse(gdp_percap < 18000, "below", "above")) %>%
    
  rename("percent_iap" = deaths_cause_all_causes_risk_household_air_pollution_from_solid_fuels_sex_both_age_age_standardized_percent) %>%
  
   mutate("region" = countrycode(entity,origin = "country.name",
   destination= "region")) %>%
  filter(!is.na(region))

# Perform t-test
ttest <- t.test(percent_iap ~ gdp_cat, data = tt_merged_df)

# Print the t-test results
print(ttest)

    Welch Two Sample t-test

data:  percent_iap by gdp_cat
t = -2.4598, df = 14.477, p-value = 0.02703
alternative hypothesis: true difference in means between group above and group below is not equal to 0
95 percent confidence interval:
 -2.7220691 -0.1903813
sample estimates:
mean in group above mean in group below 
          0.2605347           1.7167599 

Discussion

These multiple variables tests provide a stronger idea of the kinds of relationships this data can infer. Several of these observations were particularly interesting, while others may not be particularly valuable.

Section 1.7 provides an additional graph that represents the change in indoor air pollution grouped by region, which shows that despite the regression analysis in Table C and Section 1.1, this variable has some ability to describe the differences that are present on a regional basis. The Bland-Altman chart separating the current vs. change plots by region shows clearly how significant the difference between a country like Sub-Saharan Africa is demonstrating growing amounts of premature deaths versus Europe & Central Asia with consistently very little deaths from IAP. With both region and year, the r^2 value only increased from the single variable regression test with region by less then 1.0%, meaning it was not significantly informative to include this variable. In total.

In total, this analysis found that for any given 1 year change, a global average change in percent of premature deaths from indoor air pollution decreased by 1.57 on average. This would suggest that, despite some countries facing new challenges due to socioeconomic changes, this global rate remains decreasing. This analysis ended up proving more beneficial then previously anticipated, and suggests that tracking the progress over time may provide important evidence for understanding the effectiveness of other factors like GDP and population size change over time.

Section 1.8 This analysis combines the major test questions for this analysis – GDP and Region – into a single graph and multiple variable analysis that was aimed at understanding how regional averages of GDP compound the relationship seen in Section 1.2 and 1.4.

The results of the regression analysis performed to correlate the percentage of deaths and GDP seperated by region. The R^2 value (0.7149) suggests a 71% of the variation in premature deaths can be predicted with GDP and Region. The model fitting this data for each region used a region– East Asia and Pacific as an “intercept” for which all the subsequent coefficients are adding or subtracting from.

(p<0.001 for all results except North America)

  • The coefficient for East Asia and Pacific describes the expected average percentage of premature deaths caused by indoor air pollution is 9.770%

  • For countries in Europe & Central Asia, the average expected percentage of premature deaths caused by indoor air pollution is 6.908% (9.770 - 2.862) for a GDP of 0.

  • For countries in Latin America & Caribbean, the average expected percentage of premature deaths caused by indoor air pollution is 6.387% (9.770 - 3.383) for a GDP of 0.

  • For countries in Middle East & North Africa, the average expected percentage of premature deaths caused by indoor air pollution is 4.593% (9.770 - 5.177) for a GDP of 0.

  • For countries in North America, the average expected percentage of premature deaths caused by indoor air pollution is 9.007% (9.770 - 0.763) for a GDP of 0. (p = 0.543)

    • This value is likely significantly incorrect, and may be a result of an exceedingly small sample size that does not accurately predict the relationship between GDP and premature deaths in North America from IAP.
  • For countries in South Asia, the average expected percentage of premature deaths caused by indoor air pollution is 15.192% (9.770 + 5.422) for a GDP of 0.

  • For countries in Sub-Saharan Africa, the average expected percentage of premature deaths caused by indoor air pollution is 11.631% (9.770 + 1.861) for a GDP of 0.

    In Summary this regression analysis suggests that there is a positive relationship between GDP and the percentage of premature deaths caused by indoor air pollution across regions, and that GDP is a significant predictor of this relationship.

Section 1.9 The goal of this section is aimed to determine how significant the relationship between GDP and country is for predicted the percentage of premature deaths. This result recieved the highest r-squared value of any test run in this analysis (r^2 = 0.9612, p <0.001) suggesting most of the variation can be interpreted through analyzing a given countries GDP. For this model fit the intercept is Afghanistan with the highest percentage of deaths when GDP is set to 0 at 19.45, ~10% higher then the global average recorded in the merged_df dataset.

Considering the Region and GDP had an r-squared value of 0.7149, the second highest value in this dataset, it is clear that these three factors are the most significant predictive characteristics for percentage of indoor air pollution-related premature deaths. This Analysis, however, will not focus on this table in the conclusion since comparisons of countries is largely outside the scope of this study. Nonetheless, it is valuable to understand the amount of predictive power in these dependent variables.

Table 1.10 This section returns for a final interpretation of the single-variable test considering the relationship between GDP and population size fromSection 1.3, which found a significant correlation (p= 0.0122) with very little correlation to the variation in the dataset (r^2 = 0.012). While these results were statistically significant, the multi-variable analysis controlling for GDP found a small t-statistic value (2.45) suggesting that the correlation is not significantly representative of a large portion of the sample.

Considering these results, it is likely that a hypothesis testing impact of population size for a correlation with percentage of indoor air pollution-related deaths would be NULL

Table 1.11 provided four different t-tests that compared the proportion in the percent of premature deaths seen in low- middle- high- and high(er)- income countries. This test found exceedingly strong correlation between GDP and Percent deaths from IAP, which was expected considering the results in section 1.4, 1.8, and 1.9. However, these results went much further in terms of explaining how this breaks down in economically grouped countries. The result of this analysis showed a strong relationship (p=0.001) for every analysis except high(er) income countries, although this result was also significant (p = 0.02703).

  • Low-Income countries ($0 - $1085 USD) were found in this analysis to have an average percentage of premature deaths to be 13.15%, while the average for all GDPs above this group = 4.42% (t = -23.628, df = 281.64, p-value < 2.2e-16 )

  • Middle-Income Countries ($4256 - $13205 USD) in this test, had an average % premature deaths of 9.96%, with all GDP’s greater having an average of 1.76% (t = -21.997, df = 185.07, p-value < 2.2e-16) Middle- income countries were found to have. on average, 3.20% less premature deaths, which is a significant increase considering the small increase in GDP as well as the fact that excluding these groups from the population brings the average below 2% premature deaths attributed to IAP

  • High Income countries ($4256 - $13205 USD) were found to have an even smaller average, as expected, but dropping even more significantly then between low- and middle- income countries. The average percent deaths for high-income countries was found to be 2.93%, while the average for all GDP’s above this range had an average of 0.43% (t = -10.262, df = 206.63, p-value < 2.2e-16). The difference in average percentage of deaths between middle- and high- income countries was found to be 7.02% over twice the reduction as moving from low- to middle- income as a country


Conclusions

This analysis aimed to answer the hypothesis proposes that countries with higher GDP and larger population sizes have lower percentages of premature deaths caused by indoor pollution from biofuels and that significant differences in the percentage of premature deaths caused by indoor pollution from biofuels exist across regions even after controlling for GDP, population size, and other relevant factors.

In Section 1.1, the low R^2 value for the correlation between indoor air pollution deaths and year suggests that health care and disease prevention improvements have played a more prominent role in reducing premature deaths than changes in indoor air quality over time. However, as the section notes, analyzing the relationship between year and indoor air pollution deaths on a regional basis could provide more insight into trends and disparities.

The high R^2 value for the correlation between the region and indoor air pollution deaths in Section 1.2 supports the hypothesis that regional factors play a significant role in determining the prevalence of indoor air pollution and its associated health risks. These regional discrepancies are likely related to several factors, such as differences in household fuel sources, housing quality, and cultural practices. Understanding these regional differences can help policymakers and public health officials effectively target interventions and resources. While population size did not strongly correlate with indoor air pollution deaths in Section 1.3, it is still an essential factor to consider in the context of public health. For example, even if the percentage of premature deaths caused by indoor air pollution is low in a country with a large population, the absolute number of deaths could still be significant. Additionally, rapid population growth can exacerbate existing environmental and health challenges.

The relationship between GDP and indoor air pollution deaths in Section 1.4 suggests that economic development can positively reduce indoor air pollution and its associated health risks. However, it is essential to note that the relationship could be more linear and straightforward. For example, while higher GDP can enable households to switch to cleaner fuel sources, it can also lead to increased industrialization and urbanization, which can create new sources of pollution.

The relationship between indoor air pollution deaths and life expectancy in Section 1.5 highlights the broader impacts of indoor air pollution on population health. Improving indoor air quality could have significant benefits beyond reducing premature deaths, such as increasing overall life expectancy and reducing disease burden. The regional differences in indoor air pollution deaths shown in Section 1.7 further emphasize the importance of understanding and prioritizing different regions’ unique challenges.

Overall, these observations suggest that indoor air pollution is a complex and multifaceted issue that requires a nuanced understanding of regional and socioeconomic factors. While GDP and population size are significant predictors of indoor air pollution deaths, they cannot fully explain the variation across regions and countries. Further research and analysis are needed to develop targeted interventions and policies to address this critical public health issue


Future Questions

  • Studies suggest that in developed countries, people spend 80-90% of their lives indoors. The US, being the wealthiest nation, has the resources to reduce indoor air pollution through regulation, investment, and adequate medical care. However, countries with growing GDP and capitalistic societies may face increased time indoors without the resources to regulate indoor and outdoor environments, leading to greater exposure to pollutants and inadequate healthcare. As countries create economic opportunities and urbanize, indoor air pollution may increase. Analysis to Identifying the point at which GDP and premature deaths caused by indoor air pollution are correlated can inform targeted prevention efforts.
  • ANOVA tests would have been very beneficial for proving the hypothesis in this experiment. having an F-Statistic for the mean deaths from IAP within a region compared to the difference between regions may have provided a valuable correlation to show how significantly the regional differences were statistically.

References

Ezzati, Majid, and Daniel M. Kammen. 2002. The health impacts of exposure to indoor air pollution from solid fuels in developing countries: knowledge, gaps, and data needs. Environmental Health Perspectives 110 (11): 1057–68. https://doi.org/10.1289/ehp.021101057.
González-Martín, Javier, Norbertus Johannes Richardus Kraakman, Cristina Pérez, Raquel Lebrero, and Raúl Muñoz. 2021. A state–of–the-art review on indoor air pollution and strategies for indoor air pollution control.” Chemosphere 262 (January): 128376. https://doi.org/10.1016/j.chemosphere.2020.128376.
Gordon, Stephen B., Nigel Bruce, Jonathan Grigg, Patricia L. Hibberd, Om P Kurmi, Kin-Man Lam, Kevin Mortimer, et al. 2014. Respiratory risks from household air pollution in low and middle income countries.” The Lancet Respiratory Medicine 2 (10): 823–60. https://doi.org/10.1016/s2213-2600(14)70168-7.
quality, Air, and health. 2014. WHO Guidelines for indoor air quality: Household fuel combustion.” https://www.who.int/publications/i/item/9789241548885.