Section 5.3

March 4, 2021

Questions on Investigation #4?

Review: Yawning Study

\[ H_0: \pi_1 - \pi_2 = 0 \\ H_a: \pi_1 - \pi_2 > 0 \]

Seed observed Seed not observed Total
Subject yawned 11 3 14
Did not yawn 23 13 36
Total 34 16 50

\[ \begin{align} \hat{p}_1 &= 11/34 \approx 0.32 \\ \hat{p}_2 &= 3/16 \approx 0.19 \\ \hat{p}_1-\hat{p}_2 &= 11/34-3/16 \approx 0.136 = \mbox{test statistic} \\ \end{align} \]

2SD confidence interval for \(\pi_1 - \pi_2\)

  • \(\text{SD}_\text{null} = 0.137\)
  • Estimate: \(\quad\hat{p}_1 - \hat{p}_2 \approx 0.136\)
  • 2SD CI: \(\quad 0.136 \pm 2(0.137) \approx (-0.14, 0.41)\)
    • Is it plausible that \(\pi_1 - \pi_2 = 0\)?
    • What would this mean in the context of the yawn study?

5.3: Comparing two Proportions: Theory-based approach

Overview: Simulation vs. Theory

  • Simulation always works.
  • Under certain conditions, we can predict the result of simulation using theory.

Smoking Parents

Smoking and Babies’ Sex

How does parents’ behavior affect the sex of their children? Fukuda et al., 2002 (Japan) found the following:

  • 255 of 565 births (45.1%) where both parents smoked more than a pack a day were boys.
  • 1975 of 3602 births (54.8%) where both parents did not smoke were boys.

Other studies have shown a reduced male to female birth ratio where high concentrations of other environmental chemicals are present (e.g. industrial pollution, pesticides).

Smoking/Baby data

##      Nonsmokers Smokers  Sum
## Boy        1975     255 2230
## Girl       1627     310 1937
## Sum        3602     565 4167
##      Nonsmokers   Smokers
## Boy   0.5483065 0.4513274
## Girl  0.4516935 0.5486726

Segmented Bar Graph

Simulation-based approach

##      Nonsmokers Smokers
## Boy        1975     255
## Girl       1627     310
##      Nonsmokers   Smokers
## Boy   0.5483065 0.4513274
## Girl  0.4516935 0.5486726

\[ \mbox{test statistic} = \hat{p}_1 - \hat{p}_2 \approx 0.5483 - 0.4513 \approx 0.097 \]

\[ H_0: \pi_1 - \pi_2 = 0 \\ H_a: \pi_1 - \pi_2 \neq 0 \]

Let’s enter the data into the Two Proportion Applet.

Theory-based approach

  • The null distribution is approximately normal.
  • Validity conditions: At least 10 observations in each cell of the two-way table.
  • Need standard error of difference of proportions. Assuming \(\pi_1 = \pi_2\),

\[ \mbox{standard error} \approx \sqrt{\hat{p}(1-\hat{p})\left(\frac{1}{n_1} + \frac{1}{n_2} \right)} \]

where \(n_1\) and \(n_2\) are the sizes of the two groups, and \(\hat{p}\) is the pooled proportion of “successes” in both groups.

Theory-based \(z\)-Score

When the validity conditions are met, the standardized statistic is

\[ \begin{align} z &= \frac{\mbox{observed statistic} - \mbox{null value}}{\mbox{standard error of statistic}} \\ &\approx \frac{(\hat{p}_1 - \hat{p}_2) - 0}{\sqrt{\hat{p}(1-\hat{p})\left(\frac{1}{n_1} + \frac{1}{n_2} \right)}} \end{align} \]

Theory-based \(z\)-Score

For our data:

\[ \begin{align} z &= \frac{(\hat{p}_1 - \hat{p}_2) - 0}{\sqrt{\hat{p}(1-\hat{p})\left(\frac{1}{n_1} + \frac{1}{n_2} \right)}} \\ &\approx \frac{(0.548 - 0.451) - 0}{\sqrt{0.535(1-0.535)\left(\frac{1}{3602} + \frac{1}{565} \right)}} \\ &\approx 4.30 \end{align} \]

Theory-based Applet

Luckily, there’s an applet: Theory-Based Inference

##      Nonsmokers Smokers  Sum
## Boy        1975     255 2230
## Girl       1627     310 1937
## Sum        3602     565 4167

Confidence intervals

When computing a confidence interval, there is no \(H_0\), so we can’t assume that \(\pi_1 = \pi_2\). So the standard error is:

\[ \mbox{standard error} \approx \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}} \]

So a confidence interval for \(\pi_1 - \pi_2\) has the form

\[ (\hat{p}_1-\hat{p}_2) \pm \mbox{multiplier} \times \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}} \]

where the multiplier depends on the confidence level (e.g., 1.96 for 95% confidence).

Use software to get confidence intervals

For example, use the Theory-Based Inference applet.

Using R for theory-based approach

prop.test(x = c(1975, 255), n = c(3602, 565))
## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(1975, 255) out of c(3602, 565)
## X-squared = 18.077, df = 1, p-value = 2.122e-05
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.05182158 0.14213654
## sample estimates:
##    prop 1    prop 2 
## 0.5483065 0.4513274

Interpret the results in context

prop.test(x = c(1975, 255), n = c(3602, 565))
## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(1975, 255) out of c(3602, 565)
## X-squared = 18.077, df = 1, p-value = 2.122e-05
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.05182158 0.14213654
## sample estimates:
##    prop 1    prop 2 
## 0.5483065 0.4513274
  • We have very strong evidence that smokers are less likely than non-smokers to have male children.
  • With 95% confidence, the probability of having a male child is 0.051 to 0.142 lower for smokers when compared to nonsmokers.

What you should know

  • Identify when a theory-based approach would be valid to find the p-value or a confidence interval when evaluating the relationship between two categorical variables.
  • Use the Theory-Based Inference applet or R to find theory based p-values and confidence intervals.
  • Understand the impacts of confidence level and sample size on confidence interval width the difference in two proportions.
  • Understand the impacts of the difference in the sample proportions and sample size on the p-value.

Exploration 5.3

Blood Donation

We are going to look at data from the General Social Survey, which is a national survey conducted every two years on a nationwide random sample of adult Americans.

Sampling and/or Assignment?

Are Americans any more or less generous about donating blood in some years than others? Are women any more or less willing to give blood than men? To investigate these questions we can analyze data from the General Social Survey, which is a national survey conducted every two years on a nationwide random sample of adult Americans.

  1. Did this study make use of random sampling, random assignment, both, or neither? Can conclusions be generalized to a larger population? Can cause-effect conclusions be made? Everyone: Enter your responses.

Hypotheses

We are going to look at data from the years 2002 and 2004.

  1. Record the appropriate null and alternative hypotheses, in symbols, to address the research question of whether Americans were more generous with blood donations in one of these years than the other. (Hint: Remember that hypotheses are always about population parameters, and think about whether the alternative should be one- or two-sided before you see the data.)

Simulation-based analysis

##                Y2002 Y2004  Sum
## Donated blood    210   230  440
## Did not donate  1152  1106 2258
## Sum             1362  1336 2698
  1. Use the Two Proportion Applet to record the test statistic \(\hat{p}_{2002} - \hat{p}_{2004}\), and record a p-value for the test in #2. Also record the mean and SD of the null distribution.

  2. Record a 95% 2SD confidence interval for \(\pi_{2002} - \pi_{2004}\). Is zero a plausible value? What does this fact mean in the context of blood donation?

Theory-based approach

  1. Explain why the validity conditions for the theory-based test are met.

  2. Check the box for Overlay normal distribution. Is the normal distribution a good approximation for the null distribution? Record a theory-based p-value, and compare it to #3.

TBIA Confidence interval

  1. Now use the Two Proportion scenario in the Theory-Based Inference applet to record a 95% confidence interval. How close is your interval to the 2SD interval in #4?

  2. Find the theory-based p-value, and compare it to what you got in #6 (it should be the same).

Men vs. Women

Consider the combined data for 2002 and 2004, classified by sex:

##                Male Female  Sum
## Donated blood   239    201  440
## Did not donate 1032   1226 2258
## Sum            1271   1427 2698
  1. Analyze these data to address the question of whether American men and women differ with regard to donating blood. Record hypotheses, a p-value, and a confidence interval using theory-based methods, if valid. If you decide that men and women differ significantly, be sure to estimate by how much they differ. Everyone: Give a summary of your findings in paragraph form.