Section 4.1-4.2

February 25, 2021

Announcements

Student Hours Today

Instead of the usual time (1:30-4), Student Hours today will be from 2:30-5pm. (Today only!)

Chapter 4

Big idea of Chapter 4

  • Previously research questions focused on a single statistic.
    • What proportion of the time did Harley choose the right bowl?
    • What is the average hours slept by Westmont Students?
  • We will now start to focus on research questions comparing two groups.
    • Are smokers more likely than nonsmokers to have lung cancer?
    • Are children who used night lights as infants more likely to need glasses than those who didn’t use night lights?

4.1: Association and Confounding

Types of variables

When two variables are involved in a study, they are often classified as explanatory and response.

  • Explanatory variable (Independent, Predictor)
    • The variable we think is “explaining” the change in the response variable. (Many times, this is the variable the researchers are manipulating.)
  • Response variable (Dependent)
    • The variable we think is being impacted or changed by the explanatory variable.

Examples

  • Choose the explanatory and response variable:
    • Smoking and lung cancer
    • Heart disease and diet
    • Height and weight
  • Sometimes there is a clear distinction between explanatory and response variables and sometimes there isn’t.

Night Lights and Nearsightedness

Near-sightedness often develops in childhood

  • Recent studies looked to see if there is an association between near-sightedness and night light use with infants
  • Researchers interviewed parents of 479 children who were outpatients in a pediatric ophthalmology clinic
  • Asked whether the child slept with the room light on, with a night light on, or in darkness before age 2
  • Children were also separated into two groups: near-sighted or not near-sighted based on the child’s recent eye examination

Night Light Data

Night Light Data
Darkness Night Light Room Light Total
Near-sighted 18 78 41 137
Not near-sighted 154 154 34 342
Total 172 232 75 479


  • This type of table is called a two-way table (or “contingency table” or “cross tabulation”, or just “crosstab”).

Night Light Data
Darkness Night Light Room Light Total
Near-sighted 18 78 41 137
Not near-sighted 154 154 34 342
Total 172 232 75 479
Conditional Column Proportions
Darkness Night Light Room Light Total
Near-sighted 0.105 0.336 0.547 0.286
Not near-sighted 0.895 0.664 0.453 0.714
Total 1.000 1.000 1.000 1.000

Association

Conditional Column Proportions
Darkness Night Light Room Light Total
Near-sighted 0.105 0.336 0.547 0.286
Not near-sighted 0.895 0.664 0.453 0.714
Total 1.000 1.000 1.000 1.000
  • Notice that as the light level increases, the percentage of near-sighted children also increases.
  • We say there is an association between near-sightedness and night lights.
  • Two variables are associated if the values of one variable provide information (help you predict) the values of the other variable.

Stacked Bar Plots

Near sighted is dark gray, not-near sighted is light gray.

Mosaic Plots

Association \(\neq\) Causation

  • A confounding variable is associated to both the explanatory variable and the response variable.
  • We say it is confounding because its effects on the response cannot be separated from those of the explanatory variable.
  • Because of this, we can’t draw cause and effect conclusions when confounding variables are present.
  • Possession of lighters is associated to lung cancer. (Confounding variable: smoker/nonsmoker)
  • (Confounding variable: eyesight of parents) Nearsighted children have nearsighted parents, and nearsighted parents need lit rooms to function.

Exploration 4.1: Home Court Disadvantage?

Home Court advantage

OKC Thunder, 2014

  • Crowd distraction (0:15)
  • Referee Bias (3:24)

Home Court Disadvantage?

The 2008-9 Oklahoma City Thunder had a win-loss record that was actually worse for home games with a sell-out crowd (3 wins and 15 losses) than for home games without a sell-out crowd (12 wins and 11 losses).

  1. Identify the observational units and variables in this study.
  2. When did the Thunder have a higher winning percentage: in front of a sell-out crowd or a smaller crowd?
  3. Do the two variables appear to be associated?
  4. Which would you consider the explanatory variable in this study? Which is the response?

Data in R

okcmatrix <- matrix(c(3,15,12,11), byrow=FALSE, nrow=2)
rownames(okcmatrix) <- c("Wins", "Losses")
colnames(okcmatrix) <- c("Sell-out Crowd", "Smaller Crowd")
okctable <- as.table(okcmatrix)
okctable
##        Sell-out Crowd Smaller Crowd
## Wins                3            12
## Losses             15            11
prop.table(okctable, 2)
##        Sell-out Crowd Smaller Crowd
## Wins        0.1666667     0.5217391
## Losses      0.8333333     0.4782609

Mosaic Plot of OKC data

Confounding Variables

A confounding variable is a variable that is related both to the explanatory and to the response variable in such a way that its effects on the response variable cannot be separated from the effects of the explanatory variable.

Everyone: The strength of the opposing team is a confounding variable. Explain how this variable is confounding–what is the link between this third variable and the response variable, and what is the link between this third variable and the explanatory variable? Use this form to type up a complete sentence of explanation.

Crowd size and Opponent records

Of the Thunder’s 41 home games, 22 were against teams that won more than half of their games (“strong opponents”). Of these 22 games, 13 were sell-outs. Of the 19 games against “weak opponents”" that won less than half of their games that season, only 5 of those games were sell-outs.

  1. Record this data in a two-way table. The rows should be labeled Sell-out Crowd and Smaller Crowd, and the columns should be labeled Strong opponent and Weak opponent.

  2. Compute and record the conditional column proportions for your table.

Crowd size and Opponent records

  1. Does there appear to be an association between crowd size and opponent records? How are they associated? Does it make sense that these variables would be associated?

Win/Loss and Opponent records

When the Thunder played a strong opponent, they won only 4 of 22 games. When they played a weak opponent, the Thunder won 11 of 19 games.

  1. Record this data in a two-way table. Now the columns should be labeled Strong opponent and Weak opponent, and the rows should be labeled Wins and Losses.

  2. Compute and record the conditional column proportions for your table.

  3. Does there appear to be an association between wins/losses and opponent records?

Explain and summarize

  1. Fill in the blanks: When the Thunder played a strong opponent (instead of a weak opponent), they were (more/less) _____ likely to win, and they were (more/less) _____ to have a large crowd. Therefore __________ is a confounding variable that explains the association between __________ and __________.

4.2: Observational Studies vs. Experiments

Observational Studies

  • In an observational study, the groups you compare are “just there.” They are defined by what you see rather than by what you do.
  • For example, the researchers didn’t control which children slept with a night light on or not.
  • Observational studies may have confounding variables present that prevent us from determining a cause and effect.

Experiment

  • In an experiment, you actively create the groups and then assign the conditions to be compared.
  • These conditions may be one or more treatment or a control.
  • Well-designed experiments can control for confounding variables by making the treatment groups very similar except for what the experimenter manipulates.
  • Well-designed experiments use randomization to assign treatment and control groups.
    • Usually when we say “experiment” we mean “randomized experiment.”

Experiment: Physicians’ Health Study

Randomized Experiment: Physicians’ Health Study

Studied aspirin’s effect on reducing heart attacks.

  • Started in 1982 with 22,071 male physicians.
  • The physicians were randomly assigned into one of two groups.
  • Half took a 325mg aspirin every other day and half took a placebo.

Aspirin Study Results

  • 189 (1.7%) heart attacks occurred in the placebo group and 104 (0.9%) in the aspirin group. (45% reduction in heart attacks for the aspirin group.)
  • What about confounding variables?
    • Did they have a better diet?
    • Did they exercise more?
    • Were they genetically less likely to have heart attacks?
    • Were they younger?
  • Confounding variables are controlled in experiments due to random assignment of subjects into treatment groups.

Random Assignment

  • Confounding variables are controlled in experiments due to the random assignment of subjects to treatment groups.
  • Randomly assigning people to groups tends to balance out all other variables between the groups.
  • So variables that could have an effect on the response should be equalized between the two groups and therefore should not be confounding.
  • Thus, cause and effect conclusions are possible in experiments through random assignment. (It must be a well-run experiment.)

Random vs. Random

  • With observational studies, random sampling is often done. This possibly allows us to make inferences from the sample to the population where the sample was drawn.
  • With experiments, random assignment is done. This allows us to possibly conclude causation.

Exploration 4.2

Tripping study

The goal of the tripping study is to compare two recovery strategies for tripping (elevating or lowering).

Tripping study variables

  • Recovery: recover from trip / don’t recover
  • Technique: elevating / lowering

Possible confounding variables:

  • Sex (male/female)
  • height (quantitative)
  • “balance gene” (have gene / don’t have gene)
  • x-variables (things we didn’t/couldn’t measure)

Randomizing Subjects

Open the Randomizing Subjects applet. This applet will simulate how the two groups were formed for the lifting/lowering tripping experiment.

  1. Press Randomize once. What does pressing Randomize do? Notice that a dot appears on the dotplot. Where is it located? Record the location of the dot. How is the location of the dot computed? (Hint: look at the label under the dotplot.) Write down a numeric expression showing how the location of the dot is computed.

Run 500 Replications

  1. Uncheck Animate. Set Replications to 500 and Randomize to create a dotplot.
  2. Select Reveal Both.
  3. Compare the different drop-down options (sex, height, gene, x-var). What do the dotplots all have in common? Name at least two things that they have in common.

Differences in randomized groups

The dotplots are illustrating how two randomized groups will be different, in the long run.

  1. Regardless of the variable we measure (sex, height, “balance gene”, or something we haven’t even considered like “x-var”), on average, in the long run, what is the difference in the observed statistics for each group?

  2. What does your answer to #5 imply about the differences between two randomly assigned groups?

What is the best explanation?

  1. Suppose one of the randomly assigned groups used the lifting strategy, while the other used the lowering strategy. Suppose that the lifting strategy group fell much less often. What is the best explanation for why they fell less often?
    • They were shorter.
    • There were more females in the lifting group.
    • There were more people with the “balance gene” in the lifting group.
    • The lifting group had more people with some other “x-variable” that made them fall less often.
    • Lifting is a better strategy than lowering.
-poll "Suppose one of the randomly assigned groups used the lifting strategy, while the other used the lowering strategy. Suppose that the lifting strategy group fell much less often. What is the best explanation for why they fell less often?" "They were shorter." "There were more females in the lifting group." "There were more people with the balance gene in the lifting group." "Lifting is a better strategy than lowering."

Cursive vs. Block Letters

Cursive vs. Block Letters

Among students who took the essay portion of the SAT, those who wrote in cursive style scored significantly higher on the essay, on average, than students who used printed block letters.

  1. What are the explanatory and response variables?

  2. Is this an observational study or an experiment?

  3. Identify a possible confounding variable.

  4. Based on this data, does cursive writing cause an essay score to be higher?

Bonus Video: Dutton and Aron, 1974

Investigation 4 deals with the study in this video. Watch it and decide whether it is an experiment or an observational study, and what confounding variables might be present.