Section 10.1-2

April 13, 2021

Chapter 10: Two Quantitative Variables

In-Person Student Hours start today

Same time as usual: T/Th from 1:30-4pm
Place: patio outside the Math/CS learning lounge
- This is the small patio on the side of Winter Hall that overlooks the Gym.
- GPS coordinates: 34.44882425194405,-119.66235682106694

Last Week

Chapter 8: Categorical explanatory \(\rightarrow\) Categorical response
Chapter 9: Categorical explanatory \(\rightarrow\) Quantitative response

Investigation 8-9

Clip from This American Life
(Start at 38:54, End at 42:02)
Clip Editor

Was there a trend?

drivers <- read.table("http://math.westmont.edu/ma5/carstatus.txt", header = TRUE, stringsAsFactors = TRUE)
plot(Behavior ~ Status, data = drivers)

Chi Square test limitations

It seemed like there was a trend: higher status people yielded less.
However, the Chi-Square test came back insignificant (p-val > 0.2)
Chi-Square only looks at 5 different levels, not in order. (categorical explanatory)
We need other tests that deal with quantitative explanatory variables.

Chapter 10 Overview

Quantitative explanatory \(\rightarrow\) Quantitative response
How strong is the relationship?
How can we model the relationship?
How significant is the evidence for the relationship?

Section 10.1: Scatterplots and Correlation

How is the time a student spends on an exam related to the student’s score on the exam?

Section 10.1: Scatterplots and Correlation

How is the time a student spends on an exam related to the student’s score on the exam?

##    time score
## 1    30   100
## 2    41    84
## 3    41    94
## 4    43    90
## 5    47    88
## 6    48    99
## 7    51    85
## 8    54    84
## 9    54    94
## 10   56   100
## 11   56    65
## 12   56    64
## 13   57    65
## 14   58    89
## 15   58    83
## 16   60    85
## 17   61    86
## 18   61    92
## 19   62    74
## 20   63    73
## 21   64    75
## 22   66    53
## 23   66    91
## 24   69    85
## 25   72    62
## 26   78    68
## 27   79    72
## 28   93    93
## 29   96    93
## 30  100    97

Scatterplot

Explanatory variable on horizontal axis
Response variable on vertical axis

Describing Scatterplots

When we describe data in a scatterplot, we describe the

Direction (positive or negative)
Form (linear or not)
Strength (strong-moderate-weak, we will let correlation help us decide)
Unusual Observations

How would you describe the time and test scatterplot?

Correlation coefficient \(r\)

Correlation measures the strength and direction of a linear association between two quantitative variables.
Correlation is a number \(r\) between -1 and 1.
With positive correlation one variable increases, on average, as the other increases.
With negative correlation one variable decreases, on average, as the other increases.
The closer it is to either -1 or 1 the closer the points fit to a line.
The correlation for the test data is -0.56.

Guidelines for Correlation

\[ r= \frac{1}{n-1} \sum_{i=1}^n \left( \frac{x_i - \bar{x}}{s_x} \right) \left( \frac{y_i -\bar{y}}{s_y} \right) \]

range of \(r\)	Strength	Meaning
\(0.7 \leq \lvert r \rvert \leq 1\)	Strong	Points almost form a line.
\(0.3 \leq \lvert r \rvert \leq 0.7\)	Moderate	Clear pattern, but bloblike.
\(0.1 \leq \lvert r \rvert \leq 0.3\)	Weak	Slight pattern.
\(0 \leq \lvert r \rvert \leq 0.1\)	None	No discernible trend.

Note: \(-1\leq r \leq 1\) always.

Sensitivity

Original Data: \(r =\) -0.5636557

Sensitivity

Add 3 “unusual” points: \(r =\) -0.124997

Influential Observations

The correlation changed from -0.56 (a fairly moderate negative correlation) to -0.12 (a fairly weak negative correlation).
Points that are far to the left or right and not in the overall direction of the scatterplot can greatly change the correlation. (influential observations)

Correlation in R

cor(examtimes)

##            time     score
## time   1.000000 -0.124997
## score -0.124997  1.000000

examtimes

##    time score
## 1    30   100
## 2    41    84
## 3    41    94
## 4    43    90
## 5    47    88
## 6    48    99
## 7    51    85
## 8    54    84
## 9    54    94
## 10   56   100
## 11   56    65
## 12   56    64
## 13   57    65
## 14   58    89
## 15   58    83
## 16   60    85
## 17   61    86
## 18   61    92
## 19   62    74
## 20   63    73
## 21   64    75
## 22   66    53
## 23   66    91
## 24   69    85
## 25   72    62
## 26   78    68
## 27   79    72
## 28   93    93
## 29   96    93
## 30  100    97

Basic Facts about Correlation

\(r= \frac{1}{n-1} \sum_{i=1}^n \left( \frac{x_i - \bar{x}}{s_x} \right) \left( \frac{y_i -\bar{y}}{s_y} \right)\)
Correlation measures the strength and direction of a linear association between two quantitative variables.
\(-1 \leq r \leq 1\)
Correlation makes no distinction between explanatory and response variables.
Correlation has no unit.
Correlation is not a resistant measure.

Practice with \(r\)

Enter the Height and Finger Length data into the Correlation/Regression Applet.
Experiment with the Correlation Guessing Game Applet.

Another Application: Strike Zone Data

Paper in JQAS on assessing umpire performance

Exploration 10.1

The Delboeuf Illusion

Preview

Research Question: Have dinner plates gotten bigger over recent years?
Quantitative Explanatory: Year plate was produced.
Quantitative Response: Size of plate

Correlation/Regression Applet

Paste the PlateSize data set into the Correlation/Regression applet. Which variable is on the x-(horizontal) axis? Which variable is on the y-(vertical) axis?

Scatterplots

A scatterplot is a graph showing a dot for each observational unit, where the location of the dot indicates the values of the observational unit for both the explanatory and response variables. Typically, the explanatory variable is placed on the x-axis and the response variable is placed on the y-axis.

Three aspects to correlation

Direction: Positive (uphill from left to right) or Negative (downhill)
Form: Approximately linear (roughly along a straight line) or would a curve describe the data better?
Strength: How closely do the points fit on a line (clear pattern, or fuzzy?)

Direction, Form, Strength

Is the association between year and plate size positive or negative? Record a complete sentence explaining what this means in context.
Does the association between year and size appear to be linear or nonlinear?
In your opinion, would you say that the association between plate size and year appears to be strong, moderate, or weak?

Complete this form with your answers.

Unusual Observations

Are there any observational units (dots on the scatterplot, representing individual plates) that seem to fall outside of the overall pattern? Record the \((x,y)\) coordinates of the observational unit you think is most unusual, and give a reason why you chose it.

Influential vs. Outliers

Two types of unusual observations:

Influential observation: Removing it from the data set dramatically changes our perception of the association (usually extreme in explanatory (x) direction).
Outliers: Don’t fit the overall pattern of the relationship. (may also be influential)

Is the observational unit you chose in #5 an influential observation or an outlier, or both?

Correlation Coefficient

\[ r= \frac{1}{n-1} \sum_{i=1}^n \left( \frac{x_i - \bar{x}}{s_x} \right) \left( \frac{y_i -\bar{y}}{s_y} \right) \]

range of \(r\)	Strength	Meaning
\(0.7 \leq \lvert r \rvert \leq 1\)	Strong	Points almost form a line.
\(0.3 \leq \lvert r \rvert \leq 0.7\)	Moderate	Clear pattern, but bloblike.
\(0.1 \leq \lvert r \rvert \leq 0.3\)	Weak	Slight pattern.
\(0 \leq \lvert r \rvert \leq 0.1\)	None	No discernible trend.

Note: \(-1\leq r \leq 1\) always.

Guess \(r\) and check

Will the value of the correlation coefficient for the year-plate size data be negative or positive? Why?
Without using the applet, give an estimated range for the value of the correlation coefficient \(r\) between plate size and year based on the scatterplot.
Now, check the Correlation coefficient box in the applet to reveal the actual value of the correlation coefficient \(r\).

Join a Jamboard group

Join one of groups 1-6 based on who you are sitting near and work on the corresponding page of the Jamboard. Answer questions 10 and 11 with your group.

Add an observation

In the Add/remove observations section of the applet, enter a year (x) of 1950 and plate size (y) of 11.5 and press Add. Note how this is an unusual observation on the scatterplot. Record the following:

How did the correlation coefficient \(r\) change as a result? (What was the old value? What was the new value?)
Was this observation influential?
Is this observation an outlier?

Delete some observations

You can delete observations from the data set by clicking on the dot in the scatterplot (so it turns red) and clicking the Delete button. Delete the observation you added in #10. Then try deleting some other observations to make the correlation coefficient \(r\) greater than 0.90. Once you get a scatterplot with \(r > 0.9\), paste a screenshot into the Jamboard. Write a sentence explaining how you decided which observations to delete to increase the value of \(r\).

Section 10.2: Inference for the Correlation Coefficient: Simulation-based approach

Body Temperature and Heart Rate

Is there a correlation between body temperature and heart rate?

Null Hypothesis: There is no association between body temperature and heart rate.
- \(H_0 : \rho = 0\)
Alternative Hypothesis: There is a positive linear correlation between heart rate and body temperature.
- \(H_a: \rho > 0\)

Randomness?

If there was no association between heart rate and body temperature, what is the probability we would get a correlation as high as 0.378 just by chance?
If there is no association, we can break apart the temperatures and their corresponding heart rates. We will do this by shuffling one of the variables. (The applet shuffles the response.)

Shuffling Simulations

With two proportions, we colored the response on the cards, shuffled the cards and placed them into two piles corresponding to the two categories of the explanatory variable.
With two means we did the same thing except this time the responses were numbers instead of colors.

Shuffle Using applet

Paste the data into the Correlation/Regression applet.

Larger example

Let’s look at a different (and larger) data set comparing temperature and heart rate (Example 10.5A) using the Correlation/Regression applet.

Exploration 10.2

Vietnam War Draft Lottery

During the Vietnam war, young men in the US were drafted into the Army in an order determined by a random lottery.

Randomness or Pattern?

cor(draft$sequential_date, draft$draft_number)

## [1] -0.2260414

Could this correlation coefficient just be a product of randomness?

ggplot(draft, aes(x=sequential_date, y=draft_number)) + geom_point()

Hypotheses

Was the draft order truly random? A “fair” draft should have a correlation coefficient of zero.

Null Hypotheses: There is no association between sequential date and draft number.
- \(H_0 : \rho = 0\)
Alternative Hypothesis: There is some linear correlation between sequential date and draft number.
- \(H_a : \rho \neq 0\)

Simulation using Cards

Each of the 366 birthdays in a year (including February 29) was assigned a draft number.

Work together with your proximity group to answer questions 1 and 4 on the Jamboard.

Describe how you would conduct a simulation of a null distribution using cards.

How many cards would you need?
What would you write on the cards?
After you shuffle the cards, what would you do to create a point on the null distribution? What calculations would you need to make?

Simulation using Applet

Paste the DraftLottery data set into the Correlation/Regression applet. Check the Correlation Coefficient box and confirm that the correlation coefficient is -0.226.

Check the Show Shuffle Options box and select the Correlation radio button. Then press Shuffle Y-values to simulate one fair, random lottery. Record the value of the correlation coefficient.
Press Shuffle Y-values four more times to generate results of four more fair, random lotteries.
Compare your shuffled statistics with the observed one.

Build a Null Distribution

Generate a null distribution with at least 1000 simulated statistics. Where is this distribution centered? Why does this make sense?
Use the null distribution to obtain a p-value for this test. Paste a screen shot of the null distribution into the Jamboard, and record the p-value, along with a sentence explaining what it means in the context of this study.

What happened?

The irregularity can be attributed to improper mixing of the balls used in the lottery drawing process. (Balls with birthdays early in the year were placed in the bin first, and balls with birthdays late in the year were placed in the bin last. Without thorough mixing, balls with birthdays late in the year settled near the top of the bin and so tended to be selected earlier.)

1971

The following year, in 1971, the mixing process was improved. The correlation coefficient turned out to be \(r = 0.014\).

Use your simulation results to approximate the p-value for the 1971 draft lottery. Is there any reason to suspect that this 1971 draft lottery was not conducted with a fair, random process?

Explain the reasoning behind your conclusion.
Also explain why you don’t need to paste in the data from the 1971 lottery first.

Enter your explanations into this form.