










Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
How to find confidence intervals for the difference of means when the variances are unknown. It covers the use of the Student's t distribution and the Welch-Satterthwaite equation. An example using data from two groups of mosquitoes is provided.
What you will learn
Typology: Study notes
1 / 18
This page cannot be seen from the preview
Don't miss anything!
Our strategy to estimation thus far has been to use a method to find an estimator, e.g., method of moments, or maximum likelihood, and evaluate the quality of the estimator by evaluating the bias and the variance of the estimator. Often, we know more about the distribution of the estimator and this allows us to take a more comprehensive statement about the estimation procedure. Interval estimation is an alternative to the variety of techniques we have examined. Given data x, we replace the point estimate ✓ˆ(x) for the parameter ✓ by a statistic that is subset Cˆ(x) of the parameter space. We will consider both the classical and Bayesian approaches to choosing Cˆ(x). As we shall learn, the two approaches have very different interpretations.
In this case, the random set Cˆ(X) is chosen to have a prescribed high probability, , of containing the true parameter value ✓. In symbols, P (^) ✓ {✓ 2 Cˆ(X)} = .
(^0) − 3 − 2 − 1 0 1 2 3
zα
area α
Figure 16.1: Upper tail critical values. ↵ is the area under the standard normal density and to the right of the vertical line at critical value z (^) ↵
In this case, the set Cˆ(x) is called a -level con- fidence set. In the case of a one dimensional param- eter set, the typical choice of confidence set is a con- fidence interval
C^ ˆ(x) = (✓ˆ (^) ` (x), ˆ✓ (^) u (x)). Often this interval takes the form
C^ ˆ(x) = (✓ˆ(x) m(x), ✓ˆ(x)+m(x)) = ✓ˆ(x)±m(x)
where the two statistics,
Example 16.1 (1-sample z interval). If X 1 .X 2.... X (^) n are normal random variables with unknown mean μ but known variance 20. Then,
X¯ μ 0 /
p n
is a standard normal random variable. For any ↵ between 0 and 1, let z (^) ↵ satisfy
P {Z > z (^) ↵ } = ↵ or equivalently P {Z z (^) ↵ } = 1 ↵.
The value is known as the upper tail probability with critical value z (^) ↵. We can compute this in R using, for example
qnorm(0.975) [1] 1.
for ↵ = 0. 025. If = 1 2 ↵, then ↵ = (1 )/ 2. In this case, we have that
P { z (^) ↵ < Z < z↵ } = .
Let μ 0 is the state of nature. Taking in turn each the two inequalities in the line above and isolating μ 0 , we find that
X^ ¯ μ (^0) 0 /
p n
= Z < z (^) ↵
X^ ¯ μ 0 < z↵ p ^0 n X^ ¯ z (^) ↵ p^ ^0 n
< μ (^0)
Similarly, X¯ μ (^0) 0 /
p n
= Z > z (^) ↵
implies
μ 0 < X¯ + z (^) ↵
p n
Thus X¯ z (^) ↵ p^ ^0 n
< μ 0 < X¯ + z (^) ↵
p n
has probability . Thus, for data x,
¯x ± z (^) (1 )/ 2
p n
is a confidence interval with confidence level . In this case,
μ ˆ(x) = ¯x is the estimate for the mean and m(x) = z (^) (1 )/ 2 0 /
p n is the margin of error.
We can use the z-interval above for the confidence interval for μ for data that is not necessarily normally dis- tributed as long as the central limit theorem applies. For one population tests for means, n > 30 and data not strongly skewed is a good rule of thumb.
Generally, the standard deviation is not known and must be estimated. So, let X 1 , X 2 , · · · , X (^) n be normal random variables with unknown mean and unknown standard deviation. Let S 2 be the unbiased sample variance. If we are forced to replace the unknown variance 2 with its unbiased estimate s 2 , then the statistic is known as t:
t =
x¯ μ s/
p n
The term s/
p n which estimates the standard deviation of the sample mean is called the standard error. The remarkable discovery by William Gossett is that the distribution of the t statistic can be determined exactly. Write
T (^) n 1 =
p n( X¯ μ) S
Then, Gossett was able to establish the following three facts:
df
df
df
df
Figure 16.3: Upper critical values for the t confidence interval with = 0. 90 (black), 0.95 (red), 0.98 (magenta) and 0.99 (blue) as a function of df , the number of degrees of freedom. Note that these critical values decrease to the critical value for the z confidence interval and increases with .
Thus, the interval is
p 200
= 2. 490 ± 0. 099 or (2. 391 , 2 .589)
Example 16.3. We can obtain the data for the Michaelson-Morley experiment using R by typing
data(morley)
The data have 100 rows - 5 experiments (column 1) of 20 runs (column 2). The Speed is in column 3. The values for speed are the amounts over 299,000 km/sec. Thus, a t-confidence interval will have 99 degrees of freedom. We can see a histogram by writing hist(morley$Speed). To determine a 95% confidence interval, we find
mean(morley$Speed) [1] 852. sd(morley$Speed) [1] 79. qt(0.975,99) [1] 1.
Thus, our confidence interval for the speed of light is
p 100
= 299, 852. 4 ± 15. 7 or the interval (299836. 7 , 299868 .1)
This confidence interval does not include the presently determined values of 299,792.458 km/sec for the speed of light. The confidence interval can also be found by tying t.test(morley$Speed). We will study this command in more detail when we describe the t-test.
Histogram of morley$Speed
morley$Speed
Frequency
600 700 800 900 1000 1100
0
5
10
15
20
25
30
Figure 16.4: Measurements of the speed of light. Actual values are 299,000 kilometers per second plus the value shown.
Exercise 16.4. Give a 90% and a 98% confidence interval for the example above.
We often wish to determine a sample size that will guarantee a desired margin of error. For a -level t-interval, this is m = t (^) n 1 ,(1 )/ 2
s p n
Solving this for n yields
n =
t (^) n 1 ,(1 )/ 2 s m
Because the number of degrees of freedom, n 1 , for the t distribution is unknown, the quantity n appears on both sides of the equation and the value of s is unknown. We search for a conservative value for n, i.e., a margin of error that will be no greater that the desired length. This can be achieved by overestimating t (^) n 1 ,(1 )/ 2 and s. For the speed of light example above, if we desire a margin of error of m = 10 km/sec for a 95% confidence interval, then we set t (^) n 1 ,(1 )/ 2 = 2 and s = 80 to obtain
n ⇡
measurements are necessary to obtain the desired margin of error..
The next set of confidence intervals are determined, in the case in which the distributional variance in known, by finding the standardized score and using the normal approximation as given via the central limit theorem. In the cases in which the variance is unknown, we replace the distribution variance with a variance that is estimated from the observations. In this case, the procedure that is analogous to the standardized score is called the studentized score.
Example 16.5 (matched pair t interval). We begin with two quantitative measurements
(X (^1) , 1 ,... , X (^1) ,n ) and (X (^2) , 1 ,... , X (^2) ,n ),
on the same n individuals. Assume that the first set of measurements has mean μ 1 and the second set has mean μ 2.
no longer has a t-distribution. Welch and Satterthwaite have provided an approximation to the t distribution with effective degrees of freedom given by the Welch-Satterthwaite equation
s (^21) n 1 +^
s (^22) n (^2)
s (^41) n 21 ·(n 1 1) +^
s (^42) n 22 ·(n 2 1)
This give a -level confidence interval
x ¯ 1 x¯ 2 ± t (^) ⌫,(1 / 2
s s (^21) n (^1)
s (^22) n (^2)
For two sample tests, the number of observations per group may need to be at least 40 for a good approximation to the normal distribution.
Exercise 16.8. Show that the effective degrees is between the worst case of the minimum choice from a one sample t-interval and the best case of equal variances.
min{n 1 , n 2 } 1 ⌫ n 1 + n 2 2
For data on the life span in days of 88 wildtype and 99 transgenic mosquitoes, we have the summary
standard observations mean deviation wildtype 88 20.784 12. transgenic 99 16.546 10.
Using the conservative 95% confidence interval based on min{n 1 , n 2 } 1 = 87 degrees of freedom, we use
qt(0.975,87) [1] 1.
to obtain the interval
(20. 78 16 .55) ± 1. 9876
r
Using the the Welch-Satterthwaite equation, we obtain ⌫ = 169. 665. The increase in the number of degrees of freedom gives a slightly narrower interval (0. 768 , 7 .710).
For ordinary linear regression, we have given least squares estimates for the slope and the intercept ↵. For data (x 1 , y 1 ), (x 2 , y 2 )... , (x (^) n , yn ), our model is y (^) i = ↵ + x (^) i + ✏ (^) i
where ✏ (^) i are independent N (0, ) random variables. Recall that the estimator for the slope