Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Understanding ANOVA: A Statistical Approach to Comparing Group Differences in Experiments, Lecture notes of Statistics

ANOVA (Analysis of Variance), a statistical method used to compare group differences in experiments. It covers the concept of ANOVA, its advantages, and how it distinguishes systematic from random variation. The document also includes a detailed explanation of how ANOVA works, including the variance formula, degrees of freedom, mean squares, and F-ratio.

Typology: Lecture notes

2021/2022

Uploaded on 09/27/2022

dreamingofyou
dreamingofyou 🇬🇧

4.5

(15)

233 documents

1 / 9

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Research Skills: One-way independent-measures ANOVA, Graham Hole, March 2009: Page 1:
One-way Independent-measures Analysis of Variance (ANOVA):
What is "Analysis of Variance"?
Analysis of Variance, or ANOVA for short, is a whole family of statistical tests that are
widely used by psychologists. This handout will
(a) explain the advantages of using ANOVA;
(b) describe the rationale behind how ANOVA works;
(c) explain, step-by-step, how to do a simple ANOVA.
Why use ANOVA?
ANOVA is most often used when you have an experiment in which there are a number of
groups or conditions, and you want to see if there are any statistically significant differences
between them. Suppose we were interested in the effects of caffeine on memory. We could look
at this experimentally as follows. We could have four different groups of participants, and give
each group a different dosage of caffeine. Group A might get no caffeine (and hence act as a
control group against which to compare the others); group B might get one milligram of caffeine;
group C five milligrams; and group D ten milligrams. We could then give each participant a
memory test, and thus get a score for each participant. Here's the data you might obtain:
Group A
(0 mg)
Group B
(1 mg)
Group C
(5 mg)
Group D
(10 mg)
4 7 11 14
3 9 15 12
5 10 13 10
6 11 11 15
2 8 10 14
mean = 4 mean = 9 mean = 12 mean = 13
How would we analyse these data? Looking at the means, it appears that caffeine has
affected memory test scores. It looks as if the more caffeine that's consumed, the higher the
memory score (although this trend tails off with higher doses of caffeine). What statistical test
could we use to see if our groups truly differed in terms of their performance on our memory test?
We could perform lots of independent-measures t-tests, in order to compare each group with
every other. So, we could do a t-test to compare group A with group B; another t-test to compare
group A with group C; yet another to compare group A with group D; and so on. The problem
with this is that you would end up doing a lot of t-tests on the same data. With four groups, you
would have to do six-tests to compare each group with every other one:
A with B, A with C, A with D;
B with C, B with D;
C with D.
With five groups you would have to do ten tests, and with six groups, fifteen tests! The
problem with doing lots of tests on the sam e data like this, is that you run an increased risk of
getting a "significant" result purely by chance: a so-called "Type 1" error.
Revision of Type 1 and Type 2 Errors:
Remember that every time you do a statistical test, you run the risk of making one of two
kinds of error:
(a) "Type 1" error: deciding there is a real difference between your experimental
conditions when in fact the difference has arisen merely by chance. In statistical jargon, this is
known as rejecting the null hypothesis (that there's no difference between your groups) when in
fact it is true. (You might also see this referred to as an "alpha" error).
(b) "Type 2" error: deciding that the difference between conditions is merely due to
chance, when in fact it's a real difference. In the jargon, this is known as accepting the null
pf3
pf4
pf5
pf8
pf9

Partial preview of the text

Download Understanding ANOVA: A Statistical Approach to Comparing Group Differences in Experiments and more Lecture notes Statistics in PDF only on Docsity!

One-way Independent-measures Analysis of Variance (ANOVA):

What is "Analysis of Variance"? Analysis of Variance, or ANOVA for short, is a whole family of statistical tests that are widely used by psychologists. This handout will (a) explain the advantages of using ANOVA; (b) describe the rationale behind how ANOVA works; (c) explain, step-by-step, how to do a simple ANOVA.

Why use ANOVA? ANOVA is most often used when you have an experiment in which there are a number of groups or conditions, and you want to see if there are any statistically significant differences between them. Suppose we were interested in the effects of caffeine on memory. We could look at this experimentally as follows. We could have four different groups of participants, and give each group a different dosage of caffeine. Group A might get no caffeine (and hence act as a control group against which to compare the others); group B might get one milligram of caffeine; group C five milligrams; and group D ten milligrams. We could then give each participant a memory test, and thus get a score for each participant. Here's the data you might obtain:

Group A (0 mg)

Group B (1 mg)

Group C (5 mg)

Group D (10 mg) 4 7 11 14 3 9 15 12 5 10 13 10 6 11 11 15 2 8 10 14 mean = 4 mean = 9 mean = 12 mean = 13

How would we analyse these data? Looking at the means, it appears that caffeine has affected memory test scores. It looks as if the more caffeine that's consumed, the higher the memory score (although this trend tails off with higher doses of caffeine). What statistical test could we use to see if our groups truly differed in terms of their performance on our memory test? We could perform lots of independent-measures t-tests, in order to compare each group with every other. So, we could do a t-test to compare group A with group B; another t-test to compare group A with group C; yet another to compare group A with group D; and so on. The problem with this is that you would end up doing a lot of t-tests on the same data. With four groups, you would have to do six-tests to compare each group with every other one:

A with B, A with C, A with D; B with C, B with D; C with D.

With five groups you would have to do ten tests, and with six groups, fifteen tests! The problem with doing lots of tests on the same data like this, is that you run an increased risk of getting a "significant" result purely by chance: a so-called "Type 1" error.

Revision of Type 1 and Type 2 Errors: Remember that every time you do a statistical test, you run the risk of making one of two kinds of error: (a) "Type 1" error: deciding there is a real difference between your experimental conditions when in fact the difference has arisen merely by chance. In statistical jargon, this is known as rejecting the null hypothesis (that there's no difference between your groups) when in fact it is true. (You might also see this referred to as an "alpha" error). (b) "Type 2" error: deciding that the difference between conditions is merely due to chance, when in fact it's a real difference. In the jargon, this is known as accepting the null

hypothesis (that there's non difference between your groups) when in fact it is false. (You might see this referred to as a "beta" error).

The chances of making one or other of these errors are always with us, every time we run an experiment. If you try to reduce the risks of making one type of error, you increase the risk of making the other. For example, if you decide to be very cautious, and only accept a difference between groups as "real" when it is a very large difference, you will reduce your risk of making a type 1 error (accepting a difference as real when it's really just due to random fluctuations in performance). However, because you are being so cautious, you will increase your chances of making a type 2 error (dismissing a difference between groups as being due to random variation in performance, when in fact it is a genuine difference). Similarly, if you decide to be incautious, and decide that you will regard even very small differences between groups as being "real" ones, then you will reduce your chances of making a type 2 error (i.e., you won't often discount a real difference as being due to chance), but you will probably make lots of type 1 errors (lots of the differences you accept as "real" will have arisen merely by chance). The conventional significance level of 0.05 represents a generally-accepted trade-off between the chances of making these two kinds of errors. If we do a statistical test, and the results are significant at the 0.05 level, what we are really saying is this: we are prepared to regard the difference between groups that has given rise to this result as being a real difference, even though, roughly five times in a hundred, such a result could arise merely by chance. The 0.05 refers to our chances of making a type 1 error.

ANOVA and the Type 1 error: Hopefully, you should be able to see why doing lots of tests on the same data is a bad idea. Every time you do a test, you run the risk of making a type 1 error. The more tests you do on the same data, the more chance you have of obtaining a spuriously "significant" result. If you do a hundred tests, five of them are likely to give you "significant" results that are actually due to chance fluctuations in performance between the groups in your experiment. It's a bit like playing Russian Roulette: pull the trigger once, and you are quite likely to get away with it, but the more times you pull the trigger, the more likely you are to end up blowing your head off! (The results of making a type 1 error in a psychology experiment are a little less messy, admittedly). One of the main advantages of ANOVA is that it enables us to compare lots of groups all at once, without inflating our chances of making a type 1 error. Doing an ANOVA is rather like doing lots of t-tests all at once, but without the statistical disadvantages of doing so. (In fact, ANOVA and the t-test are closely related tests in many ways).

Other advantages of ANOVA: (a) ANOVA enables us to test for trends in the data: Looking at the mean scores in our caffeine data, it looks as if there is a trend: the more caffeine consumed, the better the memory performance. We can use ANOVA to see if trends in the data like this are "real" (i.e., unlikely to have arisen by chance). We don't have to confine ourselves to seeing if there is a simple "linear" trend in the data, either: we can test for more complicated trends (such as performance increasing and then decreasing, or performance increasing and then flattening off, etc.) This is second-year stuff, however...

(b) ANOVA enables us to look at the effects of more than one independent variable at a time: So far in your statistics education, you have only looked at the effects of one independent variable at a time. Moreover, you have largely been limited to making comparisons between just two levels of one IV. For example, you might use a t-test or a Mann-Whitney test to compare the memory performance of males and females (two levels of a single independent variable, "sex"). The only tests you have covered that enable you to compare more than two groups at a time are the Friedman and Kruskal-Wallis tests, but even these only enable you to deal with one IV at a time. The real power of ANOVA is that it enables you to look at the effects of more than one IV in a single experiment. So, for example, instead of just looking at "the effects on memory of caffeine dosage" (one IV, one DV), you could look at "sex differences in the effects on memory of caffeine

Systematic versus random variation: How can we distinguish the variation in the set of scores that's due to our experimental manipulation, from the variation in the scores that's produced by random differences between participants? In principle, the answer is simple: individual differences in performance are by their nature fairly random, and are therefore not likely to vary consistently between different groups in the experiment. However, the effects of our experimental manipulation should be consistently different between one group and another, because that is how we have administered them: everyone within a single group gets the same treatment from us. Consider the scores in the table. They all vary, both within a particular group and also between groups. Variation in scores within a group can't be due to what we did to the participants, as we did exactly the same thing to all participants within a group. If there is variation within a group, it must be due to random factors outside our control. Variation between groups can, in principle, occur for two reasons: because of what we did to the participants (our experimental manipulations) and/or because of random variation between participants (by chance, we might happen to have more people with good memories in group D than we do in group A). However as long as we take care to ensure that the only systematic difference between groups is due to our experimental manipulation, because the variation between participants is due to random factors, it is unlikely to produce systematic effects on performance: it is unlikely to make one group perform consistently better or worse than another. Consistent (systematic) variation between the groups is more likely to be due to what we did to the participants: i.e., due to our experimental manipulation.

In short, variation within the groups of an experiment is due to random factors. Variation between the groups of an experiment can occur because of both random factors and the influence of our experimental manipulations; however, it is only likely to be large if it is due to the latter, as this is the only thing which varies systematically between the groups. Therefore, all we have to do is work out how much variation there is in a set of scores; find out how much of it comes from differences within groups; and find out how much of it comes from differences between groups. If the between-groups variation is large compared to the within-groups variation, we can be reasonably sure that our experimental manipulation has affected participants' performance.

Total variation amongst a set of scores

between-groups variation

within-groups variation

To compare the size of the between-groups variation to the within-groups variation, we simply divide one by the other: the larger the between-groups variation compared to the within- groups variation, the larger the number that will result. We will then look up this number (called an F-ratio ) in a table to see how likely it is to have occurred by chance - in much the same way as you look up the results of a t-test, for example.

How we do all this in practice: How we do this in practice is to assess variation within and between groups by using a statistical measure based on the variance of the scores_._ You have encountered the variance before, as an intermediate step in working out the standard deviation of a set of scores. (Remember that the standard deviation is a measure of the average amount of variation amongst a set of scores - it tells you how much scores are spread out around their mean. The variance is the standard deviation squared). The main reason we use the variance rather than the standard deviation is that it makes the arithmetic easier. (Variances can be added together, whereas standard deviations can't, because of the square-rooting that is added into the s.d. formula in order to return the s.d. to the same units as the original scores and their mean).

Here is the formula for the variance:

In English, this means do the following:

Take a set of scores (e.g. one of the groups from the table), and find their mean. Find the difference between each of the scores and the mean. Square each of these differences (because otherwise they will add up to zero). Add up the squared differences.

Normally, you would then divide this sum by the number of scores, N, in order to get an average deviation of the scores from the group mean - i.e., the variance. However, in ANOVA, we will want to take into account the number of participants and number of groups we have. Therefore, in practice we will only use the top line of the variance formula (called the "Sum of Squares" , or " SS " for short):

We will divide this not by the number of scores, but by the appropriate " degrees of freedom " (which is usually the number of groups or participants minus 1). More details on this below.

Earlier, I said that the total variation amongst a set of scores consisted of between- groups variation plus within-groups variation. Another way of expressing this is to say that the total sums of squares can be broken down into the between-groups sums of squares, and the within-groups sums of squares. What we have to do is to work these out, and then see how large the between-groups sums of squares is in relation to the within-groups sums of squares, once we've taken the number of participants and number of groups into account by using the appropriate degrees of freedom.

Step-by-step example of a One-way Independent-Measures ANOVA: As mentioned earlier, there are lots of different types of ANOVA. The following example will show you how to perform a one-way independent-measures ANOVA. You use this where you have the following: (a) one independent variable (which is why it's called "one-way"); (b) one dependent variable (you get only one score from each participant); (c) each participant participates in only one condition in the experiment (i.e., they are used as a participant only once). A one-way independent-measures ANOVA is equivalent to an independent-measures t- test, except that you have more than two groups of participants. (You can have as many groups of participants as you like in theory: the term "one-way" refers to the fact that you have only one independent variable, and not to the number of levels of that IV). Another way of looking at it is to say that it is a parametric equivalent of the Kruskal-Wallis test.

Although some statistics books manage to make hand-calculation of ANOVA look scary, it's actually quite simple. However, since it is so quick and easy to use SPSS to do the work, I'm just going to give you an overview of what SPSS works out and why.

N

∑ X −^ X

2

var iance

2

Sum of Squares = ∑ X − X

Source: SS d.f MS F

Between groups 245.00 3 81.67 25.

Within groups 51.98 16 3.

Total 297.00 19

Different statistics packages may display the results in a different way, but most of these principal details will be there somewhere. The really important bit is the following:

Assessing the significance of the F-ratio: The bigger the value of the F-ratio, the less likely it is to have arisen merely by chance. How do you decide whether it's "big"? You consult a table of "critical values of F". (There's one on my website). If your value of F is equal to or larger than the value in the table, it is unlikely to have arisen by chance. To find the correct table value against which to compare your obtained F-ratio, you use the between-groups and within-groups d.f.. In the present example, we need to look up the critical F-value for 3 and 16 d.f. Here is an extract from a table of critical F-values, for a significance level of 0.05:

Treat the between-groups d.f. and the within-groups d.f. as coordinates: we have 3 between-groups d.f. and 16 within-groups d.f., so we go along 3 columns in the table, and down 16 rows. At the intersection of these coordinates is the critical value of F that we seek: with our particular combination of d.f, values of F as large as this one or larger, are likely to occur by chance with a probability of 0.05 - i.e., less than 5 times in a 100. Therefore, if our obtained value of F is equal to or larger than this critical value of F in the table, our obtained value must have a similarly low probability of having occurred by chance.

In short, we compare our obtained F to the appropriate critical value of F obtained from the table: if our obtained value is equal to or larger than the critical value, there is a significant difference between the conditions in our experiment. On the other hand, if our obtained value is smaller than the critical value, then there is no significant difference between the conditions in our experiment.

(Different tables may display these critical values of F in different ways: the table above shows only critical values for a 0.05 significance level, but often tables will show critical values for a 0.01 significance level as well (or even higher levels of significance, such as 0.001). If you are using SPSS or Excel to do the dirty work, then you won't need to use this table: the statistical package gives you the exact probability of obtaining your particular value of F by chance. Thus, for example, SPSS might say that the probability was "0.036" or somesuch. Some people report these exact probabilities; others merely round them to the nearest conventional figure, so that instead of reporting 0.036, they would simply say: p<0.05).

Interpreting the Results: If we obtain a significant F-ratio, all it tells us is that there is a statistically-significant difference between our experimental conditions. In our example, it tells us that caffeine dosage does make a difference to memory performance. However, this is all that the ANOVA does: it doesn't say where the difference comes from. For example, in our caffeine example, it might be that group A (0 mg caffeine) was different from all the other groups; or it might be that all the groups are different from each other; or groups A and B might be similar, but different from groups C and D; and so on. Usually, looking at the means for the different conditions can help you to work out what is going on. For example, in our example, it seems fairly clear that increased levels of caffeine lead to increased memory performance (although we wouldn't be too confident in saying that groups C and D differed from each other). In many cases, a significant ANOVA would be followed by further statistical analysis, using "planned comparisons or "post hoc tests", in order to determine which differences between groups have given rise to the overall result.

SPSS output and what it means: Here is the output from SPSS for the current example. (Click on "analyze data", then "compare means" and finally "one-way ANOVA").

Under "options", I selected "descriptive statistics" and "homogeneity of variance". The former gives me a mean and standard deviation for each condition, essential for interpreting the results of the ANOVA. The latter performs a "Levene's test" on the data, to test whether or not the conditions show homogeneity of variance (i.e. whether the spread of scores is roughly similar in all conditions, one of the requirements for performing a parametric test like ANOVA). If Levene's test is NOT significant, then you are OK: you can assume that the data show homogeneity of variance. If Levene's test is statistically significant (i.e. its significance is 0.05 or less), then the spread of scores is NOT similar across the different conditions: the data thus violate one of the requirements for performing ANOVA, and you should perhaps consider using a non-parametric test instead, such as the Kruskal-Wallis test. Here, Levene's test is not significant, so we are OK to use ANOVA.

Descriptives

SCORE

5 4.0000 1.58114 .70711 2.0368 5.9632 2.00 6. 5 9.0000 1.58114 .70711 7.0368 10.9632 7.00 11. 5 12.0000 2.00000 .89443 9.5167 14.4833 10.00 15. 5 13.0000 2.00000 .89443 10.5167 15.4833 10.00 15. 20 9.5000 3.95368 .88407 7.6496 11.3504 2.00 15.

0 mg 1 mg 5 mg 10 mg Total

N Mean Std. Deviation Std. Error Lower Bound Upper Bound

95% Confidence Interval for Mean Minimum Maximum