For all problems, unless otherwise indicated, assume a Type 1 error rate = .05 and that comparisons are two-tailed.
A researcher is interested in comparing the efficacy of 3 alternative treatments for speech anxiety. Treatment #1 is behavior therapy (BT), treatment #2 is the beta-adrenergic blocker Atenolol (AT), and treatment #3 is placebo (PL). Out of a total sample size of 30, 12 patients are randomly assigned to the BT group, 12 patients are randomly assigned to the AT group, and 6 patients are randomly assigned to the PL group. The primary dependent measure is an overall rating of public speaking ability conducted by trained raters who are experts in non-verbal aspects of public speaking. These ratings were made on a 0 to 10 scale, with higher scores indicating better public speaking. The ratings for each group are shown below:
BT group | 6 | 7 | 6 | 5 | 5 | 5 | 7 | 8 | 9 | 6 | 6 | 7 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
AT group | 4 | 8 | 10 | 3 | 6 | 5 | 6 | 3 | 7 | 5 | 4 | 6 |
PL group | 0 | 7 | 0 | 7 | 0 | 7 |
Conduct a one-way analysis of variance to test the null hypothesis using SAS. After presenting the results, indicate whether or not the results of this test lead you to reject the null hypothesis. (10)
R produces:
Analysis of Variance Table Response: Speaking_Ability Df Sum Sq Mean Sq F value Pr(>F) Treatment 2 34.167 17.083 3.3586 0.04982 * Residuals 27 137.333 5.086
There is a 4.98% chance these observations came from a central F-distribution. This is less than our accepted 5% error rate, so H0 is rejected.
Devise a scenario that would lead to a violation of the assumption of independence of errors (ie., observations). That is, make up a story about features of the study that I have not told you about that would plausibly result in a violation of the independence of errors assumption. Is it likely that your violation scenario would result in an increase in Type-I errors or an increase in Type-II errors (ie., decreased power) if a one-way ANOVA were still performed on these data? (15)
The assumption of independence applies both within and between groups. When there are correlations within the data this will cause the data to stratify into groups in a manner the ANOVA is not designed to account for. A within group correlation, ρw, and between group correlation, ρb, affect the expected value of the MSW and MSB in different ways.
The effects of violating the assumption of independence depends on whether the violation is within or between groups. In the previous example, imagine that the researcher chose the mental health facility associated with his university. Unbeknownst to the researcher, that particular facility has a reputation within the community of classical Shakespearean actors of the world as the premier establishment for dealing with stage fright.
Subjects were collected through responses to fliers posted in the facility. All the respondents are actors, and there are two classes of people who responded: aspiring actors too nervous to get on stage and established actors looking to iron out their craft. This division can be seen to some extent in the placebo group where individuals were either much better than average or completely inept.
It turns out that some of the characteristics that influence a person's decision to perform Shakespeare on the public stage also affect their response to different types of therapies. The response to drug therapies is like the general population, but there is a positive correlation between being a Shakespearean actor and responding well to behavioral therapy. This means that for this study there is a positive ρw for the behavioral therapy, but ρb = 0.
A positive ρw will drive up the expected value of the MSB and reduce the MSW. This will increase the ratio of the two and create an inflated F-value with an increased likelihood of rejection. This means that the probability of incorrect rejection (Type 1 error) is no longer simply based on sampling error, and is greater than α.
Between group correlations, ρb, only increase the MSB, and will reduce the F-value. This reduces the probability of rejecting and the power of the test.
Conduct a statistical test assessing whether the assumption of homogeneity of variance is violated. (10)
A common test for the homogeneity of variances that is relatively robust to non-normality is Levene's Test. For that, R produces:
Levene's Test for Homogeneity of Variance Df F value Pr(>F) group 2 14.092 6.44e-05 *** 27
So there's a very small (0.00644%) chance that these variances are homogeneous.
Based on the results of this test (in part c.), what do you conclude? What would be the consequences of violating this assumption on the Type I error rates and/or power of the one-way ANOVA? (10)
Heterogeneity of variances is particularly detrimental to the ANOVA when paired with unequal sample sizes. The exact effects depend on the variances:
Speech Anxiety Data Variances Behavior_Therapy = 0.667 Atenolol = 6.8 Placebo = 14.7
In this particular situation there is an "inverse pairing" — the group with the least number of samples, "Placebo," has the largest variance. For an inverse pairing the ANOVA is said to be overly liberal, meaning that it will reject too frequently (increased type I error rate) and support the experimenters' belief.
For a direct pairing, the type II error rate will increase and power will decrease.
Based on the results of part c., conduct an alternative test of the null hypothesis that you tested in part a. Unlike the ANOVA, this test would not assume homogeneity of the population variances. Does this test yield conclusions that are similar or different to that of the standard ANOVA? Why? (15)
The R test, oneway.test
will perform a Welch test if the parameter equal.var
is set to FALSE
.
The Welch test is an ANOVA on the distance of the means from the grand mean. The difference may either be the absolute value or, as R does, the square of the distance.
One-way analysis of means (not assuming equal variances) data: Speaking_Ability and Treatment F = 2.023, num df = 2.000, denom df = 10.944, p-value = 0.1788
The probability these values came from a central F is more than tripled, but there is now a 17.9% chance, so the test fails to reject.
A study was conducted on the effect of magnetic stimulation of the frontal lobes of the brain on feelings of positive affect. 40 adults were randomly assigned to each of the following four experimental conditions with 10 participants per condition:
Three hours after stimulation, participants completed a self-report scale assessing positive affect (potential range = 10 to 50). The following data were obtained:
Treatment | Sample Mean, | Sample Variance, |
---|---|---|
LF | 35 | 13 |
RF | 28 | 10 |
BILAT | 33 | 12 |
CONTROL | 30 | 11 |
You are interested in testing three contrasts:
Using the table below, fill in appropriate contrast coefficients (ci,j) for each contrast: (12)
Recall that for acceptable contrasts:
Contrast | Left Frontal (LF) | Right Frontal (RF) | Bilateral (BOTH) | Sham (CONTROL) |
---|---|---|---|---|
Ψ1 | 1 | -1 | 0 | 0 |
Ψ2 | 1 | -2 | 1 | 0 |
Ψ3 | 1 | 1 | 1 | -3 |
Conduct a two-tailed test of each these three contrasts (you can assume that the variances for all four populations are equal). Conduct each test at a per comparison Type I error rate of .05 (in other contexts this might not be an optimal decision, but for now let's assume it's ok). (18)
Besides showing your work, fill out the table below:
A contrast is:
Because the actual means are not generally known, an unbiased estimate of the contrast is:
A scaled version of the contrast is a form of a z-score and, if the null hypothesis H0: Ψ = 0, is true, will have a t-distribution. The MSW is generally used as the estimate for , and so the degrees of freedom is N - J (the dof for the MSW). The MSW uses the entire sample population for its estimate rather than simply the specific sample groups involved in the contrast. This is how a two group contrast varies from a simple t-test. The t-test only uses the two populations to compute the estimate of the variance.
For this particular example however, we do not have access to actual sample data. The data has already been aggregated into means and variances. A pooled estimate of the variance then is: . This quantity is equivalent to the MSW.
Recall that the variance for a mean is:
The best estimate of the variance for a contrast for equal-sized groups is:
Changing the signs for the cis will change the sign of Ψ, so the test can fail with either a large or small value and is thus two-tailed.
Ψi | Ψ1 | Ψ2 | Ψ3 |
---|---|---|---|
1 (35) + -1 (28) = 7 | 1 (35) + -2 (28) + 1 (33) = 12 | 1 (35) + 1 (28) + 1 (33) + -3 (30) = 6 | |
t | |||
df | 40 - 4 = 36 | 36 | 36 |
pt(7/sqrt(2.3),36) ≊ 0.9999758 |
pt(12/sqrt(6.9),36) ≊ 0.9999721 |
pt(6/sqrt(13.8),36) ≊ 0.9424946 |
|
qt(.975,36) ≊ 2.028094 |
2.028094 | 2.028094 | |
Reject H0 | Yes | Yes | No |
Do the three contrasts constitute an orthogonal set? Justify your answer. (10)
No. Orthogonality for a set of comparisons may be conceptualized in terms of vectors in an n-dimensional space where each dimension is a characteristic being measured. A set of vectors is an orthogonal set if the cross product of all pairwise sets is orthogonal.
A set of contrasts being orthogonal means that the contrasts are linearly independent.
This set of contrasts is not orthogonal because:
A social psychologist is interested in assessing the effects of five different types of feedback on nonverbal expressiveness in a social interaction task. Five experimental conditions are used (denoted below as A-E). Subjects are randomly assigned to groups with n = 20 per group. In the questions below, try to give a precise probability value for the familywise Type 1 error rate. If you cannot give a precise value, give a range estimate (i.e., if X is the familywise Type 1 error rate, something like L ≤ X ≤ U, where L and U are specific probability values). Assume that all comparisons are two-tailed.
The population means for all 5 groups are identical. The researcher is interested in conducting the following comparisons:
If the researcher were to conduct each of these four comparisons at a per comparison Type 1 error rate of α = .08, what would be the familywise Type 1 error rate for this set of comparisons? (10)
To help visualize the contrasts, consider them as a matrix: (the order has been changed to more clearly show orthogonality)
Contrast | A | B | C | D | E |
---|---|---|---|---|---|
Ψ2 | 4 | -1 | -1 | -1 | -1 |
Ψ4 | 0 | 3 | -1 | -1 | -1 |
Ψ3 | 0 | 0 | 2 | -1 | -1 |
Ψ1 | 0 | 0 | 0 | 1 | -1 |
This group of contrasts is an orthogonal set meaning they are linearly independent and knowing the value of one contrast provides no predictive power for any other contrast. This means that the errors are uncorrelated, so an exact value for the value can be found. The probability that any one test, Ti, will fail is α = 0.08. The probability that one of the set will fail (for independent events) then is:
For this particular experiment:
The population means for all 5 groups are identical. The researcher decides to conduct the following three comparisons:
If the researcher conducts each of these comparisons at a per comparison Type 1 error rate of α = .10, what would be the familywise Type 1 error rate for this set of comparisons? (10)
Contrast | A | B | C | D | E |
---|---|---|---|---|---|
Ψ1 | 0 | 0 | 0 | 1 | -1 |
Ψ2 | 1 | -1 | 0 | 0 | 0 |
Ψ3 | 2 | -1 | 0 | 0 | -1 |
An exact value for the familywise error cannot be specified for these contrasts because Ψ3 is non-orthogonal to both Ψ1 and Ψ2. This means the a sampling error that would cause a Type I error for Ψ3 will also affect Ψ1 and Ψ2. Bounds may be placed on the error however.
The upper bound would be simply if the errors for all three contrasts were independent:
A simple value for the lower bound is simply the per-comparison error rate, α. That bound may be set a bit higher however because two of the contrasts are independent, so the lower bound must be at least the familywise error for those two:
So the final bounds are:
Let’s assume that the researcher wants to do the three comparisons listed in part b. above. Let’s say that the population means for A-E are actually as follows: A = 5, B = 5, C = 5, D = 10, E = 5. The experimenter uses a per comparison type 1 error rate of α = .10. What would be the familywise type 1 error rate for this set of contrasts? (10)
A Type I error is an incorrect rejection of the null hypothesis. Since the mean of D is in reality different than the other means, rejecting H0 for Ψ1 is correct. This means there was no error. If there is an error, it will be a Type II error because the test should have rejected. The error rate, α, is not concerned with Type II error rates however.
So, there are only two contrasts which can incorrectly reject because of sampling errors: Ψ2 and Ψ3. These two comparisons are not orthogonal, so their error rates can only be bounded:
A researcher is interested in assessing the effects of environmental stimulation (high/low) and predictability of diet (predictable/unpredictable) on levels of the hormone prolactin (PRL) in rat pups. 32 rat pups are randomly assigned to the 4 conditions, with 8 pups per condition. The 4 conditions are as follows:
PRL scores for each of the four groups are as follows:
S+/P+ | 25 | 23 | 18 | 16 | 12 | 19 | 20 | 21 |
---|---|---|---|---|---|---|---|---|
S+/P- | 18 | 17 | 16 | 11 | 14 | 15 | 21 | 12 |
S-/P+ | 20 | 12 | 15 | 13 | 8 | 17 | 17 | 18 |
S-/P- | 12 | 15 | 17 | 10 | 18 | 10 | 9 | 14 |
For all questions below, assume that all contrasts are two-tailed and that the researcher uses a familywise Type 1 error rate = to .05.
Consider the set of all pairwise comparisons. What would be the minimum difference between means necessary to find at least one significant effect if the researcher were to conduct:
Note: You can use SAS to generate the MSW but, beyond that, use hand calculations to arrive at these minimum mean difference values (you can, however, check your work using SAS). Turn in your work as well as your minimum difference values. (15)
Conducting all pairwise tests involves compairing all possible pairs of groups. The number of such possible pairings is:
Bonferroni Test — Bonferroni corrects the per-comparison error rate, αPC, to create the desired familywise error rate, αFW.
There are several methods for performing the correction. The original from Bonferroni assumes the worst case, that all errors on k comparisons are uncorrelated, and relies on:
The researcher then specifies αFW and performs the tests at the computed αPC significance level.
In general, for a pairwise contrast with equal sample sizes, there is a specific difference in the two means being compared that must be exceeded for the test to reject. A comparison will reject generally if:
When there are only two groups, this simplifies. (Recall the sum of the weights in the comparison must be 0, so in the two group situation the magnitude of the weights must be the same and signs opposing.)
Tukey HSD Test — Controls αFW for the "maximal contrast" which is simply the contrast involving the maximum difference of means. It relies on the fact that under the complete null, μi = μj ∀ i,j, the z-score is distributed as a "Sudentized Range" created by generating g random z-scores:
The minimum difference is based off generating a critical value from the Studentized Range:
Scheffe Test — Also controls for the maximal error, however it is not limited to pairwise contrasts. In general, the maximal contracts for a given set of means will be:
An unbiased estimate of this quantity is:
Consider a quantity known as the "Sum of Squares Contrast:"
The maximum value for is the SSB from the ANOVA. Recall from the ANOVA that:
This can be used to find a distribution for the maximum sum of squares contrast:
For a two element contrast with equal n's, this quantity will simplify:
The minimum siginficant difference of means then is:
Using SAS, conduct all possible pairwise comparisons via the Tukey HSD method. Turn in the SAS output and a brief summary of what comparisons are statistically significant and what comparisons are not. (15)
Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = Response ~ Effect, data = rat.data.stack, projections = T) $Effect diff lwr upr p adj S-/P+-S-/P- 1.875 -3.1167936 6.866794 0.7360701 S+/P--S-/P- 2.375 -2.6167936 7.366794 0.5711672 S+/P+-S-/P- 6.125 1.1332064 11.116794 0.0117370 S+/P--S-/P+ 0.500 -4.4917936 5.491794 0.9926883 S+/P+-S-/P+ 4.250 -0.7417936 9.241794 0.1165103 S+/P+-S+/P- 3.750 -1.2417936 8.741794 0.1940084
The only contrast to reject is "S+/P+ vs S-/P-" with a probability of approximately 1.17%. This is also apparent in the graphical representation of the Tukey 95% intervals where S+/P vs S-/P- is the only interval that does not contain 0.
Using SAS, conduct all possible pairwise comparisons via the Scheffe method. Turn in the SAS output and a brief summary of what comparisons are statistically significant and what comparisons are not. (15)
Unfortunately, R does not include a built-in method for performing the Scheffe test. For this reason it was necessary to implement it in R based largely on an adaptation for the code for the Tukey test. While the process was certainly educational, it was ultimately off, by what I'm fairly certain is a constant factor (probably of a half). Class started 5 minutes ago though, so I'll have to find this problem at some later time.
diff lwr upr p adj S+/P- v S+/P+ 1.875 -5.330914 9.080914 0.6037222 S-/P+ v S+/P+ 2.375 -4.830914 9.580914 0.7057328 S-/P- v S+/P+ 6.125 -1.080914 13.330914 0.9669701 S-/P+ v S+/P- 0.500 -6.705914 7.705914 0.1560044 S-/P- v S+/P- 4.250 -2.955914 11.455914 0.9035778 S-/P- v S-/P+ 3.750 -3.455914 10.955914 0.8705437
If there are any differences between the results of the Tukey and Scheffe comparisons conducted in steps b and c, explain why they exist. (5)
From the previously computed minimum mean differences of the Tukey and Scheffe, it is known that pairs with a mean difference of more than 4.99 and 5.44 respectively should reject. S-/P- v S+/P+ has a mean difference of 6.125 and so both should reject it. There are no mean differences in the range (4.99, 5.44) so there should be no difference in the rejections.
Below are several multiple comparison dilemmas. Assume that the researcher can choose one of four multiple comparison tests: Bonferroni, Fisher LSD, Tukey HSD, and Scheffe. Indicate which procedure you consider to be the optimal one for the specific situation described.
A researcher conducts a study with four experimental conditions. She plans in advance to conduct three pairwise comparisons and the complex comparison comparing the means of conditions B and C to condition D. These are the only comparisons that she plans on conducting and does in fact conduct. (5)
Given that there are a small number of a priori comparisons, Bonferroni is an acceptable mutiplicity error correction method. For a small number of pairwise comparisons, Bonferroni should give a better familywise error than Tukey.
A researcher conducts a study with six experimental conditions. She plans in advance to conduct all pairwise comparisons. These are the only comparisons that she plans on conducting and does in fact conduct. (5)
For all pairwise comparisons, Tukey is a permissible correction which will give more power than Bonferroni and Scheffe.
A researcher conducts a study with four experimental conditions. Initially, she intended to conduct all pairwise comparisons. After looking at the data, she decides that only one of the comparisons is likely to be significant, so she conducts that single comparison. (5)
Because the a priori decision was to examine all pairwise contrasts, the researcher is limited to Tukey even though the actual ad hoc comparisons would have more power if Bonferroni was conducted. Because only pairwise comparisons were considered initially, it is not necessary to employ Sheffe.
A researcher conducts a study with three experimental conditions. She plans in advance to conduct all pairwise comparisons. These are the only comparisons that she plans on conducting and does in fact conduct. (5)
Because there are only three conditions, the most powerful correction method will be the Fisher LSD.
Fisher LSD is a qualification-based test. An ANOVA is performed on the data at the α Type I error rate. If the ANOVA rejects then any number of comparisons are performed, also at the α Type I error rate. The idea is that the ANOVA bounds the error for the entire data set at α. Data with genuinely equivalent means will fail the ANOVA α(100)% of the time.
The issue with the Fisher LSD is, consider an experiment with k conditions which have actually identical means. Imagine the experimenter would like to run the set of k - 1 orthogonal contrasts on the data. The error rate for this set of contrasts will be 1 - (1 - α)k - 1. Using the Fisher LSD to correct could bound this error to α, but only because all the means are equivalent.
Consider the situation where the null hypothesis is partially true. Add a new condition which has a genuine difference in mean of large magnitude from the previous k equivalent conditions. This is called a "partial null" as opposed to a "complete null" where all the means are equal. This new condition will cause the ANOVA to consistently reject, and the previous k conditions are essentially run without any error correction.
The Fisher LSD is a permissible correction for a three condition experiment simply because the two groups that might have been passed by a partial null will simply be run against each other where it is appropriate to use an α type I error rate.