CP 6691 - Week 3 (Part 1)
Using and Interpreting Statistical Tools and Evaluating Descriptive Research Designs
Interactive Table of Contents (Click on any Block)
Part 1 of 2
Types of Data
Descriptive Statistics
Inferential Statistics
Appropriate and Inappropriate Uses of Inferential Statistical Tests
What is Statistical Significance?
What is Practical Significance?
Part 2 of 2
Purpose of Descriptive Research
Traps to Avoid When Identifying the Type of Research Design
External and Internal Validity Threats to Consider
Evaluate Sample Study #12 (Human Sexuality Instruction ...)
Evaluate Sample Study #13 (Natural Rates of Teacher Approval ...)
Assignment for Week 4
Part 1 of 2
Types of Data
Statistical tools deal with data. So, in order to address the appropriate use of statistics, we should begin with a discussion of data types. There are four basic data types:
- Ratio Data
- Interval Data
- Ordinal Data
- Nominal Data
They are shown in order of power. Power is a measure of the amount of information contained in the data.
Nominal data have the lowest power because they contains the least amount of information. The information in nominal data consists of names, categories, or frequencies of occurrence. The numbers on football players' jerseys, for example, are nominal data. They only indicate the category of where they play on the field, otherwise, the numbers have no meaning. They don't relate to a player's age, weight, or anything else.
The problem with nominal data is that it does not have the mathematical properties necessary to permit the meaningful computation of means (averages). For example, if you add all the football jersey numbers on the offensive team and divide by 11, you will have a number that "should" represent the average (mean) of those numbers. You could do the same for the defensive team. But, the means you compute would be useless, because the original numbers are themselves meaningless in a numerical sense.
Surveys that use Likert-type data scales (Strongly Agree, Agree, Neutral, Disagree, Strongly Disagree) to collect attitudinal data also generate nominal data. The following example will illustrate the problem of computing (and interpreting) means with nominal data. Suppose I ask 1000 people a single question: "How do you feel about the President's economic policy?" In this example, I am using the five point Likert scale above. In this hypothetical example, let's say all 1000 people return their surveys, and that 500 respondents Strongly Agree with his policy while the other 500 Strongly Disagree with it. Some people erroneously try to convert these Likert scale data into numerical data by assigning weights to each category like this:
Strongly Agree = 5, Agree = 4, Neutral = 3, Disagree = 2, and Strongly Disagree = 1
Then, they would attempt to compute a mean response value by multiplying the number responding in each category by the weight of that category, adding up all these products, and dividing by the total number of surveys. When we do that with our survey, we get:
5 * 500 + 1 * 500 = 2500 + 500 = 3000/1000 = 3
So, what we have is a mean response value of 3 (Neutral). What this says is that "on average, those who responded to the survey are neutral toward the President's economic policy." Obviously, this is an erroneous conclusion. That's the essence of the problem you can run into when trying to compute means with nominal data -- the results are almost always un-interpretable. The most accurate way to report nominal data is to use percentages.
Ordinal Data contain both name and position information. That is, not only do we know what category a particular piece of data is in, we also know what position it occupies relative to all other data. An example is the order of finishing in a race. First, second, third, etc., are categories that racers occupy at the end of the race. But, by knowing the particular category, we know exactly where they finished relative to all other racers. One limitation with ordinal data is that it is impossible to say whether the distance between, say, first and second is the same as between second and third. In other words, the intervals between the points on the ordinal scale are not equal. Because of that, the ordinal scale is also not useful for computing means.
We call Nominal and Ordinal data scales discrete data scales because the intervals (spacing) between values on these scales is not necessarily equal from one value to the next, and because there are no intermediate points between each value (for example, there can be no position between first and second in a race, or no position between Strongly Agree and Agree on a Likert Scale). These facts make it impossible to use
discrete data scales to reliably compute mean values.
Interval Data contains all the attributes of Nominal and Ordinal data as well as possessing one additional attribute -- there are equal intervals between each point on the interval scale. This attribute allows us to reliably compute means. The zero point on an interval data scale, however, does not indicate the absence of the attribute being measure -- it's just another value on the scale. An example is the temperature scale: a temperature of 0 on the scale does not mean the absence of all heat, it's just another temperature value.
Ratio Data contains all the attributes of Nominal, Ordinal, and Interval data as well as possessing an "absolute zero." This means that there is a point on the ratio scale the indicates the absence of the attribute being measured. Some examples of ratio scales are weight, age, wealth, velocity, etc.
Both the Interval and Ratio data scales have equal intervals between points on their scales and there is a continuous range of values between any two values on the scales (for example, between the numbers 0 and 1 on a continuous scale, there are an infinite number of fractional values). Because of these attributes, we call these continuous data scales.
Continuous data scales can be
reliably used to compute means.
Why have we been talking so much about the ability to reliably compute means? Because some of the most powerful statistics available to analyze data require the computation of mean values in the data (for example, if you are looking for statistically significant differences between means of two or more groups). If these very powerful statistics are applied to nominal or ordinal data sets, their results will be unreliably because of the problems associated with computing means of these types of data, as illustrated above. So, to in order to know if the researcher is using an appropriate statistical test on data he/she has collected, determine the type of data being collected to see if it is discreet or continuous. Knowing the data type is not the only thing needed to assess the appropriateness of a statistical test, but it is one of the most important.
For purposes of this course, you only need to consider the type of data when trying to determine if a researcher is using an appropriate statistical test. (More about this a little later in the lesson.)
Descriptive Statistics
There are two classes of statistics, descriptive and inferential. Descriptive statistics allow you to describe attributes of a distribution of data. Two ways of doing that are with measures of central tendency and measures of dispersion. Measures of central tendency include the Mean, Median and Mode. Each of these describes the center point of the distribution and how the data points cluster about it. Measures of dispersion include the Range, Variance, and Standard Deviation, among others (but these three are the most common). These statistics describe how spread-out the distribution is. So, with one of each measure (actually only two values), anyone could describe the shape of an entire distribution of data. The most common statistics reported are the mean and standard deviation. If you take a statistics course or if you take a look at any basic statistics book, you'll learn more about how to read and interpret these statistics. For this course, it's enough to know that these statistics are important for describing the kind of data distributions the researcher is dealing with.
Inferential Statistics
A limitation of descriptive statistics is that they cannot tell us anything about distributions other than the one we have collected data for. This might not seem like a limitation, but consider the fact that researchers do not study whole populations directly. They sample from those populations and collect data from those samples. Descriptive statistics do a fine job of describing those sample distributions, but the researcher really wants to know if the data collected on the sample are representative of the population from which the sample was drawn. In other words, the researcher wants to be able to infer whether or not the results he/she found in the sample would occur in the population at large. That's where inferential statistics come in.
There are two classes of inferential statistics: parametric and nonparametric. The more powerful class is parametric. Let's discuss this class first. What makes parametric statistics so powerful is their ability to estimate and cancel out random sampling error. Parametric statistics can do this because they rely on certain assumptions about the population containing the attribute(s) being measured:
- the attribute must be normally distributed (or very nearly so) throughout the population
- the variability in the attribute must be evenly distributed throughout the population
- the sample mean and standard deviation must be computable (meaning the data
measuring the attribute must be continuous)
As we said earlier, this third assumption is the only one you need to be concerned about when deciding if the researcher is using the appropriate statistical test in his/her study.
Researchers also study problems involving variables that cannot be measured on continuous scales. Research questions concerning attitudes and preferences, for instance, deal with discreet variables. It is not appropriate to use parametric statistics with these variables because the assumptions listed above are often violated, especially assumption number 3. To enable researchers to study these sorts of questions, a different class of statistics was developed that do not rely on any population parameters or assumptions. They are called nonparametric statistics.
Because they do not rely on assumptions about the population, these statistics cannot estimate or cancel random sampling error. They are considerably weaker that their parametric "cousins." What this means is that a difference between two or more groups or a relationship between two variables must be considerably larger to register as statistically significant with a nonparametric statistic than with a comparable parametric statistic.
Despite their restricted statistical power, nonparametric statistics are ideal for use with variables that generate nominal or ordinal (discreet) data. So, they permit researchers to answer a wider range of questions than would be allowed with parametric statistics alone. The following table shows some of the more typical parametric and nonparametric statistical tests used in social science research today. They are categorized by parametric/nonparametric and by the two primary types of inferential studies: group difference and relationship (association) studies. You will also find these statistical tests described in your text on pg.
249 (Table 11.4) and pg. 273 (Table 12.3).
| Inferential Statistics
|
| Parametric Tests
| Nonparametric Tests
|
Group Difference Studies
| * t-test
* ANOVA
* ANCOVA
| * Chi Square
* Mann Whitney U Test
* Wilcoxin signed-rank test
* Kruskal-Wallace test
|
Relationship (Association) Studies
| * Pearson Correlation Coefficient
* Correlation ratio, eta
| * Contingency Coefficient
* Rank-difference correlation, rho
* Kendall's tau
* Biserial correlation
* Widespread biserial correlation
* Point biserial correlation
* Tetrachoric correlation
* Phi coefficient
|
Appropriate and Inappropriate Uses of Inferential Statistical Tests
Very simply stated, an inferential statistical test is appropriately used if the statistical test (parametric or nonparametric) matches the type of data being analyzed. Recall from above that parametric statistical tests require the use of continuous data. Therefore, if a researcher is collecting data on a variable (attribute) of interest and the data are continuous, then the researcher may appropriately use a parametric statistic. If the data are not continuous on the variable being analyzed, then only nonparametric statistical tests are appropriate to be used. Obviously, to determine if a particular statistical test being used is appropriate, you must determine what variable(s) is(are) being analyzed and what type of data are being collected. Sometimes this is easy to do ... sometimes it isn't. Let's try an example.
Let's say a researcher wants to analyze subjects' attitudes about the perceived value of graduate education. The researcher develops a Likert-type questionnaire to collect the data, and uses a Pearson product moment correlation to determine if there is a statistically significant relationship between gender and perceived educational value. Is this statistical test appropriate?
This is a relationship (association) study; in such studies, two variables are correlated at once. So, we must look at both variables. The two variables being correlated here are gender and perceived education value. Since the researcher-developed questionnaire contains a Likert-type scale, it is likely collecting nominal (categorical) data. Meanwhile, gender is also a categorical variable (a special type called a dichotomy because it has only two categories--male and female). Because both of these types of data are discreet, the most appropriate statistical test should be nonparametric. Looking at the table above, the Pearson correlation coefficient is a parametric test. Therefore, in this example, the researcher is using an inappropriate statistical test.
What is likely to occur is that the researcher may find a statistically significant correlation (because of the high power of the Pearson test) that really doesn't exist. It's kind of like looking through a microscope and seeing what you think in a microscopic organism, but it turns out to be a speck of dust on the lens. The trouble with very powerful tools is that if they are used indiscriminantly, they sometimes amplify random occurrences to make them appear real.
Try another example. A researcher is studying the effects of room color on academic test performance. He divides students among 3 different room colors (bright colors, pastel colors, and neutral colors). He measures students' performance (percentage scores) on an academic test and looks for statistically significant differences between the various rooms. He uses an ANOVA (Analysis of Variance) test to analyze the data. Is this an appropriate statistical test?
This is an example of a group difference study (you can tell because the researcher is creating groups and treating each of them differently). In this type of study, the variable that undergoes the statistical test is called the dependent variable. In this example, the dependent variable is the students' scores on the test. Scores on academic tests are continuous scores. So, by our rule, the most appropriate statistic the researcher can use is a parametric statistic. When we check the table above, we see that the ANOVA test is, indeed, a parametric test. Therefore, the researcher is using the most appropriate statistical test. It doesn't guarantee that he'll get a perfectly accurate result, but he can be confident that if he does get a statistically significant result, it will be correct 95 percent of the time in the general population from where he drew the sample.
Once again, to determine whether or not a researcher is using an appropriate inferential statistic is to examine the data being analyzed and apply the following rule:
If the data being analyzed are discrete in nature, then the most appropriate inferential statistic a researcher can use is a non-parametric statistic. If the data being analyzed are
continuous, then the most appropriate inferential statistic a
researcher can use is a parametric statistic.
What is Statistical Significance?
Consider an situation where a researcher is trying to determine if a novel counseling program is more effective at curbing delinquency that a standard counseling program. She sets up an experiment using two groups of delinquent subjects. She will find that both groups will experience a change in delinquent behavior over the course of the experiment. What the researcher must determine if whether or not the difference that occurred between the two groups was a real occurrence (caused by the novel counseling program) or was a random (chance) occurrence.
In research, when we want to say that a difference between groups (or a relationship between variables for that matter) is a real occurrence, we say that the difference (or relationship) is statistically significant, which means that a difference (or relationship) as large as the one the researcher found in his/her study could not occur by chance more than a certain percentage of time.
OK. What does that mean??? It's pretty simple really. First, let's agree that we can never be 100 percent certain of the results of any study done with samples. That's because of sampling error (We talked about that in Week 1. You may review it by clicking on the link.) So, if we can never be 100 percent certain, we'll have to settle on some percentage of confidence. Traditionally, researchers have used two levels of confidence -- 95 percent and 99 percent. Each level of confidence is associated with a certain amount of error. This error (known as alpha error) is 1.0 minus the confidence level, expressed as a decimal. So, for the 95 percent confidence level, the alpha error is .05 (1.0 - .95), and for the 99 percent confidence level, the alpha error is .01. Unless specified by the researcher, the alpha level for a research study is assumed to be .05. That means, the researcher is willing to accept the possibility that the result he obtains from his study, if it's statistically significant, could occur by chance only 5 percent of the time or five times out of 100.
So, to put it all together, the researcher in our example above would be able to say that her novel counseling program was effective if the difference in delinquency between the two groups was so large that it could not occur by chance more than 5 percent of the time. She would be able to say, in other words, that the difference between the two counseling programs was statistically significant.
What is Practical Significance?
Just because a difference (or relationship) is found to be statistically significant doesn't mean that it's practically significant. Whereas statistical significance is determined "by the numbers" so to speak, practical significance is, quite literally, a "judgment call." Consider the following example. Let's say that you just finished reviewing a research article that showed a statistically significant improvement in 3rd grade reading achievement using a new reading program. The improvement was small (reading achievement was accelerated by only 2 months over 3rd graders not using the new program). But, because it was statistically significant, that meant the results were not chance occurrences and would very likely repeat themselves if used with different 3rd grade students. So, you go to your school superintendent (did I say you were a teacher in this example? Well, you are.), tell him about the study and its results, and suggest that he implement this new program in all 3rd grade classes throughout the district. He is, of course, impressed with your concern for the educational quality of the district and tells you so. But, then he asks how much the new program costs. You, of course, have done your homework well and contacted the researcher in advance to ask her that very question. You tell the superintendent what the researcher told you that the new program would cost approximately $1000 per student to implement throughout the district, plus teacher retraining time and cost. The superintendent thanks you for your hard work but adds that the budget just won't support such an expenditure at this time. End of scene one.
Four months later you are invited to the superintendent's office to discuss your proposal again. This time, the superintendent seems to be "all ears" and asks you if you'll coordinate the implementation and teacher training effort needed to get the program off the ground throughout the district. Flattered, you say yes and leave his office, wondering why he changed his mind so dramatically. The statistical results had not changed in four months, the program had not gotten any cheaper to implement. So why the change of heart? What you, the teacher, did not know was that the week before your second meeting with the superintendent, he had just come from a school board meeting where he had been accused of not caring enough about the reading ability of 3rd grade children (this was prompted by several angry parents' calls and letters to school board members the week before that). The superintendent was also being threatened by the school board with dismissal if he did not soon show some improvement in reading ability among 3rd grade students.
From the superintendent's viewpoint, the first time you presented your proposal to him, he perceived no problems with the current reading programs in the school district, and the meager (2 month) increase in reading ability was not cost effective given the cost of the program. There was no practical significance to the results (no practical reason to implement them in his district). But, when he learned that there was a serious reading problem in the 3rd grade classrooms (not to mention the fact that his own future was in jeopardy), he concluded that any improvement in reading (even a paltry 2 months) would be better than nothing. So, suddenly the findings of the research have practical significance for him. We can say that practical significance is a judgment call based on economic and/or political considerations.
Proceed to Part 2 of 2 of the Week 3 lesson.
|