CP 6691 - Week 2 (Part 1)

Evaluating Measures Used in
Social Science Research


Interactive Table of Contents (Click on any Block)

Part 1 of 2
Characteristics of Social Science Measurement (Data Collection) Instruments
Validity
Reliability
Relationship Between Validity and Reliability
Appropriateness
Objectivity

Kinds of Measures Used in Social Science Research
Paper-and-Pencil Tests
Questionnaires
Interviews
Direct Observations

Part 2 of 2
Evaluate a Sample Study That Uses Paper-and-Pencil Measures
Evaluate a Sample Study That Uses Questionnaire Measures
Evaluate a Sample Study That Uses Interview and Observation Measures
Assignment for Week 3

Part 1 of 2

Characteristics of Social Science Measures

All measurement instruments designed to collect data must possess certain characteristics. The degree to which an instrument displays these characteristics determines the strength or weakness of the instrument. The characteristics we're talking about are:

  • Validity
  • Reliability
  • Appropriateness
  • Objectivity
It is the researcher's responsibility to establish an instrument's validity, reliability, appropriateness, and objectivity -- it's fitness for the job at hand. Generally speaking, measurement instruments are either standardized or researcher-made for the specifc study being conducted. If the measure a researcher is using for a study is standardized, then, of course, the researcher must ensure the instrument is appropriate for his/her sample of subjects, and he/she must also ensure the instrument will yield clean, objective data. He/she must also be prepared to assess the reliability of the instrument (because reliability is a function of the group being tested -- it changes with every test administration). The researcher does get a break, however, if he/she decides to use a standardized data gathering instrument, because standardized instruments already have a validation (norming) base. The validation studies on any standardized instrument can be found in measurement reference books such as Buros Mental Measurements Yearbook and Tests In Print). However, if the researcher creates his/her own measurement devices for use in the study, then it is the researcher's responsibility to assess both validity and reliability of the instrument. The evaluator (that's you) can usually infer the appropriateness and objectivity of the instrument from what you're told in the article.


Validity

A measure is considered valid if it measures what it is supposed to measure. A science test that asks questions about scientific principles may be valid, but a science test composed of questions about historical scientific events probably is not valid as a science test (although it may be valid as a history test). There are several different types of validity which are discussed in some detail in your course textbook on pages 136-138.


Reliability

A measure is considered reliable to the extent that it is free of measurement errors. The less error in a measurement instrument, the more consistently it will measure the same attribute time after time. There are several different types of reliability which are discussed in some detail in your course textbook on pages 138-140.


Relationship Between Validity and Reliability

Reliability and validity are related in an interesting way. A measurement instrument that is not reliable cannot be valid. But, in order for an instrument to be valid, it must be reliable. In other words, just because an instrument is reliable does not guarantee that it is valid. But, the instrument must be reliable in order for us to even consider whether it's valid or not. This relationship between validity and reliability is easier to see with a picture.
 

Imagine you used three different guns to fire at targets A, B, and C (also imagine your a very good shot or this example won't work). What could we conclude from the patterns? The pattern on target A shows that this gun does not shoot (or aim) very reliably (it doesn't hit the same place on the target consistently). In fact, this gun is so unreliable that we can consider it to be an invalid instrument (you wouldn't want to use this particular gun if your goal was to win a prize at target shooting).

The pattern of target B shows the shots from this gun to be focused in a small area on the target. So, we could say this gun has good reliability (because it consistently hits the same place on the target). But, it still doesn't have very good validity because it can't seem to hit the bullseye. Only target C displays a pattern that shows both good reliability and good validity.


Appropriateness

When evaluating measures, consider it's important to do so in evaluating research because a test that is used incorrectly or inappropriately will bias results of the test and cast serious doubt on the findings of the research.

A simple example of this is the use of IQ tests. It would be completely inappropriate to test the IQ of pre-school children with the same test used to measure the IQ of high school students. Since, in general, pre-schoolers can't read, an appropriate IQ test for them would be one comprised only of pictures. When evaluating a study that uses a paper-and-pencil test, try to determine if the subjects are able to read.

Another aspect of appropriateness is to ask if the subjects being tested should be expected to have the information requested by the measure. For example, it would not be appropriate to give a calculus test to second-grade students since they would not be expected to know anything about calculus. But it would be appropriate to give the same test to college freshmen enrolled in a calculus class.


Objectivity

A measure is said to be objective if there is an absence of subjective judgment in recording or interpreting data. Some measures are more likely to be objective than others. For instance, a measure that asks for factual data is likely to be objective, since very little judgment is needed to provide facts (only knowledge of the facts is necessary). However, if the data to be measured relate to attitudes, emotional characteristics, and the like, the degree of objectivity in the data is much lower. There is another term you will encounter a little later in this lesson that relates to objectivity, called inference. The degree of inference in the data relates to the degree of objectivity. Low inference data are associated with a low degree of judgment and high objectivity (e.g., factual data). High inference data are associated with a high degree of subjective judgment and low objectivity (e.g., measuring attitudes or scales of performance).


Kinds of Measures

In educational research alone, there are several thousand research studies published every year. And, among all of them there are four primary types of instruments (or measures) used to collect research data. These measures are:

  • Pencil-and-paper tests
  • Questionnaires (or surveys)
  • Interviews
  • Observations
The use of these measures is not restricted to educational research. They are frequently used in counseling, psychology, virtually all types of social science research. This means that by knowing how to evaluate these four types of measures, you will be able to evaluate their use in a very large number of research studies across a number of disciplines. Each of these measures, as you might expect, has advantages and disadvantages associated with them.

Some measures work better in one type of situation than another. For instance, if you want to measure attitudes, a questionnaire is usually most appropriate, while an observation instrument would be more appropriate for measuring behaviors. Knowing what those circumstances are will help you to determine if they're being used appropriately in a given research study. You'll also learn the proper steps and precautions in using each measure. All quantitative research studies work with data; so they all must use some measurement device to collect those data. Once you learn the fundamentals of evaluating these four types of measures, you will have learned a skill that will serve you well no matter what kind of quantitative research study you review.

Let's begin our study of evaluating measures by looking at pencil-and-paper tests.


Pencil-and-Paper Tests

Probably the major advantages of this measuring instrument are its ease of use and flexibility. Pencil-and-paper tests are used for measuring a wide variety of cognitive and general intellectual human attributes. Typically though, such tests are comprised of words and, therefore, require the subject being tested to be able to read.

Some types of tests can suffer from the threat of self-reported data. This is where the subject answers the questions on the test the way he/she thinks the researcher wants the answers to be. Or, self-report can lead a subject to knowingly provide false information on the test. This frequently occurs when the test is composed of questions that are threatening in some way or ask for possibly incriminating information from the subject. The self-report threat is most often associated with questionnaires; but it can occur in any of the four measurement types we are studying.


Questionnaires

Questionnaires are similar to paper-and-pencil tests in that they generate written responses. However, questionnaires are used when it is desirable to measure several variables with a single instrument. The authors of your text identify several questions you should consider when evaluating questionnaires as measurement instruments.

1. Was the questionnaire pretested?

It's important to pretest a questionnaire to ensure it will produce the information the researcher intends it to produce. Here's a silly example. Let's say I develop a questionnaire and send it out to 10,000 people. Among the questions is one that asks for the subject's gender. The question looks like this:  Sex ______________

Of course, I'm expecting to get an answer of either "M" or "F," "Male" or "Female." But, when the responses come back, nearly half of them have "Yes" or "No" or "As often as possible" or "Once a month if I'm lucky" or something else equally silly written in the space instead!!!! That's certainly not what I expected to receive. Without pretesting, these kinds of problems won't be known until it's too late.

2. Did the questionnaire include any leading or psychologically threatening questions?

If the questionnaire includes any leading or threatening questions, then its reliability and validity are both threatened. It would be ideal if you could review the actual questionnaire items to make these determinations. Unfortunately, very few studies contain the measurement devices in the journal article, which really isn't too surprising given the strict space limitations imposed by many journals. However, even though the questionnaire may not be available, look for something in the Methodology (or Procedures) section of the study for a description of what kinds of information was requested in the questionnaire. Sometimes these descriptions give an idea of the types of questions that might have been asked. Although such descriptions will not give you absolute evidence of leading or threatening questions, you may be able to infer the possibility of such questions.

For example, consider a study where the researcher is trying to determine the level of integrity in a group of subjects with the use of a questionnaire. If you knew, for instance, that information was requested from subjects about their degree of honesty in their business dealings, honesty in their personal relationships, etc. Even though the actual questions are not available, it is reasonable to infer from this information that some very personal questions were probably asked regarding an individual's personal and professional integrity. And, although we cannot tell whether any leading questions are in the instrument, there are most certainly several psychologically threatening questions that might prompt some respondents to falsify their responses (self-report bias). This, alone, raises serious questions about the validity and reliability of the data provided by respondents. If you were reading this study, you should be on the lookout for any methods the researcher would use to identify and try to correct for such bias. If you didn't find anything of that nature in the study, you should view the findings with a great deal of skepticism.

3. Were the subjects who received the questionnnaire likely to have the information requested?

This is a question of appropriateness. If the answer is no, then either the instrument or the sample are inappropriate. You should assume that the sample is one the researcher intended to include in the study. Therefore, you should conclude that the researcher used an inappropriate instrument (questionnaire).

4. What percentage of subjects responded to the questionnaire?

Obviously, the larger the percentage of respondents the better. But, the eternal question is: "how large is large enough?" The answer varies, depending on the type of survey it is and the experience of the researcher administering the instrument. Return rates of 70 percent are typically considered good for social science research surveys. Telephone surveys (interviews) may yield slightly higher response rates than mail-out type surveys. But, even a 70 percent return rate means that you lack information on 30 percent of the original sample. If 10,000 people were in the original sample, you might consider 7,000 returns to be sufficient. But, if only 10 people were in the original sample, would you consider 7 returns enough? Ultimately, the decision is yours to make (as the evaluator) about whether you are satisfied with the return rate the researcher received.

5. What steps were taken to contact nonrespondents, and how may ultimately responded?

Every research study that uses questionnaires should have a plan for contacting nonrespondents. Researcher are usually proud of their follow-up plans and often elaborate on them. If you don't see anything evidence in the report of whether or how nonrespondents were contacted (followed-up), then that's really all you can say in your evaluation. You cannot assume no follow-up was done just because you can't find it written anywhere. All you can say is that there is no evidence of a follow-up plan for nonrespondents.


Interviews

Interviews are essentially verbal questionnaires with an additional advantage. Within an interview, if the interviewee gives an answer that the interviewer wants to know more about, the interviewer is able to probe for additional information, or to go off on a tangetial line of questioning. This is something that cannot be done with a questionnaire. When you encounter a study that uses interviews, you will, no doubt, encounter the term interview protocol. This is simply the questions and instructions to the interviewer. There are three basic types of protocols: structured, semi-structured, and unstructured.

In a structured protocol, the interviewer is given a specific set of questions to ask and cannot deviate from them, nor can he/she elaborate on the questions, even if asked by the interviewee. The questions must be asked exactly as they are written on the protocol sheet. The questions are typically multiple-choice or true-false (in other words, they contain structured answer selections).

With a semi-structured protocol, some questions have probe questions added to them which the interviewer may ask to probe for additional information, usually if the interviewee's response to the main question falls into particular categories identified on the protocol sheet. Semi-structured protocols also often have open-ended questions in addition to the typical multiple-choice or true-false structured questions.

An unstructured protocol is composed almost exclusively of open-ended questions. The interviewer may even make up questions during the interview depending on the responses of the interviewee. Unstructured protocols are seldom used in quantitative research because they require a great deal of manual data coding. This, in turn, allows for the introduction of bias (error) in the coding process, which usually isn't a problem with the structured protocol. Some research texts call the unstructured protocol an informal interview protocol because it resembles casual conversation rather than a formal structured interview. These, so called, informal interviews are usually used in qualitative research studies. Beware of the major disadvantage of the interview method is In your textbook, your authors identify several questions you should consider when evaluating interviews as measurement instruments. Because interviews and questionnaires are so similar, don't be surprised that some of the questions look similar to those presented above in the previous section.

1. How well were the interviewers trained?

It's very important to train interviewers to ensure they ask questions properly (without leading the interviewee to provide a certain answer). It's also important that the interviewer be aware that his/her body language can also provide subtle clues to the interviewee that could lead them to answer questions one way or another. Usually, such bias is unintentional on the part of the interviewer. Training will make the interviewer aware of these possibilities and should reduce, or eliminate, the possibilitiy of bias induced by the interviewer.

2. How was the information recorded?

Unobtrusive recording methods are best, because they do not intimidate the interviewee. However, unobtrusive methods aren't always available. It's important that the interviewer (or researcher) use a recording mechanism that is not subject to judgment or bias induced by the interviewer.

3. How much judgment was called for?

This is the interviewer-induced bias I talked about in #2 above. If the interviewer is asked to assess the physical well-being of the interviewee. The least amount of judgment would be caused by asking the following question (with the choices):

How would you categorize your physical condition at this moment? Would you say you are:

A. in excellent health
B. suffering from a cold, flu, or sinus
    condition
C. suffering from a physical illness other than
    a cold, flu, or sinus condition

However, it would be much worse if the interviewer tried to ascertain the interviewee's physical well-being by merely looking at him/her and making a personal judgment.

4. Were the interview procedures tried out before the study began?

The same as the "pretesting" question for questionnaires.

5. Were leading questions asked?

Again, the same question we asked regarding questionnaires.

6. How much did the interviewer know about the research?

This is important to know because the more the interviewer knows, the more likely it is for the interviewer to ask questions in ways (leading) that the interviewer thinks would help the research. Sometimes this occurs by novice interviewers who believe they are actually helping the researcher, not realizing the bias he/she is induding into the data. The best situation is if the interviewer knows nothing about the research. If you encounter a study where you believe the interviewer knows a lot about the research study, pay special attention to any methods used by the researcher to counteract the possible effects of interviewer-induced bias.


Observations

Observations (direct observations) are used when the researcher is interested in measuring behaviors. Behaviors are more accurately measured through observations than through questionnaires or interviews. The self-report bias problem associated with questionnaires and interviews is not a great concern in direct observations because the researcher can vary the observation periods and lengths to reduce the possibility of falsifying behaviors by those being observed. The authors of your text identify several questions you should consider when evaluating studies that used direct observations for collecting data.

1. Were high-inference or low-inference behaviors observed?

High-inference behaviors are behaviors that require a great deal of subjective judgment to interpret them. Low-inference behaviors, on the other hand, do not require much interpretive judgment. For example, if a teacher were being observed on his questioning style with students. A high-inference behavior would be worded something like this:

Teacher asks good questions. Teacher asks questions frequently throughout the lesson.

How is the observer defining "good" and "frequently" in measuring these behaviors? The observer must apply considerable judgment when determining that the teacher either asks good or poor questions, or whether the teacher asks questions frequently or infrequently. A low-inference observation of a teacher's questioning style, however, might look something like this:

How many overhead questions does the teacher ask during the lesson? How many direct questions does the teacher ask during the lesson? ... etc.

As long as "overhead" and "direct" questions are defined (overhead questions are asked to the entire class, whereas direct questions are asked to specific students), there is no judgment involved in making these observations. The observer simply counts the number of each type of question asked during the lesson.

2. Were observers trained to identify the variables to be observed?

Multiple observers increases the likelihood that important behaviors will be observed (what if there were only one observer and she blinked or sneezed while an important behavior occurred?). However, when multiple observers are used in a study, it is very important that they both see the same behaviors at the same time. That requires training. Usually, the more behaviors being observed, the more important training becomes.

3. What was the interobserver reliability?

Interobserver reliability is your authors' term. Actually, it is more correct to refer to it as interobserver agreement rather than reliability. Regardless of what it''s called, it is a measure of the consistency between the different observers. Ideally, all observers will see (and record) the same behavior at the same time -- that would be 100 percent agreement. Realistically, though, observer agreement indices seldom exceed 90 percent (we are humans, afterall). Note that even if you are not told in the study that observers were trained. If you were given a very high interobserver agreement value, you could reasonably assume that the observers had been trained very well.

4. How long was the observation period?

The only thing I want to say about this is that longer is not necessarily better when it comes to observation periods. What's more important is overall observation time. You may be tempted to conclude that a 30-second observation period is not enough time to adequately observe a particular set of behaviors. But, if you are told that there are 60 separate observation periods spread out over a number of days, then you really have 15 minutes (30 * 60 = 900 seconds / 60 seconds per minute = 15 minutes) of overall observation time which would be quite enough time to make the necessary observations. You may well ask: "Why not just do a single 15-minute observation period and be done with it?" The answer really depends on the particulars of the research study and the behaviors being observed. But, in general, do a single, long observation period may not reveal "typical" behaviors. And, it certainly won't reveal a large variety of them. Short, focused observations spread out over a relative long period tend to reveal consistent patterns of behaviors that are more representative of the subject's behaviors.

5. How conspicuous were the observers?

Everyone who went to grade school in the United States no doubt remembers parent's day -- when parents came into the classroom and stayed for a few hours during the day with their children while the teacher struggled to keep the class in line and teach a meaningful lesson. There's something about being watched that changes most people's behavior patterns. Some people act up and act out in front of an audience, while others shy away and, though normally talkative, become almost reclusive. The same thing happens when observers are used in research studies. This is one form of observer bias, and it is completely unintentional on the part of the observer -- it occurs only because the observer is conspicuous to the subject being observed.

So, ideally, observers should be inconspicuous. But, that's not always possible. Your task, as an evaluator, is to determine how, if at all, the researcher addresses this possible bias. There may be ways of compensating for it even if the observers are in full view of the subjects being observed. Pay close attention to how the researcher deals with it.

Next you'll have a chance to practice what you've learned about evaluating measures. Your assignment now is to review Studies 8, 9, and 10 in the Supplemental Book (SB), and evaluate the measures used in each one. In Study 8, the measures used are pencil-and-paper tests; in Study 9, the measure is a questionnaire; and in Study 10, the measures are interviews and observations (notice that a researcher can use any or all of the measures in a single study -- his/her goal is to use the right measure(s) to collect the data needed). Refer to Chapter 6 in your text and Part 1 (above) of this lesson to help you evaluate the measures used in each study. Then, when you are ready, go on to Part 2 and read my evaluation of each study. If you find points where we disagree, note them and see if you can find why we differ. Bring them up in class.


Proceed to Part 2 of 2 of the Week 2 lesson.