Activity 6 Revisited
Activities for Exploring Randomness
Activity 12: Random Babies
Activity 13: AIDS Testing
Activity 14: Reese’s Pieces
Activities for Drawing Inferences
Activity 15: Which Tire?
Activity 16: Kissing the Right Way
Activity 17: Reese’s Pieces (cont.)
Activity 18: Dolphin Therapy
Activity 19: Sleep Deprivation
Activity 6 Revisited
Activity 20: Cat Households
Activity 21: Female Senators
Activity 22: Game Show Prizes
Activity 23: Government Spending
See end of handout for activity-specific “morals” given in slides, as well as references and general advice, including for implementing active learning, exam writing, and administering data collection projects.
Activity 1: Naughty or Nice?
This is a stand-alone activity that can be used very early in a course to introduce concepts and reasoning process of statistical inference.
e all recognize the difference between naughty and nice, right? What about children less than a year old- do they recognize the difference and show a preference for nice over naughty? In a study reported in the November 2007 issue of Nature, researchers investigated whether infants take into account an individual’s actions towards others in evaluating that individual as appealing or aversive, perhaps laying for the foundation for social interaction (Hamlin, Wynn, and Bloom, 2007). In one component of the study, 10-month-old infants were shown a “climber” character (a piece of wood with “google” eyes glued onto it) that could not make it up a hill in two tries. Then they were alternately shown two scenarios for the climber’s next try, one where the climber was pushed to the top of the hill by another character (“helper”) and one where the climber was pushed back down the hill by another character (“hinderer”). The infant was alternately shown these two scenarios several times. Then the child was presented with both pieces of wood (the helper and the hinderer) and asked to pick one to play with. The researchers found that the 14 of the 16 infants chose the helper over the hinderer. Researchers varied the colors and shapes that were used for the two toys. Videos demonstrating this component of the study can be found at http://www.yale.edu/infantlab/socialevaluation/Helper-Hinderer.html
(a) What proportion of these infants chose the helper toy? Is this more than half (a majority)?
Suppose for the moment that the researchers’ conjecture is wrong, and infants do not really show any preference for either type of toy. In other words, these infants just blindly pick one toy or the other, without any regard for whether it was the helper toy or the hinderer. Put another way, the infants’ selections are just like flipping a coin: Choose the helper if the coin lands heads and the hinderer if it lands tails.
(b) If this is really the case (that no infants have a preference between the helper and hinderer), is it possible that 14 out of 16 infants would have chosen the helper toy just by chance? (Note, this is essentially asking, is it possible that in 16 tosses of a fair coin, you might get 14 heads?)
Well, sure, it’s definitely possible that the infants have no real preference and simply pure random chance led to 14 of 16 choosing the helper toy. But is this a remote possibility, or not so remote? In other words, is the observed result (14 of 16 choosing the helper) be very surprising when infants have no real preference, or somewhat surprising, or not so surprising? If the answer is that that the result observed by the researchers would be very surprising for infants who had no real preference, then we would have strong evidence to conclude that infants really do prefer the helper. Why? Because otherwise, we would have to believe that the researchers were very unlucky and a very rare event just happened to occur in this study. It could be just a coincidence, but if we decide tossing a coin rarely leads to the extreme results that we saw, we can use this as evidence that the infants were acting not as if they were flipping a coin but instead have a genuine preference for the helper toy (that infants in general have a higher than .5 probability of choosing the helper toy).
So, the key question now is how to determine whether the observed result is surprising under the assumption that infants have no real preference. (We will call this assumption of no genuine preference the null model.) To answer this question, we will assume that infants have no genuine preference and were essentially flipping a coin in making their choices (i.e., knowing the null model to be true), and then replicate the selection process for 16 infants over and over. In other words, we’ll simulate the process of 16 hypothetical infants making their selections by random chance (coin flip), and we’ll see how many of them choose the helper toy. Then we’ll do this again and again, over and over. Every time we’ll see the distribution of toy selections of the 16 infants (the “could have been” distribution), and we’ll count how many infants choose the helper toy. Once we’ve repeated this process a large number of times, we’ll have a pretty good sense for whether 14 of 16 is very surprising, or somewhat surprising, or not so surprising under the null model.
Just to see if you’re following this reasoning, answer the following:
(c) If it turns out that we very rarely see 14 of 16 choosing the helper in our simulated studies, explain why this would mean that the actual study provides strong evidence that infants really do favor the helper toy.
(d) What if it turns out that it’s not very uncommon to see 14 of 16 choosing the helper in our simulated studies: explain why this would mean that the actual study does not provide much evidence that infants really do favor the helper toy.
Now the practical question is, how do we simulate this selection at random (with no genuine preference)? One answer is to go back to the coin flipping analogy. Let’s literally flip a coin for each of the 16 hypothetical infants: heads will mean to choose the helper, tails to choose the hinderer.
(e) What do you expect to be the most likely outcome: how many of the 16 choosing the helper?
(f) Do you think this simulation process will always result in 8 choosing the helper and 8 the hinderer? Explain.
(g) Flip a coin 16 times, representing the 16 infants in the study. Let a result of heads mean that the infant chose the helper toy, tails for the hinderer toy. How many of the 16 chose the helper toy?
(h) Repeat this three more times. Keep track of how many infants, out of the 16, choose the helper. Record this number for all four of your repetitions (including the one from the previous question):
Number of (simulated) infants who chose helper
(i) How many of these four repetitions produced a result at leastas extreme (i.e., as far or farther from expected) as what the researchers actually found (14 of 16 choosing the helper)?
(j) Combine your simulation results for each repetition with your classmates. Produce a well-labeled dotplot.
(k) How’s it looking so far? Does it seem like the results actually obtained by these researchers would be very surprising under the null model that infants do not have a genuine preference for either toy? Explain.
We really need to simulate this random assignment process hundreds, preferably thousands of times. This would be very tedious and time-consuming with coins, so let’s turn to technology.
(l) Use the Coin Tossing applet at www.rossmanchance.com/applets/ to simulate these 16 infants making this helper/hinderer choice, still assuming the null model that infants have no real preference and so are equally likely to choose either toy. (Keep the number of repetitions at 1 for now.) Report the number of heads (i.e., the number of infants who choose the helper toy).
(m) Repeat (l) four more times, each time recording the number of the 16 infants who choose the helper toy. Did you get the same number all five times?
(n) Now change the number of repetitions to 995, to produce a total of 1000 repetitions of this process. Comment on the distribution of the number of infants who choose the helper toy, across these 1000 repetitions. In particular, comment on where this distribution is centered (does this make sense to you?) and on how spread out it is and on the distribution’s general shape.
We’ll call the distribution in (n) the “what if?” distribution because it displays how the outcomes (for number of infants who choose the helper toy) would vary if in fact there were no preference for either toy.
(o) Report how many of these 1000 repetitions produced 14 or more infants choosing the helper toy. (Enter 14 in the “as extreme as” box and click on “count.) Also determine the proportion of these 1000 repetitions that produced such an extreme result.
(p) Is this proportion small enough to consider the actual result obtained by the researchers surprising, assuming the null model that infants have no preference and so choose blindly?
(q) In light of your answers to the previous two questions, would you say that the experimental data obtained by the researchers provide strong evidence that infants in general have a genuine preference for the helper toy over the hinderer toy? Explain.
What bottom line does our analysis lead to? Do infants in general show a genuine preference for the “nice” toy over the “naughty” one? Well, there are rarely definitive answers when working with real data, but our analysis reveals that the study provides strong evidence that these infants are not behaving as if they were tossing coins, in other words that these infants do show a genuine preference for the helper over the hinderer. Why? Because our simulation analysis shows that we would rarely get data like the actual study results if infants really had no preference. The researchers’ result is not consistent with the outcomes we would expect if the infants’ choices follow the coin-tossing process specified by the null model, so instead we will conclude that these infants’ choices are actually governed by a different process where there is a genuine preference for the helper toy. Of course, the researchers really care about whether infants in general (not just the 16 in this study) have such a preference. Extending the results to a larger group (population) of infants depends on whether it’s reasonable to believe that the infants in this study are representative of a larger group of infants.
Let’s take a step back and consider the reasoning process and analysis strategy that we have employed here. Our reasoning process has been to start by supposing that infants in general have no genuine preference between the two toys (our null model), and then ask whether the results observed by the researchers would be unlikely to have occurred just by random chance assuming this null model. We can summarize our analysis strategy as the 3 S’s.
Statistic: Calculate the value of the statistic from the observed data.
Simulation: Assume the null model is true, and simulate the random process under this model, producing data that “could have been” produced in the study if the null model were true. Calculate the value of the statistic from these “could have been” data. Then repeat this many times, generating the “what if” distribution of the values of the statistic under the null model.
Strength of evidence: Evaluate the strength of evidence against the null model by considering how extreme the observed value of the statistic is in the “what if” distribution. If the original statistic is in the tail of the “what if” distribution, then the null model is rejected as not plausible. Otherwise, the null model is considered to be plausible (but not necessarily true, because other models might also not be rejected).
In this study, our statistic is the number of the 16 infants who choose the helper toy. We assume that infants do not prefer either toy (the null model) and simulate the random selection process a large number of times under this assumption. We started out with hands-on simulations using coins, but then we moved on to using technology for speed and efficiency. We noted that our actual statistic (14 of 16 choosing the helper toy) is in the tail of the simulated “what if” distribution. Such a “tail result” indicates that the data observed by the researchers would be very surprising if the null model were true, giving us strong evidence against the null model. So instead of thinking the researchers just got that lucky that day, a more reasonable conclusion would be to reject that null model. Therefore, this study provides strong evidence to conclude that these infants really do prefer the helper toy and were not essentially flipping a coin in making their selections.
Terminology: The long-run proportion of times that an event happens when its random process is repeatedly indefinitely is called the probability of the event. We can approximate a probability empirically by simulating the random process a large number of times and determining the proportion of times that the event happens.
More specifically, the probability that a random process alone would produce data as (or more) extreme as the actual study is called a p-value. Our analysis above approximated this p-value by simulating the infants’ random select process a large number of times and finding how often we obtained results as extreme as the actual data. You can obtain better and better approximations of this p-value by using more and more repetitions in your simulation.
A small p-value indicates that the observed data would be surprising to occur through the random process alone, if the null model were true. Such a result is said to be statistically significant, providing evidence against the null model (that we don’t believe the discrepancy arose just by chance but instead reflects a genuine tendency). The smaller the p-value, the stronger the evidence against the null model. There are no hard-and-fast cut-off values for gauging the smallness of a p-value, but generally speaking:
A p-value above .10 constitutes little or no evidence against the null model.
A p-value below .10 but above .05 constitutes moderately strong evidence against the null model.
A p-value below .05 but above .01 constitutes reasonably strong evidence against the null model.
A p-value below .01 constitutes very strong evidence against the null model.
Just to make sure you’re following this terminology, answer:
(r) What is the approximate p-value for the helper/hinderer study?
(s) What if the study had found that 10 of the 16 infants chose the helper toy? How would this have affected your analysis, p-value, and conclusion? [Hint: Use your earlier simulation results but explain what you are doing differently now to find the approximate p-value.] Explain why your answers make intuitive sense.
Mathematical note: You can also determine this probability (p-value) exactly using what are called binomial probabilities. The probability of obtaining k successes in a sequence of n trials with success probability on each trial, is: .
(s) Use this expression to determine the exact probability of obtaining 14 or more successes (infants who choose the helper toy) in a sequence of 16 trials, under the null model that the underlying success probability on each trial is .5.
The exact p-value (to four decimal places) turns out to be .0021. We can interpret this by saying that if infants really had no preference and so were randomly choosing between the two toys, there’s only about a 0.21% chance that 14 or more of the 16 infants would have chosen the helper toy. Because this probability is quite small, the researchers’ data provide very strong evidence that infants in general really do have a preference for the nice (helper) toy.
Activity 2: Sampling Words
The following five activities (2-6) focus on issues of data collection. They also focus on how introducing randomness into the design of a study has several important benefits and on how the scope of conclusions to be drawn from a study depends on how the data were collected.
ne of the most important ideas in statistics is that we can learn a lot about a large group (called a population) by studying a small piece of it (called a sample). Consider the population of 268 words in the following passage:
Four score and seven years ago, our fathers brought forth upon this continent a new nation: conceived in liberty, and dedicated to the proposition that all men are created equal.
Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. We are met on a great battlefield of that war.
We have come to dedicate a portion of that field as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this.
But, in a larger sense, we cannot dedicate, we cannot consecrate, we cannot hallow this ground. The brave men, living and dead, who struggled here have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember, what we say here, but it can never forget what they did here.
It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the great task remaining before us, that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion, that we here highly resolve that these dead shall not have died in vain, that this nation, under God, shall have a new birth of freedom, and that government of the people, by the people, for the people, shall not perish from the earth.
(a) Select a sample of ten representative words from this population by circling them in the passage above.
The above passage is, of course, Lincoln’s Gettysburg Address. For this activity we are considering this passage a population of words, and the 10 words you selected are considered a sample from this population. In most studies, we do not have access to the entire population and can only consider results for a sample from that population. The goal is to learn something about a very large population (e.g., all American adults, all American registered voters) by studying a sample. The key is in carefully selecting the sample so that the results in the sample are representative of the larger population (i.e., has the same characteristics).
(b) Record the word and the number of letters in each of the ten words in your sample:
(c) Do you think the ten words in your sample are representative of the lengths of the 268 words in the population? Explain briefly.
(d) Create a dotplot of your sample results (number of letters in each word). Also indicate what the observational units and variable are in this dotplot. Is the variable categorical or quantitative?
Observational units: Variable: Type:
(e) Determine the average (mean) number of letters in your ten words.
(f) Combine your sample average with the rest of the class to produce a well-labeled dotplot.
(g) Indicate what the observational units and variable are in this dotplot. [Hint: To identify what the observational units are, ask yourself what each dot on the plot represents. The answer is different from above.]
One conceptual challenge here is realizing that the observational units are no longer the individual words but rather the samples of ten words. Each dot in this plot comes from a sample of ten words, not from an individual word.
(h) The average number of letters per word in the population of all 268 words is 4.295. Mark this value on the dotplot in (e). How many students produced a sample average greater than the actual population average? What proportion of the students is this?
When the sampling method produces characteristics of the sample that systematically differ from those characteristics of the population, we say that the sampling method is biased.
(i) Would you say that this sampling method (asking people to simply circle ten representative words) is biased? If so, in which direction? Explain how you can tell this from the dotplot.
(j) Suggest some reasons why this sampling method turned out to be biased as it did.
(k) Consider a different sampling method: close your eyes and point to the page ten times in order to select the words for your sample. Would this sampling method also be biased? Explain.
Most people tend to choose larger words, perhaps because they are more interesting or convey more information. Therefore, the first sampling method is biased toward overestimating the average number of letters per word in the population. Some samples may not overestimate this population value, but samples chosen with this method tend to overestimate the population mean. Closing your eyes does not eliminate the bias, because you are still more likely to select larger words, because they take up more space on the page. In general, human judgment is not very good at selecting representative samples from populations.
(l) Would using this same sampling method but with a larger sample size (say, 20 words) eliminate the sampling bias? Explain.
(m) Suggest how you might employ a different sampling method that would be unbiased.
A simple random sample (SRS) gives every observational unit in the population the same chance of being selected. In fact, it gives every sample of size n the same chance of being selected. In this example we want every set of ten words to be equally likely to be the sample selected.
While the principle of simple random sampling is probably clear, it is by no means simple to implement. One approach is to use a computer-generated table of random digits. Such a table is constructed so that each position is equally likely to be occupied by any one of the digits 0-9, and so that the value of any one position has no impact on the value of any other position.
The first step is to obtain a sampling frame where each member of the population can be assigned a number. Here we just need to number the words in the above passage. This sampling frame appears on the next page, and a table of random digits appears on the page after that.
You will now use the table to random digits to select a simple random sample of five words from the Gettysburg address. Do this by entering the table at any point (it does not have to be at the beginning of a line) and reading off three-digit numbers between 001 and 268. (Disregard any numbers not in this range. If you happen to get repeats, keep going until you have five different two-digit numbers. If you finish a line without obtaining five words, just continue on to the next line.) Continue until you have five numbers corresponding to words in this population.
(n) Record the ID numbers that you selected, the corresponding words, and the lengths of the words:
(o) Determine the average length in your sample of five words.
(p) Again combine your sample mean with those of your classmates to produce a dotplot below. Be sure to label the horizontal axis appropriately.
(q) Comment on how the distribution of sample averages from these random samples compares to that from your “circle ten words” samples.
(r) Do the sample averages from the random samples tend to over- or under-estimate the population average, or are they roughly split evenly on both sides?
To really examine the long-term patterns of this sampling method, we will use technology to take many, many samples.
From the webpage http://www.rossmanchance.com/applets/, select the “Sampling Words” applet. The information in the top right panels show you the population distributions (including proportion of long words and proportion of nouns) and tell you the average number of letters per word in the population, the population proportion of “long words,” and the population proportion of nouns. Unclick the boxes next to “Show Long” and “Show Noun” so we can continue to focus on the lengths of words for now.
Specify 5 as the sample size and click Draw Samples. Note the lengths of the words and the average for the sample of 5 words. Then click Draw Samples again. Then change the number of samples (Num samples) from 1 to 98. Click the Draw Samples button. The applet now takes 98 more simple random samples from the population (for a total of 100 so far) and adds the sample averages to the graph in the lower right panel. The red arrow indicates the average of the 100 sample averages.
(s) What does this dotplot reveal?
(t) Now change the sample size from 5 to 10. Click off the Animate button and click on Draw Samples. Does the sampling method still appear to be unbiased? What has changed about the type of sample averages that we obtain? Why does this make sense?
Once we have a representative sampling method, we can improve the precision by increasing the sample size. With larger random samples, the results will tend to fall even closer to the population results.
Three caveats about random sampling are in order:
One still gets the occasional “unlucky” sample whose results are not close to the population even with large sample sizes.
Second, the sample size means little if the sampling method is not random. In 1936 the Literary Digest magazine had a huge sample of 2.4 million people, yet their predictions for the Presidential election did not come close to the truth about the population.
While the role of sample size is crucial in assessing how close the sample results will be to the population results, the size of the population does not affect this. As long as the population is large relative to the sample size (at least 10 times as large), the precision of a sample statistic depends on the sample size but not on the population size! (You can explore this a bit in the applet by using the “address” pull-down menu to select “four addresses.” This makes the population four times as large, but if you conduct the simulation again you should find a very similar sampling distribution.)
Activity 3: Night Lights and Near-Sightedness
Near-sightedness typically develops during the childhood years. Recent studies have explored whether there is an association between development of myopia and the use of night-lights with infants. Quinn, Shin, Maguire, and Stone (1999) examined the type of light children aged 2-16 were exposed to. Between January and June 1998, the parents of 479 children who were seen as outpatients in a university pediatric ophthalmology clinic completed a questionnaire. One of the questions asked was “Under which lighting condition did/does your child sleep at night?” before the age of 2 years. The parents chose between “room lighting,” “a night light,” and “darkness.” Based on the child’s most recent eye examination, they were separated into three groups: near-sighted, normal refraction, or far-sighted.
(a) Identify the observational units and the two variables in this study. For each variable, specify whether it is quantitative or categorical.
Observational unit =children
Variable 1 = lighting condition (categorical)
Variable 2 = refraction (categorical)
(b) Which variable is being considered the explanatory variable and which is being considered the response variable?
(c) Is this an observational study or an experiment? Explain how you can tell.
The following table and graph display the sample data:
(d) What does the bar graph reveal about whether myopia increases with higher levels of light exposure? Explain.
When another variable has a potential influence on the response, but its effects cannot be separated from those of the explanatory variable, we say the two variables are confounded. When we classify subjects into different groups based on existing conditions (i.e., in an observational study), there is always the possibility that there are other differences between the groups apart from the explanatory variable that we are focusing on. Therefore, we cannot draw cause/effect conclusions between the explanatory and response variables from an observational study.
(e) Is it valid to conclude that sleeping in a lit room, or with a night light, causes an increase in a child’s risk of near-sightedness? If so, explain why. If not, identify a confounding variable that offers an alternative explanation for the observed association between the variables revealed by the table, and explain why it is confounding. (Be sure to indicate how the confounding variable is related to both the explanatory and response variable. Keep in mind that the association revealed in the table and graph is real; we are just saying there could be an alternative explanation besides cause-and-effect.)
Activity 4: Have a Nice Trip
Researchers wanted to study whether individuals could be taught techniques that would help them more reliably recover from a loss of balance (e.g.,www.uic.edu/ahs/biomechanics/videos/edited_lowering.avi)
(a) Suppose you had 12 subjects to participate in an experiment to compare the “elevating” strategy to the “lowering” strategy. How would you design the study?
(b) Consider the Randomizing Subjects applet at http://www.rossmanchance.com/applets/
- Explore the distribution of the difference in the proportion of males in the two treatment groups under random assignment. What is the most common outcome? Is this what you would expect?
- Explore the distribution of the difference in the mean heights in the two treatment groups under random assignment. What is the mean? How often is there more than 2 inch difference in the mean heights between the two groups?
- Explore the distribution in the group differences on the hidden gene factor and the hidden x-factor. Are the groups usually balanced?
The goal of random assignment is to create groups that can be considered equivalent on all lurking variables. If we believe the groups are equivalent prior to the start of the study, this allows us to eliminate all potential confounding variables as a plausible explanation for any significant differences in the response variable after the treatments are imposed.
Activity 5: Cursive Writing
An article about handwriting appeared in the October 11, 2006 issue of the Washington Post. The article mentioned that among students who took the essay portion of the SAT exam in 2005-06, those who wrote in cursive style scored significantly higher on the essay, on average, than students who used printed block letters.
(a) Identify the explanatory and response variables in this study. Classify each as categorical or quantitative.
(b) Is this an observational study or an experiment? Explain briefly.
(c) Would you conclude from this study that using cursive style causes students to score better on the essay? If so, explain why. If not, identify a potential confounding variable, and explain how it provides an alternative explanation for why the cursive writing group would have a significantly higher average essay score.
The article also mentioned a different study in which the same exact essay was given to many graders. But some graders were shown a cursive version of the essay and the other graders were shown a version with printed block letters. Researchers randomly decided which version the grader would receive. The average score assigned to the essay with the cursive style was significantly higher than the average score assigned to the essay with the printed block letters.
(d) What conclusion would you draw from this second study? Be clear about how this conclusion would differ from that of the first study, and why that conclusion is justified.
Scope of Conclusions permitted depending on study design (adapted from Ramsey and Schafer’s The Statistical Sleuth)
Allocation of units to groups
By random assignment
No random assignment
Selection of units
A random sample is selected from one population; units are then randomly assigned to different treatment groups
Random samples are selected from existing distinct populations
Inferences to populations can be drawn
Not random sampling
A groups of study units is found; units are then randomly assigned to treatment groups
Collections of available units from distinct groups are examined
Can draw cause and effect conclusions
Activity 6: Memorizing Letters
You will be asked to memorize as many letters as you can in 20 seconds, in order, from a sequence of 30 letters.
(a) Identify the explanatory variable and the response variable.
(b) What kind of study is this (observational or experimental)? Explain how you know.
(c) Did this study make use of comparison? Why is this important?
(d) Did this study make use of random assignment? Why is that important?
(e) Did this study make use of blindness? Why is that important?
(f) Did this study make use of random sampling? Why is this important?
The above can be used as a very quick in-class data collection assignment. The data can then be utilized at multiple points in the course.
Activity 7: Matching Variables to Graphs
The following five activities (7-11) focus on issues in describing data. Students need practice in considering how variables behave, as well as what graphs do and do not reveal.
atch the following variables with the histograms and bar graphs given below. Hint: Think about how each variable should behave.