Rutgers University School of Communication and Information
www.davidkarpf.com Submitted for consideration to the special issue of Information, Communication, and Society Abstract
This paper discusses three interrelated challenges related to conducting social science research in “Internet Time.” (1) The rate at which the Internet is both diffusing through society and developing new capacities is unprecedented. It creates some novel challenges for scholarly research. (2) Many of our most robust research methods are based upon ceteris paribus assumptions that do not hold in the online environment. The rate of change online narrows the range of questions that can be answered using traditional tools. Meanwhile, (3) new research methods are untested and often rely upon data sources that are incomplete and systematically flawed. The paper details these challenges, then proposes that scholars embrace the values of transparency and kludginess in order to answer important research questions in a rapidly-changing communications environment.
Keywords: research methods, web 2.0, Internet Time, Internet research
Introduction The internet has dramatically changed through the past decade. Consider: in 2002, streaming video was rare, short and choppy. Wireless “hotspots” were a novelty. Mobile phones were primarily used for (gasp!) phone calls. Commercial GPS applications were still in the early stages of development. Bloggers could be counted by the handful. Social networking sites like Friendster, Myspace and Facebook were still confined to Bay Area networks and technologists’ imaginations (boyd and Ellison 2007).1 A “tumbler” was a type of drinking glass, a “tweet” was a type of bird call. Simply put, the internet of 2012 is different than the internet of 2002. What’s more, there is little reason to suppose this rapid evolution is finished: the internet of 2022 will likely be different than the internet of 2012.
In a 2001 article, Barry Wellman playfully suggested that “an internet year is like a dog year, changing approximately seven times faster than normal human time” (Wellman 2001). Alongside Wellman’s claim, technologists routinely make reference to “Moore’s Law,” the 1965 prediction by Intel founder Gordon Moore that transistor capacity would grow exponentially, doubling once every 18-to-24 months (Moore 1965). Moore’s Law has proven surprisingly (albeit approximately) resilient over the past 45 years, and has become synonymous in popular technology literature with the rise of abundant and cheap computational capacity. It is an oversimplified concept, usually associated with naïve strains of techno-optimism, which tends to overlook the multi-layered “hourglass architecture” associated with the internet (Zittrain 2008, p. 67-71).
But if technology writers have relied too much on Moore’s Law as a concept, social scientists have all but ignored it. Tour through the indexes of the leading political science and sociology texts and you will find nary a mention of Moore’s Law, or of the continuing evolution of the medium (examples from my bookshelves and syllabi include Hindman 2008; Bimber 2003; Bimber and Davis 2003; Davis 2009; Perlmutter 2007; Howard 2006; Sunstein 2001; Sunstein 2006; Sunstein 2007; Kerbel 2009). “Internet Time” is a subject grudgingly acknowledged in our research designs, rather than incorporated within them. Members of the interdisciplinary research community are aware that the suite of technologies collectively referred to as “the internet” continues to develop. But that awareness rarely becomes a feature of our research methods.
This feature of “internet time” creates a substantial hurdle. The internet is unique among Information and Communications Technologies (ICTs) specificallybecause the internet of 2002 has important differences from the Internet of 2005, or 2009, or 2012. It is a suite of overlapping, interrelated technologies. The medium is simultaneously undergoing a social diffusion process and an ongoing series of code-based modifications. Social diffusion brings in new actors with diverse interests. Code-based modifications alter the technological affordances of the media environment itself (Lessig 1999; Goldsmith and Wu 2006). What was costly and difficult in 2004 is cheap and ubiquitous in 2008. That leads, in turn, to different practices.2 The Internet’s effect on media, social, and political institutions will be different at time X than at time X+1, because the suite of technologies we think of as the internet will itself change within that interval.
Standard practices within the social sciences are not well suited to such a rapidly changing medium. Traditionally, major research endeavors move at a glacial pace. Between grant applications, data collection, writing, peer review, and publication, it is not uncommon for a research project to consume five or more years between conception and publication. Several books and articles have been published in 2011 on the basis of 2006 data. Scholars openly acknowledge the limitations imposed by the system. It is a common topic for commiseration at conference hotel bars. In the time it takes to formulate, fund, conduct, revise, and publish a significant research question, we all are left to worry that changes in the media environment will render our work obsolete.
The last two stages of the process – peer-review and publication – have become topics of robust conversation. There has been ample innovation already, with online journals, automated components of peer-review, and academic blogging being three obvious examples. The conversation surely must continue, but I for one am satisfied with how it has begun. When it comes to the “tools of the trade,” collective introspection among social scientists has been far less forthcoming. Everyone can agree that it would be nice to see publication speeds improve. Exploring the limitations of panel and survey data, when applied during the early lifecycle of a changing media environment, is a steeper climb.
This paper is an effort to broach that methodological conversation. I offer a methodological prescription for conducting social science research that takes “Internet Time” seriously. It is rooted in methodological pluralism – the belief that social science researchers ought to be question-driven, letting the nature of their inquiry determine their methods, rather than the other way around (Smith 1997). To be clear, I do not argue that traditional methods have been rendered obsolete by the Internet. That is too strong of a claim, and there are several counter-examples readily at hand. Surveys of mass behavior have led to key findings regarding political participation (Schlozman, Verba, and Brady; 2010) and the digital divide, to name only two (Hargattai 2010). Rather, I argue that an expanding range of interesting and important questions cannot be answered using the most refined tools in our toolbox. The new media environment demands new techniques. Those techniques carry risks – they haven’t undergone the years of seasoning and sophistication that dominant methods have. But they also carry the promise of expanding the scope of our inquiry and applying intellectual rigor to topics of broad social significance.
This methodological prescription is based heavily upon my own experiences dealing with novel topic areas, facing the hurdles imposed by messy data, and creating novel datasets for assessing the questions that interest me most. Different branches of the internet research community define their research questions and appropriate designs differently. I offer this paper as a political scientist who has migrated into the interdisciplinary field of internet research. Researchers with alternate academic lineages will surely define a few key concepts differently (what constitutes a dominant method, or a relevant theory, or an acceptable form of data?). Yet I suspect the challenges and solutions I present will be broadly familiar to all readers.
The argument proceeds in the form of four brief essays. I try my very best to be provocative throughout. The first essay highlights changes in the online environment itself. Key areas of inquiry – from political blogs to websites and mobile applications – are in a state of maturation. In the course of a few years, Twitter has moved from the lead-adopter stage to the late-majority stage of diffusion. The social practices and technical architecture of social networking sites like Facebook have substantively changed as they have moved through the diffusion process. The boundaries that once clearly defined the blogosphere have become irredeemably fuzzy – today, I will argue, there is no such thing as “the blogosphere.” The rate at which these sociotechnical systems evolve is something new for social scientists to deal with.
The second essay turns to Internet Time and social science research. Our most robust research techniques are based upon ceteris paribus assumptions that are routinely violated in this fast-developing media environment. Online behavior at Time X only predicts online behavior at Time X+1 if (1) the underlying population from which we are sampling remains the same and (2) the medium itself remains the same. A changing media environment, under early adopter conditions, violates both (1) and (2). Ceteris is not paribus. All else cannot be assumed equal. One obvious consequence is that research findings are rendered obsolete by the time they have been published. The terrain we can explore with traditional social science techniques has narrowed in scope.
The third essay focuses upon endemic problems associated with online data quality. The online environment is lauded for its abundant data. But anyone who has worked closely with this data is well aware of its limitations. Spambots, commercial incentives, proprietary firewalls, and noisy indicators all create serious challenges for the researcher. We can rarely be sure whether our findings are an artifact of a flawed dataset, particularly when working with public data. Online data abundance offers the promise of computational social science, but only to the extent that we can bridge the data divide.
The fourth and final essay offers some hopeful examples of how social science research can be augmented by the dynamic, changing nature of the online environment. By treating Moore’s Law and Internet Time as a feature of our research designs, rather than a footnote to be acknowledged and dismissed, well-theorized, rigorously-designed studies can gain a foothold on an expanded array of research questions. Doing so requires some new habits among the research community, however -- or at least a renewed commitment to some old habits that have been lost. I argue for embracing the messiness, promoting design transparency, and supporting “kludginess,” or the use of hacks and workarounds that allow for messy-but-productive solutions where more elegant choices would prove unworkable.
Internet Time and Sociotechnical Systems, or, There Is No Such Thing as the Blogosphere My own perspective on “Internet Time” has been driven by longtime observation of the political blogosphere. Writing in 2003, Clay Shirky offered the following observation: “At some point (probably one we’ve already passed), weblog technology will be seen as a platform for so many forms of publishing, filtering, aggregation, and syndication that blogging will stop referring to any particularly coherent activity”(Shirky 2003). Notice that Shirky is still referring to blogs as “weblog technology.” In 2003, blogging was still new enough that writers had to explain that blog is short for weblog.
Today, Shirky’s statement appears prescient.
There used to be a single “blogosphere.” A few lead adopters had taken up the blogger.com software produced by Pyra Labs, or had latched on to similar platforms. They shared a common style. They relied on common software architectures, and those architectures offered the same technological affordances. Bloggers in 2003 were mostly pseudonym-laden. They engaged in an opinionated “citizen-journalism.” They made use of in-text hyperlinks and passive (blogroll) hyperlinks, networking together with peer blogs. They offered comment threads, allowing for greater interactivity than traditional online journalism. They were, by and large, critics of existing institutions of power – be they journalistic, political, or commercial. A “blogger” was, simply, “one who blogged.” And that was a small-but-growing segment of the population.3
The blogosphere grew. In so doing, it underwent a common pattern of simultaneous adoption and adaptation. We see this pattern repeated in other online channels as well. It is predictable as the tides. First, consider adoption. As the “blogosphere” grew from under a thousand blogs to over a million blogs, the new adopters used it towards different ends. The uses and gratifications driving a lead adopter are not the same as those driving a late adopter. The one is experimenting, often because they enjoy experimenting with new media. The other is applying a well-known technology toward their existing goals. Different users, with different motivations and interests, applied the technology of online self-publication to different ends.
This adoption pattern holds true for all ICTs. It is not unique to the suite of technologies that make up the internet. Early adopters of the radio, the telegraph, and the automobile all made use of those technologies in different ways than the early- and late-majority adopters. Robust literatures in science and technology studies, history of technology, and diffusion-of-innovation studies all provide valuable insights into the development of communications media, as well as the formative role that government policy can play in their development (Mueller 2004; Mueller 2010; Gillespie 2009; John 2010; Johns 2010; Wu 2010). The sheer speed of adoption, and the numerous technologies that are all considered to be “the Internet” or “the Web” has hastened confusion on this matter, however. We still tend too easily to assume that the interests and motivations of early “bloggers” will be replicated among later “bloggers,” even though we know that this population is far from static.
The adaptation pattern is even more pernicious. As the blogosphere grew, the underlying software platform underwent key modifications. The community blogging platform offered by Scoop, for instance, added user diaries to certain blogs, allowing them to operate as gathering spaces for online communities-of-interest. Sites like DailyKos developed interest group-like qualities, using the blog to engage not in modified citizen journalism, but in modified citizen activism (Karpf 2008b; Benkler and Shaw 2010). DailyKos endorses candidates, engages in issue mobilization, and even holds an annual in-person convention. It has more in common with traditional advocacy groups than it does with the solo-author political blogs of 2003. Likewise, existing institutions of authority added blogging features to their web offerings. The New York Times now hosts blogs. So does the Sierra Club. Unsurprisingly, the blogging activity on these sites advances their existing goals. A blog post at NYTimes.com has more in common with an article on NYTimes.com than it does with a blog post at mywackythoughts.blogspot.com. Counting Krugman.blogs.nytimes.com, dailykos.com/blog/sierradave, and (more problematically) Huffingtonpost.com as part of a single, overarching “blogosphere” is clearly problematic. They are not members of a singular underlying population any longer. The inclusion of new software code means that there is no longer a categorical distinction to be drawn between “bloggers” and other web readers. To rephrase Shirky, at this point, blogging has stopped referring to any particularly coherent activity. Blogging is simply writing content online.
Today, it is no longer analytically useful to conduct research on “The Blogosphere.” I say this as the proprietor of an open data resource called the Blogosphere Authority Index (BAI). Maintained since 2008, the index offers a ranked tracking system for researchers interested in elite political blogs in America. The BAI tracks two blog clusters, and it is a methodology that can be used for ranking other blog clusters (Karpf 2008a). But it does not actually track or measure the entire blogosphere, because there is no such thing anymore. Speaking of the blogosphere as though it is a single, overarching component of the world wide web only encourages faulty generalizations – leading researchers and public observers alike toward inaccurate claims about the quality of “citizen journalism,” the trustworthiness of “bloggers,” and the goals and effectiveness of “bloggers.” All of these generalizations misinform rather than enlighten. They create fractured dialogues and regrettable cul-de-sacs in the research literature. “Blogging,” today, is a boundary object, with different meanings and implications among different research communities.
From a methodological perspective, it is of little specific importance that there is no longer any such thing as the blogosphere. Internet-related social science research covers a much larger terrain than the blogosphere. What is important is that this was a predictable sequence of events. As the technology diffuses, it attracts different adopters with different interests. Many of these adopters will come from existing institutions of authority, creating a process of “political normalization” (Margolis and Resnick 2000). Also as the technology diffuses, it changes. Blog software today includes key modifications and variations on the basic 1999 Pyra Labs platform. The software modifications provide variant technological affordances (Kelty 2008; Reagle 2010). The same is true for social networking, peer-to-peer file sharing, micro-blogging, and online video creation. It is a feature that makes Internet innovations a bit trickier than previous periods of ICT innovation. A television set in 1940 was largely the same as a television set of 1950. The intersection of new adopters, government regulators, and commercial interests drove the development of television, but the ICT itself remained basically the same thing until the development of cable in the 1970s and 1980s (as such, cable television is often treated as a separate innovation). Blogging and other developments at the social layer of the Internet diffuse very fast, and also acquire key code-based adaptations along the way.
This makes for a messy research landscape. The most robust traditional techniques available to social scientists – sophisticated quantitative methods that isolate and explore causal claims – were not designed with such a landscape in mind. As such, a few common problems routinely crop up.
Internet Time and Social Science Research Nearly every US election since 1996 has been labeled “the Year of the Internet.” Important milestones have indeed been reached in each of these elections, with 1996 marking the first campaign website, Jesse Ventura’s internet-supported 1998 victory in the Minnesota Governor’s race, John McCain’s online fundraising in the 2000 Presidential Primary, Howard Dean’s landmark 2004 primary campaign, the netroots fundraising and Senator George Allen’s YouTube “Macaca Moment” in 2006, and Barack Obama’s historic 2008 campaign mobilization (Foot and Schneider 2006; Lentz 2002; Trippi 2005; Kreiss Forthcoming; Karpf 2010; Bimber and Davis 2003; Cornfield 2004.). Were claims that 2000 was the “Year of the Internet” premature? Were claims that 2008 was the “Year of the Internet” lacking in historical nuance? I would suggest an alternate path: that both were accurate, but the internet itself changed in the interim. The internet of 2008 is different than the internet of 1996, 2000, or 2004, and this is a recurrent, ongoing pattern.
Consider the following trivia question: “What was John Kerry’s YouTube strategy in the 2004 election?”
YouTube is a major component of the internet today. The video-sharing site is the 3rd most popular destination on the internet, as recorded by Alexa.com. Political campaigns now develop special “web advertisements” with no intention of buying airtime on television, simply placing the ads on YouTube in the hopes of attracting commentary from the blogosphere and resultant media coverage (Wallsten 2010; Barzilai-Nahon et al 2011). The medium is viewed as so influential that an entire political science conference and a special issue of the Journal of Information Technology and Politics were devoted to “YouTube and the 2008 election.” Yet no social scientist has ever looked at John Kerry’s use of the site in the prior election cycle. The absence should be utterly baffling (if it weren’t so easy to explain). How could we focus so much attention on YouTube in 2006 and 2008 while ignoring it completely in earlier cycles?
The answer, of course, is that John Kerry had no YouTube strategy. YouTube, founded in 2005, did not exist yet. The internet of the 1990s and early 2000s featured smaller bandwidth, slower upload times, and less-abundant storage. The technical conditions necessary for YouTube to exist were not present until approximately 2005. To the extent that the sociotechnical practice of video-sharing, and the capacity of individuals to record, remix, and react to video content without relying on traditional broadcast organizations, impacts American politics, it is an impact that makes the internet of 2004 different from the internet of 2008.
Social science observational techniques were not developed with such a rapidly-changing information environment in mind. Bruce Bimber and Richard Davis, earned the McGannon Center Communication Policy Research Award for Campaign Online, a rigorously detailed study of the Internet in the 2000 election. After clearly demonstrating that political websites were predominantly visited by existing supporters, they concluded that the new medium would prove relatively ineffective for persuading undecided voters. As such, they came to the conclusion that the Internet would have a relatively minimal impact on American politics, offering only “reinforcement,” of existing beliefs, rather than persuasion. Their book is a standard feature of the Internet politics canon today.
Bimber and Davis’s finding about candidate websites remains nominally accurate today. In 2012, we can safely believe that most of the visitors to the Republican candidate’s website will be existing supporters. Low-information, undecided voters (by definition) aren’t seeking out such political information. But the implications of their findings are overshadowed by further development in the online landscape. Candidate website engagement is no longer synonymous with online political activity.
As it happens, Bimber and Davis’s book was published in November 2003, just as the Internet-infused Howard Dean primary campaign had reached its zenith. The Dean campaign was using the internet to mobilize supporters with overwhelming effectiveness, drawing large crowds and setting online donation records. To this day, the Dean campaign is synonymous with the birth of internet campaigning; the contrast between scholarly wisdom and current events could not have been much more stark. The “reinforcement” of 2000 had morphed into resource mobilization. Yet it would be patently absurd to criticize Bimber and Davis for not foreseeing the Dean phenomenon. The internet of 2004 had features not present in the internet of 2000 (such as large-scale adoption of online credit card payments4). Those features leveraged different social practices. A network of political actors was simultaneously learning to use the new media environment and also helping to change the boundaries of the environment itself. New software code and participatory sites, supported by increasingly cheap bits and bytes, afforded new practices with alternate results.
Zysman and Newman (2007) have helpfully described the Internet as a “sequence of revolutions.” While I myself try to avoid the loaded term “revolution” in my work, it has a certain utility in this setting. The medium keeps changing, apace Internet Time. The internet is in a state of ongoing transformation. As a result, academic pronouncements about the internet’s impact on politics in any given year face a limited “shelf life.” As the medium changes, the uses it supports change as well. Ceteris paribus is violated at the outset – we can assume to begin with that all other things will not be equal.
This feature of Internet Time is particularly problematic for some of our most robust, traditional research methods that seek to impute the political behavior of an entire population based on a sample. Randomly sampling the US population in 2000 is of limited use in determining online political behavior in 2004 or 2012. Changes in the media environment produce new technological affordances. Social actors then leverage these affordances to alternate ends (Earl and Kimport 2011). Meanwhile the “gold standard” in behavioral research is the biennial American National Election Study, with questions that can be traced back 60 years. The power of such time series studies is diluted when sociotechnical systems are undergoing rapid, ongoing change. We can look for statistically significant findings in the large datasets, but only while embracing ceteris paribus assumptions that are clearly unsupportable.
This is not a call for abandoning survey instruments and behavioral studies. I am suggesting that we broaden our methodological toolboxes, not discard tools. And indeed, behavioral studies from sources like the Pew Internet and American Life project have been a tremendous resource for the internet research community. But Pew is an exceptional example – a large, well-funded research shop that continually modifies its poll questions in light of changes in the new media environment. For university-based researchers attempting to construct multi-year panels or comparable survey questions while navigating Institutional Review Boards, grant timelines, and peer-reviewed publication cycles, it becomes imminently likely that our best methods will yield research that is systematically behind-the-times. Large-N quantitative studies are, as a rule, slow-moving affairs. If we limit ourselves to the set of research questions that can be profitably explored through our most robust techniques, we substantially narrow the research landscape. We make it more likely that studies of the blogosphere or online campaigning will accurately explain a phenomenon only after it has substantially changed in an unexpected manner.
The Internet has also provided a wealth of new data, however. The digital traces of political behavior are now evident wherever we look, and these have spurred new efforts to collect and analyze novel datasets. Some of these new methods – in particular the trend toward “computational social science” that combines social science and computer science to leverage large datasets nearly in realtime – hold great promise (Lazar et al 2009). Massive datasets, rapidly compiled, appear at first blush to offer a solution to the problems described above. But they also face the lurking threat of GIGO (Garbage In, Garbage Out). Publicly available online data often has deep, systematic flaws which too often goes ignored.