What’s at stake
“The sexiest job in the next 10 years will be statistician” – says Hal Varian– “and I’m not kidding.” With increased connectivity, Internet usage and data availability a new world of statistical analysis has indeed opened up. Whether it is very high frequency data, or the vast amount of them (big data), there is a growing appetite for forecasting (or nowcasting) social and economic behaviors live. The possibilities offered are quasi infinite and truly in their infancy but a few firms and research groups are paving the way for a work that could be extremely relevant to investors, policymakers and researchers alike.
Making policy in the dark light
In an article for the New Yorker, James Surowiecki explains how essential to the public as well as to economic policymakers this kind of granular data is. In the early years of the Great Depression, the government had few good figures to go on economic activity or unemployment. As a result, policymakers persistently underestimated the severity of the crisis. In June of 1930, relying on some anecdotal evidence of an upturn, Herbert Hoover even announced that “the Depression was over”. Today, our picture of the economy is more detailed and sophisticated than ever, and that makes it easier for businesses and the government to react quickly to changes in the economy. And yet our picture of what’s going on is far from perfect. The government continues to track inflation, for instance, by gathering price data much as it did in the nineteen-fifties: it surveys consumers by phone to see where they buy, surveys businesses to see how much they charge, checks out shopping malls to price goods.
Harvard professor Gary King describes the contours and implications ofthe “Social Science Data Revolution” characterized by the transition from the production and availability of a relatively small volume of analog data collected through surveys and official channels to those of massive amounts of digital data flowing from various streams (mobile-phone logs, EZ pass transponders, IP addresses, real estate purchases, credit card transactions, electronic medical records, online news and searches, satellite imagery, and “social everything” — blogs, comments, networking sites, etc.).
In a recent paper, Dirk Helbing and Stefano Balietti discussed the power of massive mining of high-frequency socio-economic data — or “reality mining” — and its application to forecasting socio-economic crisis with the help of new analytical tools and approaches, including pattern recognition algorithms and machine learning approaches to study complex social systems.
McKinsey Global Institute– a shop recently under heavy criticisms to say the least (see here or here) – also published an interesting report last month that examines the state of digital data. The use of data has long been part of the impact of information and communication technology, but the scale and scope of changes that big data are bringing about is today at an inflection point as a number of trends converge.
Measuring inflation on a daily basis
Alberto Cavallo, the founder of InflacionVerdadera a website that provides daily inflation statistics for Argentina and Roberto Rigobon from the MIT Sloan School have founded the Billion prices project of the MIT. It is a far-reaching academic initiative that uses prices collected from hundreds of online retailers on a daily basis to conduct economic research. This high frequency item level data is an extremely powerful tool to study pricing behaviors, inflation, asset prices, and pass through. In the U.S., it collects more than half a million prices daily — five times the number that the government looks at. After Lehman Brothers went under, in September 2008, the project’s data showed that businesses started cutting prices almost immediately, which suggested that demand had collapsed, while the government’s numbers only started to show this deflationary pressure in November.
Hal Varian presented back in October 2010 the Google Price index at a business economist conference. Varian emphasized that the GPI is not a direct replacement for the CPI because the mix of goods that are sold on the web is different to the mix in the wider economy. Housing accounts for about 40 percent of the US CPI, for example, but only 18 percent of the GPI. The GPI shows a “pretty good correlation” with the CPI for goods such as cameras and watches that are often sold on the web, but less so for others, such as car parts, that are infrequently traded online.
High frequency data for economic analysis
Choi Hyonyoung and Hal Varian argue in a 2009 paper that by pooling searches in categories, Google Trends data can help predict initial claims in initial claims for unemployment benefits in the United States. Askitas and Zimmerman (2009), Tanya Suhoy (2009) and Franceso D’Amuri and Juri Marucci have examined similar unemployment data for Germany, Israel, and Italy respectively, and also found significant improvements in forecasting accuracy by using Google Trends.
Nick McLaren and Rachana Shanbhogue from the Bank of England are amongst the few researchers in Central Banks openly interested in the possibilities offered by online searches (in particular by the http://www.google.com/insights/search/# function) and its predictive power for housing markets and unemployment. Initial results suggest that Internet search data can help predict changes in unemployment in the United Kingdom. For house prices, the results are stronger: search term variables outperform existing indicators over the period since 2004. There is also evidence that these data may be used to provide additional insight on a wider range of issues which traditional business surveys might not cover.
Tanya Suhoy from the Bank of Israel warns of important shortcomings: the dynamics of query indices may be non-stationary and, there is perhaps a problem of varying predictive ability of query indices as agents use alternative social searches which cannot be tracked by Google.
Google has come up with a new tool, Google Correlate, which finds correlations between whatever data you want to plug into it with whatever people are searching for on Google. Real Time Economics gave it try and the results are somewhat surprising. The Fed’s balance sheet has, for example, an extremely high 0.9605 with searches for “nausea remedies.” Maybe quantitative easing hasn’t been good for America, but it’s been good for Dramamine sales! (Stats geeks, we hear your complaint: “Correlation doesn’t imply causation.” Did you know that there’s a nearly perfect correlation between people who say that and people who are party poopers?). But wait — the correlation between the size of the balance sheet and searches for “how to get over a guy” is even higher — 0.9726.
Fighting influenzas and Quantifying human movements
Jeremy Ginsberg and his coauthors have studied influenza epidemics for which early prevention is key to saving lives. They conclude that health-seeking behavior in the form of online queries to search engines, which are submitted by millions of users around the world each day are useful guide to health concerns. We can accurately estimate the current level of weekly influenza activity in each region of the United States, with a reporting lag of about one day. This approach may make it possible to use search queries to detect influenza epidemics in areas with a large population of web search users.The Google Flu Trends approach has now been extended to the detection of dengue outbreaks via the eponymous Google Dengue Trends.
A team of Boston-based network physicists found that they were able to predict an individual’s whereabouts with over 93%accuracy based on cell phone generated information on past movements.Nathan Eagle notes that mobile phones give researchers the ability to quantify human movement on a scale that wasn’t possible before.
Detecting digital smoke signals in developing countries
A particularly interesting avenue is the application of these tools and approaches in developing countries. According to Gavin Krugel, director of mobile banking strategy at GSM Association, an industry group of 800 wireless operators, one billion consumers in the world have a mobile phone but no access to a bank account. About 40 million people worldwide use mobile money, and the industry is growing with 18,000 new mobile banking users per day in Uganda, 15,000 in Tanzania and 11,000 in Kenya. Mobile phones are routinely used for banking services, including payment and money transfers, saving services, but also for the transmission of other pieces of information such as grades and other test results, stocks and prices of various goods in different markets, medical appointments, etc. A similar phenomenon is also visible and expected to persist for Internet traffic: according to a study, while Internet traffic growth is expected to be in the realm of 25-35% in North America, Western Europe and Europe, it may reach or surpass 50% in Latin America and the Middle East and Africa.
Mobile phones are also used to purposefully generate high-frequency data in crisis settings through a technique known as “crowdsourcing” or “crowdvoicing”. The technique received widespread attention in the aftermath of the 2010 earthquake in Haiti with the work of Ushahidi, a non-profit tech company that set up a text messaging system allowingcell-phones owners to report on collapsed buildings in real-time. Indeed, a high correlation was found between reports of damaged buildings via text-messages and actual damages on the ground, which prompted Ushahidis’ Patrick Meier to conclude that data collected using unbounded crowdsourcing (non-representative sampling) largely in the form of SMS from the disaster affected population in Port-au-Prince can predict, with surprisingly high accuracy and statistical significance, the location and extent of structural damage post-earthquake.
Global Pulse, a United Nations initiative set up in 2009 to detect “digital smoke signals” in new high frequency data that may be signs of incipient harm at the community level. The rationale behind the project is that individuals and communities in developing countries change their collective behaviors in response to shocks in ways that leave trails in digital data such as mobile-phone banking transactions, access to programs and services, fertilizer use, etc.
Conceptual, analytical and operational challenges
Analyzing high-frequency data for research and policy purposes also run into a number of challenges — conceptual, analytical and operational. Gary King, in another contribution Ensuring the Data-Rich Future of Social Sciences discussed the need to find a balance between the unprecedented increases in data production and availability about individuals and the privacy rights of human beings worldwide. In their paper, Dirk Helbing and Stefano Balietti also call for “privacy-preserving analyses”, including through the use of adequately anomymized data, deliberate participation and strong privacy-preserving technical systems and legal standards.
The complex and chaotic nature of high frequency data brings about specific analytical challenges for both researchers and policymakers. Jim Fruchterman discussed the potential problems with using text message-based systems in crisis contexts. The correlation found in Haiti is an example of a "confounding factor". A correlation was found between building damage and SMS streams but – as pointed by Kristian Lum - only because both were correlated with the simple existence of buildings. Thus the correlation between the SMS feed and the building damage is an artifact or spurious correlation.In addition,Fruchterman reported thatonce you control for the presence of any buildings (damaged or undamaged), the text message stream seems to have a weak negative correlation with the presence of damaged buildings. That is, the presence of text messages suggests there are fewer (not more) damaged buildings in a particular area.
Justin Ortiz and a teamof medical experts compared data from Google Flu Trends from 2003 to 2008 with data from two distinct surveillance networks and found that Google Flu Trends did a very good job at predicting nonspecific respiratory illnesses — bad colds and other infections, like SARS, that seemedlike the flu, but did not predict the flu itself very well. The mismatch stemmed from the fact that infections can cause symptoms that resemble those of influenza while influenza is not always associated with influenza-like symptoms. According to him, up to 40 percent of people with pandemic flu did not have "influenza-like illness" because they did not have a fever. Influenza-like illness is neither sensitive nor specific for influenza virus activity — it's a good proxy, it's a very useful public-health surveillance system, but it is not as accurate as actual nationwide specimens positive for influenza virus.
Bruegel Economic Blogs Review is an information service that surveys external blogs. It does not survey Bruegel’s own publications, nor does it include comments by Bruegel authors.