SSIR@PND
Through an agreement with the Stanford Social Innovation Review, PND is pleased to be able to offer a series of articles and profiles related to the "business" of improving society.
Big Data for Social Innovation
According to IBM, about 2.5 quintillion bytes of data are created every day — enough to fill about 57.5 billion 32GB iPads daily. Some of these data are gathered by scientific instruments measuring winds, temperatures, and currents around the world. Other data are captured by computers tracking bond sales, stock trades, and bank deposits. And other data are input by police officers, probation officers, and welfare administrators. All of the data, however, are simply that — data — until they are analyzed and used to inform decision-making. What will the weather be like next week? What are the most lucrative investment opportunities? Which neighborhoods should be receiving more social services?
The term "big data" is used to describe the growing proliferation of data and our increasing ability to make productive use of it. A myriad of big data projects have been undertaken in scientific domains. For instance, in 2012 pharmaceutical company Merck found through data analysis that allergens would probably lie dormant throughout March and April 2013 because of unseasonably cold weather, followed by a sudden May warmup that would cause pollen to be released at a higher-than-average rate, thus driving the potential need for Merck's allergy medication Claritin. Merck then modified its marketing strategy to capitalize on the high demand for allergy relief. Through partnerships with Walmart, they created personalized promotions based on zip code data to market Claritin to heavily hit areas, resulting in increased revenue.
The business community has also been a heavy user of big data. Each month Netflix collects billions of hours of user data related to titles, genres, time spent viewing, and video color schemes to gauge customer preferences in order to continually update their recommendation algorithms and programming to give the customer the best possible experience. In 2013, Netflix launched its first original series, House of Cards, largely using a mix of customer behavior data and analytics to help shape the story. Netflix invested $100 million into the series without testing a pilot or conducting focus groups, instead banking on the success of an earlier BBC production by the same name about UK politics, along with what it had learned about the preferences of its 44 million customers. House of Cards has been a great success, bringing in 2 million new subscribers.
While data-driven intelligence has been used successfully in technical and business endeavors, a very different situation prevails in the social arena. There, a large chasm exists between the potential of data-driven information and its actual use in helping solve social problems. Some social problems can be readily solved using big data, such as using traffic data to help ease the flow of highway traffic or using weather data to predict the next hurricane. But what if we want to use data to help us solve our most human and critical social problems, such as homelessness, human trafficking, and education? And what if we not only want to solve these problems but do so in a way that the solutions are sustainable for the future?
Social problems are often what are called "wicked" problems. Not only are they messier than their technical counterparts, they are also more dynamic and complex because of the number of stakeholders involved and the numerous feedback loops among inter-related components. Numerous government agencies and nonprofits are involved in tackling these problems, with limited cooperation and data sharing among them. Compared to their counterparts in the hard sciences who work on technical problems or in business who have ready access to financial, product, and customer information, most of these organizations have inadequate information technology resources.
Beyond the infrastructural impediments that social sector users of big data face, data itself can be a problem. Oftentimes, data are missing and incomplete, or stored in silos or in forms that are inaccessible to automated processing. Then there are policy and regulatory challenges that need to be faced, such as building data-sharing agreements, ensuring privacy and confidentiality of user data, and creating collaboration protocols among various stakeholders tackling the same type of problem.
Whereas there is no doubt that nonprofits, government, and other organizations will continue to invest in big data technologies and programs, questions still remain about how beneficial those investments will turn out to be. The value proposition of big data is clear for tackling complex technical and business problems, but the jury is still out on how well big data can tackle complex social problems.
Why Data Is Big
Data, or individual pieces of information, have been gathered and used throughout history. What's changed recently is that advances in digital technology have significantly increased our ability to collect, store, and analyze data. Consider the United States Census Bureau. In 1880, the U.S. conducted a national census of fifty million people that collected a range of demographic information, including age, gender, number of people in a household, ethnicity, birth date, marital status, occupation, health status, literacy, and place of origin. All of this information was logged by hand, microfilmed, and sent to be stored in state archives, libraries, and universities. It took seven to eight years to properly tabulate census data after the initial collection.
In 1890, the Census Bureau streamlined its data collection methods by adopting machine-readable punch cards that could be tabulated in a single calendar year. In the most recent U.S. census, conducted in 2010, the bureau used a range of emerging technologies to survey the populace, including geographic information systems, social media, videos, intelligent character-recognition systems, and sophisticated data-processing software.
Today, big data is used to refer to data sets that extend beyond single data repositories (databases or data warehouses) and are too large and complex to be processed by traditional database management and processing tools. Big data can encompass information such as transactions, social media, enterprise content, sensors, and mobile devices.
There are multiple dimensions to big data, which are encapsulated in the handy set of seven "V"s that follow.
Volume: considers the amount of data generated and collected.
Velocity: refers to the speed at which data are analyzed.
Variety: indicates the diversity of the types of data that are collected.
Viscosity: measures the resistance to flow of data.
Variability: measures the unpredictable rate of flow and types.
Veracity: measures the biases, noise, abnormality, and reliability in datasets.
Volatility: indicates how long data are valid and should be stored.
Although all seven Vs are increasing, they are not equal. Consider volume. The world's collections of data are doubling every eighteen months, presenting the public and private sectors with new opportunities to transform information into insight. As the volume of data increases along with the tendency to store multiple instances of the same data across varied devices, the science of information search and retrieval will have to advance.
The most challenging V for organizations is variety. Organizations have built information systems to tackle data elements in specific categories. The challenge for many organizations is to find economical ways of integrating heterogeneous datasets while allowing for newer sources of data (in origin and type) to be integrated within existing systems. Ensuring that the data collected are of sufficient veracity is also critical. Today, because of the proliferation of social networks and social media, much of the data being collected needs to be thoroughly analyzed before decision-making, as the data can be easily manipulated.
Failing to Use Big Data
When considering big data in the context of social problems, we arrive at a humbling conclusion: For the most part, there is no big data! When it comes to social problems, data are still highly unstructured and largely limited to numbers, rather than other types of data. Take human trafficking, a $32 billion global industry that ensnares an estimated thirty million people annually. Although considerable momentum exists to combat the problem, few initiatives have attempted to use big data.
Increasingly, traffickers make use of mobile phones, social media, online classifieds, and other Internet platforms. Data from these technologies could be collected and used to identify, track, and prosecute traffickers, but a few daunting truths remain: The illicit nature of human trafficking makes it difficult to collect primary data; primary data collected from some organizations may be unreliable; and we lack reliable indicators to measure anti-trafficking program and policy success. Furthermore, most information collected on human trafficking is stored in a manner that meets organizational but not global needs. Because of data privacy and security issues, data held by various organizations are seldom shared in raw form, limiting the creation of global, or big, datasets.
Making matters worse, agencies combating trafficking often compete with each other for scarce resources, whether grants and gifts or recognition from the press and the community. Because of this competition, data sharing between agencies — and even between agencies and the public — is rare. The Polaris Project, for example, has been working to combat human trafficking using a comprehensive approach combining advocacy, client services, technical training and assistance, global programs, and a national resource hotline. Between 2003 and 2006, Polaris provided hotlines for human trafficking survivors to call. In 2007, the U.S. Department of Health and Human Services selected Polaris as the country's first national human trafficking resource hotline. Over the years, Polaris is believed to have logged more than seventy-five thousand calls; nevertheless, access to the data is limited and little is known about its reliability and its sources.
Think what might be done if the Polaris information was opened to the public and integrated with other data sources, such as economic indicators, transportation routes, education statistics, and victim services. Only when the data are aggregated with other data, analyzed, visualized, and made accessible to a multitude of stakeholders will the collection be truly valuable. Only then will the small data have a chance to grow into big data and help us effectively combat human trafficking.
One hopeful sign is that in 2012 Google Giving awarded Polaris and two other international anti-human trafficking organizations $3 million to fund the aggregation of the data collected from their three hotlines and to scale their hotlines into an international hotline. Together, all three organizations have coalesced under the Global Human Trafficking Hotline Network. This is a positive sign, but it is yet to be seen what the fruits of this collaboration will be.
Barriers to Creating and Using Big Data
There are four principal reasons for the relative lack of structured big data for social problems: data are buried in administrative systems; data governance standards are lacking; data are often unreliable; and data can cause unintended consequences.
The issues being tackled in the social sector are often more complex than they are in business or science, making the use of big data more difficult.
Data are buried in administrative systems. Most organizations collect data to meet operational needs, and those data are often buried in an organization's administrative systems. To overcome this problem, organizations are trying to find ways to build large datasets that can be more widely used. This obstacle needs to be overcome before we begin to think of connecting datasets across organizations. Take the U.S. healthcare industry. Inefficient management of big data costs the industry between $100 billion and $150 billion a year in administrative costs. The biggest problem in the healthcare industry is the sheer volume of health and insurance plans that providers contract and negotiate with to be paid for their services. Each health or insurance plan supports its own system of underwriting, claims administration, provider network contracting, and broker network management — leaving data stored in multiple formats in multiple places. The McKinsey Global Institute estimated that if the U.S. healthcare industry were to transform its use of big data for more efficiency and quality, the sector could create more than $300 billion in value every year.
Data governance standards are lacking. A second challenge in our ability to use big data for social problems is the lack of adequate data governance standards that define how data are captured, stored, and curated for accountability. As a result, large inconsistencies exist and the data being captured are often not readily suitable for analysis. In many cases, data need to be transformed before they can be used, and transformation is costly. Analysts often struggle with integrating different datasets because they lack good metadata (data that describe data) and the quality of data is poor. An example of this is the U.S. government's 2009 initiative, data.gov, to make its vast amounts of data readily available to the public so that nonprofits, businesses, and other organizations can use the data for innovative purposes. The initiative has been hampered by the difficulty of ensuring that the data are in a usable format. Data quality differs heavily from agency to agency, with some agencies, such as the Environmental Protection Agency, releasing data regularly and in machine-readable formats, whereas other agencies publish data in difficult-to-manipulate forms such as PDFs or older file formats. The number of government datasets being made publicly available has exploded, but only a handful of these datasets are ever used. The ones that are being used are, not surprisingly, cases where there is good metadata, ease of accessibility, and manipulability.
Data are often unreliable. The abundance of data provides great opportunities to researchers trying to understand and solve social problems, but unfortunately much of the data is unreliable. Simply having a lot of data does not necessarily mean that the data are representative and reliable. For example, in 2011 the Obama administration proposed the Keystone XL pipeline project to carry tar sands oil from Alberta, Canada, to Texas. The proposal raised concerns among landowners, farmers, ranchers, and environmentalists who were living in the vicinity of the proposed pipeline. Despite the concerns, the American Petroleum Institute and its oil lobby allies were able to manipulate social media sentiment to show support for the project. They did so by using Twitter to send an inordinate number of tweets to show support for the project, which did not accurately represent overall public sentiment. The Rainforest Action Network discovered the subterfuge and criticized the oil companies for using fake Twitter accounts to show support for the pipeline project. For example, RAN pointed out a sudden spike in the number (within three minutes on fifteen accounts) of tweets favoring the pipeline and gathered evidence that fourteen of the fifteen accounts were phony and that the tweets were generated by an automated process.
Data can cause unintended consequences. Big data users can find themselves facing the unintended consequences of exploiting big data with no regard for data quality, legality, disparate data meanings, and process quality. This was the case when public agencies and a newspaper in New York came under scrutiny for releasing information about gun owners. In the wake of the mass shooting at an elementary school in Newtown, Connecticut, a group of journalists from The Journal News in White Plains, New York, used the Freedom of Information Act to obtain information about gun owners living in suburban Westchester, Rockland, and Putnam counties and published an article that included an interactive visual map complete with individual gun owners' names and addresses. The object of the effort was to inform the public about who legally owns firearms in their neighborhoods, but outraged critics of the effort also noted that the information could be used by criminals to target vulnerable homeowners who do not own guns or to target homeowners who have guns in order to steal them.
The Promise of Mobile Phones
There is one area where nonprofits have begun to make good use of big data: mobile phones. In 2010, more than five billion mobile phones were in use, over 80 percent of them in developing countries. Indeed, the percentage of people owning mobile phones in sub-Saharan Africa increased from 32.1 percent in 2008 to 57.1 percent in 2012, and is expected to rise to 75.4 percent by 2016.
The growth in mobile phone use has offered people in developing countries more and better opportunities to improve their quality of life. For example, Cell Life, a South African organization, created a mass messaging mobile service called Communicate that reminds patients to take their medications, links patients to clinics, and offers peer-to-peer support services such as counseling and monitoring. Cell Life also developed Capture, a service that makes it possible for healthcare workers in the field to collect and save information in digital form using their mobile phones.
The rapid proliferation of mobile and Internet usage allows for the collection of unprecedented amounts of information. Most modern mobile phones contain global positioning system technology that identifies the geographic location of the phone. In addition to location data, mobile phones contain a treasure trove of information, including call logs, SMS messages, and social media postings. A mobile phone acts as an individual sensor collecting relevant information from its environment that, when aggregated and analyzed with information from millions of other mobile phones, can lead to the discovery of important information, which can then be disseminated back to people on the ground via the same mobile phones.
For example, Harvard University epidemiologist Caroline Buckee and her team have used location data from mobile phones to better understand the migration patterns of people in Kenya and connect those patterns to the spread of malaria and other infectious diseases. They were able to do this because Kenya's western highlands are equipped with thousands of cell-phone towers that transmit data about individual phone calls and text messages. Buckee's researchers found that people making calls and sending text messages via a specific tower were making sixteen times more trips out of that area, with significant activity in the malarial hot spot of Lake Victoria. That information will be used to develop predictive models to further combat malaria in the region.
Steps to Increase Use of Big Data
Big data has enormous potential to inform decision-making to help solve the world's toughest social problems. But for that to happen, issues relating to data collection, organization, and analysis must first be resolved. The following four recommendations have the potential to create datasets useful for evidence-based decision-making:
Building global data banks on critical issues. The global community needs to create large data banks on complex issues such as human trafficking, global hunger, and poverty. These data bank would have the capacity to hold multiple different data types along with metadata that describes the datasets. For this to happen, multisector alliances that promote data sharing on thematic issues need to be created. At the 2012 G-8 Summit, leaders of the world's largest economies and four African heads of state met to discuss and commit to a new phase of efforts to fight hunger and food insecurity. Out of that discussion grew the New Alliance for Food and Nutrition Security, which has set its sights on helping fifty million people out of poverty over the next ten years through the promotion of sustained agricultural growth. As part of its plan, New Alliance launched a number of technology- and data-based initiatives. One was the Scaling and Seeds and Other Technologies Partnership, which was developed to promote the commercialization, distribution, and adoption of technologies that improve seed varieties. The U.S. government's contribution to the effort has been chronicled through the Feed the Future initiative and website, and it has stayed true to the alliance's stance on data sharing by establishing Agrilinks.org, a data-sharing platform that is updated on a regular basis. Farmers can tap into the site to learn about new agricultural practices or live tweet from their mobile phones to ask questions of an agriculture expert. In addition, USAID is offering open data pulled from the Bangladesh Integrated Household Survey dataset, which itself is derived from baseline surveys of nearly five thousand households in Ghana that captured indicators outlined by Feed the Future and the Women's Empowerment in Agriculture Index.
Engaging citizens and citizen science. Big data is not the sole province of professionals. Citizens also can be enlisted to help create and analyze these datasets. With the proliferation of data through open data platforms, more and more citizens are creating new ideas and products through what has become known as "citizen science." In 2010, the City of London made government data available to the public by opening the London Datastore. Managed by the Greater London Authority, the London Datastore offers citizens the opportunity to view and use raw data released from city agencies and civil servants. Information distributed through the site includes data on crime, economics, and the performance of transit systems in the city.
Build a cadre of data curators and analysts. Today, not only do we have a shortage of data curators and analysts who can tackle social problems, we have limited avenues for existing personnel to receive the necessary training and build competence in data analysis. In other words, we have left data science to science and business and equipped students in the social sciences with, at best, a basic understanding of statistics. This is unacceptable if we are to truly take advantage of big data. We need to equip students and analysts with the necessary skills to curate and create large datasets. These skills are often found in programs in informatics and traditional degree programs in information and library science, where students learn about the organization, preservation, visualization, retrieval, and use of data. In addition to these skills, increasing the capacity of data analysts to think about what is possible with data is critical. Thinking about networked relationships and latent patterns in datasets are competencies that need to be developed.
Promoting virtual experimentation platforms. To increase our understanding of how to use big data for tackling social problems, we need to promote more experimentation. Virtual experimentation platforms, which allow individuals to share ideas, interact with the ideas of others, and work collaboratively to find solutions to problems or take advantage of opportunities, can bring interested parties together to create large datasets, develop innovative algorithms to analyze and visualize the data, and develop new knowledge. One example is Kaggle, a website where competitions on data analysis are run. Unfortunately, organizations that are tackling social issues seldom participate on these platforms. Virtual experimentation platforms are essential if we are going to move the needle on using big data to tackle social challenges. Initially, these platforms should stimulate competitions to create large datasets on various issues. Competitions that generate large datasets are critical if the community hopes to realize the challenges associated with the way the social sector currently operates. Once a couple of these large datasets have been created, we can launch competitions that focus on predictive analytics and the discovery of novel patterns. The use of open forums such as wikis and discussion groups can help the community share lessons learned, collaborate, and advance new solutions.
The Future of Big Data
Business and science have shown that big data's merits are undeniable. Social sector organizations must now figure out how they, too, can incorporate this type of decision-making capability into their operations. The potential for growth and innovation exists, but there are serious obstacles to overcome. The issues that are being tackled in the social sector are in many ways more complex than they are in business or science, making the use of big data that much more difficult. In addition, greater attention must be paid to the rights, privacy, and dignity of their constituents.
In spite of these obstacles, progress is being made. Public sector agencies have made it clear that data are an important element of social innovation. Stakeholders such as the U.S. government and the World Bank have made their data available to the public for mining and further use. Individuals are using the data to create innovations (mainly in the form of apps) to address particular problems.
Organizations have been created to help make better use of big data for social problems. DataKind matches scientists and statisticians with nonprofits on a pro bono basis in an effort to help overcome the shortage of technology personnel capable of conceiving and managing big data projects. Globally, too, the world's actors are making efforts to use open data and big data to develop solutions to social problems in innovative and collaborative ways. Progress is being made, but a chasm between theory and practice still exists and must be bridged. It is a challenge worth overcoming.
Kevin C. Desouza is associate dean for research in the College of Public Programs, an associate professor in the School of Public Affairs, and the interim director for the Decision Theater in the Office of Knowledge Enterprise Development at Arizona State University.
Kendra L. Smith is a doctoral candidate in the School of Community Resources and Development within the College of Public Programs at Arizona State University.
