NSF Project: Using social media to solve social problems
Heng Xu, co-director of the Robust Analytics Lab, is a co-PI of a recently awarded National Science Foundation project, 1823633: “RR: The Generalizability and Replicability of Twitter Data for Population Research”. See news release for the award here.
Project Summary
Social media data have the potential to track phenomena in real time, such as percentage of the population fearful in the minutes after a disaster or terrorist event, or the degree of anger immediately after the announcement of a jury verdict in a highly publicized case. In each of these examples, it would be difficult to conduct a field survey in real time, and respondents may not be able to reconstruct how they felt or behaved at the time of the event, even if interviewed just a few days later. Social media data have the potential to overcome these limitations. This project will analyze how the application of survey weighting can rebalance samples of Twitter data, and assesses how well this rebalancing will allow valid generalizations about population behaviors. The project will provide a foundation for future advances in the use of social media data for scientific, health, and applied research, thus permitting a wide variety of inferences useful in social policy formulation. A key aspect of the project will provide new evidence regarding the accuracy of migration flows in real time, thus assisting social policy relevant to providing assistance in response to natural disasters.
This project will evaluate the extent to which Twitter users represent or misrepresent the population across different demographic groups and test the feasibility of developing weights that, when applied to Twitter data, make the results more representative of the underlying population. The project conducts the research at the county level in the United States from January 2014-December 2017, using 96% geotagged tweets in the study period and 100% tweets in one month. The project will: (1) extend and refine existing methods for imputing the gender, age, race/ethnicity, and county of residence of each Twitter user; (2) use these values to assess the representativeness of Twitter samples at the county level and explain the determinants of biases; (3) adapt five methods developed for probability or non-probability surveys to reweight Twitter samples and compare their performance in producing model estimates that can be used to infer characteristics of the general population; and (4) test the feasibility of using Twitter data to estimate migration at the county level by comparing to the Internal Revenue Service migration data, as well as estimate Puerto Rico migrants to the continent after Hurricane Maria. Analysis of these migration data will provide a new source of information with which to estimate migration flows in real time and at unprecedentedly detailed geographic scales.