We gratefully acknowledge the National Science Foundation for their support on our project 1850605, “RR: Establishing and Boosting Confidence Levels for Empirical Research Using Twitter Data”.

Project Summary

Concerns about a reproducibility crisis in scientific research have become increasingly prevalent within the academic community and to the public at large. The field of meta-science, which performs the scientific study of science itself, is thriving and has examined the existence and prevalence of threats to reproducible and robust research. Most existing replication efforts in social sciences, however, have focused on studies using data from statistically rigorous designed surveys or experiments. Largely missing are replication efforts devoted to examining those studies with organic data, including data organically generated by ubiquitous sensors or mobile applications, twitter feeds, click streams, etc. This project examines the inconsistent handling practices of organic data among scholarly publications in social sciences, in order to establish the confidence (or the lack thereof) in the conclusions drawn from such data analysis. Since findings of social and behavioral sciences inform policy makers on a wide variety of issues, from homeland security to national economy, establishing the confidence of these findings is critical for the proper usage of them, and therefore has broader impacts on all these application areas of national priority.

More specifically, this project starts with determining the extent of, causes of, and remedies for empirical research using organic data that are neither reproducible nor generalizable. The findings from this step raise awareness about the standards and tools for collecting, cleaning, and processing organic data sets across many fields of social sciences. In addition, this project develops new analytical frameworks and methodologies useful for evaluating replicability and robustness of empirical studies with organic data. The vision is for such frameworks to be broadly used in many application domains, thereby fostering cultural change across different fields in social sciences, and bringing the value of reproducibility and robustness to the forefront of data intensive research.