Guidelines for Reproducibility
The APSR requires authors of conditionally accepted manuscripts to submit a reproducibility package to the APSR Dataverse. We review these packages to verify that we can use the submitted materials to reproduce the manuscript’s tables and figures, and to check that the authors have documented the research process well enough that future researchers will be able to benefit from it. This document provides requirements, advice, and instructions for authors to satisfy our pre-publication requirements.
Objectives
Preparing reproducibility packages requires valuable time. There are three reasons why the APSR requires authors of conditionally accepted manuscripts to deposit reproducibility packages prior to publication:
- Quality control: Papers published in the APSR should carry the assurance that the results are sound. When possible, we want to help authors catch errors before a paper is published.
- Comprehensibility: Readers should be able to understand exactly how a paper’s results were produced through some combination of reading the paper, reading the appendices, and working with the reproducibility package.
- Extensibility: Other scholars should be able to expand, in future work, on the published article.
Requirements for Reproducibility Packages
Overview of what makes a successful reproducibility package
A reproducibility package is successful if someone attempting to reproduce the results in the paper can do the following:
- Open a README in the root folder of the package and find a summary overview of all materials in the reproducibility package.
- Follow a clear set of instructions in the README to run the code required to produce all tables and figures in the paper.
- For every table and figure in the paper,
- Locate the table/figure in the output produced by step 2
- Locate the place in the code that produces the table/figure
Scope of reproducibility
The reproducibility package must produce, starting from data in as raw a form as possible, all computations reported in the manuscript’s tables and figures.
When a manuscript uses a secondary dataset (i.e., a dataset made available by others), the reproducibility package must include the raw dataset and include any code that was used to transform it (and/or describe in detail any manual transformations that were made). This way, other researchers can understand and assess the author’s transformations.
When a paper includes original data collection, the reproducibility package must also include the instruments used to collect the data, e.g., the survey questionnaire including experimental treatments if any, webscraping/API code for retrieving online data, etc.. These allow future scholars to reproduce the data collection process.
For each dataset, authors must provide documentation that allows others to use the data for purposes other than simply reproducing the paper’s tables and figures. This means including:
- Codebook with a clear description of each variable, or
- A reference (in the README) to publicly available documentation for the datases
- The reproducibility package should also produce any important computations that appear outside of tables and figures in the manuscript.
Data to be included
Whenever possible, reproducibility packages should provide datasets in “raw” form (i.e., before the data has been cleaned or transformed by the author’s code). We recognize that “raw data” may be tricky to define in some cases.
When the data pipeline is time consuming, uses uncommon software, or relies on very large files, authors must include (in addition to the raw data files) an “analysis dataset” that can be used to produce the paper’s tables and figures without running the whole data pipeline. This allows other researchers to assess the robustness of the main results without having to take time to generate the analysis dataset themselves.
Authors are responsible for ensuring that they have permission to share the data they include in their reproducibility package, and that by sharing the data they do not violate legal or ethical rights of research subjects and/or the dataset’s creators.
Data that authors are unable to share
If analysis relies on data that cannot be shared for ethical, legal, or other reasons, authors must provide instructions in the README on exactly how others can obtain the data.
Data citation
Authors must cite the datasets they use in their manuscript. The Social Science Data Editors website provides guidance on how to do this. When using multiple related datasets (e.g., several years of the American National Election Study), create a single composite citation for the bibliography and list individual datasets in an appendix.
README file
All reproducibility packages should include a file in the root directory named README in an open file format (e.g. TXT, PDF, markdown, HTML). Authors should assume this is the first file we will examine. The README file should include, at a minimum:
- Table of contents: a brief description of every file in the replication folder
- Documentation files (codebook etc.)
- Code files: What does the file do?
- Data files: What data is contained in the file? How/where was the data acquired?
- Instructions for running the code
- Notes for each table and figure: a short list of where replicators will find the code needed to reproduce all parts of the publication.
- Software dependencies: Instructions that will allow our team to reproduce your software environment and run the submitted code. To make it easier to diagnose issues that arise, these instructions should include the operating system (e.g. Windows 10, OSX 12.1) and version of the computing software (e.g. R 4.1.2, Stata 17) used to conduct the paper’s analysis, as well as a list of installed packages (with version numbers/dates).
- R: Information on R version and loaded libraries can be found by typing sessionInfo().
- STATA: A list of all add-ons installed on a system can be found by typing ado dir.
- Estimated runtime for long-running computations: If any part of the data pipeline or analysis requires more than a few minutes to run on a typical laptop, include this information in the README.
- Seed locations: If any of the analysis relies on (pseudo-)randomness (e.g. Monte Carlo simulations, bootstrapped standard errors), then authors should set seeds in their code and note in the README where seeds are set.
For more detailed guidance, authors are encouraged to follow the README template provided by Social Science Data Editors: https://social-science-data-editors.github.io/template_README/. A README that follows those guidelines will satisfy our requirements.
Output
To assist the verification process, make sure that each table and figure in the paper produces an output file. Do not simply print a regression table to the console.
Suggestions to authors preparing reproducibility packages
Read our instructions carefully and use the checklist below, which is also the checklist we will use in assessing whether to accept your package or send it back for revision.
We strongly encourage authors to use CodeOcean to produce their reproducibility packages. For a detailed explanation of CodeOcean, how to use it to produce a reproducibility package, and how to export your package to Dataverse, see this link. In brief, CodeOcean is a web application that makes it easy to containerize your code and data using Docker. If you submit a CodeOcean capsule as your Dataverse submission, be sure to provide a link to the published capsule (or arrange to share the capsule with us separately) and we will skip the step of independently verifying your results. This should cause your paper to be published sooner.
If a reproducibility package includes multiple scripts, include a master script that runs each of these scripts in the appropriate order.
Give names to all files (code, data, and output) that will be easy for a replicator to understand, e.g. 01_data_cleaning.R, figure2.pdf.
Use comments in your code to make it easy for future scholars (including yourself!) to understand what the code is doing.
Unless the reproducibility package is very simple (with e.g. one script and one dataset, few outputs), we encourage you to use a directory structure that separates code, data, and results. A common pattern is:
- README.txt
- master.R
- data/
- raw/
- CCES.csv
- County_level_covariates.csv
- analysis/
- For_regressions.csv
- raw/
- code/
- 01_data_processing.R
- 02_simulation.R
- 03_analysis.R
- results/
- table1.tex
- table2.tex
- figure1.pdf
- figure2.pdf
- figure3.pdf
To upload a package with a directory structure to the APSR Dataverse, select all files and subfolders in your directory on your computer and add them to a . zip file. Upload the zipped file to your dataset. Dataverse will automatically unzip the uploaded files.
After uploading your file(s) to the APSR Dataverse and saving the result (but before submitting), download the replication package from Dataverse and make sure it runs and produces the expected output. (To download the package, click the “Access Dataset” button then “Download ZIP”.) Ideally, you should save the package to a location in your computer’s directory structure that is different from the location where you developed the replication package; that way, you will catch any absolute paths in your code. Better yet, download it to a different computer entirely and run it there.
We recognize that with some computational approaches the same code can produce slightly different results in successive runs, even when random seeds and software are perfectly harmonized. If this is the case for your project, please explain this in the README so that we do not try to investigate small discrepancies.
If you have questions about how our requirements apply to your study, please contact us.
Reproducibility package checklist
- README describes each file in the package
- README contains instructions for running the code
- README indicates where each table and figure can be found in the output
- README lists base dependencies and additional dependencies
- README contains estimated runtime for any long-running computations
- If analysis requires randomness, README indicates where in the code seeds are set
- Assuming data is shareable and included in package:
- Code runs and saves output files to disk
- Using instructions in README, every table and figure in the paper can be found in output
- Content of every table and figure matches what is in the paper
- Every secondary dataset in package is cited in the paper (or appendix for multiple related datasets)
- Every secondary dataset in package is included in its raw form, i.e. without author transformations
- Every dataset has a codebook or a reference in the README to publicly available documentation
- For any original dataset, data collection instruments are included in the replication package.
- If the data pipeline takes a long time, relies on large datasets, or requires downloading of unusual software, an analysis dataset is included
Instructions for submitting reproducibility packages
- Sign in (after signing up, if necessary) to Dataverse
- Go to the APSR Dataverse. Click the “Add Data” button and select “New Dataset” in the dropdown menu. Important: Please make sure to add your dataset to the APSR Dataverse and not anywhere else in the Harvard Dataverse repository.
- Fill in the form to describe your data file(s), such as title, author name(s), abstract, year, citation to article, etc. The minimum information should include (a) title (“Replication Data for: [paper title]”), (b) author name(s) and (c) contact information, (d) description (abstract of the paper and/or description of the replication package), (e) subject (Social Sciences), (f) related publication (“Forthcoming, American Political Science Review” for the initial submission).
- Scroll down to the “Files” section and click on “Select Files to Add” to upload your replication package, either as separate files or (preferred) a zip file containing the directory substructure. Click “Save Dataset” when upload is complete. This creates your “dataset” on Dataverse, but the result is not yet published.
- Recommended: Download your package and test that the code produces the desired output, as described in the “Suggestions” section above.
- When the replication package is ready, click “Submit for review” to submit the draft version of the dataset for replication.
- Once the package has been submitted, we will review it. We will contact you if revisions are required. When the package has been approved, we can proceed with publication
