Avoid temporal bias in MTurk samples: Publish microbatches on a fixed interval

MTurk samples differ by time-of-day and day-of-week on “characteristics known to impact political attitudes”. This is the conclusion of a recent in-preparation article (“Intertemporal Differences Among MTurk Worker Demographics” — a preprint can be found at https://osf.io/preprints/psyarxiv/8352x). The reason for this finding is simple: different kinds of people tend to use MTurk at different times of the day, and days of the week.

In light of these findings, researchers should take precautions to avoid temporal bias in their data. In this tutorial, I’ll share an approach to doing just that.


Temporal bias can be minimized by breaking up a sample into many smaller sub-samples (a technique sometimes called “micro-batching,” which is often used to avoid Amazon’s 20% mark-up for HITs with more than 9 assignments), and publishing sub-samples on a fixed interval spread across time. For example, in a recent study I planned to collect 3000+ participants. But instead of publishing all of the assignments at once, I posted 1 HIT with 9 assignments once every hour, until the total sample was reached (it took about 2 weeks). In doing this, I was able to recruit a relatively equal number of participants at each hour of the day and day of the week.

Below, I share my code for implementing this in R (https://www.r-project.org), which is free open-source software with capabilities similar to MATLab. R is typically used for statistical analyses, but crafty researchers have written their own libraries to perform a variety of functions, including the library MTurkR (https://cran.r-project.org/web/packages/MTurkR/index.html), which is what we’ll be using today to interface with MTurk!

1.

Download and install an R environment. If you’re a Windows user, I recommend RStudio (https://www.rstudio.com). In subsequent steps I’ll be referring to this, but the steps should be identical for other environments since they’re performed using console commands.

2.

Launch RStudio, and then run the following command in the console.

 
 
  1. install.packages("MTurkR")

3.

Get a Unique Turker code for your study, and save it for later. This is important because we will be publishing multiple batches for the same study, and we don’t want workers taking the study more than once.

Go to: http://uniqueturker.myleott.com, copy “Unique Identifier”, and click the “Get Script” button to activate the identifier.

4.

Get the HIT Layout ID for the HIT that you want to post.

To obtain the Hit Layout ID, open the web-based Requester interface, go to the Create tab, and click on the title of the project. It will open a popup that looks like the following:

http://i.imgur.com/OIKH2W1.jpg

Copy/write down the Layout ID and save it for later.

5.

Get your Amazon Access Key ID and Secret Access Key, copy/write them down for later.

See here for instructions: http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSGettingStartedGuide/AWSCredentials.html

6.

Modify and then run the following code in R:

 
 
  1. ##### Notes:
  2. # 1) Change sandbox from TRUE to FALSE to run live (make sure to test in sandbox first!!)    
  3.   
  4. ##### Step 1: Load library, set parameters
  5.   
  6.     #### Load MTurkR library
  7.     library(MTurkR)
  8.     #### HIT Layout ID
  9.     my_hitlayoutid = "CHANGEME"
  10.   
  11.     #### Set MTurk credentials
  12.     Sys.setenv(    
  13.                 AWS_ACCESS_KEY_ID = "CHANGEME", 
  14.                 AWS_SECRET_ACCESS_KEY = "CHANGEME"
  15.             )
  16.   
  17.     #### HIT parameters
  18.     
  19.     ## Run in sandbox?
  20.     sandbox_val <- "TRUE"
  21.      
  22.     ## Set the name of your project here (used to retrieve HITs later)
  23.     myannotation <- "myannotation"
  24.   
  25.     ## Enter other HIT aspects
  26.     newhittype <- RegisterHITType(
  27.         title = "10 Question Survey",
  28.         description = "Complete a 10-question survey about news coverage and your opinions",
  29.         reward = "0.20", 
  30.         duration = seconds(hours = 1), 
  31.         keywords = "survey, questionnaire, politics",
  32.         sandbox = sandbox_val
  33.         )
  34.   
  35. ##### Step 2: Define functions
  36.   
  37.     ## Define a function that will create a HIT using information above
  38.     createhit <- function() {
  39.         CreateHIT(    
  40.             hit.type = newhittype$HITTypeId,
  41.             question = GenerateHTMLQuestion(file = system.file("templates/surveylink.xml", package = "MTurkR")),
  42.             assignments = 9,
  43.             expiration = seconds(days = 30),
  44.             annotation = myannotation,
  45.             verbose = TRUE,
  46.             sandbox = sandbox_val,
  47.             hitlayoutid = my_hitlayoutid
  48.             )
  49.         }
  50.   
  51.     ## Define a function that will expire all running HITs
  52.     ## This keeps HITs from "piling up" on a slow day
  53.     ## It ensures that A) HIT appears at the top of the list, B) workers won't accidentally accept HIT twice
  54.     expirehits <- function() { 
  55.             ExpireHIT(
  56.                 annotation = myannotation,
  57.                 sandbox = sandbox_val
  58.             )
  59.         }
  60.     
  61.     
  62. ##### Step 3: Execute a loop that runs createhit/expirehit functions every hour, and it will log the output to a file
  63.   
  64.     ## Define number of times to post the HIT (totalruns)
  65.     totalruns <- 10
  66.     counter <- 0
  67.   
  68.     ## Define log file (change the location as appropriate)
  69.     logfile <- file("C:/Users/Tyler/Documents/logfile.txt", open="a")
  70.     sink(logfile, append=TRUE, type="message")
  71.     
  72.     ## Run loop (note: interval is hourly, but can be changed in Sys.sleep)
  73.     repeat {
  74.       message(Sys.time())
  75.       createhit()
  76.       Sys.sleep(3600)
  77.       expirehits()
  78.       counter = counter + 1
  79.       if (counter == totalruns){
  80.         break
  81.       }
  82.     }
  83.         
  84.     ## To stop the loop before it finishes, click the "STOP" button
  85.     ## To stop logging, run sink()

.

This code will do the following:

  1. Load MTurkR
  2. Load your Amazon MTurk credentials into R’s memory
  3. Register a new HIT with MTurk using parameters that are specified both in this code (e.g., HIT description) and in surveylink.xml
  4. Define a function for creating a hit on MTurk, and also for expiring HITs
  5. Execute a loop that will publish a new HIT repeatedly on an interval for a set number of times, and log the results to a file

Obviously, you will need to change a few parts of this code — the areas are noted in the code comments.

Keep in mind that once you execute the loop, you will need to keep your computer running (it will stop if your computer stops). Your computer needs to be connected to the Internet, and RStudio needs to remain open (it can run in the background).

As always, make sure to test this in the worker/requester Sandbox. For example, you could post a hit to the sandbox using the createhit() function in R, and then login to the Worker Sandbox (http://workersandbox.mturk.com) to see what it looks like. Note that the Sandbox will require a different HIT Layout ID (obtained through the Sandbox Requester interface).

Let me know if you have any questions!

P.S. I’d like to thank Marc A. Ragin for emailing me about a bug in the loop code and suggesting the use of “hitlayoutid” (which is much easier!).

Leave a Reply

Your email address will not be published. Required fields are marked *