Introduction

In 2017, political scientists Casey et al. (2017) published a paper titled Intertemporal Differences Among MTurk Workers: Time-Based Sample Variations and Implications for Online Data Collection. In this paper, they showed how the demographic characteristics of MTurk samples vary by time (time of day, day of week). For example, one finding they reported was that you’re more likely to get participants from Eastern time zones if you recruit at 10 a.m. EST, but you’ll get more workers from Western time zones if you recruit at 10 p.m. EST. Of course, the differences went well beyond time zones and into various other demographic characteristics like age, sex, race, and even some personality traits.

After reading this paper, it occurred to me that this issue of time-based sample variation might be solved using time-based sample stratification. The logic goes like this: If the issue is sample bias resulting from posting large batches of HITs at specific times of day, then posting many smaller “micro-batches” at equal intervals spaced out across time should lead to better representation.

Here I present the results from a secondary analysis of a study (N = 197) in which I applied this technique.

Stratification Method

Sampling Code

To perform time-based sample stratification, I used MTurkR to interface with the MTurk API. I wrote a repeat loop that would post a HIT with 2 assignments once every hour, and at the end of each hour it would also expire any assignments remaining from previous HIT and assign a qualification to anyone who just participated (so they couldn’t participate again).

## Run in sandbox?
sandbox_val <- "FALSE"

## Annotation
my_annotation <- "MY_ANNOTATION"

# Qualification reqirement (hasn't done the task yet)
my_qual <- GenerateQualificationRequirement(
                        qual = "MY_QUALIFICATION_ID", 
                        comparator = "DoesNotExist", 
                        value = "", 
                        preview = "TRUE")

# Create HIT type
newhittype <- RegisterHITType(
  title = "15-minute academic survey",
  description = "Complete a 15-minute research survey",
  reward = "1.50", 
  duration = seconds(hours = 1), 
  keywords = "academic, research, study, survey",
  qual.req = my_qual,
  sandbox = sandbox_val
)

# The Loop
repeat {
  
  # Create HIT
  currentHIT <-  CreateHIT(    
    hit.type = newhittype$HITTypeId,
    assignments = 9,
    expiration = 3600,
    annotation = my_annotation,
    sandbox = sandbox_val,
    hitlayoutid = my_hitlayoutid
  )
  
  # Wait some duration (1 hour)
  Sys.sleep(3600)
  
  # Expire HIT
  ExpireHIT(hit = currentHIT$HITId, sandbox = sandbox_val )
  
  # Get assignments
  assns <- GetAssignments(hit = currentHIT$HITId, sandbox = sandbox_val)
  
  # (De)Qualify workers from next HIT of same type
  AssignQualification(qual = "MY_QUALIFICATION_ID", workers = assns$WorkerId)
}

Sampling Performance

During the period of 2018-05-20 15:59:11 EDT to 2018-05-25 12:41:26 EDT I posted 118 HITs to MTurk, each spaced 1 hour apart with ~2 Assignments. This interval was a fair distribution of time, representing almost a full week (Sunday to Friday), and might therefore be expected to account for demographic differences related to lifestyle or circadian rhythm.

The graphs below show the actual HITs posted and Assignments completed as a function of time. As we see, there was pretty good coverage. The HITs were deployed consistently across time (With one spike around May 23 indicating a point where manual intervention was necessary.)

times_df <- data.frame(times = times)
wtimes_df <- data.frame(times = weighted_times)

ggplot() +
  geom_histogram(data=wtimes_df, aes(weighted_times, fill="#34495e"), alpha=0.5, bins=30) +
    ylab("Count") + xlab("Date / Time") +
  geom_histogram(data=times_df, aes(times, fill="#d35400", alpha=0.5), bins=30, alpha=0.5) + 
  theme(panel.grid.major.y = element_blank(),
        panel.grid.minor.y = element_blank(), axis.text.x=element_text(angle = 30, hjust = 1)) + 
    scale_fill_manual(name = '', 
         values =c("#d35400","#34495e"), labels = c('Assignments\nCompleted', "HITs Created"))

Results

Now that I’ve shown the stratification process works to get pretty good representation across time, we’ll next look at the demographics of the sample. For these analyses, 15 participants were dropped due to data quality issues on non-demographic questions (e.g., missing attention check questions), leaving a sample of n = 182 for the analyses. Although more demographics were collected than are reported here, we’ll focus on four variables: timezone, age, sex, and race.

Timezone Distribution

The plot below shows the percentages of time-zones reported by participants, as compared to percentages from the Census. The distribution looks pretty good, with some appreciable degree of underrepresentation in the Mountain timezone.

#########################################
# TIMEZONE

# Source for US estimates: https://metricmaps.files.wordpress.com/2017/05/time-zone-pop.png

total_mturk <- NROW(d[which(d$TimeZone > 1 & d$TimeZone < 6),])

tz_arr <- as.data.frame(matrix(0, ncol = 3, nrow = 0))
tz_arr <- rbind(tz_arr, 
              c("census", "Eastern", 48),
              c("census", "Central", 29),
              c("census", "Pacific", 17),
              c("census", "Mountain", 7),
             
              c("mturk", "Eastern", round(NROW(d[which(d$TimeZone == 2),]) / total_mturk * 100)),
              c("mturk", "Central", round(NROW(d[which(d$TimeZone == 3),]) / total_mturk * 100)),
              c("mturk", "Pacific", round(NROW(d[which(d$TimeZone == 5),]) / total_mturk * 100)),
              c("mturk", "Mountain", round(NROW(d[which(d$TimeZone == 4),]) / total_mturk * 100))
)

colnames(tz_arr) <- c("source", "timezone", "pct")
tz_arr$pct <- as.numeric(as.character(tz_arr$pct))

ggplot(tz_arr, aes(x=timezone, y=pct, fill=source)) +
  scale_y_continuous(labels = function(x) paste0(x, "%")) + 
  geom_bar(stat="identity", position = "dodge", width=0.7) +
  geom_text(aes(label = round(pct), y= round(pct)),  vjust = -0.5, size = 3, position=position_dodge(width=0.7)) + 
  theme(panel.grid.major.y = element_blank(), panel.grid.minor.y = element_blank(), axis.text.x=element_text(angle = 30, hjust = 1)) +
  xlab("Time Zone") + ylab("Percentage")

Effect of Access Time on Reported Timezone

Arguably, the effect of time-based sample stratification should be most apparent in the number of participants recruited from different timezones. Using timestamps recorded when participants first accessed the study (which are all set to the researcher’s timezone, thereby providing a common basis for comparison) we can see if there is any effect of distributing the survey at different times of the day.

We’ll use an Analysis of Variance, with the hour-of-day as our outcome variable and the timezone reported by participants as our predictor variable. As the analysis shows, the only significant difference was between individuals in the Mountain and Central timezones. This effect is not very reliable considering the Mountain timezone group was so small (n = 6) it is likely driven by extreme values. However, what is surprising is the lack of difference between the other timezones. Although the means were (sort of)* in the expected direction, the differences were not robust.

*(I say “sort of” because it is unclear why the mean time for the Central group was lower than the Pacific group, given their respective offsets)

#### Prepare data.frame()
test <- d[which(d$TimeZone >1 & d$TimeZone < 6),]
test$TimeZone <- as.factor(test$TimeZone)
levels(test$TimeZone) = c("Eastern","Central","Mountain","Pacific")
test$HourAccessed <- hour(test$AccessTime)

aov <- aov(test$HourAccessed ~ test$TimeZone)
print(anova(aov))
## Analysis of Variance Table
## 
## Response: test$HourAccessed
##                Df Sum Sq Mean Sq F value  Pr(>F)  
## test$TimeZone   3  553.7 184.577  3.8213 0.01114 *
## Residuals     162 7825.0  48.303                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
print(pairwise.t.test(test$HourAccessed, test$TimeZone, p.adj = "bonf"))
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  test$HourAccessed and test$TimeZone 
## 
##          Eastern Central Mountain
## Central  0.272   -       -       
## Mountain 0.136   0.014   -       
## Pacific  1.000   1.000   0.057   
## 
## P value adjustment method: bonferroni
tgc <- summarySE(test, measurevar="HourAccessed", groupvars=c("TimeZone"))

ggplot(tgc, aes(x=TimeZone, y=HourAccessed, fill=TimeZone)) + 
    geom_bar(position=position_dodge(), stat="identity") +
    geom_errorbar(aes(ymin=HourAccessed-ci, ymax=HourAccessed+ci),
                  width=.2,                    # Width of the error bars
                  position=position_dodge(.9)) +
  xlab("Time Zone") + ylab("Hour Accessed\n(Researcher Time)")

Age Distribution

The plot below shows the percentages of age groups reported by participants, as compared to percentages from the Census. Compared to the general US population, the MTurk sample overrepresented the 35 to 44 age group.

#########################################
# AGE

# ages <- getCensus(name = "pep/charage",
#           vintage = 2017,
#           vars = c("POP", "AGE"),
#           region = "us:*")

load("ages.R")

total_c <- sum(as.numeric(ages[which(ages$AGE > 17 & ages$AGE < 999),]$POP))
total_mturk <- NROW(d)

age_arr <- as.data.frame(matrix(0, ncol = 3, nrow = 0))
age_arr <- rbind(age_arr, 
              c("census", "18 to 24", round(sum(as.numeric(ages[which(ages$AGE >= 18 & ages$AGE <= 24),]$POP)) / total_c * 100)),
              c("census", "25 to 34", round(sum(as.numeric(ages[which(ages$AGE >= 25 & ages$AGE <= 34),]$POP)) / total_c * 100)),
              c("census", "35 to 44", round(sum(as.numeric(ages[which(ages$AGE >= 35 & ages$AGE <= 44),]$POP)) / total_c * 100)),
              c("census", "45 to 54", round(sum(as.numeric(ages[which(ages$AGE >= 45 & ages$AGE <= 54),]$POP)) / total_c * 100)),
              c("census", "55 to 64", round(sum(as.numeric(ages[which(ages$AGE >= 55 & ages$AGE <= 64),]$POP)) / total_c * 100)),
              c("census", "65 to 74", round(sum(as.numeric(ages[which(ages$AGE >= 65 & ages$AGE <= 74),]$POP)) / total_c * 100)),
              c("census", "75 to 84", round(sum(as.numeric(ages[which(ages$AGE >= 75 & ages$AGE <= 84),]$POP)) / total_c * 100)),
              c("census", "85+", round(sum(as.numeric(ages[which(ages$AGE >= 85 & ages$AGE < 999),]$POP)) / total_c * 100)),
             
              c("mturk", "18 to 24", round(NROW(d[which(d$Age == 1),]) / total_mturk * 100)),
              c("mturk", "25 to 34", round(NROW(d[which(d$Age == 2),]) / total_mturk * 100)),
              c("mturk", "35 to 44", round(NROW(d[which(d$Age == 3),]) / total_mturk * 100)),
              c("mturk", "45 to 54", round(NROW(d[which(d$Age == 4),]) / total_mturk * 100)),
              c("mturk", "55 to 64", round(NROW(d[which(d$Age == 5),]) / total_mturk * 100)),
              c("mturk", "65 to 74", round(NROW(d[which(d$Age == 6),]) / total_mturk * 100)),
              c("mturk", "75 to 84", round(NROW(d[which(d$Age == 7),]) / total_mturk * 100)),
              c("mturk", "85+", round(NROW(d[which(d$Age == 8),]) / total_mturk * 100))
)

colnames(age_arr) <- c("source", "age", "pct")
age_arr$pct <- as.numeric(as.character(age_arr$pct))
age_arr$pct <- age_arr$pct + 0.2

ggplot(age_arr, aes(x=age, y=pct, fill=source)) +
  scale_y_continuous(labels = function(x) paste0(x, "%")) + 
  geom_bar(stat="identity", position = "dodge", width=0.7) +
  geom_text(aes(label = round(pct), y= round(pct)),  vjust = -0.75, size = 3, position=position_dodge(width=0.7)) + 
  theme(panel.grid.major.y = element_blank(), panel.grid.minor.y = element_blank(), axis.text.x=element_text(angle = 30, hjust = 1)) +
  xlab("Age") + ylab("Percentage")