Is MTurk having a “data quality crisis”?
In August 2018, there was an event that I like to call “MTurk botgate”. Basically, researchers on Twitter started to panic after some reported a sudden uptick in low quality data. Some researchers were seeing the same kinds of nonsense response patterns to open-ended survey questions by large numbers of participants, and were also seeing repeated geo-locations appear in their datasets (more on this later).
MTurk may be contaminated! Bots are filling out psychology surveys. Apparently you can tell by repeating GPS coordinates. These researchers are looking into it, and you can report anything fishy at: https://t.co/YYtwAjKIyc— Kurt Gray (@kurtjgray) August 9, 2018
New study of MTurk data suggests many participants (some estimates as high as 48%) are bots with spoof accounts. Read more and learn how to detect suspicious participants here: https://t.co/KwZlzzjmRu— Matt Motyl (@MattMotyl) August 9, 2018
News outlets also jumped on the bandwagon with hits like these:
- Wired: A Bot Panic Hits Amazon’s Mechanical Turk
- NewScientist: Bots on Amazon’s Mechanical Turk are ruining psychology studies
So why were researchers finding low quality data? I’d like to lay out a few of the possibilities and explain why I think the dominant narrative of “bots” is an unlikely candidate.
As you can probably guess, the story that really caught people’s attention was that MTurk was over-run with bots! This came from the assertions of some researchers who believed that some bad actors had started using scripts that could autonomously or semi-autonomously complete tasks for them. And maybe those bad actors had acquired lots and lots of accounts through which they could enact their nefarious bot behavior.
Yes, there are some scripts that MTurk “power users” are known to use to enhance their experience and increase their efficiency while working on the platform. A lot of these can be found at greasyfork.org – a site where users share custom browser scripts that change the behavior of certain websites. For example, there’s one script called “MTurk Captcha Alert” that supposedly alerts users when a captcha is found in a task. And another called “mTurk survey highlight words” that will highlight words used in attention check questions like “ignore”, “reading”, and “attention”. But there’s nothing there on the order of automating entire MTurk tasks (let alone automate lots of different kinds of tasks).
There’s also clearly an appetite among some users for scripts that would automate some aspects of the drudgery, as one reddit user asks to the “Is there any script to auto check all the radio buttons?” and another asks, similarly, “How do I find a script that automatically fills radio buttons?”. Of course these users didn’t get the answers they were looking for and the threads were unpopular with the communities. But scripts like these do in fact exist (I won’t link them here so as not to make them easier to find), and I wouldn’t be the least bit surprised if some users have figured out how to make them work to a relatively successful degree.
And yes, the buying and selling of MTurk accounts does take place, though I don’t know how successful those transactions are and for how long the accounts remain active post-transaction.
Yet I’m skeptical of the idea that either A) automation scripts had suddenly become widespread among users, B) that some users had suddenly acquired lots and lots of accounts to use as bots. There just isn’t enough evidence to support this. Even the evidence of repeat geo-coordinates was largely a misunderstanding of the fact that Qualtrics geo-locations are only accurate to the city level. In other words, users who were thought to be coming from the same location could simply have been located in the same, densely-populated city.
I think the real story was academic psychologists collectively panicking when several anecdotes w/ bad data surfaced. IMO, plumbing those depths would have made a better story than the (exaggerated, possibly false) "bots are destroying science" hottake. #botpocalypse #meta https://t.co/5fj4xEC0Ub— Tyler Burleigh (@tylerburleigh) August 11, 2018
Blame the user, not the tool
Another possibility is that researchers were new to the platform and not following established best practices for screening out “low-quality” workers.
I think there's 2 parts to this:— Tyler Burleigh (@tylerburleigh) August 10, 2018
1) Are there bots on MTurk? Almost certainly yes.
2) Are they smart enough to pass strict reputation criteria (% Approval, # of HITs Completed)? Doubtful. I suspect many (most) reports of bots are b/c researchers are not using these correctly. https://t.co/ptkVCrlQr3
Typically, researchers will restrict access to their MTurk tasks by only allowing workers who have a history of at least 100 tasks, an approval rate of at least 95% over those tasks (although lately I’ve seen more researchers using >= 98% approval), and who are located in the US.
This practice dates back to a blog post from 2012 by one of the admins of the now defunct TurkerNation: Tips for Academic Requesters on Mturk, and the practice was later validated by a peer-reviewed research article: Reputation as a sufficient condition for data quality on Amazon Mechanical Turk
In speaking with some of the researchers who were reporting data quality issues, I learned that at least a couple of them had apparently not followed these best practices. In one case, for instance, a user had set an approval % criteria, but had not also set a minimum number of HITs completed. This is important, because any users with fewer than 100 HITs completed are automatically assigned a 100% approval rate as stated in the MTurk documentation (“Note that a Worker’s approval rate is statistically meaningless for small numbers of assignments, since a single rejection can reduce the approval rate by many percentage points. So to ensure that a new Worker’s approval rate is unaffected by these statistically meaningless changes, if a Worker has submitted less than 100 assignments, the Worker’s approval rate in the system is 100%.”)
If you are using vanilla MTurk and you don't set a # of HITs completed for your task in addition to % Approval, then anyone with fewer than 100 HITs completed can participate even if they have a garbage track record. It's even more important to set this than I thought... https://t.co/8UexvUg0FK— Tyler Burleigh (@tylerburleigh) August 13, 2018
One weakness with the “blame the user” explanation is that researchers did report using a US location restriction, and furthermore, the repeated geo-locations were showing US locations (one location that was showing a lot of users was Buffalo, NY).
This eventually gave rise to the idea, which now seems to be the most plausible in my estimation, that it was foreign users who were pretending to be from the US by using VPN (Virtual Private Networks; a.k.a. “proxies”) services to route their traffic through US server locations. This is what TurkPrime found in their analysis, referring to these users as “server farmers”, and the analysis is pretty compelling.
"Our evidence suggests recent data quality problems [on MTurk] are tied to foreign workers, not bots" https://t.co/XBstUYTJNL— Tyler Burleigh (@tylerburleigh) September 19, 2018
When I saw this, I jumped at the opportunity to create a tool that would allow researchers to screen these users out by identifying where their traffic was coming from, and eventually collaborated with some folks to create a suite of tools for this purpose and to write up a document that described in detail how to actually do the screening in Qualtrics.
Kennedy, R. & Clifford, S. & Burleigh, T., (October 24, 2018). The Shape of and Solutions to the MTurk Quality Crisis. Available at SSRN: https://ssrn.com/abstract=3272468
Winter, N., Burleigh, T., Kennedy, R. & Clifford, S. (February 1, 2019). A Simplified Protocol to Screen Out VPS and International Respondents Using Qualtrics. Available at SSRN: https://ssrn.com/abstract=3327274 [PDF]
I’ve started using this protocol as standard practice. Since the panic I’ve collected data for a few larger-scale studies and by all accounts the quality of that data was very good.
1/10 If you use MTurk + Qualtrics and want to improve data quality, this is a reminder that @njgwinter, @ScottClif, @RyanKennedy7, and I put out a simple protocol to screen users behind "server farms".https://t.co/VCwYRQ6AcK— Tyler Burleigh (@tylerburleigh) April 20, 2019
2/10 Months ago there was a fear that MTurk was having a quality crisis & a "bot" crisis. Some observed low quality data, repeated geo-locations & suspicious responses. Some of us hypothesized that this wasn't due to bots, per se, but rather users masquerading as US residents.— Tyler Burleigh (@tylerburleigh) April 20, 2019
3/10 There were some indications this was the case, like this report by @TurkPrime where low-quality was associated with users behind "server farms" (i.e., commercial, non-residential Internet Service Providers -- often used as hosts for proxies).https://t.co/dF7jOKBvyZ— Tyler Burleigh (@tylerburleigh) April 20, 2019
4/10 I've used the protocol & have some results now. Is it worth it? I just finished a N>1000 MTurk study. Using the protocol, I found and filtered out 5.6% of users who were behind a server farm, and a further 1.3% who were outside the US (despite a Location = US qualification)— Tyler Burleigh (@tylerburleigh) April 20, 2019
5/10 I should mention this was using the standard best-practice— Tyler Burleigh (@tylerburleigh) April 20, 2019
"Location = US"
"HIT Approval Rate >= 98%"
"Number of HITs Approved > 100"
quality control settings
6/10 On the issue of cost, my client was paying ~$1 for the pre-test and ~$5 for the post-test, so the screening protocol saved them ~$600 in payments to potentially suspicious participants.— Tyler Burleigh (@tylerburleigh) April 20, 2019
7/10 What about data quality? We used attention checks. The pre-test survey had 4 (I know, it's a lot), and out of the ~1000 workers who passed the screener, 99% got 3 out of 4 attention checks correct and 95% got all 4 correct.— Tyler Burleigh (@tylerburleigh) April 20, 2019
8/10 Another measure related to quality was retention. The pre- and post-tests were separated by a 2-week "washout period" (this was meant to obscure the relationship between the two surveys). 73% of participants were retained.— Tyler Burleigh (@tylerburleigh) April 20, 2019
9/10 There were open-ended Qs in the post-test. In the past, low-quality workers have been found using phrases like "NICE" and "GOOD" in unusual places. The post-test had Qs like "name 1 thing you liked about X" where such responses would be expected, but I didn't find any.— Tyler Burleigh (@tylerburleigh) April 20, 2019
10/10 So from my POV, MTurk isn't having a data quality crisis or a bot crisis. With the right precautions high-quality data can be had. Many MTurkers use services to hide their origins, and as they seem to produce bad data, screening them out might be another precaution to take.— Tyler Burleigh (@tylerburleigh) April 20, 2019
So to answer the question I posed at the beginning: Is MTurk having a data quality crisis? I think the answer is pretty clearly no – at least, not if you’re following some of the established best practices.