2019 DS Bowl - Part 5 - Scoring & “ground truth” calibration

➡ Return to tylerburleigh.com

In Part 4 I did a deeper dive into the structure of game sessions, trying to understand levels, level progression, and basic game events.

In this session, I’ll look at Assessments and scoring of assessments. I’ll focus on creating an Assessment scoring algorithm that is calibrated and produces scores consistent with the “ground truth” of scores given with the dataset.

load_quietly <- function(package) { # Quietly load package
  suppressWarnings(suppressMessages(library(deparse(substitute(package)), character.only=TRUE))) 
}
load_quietly(tidyverse)
load_quietly(data.table)
load_quietly(jsonlite)
load_quietly(sparklyr)
load_quietly(DT)

source("R/extract_json_2.R")

Start the Spark container.

sc <- spark_connect(master = "local")
train_s <- spark_read_parquet(sc, 'data/train.parquet', memory = FALSE)

What are the Assessments?

train_s %>%
  filter(type == 'Assessment') %>%
  distinct(title)
## Warning: `overscope_eval_next()` is deprecated as of rlang 0.2.0.
## Please use `eval_tidy()` with a data mask instead.
## This warning is displayed once per session.
## Warning: `overscope_clean()` is deprecated as of rlang 0.2.0.
## This warning is displayed once per session.
## # Source: spark<?> [?? x 1]
##   title                       
##   <chr>                       
## 1 Cauldron Filler (Assessment)
## 2 Chest Sorter (Assessment)   
## 3 Cart Balancer (Assessment)  
## 4 Mushroom Sorter (Assessment)
## 5 Bird Measurer (Assessment)

There are 5 Assessments in all, and appear to be different from the Games, at least in name.

Let’s turn to some of the documentation. In the competition docs it says the train_labels.csv file contains “ground truth for the assessments in the training set”. Let’s see what that means, because it might help us understand the Assessments more quickly.

train_labels <- fread("data/train_labels.csv")
train_labels %>% head %>%
  datatable(., rownames = FALSE, options = list(pageLength = 5, scrollX = '400px'))

Let’s try scoring one of the assessments from the training set. We’re going to need a faster process than full JSON extraction, so I’ll use grepl to search for correct within the JSON strong.

The competition docs say “Assessment attempts are captured in event_code 4100 for all assessments except for Bird Measurer, which uses event_code 4110.” So we can also filter on those events.

train_s %>% 
  filter(game_session == '1fb6cb9bd71c6a6e',
         event_code == 4100) %>%
  collect(.) %>%
  mutate(correct = grepl('""""correct"""":true', event_data)) %>%
  count(correct)
## # A tibble: 1 x 2
##   correct     n
##   <lgl>   <int>
## 1 FALSE       3
train_labels %>% 
  filter(game_session == '1fb6cb9bd71c6a6e') %>%
  select(game_session, num_correct, num_incorrect)
##       game_session num_correct num_incorrect
## 1 1fb6cb9bd71c6a6e           0             3

I get 3 incorrect responses here, and zero correct responses, which matches the ground truth.

In the documentation it says accuracy_group was scored as follows:

This seems like a simple algorithm to reproduce. Let’s give it a shot for this session we’ve been looking at.

train_s %>% 
  filter(game_session == '1fb6cb9bd71c6a6e' &
         event_code == 4100) %>%
  spark_apply(function(df) {
    library(dplyr)
    df %>% mutate(correct = as.numeric(grepl('""""correct"""":true', event_data))) 
  }) %>%
  select(game_session, event_id, event_code, correct) %>%
  group_by(game_session) %>%
  summarize(num_correct = sum(correct),
            num_incorrect = n() - sum(correct),
            attempts = n(),
            accuracy = sum(correct) / n()) %>%
  mutate(accuracy_group = case_when(
    num_correct == 0 ~ 0,
    num_correct == 1 & attempts >= 3 ~ 1,
    num_correct == 1 & attempts == 2 ~ 2,
    num_correct == 1 & attempts == 1 ~ 3
  )) %>%
  collect(.) %>%
  datatable(., rownames = FALSE, options = list(pageLength = 5, scrollX = '400px'))
## Warning: Missing values are always removed in SQL.
## Use `SUM(x, na.rm = TRUE)` to silence this warning
## This warning is displayed only once per session.

Looks good.

Now I’ll try it with the first 5 game sessions of this type in the ground truth file.

train_labels %>% 
  filter(title == 'Chest Sorter (Assessment)') %>%
  head(5) %>%
  arrange(game_session) %>%
  collect(.) %>%
  datatable(., rownames = FALSE, options = list(pageLength = 5, scrollX = '400px'))
train_labels %>% 
  filter(title == 'Chest Sorter (Assessment)') %>%
  select(game_session) %>%
  collect(.) %>%
  head(5) -> game_session_ids
train_s %>% 
  filter(title == 'Chest Sorter (Assessment)' &
         event_code == 4100 &
         game_session %in% local(game_session_ids$game_session)) %>%
  spark_apply(function(df) {
    library(dplyr)
    df %>% mutate(correct = as.numeric(grepl('""""correct"""":true', event_data))) 
  }) %>%
  select(game_session, event_id, event_code, correct) %>%
  group_by(game_session) %>%
  summarize(num_correct = sum(correct),
            num_incorrect = n() - sum(correct),
            attempts = n(),
            accuracy = sum(correct) / n()) %>%
  mutate(accuracy_group = case_when(
    num_correct == 0 ~ 0,
    num_correct == 1 & attempts >= 3 ~ 1,
    num_correct == 1 & attempts == 2 ~ 2,
    num_correct == 1 & attempts == 1 ~ 3
  )) %>%
  arrange(game_session) %>%
  collect(.) %>%
  datatable(., rownames = FALSE, options = list(pageLength = 5, scrollX = '400px'))

These all match, so that’s good.

The last step is to write a generalizable function that would also work for the Bird Measurer Assessment, which uses a different event_code.

score_assessments <- function(spark_df){
  spark_df %>% 
    filter(
      (title != "Bird Measurer (Assessment)" && event_code == 4100) |
      (title == "Bird Measurer (Assessment)" && event_code == 4110)
      ) %>%
    spark_apply(function(df) {
      library(dplyr)
      df %>% mutate(correct = as.numeric(grepl('""""correct"""":true', event_data))) 
    }) %>%
    select(game_session, event_id, event_code, correct) %>%
    group_by(game_session) %>%
    summarize(num_correct = sum(correct),
              num_incorrect = n() - sum(correct),
              attempts = n(),
              accuracy = sum(correct) / n()) %>%
    mutate(accuracy_group = case_when(
      num_correct == 0 ~ 0,
      num_correct == 1 & attempts >= 3 ~ 1,
      num_correct == 1 & attempts == 2 ~ 2,
      num_correct == 1 & attempts == 1 ~ 3
    ))
}

Next, just to be sure that my algorithm is correct, I need to run it over more game sessions to compare with the ground truth. I’ll try with 25 of each accuracy group.

train_labels %>% 
  filter(accuracy_group == 1) %>%
  head(25) -> acc_grp_1

train_labels %>% 
  filter(accuracy_group == 2) %>%
  head(25) -> acc_grp_2

train_labels %>% 
  filter(accuracy_group == 3) %>%
  head(25) -> acc_grp_3

acc_grp_1 %>%
  bind_rows(acc_grp_2, acc_grp_3) %>%
  select(game_session, num_correct, num_incorrect, accuracy, accuracy_group) -> ground_truth

ground_truth %>% 
  arrange(game_session) %>%
  datatable(., rownames = FALSE, options = list(pageLength = 5, scrollX = '400px'))

Now score those same sessions.

train_s %>%
  filter(game_session %in% local(ground_truth$game_session)) %>%
  score_assessments(.) %>%
  select(-attempts) %>%
  collect(.) -> my_scoring

my_scoring %>%
  arrange(game_session) %>%
  datatable(., rownames = FALSE, options = list(pageLength = 5, scrollX = '400px'))

Are they equal? Here I can use the dplor::all_equal() function, which compares two dataframes.

all_equal(ground_truth, my_scoring, convert = TRUE)
## [1] TRUE

OK so these are all equal. But it’s just a subset and I don’t want to take any chances, so I’ll make sure I’ve got them all correct.

First I’ll generate a dataframe containing all of the same sessions scored.

nrow(train_labels) / 250

scoring_results <- data.frame()
for(i in 1:70 ){
  if(i == 1){
    min <- i
  } else{
    min <- ((i-1)*250)+1
  }
    max <- i*250
  
  sessions <- train_labels$game_session[min:max]
  train_s %>%
    filter(game_session %in% local(sessions)) %>%
    score_assessments(.) %>%
    collect(.) -> result
  
  cat('\f')
  cat(i)
  scoring_results <- bind_rows(scoring_results, result)
}

min <- (70*250)+1
max <- nrow(train_labels)

sessions <- train_labels$game_session[min:max]
train_s %>%
  filter(game_session %in% local(sessions)) %>%
  score_assessments(.) %>%
  collect(.) -> result
scoring_results <- bind_rows(scoring_results, result)

A bit of wrangling to make sure the two dataframes contain the same columns.

scoring_results %>%
  select(-attempts) -> scoring_results

train_labels %>%
  select(game_session, num_correct, 
       num_incorrect, accuracy, accuracy_group) -> ground_truth

And then a check for equivalency.

all_equal(ground_truth, scoring_results, convert = TRUE)
## [1] TRUE

It worked! Now I’ll “ship” this function by storing it with my other R function files.