2019 Data Science Bowl - Part 4 - More on Game Sessions

➡ Return to tylerburleigh.com

Let’s pick up where I left off in Part 3. There I was exploring the data contained within a Game session, and I fixed a bug in my JSON extraction function.

In this part I go much deeper into the exploration of game structure. I’ll try to answer key questions about levels, level progression (are higher levels more difficult, and does difficulty climb indefinitely?), basic game events (e.g., what determines when a game session is “finished”?), and if highest level reached can be used as a positive predictor of game skill.

load_quietly <- function(package) { # Quietly load package
  suppressWarnings(suppressMessages(library(deparse(substitute(package)), character.only=TRUE))) 
}
load_quietly(tidyverse)
load_quietly(data.table)
load_quietly(jsonlite)
load_quietly(sparklyr)
load_quietly(DT)

source("R/extract_json_2.R")

Start the Spark container.

sc <- spark_connect(master = "local")
train_s <- spark_read_parquet(sc, 'data/train.parquet', memory = FALSE)

Games

I’ll load the session I was looking at last time and extract the JSON data.

train_s %>% filter(game_session == 'f3bd0e3634993d85') -> session1
extract_json_2(session1, nrows = sdf_nrow(session1)) -> session1
## Warning: `overscope_eval_next()` is deprecated as of rlang 0.2.0.
## Please use `eval_tidy()` with a data mask instead.
## This warning is displayed once per session.
## Warning: `overscope_clean()` is deprecated as of rlang 0.2.0.
## This warning is displayed once per session.

Levels and rounds

session1 %>% 
  head %>%
  collect(.) %>%
  datatable(., rownames = FALSE, options = list(pageLength = 3, scrollX = '400px'))

This game is in the “MAGMAPEAK” world, which according to the competition documentation is a world that teaches / assesses “capacity/displacement”.

I see now that the game session has levels and rounds and lots of questions come to mind about this.

  • Does every game session have the same number of levels and rounds?
  • Is level/round progression based on performance?
  • Are higher levels more difficult?
  • Does the game run on a timer?
  • Does the game finish when a user makes too many mistakes?

Some of these questions are more important than others. Level progression is important, because if a player only gets to the next level when they perform well on the current one, then the highest level reached could be a proxy for overall performance. For example, if one player reached level 10 because they performed well, while another player reached 5 because they didn’t perform as well, then 10 and 5 are quantities that would represent each players’ performance.

If the dataset contains incomplete game sessions, then this is important because abandoning a game session could indicate the user was bored or frustrated because the game was too easy or too hard, respectively. It would be really interesting if I could identify cases of game abandonment due to frustration. Alternatively, if the game runs on a timer then some game sessions will be shorter than others for that reason.

I can answer some of these questions by looking at the levels and rounds.

session1 %>% 
  arrange(event_count) %>%
  distinct(level, round)
## # A tibble: 103 x 2
##    level round
##    <int> <int>
##  1     0     0
##  2     1     0
##  3     1     1
##  4     1     2
##  5     2     2
##  6     2     3
##  7     2     4
##  8     3     4
##  9     3     5
## 10     3     6
## # ... with 93 more rows
session1 %>% 
  arrange(event_count) %>%
  distinct(level, round) %>%
  count(level)
## # A tibble: 21 x 2
##    level     n
##    <int> <int>
##  1     0     1
##  2     1     3
##  3     2     3
##  4     3     3
##  5     4     4
##  6     5     4
##  7     6     4
##  8     7     5
##  9     8     5
## 10     9     5
## # ... with 11 more rows
max(session1$level)
## [1] 20
max(session1$round)
## [1] 82
max(session1$game_time)
## [1] 605833

This game session had 20 levels (21 if you count level “0”) and 82 rounds, and the total duration was ~606 seconds.

Next, I can pull up a different session for the same game and see how it compares.

train_s %>% 
  filter(title == 'Scrub-A-Dub') %>% 
  distinct(game_session) %>% 
  head(2)
## # Source: spark<?> [?? x 1]
##   game_session    
##   <chr>           
## 1 b4189588a23b20e4
## 2 32119a4604236fcf
train_s %>% filter(game_session == 'b4189588a23b20e4') -> session2
extract_json_2(session2, nrows = sdf_nrow(session2)) -> session2
session2 %>% head
## # A tibble: 6 x 37
##   event_id game_session timestamp           installation_id event_count
##   <chr>    <chr>        <dttm>              <chr>                 <int>
## 1 6d90d394 b4189588a23~ 2019-08-09 17:01:24 ecba8eee                  1
## 2 7040c096 b4189588a23~ 2019-08-09 17:01:38 ecba8eee                  2
## 3 5a848010 b4189588a23~ 2019-08-09 17:01:40 ecba8eee                  3
## 4 4a09ace1 b4189588a23~ 2019-08-09 17:01:54 ecba8eee                  4
## 5 dcaede90 b4189588a23~ 2019-08-09 17:01:56 ecba8eee                  5
## 6 26fd2d99 b4189588a23~ 2019-08-09 17:01:56 ecba8eee                  6
## # ... with 32 more variables: event_code <int>, game_time <int>,
## #   title <chr>, type <chr>, world <chr>, version <fct>, level <int>,
## #   round <int>, event_count1 <int>, game_time1 <int>, event_code1 <int>,
## #   coordinates.x <int>, coordinates.y <int>,
## #   coordinates.stage_width <int>, coordinates.stage_height <int>,
## #   movie_id <chr>, duration <int>, description <chr>,
## #   round_target.size <int>, round_target.type <chr>,
## #   round_target.animal <chr>, options <chr>, animals <chr>,
## #   identifier <chr>, media_type <fct>, total_duration <int>, size <int>,
## #   item_type <chr>, position <int>, animal <chr>, correct <lgl>,
## #   misses <int>
max(session2$level)
## [1] 13
max(session2$round)
## [1] 43
max(session2$game_time)
## [1] 478786

Interesting. This game session had only 13 levels and 43 rounds, and the total duration was ~479 seconds. So not all game sessions have the same number of levels and rounds, and not all games have the same duration. But why the difference?

session2 %>% 
  arrange(event_count) %>%
  distinct(level, round) %>%
  collect(.) %>%
  datatable(., rownames = FALSE, options = list(pageLength = 10, scrollX = '400px'))
session2 %>% 
  arrange(event_count) %>%
  distinct(level, round) %>%
  count(level) %>%
  collect(.) %>%
  datatable(., rownames = FALSE, options = list(pageLength = 10, scrollX = '400px'))

Let’s overlay the level sequence for the two game sessions on the same plot so we can see if there are any differences in level progression.

session1 %>% 
  arrange(event_count) %>%
  distinct(level, round) %>%
  mutate(session = 'f3bd0e3634993d85') %>%
  rowid_to_column() -> session1_levels_rounds

session2 %>% 
  arrange(event_count) %>%
  distinct(level, round) %>%
  mutate(session = 'b4189588a23b20e4') %>%
  rowid_to_column() -> session2_levels_rounds

session1_levels_rounds %>%
  bind_rows(session2_levels_rounds) -> combined

ggplot(combined, aes(x = rowid, y = level, group = session)) + 
  geom_line(aes(linetype = session, color = session)) +
  geom_point(aes(color = session))

This plot tells me that the game sessions follow the same level/round trajectory, and that some game sessions are longer than others. But given the fact that session b4189588a23b20e4 ran shorter despite having fewer levels, it doesn’t seem likely that the game is played on a timer (in which case the player didn’t “run out of time”).

I wonder if there’s an event that identifies when the game is finished? This would allow me to determine if a game session was abandoned or not.

Let’s pull up the last event in the 20-level game and see if it was unique.

# First grab the last event
session1 %>%
  arrange(event_count) %>%
  tail(1)
## # A tibble: 1 x 38
##   event_id game_session timestamp           installation_id event_count
##   <chr>    <chr>        <dttm>              <chr>                 <int>
## 1 73757a5e f3bd0e36349~ 2019-08-04 15:28:41 eca4e454                635
## # ... with 33 more variables: event_code <int>, game_time <int>,
## #   title <chr>, type <chr>, world <chr>, version <fct>, level <int>,
## #   round <int>, event_count1 <int>, game_time1 <int>, event_code1 <int>,
## #   coordinates.x <int>, coordinates.y <int>,
## #   coordinates.stage_width <int>, coordinates.stage_height <int>,
## #   movie_id <chr>, duration <int>, description <chr>, time_played <int>,
## #   round_target.size <int>, round_target.type <chr>,
## #   round_target.animal <chr>, options <chr>, animals <chr>,
## #   identifier <chr>, media_type <fct>, total_duration <int>, size <int>,
## #   item_type <chr>, position <int>, animal <chr>, correct <lgl>,
## #   misses <int>
# Then check if the event_id was unique in the session
session1 %>%
  filter(event_id == '73757a5e') %>%
  nrow
## [1] 20

The very last event in the game session occurred 20 times, so it’s not an “end-game” event, but it could be an “end-round” event.

At this point it might be helpful to cross-reference game events to the description of game event contained in the specs.csv file. Per the competition instructions “This file gives the specification of the various event types.”

specs <- fread('data/specs.csv')
specs %>% select(event_id, info) -> event_specs

Next, we can join the unique events for a game session to this new event_specs dataframe to see what the events mean.

session1 %>%
  distinct(event_id) %>%
  left_join(event_specs, by = "event_id") -> session1_event_info

session1_event_info
## # A tibble: 17 x 2
##    event_id info                                                           
##    <chr>    <chr>                                                          
##  1 6d90d394 The start game event is triggered at the very beginning of the~
##  2 7040c096 This event occurs when the player clicks to start the game fro~
##  3 5a848010 The movie started event triggers when an intro or outro movie ~
##  4 c1cac9a2 The movie skipped event triggers when the player presses the s~
##  5 dcaede90 The start level event is triggered when a new level begins (at~
##  6 26fd2d99 The start round event is triggered at the start of a round. Th~
##  7 f71c4741 "The system-initiated instruction event occurs when the game d~
##  8 f7e47413 "The end of system-initiated instruction event occurs when the~
##  9 5c3d2b2f This event occurs when the player clicks on an item on the she~
## 10 d88e8f25 "The system-initiated feedback (Incorrect) event occurs when t~
## 11 ac92046e "The end of system-initiated feedback (Incorrect) event occurs~
## 12 08fd73f3 The beat round event is triggered when the player finishes a r~
## 13 73757a5e "The system-initiated feedback (Correct) event occurs when the~
## 14 2b9272f4 "The end of system-initiated feedback (Correct) event occurs w~
## 15 37c53127 The beat level event is triggered when a level has been comple~
## 16 cf82af56 This event occurs when the player clicks on something that isn~
## 17 4a09ace1 The movie ended event triggers when an intro or outro movie en~

In all, there were 17 distinct events.

Interestingly, although I see an event that represents the start of the game (6d90d394), I don’t see one that represents the end of a game. But such events do exist, if I keyword search the entire event list I can find several that are “triggered when the game is quit”.

event_specs %>% filter(grepl("quit", info)) -> quit_events
quit_events %>% head(2)
##   event_id
## 1 a8cc6fec
## 2 1b54d27f
##                                                                                                                                                                                                                           info
## 1 The exit game event is triggered when the game is quit. This is used to compute things like time spent in game. Depending on platform this may / may not be possible. NOTE: â\200œquitâ\200\235 also means navigating away from game.
## 2 The exit game event is triggered when the game is quit. This is used to compute things like time spent in game. Depending on platform this may / may not be possible. NOTE: â\200œquitâ\200\235 also means navigating away from game.

Some of these event descriptions are quite informative. For example:

event_specs %>%
  filter(event_id == '73757a5e') %>%
  select(info) %>%
  as.character()
## [1] "The system-initiated feedback (Correct) event occurs when the game starts delivering feedback to the player in response to a correct round attempt (pressing the go button with the correct answer). It contains information that describes the content of the instruction, and will likely occur in conjunction with a beat round event. These events are used to determine the effectiveness of the feedback. We can answer questions like, \"\"did players who received feedback X do better than those who did not?\"\""

Before starting the competition, I had a hunch that one of the factors would be performance feedback. This event description confirms it: “These events are used to determine the effectiveness of the feedback. We can answer questions like, ‘did players who received feedback X do better than those who did not?’”.

Anyway I digress, let’s see if we can find a session of ‘Scrub-A-Dub’ that has one of these quit events.

train_s %>% 
  filter(title == 'Scrub-A-Dub',
         event_id %in% local(quit_events$event_id)) %>% 
  distinct(game_session)
## # Source: spark<?> [?? x 1]
## # ... with 1 variable: game_session <chr>

Huh. So maybe the quit event resides outside of the game session?

Let’s pull some of the events that occurred just after the first game session started. We can do this by filtering on installation_id and timestamp

session1$installation_id[[1]] -> this_install_id
max(session1$timestamp) -> this_timestamp

train_s %>%
  filter(installation_id == local(this_install_id) &
         timestamp > local(this_timestamp)) %>%
  arrange(timestamp) %>%
  head() %>%
  select(event_id, game_session, timestamp)
## # Source:     spark<?> [?? x 3]
## # Ordered by: timestamp
##   event_id game_session     timestamp          
##   <chr>    <chr>            <dttm>             
## 1 27253bdc de4ac084bd49e3ed 2019-08-05 14:56:19
## 2 27253bdc e77771d9231ebd71 2019-08-05 14:57:25
## 3 6d90d394 50c69b9abe8b67d0 2019-08-05 14:58:36
## 4 7040c096 50c69b9abe8b67d0 2019-08-05 14:58:40
## 5 5a848010 50c69b9abe8b67d0 2019-08-05 14:58:42
## 6 cf82af56 50c69b9abe8b67d0 2019-08-05 14:58:42

Huh. So there wasn’t a game quit event that followed this game session.

Can I find any game quit events for this user & game title?

session1$title[[1]] -> this_game_title

train_s %>%
  filter(installation_id == local(this_install_id) &
         title == local(this_game_title) &
         event_id %in% local(quit_events$event_id)) %>%
  head()
## # Source: spark<?> [?? x 11]
## # ... with 11 variables: event_id <chr>, game_session <chr>,
## #   timestamp <dbl>, event_data <chr>, installation_id <chr>,
## #   event_count <int>, event_code <int>, game_time <int>, title <chr>,
## #   type <chr>, world <chr>

Nothing. Are there any game quit events for any users for this title?

train_s %>%
  filter(title == local(this_game_title) &
         event_id %in% local(quit_events$event_id)) %>%
  head()
## # Source: spark<?> [?? x 11]
## # ... with 11 variables: event_id <chr>, game_session <chr>,
## #   timestamp <dbl>, event_data <chr>, installation_id <chr>,
## #   event_count <int>, event_code <int>, game_time <int>, title <chr>,
## #   type <chr>, world <chr>

None. Are there any quit events at all?

train_s %>%
  filter(event_id %in% local(quit_events$event_id)) %>%
  head() %>%
  collect(.) %>%
  datatable(., rownames = FALSE, options = list(pageLength = 5, scrollX = '400px'))

Ahhhh. Yes, there are quit events. But the ones I see here are all for Assessment types. I wonder if I find any for any of the games.

train_s %>%
  filter(event_id %in% local(quit_events$event_id) &
         type == 'Game') %>%
  head()
## # Source: spark<?> [?? x 11]
##   event_id game_session timestamp           event_data installation_id
##   <chr>    <chr>        <dttm>              <chr>      <chr>          
## 1 17ca3959 6a672238d39~ 2019-08-20 16:50:39 "\"{\"\"\~ 90c8e087       
## 2 4074bac2 173608de942~ 2019-09-16 13:08:12 "\"{\"\"\~ 33faef73       
## # ... with 6 more variables: event_count <int>, event_code <int>,
## #   game_time <int>, title <chr>, type <chr>, world <chr>

Only 2 in the entire training set? That seems strange. Maybe there was a bug or a critical UX/UI problem that caused the quit events not to get logged.

I’d still like to know if level 20 was the last level.

This will be a bit tricky, because the level variable is stored in the JSON and I don’t really want to process all the JSONs for all of the events. Instead, I’ll take the events with the largest timestamps for each game session for this game title, then I’ll convert those JSONs and examine the level number.

train_s %>%
  filter(title == local(this_game_title)) %>%
  group_by(game_session) %>%
  filter(timestamp == max(timestamp)) -> last_game_events

extract_json_2(last_game_events, nrows = sdf_nrow(last_game_events)) -> last_game_events

Now that we’ve collected the max level reached for all of the sessions played for this game, we can look at a distribution of levels.

ggplot(last_game_events, aes(x = level)) + 
  geom_histogram(bins = 25)

max(last_game_events$level)
## [1] 47

Wow! There’s quite a long tail and the highest level obtained by any player was 47.

Now I want to look at that game session!

last_game_events %>% filter(level == 47) -> highest_level_session
train_s %>% filter(game_session == 'bb1f09ec062b6660') -> session3
extract_json_2(session3, nrows = sdf_nrow(session3)) -> session3
load("part-4-session3.Rds")

Let’s stack it against the other two game sessions. This time I’ll dodge the lines.

session3 %>% 
  arrange(event_count) %>%
  distinct(level, round) %>%
  mutate(session = 'bb1f09ec062b6660') %>%
  rowid_to_column() -> session3_levels_rounds

session1_levels_rounds %>%
  bind_rows(session2_levels_rounds) %>%
  bind_rows(session3_levels_rounds) -> combined

ggplot(combined, aes(x = rowid, y = level, group = session)) + 
  geom_line(aes(linetype = session, color = session), position = position_dodge(20))
## Warning: position_dodge requires non-overlapping x intervals

When I was looking at the first game session, I noticed that levels seemed to have an increasing number of rounds. Does this pattern persist past 20 rounds?

session3 %>% 
  arrange(event_count) %>%
  distinct(level, round) %>%
  count(level)
## # A tibble: 48 x 2
##    level     n
##    <int> <int>
##  1     0     1
##  2     1     3
##  3     2     3
##  4     3     3
##  5     4     4
##  6     5     4
##  7     6     4
##  8     7     5
##  9     8     5
## 10     9     5
## # ... with 38 more rows

No it seems to reach a plateau at level 10. Maybe level 10 is also when the game difficulty peaks? What is the difficulty based on? If this is like experimental cognitive tasks, then there must be some kind of quantity that changes that I can see in the metadata. I noticed looking at the data just now that the number of options seems to change from one level to another. This is specifically for the 26fd2d99 events.

session3 %>%
  filter(event_id == '26fd2d99') %>%
  select(event_id, level, options, animals, round_target.animal) %>%
  collect(.) %>%
  datatable(., rownames = FALSE, options = list(pageLength = 10, scrollX = '400px'))

Bingo! The number of rounds is proportional to the number of animals (number of rounds = number of animals + 1), which also represents the difficulty of the level (more animals = more difficulty). And indeed, difficulty does peak at level 10.

This changes my interpretation of max level quite a bit, because if difficulty peaks at level 10, then that means the game might start to seem repetitive to the player after level 10. Why continue playing if you beat level 10 and level 11 isn’t any different? I’d probably quit at that point unless the game was really fun.

The last question that I want to answer is whether the game ends when a user makes too many mistakes. This could explain why some sessions are longer than others, besides boredom / frustration.

To answer this, I’ll aggregate the correct and misses values by level.

session3 %>% 
  arrange(event_count) %>%
  group_by(level) %>%
  filter(!is.na(correct)) %>%
  summarize(n = n(),
            pct_correct = round(sum(correct) / n, 2)) -> session3_pct_correct

session3 %>% 
  arrange(event_count) %>%
  group_by(level) %>%
  filter(!is.na(misses)) %>%
  summarize(misses = sum(misses)) -> session3_misses

session3_pct_correct %>%
  left_join(session3_misses, by = 'level') %>%
  select(level, pct_correct, misses) -> session3_perf

melt(session3_perf, id.vars = c("level")) -> session3_perf_long

ggplot(session3_perf_long, aes(x = level, y = value, group = variable)) + 
  geom_line() + 
  facet_grid(rows = vars(variable), scales = "free_y")

Interesting. There is a strong congruence between correct and number of misses.

Let’s look at the other two game sessions.

session1 %>% 
  arrange(event_count) %>%
  group_by(level) %>%
  filter(!is.na(correct)) %>%
  summarize(n = n(),
            pct_correct = round(sum(correct) / n, 2)) -> session1_pct_correct

session1 %>% 
  arrange(event_count) %>%
  group_by(level) %>%
  filter(!is.na(misses)) %>%
  summarize(misses = sum(misses)) -> session1_misses

session1_pct_correct %>%
  left_join(session1_misses, by = 'level') %>%
  select(level, pct_correct, misses) -> session1_perf

melt(session1_perf, id.vars = c("level")) -> session1_perf_long

ggplot(session1_perf_long, aes(x = level, y = value, group = variable)) + 
  geom_line() + 
  facet_grid(rows = vars(variable), scales = "free_y")

session2 %>% 
  arrange(event_count) %>%
  group_by(level) %>%
  filter(!is.na(correct)) %>%
  summarize(n = n(),
            pct_correct = round(sum(correct) / n, 2)) -> session2_pct_correct

session2 %>% 
  arrange(event_count) %>%
  group_by(level) %>%
  filter(!is.na(misses)) %>%
  summarize(misses = sum(misses)) -> session2_misses

session2_pct_correct %>%
  left_join(session2_misses, by = 'level') %>%
  select(level, pct_correct, misses) -> session2_perf

melt(session2_perf, id.vars = c("level")) -> session2_perf_long

ggplot(session2_perf_long, aes(x = level, y = value, group = variable)) + 
  geom_line() + 
  facet_grid(rows = vars(variable), scales = "free_y")