Today I was given a task that sounded pretty straight-forward: What t-shirt size would you send to someone if you don’t know their shirt size, but instead you know their height, weight, and gender?

In fact, it seemed so straight-forward that I was sure there must be prior art out there that I could re-use. A StackOverflow, a mathematical formula, a GitHub repo, a blog post – there had to be something! To my surprise, there wasn’t any, not that I could find anyway.

I guess this is a problem that hasn’t received a lot of attention. Or at least, it’s not the sort of problem that someone in the open source / open science community has tackled.

So I set out to build my own predictive algorithm.

The model and the data

First I would need to model the problem and identify data to build the model. What are the inputs and the outputs? How do the inputs map onto the outputs?

I knew that the only information I’d have to predict shirt sizes from was: gender, height and weight. Sometimes I’d have body mass index (BMI) instead of height and weight.

I also knew that I’d be sending “unisex” t-shirts. I’ve done my share of online t-shirt shopping, so it occurred to me to look at how online clothing stores solve this problem – enter the “size chart”. Okay, so now I’d found a measurement that could be used to classify shirt sizes: chest size.

So now I just had to find a way to get from gender, height, and weight to chest size.

I know from experience that there are several large public datasets that contain health and health-related measurements. One of these is the The National Health and Nutrition Examination Survey (NHANES), which is run by the CDC. I decided to poke around there and see what I could find. I found the variable search tool and started keying in some of these variables: “height”, “weight”, “chest”, “bmi”, “gender”.

Of course they would have gender, height, weight and BMI. It would be a pretty odd health and nutrition study if they didn’t have basic demographics and measurements. But chest measurement? The context seemed a bit odd. What does chest circumference have to do with health and nutrition?

Maybe chest circumference is important if we’re talking about breathing? Eureka! NHANES does have chest measurements for inhalation and exhalation as part of the arthritis datasets!

The final piece of the puzzle was identifying a single year cohort that contained all of these measures together, because I would want the data to be at the level of individual persons. I found all of these measurements were contained in the 2009-2010 cohort, so I downloaded the dataset files for that cohort.

Now my model was complete. I would run a linear regression that predicts chest circumference from gender, height and weight (or bmi). I would train this model on the NHANES data, and I would then use the predicted chest circumference to determine shirt size.

Data wrangling

Read in the data

Of course the data had to be in a SAS format. 🙄

Luckily there’s an R package for reading SAS data files.

library(foreign)
library(tidyverse)
library(MASS)

a <- read.xport("ARX_F.XPT") # Arthritis data
b <- read.xport("BMX_F.XPT") # Body measurements
d <- read.xport("DEMO_F.XPT") # Demographics

These are the variables I’ll pull out from the different dataframes.

SEQN - Participant ID
ARXCCIN - Inhale chest circumference in CM
BMXWT - Weight in KG
BMXHT - Height in CM
BMXBMI - Body Mass Index (BMI)
DMDHRGND - Gender of participant (1 = Male, 2 = Female)

I’m using the inhale chest circumference because I found in some exploratory analyses (not reported here) that it has a stronger correlation to the other measurements. I guess the exhale circumference was more noisy for some reason?

a %>% dplyr::select(id = SEQN, chest_in_cm = ARXCCIN) -> chest_measures
b %>% dplyr::select(id = SEQN, weight_kg = BMXWT, height_cm = BMXHT, bmi = BMXBMI) -> height_weight
d %>% dplyr::select(id = SEQN, gender = DMDHRGND) %>%
  mutate(gender = case_when(gender == 1 ~ "M", TRUE ~ "F")) -> gender

# Join datasets and select only the rows that have all measurements
chest_measures %>%
  left_join(., height_weight, by = c('id')) %>%
  left_join(., gender, by = c('id')) %>%
  filter(!is.na(chest_in_cm),
         !is.na(height_cm), 
         !is.na(weight_kg),
         !is.na(gender), 
         !is.na(bmi)) -> df

Correlations

I’ll check some basic correlations and descriptives. I see that men have larger chests than women, on average. And importantly, BMI does not correlate with chest size as well as weight.

cor(df$weight_kg, df$chest_in_cm)

[1] 0.9006851

cor(df$height_cm, df$chest_in_cm)

[1] 0.4174028

cor(df$bmi, df$chest_in_cm)

[1] 0.7850526

df %>%
  group_by(gender) %>%
  summarize(mean = mean(chest_in_cm))

# A tibble: 2 × 2
  gender  mean
  <chr>  <dbl>
1 F       95.1
2 M       98.0

Model building

I decided to build two models because sometimes I know a person’s height and weight but not their BMI, and other times I know their BMI but not their height and weight.

Although I could take all of the heights and weights and convert them into BMIs, I’m guessing this could lead a loss of information. The correlation above suggests that weight is the best predictor of chest size (better than BMI), so that tells me it might be better to use height and weight when it’s available, but use BMI as a fallback when height and weight are not available.

Height and weight model

To build a model, I’ll run a stepwise linear regression to determine the model of best fit. I’ll enter height, weight, and gender as predictors and allow interaction terms. I’ll use AIC (Akaike’s Information Criterion) for model selection, since it penalizes models with added complexity and I want a parsimonious model that doesn’t overfit the data.

I see that the best model includes all terms, including weight*height and weight*gender interaction terms.

lm(data = subset(df, select= c(chest_in_cm, height_cm, weight_kg, gender)), 
   chest_in_cm ~ .) -> mod
step.model <- stepAIC(mod, direction = "both", trace = TRUE, scope = . ~ .^2)

Start:  AIC=14857.43
chest_in_cm ~ height_cm + weight_kg + gender

                      Df Sum of Sq    RSS   AIC
+ height_cm:weight_kg  1       220 115291 14851
- height_cm            1         2 115513 14856
+ weight_kg:gender     1        89 115421 14856
<none>                             115510 14857
+ height_cm:gender     1         7 115503 14859
- gender               1      2052 117563 14937
- weight_kg            1    397116 512627 21725

Step:  AIC=14850.64
chest_in_cm ~ height_cm + weight_kg + gender + height_cm:weight_kg

                      Df Sum of Sq    RSS   AIC
+ weight_kg:gender     1    144.02 115147 14847
<none>                             115291 14851
+ height_cm:gender     1     13.37 115277 14852
- height_cm:weight_kg  1    219.87 115510 14857
- gender               1   2064.63 117355 14930

Step:  AIC=14846.88
chest_in_cm ~ height_cm + weight_kg + gender + height_cm:weight_kg + 
    weight_kg:gender

                      Df Sum of Sq    RSS   AIC
<none>                             115147 14847
+ height_cm:gender     1     3.499 115143 14849
- weight_kg:gender     1   144.020 115291 14851
- height_cm:weight_kg  1   274.743 115421 14856

summary(step.model)


Call:
lm(formula = chest_in_cm ~ height_cm + weight_kg + gender + height_cm:weight_kg + 
    weight_kg:gender, data = subset(df, select = c(chest_in_cm, 
    height_cm, weight_kg, gender)))

Residuals:
     Min       1Q   Median       3Q      Max 
-21.3179  -3.2347   0.2087   3.4462  15.4911 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)         41.491415   4.708485   8.812  < 2e-16 ***
height_cm            0.086786   0.028263   3.071 0.002148 ** 
weight_kg            0.666925   0.055507  12.015  < 2e-16 ***
genderM             -0.030639   0.599462  -0.051 0.959239    
height_cm:weight_kg -0.001087   0.000328  -3.314 0.000925 ***
weight_kg:genderM    0.016955   0.007065   2.400 0.016449 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.001 on 4604 degrees of freedom
Multiple R-squared:  0.8152,    Adjusted R-squared:  0.815 
F-statistic:  4061 on 5 and 4604 DF,  p-value: < 2.2e-16

# Save the model for later
height_weight_model <- step.model

Body Mass Index (BMI) model

Now I’ll do the same with BMI and gender.

lm(data = subset(df, select= c(chest_in_cm, bmi, gender)), chest_in_cm ~ .) -> mod
step.model <- stepAIC(mod, direction = "both", trace = TRUE, scope = . ~ .^2)

Start:  AIC=17986.13
chest_in_cm ~ bmi + gender

             Df Sum of Sq    RSS   AIC
+ bmi:gender  1       725 227076 17973
<none>                    227801 17986
- gender      1     11229 239030 18206
- bmi         1    385144 612946 22547

Step:  AIC=17973.44
chest_in_cm ~ bmi + gender + bmi:gender

             Df Sum of Sq    RSS   AIC
<none>                    227076 17973
- bmi:gender  1    724.73 227801 17986

summary(step.model)


Call:
lm(formula = chest_in_cm ~ bmi + gender + bmi:gender, data = subset(df, 
    select = c(chest_in_cm, bmi, gender)))

Residuals:
     Min       1Q   Median       3Q      Max 
-26.7682  -5.0630   0.2613   5.2463  23.2969 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 56.77733    0.66042  85.971  < 2e-16 ***
bmi          1.31235    0.02203  59.573  < 2e-16 ***
genderM     -0.34122    0.92789  -0.368 0.713084    
bmi:genderM  0.11906    0.03105   3.834 0.000128 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.021 on 4606 degrees of freedom
Multiple R-squared:  0.6355,    Adjusted R-squared:  0.6353 
F-statistic:  2677 on 3 and 4606 DF,  p-value: < 2.2e-16

# Save the model for later
bmi_model <- step.model

Predicting chest size from height, weight, BMI, and gender

Now that I have the models, I can use them to generate chest size predictions.

Predict chest size given height, weight, and gender

Next we’ll take the final model selected from the above procedure and use it to predict chest circumference, in inches, given height, weight, and gender.

To test the model, I’ll use average values.

mean(df$weight_kg)

[1] 82.48026

mean(df$height_cm)

[1] 168.0299

mean(df$bmi)

[1] 29.12334

Height, weight, and gender

input <- data.frame(height_cm = 168, weight_kg = 83, gender = "F")

# 1 cm = 0.393701 inches
predict(height_weight_model, input) * 0.393701

       1 
37.90109

BMI and gender

input <- data.frame(bmi = 29, gender = "F")
predict(bmi_model, input) * 0.393701

       1 
37.33684

Chest size to shirt size

Finally, I’ll need to take the chest size prediction and convert it to a shirt size. Remember the size chart?

For some reason the chart leaves out “XS” and the ranges lso don’t provide full coverage of the possible chest size values (34-36” and then 38-40”???). So I’ll extend the upper bound of each range to meet the lower bound of each size above it. This will cause the predictions to err on the size of larger sizes rather than smaller sizes, which I think is better because if a shirt is too big, at least you can still wear it!

input <- data.frame(height_cm = 168, weight_kg = 83, gender = "F")

data.frame(chest = predict(height_weight_model,input)[[1]] * 0.393701) %>%
  mutate(shirt_size = case_when(
          chest < 32 ~ "XS",
          between(chest, 32, 36) ~ "S",
          between(chest, 36, 40) ~ "M",
          between(chest, 40, 44) ~ "L",
          between(chest, 44, 48) ~ "XL",
          between(chest, 48, 52) ~ "2XL",
          between(chest, 52, 56) ~ "3XL",
          between(chest, 56, 64) ~ "4XL",
          chest > 64 ~ "5XL"
    )
)

     chest shirt_size
1 37.90109          M

Wrapping it up in a function

And finally, I can put it in a function that will select the model to use for me, based on the information it’s been given.

predict_shirt_size <- function(height_cm, weight_kg, bmi, gender){
  
  if(!is.na(height_cm) & !is.na(weight_kg)){
    input <- data.frame(height_cm = height_cm, weight_kg = weight_kg, gender = gender)
    chest <- predict(height_weight_model, input)[[1]] * 0.393701
  } else if(!is.na(bmi)){
    input <- data.frame(bmi = bmi, gender = gender)
    chest <- predict(bmi_model, input)[[1]] * 0.393701
  } else {
    # Just in case height, weight, and BMI are all missing
    chest <- -99 # A value that gets ignored below
  }
  
  shirt_size = case_when(
              between(chest, 0, 32) ~ "XS",
              between(chest, 32, 36) ~ "S",
              between(chest, 36, 40) ~ "M",
              between(chest, 40, 44) ~ "L",
              between(chest, 44, 48) ~ "XL",
              between(chest, 48, 52) ~ "2XL",
              between(chest, 52, 56) ~ "3XL",
              between(chest, 56, 64) ~ "4XL",
              chest > 64 ~ "5XL",
              
              # Defaults in case height, weight, BMI are missing
              # but gender is not
              gender == "F" ~ "M",
              gender == "M" ~ "L",
              
              # Last resort
              TRUE ~ "L"
            )
  
  return(shirt_size)
}

predict_shirt_size(height_cm = 168, weight_kg = 83, gender = "F")

[1] "M"