Scraping the Mural Arts Philadephia website
September 22, 2019
[ Scraping ]
R scraping

When I visited Philadelphia recently I was left with a lasting impression of the street art. I really liked how the murals added beauty to the neighborhoods, and how each mural was, it seemed, a statement about human virtue and resilience and empowerment. I was also struck by what it meant about the community of Philadelphia organizing around and celebrating this public good, and creating more of it. Philadelphia prides itself on being the “City of Murals”, and I can see why!

I was curious about the history of the murals. When did they start? How are the murals distributed around the city and how does that distribution look across time? In other words, I had some questions that could only be answered by quantitative data.

I started looking through the website, Mural Arts Philadephia, and to my surprise many murals had been carefully catalogued with locations and the dates when they were completed. Just the data I was looking for, I thought. I’d recently learned how to scrape websites, and I thought what a perfect specimen for web scraping this website would be.

So I set about scraping the data.

Dealing with pagination

Going back to the Artworks page in my browser, I can see that the pagination follows a simple scheme where the page is given as a URL parameter, like this: https://www.muralarts.org/artworks/?sf_paged=2.

How will I know when I’ve reached the end? I can see that by jumping to a page that doesn’t exist (like https://www.muralarts.org/artworks/?sf_paged=200) that the page will tell me if I’ve reached the end with a big “No results found.” message. This element also has a distinct class called searchandfilter-no-results. This means that I can just iterate over pages until I see that message object, and then stop. Easy!

Function and loop: Putting it all together

The function

So let’s take the procedure that I followed above and abstract it to a function that can be repeated, using what we’ve learned about how to fetch, scrape, and deal with pagination. This function will take a page number as input, and then return a deduped list of links.

arts_on_page <- function(page){
  
  # Set URL to fetch
  url <- paste0('https://www.muralarts.org/artworks?sf_paged=', page)
  
  # Fetch URL
  site <- GET(url)
  
  # Check that we're not at the end (look for the "no results" message)
  if(length(html_nodes(content(site), ".searchandfilter-no-results")) == 0){
    
    # Get arts links on page just fetched
    arts <- html_nodes(content(site), ".click-whole-area")
    
    # Extract all of the art links
    sapply(arts, function(x){
      xml2::xml_attr(x, 'href')
    }) -> art_links
    
    # Dedupe the list
    art_links <- unique(art_links)
    
    # Return the results
    return(art_links)
    
  } else {
    
    # Otherwise, return a vector of length 0
    return(c())
    
  }
  
  
}

The loop

Now let’s loop through all of the pages until we get a vector of length 0 returned, meaning we’ve reached the end.

i <- 1
results <- arts_on_page(i)
all_links <- results
while(length(results) > 0){
  i <- i + 1
  results <- arts_on_page(i)
  all_links <- c(all_links, results)
}

Now we should have all of the links in one vector. How many links are there?

length(unique(all_links))
## [1] 209

Scraping the mural page content

Great! Now we need to figure out how to scrape what we want from each one. I want to know the location, neighborhood, and completion date of each one. I’ll just pull up a mural page and find the HTML elements that contain these details.

It looks like these are contained in the following elements:

Let’s write another function and loop to extract these details from all of the links.

Function

get_details <- function(link){
  
  # Get link
  site <- GET(link)
  
  # If it doesn't have location data, skip
  if(length(html_nodes(content(site), ".icon-location > .f-primary")) > 0){
    
    # Extract details into a data frame
    details <- data.frame(
      url = link,
      location = location <- html_nodes(content(site), ".icon-location > .f-primary") %>%
        xml2::xml_text(),
      neighborhood = html_nodes(content(site), ".icon-neighborhood > .f-primary") %>%
        xml2::xml_text(),
      date = html_nodes(content(site), ".icon-calendar > .f-primary") %>%
        xml2::xml_text() %>%
        as.Date(., "%B %d, %Y") %>%
        as.character()
    )
    
    # Return
    return(details) 
  }
}

Loop

details <- data.frame()
for(link in unique(all_links)){
  details %>% 
    bind_rows(get_details(link)) -> details
}

Now we have our data!

You’ll notice that we have fewer rows of data than we expect. This is because some of the artworks pages did not have the information we were looking for. For example, this page: https://www.muralarts.org/artworks/spring-arts-district does not have a location, neighborhood, or completion date. Only the pages that had the information we were looking for were saved to the dataframe. This is what we want.

library(DT)
datatable(details, rownames = FALSE,
          options = list(pageLength = 5))

Fin

Although I scraped the data because I wanted to explore and visualize it, as I learned from a bit more research only about 200 murals have been catalogued on the Mural Arts website, yet Philadelphia has over 3000 murals. According to the Wikipedia page for Mural Arts, more than 600 murals were painted between 2001-2004, but I don’t see that in the data. When I looked at the data, I saw more murals in recent years than in previous years, so there’s a pretty heavy recency bias. Ah well.

  • ♥️ 2019/09/14 - Data Viz - Yet Another Data Science Job Market Analysis
  • 2019/09/21 - Package - nutritionR: NLP nutrition analysis in R
  • 2019/09/13 - Tutorial - pyMTurkR (vignette): A HIT with a Qualification Test
  • 2019/09/05 - Package - pyMTurkR: An R package for MTurk Requesters
  • Comments

    comments powered by Disqus