The Priceonomics Data Puzzle: TreefortBnb
Which Are the Most Expensive Cities in America to Book a Tree Fort?
It’s the year 2020 and the sharing-economy is in full gear. Everything from hosting city dinners to ride-sharing the great outdoors is fair game. Capitalizing on their own interests and experience, eight-year-old startup prodigees, Sally Hershfield and Felicia Alvarez, have taken the sharing-economy by storm with their ever-growing TreefortBnB enterprise.
Their nascent business venture has become such a hit that industry powerhouse AirBnb has not only taken notice but also action. They’ve identified markets where TreefortBnb has expanded most successfully and it’s been rumored that AirBnb aims to undercut their marketshare by providing incentives that push their users to advertise more competitive prices.
Using data obtained from Priceonomics, I’ve identified the median price of tree fort rentals for the top 100 cities with the most units on the market. Indianapolis IN, Malibu CA, and Park City UT are the most expensive cities in the US to rent a tree fort by median pricing, something AirBnb hopes to use to its own advantage by provding lower cost classical home alternatives.
##Load dependencies
library(pipeR)
library(dplyr)
library(formattable)
##URL to download data
url<-"https://s3.amazonaws.com/pix-media/Data+for+TreefortBnB+Puzzle.csv"
#Read in data and aggregate # of units by city and subset to top 100
df1 <- url %>%
read.csv(col.names=c("ID","City","State","Price","Reviews")) %>%
group_by(State, City) %>%
summarize(Unit_Sum=n()) %>%
arrange(desc(Unit_Sum)) %>%
head(100)
#Read in data again and obtain median rental price by city
df2 <- url %>%
read.csv(col.names=c("ID","City","State","Price","Reviews")) %>%
group_by(State, City) %>%
summarize(Price_Median=median(Price)) %>%
arrange(desc(Price_Median))
#Merge two data frames
df3 <-merge(df1, df2, by=c("City", "State"), all.x=TRUE, all.y=FALSE)
df3<-arrange(df3, desc(Price_Median))
#Rename columns
colnames(df3)[3]<-"Unit Total"
colnames(df3)[4]<-"Median Price"
#Produce output table in html
formattable(df3, list(
'Median Price' = color_bar("orange")))| City | State | Unit Total | Median Price |
|---|---|---|---|
| Indianapolis | IN | 251 | 650.0 |
| Malibu | CA | 92 | 304.0 |
| Park City | UT | 229 | 299.0 |
| Healdsburg | CA | 49 | 275.0 |
| Truckee | NV | 67 | 275.0 |
| Laguna Beach | CA | 68 | 268.5 |
| Incline Village | NV | 118 | 259.0 |
| Manhattan Beach | CA | 55 | 209.0 |
| Charlotte | NC | 225 | 200.0 |
| Napa | CA | 68 | 200.0 |
| Sonoma | CA | 115 | 200.0 |
| Austin | TX | 2836 | 199.0 |
| La Jolla | CA | 53 | 195.0 |
| Sausalito | CA | 71 | 190.0 |
| Hermosa Beach | CA | 48 | 189.5 |
| Sunny Isles Beach | FL | 161 | 180.0 |
| New York | NY | 8043 | 170.0 |
| Beverly Hills | CA | 74 | 160.0 |
| Boston | MA | 613 | 160.0 |
| Newport Beach | CA | 84 | 160.0 |
| Marina del Rey | CA | 80 | 155.0 |
| Venice | CA | 242 | 155.0 |
| Louisville | KY | 86 | 150.0 |
| Miami Beach | FL | 1345 | 150.0 |
| Mill Valley | CA | 80 | 150.0 |
| New Orleans | LA | 833 | 150.0 |
| San Francisco | CA | 3622 | 150.0 |
| Santa Monica | CA | 500 | 150.0 |
| Miami | FL | 640 | 149.5 |
| Taos | NM | 52 | 148.5 |
| Sebastopol | CA | 49 | 145.0 |
| San Diego | CA | 986 | 130.0 |
| Scottsdale | AZ | 121 | 129.0 |
| Washington | DC | 1402 | 129.0 |
| West Hollywood | CA | 229 | 129.0 |
| Cambridge | MA | 300 | 125.0 |
| Charleston | SC | 84 | 125.0 |
| Las Vegas | NV | 291 | 125.0 |
| Naples | FL | 50 | 125.0 |
| San Rafael | CA | 61 | 125.0 |
| Santa Cruz | CA | 127 | 125.0 |
| Anchorage | AK | 50 | 123.5 |
| Brookline | MA | 47 | 120.0 |
| Palo Alto | CA | 100 | 120.0 |
| Santa Rosa | CA | 71 | 120.0 |
| Hollywood | FL | 150 | 115.5 |
| Colorado Springs | CO | 54 | 113.0 |
| Davenport | FL | 108 | 113.0 |
| Mountain View | CA | 72 | 113.0 |
| Brooklyn | NY | 4368 | 110.0 |
| Kissimmee | FL | 266 | 110.0 |
| Los Angeles | CA | 3236 | 110.0 |
| Savannah | GA | 91 | 107.0 |
| Arlington | VA | 214 | 106.5 |
| Paris | RI | 48 | 105.5 |
| Honolulu | HI | 363 | 105.0 |
| Alexandria | VA | 66 | 104.5 |
| Berkeley | CA | 328 | 100.0 |
| Boulder | CO | 265 | 100.0 |
| Carlsbad | CA | 54 | 100.0 |
| Chicago | IL | 1255 | 100.0 |
| Fort Lauderdale | FL | 151 | 100.0 |
| Houston | TX | 175 | 100.0 |
| Long Beach | CA | 113 | 100.0 |
| Philadelphia | PA | 457 | 100.0 |
| San Jose | CA | 104 | 100.0 |
| Seattle | WA | 824 | 100.0 |
| Somerville | MA | 99 | 100.0 |
| Tampa | FL | 78 | 100.0 |
| Pasadena | CA | 86 | 99.5 |
| Queens | NY | 457 | 99.0 |
| Ann Arbor | MI | 63 | 98.0 |
| Minneapolis | MN | 138 | 98.0 |
| Long Island City | NY | 190 | 95.5 |
| Glendale | CA | 70 | 90.0 |
| Oakland | CA | 434 | 90.0 |
| Tucson | AZ | 195 | 90.0 |
| Salt Lake City | UT | 140 | 89.5 |
| Atlanta | GA | 302 | 89.0 |
| Dallas | TX | 114 | 87.0 |
| Denver | CO | 283 | 86.0 |
| Eugene | OR | 78 | 85.0 |
| Jersey City | NJ | 82 | 85.0 |
| Madison | WI | 59 | 85.0 |
| Phoenix | AZ | 167 | 85.0 |
| Providence | RI | 72 | 85.0 |
| Baltimore | MD | 137 | 80.0 |
| Nashville | TN | 185 | 80.0 |
| Pittsburgh | PA | 99 | 80.0 |
| San Antonio | TX | 77 | 80.0 |
| Portland | OR | 819 | 79.0 |
| Albuquerque | NM | 95 | 75.0 |
| Bronx | NY | 113 | 75.0 |
| New Haven | CT | 52 | 75.0 |
| Orlando | FL | 116 | 75.0 |
| Silver Spring | MD | 53 | 75.0 |
| Cincinnati | OH | 47 | 70.0 |
| Sacramento | CA | 57 | 65.0 |
| Durham | NC | 63 | 60.0 |
| Raleigh | NC | 57 | 60.0 |
Basic listed prices, however, may not reflect actual rental costs because not all advertised listings are actually rented out. One way to identify median rental prices for listings that have actually been rented is to subset the data to include only those listings with reviews. The working assumption here is that listings with no reviews may never have actually been rented. For listings with at least one review, Carmel CA, Malibu CA, and Incline Village NV are the most expensive by median rental price.
#Read in data and aggregate # of units by city and subset to top 100
df1.b <- url %>%
read.csv(col.names=c("ID","City","State","Price","Reviews")) %>%
subset(Reviews > 0) %>% #subset to only include reviewed listings
group_by(State, City) %>%
summarize(Unit_Sum=n()) %>%
arrange(desc(Unit_Sum)) %>%
head(100)
#Read in data again and obtain median rental price by city
df2.b <- url %>%
read.csv(col.names=c("ID","City","State","Price","Reviews")) %>%
subset(Reviews > 0) %>% #subset to only include reviewed listings
group_by(State, City) %>%
summarize(Price_Median=median(Price)) %>%
arrange(desc(Price_Median))
#Merge two data frames
df3.b <-merge(df1.b, df2.b, by=c("City", "State"), all.x=TRUE, all.y=FALSE)
df3.b<-arrange(df3.b, desc(Price_Median))
#Rename columns
colnames(df3.b)[3]<-"Unit Total"
colnames(df3.b)[4]<-"Median Price"
#Produce output table in html
df3.b<-head(df3.b, 10)
formattable(df3.b, list(
'Median Price' = color_bar("orange")))| City | State | Unit Total | Median Price |
|---|---|---|---|
| Carmel | CA | 21 | 300.0 |
| Malibu | CA | 39 | 225.0 |
| Incline Village | NV | 21 | 200.0 |
| Laguna Beach | CA | 25 | 200.0 |
| Truckee | NV | 30 | 200.0 |
| Napa | CA | 44 | 184.5 |
| Park City | UT | 101 | 179.0 |
| Sunny Isles Beach | FL | 98 | 177.5 |
| Sonoma | CA | 92 | 175.0 |
| New York | NY | 5597 | 165.0 |
Let’s Map it!
Let’s see if these rental prices cluster by location across the lower 48 states. Using a function I wrote based on Microsoft’s Bing’s API, I’ve managed to batch geocode the locations provided in the TreefortBnb data file.
#Load Dependencies
library(ggplot2)
library(cowplot)
#Batch geocode with Bing
options(BingMapsKey="AtquMkrNaB7ME7krIpwQgrTwEqwB0HbUEpRKb9wfpBW-xCbgBzrGabEyUGkdpO0G")
df3$Location<-paste(df3$City, df3$State, sep = ", ")
df30<-t(geocodeVect(df3$Location, service="bing", returntype="coordinates"))
df10<-as.data.frame(df30)
colnames(df10)<-c("lat","lon")
df3<-cbind(df3, df10)
df3<-as.data.frame(df3)
#Subset to only include lower 48
df3<- df3 %>%
subset(State!="HI") %>%
subset(State!="AK")
row.names(df3) <- NULL
states <- map_data("state")
states <- states[order(states$order), ]
ggplot() +
geom_polygon(data=states, aes(long, lat, group=group),
size=0.1,fill="black", color="white", alpha=0.75) +
geom_point(data=df3, aes(lon, lat, size= `Median Price`),
color="blue", alpha=0.5) +
geom_point(data=df3, shape = 1,
aes(lon, lat, size=`Median Price`),
colour = "white", alpha=0.7) +
theme(axis.line=element_blank(),
axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks=element_blank(),
axis.title.x=element_blank(),
axis.title.y=element_blank(),
legend.position="bottom",
panel.background=element_blank(),
panel.border=element_blank(),
panel.grid.major=element_blank(),
panel.grid.minor=element_blank(),
plot.background=element_blank())
As a final exercise, let’s compare median AirBnb prices to median TreefortBnb prices. To accomplish that, I’ve scraped data from Priceonomics’s online AirBnb to hotel comparison table. Unfortunately, the data on AirBnb rental prices is not as exhaustive as the data for TreefortBnb (less populated locations are missing for AirBnb), thus, I’ve restricted my analysis to the top 25 most rented locations for TreefortBnb in an effort to increase the overlap between the two samples. Data on AirBnb rental prices for cities on the TreeforBnb list but not on Priceonomics’s list of AirBnb prices I’ve imputed using data drawn from current averages for each location as shown on AirBnb’s own website (I used this method for 5 cities: Brooklyn, Queens, Kissimmee, Miami Beach, and Santa Monica).
The graph below shows the first difference between TreefortBnb and AirBnb median rental prices across the 25 busiest TreefortBnb cities across the US. Negative values indicate locations where median AirBnb rental prices are more expensive than their TreefortBnb counterparts. Queens, NY (highlighted in red) is the only location where median prices are equivalent, although, median prices in Washington DC, Boulder, CO, and Denver, CO are very close.
# Load dependencies
library(jsonlite)
library(data.table)
# Scrape data from site
appData <- fromJSON("http://priceonomics.com/static/js/hotels/all_data.json")
# replicate table
data2 <- data.frame(City = names(appData), Price = sapply(appData, function(x) x$air$apt$p),
stringsAsFactors = FALSE)
# Arrange data
setDT(data2)
data2 <- data2[order(Price, decreasing = TRUE)]
data2 <- as.data.frame(data2)
colnames(data2) <- c("Location", "AirPrice")
# Subset to only include 25 busiest tree fort markets
dddd <- df3
colnames(dddd)[3] <- "UnitTotal"
data3 <- arrange(dddd, desc(UnitTotal))
data3 <- head(data3, 25)
# Imput missing AirBnb values
data4 <- merge(data3, data2, by = "Location", all.x = TRUE, all.y = FALSE)
data4$AirPrice <- as.character(data4$AirPrice)
data4[6, 8] <- 104
data4[10, 8] <- 172
data4[13, 8] <- 158
data4[20, 8] <- 99
data4[23, 8] <- 141
colnames(data4)[5] <- c("PriceMedian")
data4$PriceMedian <- as.numeric(data4$PriceMedian)
data4$AirPrice <- as.numeric(data4$AirPrice)
# Generate first difference
data4$first_diff <- data4$PriceMedian - data4$AirPrice #Negative value
# mean AirBnb more expensive - positive means TreefortBnb is more
# expensive
# Create variable distinguishing neg from pos values of 1st difference
data4$abs_first_diff <- abs(data4$first_diff)
data4$Air_pricier <- ifelse(data4$first_diff < 0, 0, 1)
data4$Air_pricier <- as.factor(data4$Air_pricier)
# Plot
library(cowplot) #theme
a <- ifelse(data4$first_diff < 1, ifelse(data4$Air_pricier != 0, "red",
"darkgrey"), "blue")
ggplot(data4, aes(x = Location, y = first_diff)) + geom_bar(stat = "identity",
aes(fill = Air_pricier)) + ylab("First Difference") + xlab("") + theme(axis.text.x = element_text(angle = 35,
hjust = 1)) + theme(axis.text.x = element_text(colour = a, size = 10)) +
scale_fill_manual(values = c("darkgrey", "darkblue")) + theme(legend.position = "none",
panel.background = element_blank())