The purpose of this project is to recreate a graph that appeared in the Economist. The graph we want describes yearly cost of attending college versus college selectivity. The R libraries need are dplyr, ggplot2, and tidyr.

First we will read in the ScorecardSmall.Rda file. This is a file that contains admission rate and cost information for colleges. We are interested in just a few columns, namely CONTROL (1 for public, 2 for private), ADM_RATE (admissions rate), NPT4i_PUB (the average cost for families in the ith quintile for a public institution), and NPT4i_PRIV (similar only for private institutions). To similify things for us, we will convert the data frame into a narrow format where a column named long_name will denote the type and net will be the value of NPT4i_.

load("ScorecardSmall.Rda")
 ScorecardSmall <- ScorecardSmall %>%
    filter(CONTROL != 3) %>%
    gather(long_name, net, NPT41_PUB, NPT43_PUB, NPT45_PUB, NPT41_PRIV, NPT43_PRIV, NPT45_PRIV) %>%
    select(CONTROL, INSTNM, ADM_RATE, long_name, net) %>%
    filter(complete.cases(.))

Now to fix the problem of the different suffixes of PUB and PRIV, we read in a csv file and perform a left join to normalize them.

NPT4names <- read.csv("NPT4-names.csv")

ScorecardSmall <-ScorecardSmall %>%
  left_join(NPT4names)
## Joining, by = "long_name"
## Warning: Column `long_name` joining character vector and factor, coercing
## into character vector

Now observe the head of the dataframe is as follows.

CONTROL INSTNM ADM_RATE long_name net short_name
1 Alabama A & M University 0.8989 NPT41_PUB 12683 Q1
1 University of Alabama at Birmingham 0.8673 NPT41_PUB 12361 Q1
1 University of Alabama in Huntsville 0.8062 NPT41_PUB 14652 Q1
1 Alabama State University 0.5125 NPT41_PUB 12342 Q1
1 The University of Alabama 0.5655 NPT41_PUB 17206 Q1
1 Auburn University at Montgomery 0.8371 NPT41_PUB 9044 Q1

Finally we construct the desired graphs. facet_grid greatly simplifies dividing the graph into cases based on income quintile and type of school.

ScorecardSmall %>%
    ggplot(aes(x=ADM_RATE*100, y = net)) + 
      geom_point(color = "blue", alpha = 0.2 ) +
      facet_grid(CONTROL ~ short_name) +
      geom_smooth(color = "black") +
      labs(x = "Admission rate (percentage)", y = "Net cost per year (in dollars)")
## `geom_smooth()` using method = 'gam'