The purpose of this project is apply clustering techniques to genetic data. First we read in the required libraries.

```
library(dplyr)
library(knitr)
library(ggplot2)
library(DataComputing)
library(tidyr)
library(ggdendro)
```

First we read in the table *NCI60* from the *DataComputing* library. Each row of this time is a genetic probe with the first column identifing the probe and the next 60 correspond to cell lines and their reported expression in each probe. We first convert the table to a narrow format where each row is an individual observation about a cell line in a probe.

```
Narrow <-
NCI60 %>%
gather(value=expression, key=cellLine, -Probe) %>%
group_by(Probe, cellLine) %>%
summarise(expression = mean(expression)) %>%
ungroup()
```

Next we will analyze the standard deviation of the *expression* numbers for each probe. We also test the Null Hypothesis that there is no relationship between probes and the standard deviations of the *expression* numbers for each probe.

```
keep <- 500
Best <-
Narrow %>%
group_by(Probe) %>%
summarise(spread = sd(expression)) %>%
arrange(desc(spread)) %>%
mutate(i = row_number()) %>%
head(keep)
Randomized <-
Narrow %>%
mutate(Probe = sample(Probe)) %>%
group_by(Probe) %>%
summarise(spread = sd(expression)) %>%
arrange(desc(spread)) %>%
mutate(i = row_number()) %>%
head(keep)
Best %>%
ggplot(aes(x=i, y=spread)) +
geom_line() +
geom_line(data=Randomized, color="red", size=1, alpha=.5)
```

As we can see from there graph, there is clearly a relationship so the Null Hypothesis fails.

Now we filter out those probes with standard deviations of expression numbers that are above a threshold of 4.5.

```
Keepers <-
Narrow %>% group_by(Probe) %>%
filter(sd(expression) > 4.5)
```

Now we convert the data to a wide format where each row is a cell line and the columns are expression numbers reported by different probes.

```
Keepers_wide <-
Keepers %>%
spread(key=Probe, value=expression)
rownames(Keepers_wide) <-Keepers_wide$cellLine
Keepers_wide <- Keepers_wide %>% select(-cellLine)
```

Now we finally apply a clustering algorithm to make a dendrogram.

```
Dists <- dist(Keepers_wide)
Dendrogram <- hclust(Dists)
ddata <- dendro_data(Dendrogram)
ggdendrogram(Dendrogram)
```