Biostat M280 Homework 2

Q1: Exploratory Data Analysis

Read Chapter 7 (Exploratory Data Analysis) of R for Data Science and do exercises 7.3.4, 7.4.1, 7.5.1.1, 7.5.2.1, and 7.5.3.1.

7.3.4

Explore the distribution of each of the x, y and z variables in diamonds. What do you learn? Think about a diamond and how you might decide which dimension is length, width, and depth.

Exploring the distribution of x:

ggplot(data = diamonds) + 
   geom_histogram(mapping = aes(x = x), binwidth = .2) + 
   labs(title = "Distribution of x")

diamonds %>% count(cut_width(x, .2))

ggplot(data = diamonds, mapping = aes(x = x)) + 
   geom_histogram(binwidth = .2) + 
   coord_cartesian(xlim = c(3.5, 9)) +  
   labs(title = "Distribution of x, zoomed into (3.5, 9)")

We find that there are some outliers and the distribution of x has a high peak around 4.4mm.

Exploring the distribution of y:

ggplot(data = diamonds) + 
   geom_histogram(mapping = aes(x = y), binwidth = .2) + 
   labs(title = "Distribution of y")

diamonds %>% count(cut_width(y, .2))

ggplot(data = diamonds, mapping = aes(x = y)) + 
   geom_histogram(binwidth = .2) + 
   coord_cartesian(xlim = c(3.5, 9)) + 
   labs(title = "Distribution of y, zoomed into (3.5, 9)")

The distribution of y has an extreme outlier at 58.90mm but when we zoom in, we see that there is a high peak at 4.4mm and a right skewed density.

Exploring the distribution of z:

ggplot(data = diamonds) + 
   geom_histogram(mapping = aes(x = z), binwidth = .2) + 
   labs(title = "Distribution of z")

diamonds %>% count(cut_width(z, .2))

ggplot(data = diamonds, mapping = aes(x = z)) + 
   geom_histogram(binwidth = .1) + 
   coord_cartesian(xlim = c(2, 6)) + 
   labs(title = "Distribution of z, zoomed into (2, 6)")

The measurement z has an outlier at 31.80mm with a peak around 2.7mm and a right skew.

Honestly, with no real knowledge of diamonds, the only thing I can assume is that z is the depth of the diamond and simply because of how we use the z-axis in math. Also, since the distributions of x and y look almost identical, I can only assume these measurements are the length and width and those could be interchangeable.

Explore the distribution of price. Do you discover anything unusual or surprising? (Hint: Carefully think about the binwidth and make sure you try a wide range of values.)

ggplot(data = diamonds) + 
   geom_histogram(mapping = aes(x = price), binwidth = 50) + 
   labs(title = "Distribution of Price", x = "Price in dollars")

It is not surprising that the distribution of price is highly right skewed but there seems to be another peak at $4,000, which is most likely the common price for a wedding ring.

ggplot(data = diamonds) + 
   geom_histogram(mapping = aes(x = price), binwidth = 50) + 
   coord_cartesian(xlim = c(1000, 2000)) + 
   labs(title = "Distribution of Price, zoomed into (1000, 2000)", 
        x = "Price in dollars")

It’s also strange that there appears to be a gap at $1,500. I’m not sure why this happens.

unusual <- diamonds %>% 
   filter(y < 3 | y > 20 | x < 3 | z > 20) %>% 
   select(price, x, y, z) %>%
   arrange(y)
unusual

It’s unusual that there are diamonds with measurements that are 0 but still worth thousands of dollars. This is most likely a measurement error. There are also diamonds that are enormous, with a y value of 31.80 or 58.90mm, but they aren’t worth a great deal of money.

How many diamonds are 0.99 carat? How many are 1 carat? What do you think is the cause of the difference?

diamonds %>%
   filter(carat >= 0.99, carat <= 1) %>%
   count(carat)

There are most likely a lot more 1 carat diamonds than 0.99 carat diamonds because it’s worthwhile to “round up” the number of carats in order to charge more for the diamond.

Compare and contrast coord_cartesian() vs xlim() or ylim() when zooming in on a histogram. What happens if you leave binwidth unset? What happens if you try an zoom so only half a bar shows?

The default histogram is:

ggplot(data = diamonds, mapping = aes(x = x)) + 
  geom_histogram() + 
  labs(title = "Histogram of x")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

If we add a coord_cartesian() layer with a set of x limits and y limits, ggplot simply zooms into the desired area:

ggplot(data = diamonds, mapping = aes(x = x)) + 
  geom_histogram() + 
  coord_cartesian(xlim = c(3.5, 9), ylim = c(0, 5000)) + 
  labs(title = "Histogram of x, zoomed with coord_cartesian")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

However, the xlim() and ylim() layers will remove the data outside of the limits then create the histogram:

ggplot(data = diamonds, mapping = aes(x = x)) + 
  geom_histogram() + 
  xlim(3.5, 9) + 
  ylim(0, 5000) + 
  labs(title = "Histogram of x, zoomed with xlim and ylim")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 50 rows containing non-finite values (stat_bin).

## Warning: Removed 2 rows containing missing values (geom_bar).

Note that this removed the bar that had a height over 5,000.

7.4.1

What happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference?

First we need to create some missing data. Using the bit of code from Chapter 7.4:

diamonds2 <- diamonds %>% 
  mutate(y = ifelse(y < 3 | y > 20, NA, y))

This sets all of the extreme values to missing. Then the text includes the code for creating the histogram.

ggplot(data = diamonds2, mapping = aes(x = y)) + 
  geom_histogram() + 
  labs(title = "Histogram of y, with random missing values")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 9 rows containing non-finite values (stat_bin).

Note that the warning states “Removed 9 rows containing missing values (geom_point).” With continuous data, the data is split up into the numeric bins. There is no bin for NA.

Now for the bar chart, we want to use categorical data. For example, we can use the mpg data.

ggplot(data = mpg, mapping = aes(x = class)) + 
  geom_bar() + 
  labs(title = "Bar chart of Class")

However, let’s randomly set 5% of the classes to missing with a random uniform probability

mpg2 <- mpg %>% 
   mutate(class = ifelse(runif(n()) < .05, NA, class))
ggplot(data = mpg2, mapping = aes(x = class)) + 
   geom_bar() + 
   labs(title = "Bar chart of Class, with random missing values")

Now we see there is a bin created for NA, unlike in the histogram.

What does na.rm = TRUE do in mean() and sum()?

The argument na.rm = TRUE will remove the missing values before calculating the mean or sum (or any function that uses na.rm = TRUE). If you don’t add this argument and there are missing values, you will get back NaN: not a number.

7.5.1.1

Use what you’ve learned to improve the visualisation of the departure times of cancelled vs. non-cancelled flights.

We cannot leave the departure times as they are because we would have gaps from 60-99 because a time like 576 is not possible.

flights %>% 
  mutate(
    sched_dep_hour = sched_dep_time %/% 100, 
    sched_dep_min = sched_dep_time %% 100, 
    sched_dep_time = sched_dep_hour + sched_dep_min/60,
    cancelled = is.na(dep_time)
    ) %>% 
  ggplot(.,mapping = aes(x = cancelled, y = sched_dep_time)) + 
  geom_boxplot() + 
  labs(x = "Cancelled", y = "Scheduled Departure Time, 24-Hour Clock", 
       title = "Visualizing the Scheduled Departure Times of Cancelled vs. \n
       Non-Cancelled Flights")

What variable in the diamonds dataset is most important for predicting the price of a diamond? How is that variable correlated with cut? Why does the combination of those two relationships lead to lower quality diamonds being more expensive?

ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
  geom_boxplot()
ggplot(data = diamonds, mapping = aes(x = color, y = price)) +
  geom_boxplot()
ggplot(data = diamonds, mapping = aes(x = clarity, y = price)) +
  geom_boxplot()
ggplot(data = diamonds, mapping = aes(x = price)) + 
  geom_point(mapping = aes(y = x))
ggplot(data = diamonds, mapping = aes(x = price)) + 
  geom_point(mapping = aes(y = y))
ggplot(data = diamonds, mapping = aes(x = price)) + 
  geom_point(mapping = aes(y = z))
ggplot(data = diamonds, mapping = aes(x = price)) + 
  geom_point(mapping = aes(y = carat))

From looking at all of the variables with price, I found that color seems to be important for predicting the price of a diamond.

ggplot(data = diamonds, mapping = aes(x = color, y = price)) +
  geom_boxplot() + 
  labs(title = "Boxplot of Price by Color")

I’m not sure how to use techniques taught in the book so far to answer this but in the next section, they give how to find how these two categorical variables are correlated:

ggplot(data = diamonds) +
   geom_count(mapping = aes(x = cut, y = color)) + 
   labs(title = "Correlation of Cut and Color")

diamonds %>% 
   count(color, cut)

This relationship implies that lower quality cuts can still have a higher quality color which leads to higher priced diamonds.

Install the ggstance package, and create a horizontal boxplot. How does this compare to using coord_flip()?

ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
   geom_boxplot() + 
   coord_flip() + 
   labs(title = "Boxplot of Price by Cut")

ggplot(data = diamonds, mapping = aes(x = price, y = cut)) + 
   geom_boxploth() + 
   labs(title = "Boxplot of Price by Cut")

These two bits of code produce the same horizontal boxplot. However, the arguments for x and y are flipped. I would prefer the ggstance function just because it is less typing.

One problem with boxplots is that they were developed in an era of much smaller datasets and tend to display a prohibitively large number of “outlying values”. One approach to remedy this problem is the letter value plot. Install the lvplot package, and try using geom_lv() to display the distribution of price vs cut. What do you learn? How do you interpret the plots?

ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
   geom_lv() + 
   labs(title = "Letter Value Plot of Price by Cut")

The letter value plot shows more of the percentiles and the outliers are less prominent so we can focus on the distribution.

Compare and contrast geom_violin() with a facetted geom_histogram(), or a coloured geom_freqpoly(). What are the pros and cons of each method?

p <- ggplot(data = diamonds) 
p + geom_violin(mapping = aes(x = cut, y = price)) + 
   labs(title = "Violin Plot of Price by Cut")

p + geom_histogram(mapping = aes(price)) + 
   facet_wrap(~ cut) + 
   labs(title = "Faceted Histogram of Price by Cut")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

p + geom_histogram(mapping = aes(price)) + 
   facet_wrap(~ cut, scales = "free_y") + 
   labs(title = "Faceted Histogram of Price by Cut, free count scale")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

p + geom_freqpoly(mapping = aes(x = price, colour = cut)) + 
   labs(title = "Colored Frequency Polygon, count")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

p + geom_freqpoly(mapping = aes(x = price, y = ..density.., colour = cut)) + 
   labs(title = "Colored Frequency Polygon, density")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

It’s obvious that the worst of these plots is the facetted histogram because there are so few fair and good cut diamonds compared to ideal cuts so the scales make it difficult to see what’s happening. The colored frequency polygram also has this problem, but it is less dramatic. However, it does tell you how how the counts differ in general. The violin plot, allows you to see how the price is concentrated with respect to the cut, but it does not convey how the counts differ between the cuts. This means we learn that price for ideal cuts is strangely concentrated on the low-end of price, while the others are more spread out.

If you have a small dataset, it’s sometimes useful to use geom_jitter() to see the relationship between a continuous and categorical variable. The ggbeeswarm package provides a number of methods similar to geom_jitter(). List them and briefly describe what each one does.

p <- ggplot(data = mpg, mapping = aes(x = reorder(class, hwy, FUN = median), 
                                      y = hwy)) + 
   labs(title = "Class vs Highway MPG", x = "class, reordered by median hwy")
p + geom_jitter() +
   labs(subtitle = "Using Jitter")

p + geom_beeswarm() +
   labs(subtitle = "Using Beeswarm")

p + geom_quasirandom() +
   labs(subtitle = "Using Quasirandom")

p + geom_quasirandom(method = "pseudorandom") +
   labs(subtitle = "Using Quasirandom, method pseudorandom")

p + geom_quasirandom(method = "smiley") +
   labs(subtitle = "Using Quasirandom, method smiley")

p + geom_quasirandom(method = "frowney") +
   labs(subtitle = "Using Quasirandom, method frowney")

geom_jitter(): adds small amount of random variation to the location of each point
geom_beeswarm(): shape similar to violin plot
geom_quasirandom(): somewhere between violin and jitter, several methods of jitter

7.5.2.1

How could you rescale the count dataset above to more clearly show the distribution of cut within colour, or colour within cut?

The example in Chapter 7 shows this graph:

diamonds %>% 
   count(color, cut) %>%  
   ggplot(mapping = aes(x = color, y = cut)) +
   geom_tile(mapping = aes(fill = n)) +
   scale_fill_gradient(low = "white", high = "steelblue") + 
   labs(title = "Tile Plot of Cut and Color")

However, we can’t see the relationship of cut within color. This means we can really only see where all of the diamonds are concentrated. Instead, borrowing an idea from GitHub user jrnold, we want to group the diamonds by color, then find the proportion of each cut within each color so that each count is rescaled.

colorprops <- diamonds %>% 
  count(color, cut) %>% 
  group_by(color) %>% 
  mutate(prop = n/sum(n))  # once grouped, sum(n) is only for that group

ggplot(data = colorprops, mapping = aes(x = color, y = cut)) + 
  geom_tile(mapping = aes(fill = prop), color = "white") +
  scale_fill_gradient(low = "white", high = "steelblue") + 
  labs(title = "Tile Plot of Cut within Color")

Now if you look down the columns of squares, you can see how the cuts are distributed within the colors. For example, color D seems to be skewed heavily towards the ideal cut but color J seems to be spread out across the cuts a bit more.

We can switch this around to see the distribution of color within cut in the same way.

cutprops <- diamonds %>% 
  count(color, cut) %>% 
  group_by(cut) %>% 
  mutate(prop = n/sum(n))

ggplot(data = cutprops, mapping = aes(x = color, y = cut)) + 
  geom_tile(mapping = aes(fill = prop), color = "white") +
  scale_fill_gradient(low = "white", high = "steelblue") + 
  labs(title = "Tile Plot of Color within Cut")

Now if we look across the rows, we can find which colors the different cuts are conctrating on. For example, the darkest shade for ideal cuts is at color G and for very good cuts, it’s color E.

Use geom_tile() together with dplyr to explore how average flight delays vary by destination and month of year. What makes the plot difficult to read? How could you improve it?

meandelay <- flights %>% 
   group_by(dest, month) %>% 
   summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE))

ggplot(data = meandelay, mapping = aes(x = factor(month), y = dest)) + 
   geom_tile(mapping = aes(fill = avg_dep_delay), color = "white") + 
   scale_fill_gradient(low = "white", high = "steelblue") + 
   labs(title = "Tile Plot of Average Departure Delay from New York \n
        by Destination and Month")

This plot is horrendous though because you can’t read anything and it doesn’t show any trends. Let’s remove the destinations that New York flights did not fly to every month. It would also be nice to reorder the destinations by their distance from New York since that could determine their delay times.

flights %>% 
  group_by(dest, month) %>% 
  summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>% 
  group_by(dest) %>% 
  filter(n() == 12) %>% # selects only destinations with flights every month
  ungroup() %>% 
  mutate(dest = fct_reorder(dest, avg_dep_delay)) %>% # reorder the destination
  ggplot(mapping = aes(x = factor(month), y = dest, fill = avg_dep_delay)) + 
  geom_tile(mapping = aes(fill = avg_dep_delay), color = "white") + 
  scale_fill_gradient(low = "white", high = "steelblue") + 
labs(title = "Tile Plot of Average Departure Delay from New York \n
     by Destination and Month")

Why is it slightly better to use aes(x = color, y = cut) rather than aes(x = cut, y = color) in the example above?

diamonds %>%
  count(color, cut) %>%  
  ggplot(mapping = aes(x = color, y = cut)) +
  geom_tile(mapping = aes(fill = n)) + 
  labs(title = "Tile Plot of Color and Cut")

diamonds %>% 
  count(color, cut) %>%  
  ggplot(mapping = aes(x = cut, y = color)) +
  geom_tile(mapping = aes(fill = n))  + 
  labs(title = "Tile Plot of Color and Cut")

Normally we would want the variable with longer labels on the vertical axis, so they don’t get squished together. However, this doesn’t happen with these labels.

7.5.3.1

Instead of summarising the conditional distribution with a boxplot, you could use a frequency polygon. What do you need to consider when using cut_width() vs cut_number()? How does that impact a visualisation of the 2d distribution of carat and price?

From the text, we modify the boxplots to frequency polygons

p <- ggplot(data = diamonds, mapping = aes(x = price)) + 
  labs(title = "Distribution of Price, by Carat")
q <- ggplot(data = diamonds, mapping = aes(x = price, y = ..density..)) + 
  labs(title = "Density of Price, by Carat")
p + geom_freqpoly(mapping = aes(color = cut_width(carat, 0.5))) +
  labs(subtitle = "Using Cut Width = 0.5")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

q + geom_freqpoly(mapping = aes(color = cut_width(carat, 0.5))) +
  labs(subtitle = "Using Cut Width = 0.5")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Notice using the same number of bins gives drastically different plots. The argument cut_width() creates equally spaced bins from carat but there are very few diamonds in the highest carat sizes. This means we definitely need to set y to density instead of the default count.

p + geom_freqpoly(mapping = aes(color = cut_number(carat, 11))) +
  labs(subtitle = "Using 11 Bins of Equal Count")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

q + geom_freqpoly(mapping = aes(color = cut_number(carat, 11))) +
  labs(subtitle = "Using 11 Bins of Equal Count")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

When using cut_number(), the number of diamonds is the same in each bin, but the bins are all different sizes. This does mean that if we did not use the density, the plot would be unaffected. In this case, the y scale would just change.

Visualise the distribution of carat, partitioned by price.

We can use boxplots like the text:

ggplot(data = diamonds, mapping = aes(y = carat)) + 
  geom_boxplot(mapping = aes(x = cut_number(price, 10))) + 
  coord_flip() + 
  labs(title = "Distribution of Carat, by Price", y = "Price")

or using frequency polygons as in the previous exercise:

ggplot(data = diamonds, mapping = aes(x = carat, y = ..density..)) + 
  geom_freqpoly(mapping = aes(color = cut_number(price, 10))) + 
  labs(title = "Distribution of Carat, by Price")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Both of these plots show that most of the diamonds, regardles of price, are concentrated below 3 carats.

How does the price distribution of very large diamonds compare to small diamonds. Is it as you expect, or does it surprise you?

Based on the plots, unsurprisingly, the smaller diamonds are worth less than the larger ones. However, there is a large overlap in their distribution, suggesting the other factors have a large influence on price.

Combine two of the techniques you’ve learned to visualise the combined distribution of cut, carat, and price.

ggplot(data = diamonds, mapping = aes(x = price, y = ..density..)) + 
  geom_freqpoly(mapping = aes(color = cut_number(carat, 10))) +
  facet_wrap(~ cut)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data = diamonds, mapping = aes(x = cut_number(carat, 5), y = price, 
                                      color = cut)) + 
  geom_boxplot()

Two dimensional plots reveal outliers that are not visible in one dimensional plots. For example, some points in the plot below have an unusual combination of x and y values, which makes the points outliers even though their x and y values appear normal when examined separately.

ggplot(data = diamonds) +
  geom_point(mapping = aes(x = x, y = y)) +
  coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))

Why is a scatterplot a better display than a binned plot for this case?

Since x and y have a very strong relationship with each other, a given value of x or y may not seem like an outlier but will be an outlier when you look at them together.

Q2 (optional): Model Building

Read Chapter 23 (Model Basics) and Chapter 24 (Model Building) of R for Data Science and do exercises 24.2.3 and 24.3.5.

24.2.3

In the plot of lcarat vs. lprice, there are some bright vertical strips. What do they represent?
If log(price) = a_0 + a_1 * log(carat), what does that say about the relationship between price and carat?
Extract the diamonds that have very high and very low residuals. Is there anything unusual about these diamonds? Are the particularly bad or good, or do you think these are pricing errors?
Does the final model, mod_diamonds2, do a good job of predicting diamond prices? Would you trust it to tell you how much to spend if you were buying a diamond?

24.3.5

Use your Google sleuthing skills to brainstorm why there were fewer than expected flights on Jan 20, May 26, and Sep 1. (Hint: they all have the same explanation.) How would these days generalise to another year?
What do the three days with high positive residuals represent? How would these days generalise to another year?

# daily %>% 
#   top_n(3, resid)

Create a new variable that splits the wday variable into terms, but only for Saturdays, i.e. it should have Thurs, Fri, but Sat-summer, Sat-spring, Sat-fall. How does this model compare with the model with every combination of wday and term?
Create a new wday variable that combines the day of week, term (for Saturdays), and public holidays. What do the residuals of that model look like?
What happens if you fit a day of week effect that varies by month (i.e. n ~ wday * month)? Why is this not very helpful?
What would you expect the model n ~ wday + ns(date, 5) to look like? Knowing what you know about the data, why would you expect it to be not particularly effective?
We hypothesised that people leaving on Sundays are more likely to be business travellers who need to be somewhere on Monday. Explore that hypothesis by seeing how it breaks down based on distance and time: if it’s true, you’d expect to see more Sunday evening flights to places that are far away.
It’s a little frustrating that Sunday and Saturday are on separate ends of the plot. Write a small function to set the levels of the factor so that the week starts on Monday.

Q3: Data Wrangling

Redo HW1 Q2 using functions in tidyverse.

The 35.227.165.60:/home/m280-data/hw1 folder contains a typical genetic data set in plink format. If interested, you can read plink documentation at http://zzz.bwh.harvard.edu/plink/. But it’s definitely not necessary for this homework.

merge-geno.bim contains information of each genetic marker (SNP). Each line is a SNP and has 6 fields:
Chromosome, SNP ID, Genetic Distance (morgan), Base Pair Position (bp), Allele 1, Allele 2.

head /home/m280-data/hw1/merge-geno.bim

## 1    1-54490 0   54490   A   G
## 1    1-55550 0   55550   T   A
## 1    1-57033 0   57033   C   T
## 1    1-57064 0   57064   A   G
## 1    1-57818 0   57818   A   C
## 1    1-58432 0   58432   C   T
## 1    1-58448 0   58448   A   G
## 1    1-58814 0   58814   A   G
## 1    1-59492 0   59492   G   A
## 1    1-60829 0   60829   T   C

merge-geno.fam contains individual information. Each line is one individual and has 6 fields:
Family ID, Person ID, Father ID, Mother ID, Sex coded as 1 (male) or 2 (female), Affection Status
Father ID = 0 means that person’s father is not in this data set. Similarly Mother ID = 0 means that person’s mother is not in this data set.

head -20 /home/m280-data/hw1/merge-geno.fam

## 2 T2DG0200001 0 0 1 0
## 2 T2DG0200002 0 0 2 0
## 2 T2DG0200003 0 0 2 0
## 2 T2DG0200004 0 0 2 0
## 2 T2DG0200005 0 0 1 0
## 2 T2DG0200006 0 0 1 0
## 2 T2DG0200007 0 0 2 0
## 2 T2DG0200008 0 0 2 0
## 2 T2DG0200009 0 0 2 0
## 2 T2DG0200012 0 0 1 0
## 2 T2DG0200013 0 0 1 0
## 2 T2DG0200018 0 0 1 0
## 2 T2DG0200023 0 0 2 0
## 2 T2DG0200024 0 0 1 0
## 2 T2DG0200027 0 0 2 0
## 2 T2DG0200031 T2DG0200001 T2DG0200015 1 0
## 2 T2DG0200032 T2DG0200001 T2DG0200015 2 0
## 2 T2DG0200033 T2DG0200001 T2DG0200015 2 0
## 2 T2DG0200034 T2DG0200001 T2DG0200015 2 0
## 2 T2DG0200035 T2DG0200001 T2DG0200015 2 0

merge-geno.bed contains genotypes of each individual in binary format. We don’t need this file for this homework.

Please, do not put these data files into Git; they are huge. You even don’t need to copy them into your directory. Just read from the data folder /home/m280-data/hw1 directly.

merge.geno.bim <- read_tsv("/home/m280-data/hw1/merge-geno.bim", 
                           col_names = FALSE)
merge.geno.fam <- read_delim("/home/m280-data/hw1/merge-geno.fam", delim = " ",
                             col_names = FALSE)
colnames(merge.geno.bim) <- c("chromosome", "snpid", "gendist", "bp", 
                              "allelle1", "allele2")
colnames(merge.geno.fam) <- c("famid", "personid", "fid", "mid", "sex", 
                              "affstat")

How many persons are in the data set (statisticians call this n)? How many SNPs are in the data set (statisticians call this p)?

There are 959 people in the data set and 8348674 SNPs.

Which chromosomes does this data set contain? How many SNPs are in each chromosome?

group_by(merge.geno.bim, chromosome) %>% summarise(., num.snps = n())

MAP4 (microtubule-associated protein 4) is a gene on chromosome 3 spanning positions 47,892,180 bp – 48,130,769 bp. How many SNPs are located within MAP4 gene?

filter(merge.geno.bim, chromosome == 3 & (bp >= 47892180 & bp <= 48130769)) %>% 
   nrow()

## [1] 894

Statistical geneticists often have to reformat a data set to feed into various analysis programs. For example, to use the Mendel software http://www.genetics.ucla.edu/software/mendel, we have to reformat the data set to be read by Mendel.

Mendel’s SNP definition file is similar to the plink bim file but has format SNP ID, Chromosome, Base Pair Position with each field separated by a comma. Write a Linux shell command to convert merge-geno.bim to Mendel SNP definition file. The first few lines of the Mendel SNP definition file should look like

myvar <- nrow(merge.geno.bim)
out_string <- paste0("  2.40 = FILE FORMAT VERSION NUMBER.\n", 
                     myvar, " = NUMBER OF SNPS LISTED HERE.\n")
df <- merge.geno.bim %>% 
  select(snpid, chromosome, bp) 
my_file <- file("./mendel_snpdef.txt", "w")
writeLines(out_string, my_file, sep = '')
write_delim(df, my_file, delim = ",", col_names = FALSE)
close(my_file)

readLines("./mendel_snpdef.txt", n = 10)

##  [1] "  2.40 = FILE FORMAT VERSION NUMBER." 
##  [2] "8348674 = NUMBER OF SNPS LISTED HERE."
##  [3] "1-54490,1,54490"                      
##  [4] "1-55550,1,55550"                      
##  [5] "1-57033,1,57033"                      
##  [6] "1-57064,1,57064"                      
##  [7] "1-57818,1,57818"                      
##  [8] "1-58432,1,58432"                      
##  [9] "1-58448,1,58448"                      
## [10] "1-58814,1,58814"

Mendel’s pedigree file is similar to the plink fam file but has format Family ID, Person ID, Father ID, Mother ID, Sex coded as M or F, Twin Status with each field separated by a comma. Write a Linux shell command to convert merge-geno.fam to Mendel pedigree file. Since twin status is not available in plink format, we put nothing for that field. Also Mendel limits Person ID to have length less than or equal to 8 characters, so we have to strip the string T2DG from the IDs. First few lines of the Mendel pedigree should look like

merge.geno.fam %>% 
   mutate(twin = NA, 
          personid = str_sub(personid, 5, 11),
          fid = str_sub(fid, 5, 11),
          mid = str_sub(mid, 5, 11)) %>% 
   select(famid, personid, fid, mid, sex, twin) %>% 
   write_delim(., path = "./mendel_ped.txt", delim = ",", col_names = FALSE)

readLines("./mendel_ped.txt", n = 20)

##  [1] "2,0200001,,,1,NA"               "2,0200002,,,2,NA"              
##  [3] "2,0200003,,,2,NA"               "2,0200004,,,2,NA"              
##  [5] "2,0200005,,,1,NA"               "2,0200006,,,1,NA"              
##  [7] "2,0200007,,,2,NA"               "2,0200008,,,2,NA"              
##  [9] "2,0200009,,,2,NA"               "2,0200012,,,1,NA"              
## [11] "2,0200013,,,1,NA"               "2,0200018,,,1,NA"              
## [13] "2,0200023,,,2,NA"               "2,0200024,,,1,NA"              
## [15] "2,0200027,,,2,NA"               "2,0200031,0200001,0200015,1,NA"
## [17] "2,0200032,0200001,0200015,2,NA" "2,0200033,0200001,0200015,2,NA"
## [19] "2,0200034,0200001,0200015,2,NA" "2,0200035,0200001,0200015,2,NA"

Q4 (optional): Cluster Computing

Redo HW1 Q3 on Hoffman2, except now we want to submit each runSum.R job to a different node in the cluster.