Read Chapter 7 (Exploratory Data Analysis) of R for Data Science and do exercises 7.3.4, 7.4.1, 7.5.1.1, 7.5.2.1, and 7.5.3.1.
x
, y
and z
variables in diamonds
. What do you learn? Think about a diamond and how you might decide which dimension is length, width, and depth.Exploring the distribution of x
:
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = x), binwidth = .2) +
labs(title = "Distribution of x")
diamonds %>% count(cut_width(x, .2))
ggplot(data = diamonds, mapping = aes(x = x)) +
geom_histogram(binwidth = .2) +
coord_cartesian(xlim = c(3.5, 9)) +
labs(title = "Distribution of x, zoomed into (3.5, 9)")
We find that there are some outliers and the distribution of x
has a high peak around 4.4mm.
Exploring the distribution of y
:
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = .2) +
labs(title = "Distribution of y")
diamonds %>% count(cut_width(y, .2))
ggplot(data = diamonds, mapping = aes(x = y)) +
geom_histogram(binwidth = .2) +
coord_cartesian(xlim = c(3.5, 9)) +
labs(title = "Distribution of y, zoomed into (3.5, 9)")
The distribution of y
has an extreme outlier at 58.90mm but when we zoom in, we see that there is a high peak at 4.4mm and a right skewed density.
Exploring the distribution of z
:
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = z), binwidth = .2) +
labs(title = "Distribution of z")
diamonds %>% count(cut_width(z, .2))
ggplot(data = diamonds, mapping = aes(x = z)) +
geom_histogram(binwidth = .1) +
coord_cartesian(xlim = c(2, 6)) +
labs(title = "Distribution of z, zoomed into (2, 6)")
The measurement z
has an outlier at 31.80mm with a peak around 2.7mm and a right skew.
Honestly, with no real knowledge of diamonds, the only thing I can assume is that z
is the depth of the diamond and simply because of how we use the z-axis in math. Also, since the distributions of x
and y
look almost identical, I can only assume these measurements are the length and width and those could be interchangeable.
price
. Do you discover anything unusual or surprising? (Hint: Carefully think about the binwidth
and make sure you try a wide range of values.)ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = price), binwidth = 50) +
labs(title = "Distribution of Price", x = "Price in dollars")
It is not surprising that the distribution of price
is highly right skewed but there seems to be another peak at $4,000, which is most likely the common price for a wedding ring.
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = price), binwidth = 50) +
coord_cartesian(xlim = c(1000, 2000)) +
labs(title = "Distribution of Price, zoomed into (1000, 2000)",
x = "Price in dollars")
It’s also strange that there appears to be a gap at $1,500. I’m not sure why this happens.
unusual <- diamonds %>%
filter(y < 3 | y > 20 | x < 3 | z > 20) %>%
select(price, x, y, z) %>%
arrange(y)
unusual
It’s unusual that there are diamonds with measurements that are 0 but still worth thousands of dollars. This is most likely a measurement error. There are also diamonds that are enormous, with a y
value of 31.80 or 58.90mm, but they aren’t worth a great deal of money.
diamonds %>%
filter(carat >= 0.99, carat <= 1) %>%
count(carat)
There are most likely a lot more 1 carat diamonds than 0.99 carat diamonds because it’s worthwhile to “round up” the number of carats in order to charge more for the diamond.
coord_cartesian()
vs xlim()
or ylim()
when zooming in on a histogram. What happens if you leave binwidth
unset? What happens if you try an zoom so only half a bar shows?The default histogram is:
ggplot(data = diamonds, mapping = aes(x = x)) +
geom_histogram() +
labs(title = "Histogram of x")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
If we add a coord_cartesian()
layer with a set of x limits and y limits, ggplot
simply zooms into the desired area:
ggplot(data = diamonds, mapping = aes(x = x)) +
geom_histogram() +
coord_cartesian(xlim = c(3.5, 9), ylim = c(0, 5000)) +
labs(title = "Histogram of x, zoomed with coord_cartesian")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
However, the xlim()
and ylim()
layers will remove the data outside of the limits then create the histogram:
ggplot(data = diamonds, mapping = aes(x = x)) +
geom_histogram() +
xlim(3.5, 9) +
ylim(0, 5000) +
labs(title = "Histogram of x, zoomed with xlim and ylim")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 50 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
Note that this removed the bar that had a height over 5,000.
First we need to create some missing data. Using the bit of code from Chapter 7.4:
diamonds2 <- diamonds %>%
mutate(y = ifelse(y < 3 | y > 20, NA, y))
This sets all of the extreme values to missing. Then the text includes the code for creating the histogram.
ggplot(data = diamonds2, mapping = aes(x = y)) +
geom_histogram() +
labs(title = "Histogram of y, with random missing values")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 9 rows containing non-finite values (stat_bin).
Note that the warning states “Removed 9 rows containing missing values (geom_point).” With continuous data, the data is split up into the numeric bins. There is no bin for NA
.
Now for the bar chart, we want to use categorical data. For example, we can use the mpg data.
ggplot(data = mpg, mapping = aes(x = class)) +
geom_bar() +
labs(title = "Bar chart of Class")
However, let’s randomly set 5% of the classes to missing with a random uniform probability
mpg2 <- mpg %>%
mutate(class = ifelse(runif(n()) < .05, NA, class))
ggplot(data = mpg2, mapping = aes(x = class)) +
geom_bar() +
labs(title = "Bar chart of Class, with random missing values")
Now we see there is a bin created for NA
, unlike in the histogram.
na.rm = TRUE
do in mean()
and sum()
?The argument na.rm = TRUE
will remove the missing values before calculating the mean or sum (or any function that uses na.rm = TRUE
). If you don’t add this argument and there are missing values, you will get back NaN
: not a number.
We cannot leave the departure times as they are because we would have gaps from 60-99 because a time like 576 is not possible.
flights %>%
mutate(
sched_dep_hour = sched_dep_time %/% 100,
sched_dep_min = sched_dep_time %% 100,
sched_dep_time = sched_dep_hour + sched_dep_min/60,
cancelled = is.na(dep_time)
) %>%
ggplot(.,mapping = aes(x = cancelled, y = sched_dep_time)) +
geom_boxplot() +
labs(x = "Cancelled", y = "Scheduled Departure Time, 24-Hour Clock",
title = "Visualizing the Scheduled Departure Times of Cancelled vs. \n
Non-Cancelled Flights")
ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
geom_boxplot()
ggplot(data = diamonds, mapping = aes(x = color, y = price)) +
geom_boxplot()
ggplot(data = diamonds, mapping = aes(x = clarity, y = price)) +
geom_boxplot()
ggplot(data = diamonds, mapping = aes(x = price)) +
geom_point(mapping = aes(y = x))
ggplot(data = diamonds, mapping = aes(x = price)) +
geom_point(mapping = aes(y = y))
ggplot(data = diamonds, mapping = aes(x = price)) +
geom_point(mapping = aes(y = z))
ggplot(data = diamonds, mapping = aes(x = price)) +
geom_point(mapping = aes(y = carat))
From looking at all of the variables with price, I found that color seems to be important for predicting the price of a diamond.
ggplot(data = diamonds, mapping = aes(x = color, y = price)) +
geom_boxplot() +
labs(title = "Boxplot of Price by Color")
I’m not sure how to use techniques taught in the book so far to answer this but in the next section, they give how to find how these two categorical variables are correlated:
ggplot(data = diamonds) +
geom_count(mapping = aes(x = cut, y = color)) +
labs(title = "Correlation of Cut and Color")
diamonds %>%
count(color, cut)
This relationship implies that lower quality cuts can still have a higher quality color which leads to higher priced diamonds.
coord_flip()
?ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
geom_boxplot() +
coord_flip() +
labs(title = "Boxplot of Price by Cut")
ggplot(data = diamonds, mapping = aes(x = price, y = cut)) +
geom_boxploth() +
labs(title = "Boxplot of Price by Cut")
These two bits of code produce the same horizontal boxplot. However, the arguments for x and y are flipped. I would prefer the ggstance
function just because it is less typing.
ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
geom_lv() +
labs(title = "Letter Value Plot of Price by Cut")
The letter value plot shows more of the percentiles and the outliers are less prominent so we can focus on the distribution.
geom_violin()
with a facetted geom_histogram()
, or a coloured geom_freqpoly()
. What are the pros and cons of each method?p <- ggplot(data = diamonds)
p + geom_violin(mapping = aes(x = cut, y = price)) +
labs(title = "Violin Plot of Price by Cut")
p + geom_histogram(mapping = aes(price)) +
facet_wrap(~ cut) +
labs(title = "Faceted Histogram of Price by Cut")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
p + geom_histogram(mapping = aes(price)) +
facet_wrap(~ cut, scales = "free_y") +
labs(title = "Faceted Histogram of Price by Cut, free count scale")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
p + geom_freqpoly(mapping = aes(x = price, colour = cut)) +
labs(title = "Colored Frequency Polygon, count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
p + geom_freqpoly(mapping = aes(x = price, y = ..density.., colour = cut)) +
labs(title = "Colored Frequency Polygon, density")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
It’s obvious that the worst of these plots is the facetted histogram because there are so few fair and good cut diamonds compared to ideal cuts so the scales make it difficult to see what’s happening. The colored frequency polygram also has this problem, but it is less dramatic. However, it does tell you how how the counts differ in general. The violin plot, allows you to see how the price is concentrated with respect to the cut, but it does not convey how the counts differ between the cuts. This means we learn that price for ideal cuts is strangely concentrated on the low-end of price, while the others are more spread out.
geom_jitter()
to see the relationship between a continuous and categorical variable. The ggbeeswarm package provides a number of methods similar to geom_jitter()
. List them and briefly describe what each one does.p <- ggplot(data = mpg, mapping = aes(x = reorder(class, hwy, FUN = median),
y = hwy)) +
labs(title = "Class vs Highway MPG", x = "class, reordered by median hwy")
p + geom_jitter() +
labs(subtitle = "Using Jitter")
p + geom_beeswarm() +
labs(subtitle = "Using Beeswarm")
p + geom_quasirandom() +
labs(subtitle = "Using Quasirandom")
p + geom_quasirandom(method = "pseudorandom") +
labs(subtitle = "Using Quasirandom, method pseudorandom")
p + geom_quasirandom(method = "smiley") +
labs(subtitle = "Using Quasirandom, method smiley")
p + geom_quasirandom(method = "frowney") +
labs(subtitle = "Using Quasirandom, method frowney")
geom_jitter()
: adds small amount of random variation to the location of each pointgeom_beeswarm()
: shape similar to violin plotgeom_quasirandom()
: somewhere between violin and jitter, several methods of jitterThe example in Chapter 7 shows this graph:
diamonds %>%
count(color, cut) %>%
ggplot(mapping = aes(x = color, y = cut)) +
geom_tile(mapping = aes(fill = n)) +
scale_fill_gradient(low = "white", high = "steelblue") +
labs(title = "Tile Plot of Cut and Color")
However, we can’t see the relationship of cut within color. This means we can really only see where all of the diamonds are concentrated. Instead, borrowing an idea from GitHub user jrnold, we want to group the diamonds by color, then find the proportion of each cut within each color so that each count is rescaled.
colorprops <- diamonds %>%
count(color, cut) %>%
group_by(color) %>%
mutate(prop = n/sum(n)) # once grouped, sum(n) is only for that group
ggplot(data = colorprops, mapping = aes(x = color, y = cut)) +
geom_tile(mapping = aes(fill = prop), color = "white") +
scale_fill_gradient(low = "white", high = "steelblue") +
labs(title = "Tile Plot of Cut within Color")
Now if you look down the columns of squares, you can see how the cuts are distributed within the colors. For example, color D seems to be skewed heavily towards the ideal cut but color J seems to be spread out across the cuts a bit more.
We can switch this around to see the distribution of color within cut in the same way.
cutprops <- diamonds %>%
count(color, cut) %>%
group_by(cut) %>%
mutate(prop = n/sum(n))
ggplot(data = cutprops, mapping = aes(x = color, y = cut)) +
geom_tile(mapping = aes(fill = prop), color = "white") +
scale_fill_gradient(low = "white", high = "steelblue") +
labs(title = "Tile Plot of Color within Cut")
Now if we look across the rows, we can find which colors the different cuts are conctrating on. For example, the darkest shade for ideal cuts is at color G and for very good cuts, it’s color E.
geom_tile()
together with dplyr to explore how average flight delays vary by destination and month of year. What makes the plot difficult to read? How could you improve it?meandelay <- flights %>%
group_by(dest, month) %>%
summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE))
ggplot(data = meandelay, mapping = aes(x = factor(month), y = dest)) +
geom_tile(mapping = aes(fill = avg_dep_delay), color = "white") +
scale_fill_gradient(low = "white", high = "steelblue") +
labs(title = "Tile Plot of Average Departure Delay from New York \n
by Destination and Month")
This plot is horrendous though because you can’t read anything and it doesn’t show any trends. Let’s remove the destinations that New York flights did not fly to every month. It would also be nice to reorder the destinations by their distance from New York since that could determine their delay times.
flights %>%
group_by(dest, month) %>%
summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
group_by(dest) %>%
filter(n() == 12) %>% # selects only destinations with flights every month
ungroup() %>%
mutate(dest = fct_reorder(dest, avg_dep_delay)) %>% # reorder the destination
ggplot(mapping = aes(x = factor(month), y = dest, fill = avg_dep_delay)) +
geom_tile(mapping = aes(fill = avg_dep_delay), color = "white") +
scale_fill_gradient(low = "white", high = "steelblue") +
labs(title = "Tile Plot of Average Departure Delay from New York \n
by Destination and Month")
aes(x = color, y = cut)
rather than aes(x = cut, y = color)
in the example above?diamonds %>%
count(color, cut) %>%
ggplot(mapping = aes(x = color, y = cut)) +
geom_tile(mapping = aes(fill = n)) +
labs(title = "Tile Plot of Color and Cut")
diamonds %>%
count(color, cut) %>%
ggplot(mapping = aes(x = cut, y = color)) +
geom_tile(mapping = aes(fill = n)) +
labs(title = "Tile Plot of Color and Cut")
Normally we would want the variable with longer labels on the vertical axis, so they don’t get squished together. However, this doesn’t happen with these labels.
cut_width()
vs cut_number()
? How does that impact a visualisation of the 2d distribution of carat
and price
?From the text, we modify the boxplots to frequency polygons
p <- ggplot(data = diamonds, mapping = aes(x = price)) +
labs(title = "Distribution of Price, by Carat")
q <- ggplot(data = diamonds, mapping = aes(x = price, y = ..density..)) +
labs(title = "Density of Price, by Carat")
p + geom_freqpoly(mapping = aes(color = cut_width(carat, 0.5))) +
labs(subtitle = "Using Cut Width = 0.5")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
q + geom_freqpoly(mapping = aes(color = cut_width(carat, 0.5))) +
labs(subtitle = "Using Cut Width = 0.5")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Notice using the same number of bins gives drastically different plots. The argument cut_width()
creates equally spaced bins from carat
but there are very few diamonds in the highest carat sizes. This means we definitely need to set y to density instead of the default count.
p + geom_freqpoly(mapping = aes(color = cut_number(carat, 11))) +
labs(subtitle = "Using 11 Bins of Equal Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
q + geom_freqpoly(mapping = aes(color = cut_number(carat, 11))) +
labs(subtitle = "Using 11 Bins of Equal Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
When using cut_number()
, the number of diamonds is the same in each bin, but the bins are all different sizes. This does mean that if we did not use the density, the plot would be unaffected. In this case, the y scale would just change.
We can use boxplots like the text:
ggplot(data = diamonds, mapping = aes(y = carat)) +
geom_boxplot(mapping = aes(x = cut_number(price, 10))) +
coord_flip() +
labs(title = "Distribution of Carat, by Price", y = "Price")
or using frequency polygons as in the previous exercise:
ggplot(data = diamonds, mapping = aes(x = carat, y = ..density..)) +
geom_freqpoly(mapping = aes(color = cut_number(price, 10))) +
labs(title = "Distribution of Carat, by Price")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Both of these plots show that most of the diamonds, regardles of price, are concentrated below 3 carats.
Based on the plots, unsurprisingly, the smaller diamonds are worth less than the larger ones. However, there is a large overlap in their distribution, suggesting the other factors have a large influence on price.
ggplot(data = diamonds, mapping = aes(x = price, y = ..density..)) +
geom_freqpoly(mapping = aes(color = cut_number(carat, 10))) +
facet_wrap(~ cut)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = diamonds, mapping = aes(x = cut_number(carat, 5), y = price,
color = cut)) +
geom_boxplot()
x
and y
values, which makes the points outliers even though their x
and y
values appear normal when examined separately.ggplot(data = diamonds) +
geom_point(mapping = aes(x = x, y = y)) +
coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))
Why is a scatterplot a better display than a binned plot for this case?
Since x
and y
have a very strong relationship with each other, a given value of x
or y
may not seem like an outlier but will be an outlier when you look at them together.
Read Chapter 23 (Model Basics) and Chapter 24 (Model Building) of R for Data Science and do exercises 24.2.3 and 24.3.5.
In the plot of lcarat
vs. lprice
, there are some bright vertical strips. What do they represent?
If log(price) = a_0 + a_1 * log(carat)
, what does that say about the relationship between price
and carat
?
Extract the diamonds that have very high and very low residuals. Is there anything unusual about these diamonds? Are the particularly bad or good, or do you think these are pricing errors?
Does the final model, mod_diamonds2
, do a good job of predicting diamond prices? Would you trust it to tell you how much to spend if you were buying a diamond?
Use your Google sleuthing skills to brainstorm why there were fewer than expected flights on Jan 20, May 26, and Sep 1. (Hint: they all have the same explanation.) How would these days generalise to another year?
What do the three days with high positive residuals represent? How would these days generalise to another year?
# daily %>%
# top_n(3, resid)
Create a new variable that splits the wday
variable into terms, but only for Saturdays, i.e. it should have Thurs
, Fri
, but Sat-summer
, Sat-spring
, Sat-fall
. How does this model compare with the model with every combination of wday
and term
?
Create a new wday
variable that combines the day of week, term (for Saturdays), and public holidays. What do the residuals of that model look like?
What happens if you fit a day of week effect that varies by month (i.e. n ~ wday * month
)? Why is this not very helpful?
What would you expect the model n ~ wday + ns(date, 5)
to look like? Knowing what you know about the data, why would you expect it to be not particularly effective?
We hypothesised that people leaving on Sundays are more likely to be business travellers who need to be somewhere on Monday. Explore that hypothesis by seeing how it breaks down based on distance and time: if it’s true, you’d expect to see more Sunday evening flights to places that are far away.
It’s a little frustrating that Sunday and Saturday are on separate ends of the plot. Write a small function to set the levels of the factor so that the week starts on Monday.
Redo HW1 Q2 using functions in tidyverse.
The 35.227.165.60:/home/m280-data/hw1
folder contains a typical genetic data set in plink format. If interested, you can read plink documentation at http://zzz.bwh.harvard.edu/plink/. But it’s definitely not necessary for this homework.
merge-geno.bim
contains information of each genetic marker (SNP). Each line is a SNP and has 6 fields:Chromosome
, SNP ID
, Genetic Distance (morgan)
, Base Pair Position (bp)
, Allele 1
, Allele 2
.head /home/m280-data/hw1/merge-geno.bim
## 1 1-54490 0 54490 A G
## 1 1-55550 0 55550 T A
## 1 1-57033 0 57033 C T
## 1 1-57064 0 57064 A G
## 1 1-57818 0 57818 A C
## 1 1-58432 0 58432 C T
## 1 1-58448 0 58448 A G
## 1 1-58814 0 58814 A G
## 1 1-59492 0 59492 G A
## 1 1-60829 0 60829 T C
merge-geno.fam
contains individual information. Each line is one individual and has 6 fields:Family ID
, Person ID
, Father ID
, Mother ID
, Sex
coded as 1 (male) or 2 (female), Affection Status
Father ID = 0
means that person’s father is not in this data set. Similarly Mother ID
= 0 means that person’s mother is not in this data set.head -20 /home/m280-data/hw1/merge-geno.fam
## 2 T2DG0200001 0 0 1 0
## 2 T2DG0200002 0 0 2 0
## 2 T2DG0200003 0 0 2 0
## 2 T2DG0200004 0 0 2 0
## 2 T2DG0200005 0 0 1 0
## 2 T2DG0200006 0 0 1 0
## 2 T2DG0200007 0 0 2 0
## 2 T2DG0200008 0 0 2 0
## 2 T2DG0200009 0 0 2 0
## 2 T2DG0200012 0 0 1 0
## 2 T2DG0200013 0 0 1 0
## 2 T2DG0200018 0 0 1 0
## 2 T2DG0200023 0 0 2 0
## 2 T2DG0200024 0 0 1 0
## 2 T2DG0200027 0 0 2 0
## 2 T2DG0200031 T2DG0200001 T2DG0200015 1 0
## 2 T2DG0200032 T2DG0200001 T2DG0200015 2 0
## 2 T2DG0200033 T2DG0200001 T2DG0200015 2 0
## 2 T2DG0200034 T2DG0200001 T2DG0200015 2 0
## 2 T2DG0200035 T2DG0200001 T2DG0200015 2 0
merge-geno.bed
contains genotypes of each individual in binary format. We don’t need this file for this homework.Please, do not put these data files into Git; they are huge. You even don’t need to copy them into your directory. Just read from the data folder /home/m280-data/hw1
directly.
merge.geno.bim <- read_tsv("/home/m280-data/hw1/merge-geno.bim",
col_names = FALSE)
merge.geno.fam <- read_delim("/home/m280-data/hw1/merge-geno.fam", delim = " ",
col_names = FALSE)
colnames(merge.geno.bim) <- c("chromosome", "snpid", "gendist", "bp",
"allelle1", "allele2")
colnames(merge.geno.fam) <- c("famid", "personid", "fid", "mid", "sex",
"affstat")
n
)? How many SNPs are in the data set (statisticians call this p
)?There are 959 people in the data set and 8348674 SNPs.
group_by(merge.geno.bim, chromosome) %>% summarise(., num.snps = n())
filter(merge.geno.bim, chromosome == 3 & (bp >= 47892180 & bp <= 48130769)) %>%
nrow()
## [1] 894
bim
file but has format SNP ID
, Chromosome
, Base Pair Position
with each field separated by a comma. Write a Linux shell command to convert merge-geno.bim
to Mendel SNP definition file. The first few lines of the Mendel SNP definition file should look likemyvar <- nrow(merge.geno.bim)
out_string <- paste0(" 2.40 = FILE FORMAT VERSION NUMBER.\n",
myvar, " = NUMBER OF SNPS LISTED HERE.\n")
df <- merge.geno.bim %>%
select(snpid, chromosome, bp)
my_file <- file("./mendel_snpdef.txt", "w")
writeLines(out_string, my_file, sep = '')
write_delim(df, my_file, delim = ",", col_names = FALSE)
close(my_file)
readLines("./mendel_snpdef.txt", n = 10)
## [1] " 2.40 = FILE FORMAT VERSION NUMBER."
## [2] "8348674 = NUMBER OF SNPS LISTED HERE."
## [3] "1-54490,1,54490"
## [4] "1-55550,1,55550"
## [5] "1-57033,1,57033"
## [6] "1-57064,1,57064"
## [7] "1-57818,1,57818"
## [8] "1-58432,1,58432"
## [9] "1-58448,1,58448"
## [10] "1-58814,1,58814"
fam
file but has format Family ID
, Person ID
, Father ID
, Mother ID
, Sex
coded as M or F, Twin Status
with each field separated by a comma. Write a Linux shell command to convert merge-geno.fam
to Mendel pedigree file. Since twin status is not available in plink format, we put nothing for that field. Also Mendel limits Person ID to have length less than or equal to 8 characters, so we have to strip the string T2DG
from the IDs. First few lines of the Mendel pedigree should look likemerge.geno.fam %>%
mutate(twin = NA,
personid = str_sub(personid, 5, 11),
fid = str_sub(fid, 5, 11),
mid = str_sub(mid, 5, 11)) %>%
select(famid, personid, fid, mid, sex, twin) %>%
write_delim(., path = "./mendel_ped.txt", delim = ",", col_names = FALSE)
readLines("./mendel_ped.txt", n = 20)
## [1] "2,0200001,,,1,NA" "2,0200002,,,2,NA"
## [3] "2,0200003,,,2,NA" "2,0200004,,,2,NA"
## [5] "2,0200005,,,1,NA" "2,0200006,,,1,NA"
## [7] "2,0200007,,,2,NA" "2,0200008,,,2,NA"
## [9] "2,0200009,,,2,NA" "2,0200012,,,1,NA"
## [11] "2,0200013,,,1,NA" "2,0200018,,,1,NA"
## [13] "2,0200023,,,2,NA" "2,0200024,,,1,NA"
## [15] "2,0200027,,,2,NA" "2,0200031,0200001,0200015,1,NA"
## [17] "2,0200032,0200001,0200015,2,NA" "2,0200033,0200001,0200015,2,NA"
## [19] "2,0200034,0200001,0200015,2,NA" "2,0200035,0200001,0200015,2,NA"
Redo HW1 Q3 on Hoffman2, except now we want to submit each runSum.R
job to a different node in the cluster.