Q1. Git/GitHub

No handwritten homework reports are accepted for this course. We work with Git and GitHub. Efficient and abundant use of Git, e.g., frequent and well-documented commits, is an important criterion for grading your homework.

  1. Apply for the Student Developer Pack at GitHub using your UCLA email.

Yes, I did. My Github handle is @emjcampos.

  1. Create a private repository biostat-m280-2018-winter and add Hua-Zhou and juhkim111 as your collaborators with write permission.

Done.

  1. Top directories of the repository should be hw1, hw2, … Maintain two branches master and develop. The develop branch will be your main playground, the place where you develop solution (code) to homework problems and write up report. The master branch will be your presentation area. Submit your homework files (R markdown file Rmd, html file converted from R markdown, all code and data sets to reproduce results) in master branch.

Done.

  1. After each homework due date, teaching assistant and instructor will check out your master branch for grading. Tag each of your homework submissions with tag names hw1, hw2, … Tagging time will be used as your submission time. That means if you tag your hw1 submission after deadline, penalty points will be deducted for late submission.

Q2. Linux Shell Commands

The 35.227.165.60:/home/m280-data/hw1 folder contains a typical genetic data set in plink format. If interested, you can read plink documentation at http://zzz.bwh.harvard.edu/plink/. But it’s definitely not necessary for this homework.

Please, do not put these data files into Git; they are huge. You even don’t need to copy them into your directory. Just read from the data folder /home/m280-data/hw1 directly.

Use Linux shell commands to answer following questions.

  1. How many persons are in the data set (statisticians call this n)? How many SNPs are in the data set (statisticians call this p)?

Number of persons in this data set:

wc -l < /home/m280-data/hw1/merge-geno.fam
## 959

Number of SNPs in the data set

wc -l < /home/m280-data/hw1/merge-geno.bim
## 8348674
  1. Which chromosomes does this data set contain? How many SNPs are in each chromosome?

The counts and chromosome numbers are:

cut -f1 /home/m280-data/hw1/merge-geno.bim | sort | uniq -c
## 1309299 1
##  815860 11
##  602809 13
##  491208 15
##  477990 17
##  393615 19
##  239352 21
## 1215399 3
## 1090185 5
##  980944 7
##  732013 9
  1. MAP4 (microtubule-associated protein 4) is a gene on chromosome 3 spanning positions 47,892,180 bp – 48,130,769 bp. How many SNPs are located within MAP4 gene?

There are this many SNPs within the MAP4 gene:

awk '{if ($1 == 3 && $4 >= 47892180 && $4 <= 48130769) print}' /home/m280-data/hw1/merge-geno.bim | wc -l
## 894
  1. Statistical geneticists often have to reformat a data set to feed into various analysis programs. For example, to use the Mendel software http://www.genetics.ucla.edu/software/mendel, we have to reformat the data set to be read by Mendel.
echo "     2.40 = FILE FORMAT VERSION NUMBER." > mendel_snpdef.txt
echo "$(wc -l < /home/m280-data/hw1/merge-geno.bim)  = NUMBER OF SNPS
LISTED HERE." >> mendel_snpdef.txt
awk '{OFS = ","} {print $2, $1, $4}' /home/m280-data/hw1/merge-geno.bim >>  \
mendel_snpdef.txt
head mendel_snpdef.txt
##      2.40 = FILE FORMAT VERSION NUMBER.
## 8348674  = NUMBER OF SNPS
## LISTED HERE.
## 1-54490,1,54490
## 1-55550,1,55550
## 1-57033,1,57033
## 1-57064,1,57064
## 1-57818,1,57818
## 1-58432,1,58432
## 1-58448,1,58448
awk -v sex="" -v personID="" -v fatherID="" -v motherID="" -v twin=""  \
'{OFS = ","} {if ($5 == 1) sex="M"; else sex="F"}
{if ($2 == 0) personID=personID ; else personID=substr($2,5,8)}
{if ($3 == 0) fatherID=fatherID ; else fatherID=substr($3,5,8)}
{if ($4 == 0) motherID=motherID ; else motherID=substr($4,5,8)}
{print $1, personID, fatherID, motherID, sex, twin}'  \
/home/m280-data/hw1/merge-geno.fam > mendel_ped.txt
head -20 mendel_ped.txt
## 2,0200001,,,M,
## 2,0200002,,,F,
## 2,0200003,,,F,
## 2,0200004,,,F,
## 2,0200005,,,M,
## 2,0200006,,,M,
## 2,0200007,,,F,
## 2,0200008,,,F,
## 2,0200009,,,F,
## 2,0200012,,,M,
## 2,0200013,,,M,
## 2,0200018,,,M,
## 2,0200023,,,F,
## 2,0200024,,,M,
## 2,0200027,,,F,
## 2,0200031,0200001,0200015,M,
## 2,0200032,0200001,0200015,F,
## 2,0200033,0200001,0200015,F,
## 2,0200034,0200001,0200015,F,
## 2,0200035,0200001,0200015,F,

Q3. R Batch Run

In class we discussed using R to organize simulation studies.

  1. Expand the runSim.R script to include arguments seed (random seed), n (sample size), dist (distribution) and rep (number of simulation replicates). When dist="gaussian", generate data from standard normal; when dist="t1", generate data from t-distribution with degree of freedom 1 (same as Cauchy distribution); when dist="t5", generate data from t-distribution with degree of freedom 5. Calling runSim.R will (1) set random seed according to argument seed, (2) generate data according to argument dist, (3) compute the primed-indexed average estimator in class and the classical sample average estimator for each simulation replicate, (4) report the average mean squared error (MSE) \[ \frac{\sum_{r=1}^{\text{rep}} (\widehat \mu_r - \mu_{\text{true}})^2}{\text{rep}} \] for both methods.

An example run of runSim.R is

Rscript runSim.R seed=280 n=100 dist=\"gaussian\" rep=50
## [1] 0.04146737 0.00948339
  1. Modify the autoSim.R script to run simulations with combinations of sample sizes nVals = seq(100, 500, by=100) and distributions distTypes = c("gaussian", "t1", "t5") and write output to appropriately named files. Use rep = 50, and seed = 280.

We can run autoSim.R to write output to the files:

Rscript autoSim.R 
  1. Write an R script to collect simulation results from output files and print average MSEs in a table of format

This R script will extract the results from the simulation results and create a table that can easily communicate the results:

Rscript tableSim.R
n Method Gaussian t1 t5
100 PrimeAvg 0.0414674 2808.42570 0.0664284
100 SampAvg 0.0094834 230.23720 0.0167508
200 PrimeAvg 0.0187544 799.31393 0.0418479
200 SampAvg 0.0071637 95.87976 0.0076665
300 PrimeAvg 0.0200254 446.40235 0.0241224
300 SampAvg 0.0035344 53.57742 0.0054711
400 PrimeAvg 0.0112689 19.45133 0.0243314
400 SampAvg 0.0029707 20.53370 0.0035170
500 PrimeAvg 0.0111769 217.26890 0.0116763
500 SampAvg 0.0025920 20.78780 0.0033550