No handwritten homework reports are accepted for this course. We work with Git and GitHub. Efficient and abundant use of Git, e.g., frequent and well-documented commits, is an important criterion for grading your homework.
Yes, I did. My Github handle is @emjcampos.
biostat-m280-2018-winter
and add Hua-Zhou
and juhkim111
as your collaborators with write permission.Done.
hw1
, hw2
, … Maintain two branches master
and develop
. The develop
branch will be your main playground, the place where you develop solution (code) to homework problems and write up report. The master
branch will be your presentation area. Submit your homework files (R markdown file Rmd
, html
file converted from R markdown, all code and data sets to reproduce results) in master
branch.Done.
hw1
, hw2
, … Tagging time will be used as your submission time. That means if you tag your hw1
submission after deadline, penalty points will be deducted for late submission.The 35.227.165.60:/home/m280-data/hw1
folder contains a typical genetic data set in plink format. If interested, you can read plink documentation at http://zzz.bwh.harvard.edu/plink/. But it’s definitely not necessary for this homework.
merge-geno.bim
contains information of each genetic marker (SNP). Each line is a SNP and has 6 fields:
Chromosome
, SNP ID
, Genetic Distance (morgan)
, Base Pair Position (bp)
, Allele 1
, Allele 2
.
head /home/m280-data/hw1/merge-geno.bim
## 1 1-54490 0 54490 A G
## 1 1-55550 0 55550 T A
## 1 1-57033 0 57033 C T
## 1 1-57064 0 57064 A G
## 1 1-57818 0 57818 A C
## 1 1-58432 0 58432 C T
## 1 1-58448 0 58448 A G
## 1 1-58814 0 58814 A G
## 1 1-59492 0 59492 G A
## 1 1-60829 0 60829 T C
merge-geno.fam
contains individual information. Each line is one individual and has 6 fields:
Family ID
, Person ID
, Father ID
, Mother ID
, Sex
coded as 1 (male) or 2 (female), Affection Status
Father ID = 0
means that person’s father is not in this data set. Similarly Mother ID
= 0 means that person’s mother is not in this data set.
head -20 /home/m280-data/hw1/merge-geno.fam
## 2 T2DG0200001 0 0 1 0
## 2 T2DG0200002 0 0 2 0
## 2 T2DG0200003 0 0 2 0
## 2 T2DG0200004 0 0 2 0
## 2 T2DG0200005 0 0 1 0
## 2 T2DG0200006 0 0 1 0
## 2 T2DG0200007 0 0 2 0
## 2 T2DG0200008 0 0 2 0
## 2 T2DG0200009 0 0 2 0
## 2 T2DG0200012 0 0 1 0
## 2 T2DG0200013 0 0 1 0
## 2 T2DG0200018 0 0 1 0
## 2 T2DG0200023 0 0 2 0
## 2 T2DG0200024 0 0 1 0
## 2 T2DG0200027 0 0 2 0
## 2 T2DG0200031 T2DG0200001 T2DG0200015 1 0
## 2 T2DG0200032 T2DG0200001 T2DG0200015 2 0
## 2 T2DG0200033 T2DG0200001 T2DG0200015 2 0
## 2 T2DG0200034 T2DG0200001 T2DG0200015 2 0
## 2 T2DG0200035 T2DG0200001 T2DG0200015 2 0
merge-geno.bed
contains genotypes of each individual in binary format. We don’t need this file for this homework.
Please, do not put these data files into Git; they are huge. You even don’t need to copy them into your directory. Just read from the data folder /home/m280-data/hw1
directly.
Use Linux shell commands to answer following questions.
n
)? How many SNPs are in the data set (statisticians call this p
)?Number of persons in this data set:
wc -l < /home/m280-data/hw1/merge-geno.fam
## 959
Number of SNPs in the data set
wc -l < /home/m280-data/hw1/merge-geno.bim
## 8348674
The counts and chromosome numbers are:
cut -f1 /home/m280-data/hw1/merge-geno.bim | sort | uniq -c
## 1309299 1
## 815860 11
## 602809 13
## 491208 15
## 477990 17
## 393615 19
## 239352 21
## 1215399 3
## 1090185 5
## 980944 7
## 732013 9
There are this many SNPs within the MAP4 gene:
awk '{if ($1 == 3 && $4 >= 47892180 && $4 <= 48130769) print}' /home/m280-data/hw1/merge-geno.bim | wc -l
## 894
bim
file but has format SNP ID
, Chromosome
, Base Pair Position
with each field separated by a comma. Write a Linux shell command to convert merge-geno.bim
to Mendel SNP definition file. The first few lines of the Mendel SNP definition file should look likeecho " 2.40 = FILE FORMAT VERSION NUMBER." > mendel_snpdef.txt
echo "$(wc -l < /home/m280-data/hw1/merge-geno.bim) = NUMBER OF SNPS
LISTED HERE." >> mendel_snpdef.txt
awk '{OFS = ","} {print $2, $1, $4}' /home/m280-data/hw1/merge-geno.bim >> \
mendel_snpdef.txt
head mendel_snpdef.txt
## 2.40 = FILE FORMAT VERSION NUMBER.
## 8348674 = NUMBER OF SNPS
## LISTED HERE.
## 1-54490,1,54490
## 1-55550,1,55550
## 1-57033,1,57033
## 1-57064,1,57064
## 1-57818,1,57818
## 1-58432,1,58432
## 1-58448,1,58448
fam
file but has format Family ID
, Person ID
, Father ID
, Mother ID
, Sex
coded as M or F, Twin Status
with each field separated by a comma. Write a Linux shell command to convert merge-geno.fam
to Mendel pedigree file. Since twin status is not available in plink format, we put nothing for that field. Also Mendel limits Person ID to have length less than or equal to 8 characters, so we have to strip the string T2DG
from the IDs. First few lines of the Mendel pedigree should look likeawk -v sex="" -v personID="" -v fatherID="" -v motherID="" -v twin="" \
'{OFS = ","} {if ($5 == 1) sex="M"; else sex="F"}
{if ($2 == 0) personID=personID ; else personID=substr($2,5,8)}
{if ($3 == 0) fatherID=fatherID ; else fatherID=substr($3,5,8)}
{if ($4 == 0) motherID=motherID ; else motherID=substr($4,5,8)}
{print $1, personID, fatherID, motherID, sex, twin}' \
/home/m280-data/hw1/merge-geno.fam > mendel_ped.txt
head -20 mendel_ped.txt
## 2,0200001,,,M,
## 2,0200002,,,F,
## 2,0200003,,,F,
## 2,0200004,,,F,
## 2,0200005,,,M,
## 2,0200006,,,M,
## 2,0200007,,,F,
## 2,0200008,,,F,
## 2,0200009,,,F,
## 2,0200012,,,M,
## 2,0200013,,,M,
## 2,0200018,,,M,
## 2,0200023,,,F,
## 2,0200024,,,M,
## 2,0200027,,,F,
## 2,0200031,0200001,0200015,M,
## 2,0200032,0200001,0200015,F,
## 2,0200033,0200001,0200015,F,
## 2,0200034,0200001,0200015,F,
## 2,0200035,0200001,0200015,F,
In class we discussed using R to organize simulation studies.
runSim.R
script to include arguments seed
(random seed), n
(sample size), dist
(distribution) and rep
(number of simulation replicates). When dist="gaussian"
, generate data from standard normal; when dist="t1"
, generate data from t-distribution with degree of freedom 1 (same as Cauchy distribution); when dist="t5"
, generate data from t-distribution with degree of freedom 5. Calling runSim.R
will (1) set random seed according to argument seed
, (2) generate data according to argument dist
, (3) compute the primed-indexed average estimator in class and the classical sample average estimator for each simulation replicate, (4) report the average mean squared error (MSE) \[
\frac{\sum_{r=1}^{\text{rep}} (\widehat \mu_r - \mu_{\text{true}})^2}{\text{rep}}
\] for both methods.An example run of runSim.R
is
Rscript runSim.R seed=280 n=100 dist=\"gaussian\" rep=50
## [1] 0.04146737 0.00948339
autoSim.R
script to run simulations with combinations of sample sizes nVals = seq(100, 500, by=100)
and distributions distTypes = c("gaussian", "t1", "t5")
and write output to appropriately named files. Use rep = 50
, and seed = 280
.We can run autoSim.R
to write output to the files:
Rscript autoSim.R
This R script will extract the results from the simulation results and create a table that can easily communicate the results:
Rscript tableSim.R
n | Method | Gaussian | t1 | t5 |
---|---|---|---|---|
100 | PrimeAvg | 0.0414674 | 2808.42570 | 0.0664284 |
100 | SampAvg | 0.0094834 | 230.23720 | 0.0167508 |
200 | PrimeAvg | 0.0187544 | 799.31393 | 0.0418479 |
200 | SampAvg | 0.0071637 | 95.87976 | 0.0076665 |
300 | PrimeAvg | 0.0200254 | 446.40235 | 0.0241224 |
300 | SampAvg | 0.0035344 | 53.57742 | 0.0054711 |
400 | PrimeAvg | 0.0112689 | 19.45133 | 0.0243314 |
400 | SampAvg | 0.0029707 | 20.53370 | 0.0035170 |
500 | PrimeAvg | 0.0111769 | 217.26890 | 0.0116763 |
500 | SampAvg | 0.0025920 | 20.78780 | 0.0033550 |