The pilot study

The data in this experiment are a sample of 400 controls from the European Prospective Investigation of Cancer (EPIC) Norfolk study (http://www.srl.cam.ac.uk/epic/) each with approximately 250k Perlegen (http://www.perlegen.com/) SNPs (the EPIC 400 study). This is part of a screen sample from a multistage study of breast cancer in EPIC.

 

The flowchart of our implementation is shown in Figure 1. The input data have three sources of information, i.e., genotype, map and phenotype. The genotype data contain actual genotypes for all individuals in the so-called long format (individual ID, SNP ID, and genotype). Map information show position of each SNP by chromosome. Phenotypic information is in the usual tabular format (individual ID, sex, body mass index and other measurements). These three sources of information are merged into a combined dataset for analysis. For screen purpose, only single point analysis was conducted. Call rates were obtained, as with HWE tests for all SNPs, the results of which were used as an inclusion/exclusion filter of SNPs in the regression analysis. The code for SNP information including HWE tests is shown as follows,

 

proc sort data=epic;

     by chr pos;

run;

ods select none;

proc allele data=epic genocol;

     ods output markersumm=ms allelefreq=af genotypefreq=gf;

     by chr pos;

     var a1a2;

run;

ods select all;

 

The input data is sorted by chromosome (chr) and positions (pos), as input to PROC ALLELE which accepts genotype (genocol) and outputs summary information of SNPs (markersumm), allele frequencies (allefreq) and genotype frequencies (genotypefreq). The outputs are stored in ODS (output delivery system) databases by chromosomes and SNP positions, and all outputs for individual SNPs are suppressed (ods select none). This shows great simplicity.The raw genotype data and map information can be used to construct input files for HAPLOVIEW (http://www.broad.mit.edu/mpg/haploview/) for visualisation. The SNPs involved can be submitted to ENSEMBL (http://www.ensembl.org/index.html) to obtain gene annotation.

 

Several useful features are notable in this analysis. First, we do not require any other software to manage data. In a traditional statistical analysis, the data usually takes the so-called wide format, where rows indicate sample and columns variables. Since the number of SNPs is quite large, it is more sensible to organise genotype data into the so-called long format with only a few columns indicating individual IDs, SNP IDs and genotypes. Although this requires larger amount of storage but the analysis is considerable simpler, for one can perform analysis for each SNP in sequence and store the results in a systematic fashion. Second, we were able to take advantage of SAS/GENETICS module for HWE and haplotype analysis. Third, all the outputs are available as databases for re-use and it is possible to generate data for external software programs such as HAPLOVIEW. The facility of result database is possible with ODS. Lastly, the SAS programs we developed can run without c hange under Windows. We have created a MS-DOS batch file to call SAS from MS-DOS prompt.

 

We have kept the comma-separated data in compressed format, to be readily processed by pipe mec hanism in SAS. We noted that the long format repeats individuals IDs, SNP names and uses substantially more amount of diskspace compared to the wide format if merged with phenotypic data. In this case, the SAS datasets are approximately 30GB if some intermediate results are included; the running time is about a day or two on our Intel Linux systems with 2GB RAM. Based on this, it is estimated that the full obesity project is approximately 30 times larger. But if we spread the task across chromosomes on a cluster of 30 nodes the task can be furnished at similar speed.

 

Figure 1. A flowchart of the EPIC 400 Analysis. The raw data consist of genotypes, phenotypes and map information, to be merged. Descriptive statistics are then obtained, followed by calculation of call rates and HWE, the results of which are fed into the regression which assesses statistical significance and comparison with theoretical distribution by Q-Q plot. The raw data can also be reformatted into HAPLOVIEW input files so that specific region in the genome can be visualised.