Supplementary Notes - Download Word file - Nature

For two SNPs with alleles A and G, and C and G, the possible haplotypes are AC,
..... A series of three QA exercises was carried out within the project to assess ....
a series of dummy pedigrees each describing a possible relationship between ...

Part of the document

SUPPLEMENTARY INFORMATION A Haplotype Map of the Human Genome
The International HapMap Consortium Figures and tables are numbered consecutively as mentioned in the main
text.
If not mentioned in the main text they continue the numbering in the SI. CONTENTS Glossary Project Organisation and DNA samples 1. Project organisation
2. DNA samples
SNP Discovery, SNP Selection and Genotyping 1. Genome-wide SNP discovery
2. SNP selection for inclusion in Phase I
3. SNP genotyping protocols and methods
Phase I Data Set 1. Phase I data set description
2. Data coordination and distribution
3. Quality control and quality assessment analysis Population Genetic Data Analysis
1. SNP ascertainment features
2. Constructing a simulated Phase I HapMap for the ENCODE regions
3. Comparison of pairwise summaries of LD in ENCODE, HapMap, and
previous studies
4. Selection of tag SNPs
5. Detecting cryptic relatedness of samples
6. Estimating recombination rates and detecting recombination
hotspots
7. Nearest-neighbour analyses of haplotype structure
8. Estimation of FST
9. Identification of regions of unusual genetic variation
10. Tests of natural selection
11. Tests of transmission distortion Supplementary Tables Supplementary Figure Legends References Figures (provided as individual files) GLOSSARY Allele: One of several forms of a gene; at the DNA sequence level it
refers to one of several (usually, 2) nucleotide sequences at a particular
position in the genome.
Genotype: The two specific alleles present in an individual; called a
homozygote or heterozygote depending on whether the two alleles are
identical or different.
Polymorphism: The occurrence of multiple alleles at a specific site in the
DNA sequence. Classically, a site has been called polymorphic if the rarer
of the two alleles, called the minor allele, has a frequency above 1% in
the population.
SNP (single nucleotide polymorphism): Polymorphism where multiple
(usually, 2) bases (alleles) exist at a specific genomic sequence
site within a population, such as A and G. In individuals, the possible
combinations (genotypes) may be homozygous (AA or GG) or heterozygous (AG).
Heterozygosity: The frequency of heterozygotes in the population.
Haplotype: A combination of polymorphic alleles on a chromosome
delineating a specific pattern that occurs in a population. The term is
short for haploid genotype and has been used classically to describe the
patterns of variation in a small segment of the genome where genetic
recombination is rare, such as the HLA locus. However, when described as a
haploid genotype it can refer to the specific arrangement of alleles along
an entire chromosome observed in an individual, or in a specific region of
a chromosome. For two SNPs with alleles A and G, and C and G, the
possible haplotypes are AC, AG, GC and GG.
Linkage phase: The specific arrangement of alleles in the haplotypes. For
an individual who is heterozygous at two SNPs, AG and CG (see above), the
two haplotypes are either AC and GG, or AG and GC. These arrangements are
referred to as the phases of the genotypes.
Linkage disequilibrium (LD): The statistical association between alleles
at two or more sites (SNPs) along the genome in a population. Irrespective
of the starting genetic composition of a population, over time, the
frequencies of the four possible haplotypes AC, AG, GC and GG are expected
to become the numerical products of the constituent allele frequencies,
that is, reach an equilibrium state. Any departure from this state is
called disequilibrium and defined as D = P(AC)P(GG) - P(AG)P(GC) (using the
above example) where P(.) refers to the frequency of that haplotype. LD is
commonly measured by the statistic D', which is the absolute value of D
divided by the maximum value that D could take given the allele
frequencies; D' ranges between 0 (no LD) and 1 (complete LD). LD decays
depending on the rate of recombination between the SNPs. Thus, the
patterns of genomic recombination, and the occurrence of recombination
hotspots and coldspots, affect the decay of LD and its local patterns.
When two SNPs are in strong linkage disequilibrium, one or two of the four
possible haplotypes may be missing. Another way of measuring LD is by the
coefficient of determination between the two alleles of the two SNPs, a
statistic called r2. The value of r2 (the square of the correlation
coefficient) lies between 0 and 1 and its maximum possible value depends on
the MAFs of the two SNPs. It has been used because its theoretical
properties have been well studied and, most importantly, because it
measures how well one SNP can act as a surrogate (proxy) for another.
Tag SNPs (or tags): The set of SNPs selected for genotyping in a disease
study. Given the considerable extent of LD in local genomic regions, the
choice of these SNPs for genotyping in a disease association study is
critical, as long as the cost of genotyping is still substantial. The
extensive correlation among neighbouring SNPs implies that not all of them
need to be genotyped since they provide (to some degree) redundant
information. Tag SNP selection can be performed using a variety of
methods, with a common goal to capture efficiently the variation in the
genomic region of interest.
Demographic history: Extant human groups have populated the world after a
founding group emerged 'Out of Africa' ~150,000 years ago. The changes in
the demography (population size, mating behaviour, migration, etc.) of this
ancestral population, and the descendant ones, have shaped the quantity and
patterns of genetic variation in the human genome. Demographic history is
important for understanding the patterns of both benign and disease-related
variation. PROJECT ORGANISATION AND DNA SAMPLES
To achieve the broad goals for a project international in scope and of
considerable technical challenge we describe several project details both
for completeness and for the benefit of future genetic projects: overall
organisation of the project; collection of DNA samples; discovery of SNPs
genome-wide; SNP genotyping and quality control; and data coordination and
distribution. 1. Project organisation
The project was undertaken by a diverse team of investigators from multiple
countries - Canada, China, Japan, Nigeria, the United Kingdom, and the
United States - and multiple disciplines: community engagement and sample
collection, genomics, bioinformatics, population and statistical genetics,
and the ethical, legal, and social implications of genetic research. The
specific contributions from each participating group and their funding
sources are provided in Supplementary Table 10. These distributed locations
and diverse perspectives made coordination critical to maximize uniformity
of approach and data quality across the genome. The project was led by a Steering Committee that met monthly by phone, and
twice a year in person, with subgroups responsible for: (1) community
engagement and collection of DNA samples, (2) SNP discovery, (3) genotyping
data production, (4) data flow and distribution, (5) data quality, (6) data
analysis, (7) ethical and social issues, (8) data release and intellectual
property, (9) communications and writing, and (10) coordination and
administration. 2. DNA samples
The populations studied were chosen based on known global patterns of
ancestral human geography and allele frequency differentiation, such that
the resulting resource would be broadly applicable to medical genetic
studies throughout the world1,2. A practical and efficient solution for
sampling human genetic variation in a manner useful for disease association
studies was to sample individuals from populations that represent the major
demographic histories of extant humans. Since many populations would be
equally relevant from a given continental region, preference was given to
those which investigators from the HapMap Project were members. The
project decided to report the geographic locations where the samples were
collected so that researchers could decide which HapMap tag SNPs may be
most relevant to their disease studies. The size of each population sample was limited by the number of genotypes
that could be obtained. Thus, decisions about sample size were intertwined
with the minor allele frequencies targeted for study, the number of SNPs
required to span the genome, and the cost of genotyping. The project chose
to target alleles present at minor allele frequency greater than or equal
to 0.05 in each analysis panel, recognizing that such alleles explain 90%
or more of human heterozygosity, are reasonably well represented in public
SNP databases, and can be well characterised in a modest numbers of
samples. Given the goal of studying alleles with MAF > 0.05, 90 samples were to be
included from each continental region, constituting an analysis panel (270
samples in total). For each analysis panel, 5 different duplicate samples
were also included. Based on this sample size, and at the original
estimated genotyping costs, the project had the resources to genotype about
1 to 1.5 million SNPs across the genome. This constituted Phase I of the
HapMap Project in which a SNP density of 1 per 5 kilobases (kb) with MAF >
0.05 was to be achieved. Due to decreases in genotyping costs, the final
HapMap will include a Phase II component, currently underway and to be
completed in October of 2005, in which genotyping will be attempted in an
additional 4.6 million SNPs, for a final density of 1 SNP per kb. A Phase
III component will assess the adequacy of the tag SNPs in samples from
additional populations in the ENCODE regions. A complete accounting of SNPs genotyped for the Phase I data set by the