Post provided by: Salvador Herrando-Pérez
If you use genetics to differentiate populations, the new package smartsnp might be your new best friend. Written in R language and available from GitHub and CRAN, this package performs principal component analysis with control for genetic drift, projects ancient samples onto modern genetic space, and tests for population differences in genotypes. The package can load big datasets and run complex stats in the blink of an eye.
In this post, Salvador Herrando-Pérez discusses the features of this new package which is fully described in the new paper “smartsnp, an R package for fast multivariate analyses of big genomic data” recently published in Methods in Ecology & Evolution.
In the bioinformatics era, sequencing a genome has never been so straightforward. No surprise that >20 petabytes of genomic data are expected to be generated every year by the end of this decade. For context, if 1 byte of information was 1 mm long, we could make 29,000 round trips to the moon with 20 petabytes! Data size in genetics keeps outpacing the computer power available to handle it at any given time. You might be familiar with a computer freezing when unable to load or run an analysis on a huge dataset, and how many coffees or teas we might have drunk, or computer screens might have been broken, during the wait. The bottom line is that software advances that speed up data processing and genetic analysis are always good news.
With that idea in mind, I have just published a paper presenting the new R package smartsnp to run multivariate analysis of big genotype data, with application to studies of ancestry, evolution, forensics, lineages and overall population genetics. I am proud to say that the development of the package has been one of the most gratifying short-term collaborations in my entire career, with my colleagues Christian Huber and Ray Tobler: a truly team effort!
The package is available on GitHub and the Comprehensive R Archive Network CRAN. See downloading options here, and vignettes here with step-by-step instructions to run different functionalities of our package (summarized below).
In this blog, I use “genotype” meaning the combination of gene variants (alleles) across a predefined set of positions (loci) in the genome of a given individual of animal, human, microbe or plant. One type of those variants is single nucleotide polymorphisms (SNP), a DNA locus at which two or more alternative nucleotides occur, sometimes conditioning protein translation or gene expression. SNPs are relatively stable over time and routinely used to identify individuals and ancestors in humans and wildlife.
What the package does
The package smartsnp is partly based on the field-standard software EIGENSOFT which is only available for Unix command-line environments. In fact, our driving motivation was (i) to broaden the use of EIGENSOFT tools by making them available to the rocketing community of professionals, not only academics, who employ R for their work, and (ii) to optimize our package to handle big datasets and complex stats efficiently. Our package mimics EIGENSOFT’s Principal Component Analysis (SMARTPCA), and also runs multivariate tests for population differences in genotypes as follows:
- Showing genetic relationships in a simple graph: If you have sequenced multiple SNPs from a range of study individuals (belonging to different populations), PCA can display the genetic resemblance among individuals in a scatter diagram (known as ordination). The points in the ordination are the individuals, and the closer the individuals are in ordination space, the more similar their genotypes are, and vice versa. The package smartsnp implements arguably the two most popular tools of SMARTPCA: (i) prior to analysis, SNP variation across individuals can be scaled to control for the expected allele-frequency dispersion caused by genetic drift; and (ii) posterior to analysis, individuals for which some SNPs could not be sequenced (missing nucleotides) can be projected onto the PCA space. The latter projection will be useful to compare genotypes for ancient versus modern individuals (see one of our vignettes explaining how to do it with real ancient-human data), or as an alternative to imputation for individuals resulting in low-coverage DNA with lots of missing bases.
- Quantifying statistical support for genetic relationships: Our package also implements two statistical tests based on permutations (normality not required). Each study individual must be assigned to one of at least two groups, and those groups might reflect, say, different populations, geographical regions or times. PERMANOVA tests for differences in ‘location’ (populations clustered in different locations of the PCA space because each population has a unique genotype), and PERMDISP tests for differences in ‘dispersion’ (populations show different spread in PCA space because the individuals of some populations have more variable [technically: heteroscedastic] genotypes than those of other populations).
How the package does it
The package smartsnp has four functions, one computes SMARTPCA, PERMANOVA and PERMDISP in one single go (smart_mva), and the three standalone functions compute only one analyses separately (smart_pca, smart_permanova, smart_permdisp).
For instance, once the package has been downloaded in your computer, if you set the path to your working directory in the R environment, store your genotype data (my_data) and group vector (my_groups) in the working directory, and run the next line of code:
smart_mva(snp_data = mydata, sample_group = my_groups)
Then you are computing SMARTPCA, PERMANOVA and PERMDISP at once for the genotypes in mydata given the assignment of individuals to groups specified in my_groups.
The function has multiple arguments but the above code executes the default options (e.g., missing values coded 9 in mydata, SNPs with missing values imputed with means, all SNPS scaled to control for genetic drift, PCA calculation truncated to principal axes one and two to accelerate computations, 9,999 permutations and Holm’s correction for multiple pairwise testing).
The function smart_mva will spit out (among other outputs) the position of individuals in PCA space (which you can then plot in R or elsewhere), and the F statistic and p values for PERMANOVA and PERMDISP.
In our paper, we show that smart_pca is 2 to 4 times faster than EIGENSOFT’s SMARTPCA across a range of data sizes. All four functions complete analyses on datasets with up to 100 samples and 1 million SNPs in < 30 seconds, and accurately recreate previously published SMARTPCA on genotypes of wolves (Canis lupus) and modern and ancient humans (Homo sapiens).
Genotype data formats
The package smartsnp has been conceived for the analysis of biallelic SNPs with genotypes [0|1|2] from diploid organisms. Thus, a SNP with reference allele G and variant allele T will have genotypes g(GG)= 0 (homozygous reference), g(GT) = 1 (heterozygous) and g(TT) = 2 (homozygous non-reference). Genotypes from haploid or polyploid organisms can be similarly defined and used in our package.
We tailored our package to detect automatically the format of the input data. smartsnp accepts a general flat (text) file or a compressed or uncompressed EIGENSOFT file. Those formats are stable and should preserve our package’s functionality over time (so not requiring constant updates). In contrast, genetic data formats such as VCF and PLINK are complex and change frequently, implying a range of conceptual and programmatic subtleties. Nevertheless, we have created a vignette (based on the software plink2) guiding users on transforming the VCF and PLINK/bed formats into a genotype text format (*.traw) accepted by smartsnp. Alternatively, users can resort to other R tools for data conversion into formats accepted by our package (e.g., genomic_converter, glactools, SNPRelate).
You might be familiar with the following issues but a reminder seems appropriate:
- a priori hypothesis: the assignment of individuals to groups must be done according to a hypothesis drawn from theory before doing any analysis. Our package smartsnp allows the combination of PCA with PERMANOVA and PERMDISP based on those a priori expectations. Otherwise, assigning individuals to groups based on a visual inspection of a PCA would be flawed, like testing for differences in the colour between two groups of birds by eyeballing their plumage rather than by formulating some type of prediction (e.g., feather colour might be expected to be driven by different diets or habitats).
- p values: ANOVA hypothesis testing (as in PERMANOVA, PERMDISP) invariably results in a probability (p) of the observed differences between groups if the null hypothesis (zero differences) was true. The probability is informative but does not prove anything and depends on sample size (the higher the sample size, the higher the likelihood that a p value will be low). Accordingly, avoid statements like “…there were significant differences between groups” as they have no statistical or biological meaning, but do state why the magnitude of those differences is biologically relevant.
- mean versus variance:ANOVA tests might result in a low p value because, for any given measured variable, the mean or the variance or both differ between groups. To tell any of the latter scenarios, an ANOVA test must be combined with a homoscedasticity (variance) test. The terms ‘mean’ and ‘variance’ in the univariate case (1 variable as in a one-way ANOVA) equate to ‘location’ and ‘dispersion’ in the multivariate case (>1 variable as in PCA). Notably, PERMANOVA is rather robust to dispersion effects for balanced designs.
Across the board, homoscedasticity should not be seen as nuisance that makes simple ANOVA complicated (this impression mostly results from widespread narratives presenting homogeneity of variance as an assumption of ANOVA tests) but as a statistical property of the data that can have genuine biological meaning. Thus, in ecology, multivariate dispersion can be interpreted as a measure of (i) species turnover across species assemblages and (ii) stress in species assemblages exposed to environmental perturbations. Akin to those interpretations, increased genetic heterogeneity in PCA space for human data can signal (i) increased genetic diversity when the most variable loci have the strongest effects on the phenotype of certain ethnic groups, or (ii) indicate disease profiles and their ancestral origin.
Example with real data
Kraus et al (2012) examined the global flyways of mallards (Anas platyrhynchos) using 364 SNPs (all SNPs had missing values) from 695 individuals belonging to 55 populations and 10 flyways. This study supports panmixia (random mating) or lack of genetic differentiation among flyways except for the Greenland flyway, suggesting “… the world’s mallards, perhaps with minor exceptions, form a single large, mainly interbreeding population”.
Plotting smart_pca outputs clearly captures that Greenland individuals cluster in the lower-left corner of the PCA 1 × PCA 2 ordination while individuals from the other flyways are mixed (see plot below). We have written a vignette showing how to run PCA, PERMANOVA and PERMDISP on this mallard dataset using our package’s four functions.
Our package smartsnp should make SMARTPCA available to a much wider audience than currently, and also enhance the exposition of the genetic concepts behind this numerical method.
The functionality of our package should have attractive properties for the growing community of scientists investigating modern and ancient population structure in humans and other taxa and across a range of disciplines (particularly archaeology, conservation biology, (palaeo)ecology, human/wildlife forensics, genetics/genomics, palaeontology, and population and evolutionary biology). Christian Huber is the maintainer of the package. We are happy to address any issues and questions that smartsnp users might raise, with the bonus that those interactions are bound to improve and expand future versions of the package.