KAML

Kinship Adjusted Multiple Loci Best Linear Unbiased Prediction

OVERVIEW
CITATION
GETTING STARTED
- Installation
- Test Datasets
INPUT
- Phenotype file/Covariate file/Kinship file
- Genotype file
  - Hapmap/VCF/PLINK Binary/Numeric
USAGE
- Basic
- Advanced
OUTPUT
FAQ AND HINTS

OVERVIEW

KAML is designed to predict genetic values using genome-wide or chromosome-wide SNPs for either simple traits that controlled by a limited number of major genes or complex traits that influenced by many polygenes with minor effects. In brief, KAML provides a flexible assumption to accommodate traits of various genetic architectures and incorporates pseudo-QTNs as fixed effect terms and a trait-specific random effect term under the LMM framework. The model parameters are optimized using the information of bootstrap strategy based GWAS results in a parallel accelerated machine learning procedure combing cross-validation, grid search and bisection algorithms. For more statistical methods under the BLUP Framework please see our developed package HIBLUP (https://www.hiblup.com).

KAML is developed by Lilin Yin with the help of Haohao Zhang, and Xiaolei Liu* at the Huazhong(Central China) Agricultural University.

If you have any bug reports or questions, please feed back 👉here👈, or send email to [email protected]

CITATION

Yin, L., Zhang, H., Zhou, X. et al. KAML: improving genomic prediction accuracy of complex traits using machine learning determined parameters. Genome Biol 21, 146 (2020). https://doi.org/10.1186/s13059-020-02052-w

🧰 RELEVANT SOFTWARE TOOLS FOR GENETIC ANALYSES AND GENOMIC BREEDING

📫 HIBLUP: Versatile and easy-to-use GS toolbox.	🍀 SIMER: data simulation for life science and breeding.
🏊 hibayes: A Bayesian-based GWAS and GS tool.	🏔️ IAnimal: an omics knowledgebase for animals.
📮 rMVP: Efficient and easy-to-use GWAS tool.	📊 CMplot: A drawing tool for genetic analyses.

GETTING STARTED

KAML is compatible with both R and Microsoft R Open, WE STRONGLY RECOMMEND TO INSTALL KAML ON Microsoft R Open(https://mran.microsoft.com/download/). MRO is the enhanced distribution of R from Microsoft Corporation, and it includes the state-of-the-art parallel accelerated math libraries. Those libraries would reduce the time consumption significantly as many matrix operations could be computed in parallel by using all the available processing power.

Installation

(1) KAML can be installed with "devtools" by using the following R codes:

#if "devtools" isn't installed, please "install.packages(devtools)" first.
devtools::install_github("YinLiLin/KAML")

(2) If you get trouble in installing "devtools", try to install it locally as following:

# install required packages first
R> pkg <- c("RcppArmadillo", "RcppEigen", "RcppProgress", "bigmemory", "gaston")
R> new.pkg <- pkg[!(pkg %in% installed.packages()[,"Package"])]
R> if(length(new.pkg)) install.packages(new.pkg)

shell$ git clone https://github.com/YinLiLin/KAML
shell$ R CMD INSTALL KAML

After installed successfully, the KAML package can be loaded by typing

library(KAML)

Typing ?KAML could get the details of all parameters. Some related functions (Data preparation） for KAML could be refered to our developed GWAS package rMVP (https://github.com/XiaoleiLiuBio/rMVP).

Test Datasets

The example data can be downloaded by typing:

wget https://github.com/YinLiLin/KAML/releases/download/1.0.1/example.zip
unzip example.zip

Or by clicking the example.zip in your browser. After downloaded, unzip the file and change the workplace by setwd("").

INPUT

Phenotype file

The file must contain a header row, which may represents the trait names. The missing values should be denoted by NA, which will be treated as candidates. Notice that only the numeric values are allowed and the characters will not be recognized. However, if a phenotype takes only values of 0, 1 (or only two levels), KAML would consider it to be a case-control trait, and the predicted value could be directly interpreted as the probability of being a case.

mouse.Pheno.txt

Trait1	Trait2	Trait3	Trait4	Trait5	Trait6
0.224992	0.224991	NA	1	NA	-0.285427
-0.974543	-0.974542	NA	0	NA	-2.333531
0.195909	0.195909	NA	1	NA	0.046818
NA	NA	NA	NA	NA	NA
NA	NA	NA	NA	NA	NA
...	...	...	...	...	...
NA	NA	NA	NA	NA	0.720009

Covariate file

The Covariate file is optional, in order to fit the model for raw phenotype prediction, users can provide the covariates. Please attention that NAs are not allowed in the Covariate file, and all individuals should be in the same order with phenotype file. In KAML, there are two types of covariates: dcovfile and qcovfile.

dcovfile(optional): discrete covariates, e.g. dcov.txt. Each discrete covariate is recognized as a categorical factor with several levels. Each level can be denoted by a single character, a word, or a numerical number.

qcovfile(optional): quantitative covariates, e.g. qcov.txt. Each quantitative covariate is recognized as a continuous variable.

NOTE: the design matrix of the mean, which is a vector of all ones, in the model is always a linear combination of the design matrix of a discrete covariate so that not all the effects of the levels (or classes, e.g. male and female) of a discrete covariate are estimable. KAML will always constrain the effect of the first level to be zero and the effect of any other level represents its difference in effect compared to the first level.

dcov.txt qcov.txt

F	H1	1	pop1
F	H3	0	pop2
M	H2	0	pop2
F	H3	1	pop4
M	H3	1	pop4
...	...	...	...
M	H5	0	pop5

12	0.01	0.13
5	-0.05	0.25
7	0.05	-0.36
13	0.16	0.28
2	0.07	0.95
...	...	...
10	-0.12	0.35

Kinship file

KAML requires a n×n matrix that represents the relationship among individuals. It could be also supplied by the users, however, in this case, the order of individuals in either row or column should be the same as phenotype file, the column and row names are not needed.

mouse.Kin.txt (optional)

0.3032	-0.0193	0.0094	0.0024	0.0381	...	-0.0072
-0.0193	0.274	-0.0243	0.0032	-0.0081	...	0.0056
0.0094	-0.0243	0.3207	-0.0071	-0.0045	...	-0.0407
0.0024	0.0032	-0.0071	0.321	-0.008	...	-0.0093
0.0381	-0.0081	-0.0045	-0.008	0.3498	...	-0.0238
...	...	...	...	...	...	...
-0.0072	0.0056	-0.0407	-0.0093	-0.0238	...	0.3436

Genotype file

With the increasing number of SNPs, the genotype file becomes very big and it is not possible to read it into the memory directly with a memory-limited machine. Hence, KAML is integrated with bigmemory package, which designed a specific data format named big.matrix for saving the memory usage.
A total of two files should be provided: *.geno.bin and *.geno.desc, both files should use the same prefix. *.geno.bin is the numeric m(number of markers) by n(number of individuals) genotype file in big.matrix format, and *.geno.desc is the description file of *.geno.bin. Actually, users could manually make those files, but time-consuming and error prone, so KAML provides a function KAML.Data() for genotype format transformation. In the released version, KAML could accept the genotype file in four formats, including the Hapmap format, the VCF format, the PLINK Binary format, and the Numeric format. When transforming, missing genotypes are replaced by the selected methods (Left, Middle, "Right") of a given marker. After transformed, KAML can read it on-the-fly without a memory attenuation.
NOTE: No matter what type of genotype format, the order of individuals in columns should be the same as phenotype file.

Hapmap

Hapmap is one of the commonly used data format for storing genotype. As the example shown below, the SNP information is stored in rows while the individual information is stored in columns. The first 11 columns showed attributes of the SNPs and the remaining columns show the nucleotides information that genotyped at each SNP for all individuals.

mouse.hmp.txt

rs#	alleles	chrom	pos	strand	assembly#	center	protLSID	assayLSID	panelLSID	QCcode	A048005080	A048006063	A048006555	A048007096	A048010273	...	A084292044
rs3683945	G/A	1	3197400	+	NA	NA	NA	NA	NA	NA	AG	AG	GG	AG	GG	...	AA
rs3707673	A/G	1	3407393	+	NA	NA	NA	NA	NA	NA	GA	GA	AA	GA	AA	...	GG
rs6269442	G/A	1	3492195	+	NA	NA	NA	NA	NA	NA	AG	GG	GG	AG	GG	...	AA
rs6336442	G/A	1	3580634	+	NA	NA	NA	NA	NA	NA	AG	AG	GG	AG	GG	...	AA
rs13475699	G	1	3860406	+	NA	NA	NA	NA	NA	NA	GG	GG	GG	GG	GG	...	GG

The genotype in Hapmap format can be transformed to big.matrix format by the following codes:

KAML.Data(hfile="mouse.hmp.txt", out="mouse")

If the genotype in Hapmap format is stored in multiple files for many chromosomes, it can be transformed to big.matrix format by the following codes:

KAML.Data(hfile=c("mouse.chr1.hmp.txt", "mouse.chr2.hmp.txt",...), out="mouse")

VCF

VCF (Variant Call Format) file has been developed with the advent of large-scale genotyping and DNA sequencing projects, such as the 1000 Genomes Project, it is one of the most widely used genotype format. An VCF file example is shown below:

##fileformat=VCFv4.2
##fileDate=20171105
##source=PLINKv1.90
##contig=<ID=1,length=2>
##INFO=<ID=PR,Number=0,Type=Flag,Description="Provisional reference allele, may not be based on real reference genome">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	-9_CZTB0004	-9_CZTB0006	-9_CZTB0008	-9_CZTB0010	-9_CZTB0011	-9_CZTB0012
1	1	10000235	A	C	.	.	PR	GT	0/1	0/0	0/0	0/0	0/0	0/1
1	1	10000345	A	G	.	.	PR	GT	0/0	0/0	0/0	0/0	1/1	1/1
1	1	10004575	G	.	.	.	PR	GT	0/0	0/0	0/0	0/0	0/0	0/0
1	1	10006974	C	T	.	.	PR	GT	0/0	0/0	0/1	1/1	0/1	1/1
1	1	10006986	A	G	.	.	PR	GT	0/0	0/0	0/1	./.	1/1	1/1

The genotype in VCF format can be transformed to big.matrix format by the following codes:

KAML.Data(vfile="mouse.vcf", out="mouse")

If the genotype in VCF format is stored in multiple files for many chromosomes, it can be transformed to big.matrix format by the following codes:

KAML.Data(vfile=c("mouse1.vcf", "mouse2.vcf",...), out="mouse")

PLINK Binary

The PLINK Banary format is derived from Plink software. This format requires three files: *.bed, *.bim and *.fam, all with the same prefix. KAML only use the *.bed and the *.bim file.
NOTE: the SNP ID must be unique.

mouse.fam, mouse.bim, mouse.bed

The genotype in PLINK Banary format can be transformed to big.matrix format by the following codes:

KAML.Data(bfile="mouse", out="mouse")

Numeric

KAML also accepts the Numeric format. The homozygote should be coded as 0, 2 while the heterozygote is coded as 1. The SNP information is stored in the rows and individual information is stored in the columns, it means that the dimension of the numeric matrix is m by n, the order of individuals in columns must correspond to the phenotype file in rows. Additionally, this format does not contain the chromosome and position of the SNPs. Therefore, two separate files should be provided, including one file contains the numeric genotype data, and the other file contains the position of each SNP.
NOTE: Row names and column names are not allowed, the number of row and the order of SNPs must be same in the two files.

mouse.Numeric.txt mouse.map

1	1	2	1	2	…	0
1	1	0	1	0	…	2
1	2	2	1	2	…	0
1	1	2	1	2	…	0
0	0	0	0	0	…	0

rs3683945	1	3197400
rs3707673	1	3407393
rs6269442	1	3492195
rs6336442	1	3580634
rs13475699	1	3860406

This type of file can be transformed by the following codes:

KAML.Data(numfile="mouse.Numeric.txt", mapfile="mouse.map", out="mouse")

If there are missing values in your genotype, please impute as following:

KAML.Impute("mouse")

After transformed from one of four types of format above, two needed files will be generated, all with the same prefix which is assigned by users in the KAML.Data function.

The example mouse datasets: mouse.geno.desc, mouse.geno.bin

USAGE

Basic

To run KAML, you should provide two basic files: the phenotype file (values for training, NAs for predictors) and the genotype file. By default, the first column of phenotype will be analyzed, if there are more than one trait, please specify which column of trait is to be analyzed with the parameter "pheno=". For example: KAML(..., pheno=3) means the trait in third column would be analyzed. For the genotype, only the prefix need to be assigned, KAML could automatically attach the *.geno.bin and *.geno.desc files.
Note again: KAML has no function for adjusting the order of individuals. So please make sure the same order of individuals between phenotype and genotype.

> mykaml <- KAML(pfile="mouse.Pheno.txt", pheno=1, gfile="mouse")
# pfile: phenotype file
# pheno: the column number of predicted phenotype
# gfile: transformed genotype file

# multiple threads computation
> mykaml <- KAML(pfile="mouse.Pheno.txt", pheno=1, gfile="mouse", cpu=30)
# cpu: number pf threads

Run KAML with the provided covariate file cfile and kinship file kfile:

> mykaml <- KAML(pfile="mouse.Pheno.txt", pheno=1, gfile="mouse", cfile="CV.txt", kfile="mouse.Kin.txt")
# cfile: covariates file
# kfile: kinship file

Set the sample number sample.num and validation number crv.num for cross_validation:

> mykaml <- KAML(pfile="mouse.Pheno.txt", pheno=1, gfile="mouse", sample.num=2, crv.num=5)
# sample.num: the number of replicates on cross-validation 
# crv.num: fold of cross-validation

Change the top selected number of SNPs Top.num and GWAS model (the options are "MLM", "GLM", "RR"):

> mykaml <- KAML(pfile="mouse.Pheno.txt", pheno=1, gfile="mouse", Top.num=15, GWAS.model="MLM")
# Top.num: max number of top LD-filtered SNPs before pseudo QTNs optimization
# GWAS.model: select the model of Genome-Wide Association Study

Change the methods of variance components estimation vc.method (the options are "brent", "emma", "he"):

> mykaml <- KAML(pfile="mouse.Pheno.txt", pheno=1, gfile="mouse", GWAS.model="MLM", vc.method="brent")
# vc.method: select the method of variance components estimation

Change the start value of grid search procedure of Kinship optimization Top.perc & Logx:

> mykaml <- KAML(pfile="mouse.Pheno.txt", pheno=1, gfile="mouse", GWAS.model="MLM", vc.method="brent",
            Top.perc=c(0.0001,0.001,0.01), Logx=c(0.01,0.05,0.1,0.5,1,5,10,15))
# Top.perc: prior value of the percentage of SNPs which would be given more weights
# Logx: prior value of the base of log function
# Note: More levels of start values will lead to much more calculation burden.

Change the maximum iteration number of bisection algorithm bisection.loop:

> mykaml <- KAML(pfile="mouse.Pheno.txt", pheno=1, gfile="mouse", GWAS.model="MLM", vc.method="brent",
            Top.perc=c(0.0001,0.001,0.01), Logx=c(0.01,0.05,0.1,0.5,1,5,10,15), bisection.loop=8)
# bisection.loop: the max number of iteration for bisection algorithm
# Note: if bisection.loop=0, the bisection procedure will not run.

Advanced

Only to optimize a trait-specific (weighted) kinship matrix:

> mykaml <- KAML(pfile="mouse.Pheno.txt", pheno=1, gfile="mouse", Top.num=NULL)

Only to add the selected pseudo QTNs with big effects as covariates:

> mykaml <- KAML(pfile="mouse.Pheno.txt", pheno=1, gfile="mouse", Top.perc=NULL)

Switch KAML to LMM (GBLUP)

> mykaml <- KAML(pfile="mouse.Pheno.txt", pheno=1, gfile="mouse", Top.num=NULL, Top.perc=NULL)

Integrate some previously validated causal SNPs of the trait as covariates directly:

# directly predict by LM (QTN)
> mykaml <- KAML(pfile="mouse.Pheno.txt", pheno=1, gfile="mouse", prior.QTN=c(9358, 9375), prior.model="QTN")

# predict by MLM with QTNs and a Kinship optimization procedure (QTN + weighted Kinship)
> mykaml <- KAML(pfile="mouse.Pheno.txt", pheno=1, gfile="mouse", prior.QTN=c(9358, 9375), prior.model="QTN+K")

# predict by MLM with QTNs and a Kinship (QTN + standard Kinship)
> mykaml <- KAML(pfile="mouse.Pheno.txt", pheno=1, gfile="mouse", prior.QTN=c(9358, 9375), 
            prior.model="QTN+K", Top.perc=NULL)

In realistic analysis, we don't know its actual genetic architecture of unknow traits, which could be obtained from a machine learning strategy of KAML. Although those procedures could be speeded up by parallel computation, it's still time-consuming with limited computation resources. So it would be a better choice to run KAML within a smaller population to obtain the parameters, and then apply the optimized parameters to greater populations, which has been proved to be more efficient but generate similar prediction performance in our numbers of tests.

mykaml <- KAML(pfile="mouse.Pheno.txt", pheno=1, gfile="mouse", prior.QTN=c(9358, 9375), prior.model="QTN+K",
          Top.perc=0.0276, Logx=3.1094)

Integrate KAML to SSBLUP (SSKAML)

The weighted Kinship matrix can be directly applied into SSBLUP model to improve the prediction accuracy for both genotyped and non-genotyped individuals. To run SSBLUP model, please install our developed tool HIBLUP.

library(hiblup)
library(KAML)
# extract the weighted kinship from KAML
G <- mykaml$K
G.ind <- read.table("mouse.Pheno.txt", head=F)[,1]
# fit SSBLUP
fit <- hiblup(pheno=pheno, pedigree=pedigree, geno=NULL, G=G, geno.ind=G.ind)

OUTPUT

KAML returns total 11 lists of results including: predicted phenotype, coefficients of all fixed effects, predicted GEBVs, genetic variance, residual variance, pseudo QTNs, the used model, the weights of all SNPs when constructing GRM, the optimized top percentage, the optimized base value of Log function and the optimized Kinship matrix.

> str(mykaml)
List of 11
 $ y       : num [1:1631, 1] 1541 1632 1568 1726 1749 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : NULL
  .. ..$ : chr "t1"
 $ beta    : num 1631
 $ gebv    : num [1:1631, 1] -90.019 0.997 -63.775 94.4 117.305 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : NULL
  .. ..$ : chr "t1"
 $ vg      : num 20665
 $ ve      : num 1773
 $ qtn     : NULL
 $ model   : chr "K"
 $ weight  : num [1:45426] 0.984 0.984 0.984 0.984 0.984 ...
 $ top.perc: num 0.00775
 $ logx    : num 15.5
 $ K       : num [1:1631, 1:1631] 0.94824 -0.00781 -0.00558 -0.02665 -0.01755 ...

FAQ and Hints

🆘 Question1: Failing to install "devtools":

ERROR: configuration failed for package ‘git2r’

removing ‘/Users/acer/R/3.4/library/git2r’

ERROR: dependency ‘git2r’ is not available for package ‘devtools’

removing ‘/Users/acer/R/3.4/library/devtools’

😋 Answer: Please type the following codes in terminal.

apt-get install libssl-dev/unstable

🆘 Question2: When installing packages from Github with "devtools", there is a error:

Error in curl::curl_fetch_disk(url, x$path, handle = handle): Problem with the SSL CA cert (path? access rights?)

😋 Answer: Please type the following codes and then try again.

library(httr)
set_config(config(ssl_verifypeer = 0L))

Questions, suggestions, and bug reports are welcome and appreciated. ➡️

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

YinLiLin / KAML

Programming Languages

Labels