R/ipcaps.R
ipcaps.Rd
This version supports ordinal data which can be applied directly to SNP data to identify fine-scale population structure. It was built on the iterative pruning Principal Component Analysis (ipPCA) algorithm (Intarapanich et al., 2009; Limpiti et al., 2011). The IPCAPS involves an iterative process using multiple splits based on multivariate Gaussian mixture modeling of principal components and Clustering EM estimation (Lebret et al., 2015). In each iteration, rough clusters and outliers are also identified using our own method called rubikclust (R package KRIS).
ipcaps( bed = NA, rdata = NA, files = NA, label.file = NA, lab.col = 1, out, plot.as.pdf = FALSE, method = "mix", missing = NA, covariate = NA, cov.col.first = NA, cov.col.last = NA, threshold = 0.18, min.fst = 8e-04, min.in.group = 20, no.plot = FALSE )
bed | A PLINK binary format consists of 3 files; bed, bim, and fam. To generate these files from PLINK, use option –make-bed. See more details at: harvard.edu. |
---|---|
rdata | In case of re-analysis, it is convenient to run IPCAPS using the file rawdata.RData generated by IPCAPS. This file contains a matrix of SNPs (raw.data) and a vector of labels (label). |
files | IPCAPS supports SNPs encoded as 0, 1 and 2 (dosage encoding). Rows represent SNPs and columns represent subjects. Each column needs to be separated by a space or a tab. A big text file should be divided into smaller files to load faster. For instance, to input 3 files, use as: files=c( 'input1.txt', 'input2.txt', 'input3.txt'). |
label.file | An additional useful information (called "labels" in IPCAPS) related subject, for example, geographic location or disease phenotype. These labels (one at a time) are used in displaying the clustering outcome of IPCAPS. A label file must contain at least one column. However, it may contain more than one column in which case each column need to be separated by a space or a tab. |
lab.col | The label in the label file to be used in the tree-like display of IPCAPS clustering results. |
out | To set an absolute path for IPCAPS output. If the specified output directory already exists, result files are saved in sub-directories cluster_out, cluster_out1, cluster_out2, etc. |
plot.as.pdf | To export plots as PDF. When omitted, plots are saved as PNG. |
method | The internal clustering method. It can be set to "mix" (rubikclust & mixmod), "mixmod" (Lebret et al., 2015), "clara" (R: Clustering Large Applications), "pam" (R: Partitioning Around Medoids (PAM) Object), "meanshift" (Wang, 2016), "apcluster" (Bodenhofer et al., 2016), and "hclust" (R: Hierarchical Clustering). Default = "mix". |
missing | Symbol used for missing genotypes. Default = NA. |
covariate | A file of covariates; one covariate per column. SNPs can be adjusted for these covariates via regression modeling and residual computation. |
cov.col.first | Refer to a covariate file, the first covariate to be considered as confounding variable. |
cov.col.last | Refer to a covariate file, the last covariate to be considered as confounding variable. All the variables in between the cov.col.first and cov.col.last will be considered in the adjustment process. |
threshold | Cutoff value for EigenFit. Possible values range from 0.03 to 0.18. The smaller, the potentially finer the substructure can be. Default = 0.18. |
min.fst | Minimum Fst between a pair of subgroups. Default = 0.0008. |
min.in.group | Minimum number of individuals to constitute a cluster or subgroup. Default = 20. |
no.plot | No plot is generated if this option is TRUE. This option is useful when the system does not support X Windows in the unix based system. Default = FALSE. |
Returns the list object containing 2 internal objects; output.dir as class character and cluster as class data.frame. The object output.dir stores a result directory. The object cluster contains 4 columns, group, node, label, and row.number. The column group contains the assigned groups from IPCAPS. The column node contains node numbers in a tree as illustrated in the HTML result files. The column label contains the given labels that is set to the parameter label. The column row.number contains indices to an input data, which is matched to row numbers of input matrix. All clustering result files are saved in one directory (as specified by out) containing assigned groups, hierarchical trees of group membership, PC plots, and binary data for further analysis.
$snp
is a SNP matrix from BED file.
$snp.info
is a data.frame of SNP information from BIM file.
$ind.info
is a data.frame of individual information from FAM file.
If function return NULL
, it means the input files are not in proper
format.
The computational time depends on the number of individuals. Consequentially, if large data sets are analyzed, it may be necessary first to split data into smaller files, and then load into computer memory (with parameter 'files'). Internally, the split files are merged to construct a com-prehensive matrix.
Bodenhofer, U., Palme, J., Melkonian, C., and Kothmeier, A. (2016). apcluster : Affinity Propagation Clustering. Available at CRAN (Accessed March 7, 2017).
Intarapanich, A., Shaw, P. J., Assawamakin, A., Wangkumhang, P., Ngamphiw, C. , Chaichoompu, K., et al. (2009). Iterative pruning PCA improves resolution of highly structured populations. BMC Bioinformatics 10, 382. doi:10.1186/ 1471-2105-10-382.
Lebret, R., Iovleff, S., Langrognet, F., Biernacki, C., Celeux, G., and Govaert, G. (2015). Rmixmod: TheRPackage of the Model-Based Unsupervised, Supervised, and Semi-Supervised ClassificationMixmodLibrary. J. Stat. Softw. 67. doi:10.18637/jss.v067.i06.
Limpiti, T., Intarapanich, A., Assawamakin, A., Shaw, P. J., Wangkumhang, P., Piriyapongsa, J., et al. (2011). Study of large and highly stratified population datasets by combining iterative pruning principal component analysis and structure. BMC Bioinformatics 12, 255. doi:10.1186/1471-2105-12- 255.
Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., and Hornik, K. (2017). cluster: Cluster Analysis Basics and Extensions.
R: Clustering Large Applications Available at: ethz.ch (Accessed March 7, 2017).
R Core Team (2017). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing Available at: r-project.org.
R: Hierarchical Clustering Available at: ethz.ch (Accessed March 7, 2017).
R: Partitioning Around Medoids (PAM) Object Available at: ethz.ch (Accessed March 7, 2017).
Wang, M. C. and D. (2016). MeanShift: Clustering via the Mean Shift Algorithm. Available at: CRAN (Accessed March 7, 2017).
#Use the BED format as input #Importantly, bed file, bim file, and fam file are required #Use the example files embedded in the package BED.file <- system.file("extdata", "ipcaps_example.bed", package = "IPCAPS") LABEL.file <- system.file("extdata", "ipcaps_example_individuals.txt.gz", package = "IPCAPS") my.cluster1 <- ipcaps(bed = BED.file, label.file = LABEL.file, lab.col = 2, out = tempdir())#> Running ... IPCAPS #> output: /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//Rtmpvv9uwk #> label file: /private/var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T/RtmpYTlCbr/temp_libpath1351142c69ef5/IPCAPS/extdata/ipcaps_example_individuals.txt.gz #> label column: 2 #> threshold: 0.18 #> minimum Fst: 8e-04 #> maximum thread: 1 #> method: mix #> bed: /private/var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T/RtmpYTlCbr/temp_libpath1351142c69ef5/IPCAPS/extdata/ipcaps_example.bed #> minimum in group: 20 #> data type: snp #> missing: NA #> In preprocessing step #> Note: the directory '/var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//Rtmpvv9uwk' is existed. #> The result files will be saved to this directory: /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//Rtmpvv9uwk/cluster_output2 #> Input data: 103 individuals, 2000 markers #> Start calculating #> Node 1: Start the process #> Node 1: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//Rtmpvv9uwk/cluster_output2/RData/tree.RData #> Node 1: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//Rtmpvv9uwk/cluster_output2/RData/condition.RData #> Node 1: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//Rtmpvv9uwk/cluster_output2/RData/rawdata.RData #> Node 1: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//Rtmpvv9uwk/cluster_output2/RData/node1.RData #> Node 1: Check for splitting #> Node 1: Reducing matrix #> Node 1: EigenFit = 0.652486925995028, Threshold = 0.18, no. significant PCs = 3 #> Node 1: Start clustering #> Node 1: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//Rtmpvv9uwk/cluster_output2/RData/node1.RData #> Node 1: done for clustering #> Time difference of 0.08552504 secs #> Node 1: Checking Fst of group 1 and group 2 #> Node 1: Checking Fst of group 1 and group 3 #> Node 1: Checking Fst of group 1 and group 4 #> Node 1: Checking Fst of group 2 and group 3 #> Node 1: Checking Fst of group 2 and group 4 #> Node 1: Checking Fst of group 3 and group 4 #> Node 1 times took 0.279081106185913 secs #> Node 1: 103 individuals were splitted into: 4 groups #> Node 1: Return status 0 #> Node 1: Split to sub-nodes #> Node 1: Saving /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//Rtmpvv9uwk/cluster_output2/RData/node2.RData #> Node 1: Saving /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//Rtmpvv9uwk/cluster_output2/RData/node3.RData #> Node 1: Saving /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//Rtmpvv9uwk/cluster_output2/RData/node4.RData #> Node 1: Saving /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//Rtmpvv9uwk/cluster_output2/RData/node5.RData #> Node 1: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//Rtmpvv9uwk/cluster_output2/RData/tree.RData #> Node 1: Updating /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//Rtmpvv9uwk/cluster_output2/RData/tree.RData #> Node 1: Done! #> Time difference of 0.4096 secs #> Node 2: Start the process #> Node 2: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//Rtmpvv9uwk/cluster_output2/RData/tree.RData #> Node 2: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//Rtmpvv9uwk/cluster_output2/RData/condition.RData #> Node 2: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//Rtmpvv9uwk/cluster_output2/RData/rawdata.RData #> Node 2: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//Rtmpvv9uwk/cluster_output2/RData/node2.RData #> Node 2: Check for splitting #> Node 2: A number of node is lower than the minimum number (20), therefore split was not performed #> Node 2: Return status 1 #> Node 2: No split was performed, Status=1 #> Node 2: Updating /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//Rtmpvv9uwk/cluster_output2/RData/leafnode.RData #> Node 2: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//Rtmpvv9uwk/cluster_output2/RData/tree.RData #> Node 2: Updating /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//Rtmpvv9uwk/cluster_output2/RData/tree.RData #> Node 2: Done! #> Time difference of 0.02913404 secs #> Node 3: Start the process #> Node 3: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//Rtmpvv9uwk/cluster_output2/RData/tree.RData #> Node 3: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//Rtmpvv9uwk/cluster_output2/RData/condition.RData #> Node 3: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//Rtmpvv9uwk/cluster_output2/RData/rawdata.RData #> Node 3: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//Rtmpvv9uwk/cluster_output2/RData/node3.RData #> Node 3: Check for splitting #> Node 3: A number of node is lower than the minimum number (20), therefore split was not performed #> Node 3: Return status 1 #> Node 3: No split was performed, Status=1 #> Node 3: Updating /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//Rtmpvv9uwk/cluster_output2/RData/leafnode.RData #> Node 3: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//Rtmpvv9uwk/cluster_output2/RData/tree.RData #> Node 3: Updating /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//Rtmpvv9uwk/cluster_output2/RData/tree.RData #> Node 3: Done! #> Time difference of 0.03057694 secs #> Node 4: Start the process #> Node 4: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//Rtmpvv9uwk/cluster_output2/RData/tree.RData #> Node 4: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//Rtmpvv9uwk/cluster_output2/RData/condition.RData #> Node 4: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//Rtmpvv9uwk/cluster_output2/RData/rawdata.RData #> Node 4: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//Rtmpvv9uwk/cluster_output2/RData/node4.RData #> Node 4: Check for splitting #> Node 4: A number of node is lower than the minimum number (20), therefore split was not performed #> Node 4: Return status 1 #> Node 4: No split was performed, Status=1 #> Node 4: Updating /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//Rtmpvv9uwk/cluster_output2/RData/leafnode.RData #> Node 4: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//Rtmpvv9uwk/cluster_output2/RData/tree.RData #> Node 4: Updating /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//Rtmpvv9uwk/cluster_output2/RData/tree.RData #> Node 4: Done! #> Time difference of 0.03049421 secs #> Node 5: Start the process #> Node 5: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//Rtmpvv9uwk/cluster_output2/RData/tree.RData #> Node 5: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//Rtmpvv9uwk/cluster_output2/RData/condition.RData #> Node 5: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//Rtmpvv9uwk/cluster_output2/RData/rawdata.RData #> Node 5: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//Rtmpvv9uwk/cluster_output2/RData/node5.RData #> Node 5: Check for splitting #> Node 5: Reducing matrix #> Node 5: EigenFit = 0.343228072167861, Threshold = 0.18, no. significant PCs = 3 #> Node 5: Start clustering #> Node 5: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//Rtmpvv9uwk/cluster_output2/RData/node5.RData #> Node 5: done for clustering #> Time difference of 0.112169 secs #> Node 5: Checking Fst of group 1 and group 2 #> Node 5 times took 0.143810987472534 secs #> Node 5: 100 individuals were splitted into: 2 groups #> Node 5: Return status 0 #> Node 5: Split to sub-nodes #> Node 5: Saving /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//Rtmpvv9uwk/cluster_output2/RData/node6.RData #> Node 5: Saving /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//Rtmpvv9uwk/cluster_output2/RData/node7.RData #> Node 5: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//Rtmpvv9uwk/cluster_output2/RData/tree.RData #> Node 5: Updating /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//Rtmpvv9uwk/cluster_output2/RData/tree.RData #> Node 5: Done! #> Time difference of 0.2912269 secs #> Node 6: Start the process #> Node 6: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//Rtmpvv9uwk/cluster_output2/RData/tree.RData #> Node 6: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//Rtmpvv9uwk/cluster_output2/RData/condition.RData #> Node 6: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//Rtmpvv9uwk/cluster_output2/RData/rawdata.RData #> Node 6: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//Rtmpvv9uwk/cluster_output2/RData/node6.RData #> Node 6: Check for splitting #> Node 6: Reducing matrix #> Node 6: EigenFit = 0.0221170820186458, Threshold = 0.18, no. significant PCs = #> Node 6: No split was performed because EigenFit is lower than threshold #> Node 6: Return status 1 #> Node 6: No split was performed, Status=1 #> Node 6: Updating /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//Rtmpvv9uwk/cluster_output2/RData/leafnode.RData #> Node 6: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//Rtmpvv9uwk/cluster_output2/RData/tree.RData #> Node 6: Updating /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//Rtmpvv9uwk/cluster_output2/RData/tree.RData #> Node 6: Done! #> Time difference of 0.1155739 secs #> Node 7: Start the process #> Node 7: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//Rtmpvv9uwk/cluster_output2/RData/tree.RData #> Node 7: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//Rtmpvv9uwk/cluster_output2/RData/condition.RData #> Node 7: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//Rtmpvv9uwk/cluster_output2/RData/rawdata.RData #> Node 7: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//Rtmpvv9uwk/cluster_output2/RData/node7.RData #> Node 7: Check for splitting #> Node 7: Reducing matrix #> Node 7: EigenFit = 0.0367738545252401, Threshold = 0.18, no. significant PCs = #> Node 7: No split was performed because EigenFit is lower than threshold #> Node 7: Return status 1 #> Node 7: No split was performed, Status=1 #> Node 7: Updating /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//Rtmpvv9uwk/cluster_output2/RData/leafnode.RData #> Node 7: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//Rtmpvv9uwk/cluster_output2/RData/tree.RData #> Node 7: Updating /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//Rtmpvv9uwk/cluster_output2/RData/tree.RData #> Node 7: Done! #> Time difference of 0.106158 secs #> In post process step #> Exporting node 2 as group 1 #> Exporting node 3 as group 2 #> Exporting node 4 as group 3 #> Exporting node 6 as group 4 #> Exporting node 7 as group 5 #> Note: save as /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//Rtmpvv9uwk/cluster_output2/groups.txt#> The result files were saved at: /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//Rtmpvv9uwk/cluster_output2 #> Total runtime is 2.00170683860779 sec#> #> 1 2 3 4 5 #> outlier3 1 1 1 0 0 #> pop1 0 0 0 50 0 #> pop2 0 0 0 0 50# Alternatively, use a text file as input # Use the example files embedded in the package #text.file <- system.file("extdata", "ipcaps_example_rowVar_colInd.txt.gz", # package="IPCAPS") #LABEL.file <- system.file("extdata", "ipcaps_example_individuals.txt.gz", # package="IPCAPS") #my.cluster2 <- ipcaps(files = c(text.file), label.file = LABEL.file, lab.col = 2, # out=tempdir()) #table(my.cluster2$cluster$label, my.cluster2$cluster$group) # The other alternative way, use an R Data file as input # Use the example file embedded in the package #rdata.file <- system.file("extdata", "ipcaps_example.rda", package = "IPCAPS") #my.cluster3 <- ipcaps(rdata = rdata.file, out = tempdir()) #table(my.cluster3$cluster$label, my.cluster3$cluster$group)