Title: | k-Prototypes Clustering for Mixed Variable-Type Data |
---|---|
Description: | Functions to perform k-prototypes partitioning clustering for mixed variable-type data according to Z.Huang (1998): Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Variables, Data Mining and Knowledge Discovery 2, 283-304. |
Authors: | Gero Szepannek [aut, cre], Rabea Aschenbruck [aut] |
Maintainer: | Gero Szepannek <[email protected]> |
License: | GPL (>=2) |
Version: | 0.4-2 |
Built: | 2024-10-29 05:01:45 UTC |
Source: | https://github.com/g-rho/clustmixtype |
Visualization of a k-prototypes clustering result for cluster interpretation.
clprofiles(object, x, vars = NULL, col = NULL)
clprofiles(object, x, vars = NULL, col = NULL)
object |
Object resulting from a call of resulting |
x |
Original data. |
vars |
Optional vector of either column indices or variable names. |
col |
Palette of cluster colours to be used for the plots. As a default RColorBrewer's
|
For numerical variables boxplots and for factor variables barplots of each cluster are generated.
# generate toy data with factors and numerics n <- 100 prb <- 0.9 muk <- 1.5 clusid <- rep(1:4, each = n) x1 <- sample(c("A","B"), 2*n, replace = TRUE, prob = c(prb, 1-prb)) x1 <- c(x1, sample(c("A","B"), 2*n, replace = TRUE, prob = c(1-prb, prb))) x1 <- as.factor(x1) x2 <- sample(c("A","B"), 2*n, replace = TRUE, prob = c(prb, 1-prb)) x2 <- c(x2, sample(c("A","B"), 2*n, replace = TRUE, prob = c(1-prb, prb))) x2 <- as.factor(x2) x3 <- c(rnorm(n, mean = -muk), rnorm(n, mean = muk), rnorm(n, mean = -muk), rnorm(n, mean = muk)) x4 <- c(rnorm(n, mean = -muk), rnorm(n, mean = muk), rnorm(n, mean = -muk), rnorm(n, mean = muk)) x <- data.frame(x1,x2,x3,x4) # apply k-prototyps kpres <- kproto(x, 4) clprofiles(kpres, x) # in real world clusters are often not as clear cut # by variation of lambda the emphasize is shifted towards factor / numeric variables kpres <- kproto(x, 2) clprofiles(kpres, x) kpres <- kproto(x, 2, lambda = 0.1) clprofiles(kpres, x) kpres <- kproto(x, 2, lambda = 25) clprofiles(kpres, x)
# generate toy data with factors and numerics n <- 100 prb <- 0.9 muk <- 1.5 clusid <- rep(1:4, each = n) x1 <- sample(c("A","B"), 2*n, replace = TRUE, prob = c(prb, 1-prb)) x1 <- c(x1, sample(c("A","B"), 2*n, replace = TRUE, prob = c(1-prb, prb))) x1 <- as.factor(x1) x2 <- sample(c("A","B"), 2*n, replace = TRUE, prob = c(prb, 1-prb)) x2 <- c(x2, sample(c("A","B"), 2*n, replace = TRUE, prob = c(1-prb, prb))) x2 <- as.factor(x2) x3 <- c(rnorm(n, mean = -muk), rnorm(n, mean = muk), rnorm(n, mean = -muk), rnorm(n, mean = muk)) x4 <- c(rnorm(n, mean = -muk), rnorm(n, mean = muk), rnorm(n, mean = -muk), rnorm(n, mean = muk)) x <- data.frame(x1,x2,x3,x4) # apply k-prototyps kpres <- kproto(x, 4) clprofiles(kpres, x) # in real world clusters are often not as clear cut # by variation of lambda the emphasize is shifted towards factor / numeric variables kpres <- kproto(x, 2) clprofiles(kpres, x) kpres <- kproto(x, 2, lambda = 0.1) clprofiles(kpres, x) kpres <- kproto(x, 2, lambda = 25) clprofiles(kpres, x)
Computes k-prototypes clustering for mixed-type data.
kproto(x, ...) ## Default S3 method: kproto( x, k, lambda = NULL, type = "huang", iter.max = 100, nstart = 1, na.rm = "yes", keep.data = TRUE, verbose = TRUE, init = NULL, p_nstart.m = 0.9, ... )
kproto(x, ...) ## Default S3 method: kproto( x, k, lambda = NULL, type = "huang", iter.max = 100, nstart = 1, na.rm = "yes", keep.data = TRUE, verbose = TRUE, init = NULL, p_nstart.m = 0.9, ... )
x |
Data frame with both numerics and factors (also ordered factors are possible). |
... |
Currently not used. |
k |
Either the number of clusters, a vector specifying indices of initial prototypes, or a data frame of
prototypes of the same columns as |
lambda |
Parameter > 0 to trade off between Euclidean distance of numeric variables and simple matching
coefficient between categorical variables (if |
type |
Character, to specify the distance for clustering. Either |
iter.max |
Numeric; maximum number of iterations if no convergence before. |
nstart |
Numeric; If > 1 repetitive computations with random initializations are computed and the result with
minimum |
na.rm |
Character, either |
keep.data |
Logical, whether original should be included in the returned object. |
verbose |
Logical, whether additional information about process should be printed.
Caution: For |
init |
Character, to specify the initialization strategy. Either |
p_nstart.m |
Numeric, probability(=0.9 is default) for |
Like k-means, the k-prototypes algorithm iteratively recomputes cluster prototypes and reassigns
clusters, whereby with type = "huang"
clusters are assigned using the distance
. Cluster prototypes are computed as
cluster means for numeric variables and modes for factors (cf. Huang, 1998). Ordered factors variables
are treated as categorical variables.
For type = "gower"
range-normalized absolute distances from the cluster median are computed for
the numeric variables (and for the ranks of the ordered factors respectively). For factors simple matching
distance is used as in the original k prototypes algorithm. The prototypes are given by the median for
numeric variables, the mode for factors and the level with the closest rank to the median rank of the
corresponding cluster (cf. Szepannek et al., 2024).
In case of na.rm = FALSE
: for each observation variables with missings are ignored (i.e. only the
remaining variables are considered for distance computation). In consequence for observations with missings
this might result in a change of variable's weighting compared to the one specified by lambda
. For
these observations distances to the prototypes will typically be smaller as they are based on fewer variables.
The type
argument also accepts input "standard"
, but this naming convention is deprecated and
has been renamed to "huang"
. Please use "huang"
instead.
kmeans
like object of class kproto
:
cluster |
Vector of cluster memberships. |
centers |
Data frame of cluster prototypes. |
lambda |
Distance parameter lambda. |
size |
Vector of cluster sizes. |
withinss |
Vector of within cluster distances for each cluster, i.e. summed distances of all observations belonging to a cluster to their respective prototype. |
tot.withinss |
Target function: sum of all observations' distances to their corresponding cluster prototype. |
dists |
Matrix with distances of observations to all cluster prototypes. |
iter |
Prespecified maximum number of iterations. |
trace |
List with two elements (vectors) tracing the iteration process:
|
inits |
Initial prototypes determined by specified initialization strategy, if init is either 'nbh.dens' or 'sel.cen'. |
nstart.m |
only for 'init = nstart_m': determined number of randomly choosen sets. |
data |
if 'keep.data = TRUE' than the original data will be added to the output list. |
type |
Type argument of the function call. |
stdization |
Only returned for |
Szepannek, G. (2018): clustMixType: User-Friendly Clustering of Mixed-Type Data in R, The R Journal 10/2, 200-208, doi:10.32614/RJ-2018-048.
Aschenbruck, R., Szepannek, G., Wilhelm, A. (2022): Imputation Strategies for Clustering Mixed‑Type Data with Missing Values, Journal of Classification, doi:10.1007/s00357-022-09422-y.
Szepannek, G., Aschenbruck, R., Wilhelm, A. (2024): Clustering Large Mixed-Type Data with Ordinal Variables, Advances in Data Analysis and Classification, doi:10.1007/s11634-024-00595-5.
Z.Huang (1998): Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Variables, Data Mining and Knowledge Discovery 2, 283-304.
# generate toy data with factors and numerics n <- 100 prb <- 0.9 muk <- 1.5 clusid <- rep(1:4, each = n) x1 <- sample(c("A","B"), 2*n, replace = TRUE, prob = c(prb, 1-prb)) x1 <- c(x1, sample(c("A","B"), 2*n, replace = TRUE, prob = c(1-prb, prb))) x1 <- as.factor(x1) x2 <- sample(c("A","B"), 2*n, replace = TRUE, prob = c(prb, 1-prb)) x2 <- c(x2, sample(c("A","B"), 2*n, replace = TRUE, prob = c(1-prb, prb))) x2 <- as.factor(x2) x3 <- c(rnorm(n, mean = -muk), rnorm(n, mean = muk), rnorm(n, mean = -muk), rnorm(n, mean = muk)) x4 <- c(rnorm(n, mean = -muk), rnorm(n, mean = muk), rnorm(n, mean = -muk), rnorm(n, mean = muk)) x <- data.frame(x1,x2,x3,x4) # apply k-prototypes kpres <- kproto(x, 4) clprofiles(kpres, x) # in real world clusters are often not as clear cut # by variation of lambda the emphasize is shifted towards factor / numeric variables kpres <- kproto(x, 2) clprofiles(kpres, x) kpres <- kproto(x, 2, lambda = 0.1) clprofiles(kpres, x) kpres <- kproto(x, 2, lambda = 25) clprofiles(kpres, x)
# generate toy data with factors and numerics n <- 100 prb <- 0.9 muk <- 1.5 clusid <- rep(1:4, each = n) x1 <- sample(c("A","B"), 2*n, replace = TRUE, prob = c(prb, 1-prb)) x1 <- c(x1, sample(c("A","B"), 2*n, replace = TRUE, prob = c(1-prb, prb))) x1 <- as.factor(x1) x2 <- sample(c("A","B"), 2*n, replace = TRUE, prob = c(prb, 1-prb)) x2 <- c(x2, sample(c("A","B"), 2*n, replace = TRUE, prob = c(1-prb, prb))) x2 <- as.factor(x2) x3 <- c(rnorm(n, mean = -muk), rnorm(n, mean = muk), rnorm(n, mean = -muk), rnorm(n, mean = muk)) x4 <- c(rnorm(n, mean = -muk), rnorm(n, mean = muk), rnorm(n, mean = -muk), rnorm(n, mean = muk)) x <- data.frame(x1,x2,x3,x4) # apply k-prototypes kpres <- kproto(x, 4) clprofiles(kpres, x) # in real world clusters are often not as clear cut # by variation of lambda the emphasize is shifted towards factor / numeric variables kpres <- kproto(x, 2) clprofiles(kpres, x) kpres <- kproto(x, 2, lambda = 0.1) clprofiles(kpres, x) kpres <- kproto(x, 2, lambda = 25) clprofiles(kpres, x)
Investigation of the variables' variances/concentrations to support specification of lambda for k-prototypes clustering.
lambdaest( x, num.method = 1, fac.method = 1, outtype = "numeric", verbose = TRUE )
lambdaest( x, num.method = 1, fac.method = 1, outtype = "numeric", verbose = TRUE )
x |
Data.frame with both numerics and factors. |
num.method |
Integer 1 or 2. Specifies the heuristic used for numeric variables. |
fac.method |
Integer 1 or 2. Specifies the heuristic used for factor variables. |
outtype |
Specifies the desired output: either 'numeric', 'vector' or 'variation'. |
verbose |
Logical whether additional information about process should be printed. |
Variance (num.method = 1
) or standard deviation (num.method = 2
) of numeric variables
and (
fac.method = 1
) or (
fac.method = 2
) for factors is computed.
lambda |
Ratio of averages over all numeric/factor variables is returned.
In case of |
# generate toy data with factors and numerics n <- 100 prb <- 0.9 muk <- 1.5 clusid <- rep(1:4, each = n) x1 <- sample(c("A","B"), 2*n, replace = TRUE, prob = c(prb, 1-prb)) x1 <- c(x1, sample(c("A","B"), 2*n, replace = TRUE, prob = c(1-prb, prb))) x1 <- as.factor(x1) x2 <- sample(c("A","B"), 2*n, replace = TRUE, prob = c(prb, 1-prb)) x2 <- c(x2, sample(c("A","B"), 2*n, replace = TRUE, prob = c(1-prb, prb))) x2 <- as.factor(x2) x3 <- c(rnorm(n, mean = -muk), rnorm(n, mean = muk), rnorm(n, mean = -muk), rnorm(n, mean = muk)) x4 <- c(rnorm(n, mean = -muk), rnorm(n, mean = muk), rnorm(n, mean = -muk), rnorm(n, mean = muk)) x <- data.frame(x1,x2,x3,x4) lambdaest(x) res <- kproto(x, 4, lambda = lambdaest(x))
# generate toy data with factors and numerics n <- 100 prb <- 0.9 muk <- 1.5 clusid <- rep(1:4, each = n) x1 <- sample(c("A","B"), 2*n, replace = TRUE, prob = c(prb, 1-prb)) x1 <- c(x1, sample(c("A","B"), 2*n, replace = TRUE, prob = c(1-prb, prb))) x1 <- as.factor(x1) x2 <- sample(c("A","B"), 2*n, replace = TRUE, prob = c(prb, 1-prb)) x2 <- c(x2, sample(c("A","B"), 2*n, replace = TRUE, prob = c(1-prb, prb))) x2 <- as.factor(x2) x3 <- c(rnorm(n, mean = -muk), rnorm(n, mean = muk), rnorm(n, mean = -muk), rnorm(n, mean = muk)) x4 <- c(rnorm(n, mean = -muk), rnorm(n, mean = muk), rnorm(n, mean = -muk), rnorm(n, mean = muk)) x <- data.frame(x1,x2,x3,x4) lambdaest(x) res <- kproto(x, 4, lambda = lambdaest(x))
Plot distributions of the clusters across the variables.
## S3 method for class 'kproto' plot(x, ...)
## S3 method for class 'kproto' plot(x, ...)
x |
Object resulting from a call of |
... |
Additional arguments to be passet to |
Wrapper around clprofiles
. Only works for kproto
object created with keep.data = TRUE
.
# generate toy data with factors and numerics n <- 100 prb <- 0.9 muk <- 1.5 clusid <- rep(1:4, each = n) x1 <- sample(c("A","B"), 2*n, replace = TRUE, prob = c(prb, 1-prb)) x1 <- c(x1, sample(c("A","B"), 2*n, replace = TRUE, prob = c(1-prb, prb))) x1 <- as.factor(x1) x2 <- sample(c("A","B"), 2*n, replace = TRUE, prob = c(prb, 1-prb)) x2 <- c(x2, sample(c("A","B"), 2*n, replace = TRUE, prob = c(1-prb, prb))) x2 <- as.factor(x2) x3 <- c(rnorm(n, mean = -muk), rnorm(n, mean = muk), rnorm(n, mean = -muk), rnorm(n, mean = muk)) x4 <- c(rnorm(n, mean = -muk), rnorm(n, mean = muk), rnorm(n, mean = -muk), rnorm(n, mean = muk)) x <- data.frame(x1,x2,x3,x4) # apply k-prototyps kpres <- kproto(x, 4) plot(kpres, vars = c("x1","x3"))
# generate toy data with factors and numerics n <- 100 prb <- 0.9 muk <- 1.5 clusid <- rep(1:4, each = n) x1 <- sample(c("A","B"), 2*n, replace = TRUE, prob = c(prb, 1-prb)) x1 <- c(x1, sample(c("A","B"), 2*n, replace = TRUE, prob = c(1-prb, prb))) x1 <- as.factor(x1) x2 <- sample(c("A","B"), 2*n, replace = TRUE, prob = c(prb, 1-prb)) x2 <- c(x2, sample(c("A","B"), 2*n, replace = TRUE, prob = c(1-prb, prb))) x2 <- as.factor(x2) x3 <- c(rnorm(n, mean = -muk), rnorm(n, mean = muk), rnorm(n, mean = -muk), rnorm(n, mean = muk)) x4 <- c(rnorm(n, mean = -muk), rnorm(n, mean = muk), rnorm(n, mean = -muk), rnorm(n, mean = muk)) x <- data.frame(x1,x2,x3,x4) # apply k-prototyps kpres <- kproto(x, 4) plot(kpres, vars = c("x1","x3"))
Predicts k-prototypes cluster memberships and distances for new data.
## S3 method for class 'kproto' predict(object, newdata, ...)
## S3 method for class 'kproto' predict(object, newdata, ...)
object |
Object resulting from a call of |
newdata |
New data frame (of same structure) where cluster memberships are to be predicted. |
... |
Currently not used. |
kmeans
like object of class kproto
:
cluster |
Vector of cluster memberships. |
dists |
Matrix with distances of observations to all cluster prototypes. |
# generate toy data with factors and numerics n <- 100 prb <- 0.9 muk <- 1.5 clusid <- rep(1:4, each = n) x1 <- sample(c("A","B"), 2*n, replace = TRUE, prob = c(prb, 1-prb)) x1 <- c(x1, sample(c("A","B"), 2*n, replace = TRUE, prob = c(1-prb, prb))) x1 <- as.factor(x1) x2 <- sample(c("A","B"), 2*n, replace = TRUE, prob = c(prb, 1-prb)) x2 <- c(x2, sample(c("A","B"), 2*n, replace = TRUE, prob = c(1-prb, prb))) x2 <- as.factor(x2) x3 <- c(rnorm(n, mean = -muk), rnorm(n, mean = muk), rnorm(n, mean = -muk), rnorm(n, mean = muk)) x4 <- c(rnorm(n, mean = -muk), rnorm(n, mean = muk), rnorm(n, mean = -muk), rnorm(n, mean = muk)) x <- data.frame(x1,x2,x3,x4) # apply k-prototyps kpres <- kproto(x, 4) predicted.clusters <- predict(kpres, x)
# generate toy data with factors and numerics n <- 100 prb <- 0.9 muk <- 1.5 clusid <- rep(1:4, each = n) x1 <- sample(c("A","B"), 2*n, replace = TRUE, prob = c(prb, 1-prb)) x1 <- c(x1, sample(c("A","B"), 2*n, replace = TRUE, prob = c(1-prb, prb))) x1 <- as.factor(x1) x2 <- sample(c("A","B"), 2*n, replace = TRUE, prob = c(prb, 1-prb)) x2 <- c(x2, sample(c("A","B"), 2*n, replace = TRUE, prob = c(1-prb, prb))) x2 <- as.factor(x2) x3 <- c(rnorm(n, mean = -muk), rnorm(n, mean = muk), rnorm(n, mean = -muk), rnorm(n, mean = muk)) x4 <- c(rnorm(n, mean = -muk), rnorm(n, mean = muk), rnorm(n, mean = -muk), rnorm(n, mean = muk)) x <- data.frame(x1,x2,x3,x4) # apply k-prototyps kpres <- kproto(x, 4) predicted.clusters <- predict(kpres, x)
Calculating the stability for a k-Prototypes clustering with k clusters or computing the stability-based optimal number of clusters for k-Prototype clustering. Possible stability indices are: Jaccard
, Rand
, Fowlkes \& Mallows
and Luxburg
.
stability_kproto( object, method = c("rand", "jaccard", "luxburg", "fowlkesmallows"), B = 100, verbose = FALSE, ... )
stability_kproto( object, method = c("rand", "jaccard", "luxburg", "fowlkesmallows"), B = 100, verbose = FALSE, ... )
object |
Object of class |
method |
character specifying the stability, either one or more of |
B |
numeric, number of bootstrap samples |
verbose |
Logical whether information about the bootstrap procedure should be given. |
... |
Further arguments passed to
|
The output contains the stability for a given k-Prototype clustering in a list with two elements:
kp_stab |
stability values for the given clustering |
kp_bts_stab |
stability values for each bootstrap samples |
Rabea Aschenbruck
Aschenbruck, R., Szepannek, G., Wilhelm, A.F.X (2023): Stability of mixed-type cluster partitions for determination of the number of clusters. Submitted.
von Luxburg, U. (2010): Clustering stability: an overview. Foundations and Trends in Machine Learning, Vol 2, Issue 3. doi:10.1561/2200000008.
Ben-Hur, A., Elisseeff, A., Guyon, I. (2002): A stability based method for discovering structure in clustered data. Pacific Symposium on Biocomputing. doi:10/bhfxmf.
## Not run: # generate toy data with factors and numerics n <- 10 prb <- 0.99 muk <- 2.5 x1 <- sample(c("A","B"), 2*n, replace = TRUE, prob = c(prb, 1-prb)) x1 <- c(x1, sample(c("A","B"), 2*n, replace = TRUE, prob = c(1-prb, prb))) x1 <- as.factor(x1) x2 <- sample(c("A","B"), 2*n, replace = TRUE, prob = c(prb, 1-prb)) x2 <- c(x2, sample(c("A","B"), 2*n, replace = TRUE, prob = c(1-prb, prb))) x2 <- as.factor(x2) x3 <- c(rnorm(n, mean = -muk), rnorm(n, mean = muk), rnorm(n, mean = -muk), rnorm(n, mean = muk)) x4 <- c(rnorm(n, mean = -muk), rnorm(n, mean = muk), rnorm(n, mean = -muk), rnorm(n, mean = muk)) x <- data.frame(x1,x2,x3,x4) #' # apply k-prototypes kpres <- kproto(x, 4, keep.data = TRUE) # calculate cluster stability stab <- stability_kproto(method = c("luxburg","fowlkesmallows"), object = kpres) ## End(Not run)
## Not run: # generate toy data with factors and numerics n <- 10 prb <- 0.99 muk <- 2.5 x1 <- sample(c("A","B"), 2*n, replace = TRUE, prob = c(prb, 1-prb)) x1 <- c(x1, sample(c("A","B"), 2*n, replace = TRUE, prob = c(1-prb, prb))) x1 <- as.factor(x1) x2 <- sample(c("A","B"), 2*n, replace = TRUE, prob = c(prb, 1-prb)) x2 <- c(x2, sample(c("A","B"), 2*n, replace = TRUE, prob = c(1-prb, prb))) x2 <- as.factor(x2) x3 <- c(rnorm(n, mean = -muk), rnorm(n, mean = muk), rnorm(n, mean = -muk), rnorm(n, mean = muk)) x4 <- c(rnorm(n, mean = -muk), rnorm(n, mean = muk), rnorm(n, mean = -muk), rnorm(n, mean = muk)) x <- data.frame(x1,x2,x3,x4) #' # apply k-prototypes kpres <- kproto(x, 4, keep.data = TRUE) # calculate cluster stability stab <- stability_kproto(method = c("luxburg","fowlkesmallows"), object = kpres) ## End(Not run)
Investigation of variances to specify lambda for k-prototypes clustering.
## S3 method for class 'kproto' summary(object, data = NULL, pct.dig = 3, ...)
## S3 method for class 'kproto' summary(object, data = NULL, pct.dig = 3, ...)
object |
Object of class |
data |
Optional data set to be analyzed. If |
pct.dig |
Number of digits for rounding percentages of factor variables. |
... |
Further arguments to be passed to internal call of |
For numeric variables statistics are computed for each clusters using summary()
.
For categorical variables distribution percentages are computed.
List where each element corresponds to one variable. Each row of any element corresponds to one cluster.
# generate toy data with factors and numerics n <- 100 prb <- 0.9 muk <- 1.5 clusid <- rep(1:4, each = n) x1 <- sample(c("A","B"), 2*n, replace = TRUE, prob = c(prb, 1-prb)) x1 <- c(x1, sample(c("A","B"), 2*n, replace = TRUE, prob = c(1-prb, prb))) x1 <- as.factor(x1) x2 <- sample(c("A","B"), 2*n, replace = TRUE, prob = c(prb, 1-prb)) x2 <- c(x2, sample(c("A","B"), 2*n, replace = TRUE, prob = c(1-prb, prb))) x2 <- as.factor(x2) x3 <- c(rnorm(n, mean = -muk), rnorm(n, mean = muk), rnorm(n, mean = -muk), rnorm(n, mean = muk)) x4 <- c(rnorm(n, mean = -muk), rnorm(n, mean = muk), rnorm(n, mean = -muk), rnorm(n, mean = muk)) x <- data.frame(x1,x2,x3,x4) res <- kproto(x, 4) summary(res)
# generate toy data with factors and numerics n <- 100 prb <- 0.9 muk <- 1.5 clusid <- rep(1:4, each = n) x1 <- sample(c("A","B"), 2*n, replace = TRUE, prob = c(prb, 1-prb)) x1 <- c(x1, sample(c("A","B"), 2*n, replace = TRUE, prob = c(1-prb, prb))) x1 <- as.factor(x1) x2 <- sample(c("A","B"), 2*n, replace = TRUE, prob = c(prb, 1-prb)) x2 <- c(x2, sample(c("A","B"), 2*n, replace = TRUE, prob = c(1-prb, prb))) x2 <- as.factor(x2) x3 <- c(rnorm(n, mean = -muk), rnorm(n, mean = muk), rnorm(n, mean = -muk), rnorm(n, mean = muk)) x4 <- c(rnorm(n, mean = -muk), rnorm(n, mean = muk), rnorm(n, mean = -muk), rnorm(n, mean = muk)) x <- data.frame(x1,x2,x3,x4) res <- kproto(x, 4) summary(res)
Calculating the preferred validation index for a k-Prototypes clustering with k clusters or computing the optimal number of clusters based on the choosen index for k-Prototype clustering. Possible validation indices are: cindex
, dunn
, gamma
, gplus
, mcclain
, ptbiserial
, silhouette
and tau
.
validation_kproto( method = "silhouette", object = NULL, data = NULL, type = "huang", k = NULL, lambda = NULL, kp_obj = "optimal", verbose = FALSE, ... )
validation_kproto( method = "silhouette", object = NULL, data = NULL, type = "huang", k = NULL, lambda = NULL, kp_obj = "optimal", verbose = FALSE, ... )
method |
Character specifying the validation index: |
object |
Object of class |
data |
Original data; only required if |
type |
Character, to specify the distance for clustering; either |
k |
Vector specifying the search range for optimum number of clusters; if |
lambda |
Factor to trade off between Euclidean distance of numeric variables and simple matching coefficient between categorical variables. |
kp_obj |
character either "optimal" or "all": Output of the index-optimal clustering (kp_obj == "optimal") or all computed cluster partitions (kp_obj == "all"); only required if |
verbose |
Logical, whether additional information about process should be printed. |
... |
Further arguments passed to
|
More information about the implemented validation indices:
cindex
For and
it is necessary to calculate the distances between all pairs of points in the entire data set (
).
is the sum of the "total number of pairs of objects belonging to the same cluster" smallest distances and
is the sum of the "total number of pairs of objects belonging to the same cluster" largest distances.
is the sum of the within-cluster distances.
The minimum value of the index is used to indicate the optimal number of clusters.
dunn
The following applies: The dissimilarity between the two clusters and
is defined as
and
the diameter of a cluster is defined as
.
The maximum value of the index is used to indicate the optimal number of clusters.
gamma
Comparisons are made between all within-cluster dissimilarities and all between-cluster dissimilarities.
is the number of concordant comparisons and
is the number of discordant comparisons.
A comparison is named concordant (resp. discordant) if a within-cluster dissimilarity is strictly less (resp. strictly greater) than a between-cluster dissimilarity.
The maximum value of the index is used to indicate the optimal number of clusters.
gplus
Comparisons are made between all within-cluster dissimilarities and all between-cluster dissimilarities.
is the number of discordant comparisons and a comparison is named discordant if a within-cluster
dissimilarity is strictly greater than a between-cluster dissimilarity.
The minimum value of the index is used to indicate the optimal number of clusters.
mcclain
is the sum of within-cluster distances divided by the number of within-cluster distances and
is the sum of between-cluster distances divided by the number of between-cluster distances.
The minimum value of the index is used to indicate the optimal number of clusters.
ptbiserial
is the sum of within-cluster distances divided by the number of within-cluster distances and
is the sum of between-cluster distances divided by the number of between-cluster distances.
is the total number of pairs of objects in the data,
is the total number of pairs of
objects belonging to the same cluster and
is the total number of pairs of objects belonging to different clusters.
is the standard deviation of all distances.
The maximum value of the index is used to indicate the optimal number of clusters.
silhouette
is the average dissimilarity of the ith object to all other objects of the same/own cluster.
, where
is the average dissimilarity of the ith object to all the other clusters except the own/same cluster.
The maximum value of the index is used to indicate the optimal number of clusters.
tau
Comparisons are made between all within-cluster dissimilarities and all between-cluster dissimilarities.
is the number of concordant comparisons and
is the number of discordant comparisons.
A comparison is named concordant (resp. discordant) if a within-cluster dissimilarity is strictly less
(resp. strictly greater) than a between-cluster dissimilarity.
is the total number of distances
and
is the number of comparisons
of two pairs of objects where both pairs represent within-cluster comparisons or both pairs are between-cluster
comparisons.
The maximum value of the index is used to indicate the optimal number of clusters.
For computing the optimal number of clusters based on the choosen validation index for k-Prototype clustering the output contains:
k_opt |
optimal number of clusters (sampled in case of ambiguity) |
index_opt |
index value of the index optimal clustering |
indices |
calculated indices for |
kp_obj |
if(kp_obj == "optimal") the kproto object of the index optimal clustering and if(kp_obj == "all") all kproto which were calculated |
For computing the index-value for a given k-Prototype clustering the output contains:
index |
calculated index-value |
Rabea Aschenbruck
Aschenbruck, R., Szepannek, G. (2020): Cluster Validation for Mixed-Type Data. Archives of Data Science, Series A, Vol 6, Issue 1. doi:10.5445/KSP/1000098011/02.
Charrad, M., Ghazzali, N., Boiteau, V., Niknafs, A. (2014): NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set. Journal of Statistical Software, Vol 61, Issue 6. doi:10.18637/jss.v061.i06.
## Not run: # generate toy data with factors and numerics n <- 10 prb <- 0.99 muk <- 2.5 x1 <- sample(c("A","B"), 2*n, replace = TRUE, prob = c(prb, 1-prb)) x1 <- c(x1, sample(c("A","B"), 2*n, replace = TRUE, prob = c(1-prb, prb))) x1 <- as.factor(x1) x2 <- sample(c("A","B"), 2*n, replace = TRUE, prob = c(prb, 1-prb)) x2 <- c(x2, sample(c("A","B"), 2*n, replace = TRUE, prob = c(1-prb, prb))) x2 <- as.factor(x2) x3 <- c(rnorm(n, mean = -muk), rnorm(n, mean = muk), rnorm(n, mean = -muk), rnorm(n, mean = muk)) x4 <- c(rnorm(n, mean = -muk), rnorm(n, mean = muk), rnorm(n, mean = -muk), rnorm(n, mean = muk)) x <- data.frame(x1,x2,x3,x4) # calculate optimal number of cluster, index values and clusterpartition with Silhouette-index val <- validation_kproto(method = "silhouette", data = x, k = 3:5, nstart = 5) # apply k-prototypes kpres <- kproto(x, 4, keep.data = TRUE) # calculate cindex-value for the given clusterpartition cindex_value <- validation_kproto(method = "cindex", object = kpres) ## End(Not run)
## Not run: # generate toy data with factors and numerics n <- 10 prb <- 0.99 muk <- 2.5 x1 <- sample(c("A","B"), 2*n, replace = TRUE, prob = c(prb, 1-prb)) x1 <- c(x1, sample(c("A","B"), 2*n, replace = TRUE, prob = c(1-prb, prb))) x1 <- as.factor(x1) x2 <- sample(c("A","B"), 2*n, replace = TRUE, prob = c(prb, 1-prb)) x2 <- c(x2, sample(c("A","B"), 2*n, replace = TRUE, prob = c(1-prb, prb))) x2 <- as.factor(x2) x3 <- c(rnorm(n, mean = -muk), rnorm(n, mean = muk), rnorm(n, mean = -muk), rnorm(n, mean = muk)) x4 <- c(rnorm(n, mean = -muk), rnorm(n, mean = muk), rnorm(n, mean = -muk), rnorm(n, mean = muk)) x <- data.frame(x1,x2,x3,x4) # calculate optimal number of cluster, index values and clusterpartition with Silhouette-index val <- validation_kproto(method = "silhouette", data = x, k = 3:5, nstart = 5) # apply k-prototypes kpres <- kproto(x, 4, keep.data = TRUE) # calculate cindex-value for the given clusterpartition cindex_value <- validation_kproto(method = "cindex", object = kpres) ## End(Not run)