Cluster analysis of mixed data types: Case study of segmenting ICT Skills among youths and adults

Molo Muli

2022-07-19

Context

The data context falls under United Nations SDG 4 which is to ensure inclusive and equitable education and promote lifelong learning opportunities for all. Observations in the data entails percentage values of youths/adults who used 9 selected ICT skills. An individual is considered to have ICT skills if s/he used at least one of the 9 ICT skills in the past 3 months. The data set was collected using Multiple Indicator Cluster Survey (MICS) and it’s hosted in UNICEF’s open data portal website.

Disclaimer: All reasonable precautions have been taken to verify the information in this database. In no event shall UNICEF be liable for damages arising from its use or interpretation

Goals/Objectives

Data

The data set contains 15 columns and 285 observations. 73% is numerical and 27% discrete. Sampled countries are 35 across 8 UNICEF Regions.

Descriptive Analytics

From the data table above, the ICT skill with the highest variation is how to write a computer program using any programming language. Personally this is a very specific skill set as compared to other general ICT skills.

fig 1.0. Table of descriptive statistics fig 1.0. Table of descriptive statistics

# DF of numerical variables using diagnose_numeric()
NumericDescriptives <- diagnose_numeric(ICTSkills)
NumericDescriptives <- NumericDescriptives %>%
  select(-c(min, minus)) %>%
  arrange(desc(outlier))

# Load the output
knitr::include_graphics("Outputs/DescriptiveStats.png")

From the histograms, all numeric variables have a right-tailed distribution. A probable reason is because the uptake of these courses was high initially, but over the period of three months, it slowed down. This is called the startup effect1 https://www.itl.nist.gov/div898/handbook/eda/section3/histogr6.htm.

fig 2.0. Histogram of all numeric variables fig 2.0. Histogram of all numeric variables

# Histogram of all numerical variables
NumericalDistribution <- ICTSkills %>%
  gather(variable, value, -c(1:4,14)) %>%
  ggplot(aes(x=value)) +
  geom_histogram(fill="lightblue2", color='black') + 
  facet_wrap(~variable, scales='free_x') + 
  labs(title = 'Distribution of numeric variables',x='values', y='Frequency') + 
  theme_minimal()

NumericalDistribution

Cluster analysis

Clustering is an unsupervised machine learning algorithm that finds clusters such that observations present in a cluster are similar to each other while clusters formed from these observations are dissimilar from each other. Clustering is mostly used for data mining and/or further analysis for future inferences.

Clustering algorithms are data-type specific. In this case the data has both numerical and categorical observations hence the appropriate algorithm for use is K-prototypes. K-prototypes is an extension of both K-means and K-modes algorithm.

K-prototypes belongs to the family of partitional cluster algorithms. It measures distance between numerical features using Euclidean distance (like K-means) and distance between categorical features using the number of matching categories (like K-modes). It was first published by Huang in 19982 https://link.springer.com/article/10.1023/A:1009769707641.

Prior to clustering, one needs to derive optimal number of clusters. Since there’s no reproducible research on this using K-prototypes, the code below iterates probable number of clusters by plotting a scree plot using the elbow method3 https://stats.stackexchange.com/questions/293877/optimal-number-of-clusters-using-k-prototypes-method-in-r.

fig 3.0 Scree plot showing the optimum number of clusters fig 3.0 Scree plot showing the optimum number of clusters

plot(1:k.max, wss,
     type="b", pch = 19, frame = FALSE, 
     xlab="Number of clusters K",
     ylab="Total within-clusters sum of squares")

From figure 3.0, I’ll cluster my data using 3 partitions.

# The cluster model
ict_clusters <- kproto(ICTSkills, k=3, lambda = NULL, iter.max = 100, nstart = 1, na.rm = TRUE, verbose = T)

Visualisation of the clusters

# Visualise the clusters
clprofiles(ict_clusters, ICTSkills)

Knowledge discovery

Some of the clusters derived include:

Extension to the algorithm

Further to the algorithm, one has to perform a lambdaest to investigate the variables’ variance/concentration in order to acutely specify lambda for k-prototypes clustering. The function is

lambdaest(x, num.method = 1, fac.method = 1, outtype = "numeric")

where x is the original data frame.

Estimated λ from the clusters is 0.0295. A small λ is an emphasis that the data frame has more numeric variables and it will have a similar results as if k-means was deployed to the data frame whereas a large λ means that there’s a heavy influence of categorical variables in the data. In this case λ inclines more on numeric variables4 https://journal.r-project.org/archive/2018/RJ-2018-048/RJ-2018-048.pdf.

Perfomance comparison

A performance measure is needed to be performed in the underlying clusters because the previous output has clustered only categorical variables using K-modes and numeric variables using k-means. This comparison called Rand Index (Rand measure). Rand index is a way of comparing similarity of results between two clustering methods. In this case we want to compare the similarity of results of K-means and K-modes.

A rand measure from the two clusters is 0.515913.

An index of 0 indicates that the two clustering methods do not agree on the clustering of any pair of elements whereas an index of 1 indicates that the two clustering methods perfectly agree on the clustering of every pair of elements.

Our index is an indication of partial agreement between the clusters.

Full Code