The data context falls under United Nations SDG 4 which is to ensure inclusive and equitable education and promote lifelong learning opportunities for all. Observations in the data entails percentage values of youths/adults who used 9 selected ICT skills. An individual is considered to have ICT skills if s/he used at least one of the 9 ICT skills in the past 3 months. The data set was collected using Multiple Indicator Cluster Survey (MICS) and it’s hosted in UNICEF’s open data portal website.
Disclaimer: All reasonable precautions have been taken to verify the information in this database. In no event shall UNICEF be liable for damages arising from its use or interpretation
The data set contains 15 columns and 285 observations. 73% is numerical and 27% discrete. Sampled countries are 35 across 8 UNICEF Regions.
From the data table above, the ICT skill with the highest variation is how to write a computer program using any programming language. Personally this is a very specific skill set as compared to other general ICT skills.
fig 1.0. Table of descriptive statistics
# DF of numerical variables using diagnose_numeric()
NumericDescriptives <- diagnose_numeric(ICTSkills)
NumericDescriptives <- NumericDescriptives %>%
select(-c(min, minus)) %>%
arrange(desc(outlier))
# Load the output
knitr::include_graphics("Outputs/DescriptiveStats.png")
From the histograms, all numeric variables have a right-tailed distribution. A probable reason is because the uptake of these courses was high initially, but over the period of three months, it slowed down. This is called the startup effect1 https://www.itl.nist.gov/div898/handbook/eda/section3/histogr6.htm.
fig 2.0. Histogram of all numeric variables
# Histogram of all numerical variables
NumericalDistribution <- ICTSkills %>%
gather(variable, value, -c(1:4,14)) %>%
ggplot(aes(x=value)) +
geom_histogram(fill="lightblue2", color='black') +
facet_wrap(~variable, scales='free_x') +
labs(title = 'Distribution of numeric variables',x='values', y='Frequency') +
theme_minimal()
NumericalDistribution
Clustering is an unsupervised machine learning algorithm that finds clusters such that observations present in a cluster are similar to each other while clusters formed from these observations are dissimilar from each other. Clustering is mostly used for data mining and/or further analysis for future inferences.
Clustering algorithms are data-type specific. In this case the data has both numerical and categorical observations hence the appropriate algorithm for use is K-prototypes. K-prototypes is an extension of both K-means and K-modes algorithm.
K-prototypes belongs to the family of partitional cluster algorithms. It measures distance between numerical features using Euclidean distance (like K-means) and distance between categorical features using the number of matching categories (like K-modes). It was first published by Huang in 19982 https://link.springer.com/article/10.1023/A:1009769707641.
Prior to clustering, one needs to derive optimal number of clusters. Since there’s no reproducible research on this using K-prototypes, the code below iterates probable number of clusters by plotting a scree plot using the elbow method3 https://stats.stackexchange.com/questions/293877/optimal-number-of-clusters-using-k-prototypes-method-in-r.
fig 3.0 Scree plot showing the optimum number of clusters
plot(1:k.max, wss,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K",
ylab="Total within-clusters sum of squares")
From figure 3.0, I’ll cluster my data using 3 partitions.
# The cluster model
ict_clusters <- kproto(ICTSkills, k=3, lambda = NULL, iter.max = 100, nstart = 1, na.rm = TRUE, verbose = T)
# Visualise the clusters
clprofiles(ict_clusters, ICTSkills)
Some of the clusters derived include:
Further to the algorithm, one has to perform a lambdaest
to investigate the variables’ variance/concentration in order to acutely
specify lambda for k-prototypes clustering. The function is
lambdaest(x, num.method = 1, fac.method = 1, outtype = "numeric")
where x is the original data frame.
Estimated λ from the clusters is 0.0295. A small λ is an emphasis that the data frame has more numeric variables and it will have a similar results as if k-means was deployed to the data frame whereas a large λ means that there’s a heavy influence of categorical variables in the data. In this case λ inclines more on numeric variables4 https://journal.r-project.org/archive/2018/RJ-2018-048/RJ-2018-048.pdf.
A performance measure is needed to be performed in the underlying clusters because the previous output has clustered only categorical variables using K-modes and numeric variables using k-means. This comparison called Rand Index (Rand measure). Rand index is a way of comparing similarity of results between two clustering methods. In this case we want to compare the similarity of results of K-means and K-modes.
A rand measure from the two clusters is 0.515913.
An index of 0 indicates that the two clustering methods do not agree on the clustering of any pair of elements whereas an index of 1 indicates that the two clustering methods perfectly agree on the clustering of every pair of elements.
Our index is an indication of partial agreement between the clusters.