Autoencoders are mainly a dimensionality reduction (or compression) algorithm with a couple of important properties:
Lossy: The output of the autoencoder will not be exactly the same as the input, it will be a close but degraded representation.
Unsupervised: Autoencoders are considered an unsupervised learning technique since they don’t need explicit labels to train on. But to be more precise they are self-supervised because they generate their own labels from the training data.
Let us look at the autoencoder structure in a more detailed visualization. First the input passes through the encoder, which is a fully-connected ANN, to produce the code.
The decoder, which has the similar ANN structure, then produces the output only using the code.
The goal is to get an output identical with the input. The only requirement is the dimensionality of the input and output needs to be the same. Anything in the middle can be play with.
There are 4 hyperparameters that we need to set before training an autoencoder:
Traditional centroid-based clustering algorithms for heterogeneous data with numerical and non-numerical features result in different levels of inaccurate clustering.
This is because the Hamming distance used for dissimilarity measurement of non-numerical values does not provide optimal distances between different values-
Another problems arise from attempts to combine the Euclidean distance and Hamming distance.
Deep clustering is a recent trend in the machine learning community that aims to employ a deep neural network in an unsupervised learning form. One of the main families of deep clustering is Deep Embedding Clustering (DEC) Xie, Girshick, and Farhadi (2016). The fundamental work of DEC is to learn latent space that preserves properties of the data.
DEC has two phases:
# Get the data
DEC_Embedding = read.csv('/Users/xbasra/Documents/Data/Clustering_Food_Alergies/Intermediate/CsvData_Output/DEC_Embedding.csv')
resultft_DEL_all <- read.csv('/Users/xbasra/Documents/Data/Clustering_Food_Alergies/Intermediate/CsvData_Output/resultft_DEL_all.csv')
# replacing the empty space "" values with no as done in the main analysis file
resultft_DEL_all$farmlive[resultft_DEL_all$farmlive == ""] <- NA
resultft_DEL_all <- resultft_DEL_all %>% replace_na (list(farmlive = 'no'))
#tsne_converted_food$cl_DEL <- factor(resultft_DEL_all$cluster)
#ggplot(tsne_converted_food, aes(x=X, y=Y, color=cl_DEL)) + geom_point()
resultft_DEL_all$cluster <- as.factor(resultft_DEL_all$cluster)
2D Tsne plot
# Tsne plot 2D
set.seed(10)
#tsne_converted_food_DEL <- Rtsne(X = EDL_Embedding ,perplexity= 200, is_distance = FALSE, check_duplicates = FALSE)
tsne_converted_food_DEC <- Rtsne(X = DEC_Embedding ,perplexity= 150, is_distance = FALSE, check_duplicates = FALSE)
tsne_converted_food_DEC <- tsne_converted_food_DEC$Y %>%
data.frame() %>%
setNames(c("X", "Y"))
tsne_converted_food_DEC$cl <- factor(resultft_DEL_all$cluster)
# 3D plot
tsne_converted_food_DEC_3d <- Rtsne(X = DEC_Embedding ,perplexity= 150, dims = 3, is_distance = FALSE, check_duplicates = FALSE)
tsne_converted_food_DEC_3d <- tsne_converted_food_DEC_3d$Y %>%
data.frame() %>%
setNames(c("X", "Y", "Z"))
tsne_converted_food_DEC_3d$cl <- factor(resultft_DEL_all$cluster)
3D tsne plot