As define in Wikipedia a Gaussian Mixture Model is a specific Mixture Model.
“In statistics, a mixture model is a probabilistic model for representing the presence of subpopulations within an overall population, without requiring that an observed data set should identify the sub-population to which an individual observation belongs. Formally a mixture model corresponds to the mixture distribution that represents the probability distribution of observations in the overall population. However, while problems associated with “mixture distributions” relate to deriving the properties of the overall population from those of the sub-populations, “mixture models” are used to make statistical inferences about the properties of the sub-populations given only observations on the pooled population, without sub-population identity information.”
In the same article, Wikipedia give us an example for Handwirting Recognition based on  Christopher M. Bishop, Pattern Recognition and Machine Learning.
“Imagine that we are given an N×N black-and-white image that is known to be a scan of a hand-written digit between 0 and 9, but we don’t know which digit is written. We can create a mixture model with K=10 different components, where each component is a vector of size N2 of Bernoulli distributions (one per pixel). Such a model can be trained with the expectation-maximization algorithm on an unlabeled set of hand-written digits, and will effectively cluster the images according to the digit being written. The same model could then be used to recognize the digit of another image simply by holding the parameters constant, computing the probability of the new image for each possible digit (a trivial calculation), and returning the digit that generated the highest probability.”
The implementation of Gaussian Mixture Model is an adaptation of the one done by  Valentin Iovene to process mnist classification.
The input images have 64-pixel width and height so our feature vector hold 4096 values. We make a data reduction with Principal Component Analysis.
We start by keeping 99% of the relevant information and look for the best Gaussian Mixture Model with Akaike Information Criterion. We do the same computation with Bayesian Information Criterion.
The reduction factor with Principal Component Analysis algorithm changes depending on samples fitted and their variability. One of our experiment with 12.000 samples and 99% information kept reduce the feature vector from 4096 to 651 values. The amount of memory needed to drive a complete experiment, with 62 characters and numbers, exceed the capacity of our computer (16 GB). Even with only 4.000 or 3.000 samples and feature vectors of 515 and 470 values we don’t get satisfactory aic and bic curves.
As explained by Ivan Svetunkov, we can not compare the “results” :
“All the information criteria are based on likelihood function, which in its turn depends on sample size. The larger sample size is, the smaller likelihood becomes and as a result the greater IC becomes. So, you should expect that with the increase of sample size, IC will increase as well (whatever criteria you use). This means that you cannot compare different models fitted on different sample sizes using any information criteria.”
With only upper case, lower case (26 char.) and number (10 char.) with 12.000 samples and pca reduction set to 95% we drive the experiments. They turn out to be more satisfying, we get U-curves or near U-curves, as shown below.
The curve obtained during the experiment with the capital letters is characteristic and fits the expected result. On the other hand, the curves for the lower case and the figures need to be supplemented by a larger experimentation up to 200 n-components.
Increasing the number of samples changes the results. For example with 16,000 samples we get the following curves. The two curves are placed side by side for a better reading.
We note immediately that the Akaike Information criterion gives us the best model with 70 components and no longer 60 components. This seems perfectly logical. Then the curve bends to 120 components, which highlights the pseudo plateau that we had previously at this same value. In our opinion, it would be useful to measure more precisely the accuracy of the models around these two zones.
The Gaussian mixture model makes it possible to generate new samples. We recognize a number of characters, and let’s look at the categorization performed by the model.
Bayesian Information Criterion provides the result below if we set the covariance to tied instead full.
As with Akaike information criterion, the experiment for lowercase and numbers needs to be extended beyond 210 n-components to check if the curve is going up.
Afterwards, we implement a Gaussian mixture model with 26 symbols (upper case) and a pca reduction set to 0.95%, then we calculate the accuracy of different models with n-components around the number of n-components given by AIC (60) and BIC (120). In the first case, the Gaussian mixture model with 50 components and covariance full gives the best accuracy of 76.5%.
We drive the same experiment with lower case (accuracy 74.15%) and numbers (accuracy 87.1%).
The analysis of the confusion matrix and the comparison with the images of the characters, leads us to think that the classification, the recognition of the characters with Gaussian models and PCA reduction can not lead to a very great precision without the addition of other characteristics and methods.
These experimentations give a good understanding of Gaussian Mixture Models and the parameters to rule to obtain better classification accuracy.
If now, we have a better knowledge of Gaussian mixture models. Still, we will continue our experimentation with the GaussianMixture implementations of sklearn and gmm tensorflow before continuing with hidden markov models.
 – Bishop, Christopher (2006). Pattern recognition and machine learning. New York: Springer. ISBN 978-0-387-31073-2.
 – Jake VanderPlas (2016) Python Data Science Handbook. O’Reilly. ISBN 978-1-491-91205-8.
 – Tgy – Valentin Iovene, MNIST handwritten digits clustering using a Bernoulli Mixture Model and a Gaussian Mixture Model. github.