Comparative study of statistical data modelling with machine learning techniques

Mgcima, Phumzile

dc.contributor.advisor	Olaomi, J. O.
dc.contributor.author	Mgcima, Phumzile
dc.date.accessioned	2022-05-30T07:56:55Z
dc.date.available	2022-05-30T07:56:55Z
dc.date.issued	2022-01
dc.identifier.uri	https://hdl.handle.net/10500/28905
dc.description	Summary in English	en
dc.description.abstract	Nowadays human activities produce massive amounts of data everyday. It is estimated that 2.5 quintillions bytes of data are produced daily. The ability to analyse and interpret such data, usually referred to as ‘big data’, is a precondition to succeed in the 4th Industrial Revolution (4IR). Statistical data modelling has been a de facto data analysis paradigm for many decades, but it is slowly being overshadowed by machine learning algorithms in the industry and in research funding. In this research, the two modelling paradigms were compared with the aim of establishing which one is better in terms of rational, accuracy and model parsimony. Unlike many studies on this subject which mainly concentrate on comparing accuracy, this research did not look at accuracy as the only metric of comparison. Both modelling paradigms were applied in prediction (continuous value prediction), classification (categorical class label prediction) and clustering problems in three separate case studies. In the prediction case study, a Realised GARCH (RealGARCH) model was compared to an artificial neural network (ANN) algorithm. In the classification case study, a linear discriminant analysis (LDA) model was compared to a support vector machine (SVM) algorithm. Lastly, a Gaussian mixture model (GMM) was compared to a K-means algorithm. For prediction and classification, the data was divided into training and testing sets, the training sets were used to fit the models and the testing sets were used to measure prediction and classification accuracy. For clustering, model validation was based on bootstrapping, visualisation and distant measures. The ANN model outperformed the generalised autoregressive conditional heteroscedasticity (GARCH) variant RealGARCH model in the two accuracy measurements, root mean square error (RMSE) and mean absolute error (MAE), while RealGARCH gave more insights into the data. SVM had marginally better classification accuracy in both the two-class and the three-class scenarios but had poorer F-Measure for the minority classes in the three-class scenario. The statistical models were more interpretable compared to their machine learning counterparts in both case studies. Both clustering models performed poorly in partitioning the data in the third case study, but K-means did better than the GMM model. Understanding the domain problem was found to be essential to data analysis regardless of the modelling paradigm.	en
dc.format.extent	1 online resource (xii, 135 leaves) : illustrations, graphs	en
dc.language.iso	en	en
dc.subject	Statistical Data modelling	en
dc.subject	Machine Learning Techniques	en
dc.subject	Artificial neural networks(ANNs)	en
dc.subject	GARCH models	en
dc.subject	Internet of things	en
dc.subject	Smart home	en
dc.subject.ddc	519.5
dc.subject.lcsh	Statistics	en
dc.subject.lcsh	Machine learning	en
dc.subject.lcsh	GARCH model	en
dc.subject.lcsh	Neural networks (Computer science)	en
dc.subject.lcsh	Home automation	en
dc.subject.lcsh	Big data	en
dc.title	Comparative study of statistical data modelling with machine learning techniques	en
dc.type	Dissertation	en
dc.description.department	Statistics	en
dc.description.degree	M.Sc. (Statistics)	en