Institutional Repository

Comparative study of statistical data modelling with machine learning techniques

Show simple item record

dc.contributor.advisor Olaomi, J. O.
dc.contributor.author Mgcima, Phumzile
dc.date.accessioned 2022-05-30T07:56:55Z
dc.date.available 2022-05-30T07:56:55Z
dc.date.issued 2022-01
dc.identifier.uri https://hdl.handle.net/10500/28905
dc.description Summary in English en
dc.description.abstract Nowadays human activities produce massive amounts of data everyday. It is estimated that 2.5 quintillions bytes of data are produced daily. The ability to analyse and interpret such data, usually referred to as ‘big data’, is a precondition to succeed in the 4th Industrial Revolution (4IR). Statistical data modelling has been a de facto data analysis paradigm for many decades, but it is slowly being overshadowed by machine learning algorithms in the industry and in research funding. In this research, the two modelling paradigms were compared with the aim of establishing which one is better in terms of rational, accuracy and model parsimony. Unlike many studies on this subject which mainly concentrate on comparing accuracy, this research did not look at accuracy as the only metric of comparison. Both modelling paradigms were applied in prediction (continuous value prediction), classification (categorical class label prediction) and clustering problems in three separate case studies. In the prediction case study, a Realised GARCH (RealGARCH) model was compared to an artificial neural network (ANN) algorithm. In the classification case study, a linear discriminant analysis (LDA) model was compared to a support vector machine (SVM) algorithm. Lastly, a Gaussian mixture model (GMM) was compared to a K-means algorithm. For prediction and classification, the data was divided into training and testing sets, the training sets were used to fit the models and the testing sets were used to measure prediction and classification accuracy. For clustering, model validation was based on bootstrapping, visualisation and distant measures. The ANN model outperformed the generalised autoregressive conditional heteroscedasticity (GARCH) variant RealGARCH model in the two accuracy measurements, root mean square error (RMSE) and mean absolute error (MAE), while RealGARCH gave more insights into the data. SVM had marginally better classification accuracy in both the two-class and the three-class scenarios but had poorer F-Measure for the minority classes in the three-class scenario. The statistical models were more interpretable compared to their machine learning counterparts in both case studies. Both clustering models performed poorly in partitioning the data in the third case study, but K-means did better than the GMM model. Understanding the domain problem was found to be essential to data analysis regardless of the modelling paradigm. en
dc.format.extent 1 online resource (xii, 135 leaves) : illustrations, graphs en
dc.language.iso en en
dc.subject Statistical Data modelling en
dc.subject Machine Learning Techniques en
dc.subject Artificial neural networks(ANNs) en
dc.subject GARCH models en
dc.subject Internet of things en
dc.subject Smart home en
dc.subject.ddc 519.5
dc.subject.lcsh Statistics en
dc.subject.lcsh Machine learning en
dc.subject.lcsh GARCH model en
dc.subject.lcsh Neural networks (Computer science) en
dc.subject.lcsh Home automation en
dc.subject.lcsh Big data en
dc.title Comparative study of statistical data modelling with machine learning techniques en
dc.type Dissertation en
dc.description.department Statistics en
dc.description.degree M.Sc. (Statistics) en


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search UnisaIR


Browse

My Account

Statistics