dc.description.abstract |
Nowadays human activities produce massive amounts of data everyday. It is estimated that 2.5
quintillions bytes of data are produced daily. The ability to analyse and interpret such data, usually
referred to as ‘big data’, is a precondition to succeed in the 4th Industrial Revolution (4IR). Statistical
data modelling has been a de facto data analysis paradigm for many decades, but it is slowly being
overshadowed by machine learning algorithms in the industry and in research funding. In this research,
the two modelling paradigms were compared with the aim of establishing which one is better in terms of
rational, accuracy and model parsimony. Unlike many studies on this subject which mainly concentrate
on comparing accuracy, this research did not look at accuracy as the only metric of comparison.
Both modelling paradigms were applied in prediction (continuous value prediction), classification
(categorical class label prediction) and clustering problems in three separate case studies. In the
prediction case study, a Realised GARCH (RealGARCH) model was compared to an artificial neural
network (ANN) algorithm. In the classification case study, a linear discriminant analysis (LDA)
model was compared to a support vector machine (SVM) algorithm. Lastly, a Gaussian mixture
model (GMM) was compared to a K-means algorithm. For prediction and classification, the data was
divided into training and testing sets, the training sets were used to fit the models and the testing
sets were used to measure prediction and classification accuracy. For clustering, model validation was
based on bootstrapping, visualisation and distant measures.
The ANN model outperformed the generalised autoregressive conditional heteroscedasticity (GARCH)
variant RealGARCH model in the two accuracy measurements, root mean square error (RMSE)
and mean absolute error (MAE), while RealGARCH gave more insights into the data. SVM had
marginally better classification accuracy in both the two-class and the three-class scenarios but had
poorer F-Measure for the minority classes in the three-class scenario. The statistical models were more
interpretable compared to their machine learning counterparts in both case studies. Both clustering
models performed poorly in partitioning the data in the third case study, but K-means did better
than the GMM model. Understanding the domain problem was found to be essential to data analysis
regardless of the modelling paradigm. |
en |