Masjienleerbenadering tot woordafbreking in Afrikaans

Fick, Machteld

dc.contributor.advisor	Swanepoel, C. J.
dc.contributor.author	Fick, Machteld
dc.date.accessioned	2014-04-07T09:24:05Z
dc.date.available	2014-04-07T09:24:05Z
dc.date.issued	2013-06
dc.identifier.citation	Fick, Machteld (2013) 'n Masjienleerbenadering tot woordafbreking in Afrikaans, University of South Africa, Pretoria, <http://hdl.handle.net/10500/13326>	en
dc.identifier.uri	http://hdl.handle.net/10500/13326
dc.description	Text in Afrikaans
dc.description.abstract	Die doel van hierdie studie was om te bepaal tot watter mate ’n suiwer patroongebaseerde benadering tot woordafbreking bevredigende resultate lewer. Die masjienleertegnieke kunsmatige neurale netwerke, beslissingsbome en die TEX-algoritme is ondersoek aangesien dit met letterpatrone uit woordelyste afgerig kan word om lettergreep- en saamgesteldewoordverdeling te doen. ’n Leksikon van Afrikaanse woorde is uit ’n korpus van elektroniese teks genereer. Om lyste vir lettergreep- en saamgesteldewoordverdeling te kry, is woorde in die leksikon in lettergrepe verdeel en saamgestelde woorde is in hul samestellende dele verdeel. Uit elkeen van hierdie lyste van ±183 000 woorde is ±10 000 woorde as toetsdata gereserveer terwyl die res as afrigtingsdata gebruik is. ’n Rekursiewe algoritme is vir saamgesteldewoordverdeling ontwikkel. In hierdie algoritme word alle ooreenstemmende woorde uit ’n verwysingslys (die leksikon) onttrek deur stringpassing van die begin en einde van woorde af. Verdelingspunte word dan op grond van woordlengte uit die samestelling van begin- en eindwoorde bepaal. Die algoritme is uitgebrei deur die tekortkominge van hierdie basiese prosedure aan te spreek. Neurale netwerke en beslissingsbome is afgerig en variasies van beide tegnieke is ondersoek om die optimale modelle te kry. Patrone vir die TEX-algoritme is met die OPatGen-program gegenereer. Tydens toetsing het die TEX-algoritme die beste op beide lettergreep- en saamgesteldewoordverdeling presteer met 99,56% en 99,12% akkuraatheid, respektiewelik. Dit kan dus vir woordafbreking gebruik word met min risiko vir afbrekingsfoute in gedrukte teks. Die neurale netwerk met 98,82% en 98,42% akkuraatheid op lettergreep- en saamgesteldewoordverdeling, respektiewelik, is ook bruikbaar vir lettergreepverdeling, maar dis meer riskant. Ons het bevind dat beslissingsbome te riskant is om vir lettergreepverdeling en veral vir woordverdeling te gebruik, met 97,91% en 90,71% akkuraatheid, respektiewelik. ’n Gekombineerde algoritme is ontwerp waarin saamgesteldewoordverdeling eers met die TEXalgoritme gedoen word, waarna die resultate van lettergreepverdeling deur beide die TEXalgoritme en die neurale netwerk gekombineer word. Die algoritme het 1,3% minder foute as die TEX-algoritme gemaak. ’n Toets op gepubliseerde Afrikaanse teks het getoon dat die risiko vir woordafbrekingsfoute in teks met gemiddeld tien woorde per re¨el ±0,02% is.	af
dc.description.abstract	The aim of this study was to determine the level of success achievable with a purely pattern based approach to hyphenation in Afrikaans. The machine learning techniques artificial neural networks, decision trees and the TEX algorithm were investigated since they can be trained with patterns of letters from word lists for syllabification and decompounding. A lexicon of Afrikaans words was extracted from a corpus of electronic text. To obtain lists for syllabification and decompounding, words in the lexicon were respectively syllabified and compound words were decomposed. From each list of ±183 000 words, ±10 000 words were reserved as testing data and the rest was used as training data. A recursive algorithm for decompounding was developed. In this algorithm all words corresponding with a reference list (the lexicon) are extracted by string fitting from beginning and end of words. Splitting points are then determined based on the length of reassembled words. The algorithm was expanded by addressing shortcomings of this basic procedure. Artificial neural networks and decision trees were trained and variations of both were examined to find optimal syllabification and decompounding models. Patterns for the TEX algorithm were generated by using the program OPatGen. Testing showed that the TEX algorithm performed best on both syllabification and decompounding tasks with 99,56% and 99,12% accuracy, respectively. It can therefore be used for hyphenation in Afrikaans with little risk of hyphenation errors in printed text. The performance of the artificial neural network was lower, but still acceptable, with 98,82% and 98,42% accuracy for syllabification and decompounding, respectively. The decision tree with accuracy of 97,91% on syllabification and 90,71% on decompounding was found to be too risky to use for either of the tasks A combined algorithm was developed where words are first decompounded by using the TEX algorithm before syllabifying them with both the TEX algoritm and the neural network and combining the results. This algoritm reduced the number of errors made by the TEX algorithm by 1,3% but missed more hyphens. Testing the algorithm on Afrikaans publications showed the risk for hyphenation errors to be ±0,02% for text assumed to have an average of ten words per line.	en
dc.format.extent	1 online resource (x, 173 leaves) : tables
dc.language.iso	Afrikaans
dc.subject	Woordafbreking	af
dc.subject	Lettergreepverdeling	af
dc.subject	Saamgesteldewoordverdeling	af
dc.subject	Stringpassing	af
dc.subject	Woordvlakakkuraatheid	af
dc.subject	Verdelingsgeleentheidsvlakakkuraatheid	af
dc.subject	Masjienleertegnieke	af
dc.subject	Neurale netwerke	af
dc.subject	Beslissingsbome	af
dc.subject	Algoritme	af
dc.subject	Hyphenation	en
dc.subject	Syllabification	en
dc.subject	Decompounding	en
dc.subject	String fitting	en
dc.subject	Word level accuracy	en
dc.subject	Splitting opportunity level accuracy	en
dc.subject	Machine learning	en
dc.subject	Neural networks	en
dc.subject	Decision trees	en
dc.subject	Algoritm	en
dc.subject.ddc	410.285
dc.subject.lcsh	Hyphen	en
dc.subject.lcsh	Afrikaans language -- Orthography and spelling	en
dc.subject.lcsh	Afrikaans language -- Syllabication	en
dc.subject.lcsh	Afrikaans language -- Data processing	en
dc.subject.lcsh	Syllabication -- Data processing	en
dc.subject.lcsh	Neural networks (Computer science)	en
dc.subject.lcsh	Data compression (Computer science)	en
dc.subject.lcsh	Back propagation (Artificial intelligence)	en
dc.subject.lcsh	Decision trees	en
dc.subject.lcsh	Algorithms	en
dc.title	Masjienleerbenadering tot woordafbreking in Afrikaans	af
dc.type	Thesis	en
dc.description.department	Decision Sciences	en
dc.description.degree	D. Phil. (Operational Research)