Knowledge-based word sense disambiguation for Setswana-English machine translation

Moape, Tebatso Gorgina

Please use this identifier to cite or link to this item: https://hdl.handle.net/10321/5570

DC Field	Value	Language
dc.contributor.advisor	Ojo, Sunday O.	-
dc.contributor.advisor	Olugbara, Oludayo O.	-
dc.contributor.author	Moape, Tebatso Gorgina	en_US
dc.date.accessioned	2024-10-07T12:55:12Z	-
dc.date.available	2024-10-07T12:55:12Z	-
dc.date.issued	2024	-
dc.identifier.uri	https://hdl.handle.net/10321/5570	-
dc.description	Submitted in Fulfilment of the requirements of the Degree of Doctor of Philosophy in Information Technology, Durban University of Technology, Durban, South Africa, 2024.	en_US
dc.description.abstract	There are several challenges that hinder the development of Setswana-to-English machine translation systems. A key obstacle is the absence of machine-readable knowledge resources. This has prompted the use of the only accessible data, which originates from the government domain. While training machine-translation systems using government-domain data can offer specialized language knowledge, such training introduces obstacles such as limited vocabulary, style variation, bias, and domain specificity. Furthermore, it is noted in the literature that the ongoing problem of polysemy in a machine-translation system reduces the overall accuracy. Polysemy is a linguistic phenomenon in which a single word or phrase has multiple senses, resulting in ambiguity. The task of resolving ambiguity in natural language processing (NLP) is known as word sense disambiguation (WSD). The concept of WSD serves as an intermediate task for enhancing text understanding in NLP applications, including machine translation, information retrieval, and text summarization. Its cardinal role is to enhance the effectiveness and efficiency of these applications by ensuring the accurate selection of the appropriate sense for polysemous words in diverse contexts. This study addresses these challenges by proposing three essential components: a diversity-aware machine-readable knowledge resource for SetswanaEnglish, or the Setswana universal knowledge core (SUKC), a WSD approach to resolving lexical ambiguity; and a corresponding machine-translation model embedded with a WSD capability. Setswana-English data was collected from the existing paper-based bilingual dictionaries to achieve this purpose. Secondly, the study employed professional translators to translate space domain concepts from English to Setswana. The collected lexicon was integrated into the universal knowledge core (UKC). The Lesk algorithm which has seen various adaptations by researchers for different languages over the years was employed to address the inherent polysemy challenges. This study used a simplified, Lesk-based algorithm to resolve polysemy for Setswana; and used the bidirectional encoder representations from transformers (BERT) model for Setswana, and cosine similarity measure to embed Setswana glosses and measure semantic similarity, thus determining the accurate sense. The study employed a rule-based method embedded with the WSD algorithm for machine translation. The translation accuracy of the machine-readable dictionary was assessed by employing the developed machine-translation model; and evaluated using the BLEU score. The proposed model was tested on a combination of sentences containing both ambiguous words and those without ambiguity; and a higher BLEU score of 34.89 was achieved.	en_US
dc.format.extent	265 p	en_US
dc.language.iso	en	en_US
dc.subject	Setswana-to-English	en_US
dc.subject	Machine-readable knowledge resources	en_US
dc.subject	Machine-translation system	en_US
dc.subject.lcsh	Tswana language	en_US
dc.subject.lcsh	Translating and interpreting	en_US
dc.subject.lcsh	Machine translating	en_US
dc.subject.lcsh	English language	en_US
dc.subject.lcsh	Tswana language--Translating into English	en_US
dc.title	Knowledge-based word sense disambiguation for Setswana-English machine translation	en_US
dc.type	Thesis	en_US
dc.description.level	D	en_US
dc.identifier.doi	https://doi.org/10.51415/10321/5570	-
local.sdg	SDG04	en_US
local.sdg	SDG10	en_US
local.sdg	SDG16	en_US
item.grantfulltext	open	-
item.cerifentitytype	Publications	-
item.fulltext	With Fulltext	-
item.openairecristype	http://purl.org/coar/resource_type/c_18cf	-
item.openairetype	Thesis	-
item.languageiso639-1	en	-
Appears in Collections:	Theses and dissertations (Accounting and Informatics)

Files in This Item:

File	Description	Size	Format
Moape_TG_2024.pdf		4.6 MB	Adobe PDF	View/Open

Show simple item record

Page view(s)

83

checked on Dec 13, 2024

Download(s)

98

checked on Dec 13, 2024

Google Scholar^TM

Check

Files in This Item:

Page view(s)

Download(s)

Google Scholar^TM

Altmetric

Altmetric

Files in This Item:

Page view(s)

Download(s)

Google ScholarTM

Altmetric

Altmetric

Google Scholar^TM