Repository logo
 

Word sense disambiguation pipeline framework for low resourced morphologically rich languages

dc.contributor.authorMasethe, Mosima Annaen_US
dc.contributor.authorMasethe, Hlaudi Danielen_US
dc.contributor.authorOjo, Sunday Olusegunen_US
dc.contributor.authorOwolawi, Pius A.en_US
dc.date.accessioned2023-03-22T07:02:44Z
dc.date.available2023-03-22T07:02:44Z
dc.date.issued2022
dc.date.updated2023-03-16T14:42:41Z
dc.description.abstractResolving ambiguity problem is a prolonged natural language processing theoretical research challenge. Sesotho sa Leboa language is an official name for Sepedi or Northern Sotho language as known to be an official language among 11 others in South Africa spoken by 4.7 million people. Sesotho sa Leboa is an indigenous rich morphologically low resourced South African language which is a highly polysemous language, with words that have numerous context. Disambiguating polysemous words remain a challenging problem for computational linguistics research. Deficiencies of several polysemy assessments suggest that dealing with the sense distinctiveness versus polysemy problems remains an uncluttered academic issue. A practical problem in natural language processing applications is Word Sense Disambiguation which suffers drastically from shortcomings when working with ambiguous polysemous words. Therefore, Word Sense Disambiguation seeks both academic and practical results. Many Word Sense Disambiguation applications gives high accuracy for the English language, and poor accuracy for Sesotho sa Leboa language. In this research, Word Sense Disambiguation pipeline framework is developed for Sesotho sa Leboa low resourced morphologically rich language which addresses academic and practical problems of the polysemy problem. The proposed Word Sense Disambiguation pipeline framework shows pre-processing modules which is a process to reduce ambiguity from the unstructured text corpus that serve to input sentences. Hence, the researchers compute the probability of Word Sense Disambiguation when polysemy and homonymy is observed for cosine similarity measures using sentence transformer (SBERT) and Word2Vec algorithms (Skip-Gram and Continuous Bag of Words). Computation of cosine similarity measure shows SBERT outperforms other algorithms with 87% threshold which shows strong similarity between context and sense definition while Continuous Bag of Words gives cosine similarity threshold of 51%, outperforming Skip-Gram algorithms which has a threshold below 50% with two vectors approaching a perpendicular angle of 90-degrees orthogonally indicating that orientation of vectors do not match.en_US
dc.format.extent16 pen_US
dc.identifier.citationMasethe, M.A et al. 2022. Word sense disambiguation pipeline framework for low resourced morphologically rich languages. SSRN Journal. doi:10.2139/ssrn.4332896en_US
dc.identifier.doi10.2139/ssrn.4332896
dc.identifier.issn1556-5068 (Online)
dc.identifier.urihttps://hdl.handle.net/10321/4677
dc.language.isoenen_US
dc.publisherElsevier BVen_US
dc.publisher.urihttps://ssrn.com/abstract=4332896en_US
dc.relation.ispartofSSRNen_US
dc.subjectCorpusen_US
dc.subjectContinuous bag of wordsen_US
dc.subjectNatural Language Processingen_US
dc.subjectSBERTen_US
dc.subjectSkipGramen_US
dc.subjectWord Sense Disambiguationen_US
dc.titleWord sense disambiguation pipeline framework for low resourced morphologically rich languagesen_US
dc.typeArticleen_US

Files

Original bundle

Now showing 1 - 2 of 2
Loading...
Thumbnail Image
Name:
SSRN Copyright Clearance.docx
Size:
189.3 KB
Format:
Microsoft Word XML
Description:
Copyright clearance
Loading...
Thumbnail Image
Name:
Masethe_Ojo_et al_2022.pdf
Size:
508.87 KB
Format:
Adobe Portable Document Format
Description:
Article