A data science analysis of the South African COVID-19 infodemic on Twitter

Khan, Yaseen

doi:https://doi.org/10.51415/10321/6053

A data science analysis of the South African COVID-19 infodemic on Twitter

dc.contributor.advisor	Thakur, Surendra C.
dc.contributor.author	Khan, Yaseen
dc.date.accessioned	2025-06-26T06:26:38Z
dc.date.available	2025-06-26T06:26:38Z
dc.date.issued	2024
dc.description	Submitted in fulfilment of the requirements for the Degree of Doctor of Philosophy in Information Technology, Durban University of Technology, Durban, South Africa, 2025..
dc.description.abstract	The rapid dissemination of information on Twitter (X), particularly during COVID-19, has exacerbated infodemics, marked by the proliferation of both accurate and false information. Users were inundated with Fake News, encompassing misinformation, disinformation and malinformation. Misinformation entails the unintentional dissemination of false information, whereas disinformation involves deliberate deception. Malinformation, although grounded in truth, is manipulated to inflict harm. Traditional human-led verification systems have proven inadequate, as seen in prior infodemics such as the 2016 U.S. elections and South Africa’s #FeesMustFall. Legislative measures in South Africa aimed at curbing Fake News during the pandemic were insufficient because of the vast volume of tweets, necessitating computational approaches. Despite global focus on COVID-19 Fake News, South African research on Twitter infodemic analysis remains limited, particularly in the areas of Big Data, longitudinal analysis, fake news detection, sentiment analysis and social bot detection. This study addresses these gaps through advanced data science methods, including Natural Language Processing (NLP), Machine Learning (ML) and Change Point Analysis (CPA), to analyse the South African COVID-19 Twitter infodemic. A longitudinal South African COVID-19 dataset (SAcovid19dataset), comprising 976 086 tweets from 8 November 2019 to 19 July 2021, was curated. Additionally, a labelled dataset (C19MLdataset) of 30 193 tweets was created for Fake News detection, containing 17 069 ‘Fake News’ and 13 124 ‘Not Fake News’ tweets. The study focused on textual analysis, as audio, video and image analytics were beyond its scope. This research uniquely employs an exhaustive approach to compare a wide range of models for COVID-19 Fake News detection prioritising balanced accuracy and execution time performance metrics. Using the C19MLdataset, twenty-seven (27) shallow, five (5) deep learning (DL) and seven (7) transformer models were systematically evaluated. ExtraTreesClassifier, RandomForestClassifier and LightGBM emerged as the topperforming shallow models. RoBERTa was the top performing transformer model and BiLSTM outperformed other DL models. LightGBM was identified as the most efficient model because of its speed and low computational demands. After optimization, it achieved a balanced accuracy of 88.76% and detected 262 508 (26.89%) Fake News tweets from the full SAcovid19dataset. Sentiment analysis, performed using VADER and CPA, revealed 16 significant shifts in sentiment due to real-world events. Approximately 56% were related to lockdown announcements and restrictions. For instance, the South African national state of emergency on 15 March 2020 led to a shift from neutral to positive sentiment. Contrastingly, the 26 February 2021 South African state of the nation address saw sentiment shift from positive to negative. Social bot activity was examined using three novel algorithms that analysed tweet timestamps, content duplication and sources. Results showed that three of the Top 10 Fake News accounts exhibited bot-like behaviour, confirming the presence of automated accounts in the spread of false information. This study significantly contributes to the understanding of South Africa’s COVID-19 infodemic by providing a robust Fake News detection model and linking real-world events to shifts in public sentiment. The development of social bot detection algorithms further illuminates the role of automated accounts in the dissemination of Fake News. These findings have practical implications for policymakers and researchers aiming to combat infodemics using computational tools.
dc.description.level	D
dc.format.extent	229 p
dc.identifier.doi	https://doi.org/10.51415/10321/6053
dc.identifier.uri	https://hdl.handle.net/10321/6053
dc.language.iso	en
dc.subject	COVID-19
dc.subject	Opinion Mining
dc.subject	Fake News
dc.subject	Natural Language Processing
dc.subject	Infodemic
dc.subject	Data Science
dc.subject	Machine Learning
dc.subject	Sentiment Analysis
dc.subject	Social Bot Detection
dc.subject.lcsh	Information overload
dc.subject.lcsh	Social media--Influence
dc.subject.lcsh	COVID-19 Pandemic, 2020-2023
dc.subject.lcsh	Fake news
dc.title	A data science analysis of the South African COVID-19 infodemic on Twitter
dc.type	Thesis
local.sdg	SDG03
local.sdg	SDG04

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Khan_Y_2025.pdf
Size:: 6.97 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 2.22 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Theses and dissertations (Accounting and Informatics)