A data science analysis of the South African COVID-19 infodemic on Twitter
| dc.contributor.advisor | Thakur, Surendra C. | |
| dc.contributor.author | Khan, Yaseen | |
| dc.date.accessioned | 2025-06-26T06:26:38Z | |
| dc.date.available | 2025-06-26T06:26:38Z | |
| dc.date.issued | 2024 | |
| dc.description | Submitted in fulfilment of the requirements for the Degree of Doctor of Philosophy in Information Technology, Durban University of Technology, Durban, South Africa, 2025.. | |
| dc.description.abstract | The rapid dissemination of information on Twitter (X), particularly during COVID-19, has exacerbated infodemics, marked by the proliferation of both accurate and false information. Users were inundated with Fake News, encompassing misinformation, disinformation and malinformation. Misinformation entails the unintentional dissemination of false information, whereas disinformation involves deliberate deception. Malinformation, although grounded in truth, is manipulated to inflict harm. Traditional human-led verification systems have proven inadequate, as seen in prior infodemics such as the 2016 U.S. elections and South Africa’s #FeesMustFall. Legislative measures in South Africa aimed at curbing Fake News during the pandemic were insufficient because of the vast volume of tweets, necessitating computational approaches. Despite global focus on COVID-19 Fake News, South African research on Twitter infodemic analysis remains limited, particularly in the areas of Big Data, longitudinal analysis, fake news detection, sentiment analysis and social bot detection. This study addresses these gaps through advanced data science methods, including Natural Language Processing (NLP), Machine Learning (ML) and Change Point Analysis (CPA), to analyse the South African COVID-19 Twitter infodemic. A longitudinal South African COVID-19 dataset (SAcovid19dataset), comprising 976 086 tweets from 8 November 2019 to 19 July 2021, was curated. Additionally, a labelled dataset (C19MLdataset) of 30 193 tweets was created for Fake News detection, containing 17 069 ‘Fake News’ and 13 124 ‘Not Fake News’ tweets. The study focused on textual analysis, as audio, video and image analytics were beyond its scope. This research uniquely employs an exhaustive approach to compare a wide range of models for COVID-19 Fake News detection prioritising balanced accuracy and execution time performance metrics. Using the C19MLdataset, twenty-seven (27) shallow, five (5) deep learning (DL) and seven (7) transformer models were systematically evaluated. ExtraTreesClassifier, RandomForestClassifier and LightGBM emerged as the topperforming shallow models. RoBERTa was the top performing transformer model and BiLSTM outperformed other DL models. LightGBM was identified as the most efficient model because of its speed and low computational demands. After optimization, it achieved a balanced accuracy of 88.76% and detected 262 508 (26.89%) Fake News tweets from the full SAcovid19dataset. Sentiment analysis, performed using VADER and CPA, revealed 16 significant shifts in sentiment due to real-world events. Approximately 56% were related to lockdown announcements and restrictions. For instance, the South African national state of emergency on 15 March 2020 led to a shift from neutral to positive sentiment. Contrastingly, the 26 February 2021 South African state of the nation address saw sentiment shift from positive to negative. Social bot activity was examined using three novel algorithms that analysed tweet timestamps, content duplication and sources. Results showed that three of the Top 10 Fake News accounts exhibited bot-like behaviour, confirming the presence of automated accounts in the spread of false information. This study significantly contributes to the understanding of South Africa’s COVID-19 infodemic by providing a robust Fake News detection model and linking real-world events to shifts in public sentiment. The development of social bot detection algorithms further illuminates the role of automated accounts in the dissemination of Fake News. These findings have practical implications for policymakers and researchers aiming to combat infodemics using computational tools. | |
| dc.description.level | D | |
| dc.format.extent | 229 p | |
| dc.identifier.doi | https://doi.org/10.51415/10321/6053 | |
| dc.identifier.uri | https://hdl.handle.net/10321/6053 | |
| dc.language.iso | en | |
| dc.subject | COVID-19 | |
| dc.subject | Opinion Mining | |
| dc.subject | Fake News | |
| dc.subject | Natural Language Processing | |
| dc.subject | Infodemic | |
| dc.subject | Data Science | |
| dc.subject | Machine Learning | |
| dc.subject | Sentiment Analysis | |
| dc.subject | Social Bot Detection | |
| dc.subject.lcsh | Information overload | |
| dc.subject.lcsh | Social media--Influence | |
| dc.subject.lcsh | COVID-19 Pandemic, 2020-2023 | |
| dc.subject.lcsh | Fake news | |
| dc.title | A data science analysis of the South African COVID-19 infodemic on Twitter | |
| dc.type | Thesis | |
| local.sdg | SDG03 | |
| local.sdg | SDG04 |
