Project Overview
This project focuses on cleaning and preparing global internet usage data. The dataset contains information on internet usage across various countries from 2000 to 2023. The cleaning process ensures that the data is accurate, complete, and ready for further analysis.
Project Details
Key Objectives:
- Handling rows with excessive missing data.
- Imputing missing values in specific columns.
- Deleting irrelevant columns.
Data Sources:
The original dataset, "Global Internet Usage by Country (2000-2023)," was downloaded from Kaggle.
A log of the data cleaning process is available here.
The raw and cleaned data is available here.
Data Cleaning Steps:
- Rows with missing data in more than 49% of the columns were deleted to remove incomplete records.
- Missing values in the "2000 to 2022" columns were replaced with the median of the non-missing values in each column.
- The "2023" column was deleted due to a high percentage of missing values (72%), which could introduce bias.
Key Insights
- The data cleaning process ensures that the dataset is ready for accurate and reliable analysis.
- Handling missing data appropriately is crucial for avoiding bias and drawing meaningful conclusions.
- The resulting cleaned dataset provides a solid foundation for exploring trends in global internet usage.
Conclusion
This project highlights the importance of data cleaning in the data analysis workflow. By addressing missing values and removing irrelevant data, the project ensures the quality and reliability of the internet usage dataset for subsequent analysis and visualization.