Content area
Full Text
ABSTRACT
When we imagine the work of a data analyst, we often picture meaningful data analysis and beautiful data visualizations. Although that is an exciting part of the job, data analysts actually spend the majority of their time acquiring, cleaning, and preparing data for analysis. This teaching case guides students through some of the most common data cleaning challenges, including handling missing data, reshaping datasets, splitting columns, and profiling data to anticipate data quality concerns. Students will practice these skills in Microsoft Power BI, a current market leader in data analytics, using real-world, publicly available data from the popular United States real-estate platform Zillow. This case would be a good addition to data analytics, data management, or data visualization classes, or in general information systems courses looking to introduce students to the vital activity of data cleaning
Keywords: Teaching case, Data cleansing, Data literacy, Data acquisition, Data analytics
1. CASE SUMMARY
This case utilizes a real-world dataset provided free to the public by the popular housing and real-estate platform Zillow. Students are led through the steps of acquiring the data and a variety of data cleansing activities: loading data, promoting headers, filtering rows, splitting, renaming, and removing columns, profiling columns and assessing data quality, handling missing data, reshaping the data through unpivoting, changing data types, identifying and dealing with duplicate data, creating hierarchies, and documenting data provenance. This case utilizes the leading data analytics tool Microsoft Power BI, licenses for which are included for free in Microsoft Office 365 subscriptions and thereby freely available for many university programs. By working through this case, students will learn the basics of data cleansing while also engaging in critical thought processes and class discussions on various data cleansing strategies.
2. CASE TEXT
2.1 Introduction
Data rarely exists in a format that is ready to be analyzed and visualized. Data analysts are estimated to spend up to 80% of their time discovering and preparing data (DalleMule & Davenport, 2017) and "dirty" or bad data is estimated to cost the U.S. $3 trillion per year (Redman, 2016). This is true of realworld housing data, which you are tasked with obtaining, preparing, cleaning, wrangling, and analyzing in this project.
For this project, you will utilize data from the popular housing...