Assignment 2: Cleaning and Visualizing Online Learning Data

BUS220 — Business Intelligence and Analytics

Author

Oleh Omelchenko

Published

January 1, 2026

Overview

You have a dataset of student enrollments from the first year of online courses on the edX platform (HarvardX and MITx). Each row represents one student’s activity in one course. The data is messy — it came from a real system with inconsistent missing values, broken formulas, and mixed data types.

Your task: clean the data, load it into Tableau, and build a set of worksheets that reveal how students engage with these courses.

Total: 8 points + 2 bonus points

Deadline: See Moodle. Late policy: 1–7 days = 50% credit, >7 days = 0.

Deliverables

Submit in Moodle (file upload + submission text box):

Cleaning artifact — the script or pipeline you used to clean the data:
- Syto: export the pipeline as JSON (File > Export pipeline)
- Python: .py or .ipynb notebook
- R: .R script
- Other tool: document your steps so they could be reproduced
Tableau workbook (.twbx — File > Export As Packaged Workbook) with 4 worksheets
Moodle submission text — your written answers for Parts 1 and 3 (see details below)

Submission rules

Text or file uploads only. Do not submit links to Google Docs, Google Sheets, or any external service. Submissions containing external links instead of actual files will receive a −1 point deduction.
Scripts must be reproducible. If you submit a notebook, it should run top-to-bottom on the original CSV without manual intervention. If you submit a Syto pipeline, it should be importable. Submissions that cannot be reproduced will receive a −1 point deduction. (These two deductions do not compound — maximum −1 for submission format issues.)

File naming: LastName_FirstName_assignment_2.twbx (and same pattern for your cleaning artifact, e.g. .ipynb, .py, .json)

Dataset

Download edx-courses-subset.csv from Moodle (~15,000 rows, 23 columns).

The data covers 6 courses — 3 from Harvard, 3 from MIT:

Institution	Course	Subject
HarvardX	ER22x	Justice (philosophy)
HarvardX	PH207x	Health in Numbers (quantitative methods)
MITx	14.73x	Challenges of Global Poverty
MITx	6.00x	Introduction to Computer Science
MITx	6.002x	Circuits and Electronics
MITx	7.00x	Introduction to Biology

Key columns

Identifiers: course_id (includes institution, course code, and semester), userid_DI (de-identified student ID)

Engagement flags (0/1): registered (always 1), viewed (accessed course content), explored (accessed 50%+ of chapters), certified (earned a certificate)

Demographics: Country (aggregated to 7 groups), LoE_DI (education level), YoB (year of birth), gender (m/f/o)

Activity metrics: nevents (total interactions), ndays_act (active days), nplay_video (video plays), nchapters (chapters accessed), nforum_posts (forum posts)

Other: grade (0–1 scale), start_time_DI / last_event_DI (dates), Age (derived from YoB), roles, incomplete_flag

Part 1: Data Cleaning (3 points)

Clean the dataset using whichever tool is most convenient to you: Syto, Python, R, or something else. If you’re unsure, Syto is a good default — it’s what we used in practice. Whatever tool you use, submit the artifact of your work (script, notebook, or exported pipeline) so we can see your steps.

1.1 Inspect and document data quality issues

Before changing anything, look through the data and identify problems. There are at least 4 distinct types of data quality issues in this dataset — find and document them in your Moodle submission text. A few sentences per issue is enough: what’s wrong, which column(s), and what you plan to do about it.

By “type” we mean the category of problem, not individual occurrences. For example, if several columns use different strings to represent missing data, that’s one type (inconsistent missing-value representations) even though it appears in many places. A column stored as text when it should be numeric is a different type.

1.2 Clean

Standardize missing values: replace all the different missing representations with proper nulls/blanks. (Hint: if using Syto, use the column’s “Find and Replace” option in the dropdown menu to quickly find and replace values in-place without creating new columns)
Fix data types: make sure numbers are numbers and dates are dates
Remove useless columns: drop any column that has no information

How to think about cleaning

Not every “missing” value should become a null in the same way. For each column, ask yourself:

What should this null mean? If a numerical column like nevents is null for a student who never viewed the course, does null mean “zero events” or “we don’t know”? The answer depends on the column’s meaning — think about it before deciding whether to keep nulls, replace them with 0, or handle them differently.
Is text or null easier to work with? For categorical columns like gender or LoE_DI, a value like "(unknown)" can be more practical than a null — it shows up in Tableau filters and legends instead of silently disappearing from charts.
Could the null itself be a signal? Sometimes a missing value tells you something about the row. For example, if activity metrics are null for certain students, that pattern might correlate with their engagement level. Consider whether it’s worth deriving an additional column that captures this information.

1.bonus: What’s wrong with Age? (+2 bonus points)

The YoB and Age columns have a relationship, but it’s not a clean one. Some values don’t add up, and the problems are not all the same kind. You may find it easier to investigate this after completing the cleaning steps above.

For bonus credit:

Describe what’s wrong — explain the relationship between YoB and Age, identify the different types of problems, and explain how they likely happened
Fix it — clean or recompute the values so Age is reliable

Write your findings in the Moodle submission text.

1.3 Derive new columns

institution: extract "HarvardX" or "MITx" from the course_id (everything before the first /)
course_code: extract the course code from course_id (the part between the first and second /)

1.4 Export

Export the cleaned data as a CSV file. This is what you will load into Tableau. You don’t need to submit the CSV separately — it will be packaged inside your Tableau workbook (.twbx). The only file you submit for Part 1 is your cleaning script or pipeline.

Checkpoint

Your cleaned CSV should have:

~15,000 rows (same as the original — do not filter rows out)
No non-standard missing values — every missing value should be an empty cell, not a string like "NA" or a formula error
institution and course_code columns extracted from course_id
Useless columns removed

Part 2: Tableau Worksheets (3 points)

Connect your cleaned CSV to Tableau Public and build the following 3 worksheets. Each worksheet starts with a question — your job is to build the calculation that answers it and choose a chart type that makes the answer clear. Every worksheet should have a descriptive title and clean tooltips.

Worksheet 1: How many students make it through each engagement stage?

The engagement flags are cumulative: every student is registered, some viewed the course, fewer explored it, and fewer still earned a certificate. Show how many students reach each stage.

What to calculate: The total number of students at each engagement level — registered, viewed, explored, certified.

Tip

Hint: Since you need to show four different measures in one chart, look into Measure Names / Measure Values.

Worksheet 2: Do courses attract different types of students?

Not every course draws the same audience. Compare the education-level profile (LoE_DI) across the 6 courses. The question isn’t “how many Bachelor’s holders took each course” (raw counts) — it’s “what share of each course’s students hold a Bachelor’s degree?”

What to calculate: The percentage of each education level within each course — not raw counts. This is a job for a table calculation (Percent of Total). Pay attention to how you configure the “Compute Using” setting — percentages should add up to 100% within each course.

Tip

Sanity check: One course has a notably higher share of Doctorate-level students than any other. If your chart doesn’t show a clear outlier, check your table calculation direction.

Worksheet 3: Which courses have high engagement but low completion?

A course’s overall certification rate can hide an important distinction. Some courses fail to attract engagement in the first place — students register but never show up. Other courses attract deeply engaged students who still don’t finish. These are different problems.

What to calculate: Two rates that decompose the certification pipeline:

Explore rate — what share of registrants engaged deeply (explored)?
Finish rate — of those who explored, what share earned a certificate?

Show both rates for each course in one chart so the mismatch is visible. A course with a high explore rate but a low finish rate has a different problem than one where students never engage at all.

Tip

Hint: You can place two calculated fields on the same axis using Measure Names / Measure Values, or use a dual axis.

Sanity check: You should see courses where these two rates diverge sharply. If all courses look similar on both rates, revisit your formulas.

Note

Why this matters: If a course has a low overall certification rate, these two rates tell you where the problem is. A low finish rate among explorers suggests the final stretch is the barrier (hard exams? time commitment?). A high finish rate with low exploration suggests the course struggles to hook students in the first place. This kind of diagnostic thinking is central to business intelligence — the first number tells you that something is happening, the second tells you why.

Part 3: Open-Ended Worksheet (2 points)

Worksheet 4: A Question That Needs Multiple Dimensions

Some questions can’t be answered by a single number or a single comparison. “Which course has the highest certification rate?” is one-dimensional — you need one metric across one category. But “Does the relationship between activity and grades differ by institution, and does course size affect this pattern?” needs you to see four things at once: activity, grades, institution, and enrollment size. Splitting this into separate charts would lose the comparison.

Find a question like that — one where at least 4 data dimensions need to be visible simultaneously for the answer to emerge. Then build a single chart that encodes all of them using different visual channels (position, color, size, shape, labels, etc.).

Before you start, read FDV Chapter 2: Visualizing Data — Mapping Data onto Aesthetics (free online) to learn (or refresh your knowledge) how data dimensions map to visual channels.

Tip

Some chart types that naturally support many dimensions:

Chart type	Encodings it uses
Highlight table (color-coded grid)	Row, column, color intensity, text
Bubble chart	X, Y, size, color
Scatter plot with marks	X, Y, color, size, shape

In your Moodle submission, write:

The question your chart answers — what do you need multiple dimensions to see?
What dimensions you encoded and which visual channel you used for each
Why one chart — why does combining these dimensions in a single view reveal something that separate charts would miss?

Submission text

In the Moodle submission text box, write:

Part 1: Brief summary of data quality issues you found (a few sentences per issue)
Part 1 bonus (if attempting): what’s wrong with YoB/Age and how the problems likely happened
Part 3, Worksheet 4: the question, dimensions encoded with their visual channels, and why one chart works better than separate charts

Make sure your Tableau workbook has descriptive worksheet titles (not “Sheet 1”) and clean tooltips (remove default “SUM(certified)” clutter).

Grading

Part	Points
1. Data Cleaning	3
2. Tableau Worksheets	3
3. Open-Ended Worksheet	2
Bonus: What’s wrong with Age?	+2

What loses points:

Submission format (−1, non-compounding): External links instead of files, or a script that can’t be reproduced on the original CSV.
Wrong metric: Chart shows counts when the question asks about rates, or uses the wrong aggregation.
No calculation: Worksheet uses only raw columns with no calculated field where one is needed.
Missing context: Generic titles (“Sheet 1”), default tooltips (“SUM(certified)”), unsorted bars when order matters.
Weak written answers: Documentation is missing, too vague to show understanding, or contradicts what the chart actually shows.
Bonus: Description is superficial (“some values are missing”) without explaining the mechanism. Fix doesn’t address all types of problems.

AI Usage

You may use AI tools to help with cleaning syntax (regex patterns, Syto expressions, pandas functions), troubleshooting Tableau issues, or understanding chart types. Mention it briefly in your submission. The cleaning decisions and chart design choices must be your own.