Assignment 2: Cleaning and Visualizing Online Learning Data
BUS220 — Business Intelligence and Analytics
Overview
You have a dataset of student enrollments from the first year of online courses on the edX platform (HarvardX and MITx). Each row represents one student’s activity in one course. The data is messy — it came from a real system with inconsistent missing values, broken formulas, and mixed data types.
Your task: clean the data, load it into Tableau, and build a set of worksheets that reveal how students engage with these courses.
Total: 8 points + 2 bonus points
Deadline: See Moodle. Late policy: 1–7 days = 50% credit, >7 days = 0.
Deliverables
Submit in Moodle (file upload + submission text box):
- Cleaning artifact — the script or pipeline you used to clean the data:
- Syto: export the pipeline as JSON (File > Export pipeline)
- Python:
.pyor.ipynbnotebook - R:
.Rscript - Other tool: document your steps so they could be reproduced
- Tableau workbook (
.twbx— File > Export As Packaged Workbook) with 4 worksheets - Moodle submission text — your written answers for Parts 1 and 3 (see details below)
- Text or file uploads only. Do not submit links to Google Docs, Google Sheets, or any external service. Submissions containing external links instead of actual files will receive a −1 point deduction.
- Scripts must be reproducible. If you submit a notebook, it should run top-to-bottom on the original CSV without manual intervention. If you submit a Syto pipeline, it should be importable. Submissions that cannot be reproduced will receive a −1 point deduction. (These two deductions do not compound — maximum −1 for submission format issues.)
File naming: LastName_FirstName_assignment_2.twbx (and same pattern for your cleaning artifact, e.g. .ipynb, .py, .json)
Dataset
Download edx-courses-subset.csv from Moodle (~15,000 rows, 23 columns).
The data covers 6 courses — 3 from Harvard, 3 from MIT:
| Institution | Course | Subject |
|---|---|---|
| HarvardX | ER22x | Justice (philosophy) |
| HarvardX | PH207x | Health in Numbers (quantitative methods) |
| MITx | 14.73x | Challenges of Global Poverty |
| MITx | 6.00x | Introduction to Computer Science |
| MITx | 6.002x | Circuits and Electronics |
| MITx | 7.00x | Introduction to Biology |
Key columns
Identifiers: course_id (includes institution, course code, and semester), userid_DI (de-identified student ID)
Engagement flags (0/1): registered (always 1), viewed (accessed course content), explored (accessed 50%+ of chapters), certified (earned a certificate)
Demographics: Country (aggregated to 7 groups), LoE_DI (education level), YoB (year of birth), gender (m/f/o)
Activity metrics: nevents (total interactions), ndays_act (active days), nplay_video (video plays), nchapters (chapters accessed), nforum_posts (forum posts)
Other: grade (0–1 scale), start_time_DI / last_event_DI (dates), Age (derived from YoB), roles, incomplete_flag
Part 1: Data Cleaning (3 points)
Clean the dataset using whichever tool is most convenient to you: Syto, Python, R, or something else. If you’re unsure, Syto is a good default — it’s what we used in practice. Whatever tool you use, submit the artifact of your work (script, notebook, or exported pipeline) so we can see your steps.
1.1 Inspect and document data quality issues
Before changing anything, look through the data and identify problems. There are at least 4 distinct types of data quality issues in this dataset — find and document them in your Moodle submission text. A few sentences per issue is enough: what’s wrong, which column(s), and what you plan to do about it.
By “type” we mean the category of problem, not individual occurrences. For example, if several columns use different strings to represent missing data, that’s one type (inconsistent missing-value representations) even though it appears in many places. A column stored as text when it should be numeric is a different type.
1.2 Clean
- Standardize missing values: replace all the different missing representations with proper nulls/blanks. (Hint: if using Syto, use the column’s “Find and Replace” option in the dropdown menu to quickly find and replace values in-place without creating new columns)
- Fix data types: make sure numbers are numbers and dates are dates
- Remove useless columns: drop any column that has no information
Not every “missing” value should become a null in the same way. For each column, ask yourself:
- What should this null mean? If a numerical column like
neventsis null for a student who never viewed the course, does null mean “zero events” or “we don’t know”? The answer depends on the column’s meaning — think about it before deciding whether to keep nulls, replace them with 0, or handle them differently. - Is text or null easier to work with? For categorical columns like
genderorLoE_DI, a value like"(unknown)"can be more practical than a null — it shows up in Tableau filters and legends instead of silently disappearing from charts. - Could the null itself be a signal? Sometimes a missing value tells you something about the row. For example, if activity metrics are null for certain students, that pattern might correlate with their engagement level. Consider whether it’s worth deriving an additional column that captures this information.
1.bonus: What’s wrong with Age? (+2 bonus points)
The YoB and Age columns have a relationship, but it’s not a clean one. Some values don’t add up, and the problems are not all the same kind. You may find it easier to investigate this after completing the cleaning steps above.
For bonus credit:
- Describe what’s wrong — explain the relationship between
YoBandAge, identify the different types of problems, and explain how they likely happened - Fix it — clean or recompute the values so
Ageis reliable
Write your findings in the Moodle submission text.
1.3 Derive new columns
institution: extract"HarvardX"or"MITx"from thecourse_id(everything before the first/)course_code: extract the course code fromcourse_id(the part between the first and second/)
1.4 Export
Export the cleaned data as a CSV file. This is what you will load into Tableau. You don’t need to submit the CSV separately — it will be packaged inside your Tableau workbook (.twbx). The only file you submit for Part 1 is your cleaning script or pipeline.
Your cleaned CSV should have:
- ~15,000 rows (same as the original — do not filter rows out)
- No non-standard missing values — every missing value should be an empty cell, not a string like
"NA"or a formula error institutionandcourse_codecolumns extracted fromcourse_id- Useless columns removed
Part 2: Tableau Worksheets (3 points)
Connect your cleaned CSV to Tableau Public and build the following 3 worksheets. Each worksheet starts with a question — your job is to build the calculation that answers it and choose a chart type that makes the answer clear. Every worksheet should have a descriptive title and clean tooltips.
Worksheet 1: How many students make it through each engagement stage?
The engagement flags are cumulative: every student is registered, some viewed the course, fewer explored it, and fewer still earned a certificate. Show how many students reach each stage.
What to calculate: The total number of students at each engagement level — registered, viewed, explored, certified.
Hint: Since you need to show four different measures in one chart, look into Measure Names / Measure Values.
Worksheet 2: Do courses attract different types of students?
Not every course draws the same audience. Compare the education-level profile (LoE_DI) across the 6 courses. The question isn’t “how many Bachelor’s holders took each course” (raw counts) — it’s “what share of each course’s students hold a Bachelor’s degree?”
What to calculate: The percentage of each education level within each course — not raw counts. This is a job for a table calculation (Percent of Total). Pay attention to how you configure the “Compute Using” setting — percentages should add up to 100% within each course.
Sanity check: One course has a notably higher share of Doctorate-level students than any other. If your chart doesn’t show a clear outlier, check your table calculation direction.
Worksheet 3: Which courses have high engagement but low completion?
A course’s overall certification rate can hide an important distinction. Some courses fail to attract engagement in the first place — students register but never show up. Other courses attract deeply engaged students who still don’t finish. These are different problems.
What to calculate: Two rates that decompose the certification pipeline:
- Explore rate — what share of registrants engaged deeply (explored)?
- Finish rate — of those who explored, what share earned a certificate?
Show both rates for each course in one chart so the mismatch is visible. A course with a high explore rate but a low finish rate has a different problem than one where students never engage at all.
Hint: You can place two calculated fields on the same axis using Measure Names / Measure Values, or use a dual axis.
Sanity check: You should see courses where these two rates diverge sharply. If all courses look similar on both rates, revisit your formulas.
Why this matters: If a course has a low overall certification rate, these two rates tell you where the problem is. A low finish rate among explorers suggests the final stretch is the barrier (hard exams? time commitment?). A high finish rate with low exploration suggests the course struggles to hook students in the first place. This kind of diagnostic thinking is central to business intelligence — the first number tells you that something is happening, the second tells you why.
Part 3: Open-Ended Worksheet (2 points)
Worksheet 4: A Question That Needs Multiple Dimensions
Some questions can’t be answered by a single number or a single comparison. “Which course has the highest certification rate?” is one-dimensional — you need one metric across one category. But “Does the relationship between activity and grades differ by institution, and does course size affect this pattern?” needs you to see four things at once: activity, grades, institution, and enrollment size. Splitting this into separate charts would lose the comparison.
Find a question like that — one where at least 4 data dimensions need to be visible simultaneously for the answer to emerge. Then build a single chart that encodes all of them using different visual channels (position, color, size, shape, labels, etc.).
Before you start, read FDV Chapter 2: Visualizing Data — Mapping Data onto Aesthetics (free online) to learn (or refresh your knowledge) how data dimensions map to visual channels.
Some chart types that naturally support many dimensions:
| Chart type | Encodings it uses |
|---|---|
| Highlight table (color-coded grid) | Row, column, color intensity, text |
| Bubble chart | X, Y, size, color |
| Scatter plot with marks | X, Y, color, size, shape |
In your Moodle submission, write:
- The question your chart answers — what do you need multiple dimensions to see?
- What dimensions you encoded and which visual channel you used for each
- Why one chart — why does combining these dimensions in a single view reveal something that separate charts would miss?
Submission text
In the Moodle submission text box, write:
- Part 1: Brief summary of data quality issues you found (a few sentences per issue)
- Part 1 bonus (if attempting): what’s wrong with YoB/Age and how the problems likely happened
- Part 3, Worksheet 4: the question, dimensions encoded with their visual channels, and why one chart works better than separate charts
Make sure your Tableau workbook has descriptive worksheet titles (not “Sheet 1”) and clean tooltips (remove default “SUM(certified)” clutter).
Grading
| Part | Points |
|---|---|
| 1. Data Cleaning | 3 |
| 2. Tableau Worksheets | 3 |
| 3. Open-Ended Worksheet | 2 |
| Bonus: What’s wrong with Age? | +2 |
What loses points:
- Submission format (−1, non-compounding): External links instead of files, or a script that can’t be reproduced on the original CSV.
- Wrong metric: Chart shows counts when the question asks about rates, or uses the wrong aggregation.
- No calculation: Worksheet uses only raw columns with no calculated field where one is needed.
- Missing context: Generic titles (“Sheet 1”), default tooltips (“SUM(certified)”), unsorted bars when order matters.
- Weak written answers: Documentation is missing, too vague to show understanding, or contradicts what the chart actually shows.
- Bonus: Description is superficial (“some values are missing”) without explaining the mechanism. Fix doesn’t address all types of problems.
AI Usage
You may use AI tools to help with cleaning syntax (regex patterns, Syto expressions, pandas functions), troubleshooting Tableau issues, or understanding chart types. Mention it briefly in your submission. The cleaning decisions and chart design choices must be your own.