BUS220 - Business Intelligence and Analytics – From Pivot Tables to Tidy Data

Data has shape

What did the pivot table do?

You had 10,000 order rows. To make a multi-series line chart, you built a pivot table.

Before (long):

Quarter	Region	Sales
2014-Q1	Central	$8,601
2014-Q1	East	$12,340
2014-Q1	South	$6,209
2014-Q1	West	$9,734
2014-Q2	Central	$11,082
…	…	…

⟶

After (wide):

Quarter	Central	East	South	West
2014-Q1	$8,601	$12,340	$6,209	$9,734
2014-Q2	$11,082	…	…	…
…	…	…	…	…

The pivot table is a data reshaping operation

The pivot table changed your data’s shape.

Values from the Region column became column headers.

This shape is called wide format: compact, human-readable, one row per time period.

The original shape is called long format: one observation per row, each variable in its own column.

Where does this idea come from?

In 2014, Hadley Wickham — a New Zealand statistician at Rice University — published a paper formalizing something practitioners had been doing informally.

He noticed every analysis project started with the same struggle: reshaping data before the actual analysis could begin. The target shape was always the same.

He called it tidy data. Three rules:

Each variable is a column
Each observation is a row
Each type of observational unit is a table

Wickham translated a 1970 idea

In 1970, Edgar Codd published the rules of relational database design — including third normal form: store each fact once, in the right place.

Wickham references Codd directly in his paper. Tidy data is third normal form, repackaged for people who work in R and spreadsheets instead of databases.

Community	They call it
Database engineers	Third normal form
Statisticians	Tidy data
BI practitioners	Star schema

Wickham’s contribution was making the idea accessible outside the database world.

You’ve already seen a star schema

The Superstore spreadsheet had three sheets, connected by lookups.

erDiagram
    Orders ||--o{ People : "Region"
    Orders ||--o{ Returns : "Order ID"

    Orders {
        string Order_ID
        date Order_Date
        string Region
        string Product
        float Sales
        float Profit
        float Discount
        int Quantity
    }
    People {
        string Region
        string Manager
    }
    Returns {
        string Order_ID
        string Returned
    }

Star schema = one central fact table (the transactions, the measurements) surrounded by smaller dimension tables (the context: who, where, what category).

The XLOOKUP you wrote in week 1 was a manual join between a fact table and a dimension table.

“You’ve already built this structure without knowing the name. The Orders sheet is the fact table — it has the measurements (Sales, Profit, Quantity) and the events (each order line). The People sheet and Returns sheet are dimension tables — they add context to the facts.”

“The XLOOKUP you used to bring Regional Manager into the Orders sheet? That’s a join between a fact table and a dimension table. In Tableau, you’ll do this with a visual drag-and-drop join next week.”

“Star schema is a formal data modeling pattern used in data warehouses and BI platforms. You don’t need to design star schemas in this course, but knowing the vocabulary helps when you work with data engineers or read documentation. The core idea: keep the big table of events in the center, put the lookup context around it, and connect them by shared keys.”

Keep this to 2 minutes. The point is vocabulary + recognition, not a deep dive into data modeling theory.

Exercise: which shape is this?

A grade book

Student	HW1	HW2	Midterm	Final
Anna	85	92	78	88
Boris	91	87	94	90
Clara	73	80	85	82

Wide or long?

Wide. “Assignment” is a variable stuffed into column headers. To use this in Tableau, you’d unpivot HW1/HW2/Midterm/Final into an “Assignment” column and a “Score” column.

A bank transaction log

Date	Account	Amount	Type
Mar 1	Checking	-45.00	Purchase
Mar 1	Checking	2,500.00	Salary
Mar 2	Savings	500.00	Transfer

Wide or long?

Long (tidy). Each transaction is one row. Each variable is one column.

Why tools matter

What happens when spreadsheets aren’t enough.

Excel renames your genes

Genomics researchers work with gene names like SEPT1 (Septin 1), MARCH1 (Membrane Associated Ring-CH-Type Finger 1), DEC1 (Deleted in Esophageal Cancer 1).

Open a CSV with these names in Excel:

SEPT1 → September 1
MARCH1 → March 1
DEC1 → December 1

No warning. No error message. Silent data corruption.

Excel silently converts gene names to dates

20% of genomics papers had corrupted gene names

A 2016 study by Ziemann, Eren, and El-Osta screened 3,597 published papers with supplementary Excel files.

704 papers — roughly one in five — contained gene names that had been silently converted to dates or floating-point numbers by Excel.

In 2020, the Human Gene Nomenclature Committee gave up on fixing the software. They renamed 27 human genes instead.

SEPT1 → SEPTIN1. MARCH1 → MARCHF1.

They renamed parts of the human genome because they couldn’t get scientists to stop opening CSVs in Excel.

Invisible formulas, real consequences

In 2010, Harvard economists Reinhart and Rogoff published a paper arguing that countries with debt above 90% of GDP experience negative growth.

The paper was cited in Congressional hearings, European Parliament debates, and IMF policy documents to justify austerity measures.

Three years later, a graduate student found a SUM formula that excluded five rows — Belgium, Austria, Australia, Canada, and Denmark were missing from the average.

With all countries included, the result flipped: slow but positive growth.

The corrected results

Reinhart-Rogoff: original vs. corrected GDP growth by debt level

The “above 90%” group changed the most: from −0.1% to +2.2% average growth. The formula error didn’t just nudge the number — it flipped the sign.

But tools aren’t the only problem

Even if your tool is perfect, the data itself arrives broken.

What problems can you spot?

id	date	city	category	amount	notes
1	2025-03-01	Kyiv	Electronics	1200
2	03/04/2025	Kiev	electronics	$800	returned
3	2025-03-01	Kyiv	Electronics	1200
4	2025-03-12		Clothing	0
5	01/04/2025	kyiv	Other	350
6	2025-03-20	Київ	Electronics	2,100

Duplicated rows

id	date	city	category	amount	notes
1	2025-03-01	Kyiv	Electronics	1200
2	03/04/2025	Kiev	electronics	$800	returned
3	2025-03-01	Kyiv	Electronics	1200
4	2025-03-12		Clothing	0
5	01/04/2025	kyiv	Other	350
6	2025-03-20	Київ	Electronics	2,100

Rows 1 and 3 are identical. Is this a real duplicate, or two separate transactions that happen to match?

If you sum the amount column, you get an extra 1,200 — or you lose a real sale.

Inconsistent dates

id	date	city	category	amount	notes
1	2025-03-01	Kyiv	Electronics	1200
2	03/04/2025	Kiev	electronics	$800	returned
3	2025-03-01	Kyiv	Electronics	1200
4	2025-03-12		Clothing	0
5	01/04/2025	kyiv	Other	350
6	2025-03-20	Київ	Electronics	2,100

Two formats in one column. Rows 1, 3, 4, 6 use YYYY-MM-DD. Rows 2 and 5 use MM/DD/YYYY — or is it DD/MM/YYYY?

Is row 5 January 4th or April 1st? Without knowing who entered it, you can’t tell.

Inconsistent spelling

id	date	city	category	amount	notes
1	2025-03-01	Kyiv	Electronics	1200
2	03/04/2025	Kiev	electronics	$800	returned
3	2025-03-01	Kyiv	Electronics	1200
4	2025-03-12		Clothing	0
5	01/04/2025	kyiv	Other	350
6	2025-03-20	Київ	Electronics	2,100

Four spellings of the same city: Kyiv, Kiev, kyiv, Київ.

A GROUP BY city gives you four groups instead of one. Every aggregate — total, average, count — is silently wrong.

Missing and disguised values

id	date	city	category	amount	notes
1	2025-03-01	Kyiv	Electronics	1200
2	03/04/2025	Kiev	electronics	$800	returned
3	2025-03-01	Kyiv	Electronics	1200
4	2025-03-12	(blank)	Clothing	0
5	01/04/2025	kyiv	Other	350
6	2025-03-20	Київ	Electronics	2,100

Row 4: city is blank — was it not recorded, or is this an online order with no city?

Row 4: amount is 0 — is this a free item, or does 0 mean “not yet invoiced”?

Mixed types and formatting

id	date	city	category	amount	notes
1	2025-03-01	Kyiv	Electronics	1200
2	03/04/2025	Kiev	electronics	$800	returned
3	2025-03-01	Kyiv	Electronics	1200
4	2025-03-12		Clothing	0
5	01/04/2025	kyiv	Other	350
6	2025-03-20	Київ	Electronics	2,100

$800 and 2,100 are text strings, not numbers. A SUM formula skips them silently.

Electronics vs. electronics — same category or different? Other — what does that even mean?

Six rows, six problems

Problem	Rows	Risk
Duplicated rows	1, 3	Inflated totals
Inconsistent dates	2, 5	Wrong time series
Inconsistent spelling	1, 2, 5, 6	Broken grouping
Missing values	4	Wrong aggregates
Numbers as text	2, 6	Silent calculation errors
Meaningless categories	5	Lost information

This was 6 rows. Real datasets have thousands.

Fixing these problems one cell at a time doesn’t scale. You need a process that’s systematic and reproducible.

Pipelines instead of spreadsheets

A data preparation tool separates data from logic. Every transformation is an explicit, named step.

import

→

filter

→

rename

→

convert types

→

derive

→

export

If step 3 is wrong, you fix step 3 — steps 4 through 8 rerun automatically.

If a column is typed as text, it stays text. No silent date conversions.

Every BI platform has a data prep tool

Platform	Data prep tool	Cost
Power BI	Power Query	Included with Power BI
Tableau	Tableau Prep	~$70/month (on top of Tableau Creator)
Python	pandas	Free, but you write code
Browser	Syto	Free

The concept is the same in all of them: a visual pipeline where each step is an explicit transformation you can inspect and reorder.

Syto

Browser-based — no installation, works on any OS
Visual pipeline — each transformation is a named step you can reorder, edit, or remove
Exports clean CSV — which you then connect to Tableau

It’s free, it’s visual, and it teaches the same concepts you’d use in Power Query or Tableau Prep.

The skill transfers: once you understand pipeline thinking, switching tools is learning new syntax for the same operations.

Live demo: cleaning data in Syto

Switch to Syto in the browser. Use the ihtm_2025.xlsx dataset (Oxford IHTM survey, 18 rows). The dataset is intentionally small so every issue is visible without scrolling.

Walk through the cleaning pipeline step by step. At each step, ask: “What do you notice?”

Import the Excel file. Inspect: spot empty rows, mixed types, inconsistent values.
Filter: remove rows where all fields (except ID) are null.
Replace: standardize capitalization in favourite_colour (“Navy blue” → “navy blue”).
Replace: fix nationality inconsistencies (“Vietnam” → “Vietnamese”).
Types: convert number_siblings to integer — this catches “one” as an error. “Notice: Syto flagged this instead of silently converting it. Compare that to the gene name story.”
Fix “one” → “1”, re-apply type.
Export as clean CSV.

Key emphasis: every step is visible in Syto’s pipeline panel. You can reorder steps, remove one, see how it changes the output.

Keep it conversational. The dataset is tiny, so spend time on the why behind each decision rather than tool mechanics. This is most students’ first time seeing Syto — let them absorb the interface.

Target: ~10 minutes for the live demo portion.

Tableau’s data model

How Tableau understands your data — the layer underneath the charts.

Live demo: connecting data in Tableau

Dimensions vs. measures

[Screenshot: Tableau Data pane for billboard-long.csv — Dimensions section (artist, genre, track, date_entered) above the line, Measures section (rank, week) below the line]

Dimensions answer what

What artist?
What genre?
What date?

Drag to Rows → one row per unique value

Measures answer how much

What rank?
How many weeks?
What sales amount?

Drag to Columns → Tableau aggregates (sums, averages, counts)

Discrete vs. continuous

Blue pill
Discrete — separate buckets
Tableau draws a header for each value
Good for: categories, individual years, distinct groups

Green pill
Continuous — an unbroken range
Tableau draws an axis
Good for: time series, numeric ranges, quantities

[Screenshot: Same date field shown two ways — top: discrete YEAR(date_entered) with separate column headers “1999” “2000”; bottom: continuous date_entered as a time axis flowing left to right]

The shortcut: blue = headers, green = axis. The same field can be either — you choose.

The Marks card: pre-attentive attributes in Tableau

[Screenshot: Tableau Marks card showing Color, Size, Shape, Detail, Tooltip]

Pre-attentive attributes — color, size, shape — are the channels the brain processes fastest.

In Tableau, these live on the Marks card. Drag a field to Color, Size, or Shape to encode data through that visual channel.

Live demo: the automatic pivot in action

Back in Tableau. Build step by step:

Drag week (continuous) to Columns.
Drag rank to Rows.
Right now: a single aggregated line. “Tableau summed all ranks for each week. One mark per week. Not useful.”
Drag track to Detail on the Marks card. Suddenly: hundreds of individual song lines appear.
“What just happened? By adding track as a dimension, we changed the granularity — from one mark per week to one mark per song per week. Tableau went from aggregating everything together to splitting by song.”
Filter to 5-6 recognizable songs (Destiny’s Child, Santana, 3 Doors Down, Madonna, Creed).
Move track from Detail to Color. Each song gets a distinct color.
Reverse the Y-axis (right-click → Edit Axis → Reversed) so #1 is at top.

“In Google Sheets, to get this chart you needed: a pivot table with each song as a column, then build the chart from that wide table. Here, the data stayed in its tidy form — one row per song-week — and Tableau split it visually through the Color shelf.”

This is the demo’s payoff moment. Let it land.

What’s ahead in practice

Practice this week

P1 — Data wrangling in Syto

Take the Billboard Hot 100 dataset: 317 songs, 76 weeks of rankings, crammed into 79 columns. Unpivot the 76 week columns into rows, clean the results, export two files at two granularity levels.

This is literally Table 5 from Wickham’s tidy data paper.

P2 — First steps in Tableau

Connect both cleaned files. Build worksheets from each — line charts from the song-week data, bar charts and histograms from the song-level data.

[Screenshot: The raw billboard-raw.csv opened in a spreadsheet — showing the 79-column structure with x1st.week through x76th.week columns stretching off-screen to the right. The “this is untidy” moment.]

Three ideas to take with you

Data has shape. Wide and long are two layouts for the same information. Pivot goes long → wide. Unpivot goes wide → long.

Tableau pivots for you. Give it tidy data and it handles grouping, aggregation, and visual separation automatically.

Granularity determines meaning. The same drag-and-drop produces insight or garbage depending on whether the data level matches the question.