From Pivot Tables to Tidy Data

BUS220 - Business Intelligence and Analytics | Week 3

Oleh Omelchenko

2026-03-25

Data has shape

What did the pivot table do?

You had 10,000 order rows. To make a multi-series line chart, you built a pivot table.

Before (long):

Quarter Region Sales
2014-Q1 Central $8,601
2014-Q1 East $12,340
2014-Q1 South $6,209
2014-Q1 West $9,734
2014-Q2 Central $11,082

After (wide):

Quarter Central East South West
2014-Q1 $8,601 $12,340 $6,209 $9,734
2014-Q2 $11,082

The pivot table is a data reshaping operation

The pivot table changed your data’s shape.


Values from the Region column became column headers.


This shape is called wide format: compact, human-readable, one row per time period.


The original shape is called long format: one observation per row, each variable in its own column.

Where does this idea come from?

In 2014, Hadley Wickham — a New Zealand statistician at Rice University — published a paper formalizing something practitioners had been doing informally.

He noticed every analysis project started with the same struggle: reshaping data before the actual analysis could begin. The target shape was always the same.

He called it tidy data. Three rules:

  1. Each variable is a column
  2. Each observation is a row
  3. Each type of observational unit is a table

Wickham’s “Tidy Data” paper (2014)

Wickham translated a 1970 idea


In 1970, Edgar Codd published the rules of relational database design — including third normal form: store each fact once, in the right place.


Wickham references Codd directly in his paper. Tidy data is third normal form, repackaged for people who work in R and spreadsheets instead of databases.


Community They call it
Database engineers Third normal form
Statisticians Tidy data
BI practitioners Star schema


Wickham’s contribution was making the idea accessible outside the database world.

You’ve already seen a star schema

The Superstore spreadsheet had three sheets, connected by lookups.

erDiagram
    Orders ||--o{ People : "Region"
    Orders ||--o{ Returns : "Order ID"

    Orders {
        string Order_ID
        date Order_Date
        string Region
        string Product
        float Sales
        float Profit
        float Discount
        int Quantity
    }
    People {
        string Region
        string Manager
    }
    Returns {
        string Order_ID
        string Returned
    }

Star schema = one central fact table (the transactions, the measurements) surrounded by smaller dimension tables (the context: who, where, what category).


The XLOOKUP you wrote in week 1 was a manual join between a fact table and a dimension table.

Exercise: which shape is this?

A grade book

Student HW1 HW2 Midterm Final
Anna 85 92 78 88
Boris 91 87 94 90
Clara 73 80 85 82

Wide or long?

Wide. “Assignment” is a variable stuffed into column headers. To use this in Tableau, you’d unpivot HW1/HW2/Midterm/Final into an “Assignment” column and a “Score” column.

A bank transaction log

Date Account Amount Type
Mar 1 Checking -45.00 Purchase
Mar 1 Checking 2,500.00 Salary
Mar 2 Savings 500.00 Transfer

Wide or long?

Long (tidy). Each transaction is one row. Each variable is one column.

Why tools matter

What happens when spreadsheets aren’t enough.

Excel renames your genes

Genomics researchers work with gene names like SEPT1 (Septin 1), MARCH1 (Membrane Associated Ring-CH-Type Finger 1), DEC1 (Deleted in Esophageal Cancer 1).

Open a CSV with these names in Excel:

  • SEPT1 → September 1
  • MARCH1 → March 1
  • DEC1 → December 1

No warning. No error message. Silent data corruption.

Excel silently converts gene names to dates

20% of genomics papers had corrupted gene names

A 2016 study by Ziemann, Eren, and El-Osta screened 3,597 published papers with supplementary Excel files.


704 papers — roughly one in five — contained gene names that had been silently converted to dates or floating-point numbers by Excel.


In 2020, the Human Gene Nomenclature Committee gave up on fixing the software. They renamed 27 human genes instead.


SEPT1 → SEPTIN1. MARCH1 → MARCHF1.


They renamed parts of the human genome because they couldn’t get scientists to stop opening CSVs in Excel.

Invisible formulas, real consequences

In 2010, Harvard economists Reinhart and Rogoff published a paper arguing that countries with debt above 90% of GDP experience negative growth.

The paper was cited in Congressional hearings, European Parliament debates, and IMF policy documents to justify austerity measures.

Three years later, a graduate student found a SUM formula that excluded five rows — Belgium, Austria, Australia, Canada, and Denmark were missing from the average.

With all countries included, the result flipped: slow but positive growth.

Reinhart-Rogoff spreadsheet error

The corrected results

Reinhart-Rogoff: original vs. corrected GDP growth by debt level

The “above 90%” group changed the most: from −0.1% to +2.2% average growth. The formula error didn’t just nudge the number — it flipped the sign.

But tools aren’t the only problem

Even if your tool is perfect, the data itself arrives broken.

What problems can you spot?

id date city category amount notes
1 2025-03-01 Kyiv Electronics 1200
2 03/04/2025 Kiev electronics $800 returned
3 2025-03-01 Kyiv Electronics 1200
4 2025-03-12 Clothing 0
5 01/04/2025 kyiv Other 350
6 2025-03-20 Київ Electronics 2,100

Duplicated rows

id date city category amount notes
1 2025-03-01 Kyiv Electronics 1200
2 03/04/2025 Kiev electronics $800 returned
3 2025-03-01 Kyiv Electronics 1200
4 2025-03-12 Clothing 0
5 01/04/2025 kyiv Other 350
6 2025-03-20 Київ Electronics 2,100

Rows 1 and 3 are identical. Is this a real duplicate, or two separate transactions that happen to match?

If you sum the amount column, you get an extra 1,200 — or you lose a real sale.

Inconsistent dates

id date city category amount notes
1 2025-03-01 Kyiv Electronics 1200
2 03/04/2025 Kiev electronics $800 returned
3 2025-03-01 Kyiv Electronics 1200
4 2025-03-12 Clothing 0
5 01/04/2025 kyiv Other 350
6 2025-03-20 Київ Electronics 2,100

Two formats in one column. Rows 1, 3, 4, 6 use YYYY-MM-DD. Rows 2 and 5 use MM/DD/YYYY — or is it DD/MM/YYYY?

Is row 5 January 4th or April 1st? Without knowing who entered it, you can’t tell.

Inconsistent spelling

id date city category amount notes
1 2025-03-01 Kyiv Electronics 1200
2 03/04/2025 Kiev electronics $800 returned
3 2025-03-01 Kyiv Electronics 1200
4 2025-03-12 Clothing 0
5 01/04/2025 kyiv Other 350
6 2025-03-20 Київ Electronics 2,100

Four spellings of the same city: Kyiv, Kiev, kyiv, Київ.

A GROUP BY city gives you four groups instead of one. Every aggregate — total, average, count — is silently wrong.

Missing and disguised values

id date city category amount notes
1 2025-03-01 Kyiv Electronics 1200
2 03/04/2025 Kiev electronics $800 returned
3 2025-03-01 Kyiv Electronics 1200
4 2025-03-12 (blank) Clothing 0
5 01/04/2025 kyiv Other 350
6 2025-03-20 Київ Electronics 2,100

Row 4: city is blank — was it not recorded, or is this an online order with no city?

Row 4: amount is 0 — is this a free item, or does 0 mean “not yet invoiced”?

Mixed types and formatting

id date city category amount notes
1 2025-03-01 Kyiv Electronics 1200
2 03/04/2025 Kiev electronics $800 returned
3 2025-03-01 Kyiv Electronics 1200
4 2025-03-12 Clothing 0
5 01/04/2025 kyiv Other 350
6 2025-03-20 Київ Electronics 2,100

$800 and 2,100 are text strings, not numbers. A SUM formula skips them silently.

Electronics vs. electronics — same category or different? Other — what does that even mean?

Six rows, six problems

Problem Rows Risk
Duplicated rows 1, 3 Inflated totals
Inconsistent dates 2, 5 Wrong time series
Inconsistent spelling 1, 2, 5, 6 Broken grouping
Missing values 4 Wrong aggregates
Numbers as text 2, 6 Silent calculation errors
Meaningless categories 5 Lost information

This was 6 rows. Real datasets have thousands.

Fixing these problems one cell at a time doesn’t scale. You need a process that’s systematic and reproducible.

Pipelines instead of spreadsheets

A data preparation tool separates data from logic. Every transformation is an explicit, named step.

import
filter
rename
convert types
derive
export

If step 3 is wrong, you fix step 3 — steps 4 through 8 rerun automatically.


If a column is typed as text, it stays text. No silent date conversions.

Every BI platform has a data prep tool


Platform Data prep tool Cost
Power BI Power Query Included with Power BI
Tableau Tableau Prep ~$70/month (on top of Tableau Creator)
Python pandas Free, but you write code
Browser Syto Free


The concept is the same in all of them: a visual pipeline where each step is an explicit transformation you can inspect and reorder.

Syto

  • Browser-based — no installation, works on any OS
  • Visual pipeline — each transformation is a named step you can reorder, edit, or remove
  • Exports clean CSV — which you then connect to Tableau


It’s free, it’s visual, and it teaches the same concepts you’d use in Power Query or Tableau Prep.

The skill transfers: once you understand pipeline thinking, switching tools is learning new syntax for the same operations.

Syto interface with pipeline steps

Live demo: cleaning data in Syto

Tableau’s data model

How Tableau understands your data — the layer underneath the charts.

Live demo: connecting data in Tableau

Dimensions vs. measures

[Screenshot: Tableau Data pane for billboard-long.csv — Dimensions section (artist, genre, track, date_entered) above the line, Measures section (rank, week) below the line]

Dimensions answer what

  • What artist?
  • What genre?
  • What date?

Drag to Rows → one row per unique value

Measures answer how much

  • What rank?
  • How many weeks?
  • What sales amount?

Drag to Columns → Tableau aggregates (sums, averages, counts)

Discrete vs. continuous

Blue pill
Discrete — separate buckets
Tableau draws a header for each value
Good for: categories, individual years, distinct groups

Green pill
Continuous — an unbroken range
Tableau draws an axis
Good for: time series, numeric ranges, quantities

[Screenshot: Same date field shown two ways — top: discrete YEAR(date_entered) with separate column headers “1999” “2000”; bottom: continuous date_entered as a time axis flowing left to right]


The shortcut: blue = headers, green = axis. The same field can be either — you choose.

The Marks card: pre-attentive attributes in Tableau

[Screenshot: Tableau Marks card showing Color, Size, Shape, Detail, Tooltip]


Pre-attentive attributes — color, size, shape — are the channels the brain processes fastest.


In Tableau, these live on the Marks card. Drag a field to Color, Size, or Shape to encode data through that visual channel.

Live demo: the automatic pivot in action

What’s ahead in practice

Practice this week

P1 — Data wrangling in Syto

Take the Billboard Hot 100 dataset: 317 songs, 76 weeks of rankings, crammed into 79 columns. Unpivot the 76 week columns into rows, clean the results, export two files at two granularity levels.

This is literally Table 5 from Wickham’s tidy data paper.


P2 — First steps in Tableau

Connect both cleaned files. Build worksheets from each — line charts from the song-week data, bar charts and histograms from the song-level data.

[Screenshot: The raw billboard-raw.csv opened in a spreadsheet — showing the 79-column structure with x1st.week through x76th.week columns stretching off-screen to the right. The “this is untidy” moment.]

Three ideas to take with you


Data has shape. Wide and long are two layouts for the same information. Pivot goes long → wide. Unpivot goes wide → long.


Tableau pivots for you. Give it tidy data and it handles grouping, aggregation, and visual separation automatically.


Granularity determines meaning. The same drag-and-drop produces insight or garbage depending on whether the data level matches the question.