Skip to content

Long Format vs Wide Format in Data Analysis

Notes

In data analysis, the structure of your dataset significantly impacts its usability for different tasks. Two common formats

  • Long Format (also called "tidy" or "normalized")

  • Wide Format (also called "cross-tab" or "spread")

Let’s explore both using the same example and explain how to switch between them.

1. Example Dataset

We have daily mobility data at the census tract level for a few metrics:

Raw Table (Long Format)

date census_tract median_non_home_dwell_time distance_traveled_from_home non_home_ratio
2020-01-01 6073000100 191.5 15319 0.788
2020-01-01 6073000201 9.0 7931 0.510
2020-01-02 6073000100 180.0 14500 0.760
2020-01-02 6073000201 10.0 8000 0.520

2. Format Comparison

2.1 Long Format

Structure:

date census_tract indicator value
2020-01-01 6073000100 median_non_home_dwell_time 191.5
2020-01-01 6073000100 distance_traveled_from_home 15319
2020-01-01 6073000100 non_home_ratio 0.788

2.2 Advantages:

More flexible for grouping, filtering, and time series analysis.

Compatible with most statistical modeling and plotting tools (e.g., seaborn, ggplot, plotly).

Ideal for faceting and small multiples in visualization.

2.3 Disadvantages:

Can become long and hard to read manually.

Needs transformation for certain types of reports or machine learning inputs.

2.4 Wide Format

Structure:

census_tract median_dwell_2020-01-01 median_dwell_2020-01-02 dist_2020-01-01 dist_2020-01-02 ratio_2020-01-01 ratio_2020-01-02
6073000100 191.5 180.0 15319 14500 0.788 0.760
6073000201 9.0 10.0 7931 8000 0.510 0.520

2.5 Advantages:

Easier to read for reporting or spreadsheet-style inspection.

Suitable for machine learning algorithms that require one feature per column.

Facilitates quick correlation or matrix-style analyses.

2.6 Disadvantages:

Becomes unmanageable if there are many dates or variables (e.g., 1000+ columns).

Difficult to reshape or filter by variable type or time.

May exceed column limits in GIS software or spreadsheets.

3. How to Convert Between Formats

➤ Long → Wide Using pivot():

df_long.pivot(index='census_tract', columns=['indicator', 'date'], values='value')
Flatten the multi-index:
df_wide.columns = [f'{var}_{date.date()}' for var, date in df_wide.columns]
df_wide.reset_index(inplace=True)
➤ Wide → Long Using melt():
df_long = df_wide.melt(
    id_vars=['census_tract'],
    var_name='variable_date',
    value_name='value'
)

# Optionally split the variable and date
df_long[['indicator', 'date']] = df_long['variable_date'].str.extract(r'(.+)_(\d{4}-\d{2}-\d{2})')
df_long['date'] = pd.to_datetime(df_long['date'])
df_long = df_long.drop(columns=['variable_date'])

4. When to Use Which?

Task Type Recommended Format
Time series analysis Long
Plotting with time axis Long
Machine learning input Wide
Manual inspection/report Wide
GIS symbolization (1 date) Wide
Interactive dashboards Long (recommended for filtering)