Long Format vs Wide Format in Data Analysis¶

Notes

In data analysis, the structure of your dataset significantly impacts its usability for different tasks. Two common formats

Long Format (also called "tidy" or "normalized")
Wide Format (also called "cross-tab" or "spread")

Let’s explore both using the same example and explain how to switch between them.

1. Example Dataset¶

We have daily mobility data at the census tract level for a few metrics:

Raw Table (Long Format)

date	census_tract	median_non_home_dwell_time	distance_traveled_from_home	non_home_ratio
2020-01-01	6073000100	191.5	15319	0.788
2020-01-01	6073000201	9.0	7931	0.510
2020-01-02	6073000100	180.0	14500	0.760
2020-01-02	6073000201	10.0	8000	0.520

2. Format Comparison¶

2.1 Long Format¶

Structure:

date	census_tract	indicator	value
2020-01-01	6073000100	median_non_home_dwell_time	191.5
2020-01-01	6073000100	distance_traveled_from_home	15319
2020-01-01	6073000100	non_home_ratio	0.788

2.2 Advantages:¶

More flexible for grouping, filtering, and time series analysis.

Compatible with most statistical modeling and plotting tools (e.g., seaborn, ggplot, plotly).

Ideal for faceting and small multiples in visualization.

2.3 Disadvantages:¶

Can become long and hard to read manually.

Needs transformation for certain types of reports or machine learning inputs.

2.4 Wide Format¶

Structure:

census_tract	median_dwell_2020-01-01	median_dwell_2020-01-02	dist_2020-01-01	dist_2020-01-02	ratio_2020-01-01	ratio_2020-01-02
6073000100	191.5	180.0	15319	14500	0.788	0.760
6073000201	9.0	10.0	7931	8000	0.510	0.520

2.5 Advantages:¶

Easier to read for reporting or spreadsheet-style inspection.

Suitable for machine learning algorithms that require one feature per column.

Facilitates quick correlation or matrix-style analyses.

2.6 Disadvantages:¶

Becomes unmanageable if there are many dates or variables (e.g., 1000+ columns).

Difficult to reshape or filter by variable type or time.

May exceed column limits in GIS software or spreadsheets.

3. How to Convert Between Formats¶

➤ Long → Wide Using pivot():

df_long.pivot(index='census_tract', columns=['indicator', 'date'], values='value')

Flatten the multi-index:

df_wide.columns = [f'{var}_{date.date()}' for var, date in df_wide.columns]
df_wide.reset_index(inplace=True)

➤ Wide → Long Using melt():

df_long = df_wide.melt(
    id_vars=['census_tract'],
    var_name='variable_date',
    value_name='value'
)

# Optionally split the variable and date
df_long[['indicator', 'date']] = df_long['variable_date'].str.extract(r'(.+)_(\d{4}-\d{2}-\d{2})')
df_long['date'] = pd.to_datetime(df_long['date'])
df_long = df_long.drop(columns=['variable_date'])

4. When to Use Which?¶

Task Type	Recommended Format
Time series analysis	Long
Plotting with time axis	Long
Machine learning input	Wide
Manual inspection/report	Wide
GIS symbolization (1 date)	Wide
Interactive dashboards	Long (recommended for filtering)