Real-World Column Name Variations Database
An authoritative, canonical dataset that maps diverse column name variations to standardized canonical names across multiple business domains. Solve data integration challenges with our translation dictionary for column names.
Dataset Statistics
Dataset Overview
This dataset serves as a "translation dictionary" for column names, enabling seamless data integration across different systems and platforms.
Data Integration
Harmonize disparate data sources by recognizing different column names that represent the same data.
Schema Normalization
Standardize column naming when merging datasets from multiple systems.
Data Governance
Provide authoritative reference for data stewards and engineers.
ETL/ELT Development
Simplify data transformation pipelines by mapping variations to canonical names.
Business Domains Covered
Our dataset spans multiple critical business domains with comprehensive coverage of column naming variations.
E-commerce
Online retail, marketplaces, inventory systems
CRM
Customer management, sales pipelines, marketing
Financial
Accounting, payments, banking, transactions
Logistics
Shipping, delivery, supply chain, fulfillment
Healthcare
EHR systems, medical records, clinical data
Usage Examples
Get started quickly with our dataset using these practical examples in Python and command line tools.
Basic Python Usage
import json
from pathlib import Path
# Load the dataset
dataset_path = Path("master_column_clusters.v1.0.0.json")
with open(dataset_path, 'r') as f:
column_clusters = json.load(f)
# Quick lookup: Find all variations for a canonical concept
def get_variations(canonical_name: str, domain: str = "ecommerce") -> list:
"""Get all known variations for a canonical column name"""
return column_clusters.get(domain, {}).get(canonical_name, [])
# Example usage
sku_variations = get_variations("product_id", "ecommerce")
print(f"All names that mean 'product_id': {sku_variations}")
# Output: ['sku', 'item_id', 'product_code', 'prod_id', ...]
Command Line Interface (CLI) Example
# Quick query with jq
cat master_column_clusters.v1.0.0.json | jq '.ecommerce.product_id'
# Returns: ["sku", "item_id", "product_code", ...]
# Count all variations
cat master_column_clusters.v1.0.0.json | jq '[.[] | select(type=="object") | .[] | length] | add'
# Returns: 309
Data Integration Example
# Before: Mixed column names from different systems
df_shopify.columns = ['id', 'title', 'price', 'inventory']
df_magento.columns = ['entity_id', 'name', 'cost', 'stock_qty']
# After: Unified canonical naming
df_shopify.columns = ['product_id', 'product_name', 'price', 'quantity']
df_magento.columns = ['product_id', 'product_name', 'price', 'quantity']
License & Usage
This dataset is freely available under a permissive open-source license for both commercial and non-commercial use.
Creative Commons Attribution 4.0 International (CC BY 4.0)
Copyright (c) 2026 Jason "Soo Ji" Dano | Peper Cruz
You are free to share (copy and redistribute) and adapt (remix, transform, and build upon) the material for any purpose, even commercially, as long as you give appropriate credit.