Loading... Loading... Domains Loading... Variations CC BY 4.0

Real-World Column Name Variations Database

An authoritative, canonical dataset that maps diverse column name variations to standardized canonical names across multiple business domains. Solve data integration challenges with our translation dictionary for column names.

Download Dataset (v1.0.0) View Usage Examples

Business Domains

Canonical Clusters

Column Variations

Total Mappings

Dataset Overview

This dataset serves as a "translation dictionary" for column names, enabling seamless data integration across different systems and platforms.

Data Integration

Harmonize disparate data sources by recognizing different column names that represent the same data.

Schema Normalization

Standardize column naming when merging datasets from multiple systems.

Data Governance

Provide authoritative reference for data stewards and engineers.

ETL/ELT Development

Simplify data transformation pipelines by mapping variations to canonical names.

Business Domains Covered

Our dataset spans multiple critical business domains with comprehensive coverage of column naming variations.

E-commerce

14 Canonical Concepts

Online retail, marketplaces, inventory systems

product_id, sku, price, inventory

CRM

11 Canonical Concepts

Customer management, sales pipelines, marketing

customer_id, email, company, status

Financial

8 Canonical Concepts

Accounting, payments, banking, transactions

transaction_id, amount, account, balance

Logistics

10 Canonical Concepts

Shipping, delivery, supply chain, fulfillment

tracking_number, carrier, shipping_status

Healthcare

13 Canonical Concepts

EHR systems, medical records, clinical data

patient_id, medical_code, diagnosis_code

Usage Examples

Get started quickly with our dataset using these practical examples in Python and command line tools.

Basic Python Usage

import json
from pathlib import Path

# Load the dataset
dataset_path = Path("master_column_clusters.v1.0.0.json")
with open(dataset_path, 'r') as f:
    column_clusters = json.load(f)

# Quick lookup: Find all variations for a canonical concept
def get_variations(canonical_name: str, domain: str = "ecommerce") -> list:
    """Get all known variations for a canonical column name"""
    return column_clusters.get(domain, {}).get(canonical_name, [])

# Example usage
sku_variations = get_variations("product_id", "ecommerce")
print(f"All names that mean 'product_id': {sku_variations}")
# Output: ['sku', 'item_id', 'product_code', 'prod_id', ...]

Command Line Interface (CLI) Example

# Quick query with jq
cat master_column_clusters.v1.0.0.json | jq '.ecommerce.product_id'
# Returns: ["sku", "item_id", "product_code", ...]

# Count all variations
cat master_column_clusters.v1.0.0.json | jq '[.[] | select(type=="object") | .[] | length] | add'
# Returns: 309

Data Integration Example

# Before: Mixed column names from different systems
df_shopify.columns = ['id', 'title', 'price', 'inventory']
df_magento.columns = ['entity_id', 'name', 'cost', 'stock_qty']

# After: Unified canonical naming
df_shopify.columns = ['product_id', 'product_name', 'price', 'quantity']
df_magento.columns = ['product_id', 'product_name', 'price', 'quantity']

License & Usage

This dataset is freely available under a permissive open-source license for both commercial and non-commercial use.

Creative Commons Attribution 4.0 International (CC BY 4.0)

✅ Commercial Use Allowed ✅ Modification Allowed ✅ Redistribution Allowed 📝 Attribution Required

You are free to share (copy and redistribute) and adapt (remix, transform, and build upon) the material for any purpose, even commercially, as long as you give appropriate credit.

View Full License Details