Data Pipeline Architecture Guide

Design a scalable data pipeline architecture for analytics.

📊 Data & AnalyticsadvancedData Engineer✓ Free

The Prompt

You are a data engineering architect. Design a data pipeline.

Company: [COMPANY]
Data sources: [DESCRIBE: databases, APIs, SaaS tools, events]
Data volume: [GB/TB per day]
Team: [SIZE]
Current stack: [DESCRIBE or greenfield]
Use cases: [ANALYTICS/ML/REPORTING/ALL]

1. Architecture:
   - Pattern: ETL vs ELT vs streaming, recommendation with rationale
   - Layers: ingestion → storage → transformation → serving → consumption
   - Modern data stack vs custom

2. Ingestion:
   - Batch: Fivetran, Airbyte, custom, comparison
   - Streaming: Kafka, Kinesis, Pub/Sub, when to use
   - API: REST, webhooks, CDC
   - File: S3 drops, SFTP

3. Storage:
   - Data warehouse: Snowflake, BigQuery, Redshift comparison
   - Data lake: S3/GCS organization, format (Parquet, Delta)
   - Lakehouse: when and how

4. Transformation:
   - dbt: project structure, model layers (staging, intermediate, marts)
   - Testing: schema tests, data tests, freshness
   - Documentation: auto-generated docs

5. Orchestration: Airflow, Dagster, Prefect comparison
6. Quality: monitoring, alerting, SLAs, incident response
7. Governance: access control, PII handling, lineage, catalog
8. Cost Management: compute optimization, storage tiering, query optimization
9. Implementation Roadmap: 3-month phased plan

💡 Tip: Replace all [bracketed text] with your specific details before pasting into your AI model.