1. Full Load – Truncate & Reload
2. Incremental Load – Append or Upsert
3. Differential Load – Snapshot Comparison
4. Change Data Capture (CDC) – Log or Trigger Based

1. Full Load – Truncate & Reload
Definition:
- In this approach, the target table is completely emptied (usually via
TRUNCATE) and then fully reloaded with the current source data.
How It Works:
- Delete all records in the data warehouse table.
- Read the full dataset from the source.
- Load it into the warehouse table from scratch.
When to Use:
- Data volume is relatively small.
- No reliable method exists to detect changes.
- Business requires full refresh due to compliance or data integrity.
Pros:
- Simplicity: Easy to implement.
- Consistency: Guarantees target is always in sync with source.
Cons:
- Inefficient for large datasets: Wastes resources by reloading unchanged data.
- Downtime: Table is unavailable during reload.
- Heavy I/O and compute usage.
2. Incremental Load – Append or Upsert
Definition:
- Only the new or modified records since the last load are processed and written to the warehouse.
Types:
- Append: Only new records are added.
- Upsert: Existing records are updated, and new ones are inserted.
How It Works:
- Track changes using mechanisms like:
last_updatedtimestamp.- Surrogate keys (monotonically increasing IDs).
- Audit columns in source system.
When to Use:
- Large datasets with few changes per load.
- High frequency data refresh required.
Pros:
- Efficient and fast: Only processes deltas.
- Scalable: Supports large, growing datasets.
Cons:
- Change detection must be accurate.
- Complex logic for deduplication and conflict resolution.
3. Differential Load – Delta Based on Snapshot Comparison
Definition:
- Compares current source snapshot with a previously stored snapshot to detect inserts, updates, and deletes.
How It Works:
- Extract a full snapshot from the source system.
- Store it and compare it row-by-row with the previous snapshot.
- Identify:
- Inserts: Rows in current snapshot but not in the previous one.
- Updates: Rows with matching keys but different values.
- Deletes: Rows in previous snapshot but not in the current one.
When to Use:
- No
last_updatedor CDC mechanisms available. - Source system doesn’t support incremental logic.
Pros:
- Independent of source system features.
- Flexible: Works even with legacy systems.
Cons:
- High compute due to comparison logic.
- Additional storage needed for snapshots.
- Latency: Cannot support real-time loads.
4. Change Data Capture (CDC) – Log-based or Trigger-based
Definition:
- CDC captures row-level changes (INSERT, UPDATE, DELETE) from database logs or triggers, and replicates them to the target.
Types:
- Log-based CDC: Reads changes from database transaction logs (e.g., MySQL binlog, SQL Server CDC, Oracle redo logs).
- Trigger-based CDC: Uses database triggers to capture changes and write them to tracking tables.
How It Works:
- The CDC process continuously reads the change log.
- Translates changes into actionable operations.
- Applies changes to the data warehouse with minimal latency.
When to Use:
- Real-time or near real-time use cases.
- Modern data pipelines (streaming ETL/ELT).
- Systems with high data change velocity.
Pros:
- Real-time updates: Keeps target highly synchronized.
- Accurate: Captures precise changes.
Cons:
- Complex setup: Requires deep integration with source DB.
- Permissions and performance: May require elevated privileges and impact source DB performance.
- Not universally supported: Some databases don’t expose log access.
Comparison Table:
| Model | Data Volume | Complexity | Performance | Real-Time | Change Tracking Needed? |
|---|---|---|---|---|---|
| Full Load | Small | Low | Low | No | No |
| Incremental Load | Medium–High | Medium | High | No | Yes (e.g., timestamps) |
| Differential Load | Medium–High | High | Medium | No | No (needs snapshots) |
| CDC | High | High | Very High | Yes | Yes (via DB logs) |
For Code Please REFER HERE
Leave a comment