Projects

Data ingestion modernization

AWS
Airflow
Meltano
Python
SQL
Snowflake

Best practices for extract-load pipelines and orchestration.

Project illustration.

The problem

The data platform had a small number of extractions implemented directly in Python, without a single standard for execution, versioning, and reuse. There was no orchestrator (for example, Apache Airflow), which made it hard to schedule runs, observe failures, retries, and dependencies between steps. Everything ran manually. Maintenance was expensive: every change meant navigating loose scripts, which cost time and slowed onboarding.

In addition, credentials and sensitive parameters were often exposed in repositories or local configuration, increasing leakage risk.


The solution

The solution was split across two complementary repositories:

  1. Declarative EL (extract-load) layer with Meltano, centralizing pipelines in project files with environments (dev/prod) and Singer connectors, including custom implementations where no ready TAP existed for a source. Loads land in Snowflake consistently.

  2. Orchestration with Apache Airflow in the cloud: DAGs trigger jobs (for example, Kubernetes pods with a versioned container image), with retries, email alerts, and execution visibility in the Airflow UI.

  3. Secret hygiene: integration with AWS Secrets Manager as the Airflow secrets/variables backend (prefixes per environment), replacing the previous model where secrets lived in plain code.

Order of magnitude (code structure): dozens of Python DAG files, ten Meltano pipeline definitions in YAML, and several custom taps/plugins.


Project structure

FolderRole
DAGs / Airflow repositoryOrchestration, scheduling, Secrets Manager integration, and Kubernetes execution.
Meltano repositoryDeclarative EL pipelines, plugins, and custom taps.

Results

After the migration, scheduled runs reached roughly 100% success in monitoring windows, with dozens of DAGs and ten declarative EL pipelines, plus several custom connectors for sources without off-the-shelf integration. Maintenance became simpler and more predictable, with changes concentrated in project definitions and orchestration, and the process more automated (scheduling, retries, notifications). From a security standpoint, sensitive variables used by DAGs are now sourced from AWS Secrets Manager, removing credentials that previously lived in the repository.

MetricDescription
Scheduled run success rateRoughly 100% after stabilization.
Automation volumeExample: 15+ orchestration flows in production, 10+ declarative EL jobs, 5+ custom connectors.
Secret exposure risk100% of credentials used by DAGs sourced from a managed vault (AWS Secrets Manager), versus plain-text variables before.
MaintenanceChanges are localized in one place, reducing time spent per change.
ObservabilityAlerts on task failure, history, and retries centralized in Airflow.