Design, build, and maintain highperformance data pipelines that power analytics and machinelearning products. Collaborate with data scientists, product, and infrastructure teams to turn raw data into scalable, reliable assets.
Key Responsibilities
- Architect endtoend batch and Spark Streaming pipelines on cloud (AWS/GCP/Azure).
- Implement ML feature pipelines and realtime inference services.
- Optimize petabytescale processing with Spark, Kafka, and Flink.
- Build and maintain data warehouses/lakes (Redshift, BigQuery, Snowflake).
- Enforce data quality, governance, and security.
- Develop CI/CD, monitoring, and alerting for pipelines.
- Mentor engineers and drive bestpractice documentation.
Required Experience
- 5+ years productiongrade data engineering.
- Deep expertise with Apache Spark (batch & streaming), Kafka, and distributed processing.
- Handson ML pipeline experience (feature engineering, model training, deployment).
- Cloud data platforms and warehousing.
- Strong SQL + Python/Scala; familiar with Airflow, dbt, or similar.