For asuccessful POC, the candidate should ideally be a Mid-to-Senior level DataEngineer (3-5+ years) with the following must-haves:
TechnicalCore
- Databricks Mastery: Expert-level knowledge of Delta Lake and the Medallion Architecture (Bronze/Silver/Gold layers).
- Apache Spark (PySpark/SQL): Ability to write optimized Spark code. For coming POC, Python is usually preferred over Scala/R for its flexibility and ecosystem.
- AWS Infrastructure: Deep understanding of S3 (Bucket policies/storage), IAM (Roles/Policies) for secure Databricks access, and VPC/Networking (Good to have)
- Data Ingestion: Experience with Databricks Autoloader or Unity Catalog for managed data governance.
POC-Specific'Skills
- Prototyping Speed: The ability to set up a working end-to-end pipeline (Source → S3 → Databricks → OOTB BI Tool) in weeks, not months.
- Cost Management: Knowledge of how to configure Databricks Clusters (Autoscaling, Spot Instances) to prevent the POC from blowing your AWS budget.
JobDescriptions
Focus: Hands-on ETL/ELT and connectingvarious data sources and setup the platform with technical leadership.
- Role Summary:
- We are seeking a hands-on Data Engineer to spearhead our Databricks POC on AWS. You will be responsible for the initial environment setup, security configuration, and designing the framework for our future data platform.
- You will connect diverse AWS and external data sources into a unified Databricks environment.
- Key Responsibilities:
- Configure Databricks workspace integration with AWS (S3, IAM, VPC).
- Cleanse and transform raw data from S3, RDS, and APIs into Delta tables.
- Design and implement a scalable Medallion Architecture using Delta Lake.
- Build automated ingestion pipelines using Databricks Autoloader.
- Optimize Spark jobs for performance and reliability.
- Establish data governance standards using Unity Catalog. (Good to have)
- Evaluate POC success metrics (performance, cost, ease of use).
- Requirements: 3-5+ years in Data Engineering with PySpark/SQL strong experience with AWS Glue or EMR is a plus. Databricks Certified Data Engineer Professional preferred.