Description

Results oriented Data Engineer with 6+ years of experience building scalable ETL/ELT pipelines, real-time streaming systems, and cloud data platforms. Strong in Python, Spark (PySpark/Scala), Kafka, dbt, and Airflow, with expertise across AWS, Azure, Databricks, and Microsoft Fabric. Experienced in designing data lakehouse architectures, optimizing large-scale datasets, and implementing data modeling and performance tuning. Skilled in handling millions of records, CI/CD, and Infrastructure as Code (Terraform, CloudFormation), with strong experience in healthcare and enterprise data environments focused on reliable, high-quality analytics.

TECHNICAL SKILLS

____________________________________

Programming Languages: Python, SQL, Scala, Java, Bash, Shell Scripting.

Data Integration & ETL: dbt, Apache Airflow, Informatica, Azure Data Factory, AWS Glue, Fivetran, Great Expectations.

Cloud Platforms: AWS (S3, Glue, Redshift, Lambda, EMR, RDS, MSK, Lake Formation), Azure (Data Factory, Synapse Analytics, Data Lake, Key Vault, AKS, Purview), Databricks.

Big Data & Streaming: Apache Spark (Scala/PySpark), Spark Streaming, Hadoop, Kafka, Flink.

Databases & Storage: PostgreSQL, MySQL, MongoDB, Cassandra, Oracle, Snowflake, Redshift, Synapse Analytics.

File Formats: Parquet, JSON, Avro, ORC, CSV, XML

DevOps & CI/CD: Jenkins, GitHub Actions, GitLab, AWS CloudFormation, Docker, Kubernetes (EKS/AKS), Prometheus, Datadog.

BI & Visualization: Power BI, Tableau, Looker, Mode Analytics

Machine Learning Integration: Spark ML, Scikit-learn, TensorFlow (light), Feature Engineering Collaboration

Metadata & APIs: RESTful APIs, Azure Purview, AtScale, Data Catalogs.

Methodologies: Agile (SCRUM), Jira, Confluence

Industry field of expertise

Languages

English
Native or bilingual

Workplace preferences

Remote only

Primarily works remotely

DHIS2
Data Engineer
SOFTWARE PUBLISHING
July 2020 - Today (5 years and 11 months)
Designed and implemented scalable ETL/ELT pipelines using Apache Spark (PySpark/Scala), Databricks, Microsoft Fabric, and Azure Data Factory, processing millions of public health records including HIV/AIDS surveillance, immunization, maternal health, and disease reporting used in WHO and Save the Children-supported programs.
Architected a Medallion Data Lake (Bronze, Silver, Gold layers) with strong data modeling practices (dimensional and analytical modeling) to standardize healthcare data ingestion, transformation, and analytics for disease monitoring, outbreak tracking, and national health reporting.
Engineered multi-source data ingestion pipelines from relational databases (SQL Server, PostgreSQL), APIs (DHIS2 and external health systems), CSV, Excel, JSON, and flat files, enabling unified processing of heterogeneous public healthcare data.
Built and tuned high-performance Spark pipelines using caching, in-memory computation, and distributed processing optimizations, enabling efficient processing of large-scale public health datasets in Databricks and Microsoft Fabric environments.
Collaborated with public health stakeholders, including WHO-aligned initiatives and Save the Children programs, to define data models, ETL rules, validation frameworks, and standardized health indicators, ensuring accurate reporting for HIV/AIDS and other critical health programs.
Built and maintained cloud-based data lakes on AWS S3 and Azure Data Lake, implementing scalable transformations using AWS Glue, Microsoft Fabric, and Spark with optimized partitioning and caching strategies for large datasets.
Developed Power BI dashboards integrated with SQL, APIs, DHIS2 systems, and Microsoft Fabric datasets, enabling visualization of health program performance, treatment outcomes, and key public health indicators at scale.
ETL (Extract, Transform, Load) Processes Database Management (e.g., SQL, NoSQL) Data Cleaning and Preprocessing Databricks Microsoft Fabric
OpenMRS
Jr. Data Engineer
SOFTWARE PUBLISHING
July 2019 - July 2020 (1 year)
Dallas, United States
Designed and developed scalable ETL pipelines using AWS Glue and PySpark to ingest and transform large-scale healthcare data (millions of patient and medicine records) from S3 and external systems into Amazon Redshift for analytics and reporting.
Automated data discovery and querying using AWS Glue Crawlers, Data Catalog, and Amazon Athena, enabling efficient access to high-volume patient and clinical datasets for analytics teams.
Built a reusable and scalable ETL framework using Spark (Python/Scala) to standardize ingestion, transformation, and loading of millions of healthcare records including patient history, prescriptions, and treatment data into Hive and HBase.
Designed and optimized data models for healthcare analytics, structuring raw, staging, and curated layers (Medallion-style modeling) to support efficient querying and reporting on patient and medicine datasets.
Optimized Hive table design with partitioning and bucketing strategies, significantly improving performance for millions of patient-level and pharmaceutical records.
Implemented event-driven data pipelines using AWS Lambda and S3 triggers, enabling automated and near real-time processing of incoming patient and medicine data at scale.
Orchestrated end-to-end workflows using Apache Airflow DAGs, ensuring reliable scheduling, dependency management, and monitoring of large-scale healthcare data pipelines.
Developed and optimized distributed processing jobs using PySpark and Spark SQL, efficiently handling millions of records across patient demographics, prescriptions, and clinical events.
Built real-time streaming pipelines using Apache Kafka and Apache Flink, and containerized workloads using Docker and Kubernetes, enabling scalable processing of high-volume healthcare data streams.
Data Engineer Databricks ETL (Extract, Transform, Load) Processes SQL Server Apache Kafka
Solulab Inc
Software Developer Intern
DIGITAL AND IT
January 2019 - July 2019 (6 months)
Ahmedabad, India
● Developed and integrated RESTful APIs using FastAPI and PostgreSQL into the Bevvi application, enabling seamless data exchange and functionality with third-party services.
● Designed a responsive user interface (UI) using React and Material UI for the Bevvi application, increasing mobile traffic by 25% and improving user satisfaction.
● Utilized Jenkins for Continuous Integration and Continuous Deployment (CI/CD), reducing deployment times by 40% and improving release consistency and reliability.
● Designed and implemented scalable AWS cloud infrastructure using services like EC2, S3, and DynamoDB, ensuring optimal performance and cost efficiency.
● Automated serverless workflows using AWS Lambda and API Gateway, reducing operational overhead and enabling event-driven
Apache Kafka SQL Python API REST SQL Server

Be the first to recommend Gyan Bahadur

Help this freelancer shine by sharing your experience working together.

Agatha Frydrych

Backend Java Software Engineer

4.7

(3)

Baptiste Duhen

Fullstack developer

4.6

(4)

Amed Hamou

Senior Lead Developer

(2)

Audrey Champion

Web developer

4.3

(3)

Signup to reveal

Master of Science
University of the Cumberlands
2025
Computer Science
Bachelor in Science & Technology
Maharaja Ranjit Singh Punjab Technical University
2019
Computer Science & Engineering

Check out Gyan Bahadur's education

Algorithmic Toolbox
UC San Diego
2024
https://www.coursera.org/account/accomplishments/verify/PRFAW9OY5LNN
Data science Algorithm Data Structure Python

Cloud Engineer & Architect

Gyan Bahadur Tamang

Data Engineer | ETL | PySpark | Fabric | Databrick

About Gyan Bahadur

Experience

Recommendations

These freelancer profiles also match your criteria

Education

Certifications

Skill set

Categories