With 7 years of industry experience in data engineering, analytics, warehousing techniques, ETL data pipelines & machine learning,
Along with numerous projects showcasing my command over data engineering through SQL, Python, PySpark & R combined with outstanding interpersonal skills, I firmly believe that I can design, develop and
deliver analytical solutions with exceptional efficiency to your business problems.
● Redesigned & deployed reatil & marketing end-to-end data pipeline, improving efficiency and consistency.
● Assisting Enterprise Service team with Ad-hoc data modeling & data pipeline redesign efforts.
● Spearheading the dbt data modeling team to improve data intergrity & semantics, help stakeholders to better understand the structure & relationship of the data lineage thereby assisting the executive team to make informed descisions.
● Led the data engineering team on re-designing a PySpark pipeline that loads shipment & merchandise data into Redshift warehouse’s one-big-table via a multi-node EMR cluster.
● Collaborated with the AI team in building a data pipeline used for real time shipment tracking and analysis, by ingesting data extracted from third party shipment contracts.
● Reduced shipment tracking dashboard latency from 48 mins to 17mins, by identifying and resolving bottlenecks and inefficient code practices across the entire data pipeline.
● Designed & developed HubSpot API pipeline in Python to extract customer interaction data like contacts, emails, calls, notes, etc. and harness new avenues of insight. This increased the customer retention by 18%.
● Reduced CPU utilization on the data warehouse by 22% by introducing materialized views and data validation before writing data into the respective schemas. This also fixed a multitude of data quality issues such as duplicacy & data inaccuracy.
● Engineered a data pipeline that ingests global rare disease clinical trials data that enhanced the R&D team’s drug development, keeping data accuracy as the primary business goal while avoiding data redundancy.
● Developed a pipeline that migrated study & patient data from multiple sources into our Snowflake warehouse and in turn to our Qlik dashboards through AWS EC2 and S3 buckets. This opened the Clinical Trials to a larger population of patients based on the insight generated by the symptomology dashboards.
● Spearheaded a cross-functional data wrangling & intergrity effort to predict deviation from other similar clinical trials by extracting data from various sources.
● Designed and containerized a scalable learning platform; 800+ students registered in the first semester of launch
● Designed and implemented data pipelines with Python-Selenium web-scraper for revenue driving teams to improve data governance. These pipelines fueled PowerBI dashboards with data extracted from sources like National Grid, and exported it into our local servers. This also improved data refresh rate by 94%.
● Surveyed stockholders, conceived ideas as a part of the data science team that predicts quarterly customer conversion rate and propose strategies to improve it.
● Established in-house methods to extract results from end-to-end descriptive analysis that administered Ad placement on the customer’s website which increased sales and customer retention by 12%.
● Led a cross-functional team that utilizes international customer transaction and global monetary data to enhance data accuracy & compliance through data warehousing techniques & complex queries.
● Reduced service downtime by implementing a caching system that helped warm up databases with daily foreign exchange data on service startup by 85 minutes per day.
SQL, Python, R, PySpark, R
Power BI, Tableau, AWS - EC2, S3, EMR, RDS, Redshift, Google Analytics, Docker, Hive, Snowflake
Pandas, Numpy, Scikit-Learn, matplotlib, Shiny, ggplot2, Flask,
SSIS, OLTP, OLAP, Snapshot, KPI, OLTP, OLAP, dbt
Git, Jenkins, JIRA, Confluence