The Kubernetes (a contain orchestrator) logo

Data Infrastructure

January 2022 - June 2023

When I joined CC Data I observed several challenges in our data infrastructure. The data team, who worked in python, was heavily reliant on a central GitHub monorepo. There was widespread duplication of code, often involving minor modifications to copied scripts. Client deliveries were conducted via these scripts, resulting in client complaints due to inconsistent service. This lead to hesitations in deploying revenue-generating Python scripts. Lack of robust package management, absent application tests, and intricate data storage architecture were also posing significant problems. Moreover, the monitoring of our data quality was not up to par.

To address these issues, I set the following objectives:

  1. Minimize code duplication
  2. Implement an infrastructure where data could be instantaneously accessed, and computations could be conducted where the data was stored
  3. Reduce client complaints by incorporating logging and event warnings
  4. Create a reliable deployment system to win the CTO's trust

I began by restructuring our monorepo by product to make it more organized. I then developed libraries to standardize our solutions to recurring issues. In collaboration with the DevOps team, we transitioned our deployment process to Kubernetes using Docker containers. We established a Continuous Integration/Continuous Deployment (CI/CD) pipeline using ArgoCD for CD and GitHub Actions for CI.

Next, we incorporated Grafana Loki for logging and Robusta for event management. We set up data quality monitoring services running at different frequencies and trained our 24/7 support team to oversee this. I introduced the use of Poetry for package management and ran seminars on building Docker containers, followed by one-on-one pair coding sessions to facilitate the transition to Kubernetes. I provided thorough documentation and training resources in Notion.

To abstract the complexity of our data storage, I wrapped data retrieval from Azure blobs, allowing analysts to access data without understanding the underlying infrastructure. I started shifting regular data operations to Google Big Query for SQL access and introduced standardized testing for production operations.

As a result, we launched about 20 data quality microservices, which contributed to a monthly data quality report and showed a decrease in data-related issues. The CTO gave us the green light to run two production microservices in the Kubernetes infrastructure, which generated new datasets for clients. Additionally, the analysts found it easier to use data in Google Big Query and advocated for moving more datasets over. These actions streamlined our data processes, improved the quality of our client service, and increased the efficiency of our data team.