January 2019 DataOps Posting discussing the Data Operations and it’s support of modern data platforms.
Data enablement is a pressing need for many organizations. Legacy Business Intelligence (BI) solutions leverage expensive infrastructure and centralized functions that routinely struggle with delivering business capabilities at the “speed of need”.
Modern cloud service-based offerings such as those built on Azure, facilitate the delivery of robust data platforms in a fraction of the time and cost of legacy BI solutions. In many organizations these forms of “Advanced Analytics” are creating a resurgence of cognitive and predictive analysis. Success with these technologies however requires the adoption of new operating models and methodologies.
Advanced Analytics solutions foster significant velocity to data transformation activities by leveraging scalable parallel processing to occur after data loading (ELT versus ETL). They are also foundational to enabling IoT, Artificial Intelligence and Machine LearningInitiatives to be pursued by your organization.
Critical to the success and empowerment of a modern data platform is how its deployed and maintained. Unstructured methods will bring ruin to your data platform. Conversely, overtly legalistic approaches will stifle innovation and hamper your ability to embrace the power of your modern data platform. An agile method leveraging proven concepts such as continuous integration and delivery (CI/CD) where testing and delivery is fully automated must be pursued.
Enter the concept of Data Operations (DataOps). Akin to DevOps, DataOps is based on the concept of automating the pipelines leveraged to test and deploy data and capabilities to your platform. With a modern data platform, the infrastructure is cloud service based requiring little to no worry. Azure services are ensured under service level agreements (SLAs) of 99.9% or more often 99.95%. The services are maintained in a resilient fashion by the cloud provider ensuring zero downtime for patching and service enhancements or in the legacy realm…upgrades. Many of us recall sleepless nights and weekends upgrading mission critical BI platforms!
Three of the critical differences between DevOps and DataOps is the need for a platform sandbox, iteration of orchestration, and robust monitoring.
In a traditional DevOps scenario, the developer maintains an isolated environment (Sandbox) where they write and test features without impacting other developers. In a Data Platform, it is not economically feasible for individual developers to maintain their own sandbox environments. A centralized sandbox environment needs to exist for the development or data engineering team(s).
Orchestration needs to occur at each step within the pipeline. This includes data and logical tests to ensure the quality of production and feature enhancements to the platform. A subset of data should also be leveraged and routinely run to ensure confidence in the platform. Azure Data Factory is a robust orchestration service that integrates with over 70 data sources.
Unlike DevOps where builds either pass or fail, metrics need to be captured and analyzed regarding the performance of the data pipelines. Statistical process controls (SPC) should be established to ensure quality. These controls require a larger collection of performance data than what is typically stored in the orchestration tool. Azure Data Factory combined with Azure Monitoring addresses this need and provides enterprise class monitoring of your pipelines.
In a data platform, a pipeline is comprised of logical activities that complete tasks such as ingesting, cleaning and executing the transformation of data. The pipeline can also enable external platform activities and should be leveraged as a business information process actuator.
The code and configurations comprised within these pipelines need to be stored outside of the data platform to ensure availability and repetition in both non-production and production environments. It is critical for the integrity of code and configuration to be maintained outside of the platform.
A normal software development lifecycle that segregates development, testing and production should be followed. Consideration should be made on provisioning your data platform in alternative cloud regions to meet recovery time objectives or to address data residency requirements.