Git Integration

Manage your data observation strategies as code

Git Repository and Workflow Integration

Tightly couple your data pipeline code and data observation strategies by storing your Lumadata catalog in Git

Git Overview

Lumadata's catalog can be stored outside of Lumadata in YAML markup. This allows your engineering team the greatest flexibility when it comes to managing the data observation catalog. Using YAML markup, you're able to create, modify, and delete data observation strategies from Lumadata. And since the observaton strategies are in a git-enabled repository, you can bundle your data pipeline code deployments with your observation strategy deployments to ensure the strategies you're deploying to monitor the data warehouse are aligned with the code driving the data loads.

Catalog Setup

To set up a catalog for Lumadata, start by creating an empty folder in a new branch that can be used to import an existing catalog or to create new YAML files. All of the Lumadata data observation strategies can be stored in the same directory or you can create subdirectories in this directory to organize your files. Set up correctly, whenever changes are made to these YAML files, a workflow in Git will run to automatically deploy these changes to Lumadata. Workflows fully automate the CI/CD pipeline into Lumadata.

These are sample Lumadata data observation strategies in YAML. The easiest way to create this YAML is to build the initial version of a data observation strategy in Lumadata and then click the download button to generate and download the YAML directly.


	lumadata_schema:
		version: 1.1
	lumadata_test:
		name: Sales - All history
		type: Snapshot
		tags:
			- Sales
			- Orders
			- Financials
		connection_source_name: Data Warehouse
		source_definition: |-
			select 
				to_char(order_date::date, 'yyyy-mm-dd') as dt,
				sum(orders) as orders,
				sum(units_sold)*1.1 as units_sold, 
				sum(unit_price) as unit_price, 
				sum(unit_cost) as unit_cost, 
				sum(total_revenue) as total_revenue, 
				sum(total_profit) as total_profit
			from test_dw.orders 
			where order_date::date between '2011-01-01' and '2016-12-01'
			group by 1
			order by 1 desc
		specification:
			mapping:
			- operator: =
				order: 0
				source_column: dt
				target_column: dt
				function: key
				value: ""
				percentage: 0
			- operator: =
				order: 1
				source_column: orders
				target_column: orders
				function: value
				value: ""
				percentage: 0
			- operator: =
				order: 2
				source_column: units_sold
				target_column: units_sold
				function: value
				value: ""
				percentage: 0
			- operator: =
				order: 3
				source_column: unit_price
				target_column: unit_price
				function: value
				value: ""
				percentage: 0
			- operator: =
				order: 4
				source_column: unit_cost
				target_column: unit_cost
				function: value
				value: ""
				percentage: 0
			- operator: =
				order: 5
				source_column: total_revenue
				target_column: total_revenue
				function: value
				value: ""
				percentage: 0
			- operator: =
				order: 6
				source_column: total_profit
				target_column: total_profit
				function: value
				value: ""
				percentage: 0
		deleted_flag: false						  
						

	lumadata_schema:
		version: 1.1
	lumadata_test:
		name: Sales - incremental trailing 30 days
		type: Live
		tags:
			- Sales
			- Orders
		connection_source_name: Data Warehouse
		source_definition: >-
			select 
				order_date::date, 
				sum(orders) as orders, 
				sum(units_sold) as units_sold, 
				sum(unit_price) as unit_price, 
				sum(unit_cost) as unit_cost, 
				sum(total_revenue) as total_revenue, 
				sum(total_profit) as total_profit
			from orders 
			where order_date::date between '2017-07-28'::date - interval '30 days' and '2017-07-28'::date
			group by 1 
			order by 1 desc
		connection_target_name: ERP
		target_definition: >-
			select 
				to_char(order_date, 'yyyy-mm-dd')::date as order_date, 
				count(order_id) as order_cnt, 
				sum(units_sold) as units, 
				sum(unit_price) as unit_price, 
				sum(unit_cost) as cost, 
				sum(total_revenue) as total_sales, 
				sum(total_profit) as total_profit
			from sales_data 
			where order_date between '2022-07-28'::date - interval '30 days' and '2022-07-28'::date
			group by 1
			order by 1 desc
		specification:
			mapping:
			- operator: =
				order: 0
				source_column: order_date
				target_column: order_date
				function: key
				value: ""
				percentage: 0
			- operator: =
				order: 1
				source_column: orders
				target_column: order_cnt
				function: value
				value: ""
				percentage: 0
			- operator: =
				order: 2
				source_column: units_sold
				target_column: units
				function: value
				value: ""
				percentage: 0
			- operator: =
				order: 3
				source_column: unit_price
				target_column: unit_price
				function: value
				value: ""
				percentage: 0
			- operator: =
				order: 4
				source_column: unit_cost
				target_column: cost
				function: value
				value: ""
				percentage: 0
			- operator: =
				order: 5
				source_column: total_revenue
				target_column: total_sales
				function: value
				value: ""
				percentage: 0
			- operator: =
				order: 6
				source_column: total_profit
				target_column: total_profit
				function: value
				value: ""
				percentage: 0
		deleted_flag: false
												
						

	lumadata_schema:
		version: 1.1
	lumadata_test:
		name: Sales - Profile of 1 year of recent history
		type: Profile
		tags:
			- Sales
			- Orders
		connection_source_name: Data Warehouse
		source_definition: >-
			select
			order_date::timestamp,
			sum(orders) as orders,
			sum(units_sold) as units_sold,
			sum(unit_price) as unit_price,
			sum(unit_cost)
				as unit_cost,
			sum(total_revenue) as total_revenue,
			sum(total_profit) as
				total_profit
			from
			test_dw.orders
			where
			order_date::date between
				'2022-07-28'::date - interval '30 days' and '2022-07-28'::date
			group by
			1
			order by
			1 desc
		connection_target_name: Data Warehouse
		target_definition: >-
			select
			order_date::timestamp,
			sum(orders) as orders,
			sum(units_sold) as units_sold,
			sum(unit_price) as unit_price,
			sum(unit_cost)
				as unit_cost,
			sum(total_revenue) as total_revenue,
			sum(total_profit) as
				total_profit
			from
			test_dw.orders
			where
			order_date::date between
				'2021-01-01'::date and '2021-12-31'::date
			group by
			1
			order by
			random()
			limit
				50000
		specification:
			profile_test:
			- column_name: order_date
				column_type: timestamp
				column_tests:
				- test_type: volume
					metric: column_per_day
					tolerance: 35
					tolerance_units: percent
				- test_type: recency
					metric: recency
					tolerance: 1
					tolerance_units: day
			- column_name: orders
				column_type: number
				column_tests:
				- test_type: blank
					metric: column_null_ratio
					tolerance: 50
					tolerance_units: percent
				- test_type: variance
					metric: column_average
					tolerance: 2
					tolerance_units: deviation
				- test_type: duplication
					metric: column_is_unique
					tolerance: 0
					tolerance_units: percent
			- column_name: units_sold
				column_type: number
				column_tests:
				- test_type: blank
					metric: column_null_ratio
					tolerance: 50
					tolerance_units: percent
				- test_type: variance
					metric: column_average
					tolerance: 2
					tolerance_units: deviation
				- test_type: duplication
					metric: column_is_unique
					tolerance: 0
					tolerance_units: percent
			- column_name: unit_price
				column_type: number
				column_tests:
				- test_type: blank
					metric: column_null_ratio
					tolerance: 50
					tolerance_units: percent
				- test_type: variance
					metric: column_average
					tolerance: 2
					tolerance_units: deviation
				- test_type: duplication
					metric: column_is_unique
					tolerance: 0
					tolerance_units: percent
			- column_name: unit_cost
				column_type: number
				column_tests:
				- test_type: blank
					metric: column_null_ratio
					tolerance: 50
					tolerance_units: percent
				- test_type: variance
					metric: column_average
					tolerance: 2
					tolerance_units: deviation
				- test_type: duplication
					metric: column_is_unique
					tolerance: 0
					tolerance_units: percent
			- column_name: total_revenue
				column_type: number
				column_tests:
				- test_type: blank
					metric: column_null_ratio
					tolerance: 50
					tolerance_units: percent
				- test_type: variance
					metric: column_average
					tolerance: 2
					tolerance_units: deviation
				- test_type: duplication
					metric: column_is_unique
					tolerance: 0
					tolerance_units: percent
			- column_name: total_profit
				column_type: number
				column_tests:
				- test_type: blank
					metric: column_null_ratio
					tolerance: 50
					tolerance_units: percent
				- test_type: variance
					metric: column_average
					tolerance: 2
					tolerance_units: deviation
				- test_type: duplication
					metric: column_is_unique
					tolerance: 0
					tolerance_units: percent
		deleted_flag: false			
						

Git Workflows

Git workflows are used to easily implement CI/CD pipelines by defining pipeline operations in YAML markup. Lumadata has an available workflow module that you can use to automate deployment of your YAML-based catalog to Lumadata. To set up a workflow, first create a workflow that is similar to the YAML below.


	name: deploy-lumadata-observers
	on:
		workflow_dispatch:
		push:
		branches:
		- production
		paths:
		- '.git/workflows/deploy-lumadata-observers.yml'
		- 'lumadata-observers/**'
	
	jobs:
		build:
		environment: production
		runs-on: ubuntu-latest
		steps:
			- name: Checkout source code
			uses: actions/checkout@v2
			
			- name: Deploy to Lumadata
			uses: lumadata/actions@production
			with:
				access_key: ${{ secrets.REPOSITORY_ACCESS_KEY }}
				secret_key: ${{ secrets.REPOSITORY_SECRET_KEY }}
				directory: "./lumadata-observers/**"		
						

	stages:
	- yalm_test
	
	yalm_test:
	only:
		refs:
		- main
	stage: yalm_test
	image:
		name: lumadata/actions:0.1.0
		entrypoint: ['']
	script:
		- /app/start.sh yaml-to-lumadata $CI_PROJECT_DIR/tests
						

In this sample YAML, we've defined the workflow as follows:

  • The workflow will trigger on push events to the branched named "production"
  • The path where Lumadata files are stored in this repository is "lumadata-observers". Any changes in any files or subdirectories of this directory will trigger this workflow.
  • The build environment is production and will use production secrets when running this workflow.

We recommend you put your repository access key and secret key in the secrets manager so that they're injected into this workflow. This ensures your keys are kept safe.

Learn More

Contact Us
  • Email: info@lumadata.io
  • Phone: (+1) 844-999-LUMA (5862)
  • Mail

    Sail for the Sun LLC DBA Lumadata
    Jacksonville, FL 32043
    United States of America

© 2023 Lumadata. All rights reserved.