Git Integration
Manage your data observation strategies as code
Git Repository and Workflow Integration
Tightly couple your data pipeline code and data observation strategies by storing your Lumadata catalog in Git
Git Overview
Lumadata's catalog can be stored outside of Lumadata in YAML markup. This allows your engineering team the greatest flexibility when it comes to managing the data observation catalog. Using YAML markup, you're able to create, modify, and delete data observation strategies from Lumadata. And since the observaton strategies are in a git-enabled repository, you can bundle your data pipeline code deployments with your observation strategy deployments to ensure the strategies you're deploying to monitor the data warehouse are aligned with the code driving the data loads.
Catalog Setup
To set up a catalog for Lumadata, start by creating an empty folder in a new branch that can be used to import an existing catalog or to create new YAML files. All of the Lumadata data observation strategies can be stored in the same directory or you can create subdirectories in this directory to organize your files. Set up correctly, whenever changes are made to these YAML files, a workflow in Git will run to automatically deploy these changes to Lumadata. Workflows fully automate the CI/CD pipeline into Lumadata.
These are sample Lumadata data observation strategies in YAML. The easiest way to create this YAML is to build the initial version of a data observation strategy in Lumadata and then click the download button to generate and download the YAML directly.
lumadata_schema:
version: 1.1
lumadata_test:
name: Sales - All history
type: Snapshot
tags:
- Sales
- Orders
- Financials
connection_source_name: Data Warehouse
source_definition: |-
select
to_char(order_date::date, 'yyyy-mm-dd') as dt,
sum(orders) as orders,
sum(units_sold)*1.1 as units_sold,
sum(unit_price) as unit_price,
sum(unit_cost) as unit_cost,
sum(total_revenue) as total_revenue,
sum(total_profit) as total_profit
from test_dw.orders
where order_date::date between '2011-01-01' and '2016-12-01'
group by 1
order by 1 desc
specification:
mapping:
- operator: =
order: 0
source_column: dt
target_column: dt
function: key
value: ""
percentage: 0
- operator: =
order: 1
source_column: orders
target_column: orders
function: value
value: ""
percentage: 0
- operator: =
order: 2
source_column: units_sold
target_column: units_sold
function: value
value: ""
percentage: 0
- operator: =
order: 3
source_column: unit_price
target_column: unit_price
function: value
value: ""
percentage: 0
- operator: =
order: 4
source_column: unit_cost
target_column: unit_cost
function: value
value: ""
percentage: 0
- operator: =
order: 5
source_column: total_revenue
target_column: total_revenue
function: value
value: ""
percentage: 0
- operator: =
order: 6
source_column: total_profit
target_column: total_profit
function: value
value: ""
percentage: 0
deleted_flag: false
lumadata_schema:
version: 1.1
lumadata_test:
name: Sales - incremental trailing 30 days
type: Live
tags:
- Sales
- Orders
connection_source_name: Data Warehouse
source_definition: >-
select
order_date::date,
sum(orders) as orders,
sum(units_sold) as units_sold,
sum(unit_price) as unit_price,
sum(unit_cost) as unit_cost,
sum(total_revenue) as total_revenue,
sum(total_profit) as total_profit
from orders
where order_date::date between '2017-07-28'::date - interval '30 days' and '2017-07-28'::date
group by 1
order by 1 desc
connection_target_name: ERP
target_definition: >-
select
to_char(order_date, 'yyyy-mm-dd')::date as order_date,
count(order_id) as order_cnt,
sum(units_sold) as units,
sum(unit_price) as unit_price,
sum(unit_cost) as cost,
sum(total_revenue) as total_sales,
sum(total_profit) as total_profit
from sales_data
where order_date between '2022-07-28'::date - interval '30 days' and '2022-07-28'::date
group by 1
order by 1 desc
specification:
mapping:
- operator: =
order: 0
source_column: order_date
target_column: order_date
function: key
value: ""
percentage: 0
- operator: =
order: 1
source_column: orders
target_column: order_cnt
function: value
value: ""
percentage: 0
- operator: =
order: 2
source_column: units_sold
target_column: units
function: value
value: ""
percentage: 0
- operator: =
order: 3
source_column: unit_price
target_column: unit_price
function: value
value: ""
percentage: 0
- operator: =
order: 4
source_column: unit_cost
target_column: cost
function: value
value: ""
percentage: 0
- operator: =
order: 5
source_column: total_revenue
target_column: total_sales
function: value
value: ""
percentage: 0
- operator: =
order: 6
source_column: total_profit
target_column: total_profit
function: value
value: ""
percentage: 0
deleted_flag: false
lumadata_schema:
version: 1.1
lumadata_test:
name: Sales - Profile of 1 year of recent history
type: Profile
tags:
- Sales
- Orders
connection_source_name: Data Warehouse
source_definition: >-
select
order_date::timestamp,
sum(orders) as orders,
sum(units_sold) as units_sold,
sum(unit_price) as unit_price,
sum(unit_cost)
as unit_cost,
sum(total_revenue) as total_revenue,
sum(total_profit) as
total_profit
from
test_dw.orders
where
order_date::date between
'2022-07-28'::date - interval '30 days' and '2022-07-28'::date
group by
1
order by
1 desc
connection_target_name: Data Warehouse
target_definition: >-
select
order_date::timestamp,
sum(orders) as orders,
sum(units_sold) as units_sold,
sum(unit_price) as unit_price,
sum(unit_cost)
as unit_cost,
sum(total_revenue) as total_revenue,
sum(total_profit) as
total_profit
from
test_dw.orders
where
order_date::date between
'2021-01-01'::date and '2021-12-31'::date
group by
1
order by
random()
limit
50000
specification:
profile_test:
- column_name: order_date
column_type: timestamp
column_tests:
- test_type: volume
metric: column_per_day
tolerance: 35
tolerance_units: percent
- test_type: recency
metric: recency
tolerance: 1
tolerance_units: day
- column_name: orders
column_type: number
column_tests:
- test_type: blank
metric: column_null_ratio
tolerance: 50
tolerance_units: percent
- test_type: variance
metric: column_average
tolerance: 2
tolerance_units: deviation
- test_type: duplication
metric: column_is_unique
tolerance: 0
tolerance_units: percent
- column_name: units_sold
column_type: number
column_tests:
- test_type: blank
metric: column_null_ratio
tolerance: 50
tolerance_units: percent
- test_type: variance
metric: column_average
tolerance: 2
tolerance_units: deviation
- test_type: duplication
metric: column_is_unique
tolerance: 0
tolerance_units: percent
- column_name: unit_price
column_type: number
column_tests:
- test_type: blank
metric: column_null_ratio
tolerance: 50
tolerance_units: percent
- test_type: variance
metric: column_average
tolerance: 2
tolerance_units: deviation
- test_type: duplication
metric: column_is_unique
tolerance: 0
tolerance_units: percent
- column_name: unit_cost
column_type: number
column_tests:
- test_type: blank
metric: column_null_ratio
tolerance: 50
tolerance_units: percent
- test_type: variance
metric: column_average
tolerance: 2
tolerance_units: deviation
- test_type: duplication
metric: column_is_unique
tolerance: 0
tolerance_units: percent
- column_name: total_revenue
column_type: number
column_tests:
- test_type: blank
metric: column_null_ratio
tolerance: 50
tolerance_units: percent
- test_type: variance
metric: column_average
tolerance: 2
tolerance_units: deviation
- test_type: duplication
metric: column_is_unique
tolerance: 0
tolerance_units: percent
- column_name: total_profit
column_type: number
column_tests:
- test_type: blank
metric: column_null_ratio
tolerance: 50
tolerance_units: percent
- test_type: variance
metric: column_average
tolerance: 2
tolerance_units: deviation
- test_type: duplication
metric: column_is_unique
tolerance: 0
tolerance_units: percent
deleted_flag: false
Git Workflows
Git workflows are used to easily implement CI/CD pipelines by defining pipeline operations in YAML markup. Lumadata has an available workflow module that you can use to automate deployment of your YAML-based catalog to Lumadata. To set up a workflow, first create a workflow that is similar to the YAML below.
name: deploy-lumadata-observers
on:
workflow_dispatch:
push:
branches:
- production
paths:
- '.git/workflows/deploy-lumadata-observers.yml'
- 'lumadata-observers/**'
jobs:
build:
environment: production
runs-on: ubuntu-latest
steps:
- name: Checkout source code
uses: actions/checkout@v2
- name: Deploy to Lumadata
uses: lumadata/actions@production
with:
access_key: ${{ secrets.REPOSITORY_ACCESS_KEY }}
secret_key: ${{ secrets.REPOSITORY_SECRET_KEY }}
directory: "./lumadata-observers/**"
stages:
- yalm_test
yalm_test:
only:
refs:
- main
stage: yalm_test
image:
name: lumadata/actions:0.1.0
entrypoint: ['']
script:
- /app/start.sh yaml-to-lumadata $CI_PROJECT_DIR/tests
In this sample YAML, we've defined the workflow as follows:
- The workflow will trigger on push events to the branched named "production"
- The path where Lumadata files are stored in this repository is "lumadata-observers". Any changes in any files or subdirectories of this directory will trigger this workflow.
- The build environment is production and will use production secrets when running this workflow.
We recommend you put your repository access key and secret key in the secrets manager so that they're injected into this workflow. This ensures your keys are kept safe.