Implementing Slowly Changing Dimensions using GCP Services

Pooja Kelgaonkar
Google Cloud - Community
4 min readJan 25, 2022

--

Thanks for reading out my first blog on Change data capture in DWBI. If you haven't read it earlier, you can read @ https://poojakelgaonkar.medium.com/change-data-capture-in-dwbi-etl-elt-implementations-af08b54d8fb5

We have seen different types of change data capture and ways to implement change data capture. Slowly Changing Dimensions(here after referred as SCD) can be implemented in many ways out of which SCD Type II is widely used enterprise solution. Let's quickly recap on SCD Type II — This is Row versioning , Track changes as version records with current flag & active dates and other metadata. Type II implementations carry few of audit columns with each of its implementation to represent audit trail and active records.

We can implement SCD using some of below well-known ways of implementation at on premise/legacy systems -

Implementing using DB utilities — (Batch Process)

o Running Upserts — using sql script to update existing records for changes and inserting new records received.

o Running Upserts using Merge statements — Use Merge to identify and update existing records and insert unmatched records.

Implementation using ETL Tools –(Batch Process & Real Time)

o Using Informatica, Datastage, Ab initio etc. transformations to identify changes and implement on Database/tables

Implementation using Open-Source tools/services — (Real Time)

o Debezium — This open source can be used to capture changes at source and reflect them to Target DB. Mysql, PostgreSQL,DB2,Oracle, SQL Server,MongoDB, Cassandra connectors are available with Debezium.

We can implement CDC with cloud datawarehouse implementations as well. We will see implementing DB/DW/Data Lake using GCP services in my upcoming blogs. For this blog, lets consider DB setup on GCP using Bigquery or Cloud SQL. Assume a scenario of telecom or retail domain where you have source system which captures user profiles and you need to maintain profile data using GCP DB services. Implementing CDC becomes prime need using SCD Type II here.

Now, lets see how can we implement SCD Type II using GCP native, managed as well as open-source services.

GCP Managed Services –

o Datastream — This is a serverless and easy-to-use change data capture (CDC) and replication service. We will see Change data capture with GCP datastream in upcoming blogs at length.

o Pub/Sub — This is used to capture changes at real time from source systems.

o DataFusion - This is ETL service of GCP. We can create job/pipeline to capture changes and implement using data fusion.

GCP DB Services — we can use below GCP services to implement SCD Type II.

o BigQuery — GCP BQ is managed service which is widely used for DW implementation on GCP. We have also learnt usage of BQ for serverless datwarehouse implementation on GCP. BQ supports/compatible with ANSI SQL standards. Here, we can implement SCD Type II same way as DB utilities.

§ Use MERGE statement to implement UPSERTS.

§ Use SQL statements to Update and Inserts with bulk operations than row level updates/inserts.

o Cloud SQL — This is DB service of GCP which supports ANSI SQL standard. We can implement SCD Type II using DB utilities, native SQLs on DB

o Cloud Spanner — This is scalable DB service of GCP. This support ANSI SQL standard. We can implement SCD Type II using SQL — UPSERTS to identify change and implement it

Open-source Services –

o Debezium — We can use Debezium to capture changes done at source system. Route them to Kafka topics and write to GCS buckets. Once changes are written to GCS file , these changes can be applied to GCP DB services using one of below steps –

§ Create Dataflow job to push changes to DB from GCS

§ Create DB service utility job to apply changes to DB/table (BQ)

· Create bqload job to push changes to DB/table from GCS bucket

· Create a script to apply changes to target db/table using MERGE

§ Create dokers of services required for Debezium and capture changes. We will see Debezium and its implementation in coming blog.

We can choose one of this implementation method to implement SCD Type II as part of Change data Capture. Challenges come along with implementations, we will see Bigquery , Datastream and Debezium implementations in detail and their challenges in upcoming blogs.

Thanks for reading! Sharing is more learning 😊

  • About Me

I am DWBI and Cloud Architect! I have been working with various Legacy data warehouses, Bigdata Implementations, Cloud platforms/Migrations. I am Google Certified Professional Cloud Architect .You can reach out to me @ LinkedIn if you need any further help on certification, GCP Implementations!

--

--

Pooja Kelgaonkar
Google Cloud - Community

My words keep me going, Keep me motivating to reach out to more and more!