Data Warehouse Design recommendations with Google BigQuery
Thank you all for reading out my previous blogs on CDC implementations, Google services , GCP Bigquery and DataStream service of GCP. With today’s topic , would like to share some of design recommendations based on my previous experiences of GCP Data migrations , DW implementation with Google Bigquery.
With typical DWBI implementation, we are already aware of some of basic design patterns or tips to remember like –
1. OLTP vs OLAP — What kind of transactions are there ?
2. Real time vs Batch Processing — What kind of processing there?
3. Structured vs Semi-structured data — What kind of Data stored, processed in the DW?
4. ETL vs ELT — What kind of operations , transformations on top of Sourced data in DW?
5. Volume vs Velocity vs Veracity — What kind of volume, frequency and variety of data in DW? What is daily/monthly/quarterly/annual growth of the data?
6. Prescriptive vs Predictive analytics — What kind of analytics on top of data present in DW?
7. End users/ Business users — What are the different types of users accessing data from DW?
8. Concurrency — What are concurrent processes running within DW?
9. SLOs vs SLAs — What are the SLAs, SLOs of applications running on top of DW?
10. Performance — What is current performance of system vs system upgrade/availability ?
With the help of above questions and answers to them help us to design a Data Warehouse. We can easily calculate resources required like — compute and storage, Users required — application users vs end users , concurrent processes running and required SLA. With type of transformations, we can easily evaluate available ETL tools as well as evaluate DW native utilities to implement push down mechanism to utilize power of DW for better performance — this is typically used in ELT kind of implementation.
Now, consider the same set of questions and how they can help us to derive design patterns in GCP DW implementations with GCP BigQuery. Let’s check how we can use these questions –
1. OLTP vs OLAP — As we already know that BQ supports OLAP kind of implementation
2. Real time vs Batch Processing — BQ supports real time as well as batch processing. There is difference in terms of their implementation and data availability.
3. Structured vs Semi-structured data — BQ supports structured, semi-structured as well as Bigdata ecosystem file formats like — Parquet, Avro etc.
4. ETL vs ELT — BQ connectors and extensions are integrated with most of ETL tools. We can use ETL tools as well as GCP ETL service — Data Fusion to load data into DW. We have already seen how can we implement ELT with BQ.
5. Volume vs Velocity vs Veracity — BQ is one of the managed service of GCP which supports growing needs of data volume, supports variety of data and processing.
6. Prescriptive vs Predictive analytics — BQ connectors are available and integrated with most of BI tools. We can use BI tools like Tableau, PowerBI , Looker or GCP native Data Studio for reporting and prescriptive analytics. BQ also have BQ ML which is powered extension of BQ to support Data scientists who are good with SQL skills than Programming. We can run predictive models with BQ ML and schedule them , use them to get predictions.
7. End users/ Business users — BQ allows us to create Federated Views or Authorized views to setup end users access to DW/data.
8. Concurrency — With legacy implementations and even some of cloud services have limitations on concurrent users accessing data but with BQ we have broader limitation widow where we can setup concurrent accesses through processes or users with BQ project. You can refer to my next blog on Data security with BQ to understand access setup at length.
9. SLOs vs SLAs — BQ is well known for its power of performance and query execution time. With BQ architecture and data access path , we can scan huge data in few seconds. The end users as well as application SLAs are dependent on how processes and applications, transformations being run on BQ . You can refer to my next blog for Performance optimization with BQ to design and develop workloads on BQ
10. Performance — This is important factor of any DW implementation where design of processes/workloads impact performance of DW. Though BQ is well known for its performance , this might be impacted due to incorrect SQLs, coding or development or migration of legacy code as is to BQ. You can refer to my next blog Performance optimization with BQ to design and develop workloads on BQ
As we know by now, that though BQ is performance efficient , managed service or serverless service of GCP we need to be careful while implementing ETL or ELT or processes or workloads or transformations with BQ. Performance of BQ is dependent on the way queries designed , data scans etc. I would like to share few tips or patterns useful for BQ DW design –
i. Segregate your Data — Identify Hot , Warm and Cold data present in your current system and separate out into different storages.
a. Hot Data— Data frequently accessed, processed within DW. Store this data as part of BQ tables leveraging BQ storage
b. Warm Data — Data that is less frequently accessed or referenced or modified. Based on the no of reads or reference to this data either store it as BQ table or store on GCS bucket and define an external table on top of it ( this is purely dependent on accessibility and usage of data)
c. Cold Data — Data that is rarely used or archived data or historical data which will be hardly referenced by any reports or processes. I recommend to store them on GCS buckets instead of moving them to BQ. This will save BQ storage cost as GCS coldline storage or nearline storage cost is cheaper than BQ storage cost. External table can be created on top of it if at all it is required to be accessed.
ii. Define Layers of your DW — Identify the different layers of data transformations / processing and create layered DW.
a. Raw Layer/Stage Layer — This layer is used to store raw data or source data in original format/form . Store source data as is. Tables in this layer are truncate and load. Refresh the data as its frequency of sourcing.
b. Transform Layer / Intermediate Layer — This layer is used to store transformed data. Write processes/jobs/pipelines to read data from Raw layer apply transformations and load to transform layer tables. Tables in this layer are mostly carry UPSERT logic and store batch data for few days.
c. Aggregate Layer / Target Layer — This layer is final layer where data transformed from transform layer pushed to target tables. Tables in this layer are mostly carry APPEND or UPSERT operations to combine daily data modifications.
d. Reporting Layer / BI Layer / Analytical Layer — This layer is provisioned for end users, business users, data scientists , analysts working on AI/ML use cases. This layer consist of federated views or authorized views or restricted access to the data as per business requirements.
iii. Define Roles and Users of DW — We know that a DW is a data platform used by various users within Organization as well as business/end users. Defining roles of these users is equally important as defining layers. Define Service accounts (application users) , Dev users , business users/end users , BA , QA teams, DA teams etc.
iv. Define Access Control Policies — We have seen different layers of data processing ,usage and users of these layers. Access control is one the most important design aspect for Data Security. There are different ways we can implement Data Security, you can refer my next blog for the same. Defining users and access restrictions, policies for each layer is to be done with DW designing. RBAC is one of method to implement access control.
v. Define Processes/workloads/pipelines — Based on DW source integrations, identify and define different types of pipelines to be developed. This is important phase as it incurs cost of maintenance if its not designed by applying some general rules or automated processes. You can refer to my blog for processes integration with BQ.
vi. Define monitoring, operations and Maintenance — Based on pipelines designed and orchestration implemented, setup a monitoring process or automate it. Refer to my blog on implementing DataOps with GCP services.
vii. Define benchmarking for BQ expenses — As you know , BQ is managed service of GCP which takes care of auto scaling and resource management. If your DW is high with data volume and processing then you need to be careful while designing pipelines to avoid unnecessary scans and $ bill for the same. You need to restrict down the unnecessary scans by application teams. Refer to my blog on BQ costing and optimization for more details. Setup billing alerts for BQ usages.
These are some of the important design tips while designing Data Warehouse implementation with GCP BQ. Hope this helps to design your DW with GCP BigQuery.
About Me :
I am DWBI and Cloud Architect! I have been working with various Legacy data warehouses, Bigdata Implementations, Cloud platforms/Migrations. I am Google Certified Professional Cloud Architect .You can reach out to me @ LinkedIn if you need any further help on certification, GCP Implementations!