Performance Optimization with BigQuery
Till now, we have seen what is Bigquery(BQ), what are design recommendations with BQ? , What are ways to implement ELT , SCD and security with GCP BQ. Now , let’s see what are the recommendations to improve performance of BQ processes. We will see BQ cost optimization in next blog.
BQ is managed service of GCP which offers users to create/build DW using various GCP as well as open source services or ETL/BI tools to integrate with it. Let’s assume a typical DW implementation using BQ on top of GCP data pipelines. As this is managed service of Google and doesn’t support Indexes, constraints , statistics and some of typical DW feature which are used for performance optimization.
BQ performance optimization can be done only in terms of SQL optimization. There are no additional ways to add/remove resources, auto scale them or work with memory/storages to improve performance. This is fully managed service, we can use below techniques/checkpoints to validate performance of DW Queries. We can follow the recommendations and improve performance of BQ queries, SQLs etc. Queries that perform better consume fewer resources which can result in lower costs and fewer failures.
When evaluating query performance in BigQuery, the amount of work required depends on a below factors:
Input data and data sources (I/O): How many bytes does query read?
Communication between nodes (shuffling): How many bytes does query pass to the next stage? How many bytes does query pass to each slot?
Computation: How much CPU work does query require?
Outputs (materialization): How many bytes does query write?
Query anti-patterns: Are queries following SQL best practices?
Many of these factors can be observed through the query plan. A query plan is generated for each query submitted to BigQuery. We can see execution statistics such as bytes read and slot time consumed. The query plan also shows the different stages of execution which can help to identify ways to improve query performance.
1. Manage Input Output Data — We can choose only required columns to view data/process data. Instead of running select * prefer running Select column list from tables
a. Best practice: Control projection — Query only the columns that are needed. Projection refers to the number of columns that are read by query. Projecting excess columns incurs additional (wasted) I/O and materialization (writing results).
Using SELECT * is the most expensive way to query data. When you use SELECT *, BigQuery does a full scan of every column in the table. When experimenting with data or exploring data, use one of the data preview options instead of SELECT *.
Applying a LIMIT clause to a SELECT * query does not affect the amount of data read. User is billed for reading all bytes in the entire table, and the query counts against free tier quota. Instead, query only the columns needed. For example, use SELECT * EXCEPT to exclude one or more columns from the results.
b. Partition Data to limit no of bytes scanned — BQ supports partitioning on Date and Integer columns. Based on data , we can add partition column to create partitions on top of table data. We can also use ingestion column as partition when we don’t have any specific date or integer columns to add partition. This can be used in case of daily batch loads and when we want to separate daily batch load data into partitions and run queries against only that day’s or batch data. This helps to reduce table scan and access only 1 partition to read data.
We can restrict users to add partitions using BQ partition clause. Use partition columns while reading data from partitioned tables to avoid data prune.
c. Denormalizing data — Denormalization is a common strategy for increasing read performance for relational datasets that were previously normalized. The recommended way to denormalize data in BigQuery is to use nested and repeated fields. It’s best to use this strategy when the relationships are hierarchical and frequently queried together, such as in parent-child relationships. BQ doesn’t need data to flattened or normalized completely, we can use ARRAY or STRUCTS to read data in given nested or repeated formats.
d. Use external data sources wisely — If query performance is a top priority, do not use an external data source. Querying tables in BigQuery managed storage is typically much faster than querying external tables in Cloud Storage, Drive, or Cloud Bigtable.
Use an external data source for these use cases:
a. Performing extract, transform, and load (ETL) operations when loading data
b. Frequently changing data
c. Periodic loads such as recurring ingestion of data from Cloud Bigtable
e. Avoid wildcard characters while using Tables in queries — BQ supports querying on set of tables with wild char expression. This will create union of all tables and data in it. For better performance, reduce the granular search or add prefix at very granular level.
More granular prefixes perform better than shorter prefixes. For example, FROM bigquery-public-data.noaa_gsod.gsod194* performs better than FROM bigquery-public-data.noaa_gsod.* because fewer tables match the wildcard.
2. Optimize communication between slots — When evaluating communication throughput, consider the amount of shuffling that is required by query. For example, a GROUP BY clause passes like values to the same slot for processing. The amount of data that is shuffled directly impacts communication throughput and as a result, query performance. The following best practices provide guidance on controlling communication between slots.
a. Reduce data before Join — Trim the data as early in the query as possible, before the query performs a JOIN. If you reduce data early in the processing cycle, shuffling and other complex operations only execute on the data that you need.
b. Avoid WITH clause as prepared statement — WITH clauses are used primarily for readability because they are not materialized. For example, placing all queries in WITH clauses and then running UNION ALL is a misuse of the WITH clause. If a query appears in more than one WITH clause, it executes in each clause.
c. Avoid tables sharded by dates — When you create tables sharded by date, BigQuery must maintain a copy of the schema and metadata for each date-named table. Also, when date-named tables are used, BigQuery might be required to verify permissions for each queried table. This practice also adds to query overhead and impacts query performance.
d. Avoid oversharding of tables — Table sharding refers to dividing large datasets into separate tables and adding a suffix to each table name. If you are sharding tables by date, use time-partitioned tables instead. Because of the low cost of BigQuery storage, you do not need to optimize tables for cost as you would in a relational database system. Creating a large number of table shards has performance impacts that outweigh any cost benefits. Sharded tables require BigQuery to maintain schema, metadata, and permissions for each shard. Because of the added overhead required to maintain information on each shard, oversharding tables can impact query performance.
3. Optimize Query Computing — When evaluating the computation that is required by a query, consider the amount of work that is required.
a. Avoid repeatedly transforming data through SQL queries — If you are using SQL to perform ETL operations, avoid situations where you are repeatedly transforming the same data.
For example, if you are using SQL to trim strings or extract data by using regular expressions, it is more performant to materialize the transformed results in a destination table. Functions like regular expressions require additional computation. Querying the destination table without the added transformation overhead is much more efficient.
c. Use approximate aggregation functions — If use case supports it, use an approximate aggregation function. If the SQL aggregation function you’re using has an equivalent approximation function, the approximation function yields faster query performance. For example, instead of using COUNT(DISTINCT), use APPROX_COUNT_DISTINCT().
d. Use aggregate analytic function to obtain the latest record — To obtain the latest record, use the ARRAY_AGG() aggregate analytic function. Using the ARRAY_AGG() aggregate analytic function instead of using numbering functions, such as RANK() or ROW_NUMBER(), allows a query to run more efficiently because the ORDER BY clause is allowed to drop everything except the top record on each GROUP BY clause.
e. Order query operations to maximize performance — Use ORDER BY only in the outermost query or within window clauses (analytic functions). Push complex operations to the end of the query. If need to sort data, filter first to reduce the number of values that you need to sort. If sort data first, sort much more data than is necessary. It is preferable to sort on a subset of data than to sort all the data and apply a LIMIT clause. When we use an ORDER BY clause, it should appear only in the outermost query. Placing an ORDER BY clause in the middle of a query greatly impacts performance unless it is being used in a window (analytic) function.
f. Optimize Join patterns — For queries that join data from multiple tables, optimize join patterns. Start with the largest table. When you create a query by using a JOIN, consider the order in which you are merging the data. The standard SQL query optimizer can determine which table should be on which side of the join, but it is still recommended to order joined tables appropriately. As a best practice, place the table with the largest number of rows first, followed by the table with the fewest rows, and then place the remaining tables by decreasing size.
g. Use INT64 data types in joins to reduce cost and improve comparison performance — If use case supports it, use INT64 data types in joins instead of STRING data types.
BigQuery does not index primary keys like traditional databases, so the wider the join column is, the longer the comparison takes. Therefore, using INT64 data types in joins is cheaper and more efficient than using STRING data types.
h. Prune partitioned queries — When querying a partitioned table, to filter with partitions on partitioned tables, use the following columns:
i. For ingestion-time partitioned tables, use the pseudo-column _PARTITIONTIME
ii. For partitioned tables such as the time-unit column-based and integer-range, use the partitioning column.
i. Avoid multiple evaluations of the same Common Table Expressions (CTEs) — Use scripting, variables, temporary tables, and automatically expiring tables to persist calculations and use them later in the query. When query contains Common Table Expressions that are used in multiple places in the query, they are evaluated each time they are referenced. This may increase internal query complexity and resource consumption. You can store the result of a CTE in a scalar variable or a temporary table depending on the data that the CTE returns. You are not charged for storage of temporary tables.
j. Split complex queries into multiple smaller ones — Leverage BigQuery’s rich scripting capabilities and stored procedures to perform the computations that were designed as one complex query as multiple smaller and simpler queries instead. Complex queries, for example queries with layers of subqueries or joins, can be slow to run and resource intensive. Trying to fit all computations in one huge SELECT statement, for example to make it a view, is sometimes an antipattern, and it can result in a slow, resource-intensive query. In extreme cases the internal query plan will become so complex that BigQuery will be unable to execute it.
4. Manage Query Outputs — When evaluating output data, consider the number of bytes written by query. The following best practices provide guidance on controlling output data.
a. Avoid repeated joins and subqueries — Avoid repeatedly joining the same tables and using the same subqueries. If you are repeatedly joining the same tables, consider revisiting schema. Instead of repeatedly joining the data, it might be more performant for you to use nested repeated data to represent the relationships. Nested repeated data saves you the performance impact of the communication bandwidth that a join requires. It also saves you the I/O costs that you incur by repeatedly reading and writing the same data.
b. Materializing output data — BigQuery limits cached results to approximately 10 GB compressed. Queries that return larger results overtake this limit and frequently result in the following error: Response too large. This error often occurs when you select a large number of fields from a table with a considerable amount of data. Issues writing cached results can also occur in ETL-style queries that normalize data without reduction or aggregation. We can overcome the limitation on cached result size by:
i. Using filters to limit the result set
ii. Using a LIMIT clause to reduce the result set, especially if you are using an ORDER BY clause
iii. Writing the output data to a destination table
c. Use a LIMIT clause with large sorts — when sorting a very large number of values, use a LIMIT clause. Writing results for a query with an ORDER BY clause can result in Resources exceeded errors. The final sorting is done on a single slot. If you attempt to order a very large result set, the final sorting can overwhelm the slot that is processing the data. If you are using an ORDER BY clause, also use a LIMIT clause.
5. Avoid Anti SQL Patterns –
a. Self-joins — Avoid self-joins. Use a window (analytic) function instead. Typically, self-joins are used to compute row-dependent relationships. The result of using a self-join is that it potentially squares the number of output rows. This increase in output data can cause poor performance. Instead of using a self-join, use a window (analytic) function to reduce the number of additional bytes that are generated by the query.
b. Data skew — If query processes keys that are heavily skewed to a few values, filter data as early as possible. Partition skew, sometimes called data skew, is when data is partitioned into very unequally sized partitions. This creates an imbalance in the amount of data sent between slots. You can’t share partitions between slots, so if one partition is especially large, it can slow down, or even crash the slot that processes the oversized partition. To avoid performance issues that result from data skew:
i. Use an approximate aggregate function such as APPROX_TOP_COUNT to determine if the data is skewed.
ii. Filter data as early as possible.
c. Unbalanced joins — Data skew can also appear when you use JOIN clauses. Because BigQuery shuffles data on each side of the join, all data with the same join key goes to the same shard. This shuffling can overload the slot. To avoid performance issues that are associated with unbalanced joins:
i. Pre-filter rows from the table with the unbalanced key.
ii. If possible, split the query into two queries.
d. Cross joins (Cartesian product) — Avoid joins that generate more outputs than inputs. When a CROSS JOIN is required, pre-aggregate data. Cross joins are queries where each row from the first table is joined to every row in the second table (there are non-unique keys on both sides). The worst case output is the number of rows in the left table multiplied by the number of rows in the right table. In extreme cases, the query might not finish.
e. DML statements that update or insert single rows — Avoid point-specific DML statements (updating or inserting 1 row at a time). Batch updates and inserts.
Using point-specific DML statements is an attempt to treat BigQuery like an Online Transaction Processing (OLTP) system. BigQuery focuses on Online Analytical Processing (OLAP) by using table scans and not point lookups. If need OLTP-like behavior (single-row updates or inserts), consider a database designed to support OLTP use cases such as Cloud SQL. BigQuery DML statements are intended for bulk updates. UPDATE and DELETE DML statements in BigQuery are oriented towards periodic rewrites of data, not single row mutations.
Follow my upcoming story on BQ Cost optimization.
About Me :
I am DWBI and Cloud Architect! I have been working with various Legacy data warehouses, Bigdata Implementations, Cloud platforms/Migrations. I am Google Certified Professional Cloud Architect .You can reach out to me @ LinkedIn if you need any further help on certification, GCP Implementations!