Mastering BigQuery: Optimisation and Best Practices

Katherine Lim
Apr 11
6 min read

Updated: Apr 15

Best Practices for Optimising Queries in BigQuery

BigQuery is a powerful cloud-based data warehousing service from Google Cloud. However, without optimised queries, organisations can experience significant time and monetary losses. Inefficient queries consume excessive processing resources, leading to longer execution times and increased costs associated with BigQuery's on-demand pricing model. Moreover, poorly structured queries can hinder data analysis workflows, delaying critical business decisions and impacting operational efficiency. Consequently, optimising your queries is crucial for both performance and cost-effectiveness, ensuring that you extract maximum value from your data while minimising unnecessary expenditure. Here are some key best practices to consider.

1. Data Selection

Avoid SELECT * :
- Only select the necessary columns.
- Minimises data transferred and processed, significantly impacting query cost.
Filter Data Early:
- Apply filters in the WHERE clause as early as possible to reduce the amount of data scanned.
Create a Search Index:
- Query performance can be greatly improved with an index because less data is scanned.

2. Partitioning and Clustering Tables

Partitioning: Dividing Data by Time or Range
- Partitioning divides tables into smaller, manageable chunks based on a column (usually a DATE, TIMESTAMP, or INTEGER range).
- Enables faster queries by allowing BigQuery to only scan relevant partitions.
- Important: Always filter queries on the partitioning column to effectively leverage partitioning.
- Set the Require partition filter option when you create a partitioned table. This will require that all queries on the table must include a predicate WHERE clause that filters on the partitioning column. This setting can improve performance and reduce costs, because BigQuery can use the filter to prune partitions that don't match the predicate.
- Use Case: Time-Series Data Analysis (e.g., Website Logs, Financial Transactions)
  - Scenario: A company stores daily website traffic logs in BigQuery, with a timestamp column indicating when each event occurred. They frequently analyze website activity for specific date ranges (e.g., daily, weekly, monthly reports).
  - Implementation: Partition the table by the timestamp column (using DATE partitioning).
  - Benefit: When querying for a specific date range, BigQuery only scans the relevant partitions, leading to faster query execution and lower costs.
  - The example query below shows a list of records between two dates.

SELECT *
FROM `project.dataset.website_logs`
WHERE DATE(timestamp) BETWEEN '2023-10-20' AND '2023-10-27';

Use Case: Range-Based Data Segmentation (e.g., Customer IDs, Product Categories)
- Scenario: An e-commerce platform stores customer order data, including a customer_id column. They need to analyse order patterns for specific customer ID ranges (e.g., high-value customers, new customers).
- Implementation: Partition the table by the customer_id column (using INTEGER range partitioning).
- Benefit: Queries targeting specific customer ID ranges will only scan the relevant partitions, improving performance.
- The example query below lists records in a specific range.

SELECT *
FROM `project.dataset.customer_orders`
WHERE customer_id BETWEEN 10000 AND 20000;

Use Case: Data Archiving
- Scenario: A company wants to keep historical data in BigQuery but needs to optimise storage costs. Older data is accessed far less frequently.
- Implementation: Partition the table by a date column. Then, you can set partition expiration to delete older partitions automatically.
- Benefit: Reduced storage costs for older, less accessed data.

Clustering: Sorting Data within Partitions
- Clustering sorts the data in each partition based on one or more columns.
- Improve query performance for frequently used queries - clustering accelerates queries because the query only scans the blocks that match the filter.
- Cluster on columns frequently used in WHERE, GROUP BY, and JOIN clauses.
- Use Case: Filtering and Aggregation on Specific Columns (e.g., Product Category, User Region)
  - Scenario: An online retail company stores sales data with columns for product_category, user_region, and sale_amount. They frequently analyse sales by product category and user region.
  - Implementation: Cluster the table (or partitions) by product_category and user_region.
  - Benefit: Queries that filter or aggregate by these columns will perform significantly faster, as BigQuery can efficiently locate the relevant data blocks within each partition.
  - The example query below filters on user region and aggregates by product category.

SELECT product_category, SUM(sale_amount)
FROM `project.dataset.sales_data`
WHERE user_region = 'AU'
GROUP BY product_category;

Use Case: Improving Join Performance
- Scenario: A data warehouse contains two tables: orders and customers. They are often joined on the customer_id column.
- Implementation: Cluster both tables (or partitions) by customer_id.
- Benefit: Joining the tables on the clustered column will be faster, as BigQuery can efficiently locate matching rows.
Use Case: Data Locality
- Scenario: When you have a table where certain values are queried far more often than others, clustering those columns will increase the speed of those common queries.
- Implementation: Cluster the table by the frequently queried columns.
- Benefit: BigQuery will store like values together, reducing the amount of data that needs to be scanned.

3. Cost Estimation and Control

Estimate Query Costs:
- Use the dry run feature in the BigQuery web UI (top right of the query window, look for the green tick and the “This query will process … bytes when run” result) or the dry_run parameter in the CLI to estimate query costs before execution. To quickly estimate the cost, e.g., if the cost is $8.13 AUD per TiB and dry run shows 2 terabytes, multiply 2 times 8 and check if the query result will be within your budget. On-demand compute pricing can be found in the BigQuery Pricing page.
- To create a monthly cost estimate use the Google Cloud pricing calculator.
"LIMIT" Clause:
- While LIMIT restricts the number of rows returned, it doesn't always reduce cost because it does not affect the amount of data that is read.
- Only effective if the table is clustered on the column used in the ORDER BY clause.
Quotas and Limits:
- To stop user queries from exhausting a project level quota, set up a new project with quotas to allow ad hoc or interactive queries. Remove user access from the primary project and only allow the service account to use BigQuery in that project.
- Apply quotas and limits to projects to prevent unexpected costs.
- Custom daily query quotas can be set to limit the amount of data processed per day per project or per user.
- This helps you avoid unexpected costs and optimise resource allocation.
Billing Budgets and Alerts:
- Use Cloud Billing Budgets to monitor all your BigQuery charges in the same place.
- Set a budget amount to track actual costs against planned costs.
- Budget alert threshold rules can trigger email notifications to catch any cost spikes.
- Track query costs over time to identify trends and areas for optimisation. One way is to query the JOBS view. The example query below shows a list of queries billed for more than a terabyte.

SELECT
    creation_time,
    project_id,
    user_email,
    job_type,
    statement_type,
    start_time,
    end_time,
    query,
    total_bytes_processed,
    total_slot_ms,
    total_bytes_billed
FROM `[project_id].region-australia-southeast1.INFORMATION_SCHEMA.JOBS`
WHERE creation_time > "{from_date}"
    AND total_bytes_billed > 1000000000
ORDER BY total_bytes_billed DESC
LIMIT {limit}

Automate alerts by sending email to individual users informing them of the cost issue.

Change the storage billing model:
- Switch to the Physical Bytes Storage Billing (PBSB) as opposed to Logical Bytes Storage Billing (LBSB) to take advantage of the cost savings offered by the compression capabilities of physical bytes without compromising data accessibility or query performance.

4. Query Optimisation Techniques

Use Views:
- Views can reduce the need to rewrite complex queries, leading to cleaner code and potentially faster execution, especially for repeated complex query patterns.
- By pre-processing data, views can filter and transform it, reducing the amount of data that subsequent queries need to process, resulting in faster execution times.
- Pre-compute and store frequently used queries as materialised views. This can significantly accelerate queries that repeatedly access the same aggregated data, as BigQuery can retrieve the results directly from the stored view instead of re-running the underlying query.
- Improves query performance and reduces query execution time.
Consider Table Sharding:
- Distribute data across multiple tables for improved scalability and query performance.
- However, sharded tables may increase overhead and decrease performance due to metadata access overhead, therefore partitioning is recommended over sharding.
- Date-sharded tables can be converted into an ingestion-time partitioned table for better query performance.
Leverage Query Caching:
- Cache frequently used query results to improve performance and reduce query latency.
- By default BigQuery caches results for 24 hours.
- Queries that return cached results are not charged.

5. Performance Insights

Analyse Query Performance:
- Use BigQuery's built-in query profiling tools to identify performance bottlenecks.
- View the Execution Graph to see a graphical representation of each stage of the query.
- Analyse query plans to understand how BigQuery is executing your queries.
- Review query performance insights callouts for tips on how to improve query performance.

Things to Avoid

Scanning Entire Partitioned Tables: Avoid scanning entire partitioned tables without filtering on the partitioning column. This can lead to significant cost increases.
Over-partitioning: Avoid over-partitioning tables, as it can increase storage costs and overhead.
Unnecessary Joins: Avoid large joins, especially cross joins, which can significantly impact query performance.

Conclusion

By following these best practices, you can significantly improve the performance and cost-effectiveness of your BigQuery queries, enabling you to gain valuable insights from your data more efficiently. Remember that continuous monitoring and optimisation are key to maintaining efficient and cost-effective data analysis.

References

About Innablr

Discover the future of cloud technology with Innablr, your premier consultancy specialising in delivering cutting-edge solutions for businesses. With a rich track record of successful migrations and transformational projects across various industries, we stand at the forefront of the digital revolution. Our team of seasoned professionals combines deep technical knowledge with strategic insights to guide our clients through every step of their cloud journey seamlessly.

Katherine Lim, Lead Engineer @ Innablr

FinOps

Lower Your Cloud Costs

FinOps Assessment

Go Cloud Native

Migrate to AWS

Container Assessment

Data Platform Assessment

Your Cloud Journey

DevSecOps

Site Reliability

Acceleration

Engineering Excellence

Cloud Native Platforms

AWS Cloud Migration

DORA Acceleration

Google Cloud Migration

Security

Site Reliability Engineering

Data Engineering

Efficiency

Sustainability