top of page

Useful tips on optimising your code

Often, we’re eager to tackle complex challenges in programming, but laying a solid foundation in the fundamentals ensures our more advanced code functions seamlessly. No matter how long you’ve been writing SQL, it’s easy to overlook the basics.

With that in mind, In this series of 10 blogs, I would like to share my favourite tips for mastering BigQuery like a pro. These insights aim to reconnect you with essential practices that might have slipped your mind, ensuring your projects are not just ambitious but also impeccably executed.

1:BigQuery Dremel

BigQuery’s exceptional speed in processing massive queries is largely due to Dremel, its foundational technology. Dremel enables BigQuery to process data by columns rather than by rows, which is a departure from the row-based data storage you might be accustomed to with SQL Server and other databases. This columnar processing means that data retrieval is optimised for speed and efficiency, particularly beneficial for analytical queries.

However, to fully leverage BigQuery’s capabilities, it’s crucial to adapt your approach to data management. This includes minimising the number of columns your queries read— as each additional column can affect performance — and reconsidering the reliance on common SQL functions like LIMIT, which does not have the same performance implications in a columnar storage context. By aligning your practices with BigQuery’s strengths, you can achieve more efficient data processing and quicker insights
 

This is still one of my favourite articles on what is under the hood on BigQuery and I highly recommend a read.

2:Limit does not reduce the amount of data processed

In many SQL dialects, using LIMIT can help reduce the amount of data your query processes, which can save on processing time and costs. However, this isn’t the case with BigQuery. Due to its columnar-based structure, BigQuery processes the full dataset before applying the LIMIT clause, so your data processing costs remain the same regardless.

This makes it crucial to rethink when and why you use LIMIT.

 

Personally, I find LIMIT most useful for scenarios where I need to sort data and retrieve only the top X records. It’s a great tool for these specific cases, but it’s important to understand that it won’t help reduce your query’s cost in BigQuery. 

3:Do not use Select *

While it may be tempting to use SELECT * to quickly grab all columns from a table, this approach can significantly increase the cost of your queries in BigQuery. Remember, BigQuery charges based on the amount of data processed, and selecting unnecessary columns means processing more data than needed. Instead, it’s wise to carefully consider which columns you actually need for your analysis. Developing a clear analysis plan can help with this, guiding you to identify the specific data and columns that will answer your key questions.

Fortunately, BigQuery offers robust preview and schema viewing capabilities for all native tables, making it straightforward and cost-free to explore the contents of your tables before running your queries. By leveraging these features, you can gain a deep understanding of your data’s structure and ensure that your queries are both cost-effective and tailored to your analysis needs.

4:Use Dry run

Utilising BigQuery’s dry run functionality before executing your queries is a simple yet crucial step in managing your data analysis costs effectively. A dry run allows you to estimate the amount of data your query will process, without incurring any charges or actually running the query. If you notice that a dry run estimates more than 10GB of data processing, it’s a strong signal to reevaluate and optimise your code.

This practice is especially important when integrating BigQuery with external tools like Python or Looker Studio, as inefficient queries can lead to unnecessary spending regardless of the interface used.

Optimising a query may involve refining your selection criteria, limiting the scope of data processed, or leveraging more efficient functions and joins. Beyond cost savings, these optimisations can also improve query performance and execution speed, enhancing your overall data analysis workflow.

Remember, the goal is not just to reduce costs but to ensure that your queries are as efficient and effective as possible, aligning with your analytical objectives while minimising resource consumption.

5:Distinct is not your friend

The DISTINCT keyword is a powerful tool in SQL for removing duplicate rows from your results, making it invaluable for de-duplicating data. However, it’s essential to understand why duplicates exist in your dataset in the first place. Duplicates might not always be true duplicates; they could be updated records, entries with minor misspellings, or records that differ only in timestamps and might be better handled through aggregation.

Before resorting to DISTINCT, thoroughly examine your data to understand the nature of these duplicates. For instance, what seems like duplicate entries might represent data points collected at different times, indicating a need for a more nuanced approach to data cleaning.

An effective alternative to blanket de-duplication is the QUALIFY function, which allows for more precise control over which duplicates to remove based on specific criteria. This can be particularly useful in situations where you want to keep only the most recent record or when duplicates have significant differences.

When using DISTINCT or QUALIFY, it’s a best practice to include comments in your code explaining why you’re removing duplicates in this manner. This not only aids in your understanding when revisiting the code but also helps others who may work with your queries to grasp the logic behind your data-cleaning process.

In summary, while DISTINCT can clean your data effectively, a deeper understanding of your data’s unique characteristics and strategic use of functions like QUALIFY can lead to more accurate and insightful data analysis.

bottom of page