A curated selection of new BigQuery features, offering a high-level overview of what they are, when to use them, and when to avoid them—in easy-to-digest, bite-sized chunks with helpful links.
What is it?
Fine-grained DML allows for targeted updates and deletes within specific rows of a BigQuery table based on defined conditions.
Why is it useful?
It provides precise control over data manipulation, enabling efficient changes without modifying unaffected data, which is crucial for maintaining data integrity in large datasets.
When to use it?
Use fine-grained DML when you need to update or delete a subset of records, such as marking inactive users based on a "last login" date condition.
When not to use it?
Avoid using fine-grained DML when working with massive updates or deletes on entire tables, as it may be less efficient than using bulk operations or partitioning strategies.
October 2024
What is it?
BigQuery allows you to load geospatial data stored in Parquet format from Cloud Storage directly into tables, enabling the storage and analysis of spatial data using BigQuery’s native GEOGRAPHY data type.
Geospatial data refers to information related to locations on the Earth's surface, such as coordinates, polygons, and areas, often used in mapping and location-based services.
Parquet format is a columnar storage format optimized for performance, compression, and efficient querying of large datasets.
Why is it useful?
It simplifies the process of handling large-scale geospatial datasets by directly supporting the Parquet format, which is designed for high efficiency in both storage and querying. This allows users to perform advanced spatial analyses, such as distance calculations or area comparisons, within BigQuery while leveraging the efficiency of Parquet.
When to use it?
Use this when you need to load large geospatial datasets, like city maps or geographic features (e.g., roads, buildings), from Cloud Storage to BigQuery for analysis, such as calculating distances between two locations or identifying regions within a boundary.
When not to use it?
Avoid using this method if your data is already in a simpler format better suited to your analysis needs or if your geospatial data doesn't require complex analysis and can be managed with simpler data storage and querying methods. Additionally, if your dataset is very small, using Parquet may add unnecessary complexity.
What is it?
The pipe syntax (|>), available in preview mode, is a feature in BigQuery that allows chaining of SQL queries and transformations in a more streamlined way, similar to how pipelines work in functional programming. It passes the result of one query as input to the next without needing intermediate steps.
Why is it useful?
It makes queries more modular, readable, and easier to manage by breaking down complex operations into simpler, step-by-step transformations. For R and Python programmers, the pipe syntax feels familiar because it's similar to the pipe operators (%>% in R or |> in Python), making it easier for them to adopt the syntax and approach within BigQuery.
When to use it?
Use pipe syntax when performing a series of transformations or operations that can be logically broken into stages, such as filtering rows, applying aggregations, and sorting the results, all while keeping the query readable and concise. For example, use it to first filter a dataset, then aggregate results, and finally sort them.
When not to use it?
Avoid using pipe syntax if your query is simple and doesn't involve multiple transformations, or if you are not familiar with chaining operations in your workflow. For very basic queries, the pipe syntax might add unnecessary complexity.
This feature is currently in preview mode, and you must apply to test it. To use this feature, you need to submit a request to gain access and ensure it's activated in your BigQuery environment.
What is it?
BigQuery supports querying external datasets that reside in Cloud Spanner, allowing you to run SQL queries across both BigQuery and Spanner without needing to replicate the data. Spanner is a fully managed, scalable, globally-distributed database service, and this integration enables direct access to its data from BigQuery.
Why is it useful?
It reduces the need for data duplication, allowing you to work with large datasets stored in Spanner while leveraging BigQuery’s analytics capabilities. This integration simplifies workflows by enabling seamless querying of data across different systems, reducing overhead and improving performance for cross-database analysis.
Why store data in Spanner?
Cloud Spanner is ideal for storing transactional or operational data that requires global consistency, high availability, and horizontal scalability. It’s particularly useful for applications that require strong consistency across regions, such as financial systems, inventory management, or critical business operations.
When to use it?
Use Spanner external datasets when you have transactional or operational data in Spanner that you want to analyse using BigQuery’s powerful querying and analytics functions without having to move or copy the data. For example, analyzing real-time inventory data stored in Spanner alongside historical data in BigQuery.
When not to use it?
Avoid using Spanner external datasets if your data resides entirely within BigQuery or if the performance impact of querying across systems outweighs the benefits of integration. Also, consider avoiding it when latency is a critical concern, as querying across external datasets might introduce additional delays.
