Google BigQuery – Retrieving large datasets

Create a Cycle to request a large set of data

This article explains how to retrieve large amounts of data from Google BigQuery.

Attempting to retrieve it in a single Cycle run can result in more data than can be processed in one Transaction so it can be necessary to retrieve it in chunks. See Cyclr’s Payload Size Limitations for more details.

To retrieve large amounts of data, you can create a Cycle to retrieve the data, process it as you need by sending it to another system, then have the Cycle re-trigger itself either immediately if there’s more data to retrieve, or after a pause by using a Delay Step. Re-triggering is handled by the Cycle starting with a Webhook Step and ending with a POST Step which is configured to make a call back to the initial Webhook Step, creating a looping integration:

When starting the above Cycle running, it won’t actually do anything until its Webhook Step receives a request. You can trigger that by clicking the Test Step button on the final POST Step, causing it to send a request to the Webhook Step and the first run to start shortly after.

The following Connector Methods in the Google BigQuery Connector’s Tables Method Category can be used to request large sets of data from the Google BigQuery API by requesting the table data across multiple Cycle runs.

Connector Methods:

Note: To ensure all data is retrieved, the List Table Data Incrementally (Per Page) Method should be used. The other Methods can be affected by the order of data within Google BigQuery changing, resulting in data potentially being missed or retrieved more than once.

The prerequisites below must be completed before following the specific instructions for your chosen method.

Finally, you can configure a Delay Step to avoid the Cycle re-triggering itself very quickly when there’s no further data to be retrieved, resulting in unnecessary work that can affect the speed of your other Cycles.

Prerequisites

  • Install and authenticate a Google BigQuery Connector.
  • Create a Table Custom Object Category that targets the required BigQuery table.
  • Install and setup a Generic Webhook Connector to allow your Cycle to re-trigger itself.

“List Table Data Incrementally (Per Page)” Method

This Method uses a “Next Page Token” property to continue paging across multiple Cycle runs.

Note: It is recommended that you use this Method as it will reliably retrieve all data held in a Google BigQuery table. The other Methods described on this page can be affected by the order of the data changing, resulting in some being missed.

Add a request field to the “POST” method

From the Edit Connector page of a Generic Webhook Connector:

  1. Under the Methods and Fields heading, select HTTP Methods.
  2. Expand the POST Method.
  3. Under the Request Fields heading, select the Add Field icon.
  4. Set the Field Location to nextPageToken.
  5. Set the Display Name to “Next Page Token”.
  6. Set the Data Type to Text.
  7. Select Create.

Add a request field to the “Webhook” method

From the Edit Connector page of a Generic Webhook Connector:

  1. Under the Methods and Fields heading, select Webhooks.
  2. Expand the Webhook Method.
  3. Under the Request Fields heading, select the Add Field icon.
  4. Set the Field Location to nextPageToken.
  5. Set the Display Name to “Next Page Token”.
  6. Set the Data Type to Text.
  7. Select Create.

Build a Cycle

Add the following methods to a Cycle:

  • Generic Webhook > Webhooks > Webhook
  • Generic Webhook > HTTP Methods > POST
  • Google BigQuery > Tables > List Table Data Incrementally (Per Page)
  • Tools > Delay: You can use a Delay Step to set how long the Cycle waits between executions once all current table data has been retrieved.

Setup the “POST” to “Webhook” loop

The POST Step must target the Webhook Step to allow the Cycle to send a request to itself to trigger another run each time it completes so set the Webhook’s URL as the URL field on the POST Step.

Pass on the Next Page Token returned

To enable the Cycle to continue from where a previous run reached:

  • on the List Table Data Incrementally (Per Page) Step: map the Webhook’s Next Page Token field to the Next Page Token field.
  • on the POST Step: map the Next Page Token value returned by the List Table Data Incrementally (Per Page) Step to the Next Page Token field.

Configure the Pages Per Transaction setting

The “Pages Per Transaction” field determines how many table pages are returned per Cycle execution. You need to manually set Pages Per Transaction, which is dependent on the “width” of your table data (so the number of columns it contains, which then has an impact on the size of the data being retrieved). An example value to use could be 10.

“List Partial Table Data” Method

This Method requires a “last index” value to be provided to continue paging across multiple Cycle runs.

Add a request field to the “POST” method

From the Edit Connector page of a Generic Webhook Connector:

  1. Under the Methods and Fields heading, select HTTP Methods.
  2. Expand the POST Method.
  3. Under the Request Fields heading, select the Add Field icon.
  4. Set the Field Location to lastIndex.
  5. Set the Display Name to “Last Index”.
  6. Set the Data Type to Integer.
  7. Select Create.

Add a request field to the “Webhook” method

From the Edit Connector page of a Generic Webhook Connector:

  1. Under the Methods and Fields heading, select Webhooks.
  2. Expand the Webhook Method.
  3. Under the Request Fields heading, select the Add Field icon.
  4. Set the Field Location to lastIndex.
  5. Set the Display Name to “Last Index”.
  6. Set the Data Type to Integer.
  7. Select Create.

Build a Cycle

Add the following methods to a Cycle:

  • Generic Webhook > Webhooks > Webhook
  • Generic Webhook > HTTP Methods > POST
  • Google BigQuery > Tables > List Partial Table Data
  • Tools > Delay: You can use a Delay Step to set how long the Cycle waits between executions once all current table data has been retrieved.

Setup the “POST” to “Webhook” loop

The POST Step must target the Webhook Step to allow the Cycle to send a request to itself to trigger another run each time it completes so set the Webhook’s URL as the URL field on the POST Step.

Pass on the Last Index returned

To enable the Cycle to continue from where a previous run reached:

  • on the List Partial Table Data Step: map the Webhook’s Last Index field to the Start Index field.
  • on the POST Step: map the Last Row Index value returned by the List Partial Table Data Step to the Last Index field.

Configure the Rows Per Execution setting

The “Rows Per Execution” field determines how many table rows are returned per Cycle execution. You need to manually set Rows Per Execution, which is dependent on the “width” of your table data (so the number of columns it contains, which then has an impact on the size of the data being retrieved). An example value to use could be 10000.

  1. Select the Step Setup of the List Partial Table Data Step.
  2. Set Rows Per Execution to the number of rows to retrieve on each run, e.g. 10000.

“List Table Data Incrementally” Method

This Method automatically tracks the last index property to continue paging across multiple Cycle runs.

Build a Cycle

Add the following methods to a Cycle:

  • Generic Webhook > Webhooks > Webhook
  • Generic Webhook > HTTP Methods > POST
  • Google BigQuery > Tables > List Table Data Incrementally
  • Tools > Delay: You can use the Delay tool to set the time between requests once all current table data has been retrieved.

Setup the “POST” to “Webhook” loop

The POST Step must target the Webhook Step to allow the Cycle to send a request to itself to trigger another run each time it completes so set the Webhook’s URL as the URL field on the POST Step.

Configure the Rows Per Execution setting

The “Rows Per Execution” field determines how many table rows are returned per Cycle execution. You need to manually set Rows Per Execution, which is dependent on the “width” of your table data (so the number of columns it contains, which then has an impact on the size of the data being retrieved). An example value to use could be 10000.

  1. Select the Step Setup of the List Table Data Incrementally Step.
  2. Set Rows Per Execution to the number of rows to retrieve on each run, e.g. 10000.

“List New Table Data Incrementally” Method

This Method automatically tracks the last index property to continue paging across multiple Cycle run.

Build a Cycle

Add the following methods to a Cycle:

  • Generic Webhook > Webhooks > Webhook
  • Generic Webhook > HTTP Methods > POST
  • Google BigQuery > Tables > List New Table Data Incrementally
  • Tools > Delay: You can use a Delay Step to set how long the Cycle waits between executions once all current table data has been retrieved.

Setup the “POST” to “Webhook” loop

The POST Step must target the Webhook Step to allow the Cycle to send a request to itself to trigger another run each time it completes so set the Webhook’s URL as the URL field on the POST Step.

Configure the Rows Per Execution setting

The “Rows Per Execution” field determines how many table rows are returned per Cycle execution. You need to manually set Rows Per Execution, which is dependent on the “width” of your table data (so the number of columns it contains, which then has an impact on the size of the data being retrieved). An example value to use could be 10000.

  1. Select the Step Setup of the List New Table Data Incrementally Step.
  2. Set Rows Per Execution to the number of rows to retrieve on each run, e.g. 10000.

Configure a Delay Step

Use a Delay Step to set the time between cycle executions once the cycle retrieves all of the table data. When new table data is found, the cycle continues to run until it retrieves all table data and then reverts back to the delay.

For example, a delay duration of 12 hours makes the Cycle check for new table data once every 12 hours once it’s retrieved all of the current table data.