Scenario:
When running a query like below, it takes pretty long to get the results of bigquery query, even though we know that querying bigtable with a given rowkey is pretty fast
Query:
SELECT
bt.rowkey,
bt.column1,
...,
bt.column14
FROM
`bq_dataset_mine.my_bq_small_table_with_rowkeys` sm
JOIN `bq_dataset_mine.my_bq_external_table_on_bt_table1` bt
ON
bt.rowkey = sm.rowkey
Is there a possibility of speeding up this bigquery query that joins a small bigquery table having rowkeys with the larger BQ external table on BT?
Here is what my query plan looks like on BQ
Step details
READ
$1:rowkey, $2:my_cf1.cq1.cell.timestamp, $3:my_cf1.cq2.cell.value, $4:my_cf2.cq3.cell.timestamp, $5:my_cf1.cq4.cell.value, ..., $11:my_cf1.cq10.cell.value, ...
FROM my-proj.my-dset.my_bt_ext_table
READ
$30
FROM __stage00_output
JOIN
$40 := $1, $41 := $2, $42 := $3, $43 := $4, $44 := $5, $45 := $6, $46 := $7, $47 := $8, $48 := $9, $49 := $10, $50 := $11, $51 := $12, $52 := $13, $53 := $14, $54 := $15, $55 := $16
SEMI HASH JOIN EACH WITH ALL ON $1 = $30
WRITE
$40, $41, $42, $43, $44, $45, $46, $47, $48, $49, $50, $51, $52, $53, $54, $55
TO __stage01_output
Shuffle
Shuffle output bytes 2.2 MiB
Looking at the plan, seems like the explanation might be because of the fact that we are selecting lot of the columns from the BT table rather than just testing for existence of the rowkey, so all of those columns need to be stored into the hash table (Hash Semi Join) thus maybe the time and cost go up!
If someone from Google Bigtable or Bigquery team can look at the plan and see if there is a possibility of speeding up here, it would be great.