Class: OnlineMigrations::DataMigrations::ScrubNullClickIdsFromVisitMarketingMeta

Inherits:
OnlineMigrations::DataMigration
  • Object
show all
Defined in:
lib/online_migrations/data_migrations/scrub_null_click_ids_from_visit_marketing_meta.rb

Overview

Strips literal JSON-null keys (gclid, gbraid, wbraid) from
visits.marketing_meta. Pairs with the tracker fix in PR #878
(Tracking::Tracker was routing nil click-ids through
data[:marketing_meta_*], which the jsonb_attributes accessor
persisted as {gclid: null} rather than dropping the key).

Scope

~46.8K rows at write time on a ~10.1M-row, ~7.9GB visits table.
Way under the "must be background" threshold by row count, but
visits is a high-write table — the single-statement UPDATE
variant of this migration would hold its transaction open long
enough to interfere with autovacuum and produce contention on the
marketing_meta GIN index. Splitting into batched UPDATE … WHERE id IN (…) statements keeps each transaction sub-second and lets
autovacuum/lock-retries breathe.

Mirrors the BackfillVisitMetaJsonb shape so the monitoring +
operational pattern is the same one ops/dev are already familiar
with from the Phase 0 backfill.

Idempotency

Each batch's UPDATE uses PG's - operator (marketing_meta - 'gclid' - 'gbraid' - 'wbraid') which is a no-op on rows where the
named keys aren't present. The WHERE clause filters batches to
the still-polluted rows; re-running after a partial drain only
touches the remainder.

Monitor:
OnlineMigrations::BackgroundDataMigrations::Migration
.find_by(migration_name: 'ScrubNullClickIdsFromVisitMarketingMeta')

See Also:

Constant Summary collapse

BATCH_SIZE =

Rows per batched SQL UPDATE. Tuned the same as
BackfillVisitMetaJsonb — a 5K-row batch finishes well under
the lock-timeout/autovacuum windows on the primary and stays
below PostgreSQL's prepared-statement parameter limits.

5_000
POLLUTED_PREDICATE =

Filter for rows where the JSONB accessor wrote a literal JSON
null for any of the three click-id keys. Selected as a module
constant so #collection and #count share the exact predicate.

<<~SQL.squish
  marketing_meta -> 'gclid'  = 'null'::jsonb
  OR marketing_meta -> 'gbraid' = 'null'::jsonb
  OR marketing_meta -> 'wbraid' = 'null'::jsonb
SQL

Instance Method Summary collapse

Instance Method Details

#collectionArray<Array<Integer>>

Yields one ID batch per BATCH_SIZE rows that still carry at
least one JSON-null click-id key. Eagerly materialized to an
Array because OnlineMigrations::MigrationJob#build_enumerator
only handles ActiveRecord::Relation, BatchEnumerator, and
Array from #collection — a raw Enumerator would wedge the
migration at status='enqueued'.

Returns:

  • (Array<Array<Integer>>)

    one array of IDs per batch



64
65
66
67
68
69
70
71
72
# File 'lib/online_migrations/data_migrations/scrub_null_click_ids_from_visit_marketing_meta.rb', line 64

def collection
  batches = []
  ::Visit.where(POLLUTED_PREDICATE)
         .in_batches(of: BATCH_SIZE) do |relation|
    ids = relation.pluck(:id)
    batches << ids if ids.any?
  end
  batches
end

#countInteger

Drives the online_migrations progress bar — counts rows still
carrying a JSON-null click-id key.

Returns:

  • (Integer)


105
106
107
# File 'lib/online_migrations/data_migrations/scrub_null_click_ids_from_visit_marketing_meta.rb', line 105

def count
  ::Visit.where(POLLUTED_PREDICATE).count
end

#process(ids) ⇒ void

This method returns an undefined value.

Strips the three click-id keys from one batch of visits in a
single SQL UPDATE. PG's jsonb - text operator removes the
named key, no-op when the key isn't present — so the statement
is safe on rows where the tracker had only stamped one or two of
the three keys as null.

Parameters:

  • ids (Array<Integer>)

    visit IDs to scrub in this batch.



82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
# File 'lib/online_migrations/data_migrations/scrub_null_click_ids_from_visit_marketing_meta.rb', line 82

def process(ids)
  return if ids.empty?

  # `ids` are integers from `Visit#pluck(:id)` in `#collection`, so
  # the `IN (…)` interpolation is safe by construction — no
  # untrusted input can reach the SQL string. We interpolate (not
  # bind) because the batch contains up to BATCH_SIZE (5_000) IDs,
  # which pushes against PG's prepared-statement parameter limits
  # at the higher end of the per-iteration fan-out.
  sql = <<~SQL.squish
    UPDATE visits
    SET marketing_meta = marketing_meta - 'gclid' - 'gbraid' - 'wbraid'
    WHERE id IN (#{ids.join(',')})
      AND (#{POLLUTED_PREDICATE})
  SQL

  ApplicationRecord.with_connection { |conn| conn.execute(sql) }
end