Applications – big table article

1. Production Snapshot (from Table 2)

Project	Table size (TB, pre-compression)	Compression ratio	# Cells (billions)	# Families	# Locality groups	% in memory	Latency-sensitive?
Google Analytics	20	29%	10	1	1	0%	Yes
Google Analytics	200	14%	80	1	1	0%	Yes
Google Earth	0.5	64%	8	7	2	33%	Yes
Google Earth	70	–	9	8	3	0%	No
Personalized Search	4	47%	6	93	11	5%	Yes

Extracted directly from the paper’s Table 2. “–” indicates compression disabled.

2. Google Analytics

Raw → Aggregations → Summary

A JavaScript snippet records page views and events. The system stores raw sessions and periodically produces per-site summaries. Summaries power dashboards and reporting; throughput is typically GFS-bound during batch windows.

Raw click table (~200 TB): one row per session; row key (site, session_time) for site-contiguity and chronological order.
Summary table (~20 TB): MapReduce/streaming jobs compute per-site daily aggregates.
Compression: raw ≈14%, summary ≈29%.

Access: range scans + targeted lookups

Writes: heavy ingest with group commit

Latency: user-facing reports (low-ms on summaries)

3. Google Earth

Spatial Locality for Tiled Imagery

Raw imagery is preprocessed into geographic tiles in one table; serving uses a compact index in another table. The index uses in-memory families and many tablet servers to meet very low-latency read targets during panning/zooming.

Imagery table (~70 TB): rows = geo segments; compression disabled (source imagery already encoded).
Serving index (~0.5 TB): memory-resident families, high QPS; keys reflect (zoom, lat, lon).
ETL: MapReduce pipelines for cleaning, tiling, and loading.

Access: point reads by tile key

Writes: batch loads during imagery updates

Latency: highly latency-sensitive (map UX)

4. Personalized Search

Per-User Activity with Versioned Cells

Opt-in histories capture queries, clicks, and preferences. Each user is a row keyed by userid; column families separate action types, with timestamps as versions. Profiles are built by batch jobs and used at serve time.

Schema: families like queries, clicks, prefs; timestamps = action time.
Consistency: single-row atomic updates; multi-cluster replication for availability.
Ops: quotas on shared tables to bound per-client usage.

Access: per-user lookups, short range scans

Writes: steady stream of small updates

Latency: on the user-visible path

Notes: Sizes, compression ratios, locality group counts, and in-memory percentages reflect the paper’s Table 2 and the “Real Applications” section descriptions.