How This Works - CA Media Monitor

Data Collection

Sources

Every 6 hours, the system automatically collects articles from three primary sources:

Google News — searches for each candidate by name plus broad governor's race queries, covering hundreds of outlets
GDELT — the Global Database of Events, Language, and Tone, which indexes millions of news articles (rolling ~3 month window)
Politico California Playbook — daily newsletters via RSS feed, with full text extraction

We also search YouTube for video coverage and extract transcripts where available (auto-captions or manual).

Currently tracking 150+ named media outlets across California regions and national coverage, from the LA Times and Sacramento Bee to local outlets like the Chico News & Review and Lost Coast Outpost.

Text Extraction

For each article URL found, we attempt to extract the full article text using newspaper4k. Some articles behind paywalls (LA Times, NY Times, SF Chronicle) may only have headline data. Politico newsletters use a specialized extractor that handles Cloudflare protection.

Mention Counting

Name Matching

Each article is scanned for candidate names in both the headline and body text. Candidates with unique last names (Becerra, Swalwell, Thurmond, Villaraigosa, Bianco) are matched by last name alone. Candidates with common last names (Mahan, Porter, Steyer, Yee, Hilton) require either their full name or their last name appearing near political context keywords (governor, candidate, election, etc.) to avoid false positives.

Focus Scoring

Three-Tier Classification

Not all mentions are equal. Each candidate's appearance in an article is classified using a points-based system that examines four signals:

Headline position (0-40 pts) — Is the candidate named in the headline? Are they the first or only candidate mentioned?
Mention dominance (0-30 pts) — What share of total candidate mentions does this person have?
First mention position (0-15 pts) — How early in the article does the candidate first appear?
Mention concentration (0-15 pts) — What fraction of all mentions belong to this candidate?

These points are summed (max 100) and classified:

SUBJECT (55+ pts) — weighted at 1.0x — the article is primarily about this candidate
SIGNIFICANT (25-54 pts) — weighted at 0.3x — the candidate plays a meaningful role
PASSING (<25 pts) — weighted at 0.05x — the candidate is briefly mentioned

This means being the subject of one article generates more "coverage score" than being name-dropped in six.

Share of Voice

Each candidate's total weighted coverage score is divided by the sum of all candidates' scores, producing a percentage "share of voice." This is compared to polling averages to identify coverage gaps — candidates getting more or less media attention than their polling position would suggest.

Word Associations

AI-Powered Analysis

For each article with full text, we use AI (Claude by Anthropic) to identify the specific words, phrases, and topics associated with each candidate in that article. This is much more accurate than simple word proximity — the AI understands context and only returns descriptions that actually apply to the candidate.

These candidate-specific word buckets are aggregated across all articles in the selected date range to produce the word clouds you see on the dashboard.

Most Media of the Day

Every morning at 6:00 AM Pacific, the system calculates who generated the most weighted coverage the previous day and posts a scorecard image to @CA120Media on X, showing the top three candidates with their scores and key stats.

Pundit Tracker

As a bonus feature, we also track mentions of 30+ political pundits, strategists, pollsters, and commentators who appear in governor's race coverage. Mentions are classified as Quoted (direct attribution), Discussed (referenced substantively), or Mentioned (name-dropped). This is accessible via a hidden feature on the dashboard.

Limitations

Things to keep in mind:

This tool captures a lot, but not everything. Paywalled articles may only have headline data. The focus scoring algorithm is rule-based and can misclassify edge cases. Google News redirect URLs don't always resolve. GDELT's API has rate limits and occasional downtime. YouTube transcript extraction is blocked from cloud server IPs.

The data is directionally useful, not comprehensive. Use it to spot trends, not to make definitive claims about who's "winning" the media race.

← Back to Dashboard