Home Archiving Google Interests Articles to PDF with Playwright on Kubernetes
Post
Cancel

Archiving Google Interests Articles to PDF with Playwright on Kubernetes

Introduction

Google Interests (google.com/interests/saved) lets you flag articles from your feed for later. What it does not give you is any way to export, archive, or reliably revisit those saves. Articles vanish as Google’s feed rotates, with no API, no RSS, and no bulk-download option. If you want a durable copy, you have to take it yourself.

This post covers web-to-pdf-job: a Kubernetes CronJob that scrapes the saved articles feed using Playwright (Chromium headless), bypasses lazy loading via keyboard scroll simulation, resolves Google redirect URLs, and writes A4 PDFs to an NFS share. Those PDFs feed directly into a morning-brief CronJob that processes them overnight. The two jobs are loosely coupled through a shared NFS directory — no message queue, no service dependencies.

The hardest parts are authentication (no API means simulating a real browser session with live cookies) and pagination (Google renders the feed lazily and paginates at 200 items). Both are covered below.


The Authentication Problem

Google has no API for Interests. The page requires an authenticated session, and any automation that tries to log in programmatically will hit CAPTCHA or 2FA within seconds. The practical solution is to capture a real session from a Chrome instance the user controls, then replay it in Playwright.

Playwright supports this natively through storage state — a JSON dump of cookies and localStorage that can be loaded into a new browser context.

Cookie extraction (runs locally on Mac):

A Python script reads Chrome’s SQLite cookie store, decrypts the values using the macOS Keychain-backed AES-128-CBC key, and serializes the relevant Google cookies into Playwright’s storage state format:

1
python3 refresh_google_cookies.py --profile "Default" --output google_storage_state.json

The output file is a standard Playwright storage state:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
{
  "cookies": [
    {
      "name": "SID",
      "value": "...",
      "domain": ".google.com",
      "path": "/",
      "expires": 1760000000,
      "httpOnly": true,
      "secure": true,
      "sameSite": "None"
    }
  ],
  "origins": []
}

Once refreshed, the file is copied onto the cluster NFS via kubectl cp:

1
2
kubectl cp google_storage_state.json \
  <pod-name>:/mnt/nfs_share/google_storage_state.json

The CronJob mounts the same NFS PVC, so the file is available at job start without rebuilding the image or rotating a Secret.

Expired session detection:

When cookies expire, Google redirects to accounts.google.com. The scraper checks the final URL after navigation and exits with a clear message rather than silently producing empty output:

1
2
3
if "accounts.google.com" in page.url:
    print("ERROR: Google session expired. Refresh cookies with refresh_google_cookies.py and re-copy to NFS.")
    sys.exit(1)

Browser Context Setup

The context is initialized with the storage state and a Linux desktop user-agent. The webdriver property is patched out via add_init_script to reduce bot detection fingerprinting:

1
2
3
4
5
6
7
8
9
context = browser.new_context(
    storage_state=STORAGE_STATE,
    user_agent='Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 '
               '(KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36',
    viewport={'width': 1280, 'height': 900},
)
context.add_init_script(
    "Object.defineProperty(navigator, 'webdriver', {get: () => undefined})"
)

Running Chromium inside Kubernetes requires a few container-level flags. The job’s container args include --no-sandbox and --disable-dev-shm-usage, and the pod spec sets securityContext.runAsNonRoot: false to satisfy Chromium’s sandbox requirements without a full privileged container.


Pagination and Lazy Loading

The Interests feed paginates at 200 items per page using a ?pageNumber=N query parameter. Within each page, articles are rendered lazily as the user scrolls — a straightforward page.content() grab after initial load will miss most of them.

The scroll strategy uses End key presses in a loop, collecting article links after each press until two consecutive passes return the same count:

1
2
3
4
5
6
7
8
9
10
11
12
13
def scroll_and_collect(page) -> list[str]:
    prev_count = 0
    while True:
        page.keyboard.press("End")
        page.wait_for_timeout(1200)
        links = page.eval_on_selector_all(
            'a[href*="google.com/url?q="]',
            "els => els.map(e => e.href)"
        )
        if len(links) == prev_count:
            break
        prev_count = len(links)
    return links

The 1200 ms wait is intentional — Google’s feed throttles lazy-load responses. Dropping it below ~800 ms causes partial page loads and missed articles.


URL Resolution

Links on the Interests page are wrapped in google.com/url?q= redirects. The real URL is extracted from the q parameter before deduplication and PDF generation:

1
2
3
4
5
6
7
8
9
from urllib.parse import urlparse, parse_qs

def resolve_google_redirect(href: str) -> str | None:
    parsed = urlparse(href)
    qs = parse_qs(parsed.query)
    targets = qs.get("q", [])
    if not targets:
        return None
    return targets[0]

Google-owned domains are filtered after resolution to avoid archiving YouTube videos, Play Store pages, and other non-article destinations:

1
2
3
4
5
6
7
8
GOOGLE_DOMAINS = {
    "youtube.com", "play.google.com", "google.com",
    "googleapis.com", "goo.gl",
}

def is_google_domain(url: str) -> bool:
    host = urlparse(url).hostname or ""
    return any(host == d or host.endswith("." + d) for d in GOOGLE_DOMAINS)

PDF Generation

Each article is opened in a new page within the same context (preserving auth cookies), then printed to A4 PDF:

1
2
3
4
5
article_page.pdf(
    path=str(OUTPUT_DIR / filename),
    format='A4',
    margin={'top': '20mm', 'bottom': '20mm', 'left': '15mm', 'right': '15mm'},
)

Filenames are derived from the article’s domain, a per-run index, and a Unix timestamp to avoid collisions across job runs:

1
2
techcrunch-com_3_1713456789.pdf
arstechnica-com_11_1713456812.pdf

Deduplication is handled by a plain text file on the NFS share. URLs are appended after successful PDF generation and checked at the start of each run:

1
2
3
4
5
6
7
8
9
10
PROCESSED_URLS = OUTPUT_DIR / "processed_urls.txt"

def load_processed() -> set[str]:
    if not PROCESSED_URLS.exists():
        return set()
    return set(PROCESSED_URLS.read_text().splitlines())

def mark_processed(url: str) -> None:
    with PROCESSED_URLS.open("a") as f:
        f.write(url + "\n")

Kubernetes CronJob

The job runs on nodes labeled node-availability: 24x7 to avoid scheduling on nodes that may be powered down overnight:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
apiVersion: batch/v1
kind: CronJob
metadata:
  name: web-to-pdf-job
  namespace: automation
spec:
  schedule: "0 2 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          nodeSelector:
            node-availability: "24x7"
          restartPolicy: OnFailure
          containers:
            - name: web-to-pdf
              image: <your-registry>/web-to-pdf:latest
              args:
                - python3
                - /app/scrape_interests.py
              env:
                - name: OUTPUT_DIR
                  value: /mnt/nfs_share/saved_articles
                - name: STORAGE_STATE
                  value: /mnt/nfs_share/google_storage_state.json
              volumeMounts:
                - name: nfs-pdf
                  mountPath: /mnt/nfs_share
          volumes:
            - name: nfs-pdf
              persistentVolumeClaim:
                claimName: nfs-pdf-pvc

The NFS PVC (nfs-pdf-pvc) is a ReadWriteMany volume backed by the homelab NAS. Both web-to-pdf-job and morning-brief mount it. PDFs land in /mnt/nfs_share/saved_articles/ and are picked up by morning-brief on its next run without any coordination between the two jobs.


On-Demand Triggering

The CronJob can be triggered manually from OpenClaw (the homelab dashboard) without requiring kubectl access on the client side. A Node.js script calls the Kubernetes API using the pod’s own service account token to create a one-off Job from the CronJob template:

1
node /scripts/trigger_articles.js

The script POSTs to /apis/batch/v1/namespaces/automation/jobs with a body derived from the CronJob spec. The service account needs get on the CronJob and create on Jobs in the automation namespace — no cluster-wide permissions required:

1
2
3
4
5
6
7
8
9
10
11
12
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: cronjob-trigger
  namespace: automation
rules:
  - apiGroups: ["batch"]
    resources: ["cronjobs"]
    verbs: ["get"]
  - apiGroups: ["batch"]
    resources: ["jobs"]
    verbs: ["create"]

The Morning-Brief Connection

The PDF output directory is the intake queue for morning-brief, a second CronJob that runs at 05:30 and processes new PDFs into a daily digest. The coupling is intentional and minimal: web-to-pdf-job writes files, morning-brief reads them. Neither job knows about the other’s schedule or state. If web-to-pdf-job is delayed or fails, morning-brief processes whatever is already in the directory and moves on.

This makes the pipeline resilient to the one guaranteed failure mode: expired Google cookies. When cookies expire, web-to-pdf-job fails fast and logs the error. No corrupt PDFs are written, morning-brief is unaffected, and the fix is a two-step local operation: re-run the cookie extraction script and kubectl cp the updated state file to NFS.


Conclusion

The core engineering challenge here is not scraping — it’s maintaining a usable Google session without an API. Playwright’s storage state mechanism plus a local cookie extraction script gives a workable solution, with the tradeoff that cookies expire and need manual refresh every few weeks. The failure mode is clean and recoverable, which matters more than eliminating the maintenance burden entirely.

The lazy-load scroll strategy and google.com/url?q= resolution are straightforward once identified, but both require careful handling: too-fast scrolling misses articles, and skipping URL resolution produces deduplicated-but-wrong entries in processed_urls.txt.

The NFS-based coupling between web-to-pdf-job and morning-brief keeps both jobs independently deployable and debuggable. Adding a new source to the morning brief pipeline means pointing another job at the same directory — no changes to the consumer.

This post is licensed under CC BY 4.0 by the author.