Introduction
Google Interests (google.com/interests/saved) lets you flag articles from your feed for later. What it does not give you is any way to export, archive, or reliably revisit those saves. Articles vanish as Google’s feed rotates, with no API, no RSS, and no bulk-download option. If you want a durable copy, you have to take it yourself.
This post covers web-to-pdf-job: a Kubernetes CronJob that scrapes the saved articles feed using Playwright (Chromium headless), bypasses lazy loading via keyboard scroll simulation, resolves Google redirect URLs, and writes A4 PDFs to an NFS share. Those PDFs feed directly into a morning-brief CronJob that processes them overnight. The two jobs are loosely coupled through a shared NFS directory — no message queue, no service dependencies.
The hardest parts are authentication (no API means simulating a real browser session with live cookies) and pagination (Google renders the feed lazily and paginates at 200 items). Both are covered below.
The Authentication Problem
Google has no API for Interests. The page requires an authenticated session, and any automation that tries to log in programmatically will hit CAPTCHA or 2FA within seconds. The practical solution is to capture a real session from a Chrome instance the user controls, then replay it in Playwright.
Playwright supports this natively through storage state — a JSON dump of cookies and localStorage that can be loaded into a new browser context.
Cookie extraction (runs locally on Mac):
A Python script reads Chrome’s SQLite cookie store, decrypts the values using the macOS Keychain-backed AES-128-CBC key, and serializes the relevant Google cookies into Playwright’s storage state format:
1
python3 refresh_google_cookies.py --profile "Default" --output google_storage_state.json
The output file is a standard Playwright storage state:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
{
"cookies": [
{
"name": "SID",
"value": "...",
"domain": ".google.com",
"path": "/",
"expires": 1760000000,
"httpOnly": true,
"secure": true,
"sameSite": "None"
}
],
"origins": []
}
Once refreshed, the file is copied onto the cluster NFS via kubectl cp:
1
2
kubectl cp google_storage_state.json \
<pod-name>:/mnt/nfs_share/google_storage_state.json
The CronJob mounts the same NFS PVC, so the file is available at job start without rebuilding the image or rotating a Secret.
Expired session detection:
When cookies expire, Google redirects to accounts.google.com. The scraper checks the final URL after navigation and exits with a clear message rather than silently producing empty output:
1
2
3
if "accounts.google.com" in page.url:
print("ERROR: Google session expired. Refresh cookies with refresh_google_cookies.py and re-copy to NFS.")
sys.exit(1)
Browser Context Setup
The context is initialized with the storage state and a Linux desktop user-agent. The webdriver property is patched out via add_init_script to reduce bot detection fingerprinting:
1
2
3
4
5
6
7
8
9
context = browser.new_context(
storage_state=STORAGE_STATE,
user_agent='Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 '
'(KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36',
viewport={'width': 1280, 'height': 900},
)
context.add_init_script(
"Object.defineProperty(navigator, 'webdriver', {get: () => undefined})"
)
Running Chromium inside Kubernetes requires a few container-level flags. The job’s container args include --no-sandbox and --disable-dev-shm-usage, and the pod spec sets securityContext.runAsNonRoot: false to satisfy Chromium’s sandbox requirements without a full privileged container.
Pagination and Lazy Loading
The Interests feed paginates at 200 items per page using a ?pageNumber=N query parameter. Within each page, articles are rendered lazily as the user scrolls — a straightforward page.content() grab after initial load will miss most of them.
The scroll strategy uses End key presses in a loop, collecting article links after each press until two consecutive passes return the same count:
1
2
3
4
5
6
7
8
9
10
11
12
13
def scroll_and_collect(page) -> list[str]:
prev_count = 0
while True:
page.keyboard.press("End")
page.wait_for_timeout(1200)
links = page.eval_on_selector_all(
'a[href*="google.com/url?q="]',
"els => els.map(e => e.href)"
)
if len(links) == prev_count:
break
prev_count = len(links)
return links
The 1200 ms wait is intentional — Google’s feed throttles lazy-load responses. Dropping it below ~800 ms causes partial page loads and missed articles.
URL Resolution
Links on the Interests page are wrapped in google.com/url?q= redirects. The real URL is extracted from the q parameter before deduplication and PDF generation:
1
2
3
4
5
6
7
8
9
from urllib.parse import urlparse, parse_qs
def resolve_google_redirect(href: str) -> str | None:
parsed = urlparse(href)
qs = parse_qs(parsed.query)
targets = qs.get("q", [])
if not targets:
return None
return targets[0]
Google-owned domains are filtered after resolution to avoid archiving YouTube videos, Play Store pages, and other non-article destinations:
1
2
3
4
5
6
7
8
GOOGLE_DOMAINS = {
"youtube.com", "play.google.com", "google.com",
"googleapis.com", "goo.gl",
}
def is_google_domain(url: str) -> bool:
host = urlparse(url).hostname or ""
return any(host == d or host.endswith("." + d) for d in GOOGLE_DOMAINS)
PDF Generation
Each article is opened in a new page within the same context (preserving auth cookies), then printed to A4 PDF:
1
2
3
4
5
article_page.pdf(
path=str(OUTPUT_DIR / filename),
format='A4',
margin={'top': '20mm', 'bottom': '20mm', 'left': '15mm', 'right': '15mm'},
)
Filenames are derived from the article’s domain, a per-run index, and a Unix timestamp to avoid collisions across job runs:
1
2
techcrunch-com_3_1713456789.pdf
arstechnica-com_11_1713456812.pdf
Deduplication is handled by a plain text file on the NFS share. URLs are appended after successful PDF generation and checked at the start of each run:
1
2
3
4
5
6
7
8
9
10
PROCESSED_URLS = OUTPUT_DIR / "processed_urls.txt"
def load_processed() -> set[str]:
if not PROCESSED_URLS.exists():
return set()
return set(PROCESSED_URLS.read_text().splitlines())
def mark_processed(url: str) -> None:
with PROCESSED_URLS.open("a") as f:
f.write(url + "\n")
Kubernetes CronJob
The job runs on nodes labeled node-availability: 24x7 to avoid scheduling on nodes that may be powered down overnight:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
apiVersion: batch/v1
kind: CronJob
metadata:
name: web-to-pdf-job
namespace: automation
spec:
schedule: "0 2 * * *"
jobTemplate:
spec:
template:
spec:
nodeSelector:
node-availability: "24x7"
restartPolicy: OnFailure
containers:
- name: web-to-pdf
image: <your-registry>/web-to-pdf:latest
args:
- python3
- /app/scrape_interests.py
env:
- name: OUTPUT_DIR
value: /mnt/nfs_share/saved_articles
- name: STORAGE_STATE
value: /mnt/nfs_share/google_storage_state.json
volumeMounts:
- name: nfs-pdf
mountPath: /mnt/nfs_share
volumes:
- name: nfs-pdf
persistentVolumeClaim:
claimName: nfs-pdf-pvc
The NFS PVC (nfs-pdf-pvc) is a ReadWriteMany volume backed by the homelab NAS. Both web-to-pdf-job and morning-brief mount it. PDFs land in /mnt/nfs_share/saved_articles/ and are picked up by morning-brief on its next run without any coordination between the two jobs.
On-Demand Triggering
The CronJob can be triggered manually from OpenClaw (the homelab dashboard) without requiring kubectl access on the client side. A Node.js script calls the Kubernetes API using the pod’s own service account token to create a one-off Job from the CronJob template:
1
node /scripts/trigger_articles.js
The script POSTs to /apis/batch/v1/namespaces/automation/jobs with a body derived from the CronJob spec. The service account needs get on the CronJob and create on Jobs in the automation namespace — no cluster-wide permissions required:
1
2
3
4
5
6
7
8
9
10
11
12
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: cronjob-trigger
namespace: automation
rules:
- apiGroups: ["batch"]
resources: ["cronjobs"]
verbs: ["get"]
- apiGroups: ["batch"]
resources: ["jobs"]
verbs: ["create"]
The Morning-Brief Connection
The PDF output directory is the intake queue for morning-brief, a second CronJob that runs at 05:30 and processes new PDFs into a daily digest. The coupling is intentional and minimal: web-to-pdf-job writes files, morning-brief reads them. Neither job knows about the other’s schedule or state. If web-to-pdf-job is delayed or fails, morning-brief processes whatever is already in the directory and moves on.
This makes the pipeline resilient to the one guaranteed failure mode: expired Google cookies. When cookies expire, web-to-pdf-job fails fast and logs the error. No corrupt PDFs are written, morning-brief is unaffected, and the fix is a two-step local operation: re-run the cookie extraction script and kubectl cp the updated state file to NFS.
Conclusion
The core engineering challenge here is not scraping — it’s maintaining a usable Google session without an API. Playwright’s storage state mechanism plus a local cookie extraction script gives a workable solution, with the tradeoff that cookies expire and need manual refresh every few weeks. The failure mode is clean and recoverable, which matters more than eliminating the maintenance burden entirely.
The lazy-load scroll strategy and google.com/url?q= resolution are straightforward once identified, but both require careful handling: too-fast scrolling misses articles, and skipping URL resolution produces deduplicated-but-wrong entries in processed_urls.txt.
The NFS-based coupling between web-to-pdf-job and morning-brief keeps both jobs independently deployable and debuggable. Adding a new source to the morning brief pipeline means pointing another job at the same directory — no changes to the consumer.