Introduction
The read-later problem is a trap. You save an article, feel briefly virtuous, and then never open it again. Over the course of a week you accumulate dozens of links from your Google Interests saved feed, and the pile grows faster than you can consume it.
The fix isn’t better discipline — it’s removing yourself from the loop entirely. This post describes a two-stage Kubernetes pipeline that scrapes your saved articles into PDFs overnight, then every morning batches all of them into a single Gemini 2.5 Flash call and delivers a structured digest to Telegram before you’ve poured your first coffee.
The key design constraint: stay within Gemini’s free tier by making one API call per run, regardless of how many articles were collected.
Architecture Overview
The system is split into two independent CronJobs that share an NFS volume:
web-to-pdf-job— runs on a schedule throughout the day, scraping saved articles fromhttps://www.google.com/interests/savedand writing them as PDFs to/mnt/nfs_share/saved_articles/morning-brief— runs at0 15 * * *UTC (7am PST / 8am PDT), reads up to 20 of those PDFs, calls Gemini once with all content combined, and delivers the summary to Telegram
Both jobs run on nodes labeled node-availability: "24x7" — the always-on worker nodes in the cluster — ensuring they are never evicted to a node that may be powered off.
Stage 1: web-to-pdf-job
Authentication
Google’s saved interests feed requires a valid session. Rather than managing OAuth flows inside a container, the job uses a Playwright cookie dump (google_storage_state.json) stored on the NFS share. Playwright’s storage_state captures the full browser session — cookies, localStorage, and sessionStorage — so the job can authenticate as a real browser session.
When the job starts, it loads this file directly:
1
context = browser.new_context(storage_state="/mnt/nfs_share/google_storage_state.json")
When the session expires, the fix is manual: re-authenticate on a desktop browser, export the storage state, and overwrite the file on NFS. It is a known operational step, not a flaw in the design.
Scraping and PDF Export
The scraper navigates to https://www.google.com/interests/saved, scrolls the page in increments to force lazy-loaded content into the DOM, then collects article links. Google wraps outbound links in google.com/url?q= redirects, so the job follows each redirect to resolve the canonical URL before processing.
Google-owned domains (google.com, youtube.com, googleapis.com, etc.) are filtered out — the brief is for third-party articles, not Google’s own properties.
Each resolved URL is checked against processed_urls.txt on the NFS share before fetching. Articles that have already been converted are skipped. After a successful PDF export, the URL is appended to the tracking file.
PDFs are generated using Chromium’s built-in print-to-PDF capability via Playwright:
1
page.pdf(path=f"/mnt/nfs_share/saved_articles/{safe_filename}.pdf", format="A4")
This produces clean, single-file documents with no external dependencies.
CronJob Spec (Stage 1 excerpt)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
apiVersion: batch/v1
kind: CronJob
metadata:
name: web-to-pdf-job
namespace: automation
spec:
schedule: "0 */4 * * *"
jobTemplate:
spec:
backoffLimit: 1
activeDeadlineSeconds: 600
template:
spec:
nodeSelector:
node-availability: "24x7"
containers:
- name: scraper
image: z2tone/web-to-pdf:v1.0.0
volumeMounts:
- name: nfs-pdf
mountPath: /mnt/nfs_share
volumes:
- name: nfs-pdf
persistentVolumeClaim:
claimName: nfs-pdf-pvc
restartPolicy: Never
Stage 2: morning-brief CronJob
PDF Ingestion and Text Extraction
The job reads up to 20 PDFs from /mnt/nfs_share/saved_articles/ — a practical ceiling that keeps the Gemini prompt within a reasonable token budget while covering a typical week’s worth of saved content.
Text is extracted with pdftotext (from the poppler-utils package), which handles the A4 PDFs produced by Stage 1 cleanly:
1
2
3
4
5
6
7
8
import subprocess
def extract_text(pdf_path):
result = subprocess.run(
["pdftotext", pdf_path, "-"],
capture_output=True, text=True, timeout=30
)
return result.stdout.strip()
All article texts are concatenated into a single string, with a separator between each article to preserve boundaries in the prompt.
The Gemini Call
The entire payload — all articles combined — goes to Gemini 2.5 Flash in a single request. The prompt requests three structured sections:
- OVERVIEW: 2–3 sentences on the dominant themes across all articles
- ARTICLES: a short per-article summary
- CONNECTIONS: cross-article patterns, recurring names, or contradicting viewpoints
One call per run. Not one call per article. This is the constraint that makes the free tier viable at any reasonable article volume.
The call_gemini function handles transient API errors with exponential backoff:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def call_gemini(prompt, retries=3):
payload = {'contents': [{'role': 'user', 'parts': [{'text': prompt}]}]}
for attempt in range(retries):
try:
req = urllib.request.Request(
GEMINI_URL,
data=json.dumps(payload).encode(),
headers={'Content-Type': 'application/json'})
with urllib.request.urlopen(req, timeout=120) as r:
return json.loads(r.read())['candidates'][0]['content']['parts'][0]['text']
except urllib.error.HTTPError as e:
if e.code in (429, 500, 503) and attempt < retries - 1:
wait = 10 * (2 ** attempt)
print(f"Gemini error {e.code}, retrying in {wait}s...")
time.sleep(wait)
else:
raise
The 120-second timeout on urlopen is intentional — a batch of 20 dense articles can take Gemini 60–90 seconds to process.
Telegram Delivery
The Telegram Bot API has a 4096-character message limit. The digest is split at that boundary and each chunk is sent as a separate message to the configured chat:
1
2
3
4
5
6
7
8
9
def send_telegram(text, bot_token, chat_id):
chunks = [text[i:i+4000] for i in range(0, len(text), 4000)]
for chunk in chunks:
payload = {'chat_id': chat_id, 'text': chunk}
req = urllib.request.Request(
f"https://api.telegram.org/bot{bot_token}/sendMessage",
data=json.dumps(payload).encode(),
headers={'Content-Type': 'application/json'})
urllib.request.urlopen(req, timeout=30)
Archival
After a successful Telegram delivery, processed PDFs are moved to /mnt/nfs_share/saved_articles/archived/YYYY-MM-DD/ using the current date. This keeps the working directory clean for the next run and provides a browsable archive if you want to revisit what was covered on a specific day.
1
2
3
4
5
6
7
import shutil
from datetime import date
archive_dir = f"/mnt/nfs_share/saved_articles/archived/{date.today().isoformat()}"
os.makedirs(archive_dir, exist_ok=True)
for pdf in processed_pdfs:
shutil.move(pdf, archive_dir)
CronJob Spec (Stage 2 — full)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
apiVersion: batch/v1
kind: CronJob
metadata:
name: morning-brief
namespace: automation
spec:
schedule: "0 15 * * *"
jobTemplate:
spec:
backoffLimit: 1
activeDeadlineSeconds: 600
template:
spec:
nodeSelector:
node-availability: "24x7"
containers:
- name: brief
image: z2tone/morning-brief:v1.0.0
env:
- name: GEMINI_API_KEY
valueFrom:
secretKeyRef:
name: openclaw-gemini-secret
key: GEMINI_API_KEY
- name: TELEGRAM_BOT_TOKEN
valueFrom:
secretKeyRef:
name: telegram-secret
key: bot_token
- name: TELEGRAM_CHAT_ID
valueFrom:
secretKeyRef:
name: telegram-secret
key: morning_brief_chat_id
volumeMounts:
- name: nfs-pdf
mountPath: /mnt/nfs_share
volumes:
- name: nfs-pdf
persistentVolumeClaim:
claimName: nfs-pdf-pvc
restartPolicy: Never
Secrets
Two Kubernetes Secrets back the job:
telegram-secret: keysbot_tokenandmorning_brief_chat_idopenclaw-gemini-secret: keyGEMINI_API_KEY
Both are sealed with Sealed Secrets before being committed to the GitOps repo. Neither value appears in plaintext in version control.
Operational Notes
When the scraper stops finding articles: The Google session in google_storage_state.json has expired. Re-authenticate in a browser with Playwright’s codegen or a manual context.storage_state() export and overwrite the file on NFS. This is the only manual step in the entire pipeline.
Adjusting the article cap: The 20-PDF limit is a soft ceiling set in the brief script. If your reading habits generate significantly more content, increase it — but monitor token usage. Gemini 2.5 Flash’s context window is large, but very long prompts will increase latency and may push total tokens into billable territory.
backoffLimit: 1 and activeDeadlineSeconds: 600: A single retry is enough for transient failures; the exponential backoff inside call_gemini handles API rate limits before they reach the pod level. The 10-minute deadline prevents a hung Playwright session from blocking node resources indefinitely.
Conclusion
The read-later pile doesn’t get smaller by reading faster — it gets smaller by changing what “reading” means. A batched Gemini digest converts passive article accumulation into an active, structured morning briefing, and the entire operational cost is one Kubernetes CronJob that runs for under two minutes a day.
The single-call batching strategy is worth emphasizing: per-article Gemini calls would hit free tier limits within a few days. Concatenating all content into one request with a structured output prompt keeps the monthly API cost at zero while delivering a richer analysis — cross-article connections are only visible when the model sees everything at once.
The full pipeline — scraper, brief, NFS volume, and sealed secrets — is managed as GitOps resources in ArgoCD and requires no ongoing maintenance beyond occasional session token refreshes.