Home Automated Morning Tech Brief with Gemini and Kubernetes CronJobs
Post
Cancel

Automated Morning Tech Brief with Gemini and Kubernetes CronJobs

Introduction

The read-later problem is a trap. You save an article, feel briefly virtuous, and then never open it again. Over the course of a week you accumulate dozens of links from your Google Interests saved feed, and the pile grows faster than you can consume it.

The fix isn’t better discipline — it’s removing yourself from the loop entirely. This post describes a two-stage Kubernetes pipeline that scrapes your saved articles into PDFs overnight, then every morning batches all of them into a single Gemini 2.5 Flash call and delivers a structured digest to Telegram before you’ve poured your first coffee.

The key design constraint: stay within Gemini’s free tier by making one API call per run, regardless of how many articles were collected.


Architecture Overview

The system is split into two independent CronJobs that share an NFS volume:

  1. web-to-pdf-job — runs on a schedule throughout the day, scraping saved articles from https://www.google.com/interests/saved and writing them as PDFs to /mnt/nfs_share/saved_articles/
  2. morning-brief — runs at 0 15 * * * UTC (7am PST / 8am PDT), reads up to 20 of those PDFs, calls Gemini once with all content combined, and delivers the summary to Telegram

Both jobs run on nodes labeled node-availability: "24x7" — the always-on worker nodes in the cluster — ensuring they are never evicted to a node that may be powered off.


Stage 1: web-to-pdf-job

Authentication

Google’s saved interests feed requires a valid session. Rather than managing OAuth flows inside a container, the job uses a Playwright cookie dump (google_storage_state.json) stored on the NFS share. Playwright’s storage_state captures the full browser session — cookies, localStorage, and sessionStorage — so the job can authenticate as a real browser session.

When the job starts, it loads this file directly:

1
context = browser.new_context(storage_state="/mnt/nfs_share/google_storage_state.json")

When the session expires, the fix is manual: re-authenticate on a desktop browser, export the storage state, and overwrite the file on NFS. It is a known operational step, not a flaw in the design.

Scraping and PDF Export

The scraper navigates to https://www.google.com/interests/saved, scrolls the page in increments to force lazy-loaded content into the DOM, then collects article links. Google wraps outbound links in google.com/url?q= redirects, so the job follows each redirect to resolve the canonical URL before processing.

Google-owned domains (google.com, youtube.com, googleapis.com, etc.) are filtered out — the brief is for third-party articles, not Google’s own properties.

Each resolved URL is checked against processed_urls.txt on the NFS share before fetching. Articles that have already been converted are skipped. After a successful PDF export, the URL is appended to the tracking file.

PDFs are generated using Chromium’s built-in print-to-PDF capability via Playwright:

1
page.pdf(path=f"/mnt/nfs_share/saved_articles/{safe_filename}.pdf", format="A4")

This produces clean, single-file documents with no external dependencies.

CronJob Spec (Stage 1 excerpt)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
apiVersion: batch/v1
kind: CronJob
metadata:
  name: web-to-pdf-job
  namespace: automation
spec:
  schedule: "0 */4 * * *"
  jobTemplate:
    spec:
      backoffLimit: 1
      activeDeadlineSeconds: 600
      template:
        spec:
          nodeSelector:
            node-availability: "24x7"
          containers:
            - name: scraper
              image: z2tone/web-to-pdf:v1.0.0
              volumeMounts:
                - name: nfs-pdf
                  mountPath: /mnt/nfs_share
          volumes:
            - name: nfs-pdf
              persistentVolumeClaim:
                claimName: nfs-pdf-pvc
          restartPolicy: Never

Stage 2: morning-brief CronJob

PDF Ingestion and Text Extraction

The job reads up to 20 PDFs from /mnt/nfs_share/saved_articles/ — a practical ceiling that keeps the Gemini prompt within a reasonable token budget while covering a typical week’s worth of saved content.

Text is extracted with pdftotext (from the poppler-utils package), which handles the A4 PDFs produced by Stage 1 cleanly:

1
2
3
4
5
6
7
8
import subprocess

def extract_text(pdf_path):
    result = subprocess.run(
        ["pdftotext", pdf_path, "-"],
        capture_output=True, text=True, timeout=30
    )
    return result.stdout.strip()

All article texts are concatenated into a single string, with a separator between each article to preserve boundaries in the prompt.

The Gemini Call

The entire payload — all articles combined — goes to Gemini 2.5 Flash in a single request. The prompt requests three structured sections:

  • OVERVIEW: 2–3 sentences on the dominant themes across all articles
  • ARTICLES: a short per-article summary
  • CONNECTIONS: cross-article patterns, recurring names, or contradicting viewpoints

One call per run. Not one call per article. This is the constraint that makes the free tier viable at any reasonable article volume.

The call_gemini function handles transient API errors with exponential backoff:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def call_gemini(prompt, retries=3):
    payload = {'contents': [{'role': 'user', 'parts': [{'text': prompt}]}]}
    for attempt in range(retries):
        try:
            req = urllib.request.Request(
                GEMINI_URL,
                data=json.dumps(payload).encode(),
                headers={'Content-Type': 'application/json'})
            with urllib.request.urlopen(req, timeout=120) as r:
                return json.loads(r.read())['candidates'][0]['content']['parts'][0]['text']
        except urllib.error.HTTPError as e:
            if e.code in (429, 500, 503) and attempt < retries - 1:
                wait = 10 * (2 ** attempt)
                print(f"Gemini error {e.code}, retrying in {wait}s...")
                time.sleep(wait)
            else:
                raise

The 120-second timeout on urlopen is intentional — a batch of 20 dense articles can take Gemini 60–90 seconds to process.

Telegram Delivery

The Telegram Bot API has a 4096-character message limit. The digest is split at that boundary and each chunk is sent as a separate message to the configured chat:

1
2
3
4
5
6
7
8
9
def send_telegram(text, bot_token, chat_id):
    chunks = [text[i:i+4000] for i in range(0, len(text), 4000)]
    for chunk in chunks:
        payload = {'chat_id': chat_id, 'text': chunk}
        req = urllib.request.Request(
            f"https://api.telegram.org/bot{bot_token}/sendMessage",
            data=json.dumps(payload).encode(),
            headers={'Content-Type': 'application/json'})
        urllib.request.urlopen(req, timeout=30)

Archival

After a successful Telegram delivery, processed PDFs are moved to /mnt/nfs_share/saved_articles/archived/YYYY-MM-DD/ using the current date. This keeps the working directory clean for the next run and provides a browsable archive if you want to revisit what was covered on a specific day.

1
2
3
4
5
6
7
import shutil
from datetime import date

archive_dir = f"/mnt/nfs_share/saved_articles/archived/{date.today().isoformat()}"
os.makedirs(archive_dir, exist_ok=True)
for pdf in processed_pdfs:
    shutil.move(pdf, archive_dir)

CronJob Spec (Stage 2 — full)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
apiVersion: batch/v1
kind: CronJob
metadata:
  name: morning-brief
  namespace: automation
spec:
  schedule: "0 15 * * *"
  jobTemplate:
    spec:
      backoffLimit: 1
      activeDeadlineSeconds: 600
      template:
        spec:
          nodeSelector:
            node-availability: "24x7"
          containers:
            - name: brief
              image: z2tone/morning-brief:v1.0.0
              env:
                - name: GEMINI_API_KEY
                  valueFrom:
                    secretKeyRef:
                      name: openclaw-gemini-secret
                      key: GEMINI_API_KEY
                - name: TELEGRAM_BOT_TOKEN
                  valueFrom:
                    secretKeyRef:
                      name: telegram-secret
                      key: bot_token
                - name: TELEGRAM_CHAT_ID
                  valueFrom:
                    secretKeyRef:
                      name: telegram-secret
                      key: morning_brief_chat_id
              volumeMounts:
                - name: nfs-pdf
                  mountPath: /mnt/nfs_share
          volumes:
            - name: nfs-pdf
              persistentVolumeClaim:
                claimName: nfs-pdf-pvc
          restartPolicy: Never

Secrets

Two Kubernetes Secrets back the job:

  • telegram-secret: keys bot_token and morning_brief_chat_id
  • openclaw-gemini-secret: key GEMINI_API_KEY

Both are sealed with Sealed Secrets before being committed to the GitOps repo. Neither value appears in plaintext in version control.


Operational Notes

When the scraper stops finding articles: The Google session in google_storage_state.json has expired. Re-authenticate in a browser with Playwright’s codegen or a manual context.storage_state() export and overwrite the file on NFS. This is the only manual step in the entire pipeline.

Adjusting the article cap: The 20-PDF limit is a soft ceiling set in the brief script. If your reading habits generate significantly more content, increase it — but monitor token usage. Gemini 2.5 Flash’s context window is large, but very long prompts will increase latency and may push total tokens into billable territory.

backoffLimit: 1 and activeDeadlineSeconds: 600: A single retry is enough for transient failures; the exponential backoff inside call_gemini handles API rate limits before they reach the pod level. The 10-minute deadline prevents a hung Playwright session from blocking node resources indefinitely.


Conclusion

The read-later pile doesn’t get smaller by reading faster — it gets smaller by changing what “reading” means. A batched Gemini digest converts passive article accumulation into an active, structured morning briefing, and the entire operational cost is one Kubernetes CronJob that runs for under two minutes a day.

The single-call batching strategy is worth emphasizing: per-article Gemini calls would hit free tier limits within a few days. Concatenating all content into one request with a structured output prompt keeps the monthly API cost at zero while delivering a richer analysis — cross-article connections are only visible when the model sees everything at once.

The full pipeline — scraper, brief, NFS volume, and sealed secrets — is managed as GitOps resources in ArgoCD and requires no ongoing maintenance beyond occasional session token refreshes.

This post is licensed under CC BY 4.0 by the author.