Home Optimizing Large-Scale ISO Extraction on Kubernetes with Indexed Jobs
Post
Cancel

Optimizing Large-Scale ISO Extraction on Kubernetes with Indexed Jobs

The Challenge: The 1Gbps Bottleneck

Managing a vast library of retro game archives (PS2, GameCube, etc.) requires efficient extraction. In my setup, both the source compressed archives and the destination ROM folders live on a shared NFS server connected via a 1Gbps link.

The traditional extraction process—pulling a 2GB archive to a node, extracting it into a 4GB ISO, and writing it back—creates significant network I/O contention. If too many workers run in parallel, the NFS server’s read/write heads thrash, and the network link becomes a saturated mess.

The Solution: Tuned Indexed Jobs

I implemented a Kubernetes Indexed Job to handle the parallel extraction. By using indices, I can ensure each worker knows exactly which slice of the file list it is responsible for.

1. High-Performance Node Selection

I pinned the job to my most capable worker nodes (<worker-node-1> and <worker-node-2>) using a nodeSelector. This ensures that the CPU-intensive decompression (handled by 7z) doesn’t compete with more sensitive workloads on smaller nodes.

1
2
nodeSelector:
  gpuinstalled: "true"

2. Finding the Parallelism Sweet Spot

Through testing, I found that parallelism: 4 is the “Goldilocks” zone for a 1Gbps link. With four concurrent workers, I can consistently saturate the network bandwidth (approx. 125MB/s) without causing the latency spikes associated with deeper queues.

3. Recursive and Robust Scripting

The extraction script uses find to recursively locate archives, ensuring that even nested directory structures are processed correctly. By extracting to the node’s local /tmp storage first and then rsyncing to the final destination, we minimize the duration of active NFS locks.

Conclusion

By treating file extraction as a first-class Kubernetes workload and tuning it to match the physical constraints of my network, I’ve transformed a manual, slow process into a fast, set-and-forget automation.

This post is licensed under CC BY 4.0 by the author.