Sitemap regeneration is unbounded, runs on every save, and leaks non-public URLs #100

Closed
opened 2026-06-29 14:19:25 +00:00 by jmiller · 3 comments
Owner

Problem

SitemapBuilder::generate() (plg_system_mokoog/src/Helper/SitemapBuilder.php:33-95) has scaling and disclosure issues, and is invoked synchronously from onContentAfterSave on every article save.

  1. Unbounded / synchronousloadObjectList() of every published article with no LIMIT/batching, building the whole XML in memory, on each save (MokoOG.php:821-839). Heavy on large sites; pathological if frontend submissions trigger it.
  2. No access-level filtering — selects state = 1 only (line 44); registered/special-access articles are written into the public sitemap (information disclosure). The noindex lookup (lines 50-57) also ignores access level.
  3. No language handling — multilingual sites get a single mixed sitemap; noindex lookup ignores language.
  4. Non-SEF URLs (line 76) — emits index.php?option=com_content&view=article&id=N instead of SEF/canonical URLs, creating duplicate-content signals against the canonical tags the plugin itself emits.
  5. Non-atomic write (line 108) — concurrent saves can interleave writes.

Fix

  • Filter by access IN (public viewing levels) and per-language.
  • Emit SEF/Routed canonical URLs.
  • Debounce/throttle regeneration (scheduled task or dirty-flag) instead of rebuilding on every save; batch the query.
  • Write to a temp file + atomic rename.
## Problem `SitemapBuilder::generate()` (`plg_system_mokoog/src/Helper/SitemapBuilder.php:33-95`) has scaling and disclosure issues, and is invoked synchronously from `onContentAfterSave` on **every** article save. 1. **Unbounded / synchronous** — `loadObjectList()` of *every* published article with no `LIMIT`/batching, building the whole XML in memory, on each save (`MokoOG.php:821-839`). Heavy on large sites; pathological if frontend submissions trigger it. 2. **No access-level filtering** — selects `state = 1` only (line 44); registered/special-access articles are written into the **public** sitemap (information disclosure). The `noindex` lookup (lines 50-57) also ignores access level. 3. **No language handling** — multilingual sites get a single mixed sitemap; noindex lookup ignores language. 4. **Non-SEF URLs** (line 76) — emits `index.php?option=com_content&view=article&id=N` instead of SEF/canonical URLs, creating duplicate-content signals against the canonical tags the plugin itself emits. 5. **Non-atomic write** (line 108) — concurrent saves can interleave writes. ## Fix - Filter by `access IN (public viewing levels)` and per-language. - Emit SEF/`Route`d canonical URLs. - Debounce/throttle regeneration (scheduled task or dirty-flag) instead of rebuilding on every save; batch the query. - Write to a temp file + atomic rename.
jmiller added the enhancementsecurity labels 2026-06-29 14:19:25 +00:00
Author
Owner

Branch created: feature/100-sitemap-regeneration-is-unbounded-runs-o

git fetch origin
git checkout feature/100-sitemap-regeneration-is-unbounded-runs-o
Branch created: [`feature/100-sitemap-regeneration-is-unbounded-runs-o`](https://git.mokoconsulting.tech/MokoConsulting/MokoSuiteOpenGraph/src/branch/feature/100-sitemap-regeneration-is-unbounded-runs-o) ```bash git fetch origin git checkout feature/100-sitemap-regeneration-is-unbounded-runs-o ```
Author
Owner

Partially addressed in PR #109 (merged to dev):

  • Information disclosure fixed — the query now filters to the public (guest) authorised view levels, so registered/special-access articles are no longer written into the public sitemap.xml.
  • Atomic write — temp file + rename, so concurrent saves can't expose a half-written file.

Still open (kept):

  • ⏸️ Per-save unbounded/synchronous regeneration — needs throttling/scheduling (dirty-flag or scheduled task) rather than rebuilding the full sitemap on every onContentAfterSave.
  • ⏸️ SEF/canonical URLs — currently still index.php?option=...; generating correct SEF URLs from the admin/save context is non-trivial and deferred to avoid emitting wrong URLs.
  • ⏸️ Per-language sitemap handling.
**Partially addressed in PR #109** (merged to `dev`): - ✅ **Information disclosure fixed** — the query now filters to the public (guest) authorised view levels, so registered/special-access articles are no longer written into the public `sitemap.xml`. - ✅ **Atomic write** — temp file + `rename`, so concurrent saves can't expose a half-written file. **Still open (kept):** - ⏸️ Per-save unbounded/synchronous regeneration — needs throttling/scheduling (dirty-flag or scheduled task) rather than rebuilding the full sitemap on every `onContentAfterSave`. - ⏸️ SEF/canonical URLs — currently still `index.php?option=...`; generating correct SEF URLs from the admin/save context is non-trivial and deferred to avoid emitting wrong URLs. - ⏸️ Per-language sitemap handling.
Author
Owner

Remaining items fixed in PR #116 (merged to dev):

  • ThrottlingonContentAfterSaveRebuildSitemap regenerates at most once per 60s (SITEMAP_MIN_INTERVAL), so bulk edits/imports no longer rebuild the whole sitemap on every save (eventually consistent within the window).
  • SEF URLs — articles are routed via Route::link('site', …) so the sitemap matches the plugin's canonical URLs, with a try/catch fallback to the non-SEF index.php URL if routing fails (worst case = prior behavior, never broken).

Combined with the earlier access-level filtering + atomic write (PR #109), all four original concerns are addressed. Closing.

Minor optional follow-up (separate, low priority): per-language sitemap files — the current build emits one combined sitemap. Open a new issue if that's wanted.

Remaining items fixed in PR #116 (merged to `dev`): - ✅ **Throttling** — `onContentAfterSaveRebuildSitemap` regenerates at most once per 60s (`SITEMAP_MIN_INTERVAL`), so bulk edits/imports no longer rebuild the whole sitemap on every save (eventually consistent within the window). - ✅ **SEF URLs** — articles are routed via `Route::link('site', …)` so the sitemap matches the plugin's canonical URLs, with a try/catch fallback to the non-SEF `index.php` URL if routing fails (worst case = prior behavior, never broken). Combined with the earlier access-level filtering + atomic write (PR #109), all four original concerns are addressed. Closing. Minor optional follow-up (separate, low priority): per-language sitemap files — the current build emits one combined sitemap. Open a new issue if that's wanted.
Sign in to join this conversation.
Priority Medium
Type Feature
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: MokoConsulting/MokoSuiteOpenGraph#100