Skip to content

fix: return successful scrapes for empty 2xx pages#3131

Open
tsubasakong wants to merge 2 commits intofirecrawl:mainfrom
tsubasakong:fix/2316-empty-page-success
Open

fix: return successful scrapes for empty 2xx pages#3131
tsubasakong wants to merge 2 commits intofirecrawl:mainfrom
tsubasakong:fix/2316-empty-page-success

Conversation

@tsubasakong
Copy link

@tsubasakong tsubasakong commented Mar 12, 2026

Summary

  • treat 2xx scrape results with no page error and no extractable body text as successful instead of cascading into SCRAPE_ALL_ENGINES_FAILED
  • keep the existing waterfall behavior for genuinely unsuccessful scrapes, but include the empty-page factor in the quality logs
  • add a focused regression test for the empty-page text detection helper

Testing

  • pnpm exec jest src/scraper/scrapeURL/empty-page.test.ts --runInBand
  • git diff --check

Notes

  • no repo PR template was present in .github/ at checkout time

Summary by cubic

Treat 2xx scrapes with no page error and no extractable text as successful to prevent false SCRAPE_ALL_ENGINES_FAILED and avoid unnecessary fallbacks. Adds hasNoExtractableText and updates quality logs and tests.

  • Bug Fixes
    • Mark explicitly empty pages (2xx, no error, no text) as successful; keep waterfall for real failures or bad status codes.
    • Include isExplicitlyEmptyPage in success/failure and proxy-adequacy logs.
    • Add a focused regression test for hasNoExtractableText.

Written for commit a6c4229. Summary will update on new commits.

@tsubasakong tsubasakong requested a review from mogery as a code owner March 12, 2026 21:21
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 3 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="apps/api/src/scraper/scrapeURL/emptyPage.ts">

<violation number="1" location="apps/api/src/scraper/scrapeURL/emptyPage.ts:2">
P2: Empty-page detection can produce false negatives (head-only/bodyless docs and non-`&nbsp;` invisible entities), causing valid empty 2xx scrapes to be marked unsuccessful.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

@@ -0,0 +1,16 @@
export function hasNoExtractableText(html: string): boolean {
const body = html.match(/<body\b[^>]*>([\s\S]*?)<\/body>/i)?.[1] ?? html;
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Empty-page detection can produce false negatives (head-only/bodyless docs and non-&nbsp; invisible entities), causing valid empty 2xx scrapes to be marked unsuccessful.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At apps/api/src/scraper/scrapeURL/emptyPage.ts, line 2:

<comment>Empty-page detection can produce false negatives (head-only/bodyless docs and non-`&nbsp;` invisible entities), causing valid empty 2xx scrapes to be marked unsuccessful.</comment>

<file context>
@@ -0,0 +1,16 @@
+export function hasNoExtractableText(html: string): boolean {
+  const body = html.match(/<body\b[^>]*>([\s\S]*?)<\/body>/i)?.[1] ?? html;
+
+  const text = body
</file context>
Fix with Cubic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant