fix: handle empty webpage content without 500 error#3213
fix: handle empty webpage content without 500 error#3213withsivram wants to merge 1 commit intofirecrawl:mainfrom
Conversation
…2316) When scraping a valid but empty webpage (good HTTP status, no page errors, but no extractable content), the engine loop previously treated the empty result as an engine failure. All engines would fail this check, leading to a NoEnginesLeftError and a misleading 500 response with SCRAPE_ALL_ENGINES_FAILED. The fix adds a third condition to the success check: if the status code is good (2xx/304) and there is no page-level error, the scrape is treated as successful even when the content is empty. This returns a 200 with empty content instead of a 500 error. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Updates the scrape engine loop success criteria so that valid, content-empty pages no longer surface as SCRAPE_ALL_ENGINES_FAILED (500), aligning the API response with the “empty but successful” outcome described in issue #2316.
Changes:
- Treats 2xx/304 + no page-level error as a successful scrape even when extracted content is empty.
- Adds targeted logging to distinguish “successful but empty” scrapes from typical successes.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // When the page returns a good status code but has no content, this is a | ||
| // valid empty page — not an engine failure. We accept the result so the | ||
| // caller gets a 200 with empty content instead of a misleading 500 | ||
| // SCRAPE_ALL_ENGINES_FAILED error. (fixes #2316) | ||
| if (isLongEnough || !isGoodStatusCode || (isGoodStatusCode && hasNoPageError)) { | ||
| if (!isLongEnough && isGoodStatusCode && hasNoPageError) { | ||
| meta.logger.info( | ||
| "Scrape via " + engine + " returned empty content with a good status code. Treating as successful empty page.", | ||
| { | ||
| factors: { isLongEnough, isGoodStatusCode, hasNoPageError }, | ||
| statusCode: engineResult.statusCode, | ||
| }, | ||
| ); | ||
| } else { | ||
| meta.logger.info("Scrape via " + engine + " deemed successful.", { | ||
| factors: { isLongEnough, isGoodStatusCode, hasNoPageError }, | ||
| }); | ||
| } |
There was a problem hiding this comment.
The new success condition ... || (isGoodStatusCode && hasNoPageError) will treat any 2xx/304 response without engineResult.error as successful even when extracted markdown is empty. Since engines like fetch/index typically leave error undefined, this effectively disables the “try fallbacks when content is empty” behavior and can return empty content for pages where a later engine (e.g. Fire Engine / Playwright) would have produced text. Consider narrowing the empty-content acceptance to cases where you’ve exhausted fallbacks (or returning an “empty-but-acceptable” candidate only if all engines yield empty) so you still waterfall to other engines when the first result has no extractable content.
| // When the page returns a good status code but has no content, this is a | |
| // valid empty page — not an engine failure. We accept the result so the | |
| // caller gets a 200 with empty content instead of a misleading 500 | |
| // SCRAPE_ALL_ENGINES_FAILED error. (fixes #2316) | |
| if (isLongEnough || !isGoodStatusCode || (isGoodStatusCode && hasNoPageError)) { | |
| if (!isLongEnough && isGoodStatusCode && hasNoPageError) { | |
| meta.logger.info( | |
| "Scrape via " + engine + " returned empty content with a good status code. Treating as successful empty page.", | |
| { | |
| factors: { isLongEnough, isGoodStatusCode, hasNoPageError }, | |
| statusCode: engineResult.statusCode, | |
| }, | |
| ); | |
| } else { | |
| meta.logger.info("Scrape via " + engine + " deemed successful.", { | |
| factors: { isLongEnough, isGoodStatusCode, hasNoPageError }, | |
| }); | |
| } | |
| // When the page returns a good status code but has no content, we do not | |
| // treat this as a successful scrape here, so that the caller can continue | |
| // to waterfall to other engines that may be able to extract content. The | |
| // decision to ultimately accept an empty result should be made only after | |
| // fallbacks are exhausted. | |
| if (isLongEnough || !isGoodStatusCode) { | |
| meta.logger.info("Scrape via " + engine + " deemed successful.", { | |
| factors: { isLongEnough, isGoodStatusCode, hasNoPageError }, | |
| }); |
| // When the page returns a good status code but has no content, this is a | ||
| // valid empty page — not an engine failure. We accept the result so the | ||
| // caller gets a 200 with empty content instead of a misleading 500 | ||
| // SCRAPE_ALL_ENGINES_FAILED error. (fixes #2316) | ||
| if (isLongEnough || !isGoodStatusCode || (isGoodStatusCode && hasNoPageError)) { | ||
| if (!isLongEnough && isGoodStatusCode && hasNoPageError) { | ||
| meta.logger.info( | ||
| "Scrape via " + engine + " returned empty content with a good status code. Treating as successful empty page.", | ||
| { | ||
| factors: { isLongEnough, isGoodStatusCode, hasNoPageError }, | ||
| statusCode: engineResult.statusCode, | ||
| }, | ||
| ); |
There was a problem hiding this comment.
This change introduces a new “empty content but successful” path for good status codes with no engineResult.error, but there doesn’t appear to be a regression test asserting that an empty (but valid) HTML page returns success: true with empty markdown (and that we don’t raise SCRAPE_ALL_ENGINES_FAILED). Adding a unit test around this branch would help prevent future changes from reintroducing the 500 behavior.
| if (isLongEnough || !isGoodStatusCode || (isGoodStatusCode && hasNoPageError)) { | ||
| if (!isLongEnough && isGoodStatusCode && hasNoPageError) { |
There was a problem hiding this comment.
PR description/test plan says 4xx/5xx status codes should still attempt fallback engines, but this success check (and the prior behavior) returns immediately when !isGoodStatusCode is true, so the loop will typically not try additional engines for 4xx/5xx responses (except the special proxy-handling case above). Please align the PR description/test plan with the actual behavior, or adjust the logic if fallback-on-bad-status is intended.
Summary
500 SCRAPE_ALL_ENGINES_FAILEDerror.scrapeURLLoopIter: if the status code is good (2xx/304) and there is no page-level error, treat the scrape as successful even when content is empty.Fixes #2316
Test plan
<html><body></body></html>) — should return 200 with empty markdown, not 500🤖 Generated with Claude Code
Summary by cubic
Return 200 with empty content for valid webpages that have no extractable text, instead of a 500
SCRAPE_ALL_ENGINES_FAILED.Adds a success condition in
scrapeURLLoopIterfor good status (2xx/304) with no page-level error; fallback behavior for bad status or page-level errors is unchanged (fixes #2316).Written for commit 1aadaa8. Summary will update on new commits.