Skip to content

fix: handle empty webpage content without 500 error#3213

Open
withsivram wants to merge 1 commit intofirecrawl:mainfrom
withsivram:fix/empty-page-500-error-issue-2316
Open

fix: handle empty webpage content without 500 error#3213
withsivram wants to merge 1 commit intofirecrawl:mainfrom
withsivram:fix/empty-page-500-error-issue-2316

Conversation

@withsivram
Copy link

@withsivram withsivram commented Mar 22, 2026

Summary

  • When scraping a valid but content-empty webpage (HTTP 200, no page errors, but no extractable text), the scraper now returns a successful response with empty content instead of throwing a misleading 500 SCRAPE_ALL_ENGINES_FAILED error.
  • The fix adds a third condition to the engine success check in scrapeURLLoopIter: if the status code is good (2xx/304) and there is no page-level error, treat the scrape as successful even when content is empty.
  • Existing behavior for pages with content, bad status codes, or page-level errors is unchanged.

Fixes #2316

Test plan

  • Scrape a valid but empty HTML page (e.g., <html><body></body></html>) — should return 200 with empty markdown, not 500
  • Scrape a normal page with content — should still return 200 with content (no regression)
  • Scrape a page that returns a 4xx/5xx status code — should still attempt fallback engines as before
  • Scrape a page where the engine reports a page-level error but status is 200 — should still try fallback engines

🤖 Generated with Claude Code


Summary by cubic

Return 200 with empty content for valid webpages that have no extractable text, instead of a 500 SCRAPE_ALL_ENGINES_FAILED.
Adds a success condition in scrapeURLLoopIter for good status (2xx/304) with no page-level error; fallback behavior for bad status or page-level errors is unchanged (fixes #2316).

Written for commit 1aadaa8. Summary will update on new commits.

…2316)

When scraping a valid but empty webpage (good HTTP status, no page
errors, but no extractable content), the engine loop previously treated
the empty result as an engine failure. All engines would fail this check,
leading to a NoEnginesLeftError and a misleading 500 response with
SCRAPE_ALL_ENGINES_FAILED.

The fix adds a third condition to the success check: if the status code
is good (2xx/304) and there is no page-level error, the scrape is
treated as successful even when the content is empty. This returns a 200
with empty content instead of a 500 error.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@withsivram withsivram requested a review from mogery as a code owner March 22, 2026 06:41
Copilot AI review requested due to automatic review settings March 22, 2026 06:41
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 1 file

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the scrape engine loop success criteria so that valid, content-empty pages no longer surface as SCRAPE_ALL_ENGINES_FAILED (500), aligning the API response with the “empty but successful” outcome described in issue #2316.

Changes:

  • Treats 2xx/304 + no page-level error as a successful scrape even when extracted content is empty.
  • Adds targeted logging to distinguish “successful but empty” scrapes from typical successes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +440 to +457
// When the page returns a good status code but has no content, this is a
// valid empty page — not an engine failure. We accept the result so the
// caller gets a 200 with empty content instead of a misleading 500
// SCRAPE_ALL_ENGINES_FAILED error. (fixes #2316)
if (isLongEnough || !isGoodStatusCode || (isGoodStatusCode && hasNoPageError)) {
if (!isLongEnough && isGoodStatusCode && hasNoPageError) {
meta.logger.info(
"Scrape via " + engine + " returned empty content with a good status code. Treating as successful empty page.",
{
factors: { isLongEnough, isGoodStatusCode, hasNoPageError },
statusCode: engineResult.statusCode,
},
);
} else {
meta.logger.info("Scrape via " + engine + " deemed successful.", {
factors: { isLongEnough, isGoodStatusCode, hasNoPageError },
});
}
Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new success condition ... || (isGoodStatusCode && hasNoPageError) will treat any 2xx/304 response without engineResult.error as successful even when extracted markdown is empty. Since engines like fetch/index typically leave error undefined, this effectively disables the “try fallbacks when content is empty” behavior and can return empty content for pages where a later engine (e.g. Fire Engine / Playwright) would have produced text. Consider narrowing the empty-content acceptance to cases where you’ve exhausted fallbacks (or returning an “empty-but-acceptable” candidate only if all engines yield empty) so you still waterfall to other engines when the first result has no extractable content.

Suggested change
// When the page returns a good status code but has no content, this is a
// valid empty page — not an engine failure. We accept the result so the
// caller gets a 200 with empty content instead of a misleading 500
// SCRAPE_ALL_ENGINES_FAILED error. (fixes #2316)
if (isLongEnough || !isGoodStatusCode || (isGoodStatusCode && hasNoPageError)) {
if (!isLongEnough && isGoodStatusCode && hasNoPageError) {
meta.logger.info(
"Scrape via " + engine + " returned empty content with a good status code. Treating as successful empty page.",
{
factors: { isLongEnough, isGoodStatusCode, hasNoPageError },
statusCode: engineResult.statusCode,
},
);
} else {
meta.logger.info("Scrape via " + engine + " deemed successful.", {
factors: { isLongEnough, isGoodStatusCode, hasNoPageError },
});
}
// When the page returns a good status code but has no content, we do not
// treat this as a successful scrape here, so that the caller can continue
// to waterfall to other engines that may be able to extract content. The
// decision to ultimately accept an empty result should be made only after
// fallbacks are exhausted.
if (isLongEnough || !isGoodStatusCode) {
meta.logger.info("Scrape via " + engine + " deemed successful.", {
factors: { isLongEnough, isGoodStatusCode, hasNoPageError },
});

Copilot uses AI. Check for mistakes.
Comment on lines +440 to +452
// When the page returns a good status code but has no content, this is a
// valid empty page — not an engine failure. We accept the result so the
// caller gets a 200 with empty content instead of a misleading 500
// SCRAPE_ALL_ENGINES_FAILED error. (fixes #2316)
if (isLongEnough || !isGoodStatusCode || (isGoodStatusCode && hasNoPageError)) {
if (!isLongEnough && isGoodStatusCode && hasNoPageError) {
meta.logger.info(
"Scrape via " + engine + " returned empty content with a good status code. Treating as successful empty page.",
{
factors: { isLongEnough, isGoodStatusCode, hasNoPageError },
statusCode: engineResult.statusCode,
},
);
Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change introduces a new “empty content but successful” path for good status codes with no engineResult.error, but there doesn’t appear to be a regression test asserting that an empty (but valid) HTML page returns success: true with empty markdown (and that we don’t raise SCRAPE_ALL_ENGINES_FAILED). Adding a unit test around this branch would help prevent future changes from reintroducing the 500 behavior.

Copilot uses AI. Check for mistakes.
Comment on lines +444 to +445
if (isLongEnough || !isGoodStatusCode || (isGoodStatusCode && hasNoPageError)) {
if (!isLongEnough && isGoodStatusCode && hasNoPageError) {
Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description/test plan says 4xx/5xx status codes should still attempt fallback engines, but this success check (and the prior behavior) returns immediately when !isGoodStatusCode is true, so the loop will typically not try additional engines for 4xx/5xx responses (except the special proxy-handling case above). Please align the PR description/test plan with the actual behavior, or adjust the logic if fallback-on-bad-status is intended.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Incorrect 500 Error When Scraping Empty Webpage Content

2 participants