Skip to content

feat: /parse API (ENG-4608)#3137

Draft
mogery wants to merge 5 commits intomainfrom
mog/parse-api
Draft

feat: /parse API (ENG-4608)#3137
mogery wants to merge 5 commits intomainfrom
mog/parse-api

Conversation

@mogery
Copy link
Member

@mogery mogery commented Mar 13, 2026

Summary by cubic

Add a new /v2/parse API to upload and parse local files (HTML, PDF, DOCX, etc.) through the existing scraping pipeline, returning a Document in requested formats. Implements ENG-4608 with full SDK support, correct parse billing, dedicated logging, and explicit rejection of browser-only options.

  • New Features

    • API: POST /v2/parse (multipart/form-data). Fields: file (binary) and options (JSON, scrape‑shaped subset). 50 MB limit. Validates input and returns a standard Document. New error: PARSE_UNSUPPORTED_OPTIONS.
    • Pipeline: In‑memory uploads flow through scrape-worker via uploadedFile; forces fetch pipeline (skips index engine), adds HTML charset detection from headers/meta, disables caching and search indexing, logs to the parses table, avoids GCS uploads, bills under the parse endpoint, and honors zero data retention.
    • SDKs/Tests/Docs: Added parse to JS, Python (sync/async), Java, and Rust with multipart support and parse‑specific validation (disallows change tracking, screenshot, branding, actions, waitFor, location, mobile). Unit/e2e tests, including billing; README examples across SDKs.
  • Dependencies

    • API: added multer and @types/multer.
    • JS SDK: HTTP client supports FormData and no longer forces JSON Content‑Type.
    • Python SDK: multipart helpers and post_multipart for sync/async clients.
    • Java SDK: multipart POST support plus ParseFile and ParseOptions.
    • Rust SDK: enabled reqwest "multipart" and added mime_guess.

Written for commit 4888b93. Summary will update on new commits.

@blacksmith-sh
Copy link
Contributor

blacksmith-sh bot commented Mar 13, 2026

Found 4 test failures on Blacksmith runners:

Failures

Test View Logs
src/tests/snips/v2/parse.test.ts//v2/
parse rejects changeTracking format for parse uploads
View Logs
src/tests/snips/v2/parse.test.ts//v2/
parse rejects changeTracking format for parse uploads
View Logs
src/tests/snips/v2/parse.test.ts//v2/
parse rejects changeTracking format for parse uploads
View Logs
src/tests/snips/v2/parse.test.ts//v2/
parse rejects changeTracking format for parse uploads
View Logs

Fix in Cursor

@mogery mogery changed the title feat: /parse API feat: /parse API (ENG-4608) Mar 13, 2026
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

18 issues found across 45 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="apps/java-sdk/src/main/java/com/firecrawl/models/ParseFile.java">

<violation number="1" location="apps/java-sdk/src/main/java/com/firecrawl/models/ParseFile.java:21">
P2: Return a defensive copy from `getContent()` so callers cannot mutate the stored file bytes.</violation>
</file>

<file name="apps/api/src/routes/v2.ts">

<violation number="1" location="apps/api/src/routes/v2.ts:229">
P2: Check credits before buffering the multipart file. As written, exhausted accounts can still force up to 50 MB into memory before the request is rejected.</violation>
</file>

<file name="apps/js-sdk/firecrawl/src/v2/client.ts">

<violation number="1" location="apps/js-sdk/firecrawl/src/v2/client.ts:143">
P2: Add the same schema-aware overloads as `scrape`; otherwise `parse(..., { formats: [{ type: 'json', schema }] })` always returns `json` as `unknown` to TypeScript callers.</violation>
</file>

<file name="apps/python-sdk/firecrawl/__tests__/e2e/v2/aio/test_aio_parse.py">

<violation number="1" location="apps/python-sdk/firecrawl/__tests__/e2e/v2/aio/test_aio_parse.py:13">
P2: Treat whitespace-only `API_KEY`/`API_URL` as missing in the skip condition so this e2e test does not run with invalid environment values.

(Based on your team's feedback about rejecting empty and whitespace-only API keys.) [FEEDBACK_USED]</violation>
</file>

<file name="apps/python-sdk/firecrawl/__tests__/unit/v2/methods/test_parse_request_preparation.py">

<violation number="1" location="apps/python-sdk/firecrawl/__tests__/unit/v2/methods/test_parse_request_preparation.py:51">
P2: Avoid a hard-coded “missing” file path here; it makes the test environment-dependent and potentially flaky if that path ever exists.</violation>
</file>

<file name="apps/python-sdk/firecrawl/__tests__/e2e/v2/test_parse.py">

<violation number="1" location="apps/python-sdk/firecrawl/__tests__/e2e/v2/test_parse.py:13">
P2: Trim the environment variables in the skip condition so whitespace-only `API_KEY`/`API_URL` values do not run this e2e test.

(Based on your team's feedback about rejecting whitespace-only API keys during validation.) [FEEDBACK_USED]</violation>
</file>

<file name="apps/java-sdk/src/test/java/com/firecrawl/FirecrawlClientTest.java">

<violation number="1" location="apps/java-sdk/src/test/java/com/firecrawl/FirecrawlClientTest.java:235">
P2: Assert the uploaded HTML content appears in the parsed markdown; a non-empty check alone can let this E2E test pass on unrelated output.</violation>
</file>

<file name="apps/python-sdk/firecrawl/v2/client.py">

<violation number="1" location="apps/python-sdk/firecrawl/v2/client.py:191">
P3: Include `pathlib.Path` in the new `parse()` file type so the public SDK signature matches the implementation.</violation>
</file>

<file name="apps/python-sdk/firecrawl/v2/methods/aio/parse.py">

<violation number="1" location="apps/python-sdk/firecrawl/v2/methods/aio/parse.py:18">
P1: Duplicate function `_prepare_file_payload` exists in both `parse.py` and `aio/parse.py` with identical implementation. The two helper functions share the same name, same logic, same docstring structure, and same return type. This violates DRY principles and creates a maintenance burden where any bug fix or enhancement to file payload preparation must be manually synchronized in both files.</violation>

<violation number="2" location="apps/python-sdk/firecrawl/v2/methods/aio/parse.py:27">
P2: Avoid synchronous file reads in the async `parse` path; large uploads will block the event loop and reduce concurrency.</violation>
</file>

<file name="apps/python-sdk/firecrawl/v2/utils/http_client.py">

<violation number="1" location="apps/python-sdk/firecrawl/v2/utils/http_client.py:134">
P2: Merge caller headers into multipart-safe defaults; otherwise passing `_prepare_headers(...)` into `post_multipart` sends `Content-Type: application/json` on a multipart body.</violation>
</file>

<file name="apps/api/src/services/logging/log_job.ts">

<violation number="1" location="apps/api/src/services/logging/log_job.ts:236">
P2: Parse jobs are still persisted to GCS as `scrape`, so the new `parses` path produces inconsistent metadata.</violation>
</file>

<file name="apps/api/src/__tests__/snips/v2/parse.test.ts">

<violation number="1" location="apps/api/src/__tests__/snips/v2/parse.test.ts:256">
P3: Do not swallow Supabase query errors inside the polling callback; fail immediately so the real DB error is visible.</violation>
</file>

<file name="apps/js-sdk/firecrawl/src/v2/methods/parse.ts">

<violation number="1" location="apps/js-sdk/firecrawl/src/v2/methods/parse.ts:63">
P2: `ParseOptions.timeout` is ignored here because multipart requests bypass the HTTP client's timeout override path.</violation>
</file>

<file name="apps/api/src/scraper/scrapeURL/engines/fetch/index.ts">

<violation number="1" location="apps/api/src/scraper/scrapeURL/engines/fetch/index.ts:36">
P2: This bypasses the engine's charset re-decoding logic, so non-UTF-8 HTML uploads will be mis-decoded.</violation>
</file>

<file name="apps/rust-sdk/src/v2/parse.rs">

<violation number="1" location="apps/rust-sdk/src/v2/parse.rs:78">
P2: `parse` accepts `ScrapeOptions`, but that type serializes several advanced parse settings in a wire format the `/v2/parse` schema rejects.</violation>

<violation number="2" location="apps/rust-sdk/src/v2/parse.rs:90">
P2: Trim the filename before attaching it to the multipart part; otherwise filenames with surrounding whitespace can fail server-side file-type detection.</violation>
</file>

<file name="apps/python-sdk/firecrawl/v2/methods/parse.py">

<violation number="1" location="apps/python-sdk/firecrawl/v2/methods/parse.py:17">
P3: Path support was added internally, but the public parse methods still omit `Path` from their type hints.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 issues found across 20 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="apps/python-sdk/firecrawl/v2/methods/aio/parse.py">

<violation number="1" location="apps/python-sdk/firecrawl/v2/methods/aio/parse.py:33">
P1: `asyncio.to_thread()` drops Python 3.8 compatibility, but the SDK still advertises `>=3.8` support.</violation>
</file>

<file name="apps/rust-sdk/src/v2/parse.rs">

<violation number="1" location="apps/rust-sdk/src/v2/parse.rs:82">
P2: This new `ParseOptions` type omits supported parse fields, so Rust callers lose JSON/change-tracking/attribute configuration that `/v2/parse` still accepts.</violation>

<violation number="2" location="apps/rust-sdk/src/v2/parse.rs:106">
P2: `ParseOptions.proxy` still permits `stealth` and `enhanced`, but `/v2/parse` only accepts `basic` or `auto`. Narrow or validate this field before sending the request.</violation>
</file>

<file name="apps/api/src/scraper/scrapeURL/engines/fetch/index.ts">

<violation number="1" location="apps/api/src/scraper/scrapeURL/engines/fetch/index.ts:32">
P2: Fall back to the document's `<meta charset>` when the header charset cannot be decoded.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 issues found across 5 files (changes from recent commits).

Requires human review: Auto-approval blocked by 1 unresolved issue from previous reviews.

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 issues found across 1 file (changes from recent commits).

Requires human review: Auto-approval blocked by 1 unresolved issue from previous reviews.

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 issues found across 28 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="apps/js-sdk/firecrawl/src/v2/utils/validation.ts">

<violation number="1" location="apps/js-sdk/firecrawl/src/v2/utils/validation.ts:81">
P2: Reject unsupported parse format string shorthands before continuing; `"screenshot"`, `"changeTracking"`, and `"branding"` currently bypass validation in JS callers.</violation>
</file>

<file name="apps/java-sdk/src/main/java/com/firecrawl/models/ParseOptions.java">

<violation number="1" location="apps/java-sdk/src/main/java/com/firecrawl/models/ParseOptions.java:74">
P2: Guard null format entries before reflection; a null element currently crashes `build()` with `NullPointerException`.</violation>

<violation number="2" location="apps/java-sdk/src/main/java/com/firecrawl/models/ParseOptions.java:125">
P2: Blank `proxy` values bypass validation and are still serialized as request data.</violation>
</file>

<file name="apps/js-sdk/firecrawl/src/v2/types.ts">

<violation number="1" location="apps/js-sdk/firecrawl/src/v2/types.ts:70">
P2: Exclude bare `json` from `ParseFormatString`; the current type accepts `formats: ['json']` even though parse validation rejects it.</violation>
</file>

<file name="apps/rust-sdk/src/v2/parse.rs">

<violation number="1" location="apps/rust-sdk/src/v2/parse.rs:105">
P2: `formats` is now too narrow for `/v2/parse`: it can no longer represent the supported `query` format object, so Rust clients cannot request query answers from parse.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

if (fmt === "json") {
throw new Error("json format must be an object with { type: 'json', prompt, schema }");
}
continue;
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Reject unsupported parse format string shorthands before continuing; "screenshot", "changeTracking", and "branding" currently bypass validation in JS callers.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At apps/js-sdk/firecrawl/src/v2/utils/validation.ts, line 81:

<comment>Reject unsupported parse format string shorthands before continuing; `"screenshot"`, `"changeTracking"`, and `"branding"` currently bypass validation in JS callers.</comment>

<file context>
@@ -62,3 +70,73 @@ export function ensureValidScrapeOptions(options?: ScrapeOptions): void {
+      if (fmt === "json") {
+        throw new Error("json format must be an object with { type: 'json', prompt, schema }");
+      }
+      continue;
+    }
+
</file context>
Suggested change
continue;
if (fmt === "changeTracking" || fmt === "screenshot" || fmt === "branding") {
throw new Error(`parse does not support ${fmt} format`);
}
continue;
Fix with Cubic

Comment on lines +125 to +128
if (proxy != null && !proxy.isBlank()) {
if (!proxy.equals("basic") && !proxy.equals("auto")) {
throw new IllegalArgumentException("parse only supports proxy values 'basic' or 'auto'");
}
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Blank proxy values bypass validation and are still serialized as request data.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At apps/java-sdk/src/main/java/com/firecrawl/models/ParseOptions.java, line 125:

<comment>Blank `proxy` values bypass validation and are still serialized as request data.</comment>

<file context>
@@ -0,0 +1,155 @@
+            if (timeout != null && timeout <= 0) {
+                throw new IllegalArgumentException("timeout must be positive");
+            }
+            if (proxy != null && !proxy.isBlank()) {
+                if (!proxy.equals("basic") && !proxy.equals("auto")) {
+                    throw new IllegalArgumentException("parse only supports proxy values 'basic' or 'auto'");
</file context>
Suggested change
if (proxy != null && !proxy.isBlank()) {
if (!proxy.equals("basic") && !proxy.equals("auto")) {
throw new IllegalArgumentException("parse only supports proxy values 'basic' or 'auto'");
}
if (proxy != null) {
if (proxy.isBlank()) {
proxy = null;
} else if (!proxy.equals("basic") && !proxy.equals("auto")) {
throw new IllegalArgumentException("parse only supports proxy values 'basic' or 'auto'");
}
}
Fix with Cubic

if (type instanceof String) return (String) type;
}
try {
Object type = fmt.getClass().getMethod("getType").invoke(fmt);
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Guard null format entries before reflection; a null element currently crashes build() with NullPointerException.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At apps/java-sdk/src/main/java/com/firecrawl/models/ParseOptions.java, line 74:

<comment>Guard null format entries before reflection; a null element currently crashes `build()` with `NullPointerException`.</comment>

<file context>
@@ -0,0 +1,155 @@
+            if (type instanceof String) return (String) type;
+        }
+        try {
+            Object type = fmt.getClass().getMethod("getType").invoke(fmt);
+            if (type instanceof String) return (String) type;
+        } catch (ReflectiveOperationException ignored) {
</file context>
Fix with Cubic


export type ParseFormatString = Exclude<
FormatString,
'screenshot' | 'changeTracking' | 'branding'
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Exclude bare json from ParseFormatString; the current type accepts formats: ['json'] even though parse validation rejects it.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At apps/js-sdk/firecrawl/src/v2/types.ts, line 70:

<comment>Exclude bare `json` from `ParseFormatString`; the current type accepts `formats: ['json']` even though parse validation rejects it.</comment>

<file context>
@@ -65,6 +65,22 @@ export type FormatOption =
 
+export type ParseFormatString = Exclude<
+  FormatString,
+  'screenshot' | 'changeTracking' | 'branding'
+>;
+
</file context>
Suggested change
'screenshot' | 'changeTracking' | 'branding'
'screenshot' | 'changeTracking' | 'branding' | 'json'
Fix with Cubic

#[serde(rename_all = "camelCase")]
pub struct ParseOptions {
/// Output formats to include in the response.
pub formats: Option<Vec<ParseFormat>>,
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: formats is now too narrow for /v2/parse: it can no longer represent the supported query format object, so Rust clients cannot request query answers from parse.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At apps/rust-sdk/src/v2/parse.rs, line 105:

<comment>`formats` is now too narrow for `/v2/parse`: it can no longer represent the supported `query` format object, so Rust clients cannot request query answers from parse.</comment>

<file context>
@@ -79,16 +79,30 @@ pub enum ParseProxyType {
 pub struct ParseOptions {
     /// Output formats to include in the response.
-    pub formats: Option<Vec<Format>>,
+    pub formats: Option<Vec<ParseFormat>>,
     /// Additional HTTP headers.
     pub headers: Option<HashMap<String, String>>,
</file context>
Fix with Cubic

@mogery mogery marked this pull request as draft March 13, 2026 20:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant