-
Notifications
You must be signed in to change notification settings - Fork 1.5k
refactor!: Centralize KVS value semantics in @crawlee/core #3774
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
janbuchar
wants to merge
9
commits into
v4
Choose a base branch
from
centralize-kvs-value-semantics
base: v4
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
2a002f1
refactor!: Centralize KVS value semantics
janbuchar d0e3433
Missing awaits
janbuchar de6c16b
Add failing tests according to review feedback
janbuchar 76923d1
Fix serializeValue logic
janbuchar c5707e2
Add KeyValueStore.getRecord method
janbuchar cd4e933
Avoid empty string sentinels
janbuchar 441e794
Tighten down test assertions
janbuchar 12ff6e8
Extract shared buffer/stream detection to utils, handle web streams c…
janbuchar 22a4475
Fix dependencies
janbuchar File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,131 @@ | ||
| import type { Dictionary } from '@crawlee/types'; | ||
| import contentTypeParser from 'content-type'; | ||
| import { isBuffer, isStream } from '@crawlee/utils'; | ||
| import JSON5 from 'json5'; | ||
|
|
||
| import { jsonStringifyExtended } from '@apify/utilities'; | ||
|
|
||
| const CONTENT_TYPE_JSON = 'application/json'; | ||
| const STRINGIFIABLE_CONTENT_TYPE_RXS = [new RegExp(`^${CONTENT_TYPE_JSON}$`, 'i'), /^application\/.*xml$/i, /^text\//i]; | ||
|
|
||
| /** | ||
| * Canonical write path for key-value store records. | ||
| * | ||
| * When a content type is provided, the value passes through unchanged — it is the caller's | ||
| * responsibility to supply a String/Buffer/Stream (the frontend validates this). | ||
| * | ||
| * When no content type is provided, it is inferred from the value's shape: | ||
| * - Buffer / typed array / ArrayBuffer / stream → `application/octet-stream` (passthrough) | ||
| * - `string` → `text/plain; charset=utf-8` (passthrough) | ||
| * - anything else → `application/json; charset=utf-8` (serialized via `jsonStringifyExtended`) | ||
| * | ||
| * Does NOT drain streams — that is storage mechanics and stays in the storage client. | ||
| * | ||
| * Backend-independent. | ||
| */ | ||
| export function serializeValue( | ||
| value: unknown, | ||
| contentType?: string, | ||
| ): { | ||
| value: Buffer | ArrayBuffer | ArrayBufferView | string | NodeJS.ReadableStream | ReadableStream; | ||
| contentType: string; | ||
| } { | ||
| if (contentType !== null && contentType !== undefined) { | ||
| return { value: value as Buffer | string | NodeJS.ReadableStream | ReadableStream, contentType }; | ||
| } | ||
|
|
||
| if (isStream(value) || isBuffer(value)) { | ||
| return { | ||
| value, | ||
| contentType: 'application/octet-stream', | ||
| }; | ||
| } | ||
|
|
||
| if (typeof value === 'string') { | ||
| return { value, contentType: 'text/plain; charset=utf-8' }; | ||
| } | ||
|
|
||
| let serialized: string; | ||
| try { | ||
| // Format JSON to simplify debugging, the overheads with compression is negligible | ||
| serialized = jsonStringifyExtended(value as Dictionary, null, 2); | ||
| } catch (e) { | ||
| const error = e as Error; | ||
| // Give more meaningful error message | ||
| if (error.message?.includes('Invalid string length')) { | ||
| error.message = 'Object is too large'; | ||
| } | ||
| throw new Error(`The "value" parameter cannot be stringified to JSON: ${error.message}`); | ||
| } | ||
|
|
||
| if (serialized === undefined) { | ||
| throw new Error( | ||
| 'The "value" parameter was stringified to JSON and returned undefined. ' + | ||
| "Make sure you're not trying to stringify an undefined value.", | ||
| ); | ||
| } | ||
|
|
||
| return { value: serialized, contentType: 'application/json; charset=utf-8' }; | ||
| } | ||
|
|
||
| /** | ||
| * Parses a Buffer or ArrayBuffer using the provided content type header. | ||
| * | ||
| * - application/json is returned as a parsed object. | ||
| * - application/*xml and text/* are returned as strings. | ||
| * - everything else is returned as original body. | ||
| * | ||
| * If the header includes a charset, the body will be stringified only | ||
| * if the charset represents a known encoding to Node.js or Browser. | ||
| * | ||
| * Backend-independent — this is the canonical read path for the {@apilink KeyValueStore} frontend. | ||
| */ | ||
| export function parseValue( | ||
| body: Buffer | ArrayBuffer, | ||
| contentTypeHeader: string | null, | ||
| ): string | Buffer | ArrayBuffer | Record<string, unknown> { | ||
| // No content type at all → we have no basis for interpretation; hand back the raw bytes. | ||
| if (contentTypeHeader === null) return body; | ||
|
|
||
| let contentType: string; | ||
| let charset: BufferEncoding; | ||
| try { | ||
| const result = contentTypeParser.parse(contentTypeHeader); | ||
| contentType = result.type; | ||
| charset = result.parameters.charset as BufferEncoding; | ||
| } catch { | ||
| // Unparseable header → keep the original buffer rather than a mangled string. | ||
| return body; | ||
| } | ||
|
|
||
| // If we can't successfully parse it, we return | ||
| // the original buffer rather than a mangled string. | ||
| if (!areDataStringifiable(contentType, charset)) return body; | ||
| const dataString = isomorphicBufferToString(body, charset); | ||
|
|
||
| return contentType === CONTENT_TYPE_JSON ? JSON5.parse(dataString) : dataString; | ||
| } | ||
|
|
||
| function isomorphicBufferToString(buffer: Buffer | ArrayBuffer, encoding: BufferEncoding): string { | ||
| if (buffer.constructor.name !== ArrayBuffer.name) { | ||
| return buffer.toString(encoding); | ||
| } | ||
|
|
||
| // Browser decoding only works with UTF-8. | ||
| const utf8decoder = new TextDecoder(); | ||
| return utf8decoder.decode(new Uint8Array(buffer)); | ||
| } | ||
|
|
||
| function isCharsetStringifiable(charset: string): charset is BufferEncoding { | ||
| if (!charset) return true; // hope that it's utf-8 | ||
| return Buffer.isEncoding(charset); | ||
| } | ||
|
|
||
| function isContentTypeStringifiable(contentType: string): boolean { | ||
| if (!contentType) return false; // keep buffer | ||
| return STRINGIFIABLE_CONTENT_TYPE_RXS.some((rx) => rx.test(contentType)); | ||
| } | ||
|
|
||
| function areDataStringifiable(contentType: string, charset: string): boolean { | ||
| return isContentTypeStringifiable(contentType) && isCharsetStringifiable(charset); | ||
| } | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.