Skip to content
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
### Changed

- The UPLC/PLC/PIR textual parser now rejects unquoted identifiers that
contain a `-` anywhere other than as the terminal numeric unique-suffix
separator (e.g. `pubKeyHash-305478r71`, `foo-bar`, `foo-123-456`) with
a dedicated `InvalidIdentifier` diagnostic that points directly at the
offending name and shows the full bad text. Previously the same inputs
silently mis-parsed — the prefix was taken as a name plus unique-suffix
and the remainder was picked up as an adjacent term — which surfaced as
a confusing "unexpected '(' expecting ')'" message far from the real
site (see #7742). To use such a string as a name verbatim, wrap it in
backticks: `` `pubKeyHash-305478r71` ``.
Comment thread
Unisay marked this conversation as resolved.
Outdated
18 changes: 18 additions & 0 deletions plutus-core/plutus-core/src/PlutusCore/Error.hs
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,12 @@ data ParserError
= BuiltinTypeNotAStar !T.Text !SourcePos
| UnknownBuiltinFunction !T.Text !SourcePos ![T.Text]
| InvalidBuiltinConstant !T.Text !T.Text !SourcePos
| {-| An unquoted identifier that violates the grammar: a '-' appeared
anywhere other than as the separator of a terminal numeric unique-suffix
(e.g. @pubKeyHash-305478r71@, @foo-bar@, @foo-123-456@). The 'Text'
carries the full offending text as it appeared in the source, so the
user sees their own name back in the diagnostic. -}
InvalidIdentifier !T.Text !SourcePos

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the current error thrown if there's an invalid character (like ^ or whatever) in the identifier? If you are adding an InvalidIdentifier error, it shouldn't just apply to -.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the crux: there's an asymmetry in how identifier-related errors are handled.

  1. Characters that are invalid for an identifier (not isIdentifierChar and not -), like ^ or &: the parser stops at the valid prefix, returns it as a Name, and the bad character produces a generic Megaparsec "unexpected" error at its own position. So foo^bar reports ^ at column N, with no link back to the identifier the user actually meant.
  2. Characters that belong to the identifier-char set, or another - (i.e. isNameExtensionChar = isIdentifierChar || '-'): when they appear right after a parsed name + optional -NNN and would extend it, the parser has already over-committed. The -NNN it consumed as a unique-suffix was actually part of the intended name. It eats the remainder and raises InvalidIdentifier with the full offending text.

The new error only fires for case 2, but the constructor name InvalidIdentifier is misleading as it sounds to cover both cases. It's really the hyphen-misparse case, not "any invalid identifier".

That said I see two possible ways forward:

  1. [Local solution] Keep this error specific to the second case and rename it accordingly, i.e. MalformedIdentifierExtension
  2. [Global solution] Rework the way identifiers are parsed to emit identifier-related errors in a uniform, consistent way.

I chose local fix: its an improvement that doesn't stop us from reworking the whole identifier parser later if we want (much bigger scope)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the end I have re-implemented the middle-ground solution:
I didn't rework the way whole identifiers are parsed, only touched the unique suffix part -123
The change turned out not that big and we have a more consistent error handling for suffixes now.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the crux: there's an asymmetry in how identifier-related errors are handled.

Do you mean this is how they are handled currently on master, or how you are handling it in your PR?

I didn't rework the way whole identifiers are parsed, only touched the unique suffix part -123

I don't know what this means. "Touched" in what sense?

And to step back a bit - why do we need to allow parsing both names-with-unique and names-without-unique? In any given textual UPLC, don't all names either have uniques or don't?

Btw, as a general rule, avoid posting Claude answers verbatim. It comes across to me as not being respectful of other people's time. They are often quite verbose and of questionable quality - inaccuracy is common. You are of course free to use Claude, but make sure to quality-check, shorten and rephrase, make it suitable for human conversation before asking others to read it. This applies to PR descriptions, code reviews, and other communication.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean this is how they are handled currently on master

yes, I meant master.

I don't know what this means. "Touched" in what sense?

by "touched" I meant "changed" the suffix (numeric extension) parser.

And to step back a bit - why do we need to allow parsing both names-with-unique and names-without-unique?

Idk, this decision was made before me.

In any given textual UPLC, don't all names either have uniques or don't?

I agree that most likely its either all with suffixes or all without. Depends on the machinery that produced the UPLC.

Btw, as a general rule, avoid posting Claude answers verbatim. It comes across to me as not being respectful of other people's time.

So far when producing public texts (GH descriptions, comments, etc.) I tried to come close to how I want to receive them (Golden rule: treat others same way you want them to treat you).

I didn't mean to disrespect anyone, and since you criticized the original Claude produced PR description - I simplified it significantly and now its in the shape in which I'd want others to create PRs for my review. I think it also serves your request to provide the Before and After examples.

Other than the original PR description - I typed all comments on keyboard myself without asking LLM to write them.

Anyway, I got your point:

They are often quite verbose and of questionable quality - inaccuracy is common. You are of course free to use Claude, but make sure to quality-check, shorten and rephrase, make it suitable for human conversation before asking others to read it. This applies to PR descriptions, code reviews, and other communication.

deriving stock (Eq, Ord, Generic)
deriving anyclass (NFData)

Expand Down Expand Up @@ -192,6 +198,18 @@ instance Pretty ParserError where
<+> squotes (pretty s)
<+> "at"
<+> pretty loc
pretty (InvalidIdentifier txt loc) =
"Invalid identifier"
<+> squotes (pretty txt)
<+> "at"
<+> pretty loc
<> "."
<> hardline
<> "A '-' inside a name is the numeric unique-suffix separator and must be"
<+> "followed only by digits and a word boundary."
<> hardline
<> "To use this text as a name verbatim, quote it with backticks:"
<+> pretty ("`" <> txt <> "`")

instance ShowErrorComponent ParserError where
showErrorComponent = show . pretty
Expand Down
27 changes: 26 additions & 1 deletion plutus-core/plutus-core/src/PlutusCore/Parser/ParserCommon.hs
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ import Control.Monad.Except
import Control.Monad.Reader (ReaderT, ask, local, runReaderT)
import Control.Monad.State (StateT, evalStateT)
import Data.Map qualified as M
import Data.Set qualified as Set
import Data.Text (Text)
import Data.Text qualified as Text
import Text.Megaparsec hiding (ParseError, State, parse, some)
Expand Down Expand Up @@ -217,9 +218,33 @@ name = try $ parseUnquoted <|> parseQuoted
where
parseUnquoted :: Parser Name
parseUnquoted = do
startOffset <- getOffset
startPos <- getSourcePos'
_ <- lookAhead (satisfy isIdentifierStartingChar)
inputBefore <- getInput
str <- takeWhileP (Just "identifier-unquoted") isIdentifierChar
Name str <$> uniqueSuffix str
u <- uniqueSuffix str
{- The parsed prefix is only a valid identifier if the next character is
a real word-boundary. If instead we see more identifier chars or another
'-', the user wrote something like `foo-bar` or `pubKeyHash-305478r71` —
the '-NNN' run we just treated as a unique-suffix was actually part of
their intended name (or they have a stray '-' at all). Fail with a
custom diagnostic that points at the whole offending identifier. -}
mBad <- optional (lookAhead (satisfy isNameExtensionChar))
case mBad of
Nothing -> pure (Name str u)
Just _ -> do
-- Consume the remainder so the reported text covers the full name.
_ <- takeWhileP Nothing isNameExtensionChar
inputAfter <- getInput
let consumed = Text.length inputBefore - Text.length inputAfter
fullText = Text.take consumed inputBefore
parseError $
FancyError startOffset $
Set.singleton (ErrorCustom (InvalidIdentifier fullText startPos))

isNameExtensionChar :: Char -> Bool
isNameExtensionChar c = isIdentifierChar c || c == '-'

parseQuoted :: Parser Name
parseQuoted = do
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
test:1:21:
|
1 | (program 1.1.0 (lam foo-123-456 foo-123-456))
| ^
Invalid identifier 'foo-123-456' at test:1:21.
A '-' inside a name is the numeric unique-suffix separator and must be followed only by digits and a word boundary.
To use this text as a name verbatim, quote it with backticks: `foo-123-456`
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
test:1:21:
|
1 | (program 1.1.0 (lam pubKeyHash-305478r71 (lam x x)))
| ^
Invalid identifier 'pubKeyHash-305478r71' at test:1:21.
A '-' inside a name is the numeric unique-suffix separator and must be followed only by digits and a word boundary.
To use this text as a name verbatim, quote it with backticks: `pubKeyHash-305478r71`
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
test:1:21:
|
1 | (program 1.1.0 (lam foo-bar foo-bar))
| ^
Invalid identifier 'foo-bar' at test:1:21.
A '-' inside a name is the numeric unique-suffix separator and must be followed only by digits and a word boundary.
To use this text as a name verbatim, quote it with backticks: `foo-bar`
46 changes: 46 additions & 0 deletions plutus-core/untyped-plutus-core/testlib/Generators/Spec.hs
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,9 @@ test_parsing =
, propMissingConOperands
, propInvalidKeyword
, propBracketMismatch
, propInvalidIdentifierHyphenLetters
, propInvalidIdentifierHyphenWord
, propInvalidIdentifierDoubleUnique
]
]

Expand Down Expand Up @@ -241,6 +244,49 @@ propBracketMismatch =
"bracket-mismatch"
"(program 1.1.0 [(var x))"

{- Note [Negative identifier-grammar tests]
The parser's name grammar treats '-NNN' purely as the numeric unique-suffix:
'foo-123' → Name "foo" (Unique 123). A '-' anywhere else in an identifier is
not allowed by the unquoted grammar (see 'isIdentifierChar' in
'PlutusCore.Name.Unique'). Several tools in the wild (e.g. Scalus 0.16.0's
'toUplcOptimized') emit names like 'pubKeyHash-305478r71' that violate this,
and today the parser mis-parses them in a way that surfaces as a confusing
error hundreds of lines away from the offending name — see issue #7742.

The goldens below freeze the *current* (unhelpful) error output so that a
future diagnostic improvement shows up as an explicit golden-file diff.
When the parser is taught to point at the bad name itself, accept the new
goldens with 'scripts/regen-goldens.sh' (or '--accept'). -}
Comment thread
Unisay marked this conversation as resolved.
Outdated

{-| @pubKeyHash-305478r71@ — the exact shape Scalus 0.16.0 produces, inside a
binder. Current behaviour: the parser eats @pubKeyHash-305478@ as name+unique,
Comment thread
Unisay marked this conversation as resolved.
Outdated
picks up @r71@ as the lam body, then fails far away on the next paren. -}
propInvalidIdentifierHyphenLetters :: TestTree
propInvalidIdentifierHyphenLetters =
Comment thread
Unisay marked this conversation as resolved.
Outdated
testParseErrorGolden
"Invalid identifier: hyphen followed by digits then letters"
"invalid-identifier-hyphen-letters"
"(program 1.1.0 (lam pubKeyHash-305478r71 (lam x x)))"

{-| @foo-bar@ — hyphen followed by non-digits. Current behaviour: the parser
stops at '-' (it is not in 'isIdentifierChar'), takes @foo@ as the name, and
then explodes on @-bar@ which is not a valid continuation anywhere. -}
propInvalidIdentifierHyphenWord :: TestTree
propInvalidIdentifierHyphenWord =
Comment thread
Unisay marked this conversation as resolved.
Outdated
testParseErrorGolden
"Invalid identifier: hyphen followed by non-digits"
"invalid-identifier-hyphen-word"
"(program 1.1.0 (lam foo-bar foo-bar))"

{-| @foo-123-456@ — ambiguous double '-NNN' run. Current behaviour: the first
@-123@ wins as the unique, @-456@ is left over and fails the next check. -}
propInvalidIdentifierDoubleUnique :: TestTree
propInvalidIdentifierDoubleUnique =
testParseErrorGolden
"Invalid identifier: double unique-suffix"
"invalid-identifier-double-unique"
"(program 1.1.0 (lam foo-123-456 foo-123-456))"

--------------------------------------------------------------------------------
-- Helper Functions ------------------------------------------------------------

Expand Down
Loading