Handling unescaped text in Reference Resolution inside Injected Language Fragments

Hello everyone,

We have a custom language plugin using MultiHostInjector to inject our language expressions into the interior of double-quoted string literal hosts (e.g., string interpolation like "%expression%").

Our host strings can contain escape sequences (e.g., %\101%, which represents %A%).

We have fully implemented a custom LiteralTextEscaper to properly decode these escape sequences:

  1. When the target language’s AST/PSI tree is built, the lexer receives the fully decoded, unescaped characters buffer (_ = A). As a result, tokenization and parsing are completely correct.
  2. However, when performing reference resolution on the identifier node inside the injected overlay, calling refElement.getText() returns the substring from the raw text of the physical containing file (which yields \101 instead of the decoded character A).
  3. Because the text comparison expects A, the reference resolution fails to match the name and reports the element as unresolved.

Our Questions:

  1. What is the recommended platform pattern for Reference implementations inside injected language fragments to resolve identifier names when the underlying host element contains escape sequences?
  2. Should all custom References be instructed to fetch the decoded string buffer from the injection DocumentWindow instead of calling .getText(), or is there a standard PsiElement API to retrieve the unescaped text directly regardless of the injection source?

Thank you!

Follow-up Details: The Underlying Implementation Mismatch

When tracing through InjectionRegistrarImpl.java, the core discrepancy becomes clear in how the injected file overlays are constructed compared to how standard PsiElement.getText() fetches its content:

  1. Decoded Buffer Construction:
    In createAndRegisterInjected, the platform accumulates the fully unescaped strings using the myEscaper.decode(...) loop into a dedicated StringBuilder decodedChars.

  2. Virtual AST Creation:
    This decodedChars buffer is then passed as the backing character source when instantiating the VirtualFileWindowImpl and invoking LanguageParserDefinitions.parseFile(...). This guarantees that the lexer specifically runs on the decoded characters, meaning that _ = A is successfully tokenized and parsed as a valid AST tree in memory.

  3. The getText() Bypass:
    When calling .getText() on an injected PsiElement (such as our identifier node), the platform avoids storing a separate text string for each node to save memory. Instead, it delegates the call to fetch a substring from the raw host text sequence from the physical .gcl file using the mapped offsets. As a result, the reference resolution logic receives the raw string characters \101 rather than the unescaped value A stored inside the decodedChars buffer of the VirtualFileWindow.

Follow-up Question:

Since the VirtualFileWindowImpl inherently encapsulates the decodedChars buffer, is the officially supported pattern to retrieve the decoded text via the InjectedFileViewProvider’s DocumentWindow.getText(textRange), or should identifiers in injected languages override .getText() to pull directly from the unescaped character sequence to ensure correct Name Matching across all standard IDE reference queries?

Additional Context: LeafPatcher, Self-Injection, and Potential Workarounds

After further investigation, I wanted to add some details that may help pinpoint the right solution.

The exact mechanism: LeafPatcher.patch()

The decoded text is lost at a very specific point in the pipeline. After parsing succeeds on the decodedChars buffer, InjectionRegistrarImpl.parseFile() calls patchLeaves(), which invokes LeafPatcher.patch() to rewrite every leaf node’s text range so that the injected PSI tree’s text matches the DocumentWindow text (which mirrors the host document, escapes included). Before patching, identNode.getText() returns A; after patching, it returns \101. The decoded buffer used during parsing is effectively discarded from the PSI layer at this point.

This is a self-injection scenario

It’s worth noting that our use case is somewhat unusual: we inject our language back into itself (the host and injected language are the same). The interpolated expressions inside "%expr%" reference symbols defined in the host file’s scope.

In typical injection scenarios (e.g., SQL into Java strings, regex into Python), the injected language resolves references against its own symbol space, so the host’s escape encoding never interferes with name matching. In our case, the injected PSI’s GclFieldRef.getText() must match field names from the host GCL file — which is where the patched (escaped) text causes resolution failures.

This likely explains why this issue hasn’t surfaced more broadly — most plugins never encounter it.

Kotlin’s approach avoids the problem structurally

KotlinStringLiteralTextEscaper (in psi-api) builds a sourceOffsets array during decode() for correct offset mapping. However, Kotlin sidesteps the text identity problem entirely because template expressions (${expr}) are separate PSI children of KtStringTemplateExpression — they are never decoded from escape sequences. The identifiers inside ${} exist as first-class PSI nodes with their own literal text, so LeafPatcher never rewrites them.

Workaround limitations

We investigated whether the decoded text is recoverable after injection:

  • element.getText() → returns host text (\101) — leaf nodes are patched to the DocumentWindow
  • element.getContainingFile().getViewProvider().getDocument().getText(element.getTextRange()) → also returns \101getTextRange() is remapped to the DocumentWindow which mirrors the host document
  • The original decodedChars buffer is passed to VirtualFileWindowImpl for parsing but is not accessible through any public API after LeafPatcher rewrites the tree

As far as we can tell, the decoded text is effectively lost once patchLeaves() completes. There is no PsiElement or Document API that returns the pre-patched (decoded) content.

This means our only current option is to re-decode manually using the LiteralTextEscaper from the host element during reference resolution, which duplicates work and feels fragile.

Concrete proposal

Would any of the following be considered appropriate platform-level solutions?

  1. A PsiElement API for decoded text — e.g., getDecodedText() or an InjectedLanguageManager utility that retrieves text from the VirtualFileWindowImpl’s decoded buffer rather than from the patched leaf ranges.

  2. A user-data key on patched leavesLeafPatcher could store the original decoded text as UserData on each leaf it modifies, making it retrievable without going through the DocumentWindow.

  3. An opt-out from LeafPatcher — A flag on LiteralTextEscaper (e.g., preserveDecodedText()) that tells the injection framework to skip leaf patching for fragments where the plugin handles offset mapping itself.

Any guidance on the recommended pattern would be greatly appreciated. Thank you!

To reproduce the problem with other languages, you can write the following string in java:
String s = "<html><\u0062ody></body></html>";
And do html language injection on the string literal.
As result you will have correctly parsed html tree, but <body> tag reference (<\u0062ody>) won’t be resolved (ctrl-click doesn’t work).

Update: Found the solution — InjectedLanguageManager.getUnescapedText()

After further investigation, I discovered that the platform does preserve the decoded text after LeafPatcher patches the injected PSI tree. The decoded text is stored as user-data on patched leaves and is accessible via:

InjectedLanguageManager.getInstance(project).getUnescapedText(injectedElement)

This returns the original decoded text (e.g., A instead of \101), which is exactly what’s needed for reference resolution in self-injection scenarios.

There are also offset mapping utilities for navigating between the two coordinate spaces:

// patched PSI offset → decoded offset
manager.mapInjectedOffsetToUnescaped(injectedFile, offset)

// decoded offset → patched PSI offset
manager.mapUnescapedOffsetToInjected(injectedFile, offset)

The implementation lives in InjectedLanguageManagerImpl and relies on InjectedLanguageUtilBase.getUnescapedLeafText() to retrieve the stored decoded text from each leaf node.

This solves the original problem: in reference resolution, instead of calling element.getText() (which returns the host-escaped text), calling getUnescapedText() returns the decoded identifier needed for name matching.

I’m leaving this update here in case others encounter the same issue with self-injection (injecting a language back into itself) where escape sequences in the host string produce different identifier text in the decoded PSI.

And now we run into yet another problem: it seems extremely hard to handle the cases when the language is injected into already injected language. It’s very fragile approach to resolve everything in the context of top-level injection host.

@yuriy.artamonov, would be great to get some insights.
What’s the right way of implementing a recursive language injection with the escaping.

I think we never do this reliably ourselves because of correctness and performance complexity.

Why do you need it really? Can this language be modeled normally without this recursive injection?

Injected language fragments are not really designed to be used this way.

You may consider alternative ways:

  • rework parser to support such nesting
  • use chameleon nodes (HTML does this)
  • template language

I’ve explored the proposed options and it doesn’t seem that any addresses the problem from all angles.

To use chameleon node I would have to implement lexer that finds interpolation boundaries inside escaped string which would be extremely hard as in our language the same character has multiple escaped variants: % \% \x45 \u45 \u045 \u0045 \45 \045 and so on..

Here is the AI analisys of the proposed methods:

Architectural Summary: Why None of the Options are Perfect

Handling GCL double-quoted string literal interpolations featuring arbitrary octal/hex identifier escape sequences exposes limitations in all standard IntelliJ Platform paradigms:


1. Standard MultiHostInjector

  • How it works: Slices the host string into dynamic, virtual embedded PSI file overlays mapped in memory.
  • Why it falls short: PsiElement.getText() inherently fetches substrings directly from the raw physical file buffer. As a result, reference name comparisons receive the raw backslash sequence (\101) instead of the decoded value (A), breaking Go-to-Declaration.

2. Chameleon Nodes & Embedded Lexers

  • How it works: Defers the parsing of specific syntax blocks to lazy AST expansion passes integrated natively in the primary PSI tree.
  • Why it falls short: Chameleon sub-lexers operate directly on the raw host character stream. Detecting interpolation bounds (%) when they are written as escaped octal/hex equivalents (\045) becomes extremely difficult without massive lexer state regex rules.

3. Template Language Frameworks

  • How it works: Creates dual-layer base vs. template PSI trees for document-level mixing (like PHP and HTML).
  • Why it falls short: It is designed for mixing file structures rather than identifier unescaping. Like Chameleon nodes, it provides no automated buffer pre-processing or character unescaping before tokenization.

That is why I think the main language parser actually must take this recursive nature into account

If your language is really recursive, and not just 2 layers then injected language is too limited abstraction probably

It may help to separate two different concerns here.

Reference contributors for the injected language should normally run inside the injected PSI, exactly as they would for a regular file of that language. In that setup, the reference provider should not need to know whether the PSI came from a physical file or from an injected fragment.

So I would not normally expect InjectedLanguageManager to be involved in reference resolution itself (except maybe for reaching the host document, if you need it as a resolve scope).

Inside the injected language, references should be based on the injected PSI model and its tokens. If the provider needs a semantic value, it should get it from the injected element/model, not by manually going back to the host and decoding text there.

Have you considered whether the reference contributor might actually be running on the host PSI rather than on the injected PSI? If so, maybe the contributor is registered for the wrong language/place.

Have you considered whether the reference contributor might actually be running on the host PSI rather than on the injected PSI? If so, maybe the contributor is registered for the wrong language/place.

That’s the problem the lexer is run on injected tree, but then it’s replaced with the host tree and the resolution happens inside host psi tree.

Could you please clarify what exactly you mean by “then it’s replaced”? Where does it happen? during calling resolve method, or?

I’m trying to understand whether this is mainly about PSI text representation, or about how references/search interact with host offsets.

Could you clarify which part is failing here?

Is the problem that PsiReference.resolve() cannot resolve the reference once the reference object is already created?

Or is the problem specifically with Find Usages / ReferencesSearch, where the platform does not find the escaped occurrence in the host text as a candidate usage?

Those are slightly different cases. For resolve(), I would expect the reference implementation to work with the injected PSI/token model and compare against the semantic, unescaped value. But for usage search, the platform may need to search in the host text, where the same semantic value can be represented in escaped form.

This is the place where magic happens.
It maps unescaped content into host psi.