PSI manipulation merges consecutive whitespace tokens

The language my plugin handles, Clojure, treats commas as whitespace, so a collection can be written as either [a, b, c] or [a b c], those are essentially identical. However, I parse the commas as separate tokens, because it’s important to maintain them e.g. when formatting. This means that my PSI can have multiple contiguous whitespace tokens (e.g. in the case above, the tokens are [ a , b , c ]. This has worked for years with no problem. However, I’ve noticed an edge case - if I copy some of these tokens and move them in the PSI, e.g. I copy the a , and move them out of the vector, then when I use addBefore to add the two whitespace tokens they get merged into one in CodeEditUtil#makePlaceHolderBetweenTokens. I’m assuming the basic approach of having multiple different whitespace tokens is valid because Greg Shrago does the same in Clojure Kit. However while debugging, I can’t see any way to prevent those tokens being merged. Can anyone help with this?

1 Like

If it’s merged by your formatter, have you checked if WhiteSpaceFormattingStrategy is involved?
For a language, I’m using it to handle non-space whitespace characters, similar to your case.

In this case, it’s not merged by the formatter, it’s when inserting two consecutive PSI elements which have IElementTypes of TokenType.WHITE_SPACE. What is odd is that one of the tokens has a different IElementType, at least when lexing, so I’m wondering if I need to handle it in ASTFactory to give it a custom PSI type which returns that element type, or something like that. I’m guessing it currently gets a standard element TokenType.WHITE_SPACE because its token type is in the set returned from getWhitespaceTokens in my ParserDefinition. But I haven’t debugged all the way through all this yet.

That said, that does look like a very useful EP, and one I could potentially use if I want to just treat all whitespace as a single token type, so thanks for that!

Whitespace is tricky…

I haven’t had this case yet. Glancing at the method is seems to merge whitespace if left and right ASTNodes have type TokenType.WHITE_SPACE.

Unfortunately, a PsiWhiteSpaceImpl is AFAIK the only PSI whitespace element and always has type WHITE_SPACE, even if your lexer returns a different type for tokens. You can use custom whitespace types in the parser, but as soon as PSI is created such nodes are turned into WHITE_SPACE as far as I remember. You could detect such PsiElement whitespace by checking text or textContains(',').

It may be possible to lex , in a, b as a single node of a custom whitespace token type (e.g. COMMA_WHITESPACE) instead of as two tokens WHITE_SPACE WHITE_SPACE. That would avoid the merging. The WhiteSpaceFormattingStrategy could then reduce too much whitespace around a comma to a smaller chunk.
But this is just a thought, it may not work at all.

Yes, I’m starting to think that my best option is to lex the commas along with the rest of the whitespace into a single token, and then use WhiteSpaceFormattingStrategy to control how they appear in the resulting whitespace tokens after formatting. I’ll try that and report back.