How to create more PSI elements than lexed/parsed?

Hi!
I’m developing a plugin that implements support for a new language. I tried writing a grammar for it a few times but failed. So I retrieved an ANTLR4 grammar and used the antlr4-intellij-adaptor to use it in the plugin.
This is working quite well, but that grammar was made for a compiler, not for an IDE use. This means the lexer is only creating tokens for parts that the compiler requires, resulting in larger and less-significant tokens. For example, whitespaces are included in other token types, no token is created for parentheses, brackets, etc.
I would like to improve the PSI tree that is built upon that grammar, by splitting tokens into smaller ones. That would not modify the file content, only improve its PSI structure.
E.g. having PsiWhiteSpace PsiElement('abc') instead of PsiElement(' abc').

I’m trying to find the right place where to do that:

  • Modify the token stream in the ANTLRLexerAdaptor: it seems straightforward, but it would then mess with the ANTLR parser, not accepting the new tokens in these places?
  • Modify ANTLRParserAdaptor to create the right PsiBuilder.Markers: this seems like rewriting the whole class, due to the tight coupling of the parser, lexer and PSI builder.
  • Modify the AST tree built by PsiParser.parse(IElementType, PsiBuilder): this seems doable, but it may be error-prone to modify the tree while walking it with all those links between nodes.
  • Modify the PSI structure by hooking on ParserDefinition.createElement(ASTNode): I remember having read in the documentation that each PSI element should be backed by an AST node, so this wouldn’t be the right place?

Any advice on the best way to do that? From what I researched, I think the better way to do it is the third solution, modifying the AST tree. Is there any tooling to do it easier/safer?

Thank you!

I don‘t have experience with the ANTLR4 integration.

I suggest to modify and update the grammar instead of trying to avoid this by modifying some parts of language integration, which rely on it.

I’d recommend using standard BNF and GrammarKit now and do not bury yourself with such bridges. You will complicate your life in the future a lot and do not save on anything

That was my initial plan, but the complexity of writing the grammar myself was much higher than I expected. The lexer seems simple, but there are too many details to consider. The separation between the lexer and parser is far from obvious, and I’ve not even started with recovery rules…

Trying to split tokens/nodes (for example, parse and split function(parameter)) with code seemed an easier solution than modifying the existing grammar (I tried, and broke the grammar…). Adding support for parenthesis in the lexer is easy for that case, but it breaks in many other places (technically, you can have parentheses in the parameter name, and in other places too, and you don’t want to lex them in these places, so you suddenly get a lot of states in the lexer, which becomes quite similar to the parser :confused:).

I develop my plugin in my spare time, and release it for free, so hiring a consultant to work on that task is probably much more expensive than I can afford; that’s not a solution either.

So I’m a bit stuck, there’s no good solution that I can see right now.