Proper rehighlighting with overlapping lexer states and matches >= 5 letters

I’m creating a plugin for a custom language, but I’ve stumbled across a very nasty highlighting problem. My simplified flex file looks like this:

<YYINITIAL> {
    "G"                                                       { return CustomLanguageTypes.VARIABLE; }
    "Goto "                                                   { return CustomLanguageTypes.GOTO; }
}

Highlighting works normally in the editor, i.e. the VARIABLE is highlighted appropriately and the GOTO as well. Normally. Because the main problem is, if I start typing the “Goto”, only the “G” is highlighted as a variable, but if I type the space, the PSI is updated accordingly, but the highlight is not. It still highlights only the “G” but not the entire “Goto “ word as the GOTO highlighting.

To make it even more fun: this doesn’t happen if “Goto ” is one letter shorter, i.e. “Got ”. Then the “Got “ gets highlighted properly. It seems that during my testing this always happen with tokens >= 5 chars. Any match from 4 letters or shorter gets highlighted properly, but 5 letters or more not. How can I fix that? Should I restart the DaemonCodeAnalyser after typing any letter? Should my generated lexer implement the RestartingLexer? Should I do something with the LexerEditorHighlighter? I’m not sure how to proceed from here, so thanks in advance!

Here is my flex file: TI-Basic-IntelliJ-plugin/src/main/java/nl/petertillema/tibasic/syntax/TIBasic.flex at master · PeterTillema/TI-Basic-IntelliJ-plugin · GitHub

After Googling more, it seems someone else had kinda the same problem: https://intellij-support.jetbrains.com/hc/en-us/community/posts/4402747055250-Syntax-highlighing-stops-after-5-symbols. However, that repository doesn’t exist anymore, so I don’t know what the actual fix is.

There are a variety of issues with your language implementation – mostly with the lexer. It’s far too simplistic with many overlapping and likely incorrectly prioritized tokens. It’s also tokenizes whitespace which is generally bad practice unless the whitespace itself has some kind of semantic meaning beyond “whitespace.” Whitespace handling should be left GrammarKit’s native whitespace handling. Last point I’ll make here is that including parens and other punctuation in tokens instead of handling them as their own tokens is at minimum going to cause issues for highlighting, but is also another smell of poor lexer design.

Ideally, your lexer should be focused on the code structure/syntax, not details like command names. Those should be validated at a higher layer if possible, perhaps at the annotation layer.

Check out this change which implements some of my suggestions (reprioritizing lexing rules, whitespace handling). I have no personal experience with TI-basic, so I can’t say that my change doesn’t cause any issues, but it definitely resolves the issue you’re describing here.

Kapture 2025-10-16 at 20.23.042

I recommend looking at lexers for other similar languages for inspiration.

Good luck!

Awesome, many thanks for the time you’ve put into improving my lexer! I indeed see where the issue arise, and it is indeed fixes. I will first look into my command list to see which ones are kind of “internal tokens”, i.e. not means to be put into a script, and remove them. Regarding whitespace: normally the language doesn’t allow whitespace at all, but for the sake of sanity, I allow indentation and whitespace after a line, which gets removed if I tokenize the entire source file. But it also seems valid to allow it everywhere, and strip all whitespace before going into tokenization. Then I will also add a formatter to remove spaces around operators (for example).

Regarding your comment that my lexer should be focused on the code structure/syntax, do you mean that I should replace the big command list with a simple regex, which matches all the current commands, and then explicitly mention the commands in my BNF file? In that case I have finer control over what kind of arguments gets passed and how much, for example.

Anyway, I have merged the changes, and will look into it even more!

Regarding whitespace: normally the language doesn’t allow whitespace at all, but for the sake of sanity, I allow indentation and whitespace after a line, which gets removed if I tokenize the entire source file. But it also seems valid to allow it everywhere, and strip all whitespace before going into tokenization. Then I will also add a formatter to remove spaces around operators (for example).

I see, so whitespace is syntactically important. In this case, you have a couple of options:

  1. You can define your own WHITE_SPACE token to be used in your lexer and parser definitions instead of lexing the WHITE_SPACE pattern as TokenType.WHITE_SPACE. You’d have to update all of your parser rules to decide where whitespace is explicitly allowed (roughly something like this example: private simple_command ::= COMMAND_NO_PARENS WHITE_SPACE+ [(expr (COMMA expr)*)]).

  2. You can let Grammar-Kit continue to handle whitespace and handle invalid whitespace at the Annotator layer. This is probably the better approach since it would be trivial to include a quick-fix with the error annotation to remove the offending space. This approach should also work better with a formatter since the invalid spaces won’t break your PSI tree. You’d basically be looking for PsiWhitespace elements in relevant locations and annotating them like in the linked example.

Regarding your comment that my lexer should be focused on the code structure/syntax, do you mean that I should replace the big command list with a simple regex, which matches all the current commands, and then explicitly mention the commands in my BNF file? In that case I have finer control over what kind of arguments gets passed and how much, for example.

I think this is where my lack of full knowledge of ti-basic makes it hard give useful advice. It’s entirely possible that what you’re doing with the big list of commands is fine for this language given its limited scope. It also doesn’t help that my prior experience is configuration languages rather than programming languages. If it’s working as intended, I probably wouldn’t recommend messing with it.