New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: list item bullet without whitespace #2431
fix: list item bullet without whitespace #2431
Conversation
This makes marked render correctly with the following text: ```markdown - item1 - item2 -Not a list item(without space) - item3 ``` Currently: ```html <ul> <li>item1</li> <li>item2</li> </ul> <p>-Not a list item(without space)</p> <ul> <li>item3</li> </ul> ``` Which should be: ```html <ul> <li>item1</li> <li>item2 -Not a list item(without space)</li> <li>item3</li> </ul> ``` This is the rendered result for commonmark as well. Changes: - `nextBulletRegex` checks for whitespace after the bullet - separately checks for hr or lheading
This pull request is being automatically deployed with Vercel (learn more). markedjs – ./🔍 Inspect: https://vercel.com/markedjs-legacy/markedjs/DNmBC1uatY3aB7N4YXYVbiJSNbum marked-website – ./🔍 Inspect: https://vercel.com/markedjs/marked-website/9mWBpQw1ejGMk7opcnyCD7qEaQLB |
Linting issues need to be fixed. |
I have applied the changes -- fixed the pedantic option, removed comments and restored the run-spec file, and ran eslint. During the process I've noticed there was a typo in the test file which originated from the original thread, which was shadowed by the pedantic option. Fixed that as well, and along the way added another list (with I've noticed that Here's a list of lint warnings, just in case if there are further discussions to be made:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking this PR over, this does solve the issue, however I would propose alternative approach (inspired from this PR), as I will try to explain here:
As you mentioned in #1980:
Maybe the reason
nextBulletRegex
did not have any whitespace in the first place was to make it compatible with these other syntaxes. However, it does not solve the problem in this issue, and at minimum is a misnomer. I think it should be fixed.
Essentially, yes, you might say it is a misnomer since we designed it to catch any line that starts with a bullet symbol, since it gets passed up to the outer loop where we already have a handler in case it is a ---
HR token:
Lines 199 to 201 in 3795476
if (this.rules.block.hr.test(src)) { // End list if bullet was actually HR (possibly move into itemRegex?) | |
break; | |
} |
By changing the regex to only detect list items with a space, we lose the chance to catch HRs, which meant adding a separate check of the HR rules if (this.rules.block.hr.test(line)) {
. However, this rules check is now missing the indentation logic that was already included in nextBulletRegex
with ^ {0,${Math.min(3, indent - 1)}}
, so you had to add further logic to detect Setext headers.
You could perhaps more cleanly just build a second regex rule for HRs but with the same starting indent check. Something like:
const HrRegex = new RegExp(`^ {0,${Math.min(3, indent - 1)}}((?:- *){3,}|(?:_ *){3,}|(?:\\* *){3,})(?:\\n+|$);
And you could go further. Instead of two separate regexes, simply update the regex in nextBulletRegex
so it catches both "bullets with a space" and "HR" combinations. Although this starts to get hard to read:
//Match any bullet //Must have space //or be all same symbol (HR)
const nextBulletRegex = new RegExp(`^ {0,${Math.min(3, indent - 1)}}([*+-]|\\d{1,9}[.)])(?:(?: [^\\n]*)?| *(?:\\1 *)*)(?:\\n|$)`);
It's not as explicit as this PR, but avoids some code reuse and extra steps. We could potentially rename nextBulletRegex
to something more appropriate. (nextBreakingSymbol?)
@calculuschild Before I dive even deeper, let's briefly discuss about the directions. I am new to the source code and am not aware of the maintaining atmosphere. In fact I did think about that route. Repeating code for same logic seems troublesome, and maybe we are adding technical debt. In that perspective, having Maybe. There were a couple of reasons I implemented this way.
I'll think about whether we could come up with a clever regex. Meanwhile, can you elaborate your thoughts on this issue? |
@jangxyz I can appreciate where you're coming from. Let's discuss:
Granted, RegEx is a bit hard to read, particularly parsing someone else's code. Hence it's often handy to drop some explanatory comments around similar to how you have outlined above. We've done that in a few places already (and we welcome any contributions to that effect). On the other hand, RegEx is very often the best tool for sifting through strings, and in particular, finding patterns of characters. A 50-character RegEx might look daunting, but the same logic could take 50+ lines of code to do the same thing, and much more slowly at that. Which one is simpler now? (To be fair, sometimes we do overuse RegEx where standard Javascript logic would make more sense, but RegEx can be blazing fast if you do it right, and that's one of the main goals of Marked.)
Correct. And we don't need to add any more nested loops using my suggested approach either. This shouldn't be a concern.
I already provided the regex we would need in my comment above. It's not that complicated. Especially since your logic:
would simplify down to:
This is getting at one of my main concerns. Why try to maintain two different systems (a RegEx rule and a separate nested if statement?) to achieve the same logic? For future maintenance, we can put both regex right next to each other to make it obvious they share the same logic and should be equivalent.
|
I tried your idea, and I have to say the code looks way more simpler. 👍 I was impressed to see it passes the specs as well 😎 The only downside I could think of is, that it feels too cryptic to me. const hrRegex = new RegExp(`^ {0,${Math.min(3, indent - 1)}}((?:- *){3,}|(?:_ *){3,}|(?:\\* *){3,})(?:\\n+|$);
/*
* <^ {0,${Math.min(3, indent - 1)}}((?:- *){3,}|(?:_ *){3,}|(?:\\* *){3,})(?:\\n+|$)>
* 1. <^> : start of sentence
* 2. < {0,${Math.min(3, indent - 1)}}>: upto 3 spaces
* 3. <( : either one of
* (- *){3,} : three or more `-` marks, optionally followed by any amount of spaces
* |(_ *){3,} : same, in `_`
* |(\\* *){3,} : same, in `*`
* )>
* 4. <(\\n+|$)> : end of sentence, or a new line
*/ Maybe all the considerations about setext headings are 'baked in' to the bullets as well -- I will probably need to try out more test samples to figure that out. And in that case the name I still don't understand why there is a '+' after the newline though ( Interestingly, I think this reveals the issues after applying a well-crafted regular expression.
In the end, I think it's a matter of design principles. I won't mind if the maintainers decide to use cleaner regular expressions to maintain the code. |
as discussed in markedjs#2431 (comment) . all spec tests pass.
While waiting for @calculuschild 's opinions and explanations, I went ahead and pushed the change. It has the regex version now, and all specs pass. |
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
## [4.0.15](v4.0.14...v4.0.15) (2022-05-02) ### Bug Fixes * list item bullet without whitespace ([#2431](#2431)) ([9c10b4d](9c10b4d))
🎉 This PR is included in version 4.0.15 🎉 The release is available on: Your semantic-release bot 📦🚀 |
Marked version:
Markdown flavor: Markdown.pl|CommonMark|GitHub Flavored Markdown|n/a
Description
Fixes #1980
as described in the thread.
This makes marked behave like commonmark. Test is added to ensure it properly works, and nothing else fails.
Contributor
Committer
In most cases, this should be a different person than the contributor.