Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: list item bullet without whitespace #2431

Merged

Conversation

jangxyz
Copy link
Contributor

@jangxyz jangxyz commented Apr 3, 2022

Marked version:

Markdown flavor: Markdown.pl|CommonMark|GitHub Flavored Markdown|n/a

Description

Fixes #1980

as described in the thread.

This makes marked behave like commonmark. Test is added to ensure it properly works, and nothing else fails.

---
pedantic: true
headerIds: false
---
## +list

+ list1
+Not list(without space)
- list2

## -list

- list1
-Not list(without space)
- list2

## number(1.)list 
1. list
1.Notlist(without space)
1. list

Contributor

  • Test(s) exist to ensure functionality and minimize regression (if no tests added, list tests covering this PR); or,
  • no tests required for this PR.
  • If submitting new feature, it has been documented in the appropriate places.

Committer

In most cases, this should be a different person than the contributor.

This makes marked render correctly with the following text:

```markdown
- item1
- item2
-Not a list item(without space)
- item3
```

Currently:
```html
<ul>
<li>item1</li>
<li>item2</li>
</ul>
<p>-Not a list item(without space)</p>
<ul>
<li>item3</li>
</ul>
```

Which should be:
```html
<ul>
<li>item1</li>
<li>item2
-Not a list item(without space)</li>
<li>item3</li>
</ul>
```

This is the rendered result for commonmark as well.

Changes:
- `nextBulletRegex` checks for whitespace after the bullet
- separately checks for hr or lheading
@vercel
Copy link

vercel bot commented Apr 3, 2022

This pull request is being automatically deployed with Vercel (learn more).
To see the status of your deployments, click below or on the icon next to each commit.

markedjs – ./

🔍 Inspect: https://vercel.com/markedjs-legacy/markedjs/DNmBC1uatY3aB7N4YXYVbiJSNbum
✅ Preview: https://markedjs-git-fork-jangxyz-fix-listitem-b-8c742f-markedjs-legacy.vercel.app

marked-website – ./

🔍 Inspect: https://vercel.com/markedjs/marked-website/9mWBpQw1ejGMk7opcnyCD7qEaQLB
✅ Preview: https://marked-website-git-fork-jangxyz-fix-listitem-bu-61980c-markedjs.vercel.app

test/specs/new/list_item_unindented_asterisk.md Outdated Show resolved Hide resolved
test/specs/run-spec.js Outdated Show resolved Hide resolved
test/specs/run-spec.js Outdated Show resolved Hide resolved
@UziTech
Copy link
Member

UziTech commented Apr 3, 2022

Linting issues need to be fixed.

- remove pedantic option (should not be true)
- change bullet mark of first list to '+' like other items in the list.
- add another list with '*' bullet
@jangxyz
Copy link
Contributor Author

jangxyz commented Apr 4, 2022

I have applied the changes -- fixed the pedantic option, removed comments and restored the run-spec file, and ran eslint.

During the process I've noticed there was a typo in the test file which originated from the original thread, which was shadowed by the pedantic option. Fixed that as well, and along the way added another list (with * bullet) for completeness.

I've noticed that yarn lint actually fixed other files that I have not changed at all. Lot of them are max-len errors, but there are some others such as no-else-return or object-shorthand warnings too. I'm not sure whether there was a recent change in eslint configs, whether it has been a while since there was an overall eslint fix, or I just messed up with the configs somehow, though I don't remember doing anything special with eslint after yarn install. I've just commited the fix only for the files I have touched with -- the Tokenizer.js and run-spec.js file.

Here's a list of lint warnings, just in case if there are further discussions to be made:
jangxyz@hail ~/play/reveal-md-editor/packages/markdowns/repos/marked (fix/listitem-bullet-without-whitespace2 *)
04:10 PM $ ./node_modules/.bin/eslint .

/Users/jangxyz/play/_by_others/markdowns/marked/bin/marked.js
   77:42  warning  Expected line break before `.map`     newline-per-chained-call
   79:11  warning  Expected line break before `.concat`  newline-per-chained-call
  212:3   warning  Expected line break before `.catch`   newline-per-chained-call

/Users/jangxyz/play/_by_others/markdowns/marked/docs/demo/demo.js
  125:1   error    This line has a length of 114. Maximum allowed is 100  max-len
  136:16  warning  Unnecessary 'else' after 'return'                      no-else-return
  265:1   error    This line has a length of 101. Maximum allowed is 100  max-len
  279:7   warning  Expected line break before `.catch`                    newline-per-chained-call
  290:10  warning  Unnecessary 'else' after 'return'                      no-else-return
  361:10  warning  Unnecessary 'else' after 'return'                      no-else-return
  424:1   error    This line has a length of 141. Maximum allowed is 100  max-len
  448:11  warning  Expected property shorthand                            object-shorthand
  449:11  warning  Expected property shorthand                            object-shorthand
  450:11  warning  Expected property shorthand                            object-shorthand
  573:1   error    This line has a length of 126. Maximum allowed is 100  max-len

/Users/jangxyz/play/_by_others/markdowns/marked/docs/demo/worker.js
  47:9   warning  Expected property shorthand        object-shorthand
  60:9   warning  Expected property shorthand        object-shorthand
  99:10  warning  Unnecessary 'else' after 'return'  no-else-return

/Users/jangxyz/play/_by_others/markdowns/marked/run-my-spec.js
  31:3   error  'describe' is not defined     no-undef
  34:7   error  'describe' is not defined     no-undef
  50:39  error  'fit' is not defined          no-undef
  50:58  error  'xit' is not defined          no-undef
  50:64  error  'it' is not defined           no-undef
  55:21  error  'expectAsync' is not defined  no-undef
  57:21  error  'expectAsync' is not defined  no-undef
  59:21  error  'expectAsync' is not defined  no-undef
  64:15  error  'fail' is not defined         no-undef

/Users/jangxyz/play/_by_others/markdowns/marked/src/Lexer.js
  262:1  error  This line has a length of 112. Maximum allowed is 100  max-len
  334:1  error  This line has a length of 180. Maximum allowed is 100  max-len
  341:1  error  This line has a length of 170. Maximum allowed is 100  max-len
  346:1  error  This line has a length of 126. Maximum allowed is 100  max-len
  460:1  error  This line has a length of 112. Maximum allowed is 100  max-len

/Users/jangxyz/play/_by_others/markdowns/marked/src/Parser.js
   68:1   error    This line has a length of 122. Maximum allowed is 100  max-len
   70:1   error    This line has a length of 148. Maximum allowed is 100  max-len
  154:1   error    This line has a length of 126. Maximum allowed is 100  max-len
  199:18  warning  Unnecessary 'else' after 'return'                      no-else-return
  224:1   error    This line has a length of 122. Maximum allowed is 100  max-len
  226:1   error    This line has a length of 138. Maximum allowed is 100  max-len
  278:18  warning  Unnecessary 'else' after 'return'                      no-else-return

/Users/jangxyz/play/_by_others/markdowns/marked/src/Tokenizer.js
  407:1  error  This line has a length of 104. Maximum allowed is 100  max-len
  603:1  error  This line has a length of 103. Maximum allowed is 100  max-len
  607:1  error  This line has a length of 119. Maximum allowed is 100  max-len
  770:1  error  This line has a length of 123. Maximum allowed is 100  max-len

/Users/jangxyz/play/_by_others/markdowns/marked/src/helpers.js
   57:21  warning  Unexpected block statement surrounding arrow body; move the returned value immediately after the `=>`  arrow-body-style
   76:1   error    This line has a length of 110. Maximum allowed is 100                                                  max-len
  120:10  warning  Unnecessary 'else' after 'return'                                                                      no-else-return
  155:14  warning  Unnecessary 'else' after 'return'                                                                      no-else-return

/Users/jangxyz/play/_by_others/markdowns/marked/src/marked.js
  278:1  error  This line has a length of 137. Maximum allowed is 100  max-len

/Users/jangxyz/play/_by_others/markdowns/marked/src/rules.js
   13:1   error    This line has a length of 102. Maximum allowed is 100  max-len
   25:1   error    This line has a length of 122. Maximum allowed is 100  max-len
   26:1   error    This line has a length of 106. Maximum allowed is 100  max-len
   33:1   error    This line has a length of 101. Maximum allowed is 100  max-len
  172:1   error    This line has a length of 226. Maximum allowed is 100  max-len
  173:1   error    This line has a length of 191. Maximum allowed is 100  max-len
  185:1   error    This line has a length of 102. Maximum allowed is 100  max-len
  185:91  warning  Expected line break before `.getRegex`                 newline-per-chained-call
  191:67  warning  Expected line break before `.getRegex`                 newline-per-chained-call
  208:1   error    This line has a length of 159. Maximum allowed is 100  max-len
  281:52  warning  Expected line break before `.getRegex`                 newline-per-chained-call
  286:1   error    This line has a length of 228. Maximum allowed is 100  max-len
  297:43  warning  Expected line break before `.getRegex`                 newline-per-chained-call

/Users/jangxyz/play/_by_others/markdowns/marked/test/bench.js
  189:42  warning  Expected line break before `.map`                                                                      newline-per-chained-call
  189:53  warning  Unexpected block statement surrounding arrow body; move the returned value immediately after the `=>`  arrow-body-style
  191:11  warning  Expected line break before `.concat`                                                                   newline-per-chained-call

/Users/jangxyz/play/_by_others/markdowns/marked/test/helpers/helpers.js
  42:27  warning  Expected block statement surrounding arrow body  arrow-body-style

/Users/jangxyz/play/_by_others/markdowns/marked/test/helpers/load.js
  18:14  warning  Unnecessary 'else' after 'return'  no-else-return
  37:14  warning  Unnecessary 'else' after 'return'  no-else-return

/Users/jangxyz/play/_by_others/markdowns/marked/test/unit/Lexer-spec.js
  795:1  error  This line has a length of 123. Maximum allowed is 100  max-len

/Users/jangxyz/play/_by_others/markdowns/marked/test/unit/marked-spec.js
    1:1   error    This line has a length of 129. Maximum allowed is 100                                                  max-len
  191:1   error    This line has a length of 116. Maximum allowed is 100                                                  max-len
  387:1   error    This line has a length of 106. Maximum allowed is 100                                                  max-len
  396:1   error    This line has a length of 104. Maximum allowed is 100                                                  max-len
  420:1   error    This line has a length of 124. Maximum allowed is 100                                                  max-len
  461:1   error    This line has a length of 123. Maximum allowed is 100                                                  max-len
  834:24  warning  Unexpected block statement surrounding arrow body; move the returned value immediately after the `=>`  arrow-body-style
  837:9   warning  Expected method shorthand                                                                              object-shorthand

/Users/jangxyz/play/_by_others/markdowns/marked/test/update-specs.js
  16:1   error    This line has a length of 110. Maximum allowed is 100  max-len
  39:41  warning  Expected line break before `.match`                    newline-per-chained-call
  45:1   error    This line has a length of 112. Maximum allowed is 100  max-len
  45:51  warning  Expected line break before `.trim`                     newline-per-chained-call
  45:58  warning  Expected line break before `.replace`                  newline-per-chained-call
  47:43  warning  Expected line break before `.replace`                  newline-per-chained-call
  48:61  warning  Expected line break before `.trim`                     newline-per-chained-call
  49:53  warning  Expected line break before `.trim`                     newline-per-chained-call

✖ 83 problems (47 errors, 36 warnings)
  0 errors and 36 warnings potentially fixable with the `--fix` option.

src/Tokenizer.js Show resolved Hide resolved
src/Tokenizer.js Show resolved Hide resolved
Copy link
Contributor

@calculuschild calculuschild left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking this PR over, this does solve the issue, however I would propose alternative approach (inspired from this PR), as I will try to explain here:

As you mentioned in #1980:

Maybe the reason nextBulletRegex did not have any whitespace in the first place was to make it compatible with these other syntaxes. However, it does not solve the problem in this issue, and at minimum is a misnomer. I think it should be fixed.

Essentially, yes, you might say it is a misnomer since we designed it to catch any line that starts with a bullet symbol, since it gets passed up to the outer loop where we already have a handler in case it is a --- HR token:

marked/src/Tokenizer.js

Lines 199 to 201 in 3795476

if (this.rules.block.hr.test(src)) { // End list if bullet was actually HR (possibly move into itemRegex?)
break;
}

By changing the regex to only detect list items with a space, we lose the chance to catch HRs, which meant adding a separate check of the HR rules if (this.rules.block.hr.test(line)) {. However, this rules check is now missing the indentation logic that was already included in nextBulletRegex with ^ {0,${Math.min(3, indent - 1)}}, so you had to add further logic to detect Setext headers.

You could perhaps more cleanly just build a second regex rule for HRs but with the same starting indent check. Something like:

 const HrRegex = new RegExp(`^ {0,${Math.min(3, indent - 1)}}((?:- *){3,}|(?:_ *){3,}|(?:\\* *){3,})(?:\\n+|$); 

And you could go further. Instead of two separate regexes, simply update the regex in nextBulletRegex so it catches both "bullets with a space" and "HR" combinations. Although this starts to get hard to read:

                                                                   //Match any bullet   //Must have space  //or be all same symbol (HR)
const nextBulletRegex = new RegExp(`^ {0,${Math.min(3, indent - 1)}}([*+-]|\\d{1,9}[.)])(?:(?: [^\\n]*)?| *(?:\\1 *)*)(?:\\n|$)`);

It's not as explicit as this PR, but avoids some code reuse and extra steps. We could potentially rename nextBulletRegex to something more appropriate. (nextBreakingSymbol?)

@jangxyz
Copy link
Contributor Author

jangxyz commented Apr 4, 2022

@calculuschild Before I dive even deeper, let's briefly discuss about the directions. I am new to the source code and am not aware of the maintaining atmosphere.

In fact I did think about that route. Repeating code for same logic seems troublesome, and maybe we are adding technical debt. In that perspective, having nextBulletRegex -- though with a bit misleading name -- seems like a nice catch, to have two birds in one stone. Maybe it could've been better to just rename the variable?

Maybe.

There were a couple of reasons I implemented this way.

  • You can't say keeping the logic inside regex is indeed a simpler solution, especially if you think about the complexity of maintaining the code. Just think about the headache future contributors will have.
    I've already done some verbose analysis on the thread, something like:

    const nextBulletRegex = new RegExp(`^ {0,${Math.min(3, indent - 1)}}(?:[*+-]|\\d{1,9}[.)])((?: [^\\n]*)?(?:\\n|$))`);
    //  |^ {0,${Math.min(3, indent - 1)}}(?:[*+-]|\\d{1,9}[.)])((?: [^\\n]*)?(?:\\n|$))|
    //    1. |^ |                            : start of sentence, followed by a space
    //    2. |{0,${Math.min(3, indent - 1)}}|: zero to some predefined indent number of spaces (string evaluation)
    //    3. |([*+-]|\\d{1,9}[.)])|          : any item bullet, either ordered or unordered
    //    4. |( [^\\n]*)?(?:\\n|$)|          : any letter until end of the sentence (or a line), preceding with a space.

    (While adding No. 4 chunk here, frankly speaking, I had to sit down and do the interpretation by hand again. And I was the one who wrote it!)

  • There already was a nested while loop checking for the next line. Maybe it's the markdown logic itself is complicated -- dealing list item bullets against thematic breaks and setext headings -- and at some point you need a stronger machine than the regular language (maybe).

  • To be honest, I was new to the code, and was afraid of modifying the regex. Even though I checked with the tests, I couldn't be 100% sure I wasn't missing anything. Perhaps there should be commented documents explaining the expression, or even stronger tests?

I'll think about whether we could come up with a clever regex. Meanwhile, can you elaborate your thoughts on this issue?

@calculuschild
Copy link
Contributor

calculuschild commented Apr 5, 2022

@jangxyz I can appreciate where you're coming from. Let's discuss:

You can't say keeping the logic inside regex is indeed a simpler solution...

Granted, RegEx is a bit hard to read, particularly parsing someone else's code. Hence it's often handy to drop some explanatory comments around similar to how you have outlined above. We've done that in a few places already (and we welcome any contributions to that effect). On the other hand, RegEx is very often the best tool for sifting through strings, and in particular, finding patterns of characters. A 50-character RegEx might look daunting, but the same logic could take 50+ lines of code to do the same thing, and much more slowly at that. Which one is simpler now? (To be fair, sometimes we do overuse RegEx where standard Javascript logic would make more sense, but RegEx can be blazing fast if you do it right, and that's one of the main goals of Marked.)

There already was a nested while loop checking for the next line.

Correct. And we don't need to add any more nested loops using my suggested approach either. This shouldn't be a concern.

I'll think about whether we could come up with a clever regex.

I already provided the regex we would need in my comment above. It's not that complicated. Especially since your logic:

            if (this.rules.block.hr.test(line)) {
              // End list if bullet was actually HR (possibly move into itemRegex?)
              // make sure it is not a setext heading, which takes precedence
              const twoLines = [prevLine, line].join('\n');
              const matchLheadingPattern = this.rules.block.lheading.test(twoLines);
              const lineIdentIsLargerThanPrev = line.search(/[^ ]/) >= indent;
              const isNextLineSetextHeading = matchLheadingPattern && lineIdentIsLargerThanPrev;
              if (!isNextLineSetextHeading) {
                break;
              }
            }

would simplify down to:

                                                 //Adjust indentation                 //Last half of HR rule copied from rules.js
const HrRegex = new RegExp(`^ {0,${Math.min(3, indent - 1)}}((?:- *){3,}|(?:_ *){3,}|(?:\\* *){3,})(?:\\n+|$); 
 if (HrRegex.test(src)) {
   break; 
 } 

Repeating code for same logic seems troublesome, and maybe we are adding technical debt.

This is getting at one of my main concerns. Why try to maintain two different systems (a RegEx rule and a separate nested if statement?) to achieve the same logic? For future maintenance, we can put both regex right next to each other to make it obvious they share the same logic and should be equivalent.

                                    // Adjust for list indentation 
const nextBulletRegex = new RegExp(`^ {0,${Math.min(3, indent - 1)}}(?:[*+-]|\\d{1,9}[.)])((?: [^\\n]*)?(?:\\n|$))`);
const HrRegex         = new RegExp(`^ {0,${Math.min(3, indent - 1)}}((?:- *){3,}|(?:_ *){3,}|(?:\\* *){3,})(?:\\n+|$); 

@jangxyz
Copy link
Contributor Author

jangxyz commented Apr 6, 2022

I tried your idea, and I have to say the code looks way more simpler. 👍
There are only two spots of change, and it's not bulky and too verbose as the original PR.

Diff screenshot

image

I was impressed to see it passes the specs as well 😎

The only downside I could think of is, that it feels too cryptic to me.
I've figured out (the hard way) that the markdown list item syntax can collide with setext headings and horizontal rules, concerned with indentations and precedences, so I added verbose checks for it. Looking at hrRegex, it seems to deal with all of that with a single check.

const hrRegex = new RegExp(`^ {0,${Math.min(3, indent - 1)}}((?:- *){3,}|(?:_ *){3,}|(?:\\* *){3,})(?:\\n+|$); 
  /*
   * <^ {0,${Math.min(3, indent - 1)}}((?:- *){3,}|(?:_ *){3,}|(?:\\* *){3,})(?:\\n+|$)>
   *   1. <^>                              : start of sentence
   *   2. < {0,${Math.min(3, indent - 1)}}>: upto 3 spaces
   *   3. <(                               : either one of
   *         (- *){3,}                     :   three or more `-` marks, optionally followed by any amount of spaces
   *        |(_ *){3,}                     :   same, in `_`
   *        |(\\* *){3,}                   :   same, in `*`
   *      )>
   *   4. <(\\n+|$)>                       : end of sentence, or a new line
   */

Maybe all the considerations about setext headings are 'baked in' to the bullets as well -- I will probably need to try out more test samples to figure that out. And in that case the name hrRegex is perhaps misleading, since it does not mention the headings. (Maybe we can add a more descriptive comment for that.)

I still don't understand why there is a '+' after the newline though (\\n+). I did find it inside rules.js, but why do you need to check for more than one newline characters? I tried running tests without it, and it ran fine. Perhaps there's a reason for it for another case, or there's some legacy. (Or maybe it comes down to everyone's regex skills...)

Interestingly, I think this reveals the issues after applying a well-crafted regular expression.

  • Simple code, easier to read the overall logic.
  • Difficult to figure out what the specific regex is doing (and what not).

In the end, I think it's a matter of design principles. I won't mind if the maintainers decide to use cleaner regular expressions to maintain the code.

@jangxyz
Copy link
Contributor Author

jangxyz commented Apr 14, 2022

While waiting for @calculuschild 's opinions and explanations, I went ahead and pushed the change. It has the regex version now, and all specs pass.

@vercel
Copy link

vercel bot commented Apr 29, 2022

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated
marked-website ✅ Ready (Inspect) Visit Preview Apr 29, 2022 at 2:47PM (UTC)

@UziTech UziTech changed the title Fix listitem bullet without whitespace2 fix: list item bullet without whitespace May 2, 2022
@UziTech UziTech merged commit 9c10b4d into markedjs:master May 2, 2022
github-actions bot pushed a commit that referenced this pull request May 2, 2022
## [4.0.15](v4.0.14...v4.0.15) (2022-05-02)

### Bug Fixes

* list item bullet without whitespace ([#2431](#2431)) ([9c10b4d](9c10b4d))
@github-actions
Copy link

github-actions bot commented May 2, 2022

🎉 This PR is included in version 4.0.15 🎉

The release is available on:

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Italic/bold is treated as a list, When Italic/bold is written on the next line of the list
3 participants