Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add preprocess and postprocess hooks #2730

Merged
merged 13 commits into from Mar 22, 2023

Conversation

UziTech
Copy link
Member

@UziTech UziTech commented Feb 11, 2023

Marked version: 4.2.12

Description

This is the start of implementing #1878

Add preprocess and postprocess hooks so extensions can process the markdown and options before sending it to marked and they can process the html after marked is done with it.

A couple use cases come to mind. We can have a sanitize extension that works on the final output html. We can have an extension that parses front-matter for the options marked should use during the parsing.

Docs examples:

Example: Set options based on front-matter

import { marked } from 'marked';
import fm from 'front-matter';

// Override function
const hooks = {
  preprocess(markdown) {
    const { attributes, body } = fm(markdown);
    for (const prop in attributes) {
      if (prop in this.options) {
        this.options[prop] = attributes[prop];
      }
    }
    return body;
  }
};

marked.use({ hooks });

// Run marked
console.log(marked.parse(`
---
headerIds: false
---

## test
`.trim()));

Output:

<h2>test</h2>

Example: Sanitize HTML with isomorphic-dompurify

import { marked } from 'marked';
import DOMPurify from 'isomorphic-dompurify';

// Override function
const hooks = {
  postprocess(html) {
    return DOMPurify.sanitize(html);
  }
};

marked.use({ hooks });

// Run marked
console.log(marked.parse(`
<img src=x onerror=alert(1)//>
`));

Output:

<img src="x">

Contributor

  • Test(s) exist to ensure functionality and minimize regression (if no tests added, list tests covering this PR); or,
  • no tests required for this PR.
  • If submitting new feature, it has been documented in the appropriate places.

Committer

In most cases, this should be a different person than the contributor.

@vercel
Copy link

vercel bot commented Feb 11, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated
marked-website ✅ Ready (Inspect) Visit Preview 💬 Add your feedback Mar 22, 2023 at 5:44AM (UTC)

Comment on lines +2 to +11
static passThroughHooks = new Set([
'preprocess',
'postprocess'
]);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently both preprocess and postprocess are "passthrough" hooks, which means the return value of one function gets passed to the next hook.

In the future I could see us adding some hooks that would not pass the return to the next hook. For example the way we call renderer or tokenizer methods is not pass through. The first tokenizer is called and if it returns something other than false we don't call the next tokenizer of the same name at all.

const hooks = marked.defaults.hooks || new Hooks();
for (const prop in pack.hooks) {
const prevHook = hooks[prop];
if (Hooks.passThroughHooks.has(prop)) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is where we determine what to do for the next hook based on whether the hook is a "passthrough" hook. Either we pass the return value to the next function or we stop on false like tokenizers and renderers.

src/marked.js Outdated
Comment on lines 36 to 37
opt = merge({}, marked.defaults, opt || {});
const origOpt = { ...opt };
opt = { ...marked.defaults, ...origOpt };
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the merge helper in favor of the spread operator. This actually sped up marked's benchmarks by about 300ms (~5%).

@@ -255,7 +255,7 @@ smartypants('"this ... string"')

<h2 id="walk-tokens">Walk Tokens : <code>walkTokens</code></h2>

The walkTokens function gets called with every token. Child tokens are called before moving on to sibling tokens. Each token is passed by reference so updates are persisted when passed to the parser. The return value of the function is ignored.
The walkTokens function gets called with every token. Child tokens are called before moving on to sibling tokens. Each token is passed by reference so updates are persisted when passed to the parser. When [`async`](#async) mode is enabled, the return value is awaited. Otherwise the return value is ignored.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also merged the changes from #2728 in this PR because I am touching all of the same code.

fixes #2721

@UziTech
Copy link
Member Author

UziTech commented Feb 11, 2023

It might not be a bad idea to create multiple PRs for this change:

I added them all here to get the complete picture and make sure my idea actually worked.

src/marked.js Outdated Show resolved Hide resolved
@UziTech
Copy link
Member Author

UziTech commented Feb 25, 2023

@calculuschild @styfle thoughts?

Are these type of extension hooks something we want to maintain?

Pros

  • Enables more types of extensions that need to do something to the markdown/options before parsing or the output HTML.

Cons

  • More code to maintain. I could see extension creators wanting even more hooks (i.e. between lexing and parsing)

The alternative is for these types of extensions just wrapping marked and including it as a dependency. But there is no easy way to compose multiple extensions that way.

@styfle
Copy link
Member

styfle commented Feb 26, 2023

I read the PR description but I don’t understand the purpose of this feature.

Since the input of preprocess/postprocess is just the markdown/html, it seems like it would be equivalent to the current usage:

const md = preprocess(markdown)
const html = marked(md)
const output = postprocess(html)

Is there a use case that hooks cover that isn’t covered by this snippet?

@UziTech
Copy link
Member Author

UziTech commented Feb 26, 2023

No, that is exactly what this does.

The benefit of this PR is that extensions can do that so they can be composable.

The examples that I give are frontmatter and dompurify. If someone creates an npm package that wraps marked to add front matter parsing and another package that wraps marked to do sanitization with dompurify a user wouldn't be able to use them together.

With this PR the user would be able to include both or just one with marked.use.

@calculuschild
Copy link
Contributor

calculuschild commented Mar 2, 2023

My initial impression is this is moving a bit outside the scope of being a Markdown parser, and my inclination is that we should keep Marked.js that way. If I were a developer who wanted to have some kind of frontmatter parsing and a post-sanitizer, I would probably opt to just do it the way @styfle shows, passing the result from process to process rather than embedding everything inside of Marked.js.

If someone creates an npm package that wraps marked

I may be misunderstanding, but why would a package designer want to wrap marked.js like this anyway? Why not just use the base DOMPurify package as an external postprocessing step as we have always done? If a user wants to have both a pre- and post-processing step it seems more practical to just use the packages directly rather than having to build a specialized Marked.js-extension plugin that does the same thing?

A lot of time has passed since #1878.... Now that custom extensions exist I wonder if its worth re-evaluating whether/which hooks are still necessary?

@UziTech
Copy link
Member Author

UziTech commented Mar 3, 2023

The dompurify and front-matter example are fairly trivial examples, but if someone wants to create an extension like Table of Contents or footnotes that have multiple parts (a tokenizer/renderer and something that modifies the output html) then anyone who wants to use that extension would need to implement multiple parts instead of just being able to write marked.use(extension).

A lot of time has passed since #1878.... Now that custom extensions exist I wonder if its worth re-evaluating whether/which hooks are still necessary?

Ya I don't think all of those hooks are necessary which is why I started a new PR instead of building on that one. That was more of a POC to define which hooks would be needed to get rid of the current options so we could just have extensions. I would still like to deprecate a lot of the current options and find ways to move them to extensions.

I agree this is somewhat out of scope for marked.

@huaichaow do you have any input on this? I know you are trying to write an extension that would take advantage of these hooks. Is there any use cases we should consider that can't be worked around without these hooks?

@huaichaow
Copy link

Hi @UziTech @styfle @calculuschild

the idea is to have a place to clean up the side effects created in extensions, e.g., the Table-of-Contents extension I've been working on https://github.com/huaichaow/marked-toc-extension/blob/main/src/index.ts.

all the headings found are saved in an array in the walkTokens callback and consumed in the renderer function of the extension. it's ok if marked is fired only once, but in most cases, e.g., in a live editor, it will be executed many times leading to duplicated headings pushed into the array over and over again.


why not pass a new marked?

currently, it's not possible as the settings are cumulative, and there's no method to reset the marked.defaults, please check:

marked.options =
marked.setOptions = function(opt) {
  merge(marked.defaults, opt);
  changeDefaults(marked.defaults);
  return marked;
};

and most importantly, lifecycle hooks will make it more flexible and easier to create Extensions - more stages are available for use in extensions

@UziTech
Copy link
Member Author

UziTech commented Mar 10, 2023

@styfle @calculuschild what is the final decision? Are we saying these type of hooks are out of scope and should be handled separate from the custom extension?

For example in @huaichaow's example he could export a function that users of his extension should call after each marked run instead of being able to handle cleanup in the extension, like:

import { marked } from 'marked';
import { tocExtension, tocCleanup } from 'marked-toc-extension';

marked.use(tocExtension);

marked.parse(someMarkdown);
tocCleanup();

marked.parse(someOtherMarkdown);
tocCleanup();

I would say I am still on the fence. On the one hand this is slightly out of scope. On the other hand this does seem to complement the custom extensions really well. I can think of many extensions that would need these hooks to use marked.use, but they don't really need to be extensions (like the front-matter and dompurify examples above).

Copy link
Member

@styfle styfle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm +0 on this one.

It seems it might be useful for some of those cases mentioned above.

@calculuschild
Copy link
Contributor

calculuschild commented Mar 20, 2023

Eh, I guess if this isn't adding a heavy maintenance load then I can allow it.

I'm still not convinced the original user in question really needs this feature for their extension (seems the cleanup step could be avoided by reworking the extension), but maybe there will be some use case in the future that I'm just not seeing yet.

I probably wouldn't encourage extensions that are only preprocessors or post processors, (i.e., a sanitizer extension that is literally just wrapping Marked.js around Dompurify) because that seems like it would overly-complicate a simple task and give the impression that Marked.js isn't compatible with existing pre/post-processing packages.

Perhaps it would be helpful to add a small note to the docs for this that you can pre- and post-process things without having to make an extension; this just lets you bundle it with a Tokenizer or Renderer extension if it relies on some unique HTML handling at the same time.

@UziTech UziTech merged commit 9b452bc into markedjs:master Mar 22, 2023
9 checks passed
@UziTech UziTech deleted the prepostprocess branch March 22, 2023 05:49
github-actions bot pushed a commit that referenced this pull request Mar 22, 2023
# [4.3.0](v4.2.12...v4.3.0) (2023-03-22)

### Bug Fixes

* always return promise if async ([#2728](#2728)) ([042dcc5](042dcc5))
* fenced code doesn't need a trailing newline ([#2756](#2756)) ([3acbb7f](3acbb7f))

### Features

* add preprocess and postprocess hooks ([#2730](#2730)) ([9b452bc](9b452bc))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants