Skip to content

Public API for accessing end position and raw match in parser actions #131

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kevinmehall opened this issue Dec 5, 2012 · 5 comments
Closed
Assignees
Labels
Milestone

Comments

@kevinmehall
Copy link

CoffeeScriptRedux currently obtains the raw match string by tediously concatenating together all the subexpressions, but this could easily be handled by pegjs.

It's already possible by accessing the pos and input variables that happen to be in lexical scope:

main = expr
expr = "(" fn:[abcd] subexpr:((" " e:expr){return e})? ")"
  { return {start:offset, end:pos,
            raw:input.substring(offset, pos),
            fn:fn, subexpr:subexpr};
  }

Which parses (a (b (c))):

{ start: 0,
  end: 11,
  raw: '(a (b (c)))',
  fn: 'a',
  subexpr: 
   { start: 3,
     end: 10,
     raw: '(b (c))',
     fn: 'b',
     subexpr: 
      { start: 6,
        end: 9,
        raw: '(c)',
        fn: 'c',
        subexpr: '' } } }

It would be really easy to wrap this in a function like the new offset() for future-compatibility and consistency. I see there's a code generator rewrite coming, otherwise this would be a pull request...

@dmajda
Copy link
Contributor

dmajda commented Dec 5, 2012

Do I understand correctly that you need both structured values and the raw text? Meaning the new $ operator isn't enough for your purposes?

@kevinmehall
Copy link
Author

Correct, although the end position is probably more important than the raw text (especially since substring can produce the raw string given the [start,end] position and input).

CoffeeScriptRedux keeps the start and end position of each AST node to generate source maps. Right now, each rule's action concatenates together all of the subexpressions' raw text (even ignored whitespace!) to obtain the raw text. The lengths of the raw strings and start offsets are used to calculate the end positions. I'm not sure the raw text is actually used for anything besides that and making the parse tree more human-readable for debugging.

Grammar source is here, and you can see that a large portion of it is dealing with raw values, which I find a little silly. (@michaelficarra is the main developer; I've been contributing to other parts and looking on at the parser with confusion)

@michaelficarra
Copy link

What an amazing coincidence. I just recently forked PEGjs and was working on this exact issue. This would be extremely useful for cleaning up my grammar, as you can see. +1.

edit: Not that amazing. I just noticed I had opened an issue regarding this.

@curvedmark
Copy link

Correct, although the end position is probably more important than the raw text (especially since substring can produce the raw string given the [start,end] position and input).

If the raw text is exposed, the end position can be easily obtained by offset + raw.length.

I personally feel exposing the matched text makes more sense than exposing the end position. Probably raw() or string() or whatever to the action.

@dmajda
Copy link
Contributor

dmajda commented Dec 10, 2012

I personally feel exposing the matched text makes more sense than exposing the end position. Probably raw() or string() or whatever to the action.

I agree. I'll push a patch for that in a minute (this was a quick fix).

@dmajda dmajda closed this as completed in bea6b1f Dec 10, 2012
@ghost ghost assigned dmajda Dec 10, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants