How to use the sacremoses.util.xml_unescape function in sacremoses

To help you get started, we’ve selected a few sacremoses examples, based on popular ways it is used in public projects.

Secure your code as it's written. Use Snyk Code to scan source code in minutes - no build needed - and fix issues immediately.

github yannvgn / laserembeddings / laserembeddings / View on Github external
def tokenize(self, text: str) -> str:
        """Tokenizes a text and returns the tokens as a string"""

        # not implemented

        # NORM_PUNC
        text = self.normalizer.normalize(text)

        # DESCAPE
        if self.descape:
            text = xml_unescape(text)

        # see:
        text = self.tokenizer.tokenize(text,

        # jieba
        if self.lang == 'zh':
            text = ' '.join(jieba.cut(text.rstrip('\r\n')))

        # MECAB
        if self.lang == 'ja':
            text = self.mecab_tokenizer.parse(text).rstrip('\r\n')