How to use the konoha.WordTokenizer function in konoha

To help you get started, weโ€™ve selected a few konoha examples, based on popular ways it is used in public projects.

Secure your code as it's written. Use Snyk Code to scan source code in minutes - no build needed - and fix issues immediately.

github himkt / tiny_tokenizer / konoha / api / tokenizers.py View on Github external
else:
        raise HTTPException(status_code=400, detail="text or texts is required.")

    mode = params.mode.lower()
    model_path = (
        "data/model.spm" if params.tokenizer.lower() == "sentencepiece" else None
    )  # NOQA

    signature = f"{params.tokenizer}.{model_path}.{mode}"
    if signature in request.app.tokenizers:
        logging.info(f"Hit cache: {signature}")
        tokenizer = request.app.tokenizers[signature]
    else:
        logging.info(f"Create tokenizer: {signature}")
        try:
            tokenizer = WordTokenizer(
                tokenizer=params.tokenizer,
                with_postag=True,
                model_path=model_path,
                mode=mode,
            )
            request.app.tokenizers[signature] = tokenizer
        except Exception:
            raise HTTPException(status_code=400, detail="fail to initialize tokenizer")

    results = [
        [
            {"surface": t.surface, "part_of_speech": t.postag}
            for t in tokenizer.tokenize(text)
        ]
        for text in texts
    ]