How to use the wordfreq.language_info.get_language_info function in wordfreq

To help you get started, we’ve selected a few wordfreq examples, based on popular ways it is used in public projects.

Secure your code as it's written. Use Snyk Code to scan source code in minutes - no build needed - and fix issues immediately.

github LuminosoInsight / wordfreq / wordfreq / preprocess.py View on Github external
>>> preprocess_text('культуры', 'sr')
    "kul'tury"

    Azerbaijani (Azeri) has a similar transliteration step to Serbian,
    and then the Latin-alphabet text is handled similarly to Turkish.

    >>> preprocess_text('бағырты', 'az')
    'bağırtı'

    We don't transliterate Traditional to Simplified Chinese in this step.
    There are some steps where we unify them internally: see chinese.py
    for more information.
    """
    # NFC or NFKC normalization, as needed for the language
    info = get_language_info(language)
    text = unicodedata.normalize(info['normal_form'], text)

    # Transliteration of multi-script languages
    if info['transliteration'] is not None:
        text = transliterate(info['transliteration'], text)

    # Abjad mark removal
    if info['remove_marks']:
        text = remove_marks(text)

    # Case folding
    if info['dotless_i']:
        text = casefold_with_i_dots(text)
    else:
        text = text.casefold()
github LuminosoInsight / wordfreq / wordfreq / __init__.py View on Github external
# Frequencies for multiple tokens are combined using the formula
    #     1 / f = 1 / f1 + 1 / f2 + ...
    # Thus the resulting frequency is less than any individual frequency, and
    # the smallest frequency dominates the sum.
    freqs = get_frequency_dict(lang, wordlist)
    one_over_result = 0.0
    for token in tokens:
        if token not in freqs:
            # If any word is missing, just return the default value
            return minimum
        one_over_result += 1.0 / freqs[token]

    freq = 1.0 / one_over_result

    if get_language_info(lang)['tokenizer'] == 'jieba':
        # If we used the Jieba tokenizer, we could tokenize anything to match
        # our wordlist, even nonsense. To counteract this, we multiply by a
        # probability for each word break that was inferred.
        freq /= INFERRED_SPACE_FACTOR ** (len(tokens) - 1)

    # All our frequency data is only precise to within 1% anyway, so round
    # it to 3 significant digits
    unrounded = max(freq, minimum)
    if unrounded == 0.:
        return 0.
    else:
        leading_zeroes = math.floor(-math.log(unrounded, 10))
        return round(unrounded, leading_zeroes + 3)
github LuminosoInsight / wordfreq / wordfreq / tokens.py View on Github external
True, then wordfreq will not use its own Chinese wordlist for tokenization.
    Instead, it will use the large wordlist packaged with the Jieba tokenizer,
    and it will leave Traditional Chinese characters as is. This will probably
    give more accurate tokenization, but the resulting tokens won't necessarily
    have word frequencies that can be looked up.

    If you end up seeing tokens that are entire phrases or sentences glued
    together, that probably means you passed in CJK text with the wrong
    language code.
    """
    # Use globals to load CJK tokenizers on demand, so that we can still run
    # in environments that lack the CJK dependencies
    global _mecab_tokenize, _jieba_tokenize

    language = langcodes.get(lang)
    info = get_language_info(language)
    text = preprocess_text(text, language)

    if info['tokenizer'] == 'mecab':
        from wordfreq.mecab import mecab_tokenize as _mecab_tokenize

        # Get just the language code out of the Language object, so we can
        # use it to select a MeCab dictionary
        tokens = _mecab_tokenize(text, language.language)
        if not include_punctuation:
            tokens = [token for token in tokens if not PUNCT_RE.match(token)]
    elif info['tokenizer'] == 'jieba':
        from wordfreq.chinese import jieba_tokenize as _jieba_tokenize

        tokens = _jieba_tokenize(text, external_wordlist=external_wordlist)
        if not include_punctuation:
            tokens = [token for token in tokens if not PUNCT_RE.match(token)]
github LuminosoInsight / wordfreq / wordfreq / tokens.py View on Github external
In particular:

    - Any sequence of 2 or more adjacent digits, possibly with intervening
      punctuation such as a decimal point, will replace each digit with '0'
      so that frequencies for numbers don't have to be counted separately.

      This is similar to but not quite identical to the word2vec Google News
      data, which replaces digits with '#' in tokens with more than one digit.

    - In Chinese, unless Traditional Chinese is specifically requested using
      'zh-Hant', all characters will be converted to Simplified Chinese.
    """
    global _simplify_chinese

    info = get_language_info(lang)
    tokens = tokenize(text, lang, include_punctuation, external_wordlist)

    if info['lookup_transliteration'] == 'zh-Hans':
        from wordfreq.chinese import simplify_chinese as _simplify_chinese

        tokens = [_simplify_chinese(token) for token in tokens]

    return [smash_numbers(token) for token in tokens]