How to use the textacy.preprocessing.normalize_whitespace function in textacy

To help you get started, we’ve selected a few textacy examples, based on popular ways it is used in public projects.

Secure your code as it's written. Use Snyk Code to scan source code in minutes - no build needed - and fix issues immediately.

github chartbeat-labs / textacy / tests / preprocessing / test_normalize.py View on Github external
def test_normalize_whitespace():
    text = "Hello, world!  Hello...\t \tworld?\n\nHello:\r\n\n\nWorld. "
    proc_text = "Hello, world! Hello... world?\nHello:\nWorld."
    assert preprocessing.normalize_whitespace(text) == proc_text
github chartbeat-labs / textacy / tests / test_keyterms.py View on Github external
def spacy_doc():
    spacy_lang = cache.load_spacy_lang("en")
    text = """
    Friedman joined the London bureau of United Press International after completing his master's degree. He was dispatched a year later to Beirut, where he lived from June 1979 to May 1981 while covering the Lebanon Civil War. He was hired by The New York Times as a reporter in 1981 and re-dispatched to Beirut at the start of the 1982 Israeli invasion of Lebanon. His coverage of the war, particularly the Sabra and Shatila massacre, won him the Pulitzer Prize for International Reporting (shared with Loren Jenkins of The Washington Post). Alongside David K. Shipler he also won the George Polk Award for foreign reporting.

    In June 1984, Friedman was transferred to Jerusalem, where he served as the New York Times Jerusalem Bureau Chief until February 1988. That year he received a second Pulitzer Prize for International Reporting, which cited his coverage of the First Palestinian Intifada. He wrote a book, From Beirut to Jerusalem, describing his experiences in the Middle East, which won the 1989 U.S. National Book Award for Nonfiction.

    Friedman covered Secretary of State James Baker during the administration of President George H. W. Bush. Following the election of Bill Clinton in 1992, Friedman became the White House correspondent for the New York Times. In 1994, he began to write more about foreign policy and economics, and moved to the op-ed page of The New York Times the following year as a foreign affairs columnist. In 2002, Friedman won the Pulitzer Prize for Commentary for his "clarity of vision, based on extensive reporting, in commenting on the worldwide impact of the terrorist threat."

    In February 2002, Friedman met Saudi Crown Prince Abdullah and encouraged him to make a comprehensive attempt to end the Arab-Israeli conflict by normalizing Arab relations with Israel in exchange for the return of refugees alongside an end to the Israel territorial occupations. Abdullah proposed the Arab Peace Initiative at the Beirut Summit that March, which Friedman has since strongly supported.

    Friedman received the 2004 Overseas Press Club Award for lifetime achievement and was named to the Order of the British Empire by Queen Elizabeth II.

    In May 2011, The New York Times reported that President Barack Obama "has sounded out" Friedman concerning Middle East issues.
    """
    spacy_doc = spacy_lang(preprocessing.normalize_whitespace(text), disable=["parser"])
    return spacy_doc
github chartbeat-labs / textacy / tests / test_readme.py View on Github external
def test_plaintext_functionality(text):
    preprocessed_text = preprocessing.normalize_whitespace(text)
    preprocessed_text = preprocessing.remove_punctuation(text)
    preprocessed_text = preprocessed_text.lower()
    assert all(char.islower() for char in preprocessed_text if char.isalpha())
    assert all(char.isalnum() or char.isspace() for char in preprocessed_text)
    keyword = "America"
    kwics = text_utils.keyword_in_context(
        text, keyword, window_width=35, print_only=False
    )
    for pre, kw, post in kwics:
        assert kw == keyword
        assert isinstance(pre, compat.unicode_)
        assert isinstance(post, compat.unicode_)
github chartbeat-labs / textacy / scripts / train_lang_identifier.py View on Github external
"""
    ds = []
    for filepath in sorted(textacy.io.get_filepaths(dirpath, match_regex=r"\.tar\.gz$")):
        fname = os.path.basename(filepath)
        lang = fname.split("_")[0]
        try:
            lang = iso_lang_map[lang]
        except KeyError:
            continue
        with tarfile.open(filepath, mode="r") as tf:
            for member in tf:
                if re.search(r".*?-sentences\.txt$", member.name):
                    with tf.extractfile(member) as f:
                        for line in f:
                            idx, text = line.decode("utf-8").split(sep="\t", maxsplit=1)
                            text = textacy.preprocessing.normalize_whitespace(text)
                            if len(text) >= min_len:
                                ds.append((text.strip(), lang))
    return ds
github chartbeat-labs / textacy / textacy / datasets / udhr.py View on Github external
def _load_and_parse_text_file(self, filepath):
        with io.open(filepath, mode="rt", encoding="utf-8") as f:
            text_lines = [line.strip() for line in f.readlines()]
        # chop off the header, if it exists
        try:
            header_idx = text_lines.index("---")
            text_lines = text_lines[header_idx + 1:]
        except ValueError:
            pass
        return preprocessing.normalize_whitespace("\n".join(text_lines))
github ebursztein / sitefab / sitefab / nlp.py View on Github external
text = preprocessing.replace_emails(text, replace_with='')
    text = preprocessing.replace_urls(text, replace_with='')
    text = preprocessing.replace_hashtags(text, replace_with='')
    text = preprocessing.replace_phone_numbers(text, replace_with='')
    text = preprocessing.replace_numbers(text, replace_with='')

    text = preprocessing.remove_accents(text)
    text = preprocessing.remove_punctuation(text)

    text = preprocessing.normalize_quotation_marks(text)
    text = preprocessing.normalize_hyphenated_words(text)
    text = text.replace('\n', ' ').replace('\t', ' ')
    text = text.lower()

    text = preprocessing.normalize_whitespace(text)
    return text