How to use the gne.utils.get_longest_common_sub_string function in gne

To help you get started, we’ve selected a few gne examples, based on popular ways it is used in public projects.

Secure your code as it's written. Use Snyk Code to scan source code in minutes - no build needed - and fix issues immediately.

github kingname / GeneralNewsExtractor / gne / extractor / TitleExtractor.py View on Github external
GNE 成为全球最好的新闻提取模块-今日头条
        新华网:GNE 成为全球最好的新闻提取模块

        同时,新闻的某个 标签中也会包含这个新闻标题。

        因此,通过 h 标签与 title 的文字双向匹配,找到最适合作为新闻标题的字符串。
        但是,需要考虑到 title 与 h 标签中的文字可能均含有特殊符号,因此,不能直接通过
        判断 h 标签中的文字是否在 title 中来判断,这里需要中最长公共子串。
        :param element:
        :return:
        """
        h_tag_texts_list = element.xpath('(//h1//text() | //h2//text() | //h3//text() | //h4//text() | //h5//text())')
        title_text = ''.join(element.xpath('//title/text()'))
        news_title = ''
        for h_tag_text in h_tag_texts_list:
            lcs = get_longest_common_sub_string(title_text, h_tag_text)
            if len(lcs) > len(news_title):
                news_title = lcs
        return news_title