searxng/searx/engines/bing.py

"""
 Bing (Web)

 @website     https://www.bing.com
 @provide-api yes (http://datamarket.azure.com/dataset/bing/search),
              max. 5000 query/month

 @using-api   no (because of query limit)
 @results     HTML (using search portal)
 @stable      no (HTML can change)
 @parse       url, title, content

 @todo        publishedDate
"""

import re
from urllib.parse import urlencode
from lxml import html
from searx import logger, utils
from searx.engines.xpath import extract_text
from searx.utils import match_language, gen_useragent, eval_xpath

logger = logger.getChild('bing engine')

# engine dependent config
categories = ['general']
paging = True
language_support = True
supported_languages_url = 'https://www.bing.com/account/general'
language_aliases = {'zh-CN': 'zh-CHS', 'zh-TW': 'zh-CHT', 'zh-HK': 'zh-CHT'}

# search-url
base_url = 'https://www.bing.com/'
search_string = 'search?{query}&first={offset}'


def _get_offset_from_pageno(pageno):
    return (pageno - 1) * 10 + 1


# do search-request
def request(query, params):
    offset = _get_offset_from_pageno(params.get('pageno', 0))

    if params['language'] == 'all':
        lang = 'EN'
    else:
        lang = match_language(params['language'], supported_languages, language_aliases)

    query = 'language:{} {}'.format(lang.split('-')[0].upper(), query.decode()).encode()

    search_path = search_string.format(
        query=urlencode({'q': query}),
        offset=offset)

    params['url'] = base_url + search_path

    return params


# get response from search-request
def response(resp):
    results = []
    result_len = 0

    dom = html.fromstring(resp.text)
    # parse results
    for result in eval_xpath(dom, '//div[@class="sa_cc"]'):
        link = eval_xpath(result, './/h3/a')[0]
        url = link.attrib.get('href')
        title = extract_text(link)
        content = extract_text(eval_xpath(result, './/p'))

        # append result
        results.append({'url': url,
                        'title': title,
                        'content': content})

    # parse results again if nothing is found yet
    for result in eval_xpath(dom, '//li[@class="b_algo"]'):
        link = eval_xpath(result, './/h2/a')[0]
        url = link.attrib.get('href')
        title = extract_text(link)
        content = extract_text(eval_xpath(result, './/p'))

        # append result
        results.append({'url': url,
                        'title': title,
                        'content': content})

    try:
        result_len_container = "".join(eval_xpath(dom, '//span[@class="sb_count"]//text()'))
        if "-" in result_len_container:
            # Remove the part "from-to" for paginated request ...
            result_len_container = result_len_container[result_len_container.find("-") * 2 + 2:]

        result_len_container = re.sub('[^0-9]', '', result_len_container)
        if len(result_len_container) > 0:
            result_len = int(result_len_container)
    except Exception as e:
        logger.debug('result error :\n%s', e)
        pass

    if result_len and _get_offset_from_pageno(resp.search_params.get("pageno", 0)) > result_len:
        return []

    results.append({'number_of_results': result_len})
    return results


# get supported languages from their site
def _fetch_supported_languages(resp):
    lang_tags = set()

    setmkt = re.compile('setmkt=([^&]*)')
    dom = html.fromstring(resp.text)
    lang_links = eval_xpath(dom, "//li/a[contains(@href, 'setmkt')]")

    for a in lang_links:
        href = eval_xpath(a, './@href')[0]
        match = setmkt.search(href)
        l_tag = match.groups()[0]
        _lang, _nation = l_tag.split('-', 1)
        l_tag = _lang.lower() + '-' + _nation.upper()
        lang_tags.add(l_tag)

    return list(lang_tags)
update versions.cfg to use the current up-to-date packages 2015-05-02 15:45:17 +02:00			`"""`
			`Bing (Web)`

			`@website https://www.bing.com`
			`@provide-api yes (http://datamarket.azure.com/dataset/bing/search),`
			`max. 5000 query/month`

			`@using-api no (because of query limit)`
			`@results HTML (using search portal)`
			`@stable no (HTML can change)`
			`@parse url, title, content`

			`@todo publishedDate`
			`"""`
update bing engines and fix bing_news 2014-09-01 14:38:59 +02:00
Fix bing engine results count (#1387) This PR fixes the result count from bing which was throwing an (hidden) error and add a validation to avoid reading more results than avalaible. For example : If there is 100 results from some search and we try to get results from 120 to 130, Bing will send back the results from 0 to 10 and no error. If we compare results count with the first parameter of the request we can avoid this "invalid" results. 2019-08-05 16:15:40 +02:00			`import re`
Drop Python 2 (1/n): remove unicode string and url_utils 2020-08-06 17:42:46 +02:00			`from urllib.parse import urlencode`
Improves PEP8 compatibility. 2014-02-05 20:24:31 +01:00			`from lxml import html`
Fix bing engine results count (#1387) This PR fixes the result count from bing which was throwing an (hidden) error and add a validation to avoid reading more results than avalaible. For example : If there is 100 results from some search and we try to get results from 120 to 130, Bing will send back the results from 0 to 10 and no error. If we compare results count with the first parameter of the request we can avoid this "invalid" results. 2019-08-05 16:15:40 +02:00			`from searx import logger, utils`
Add bing in the test units 2015-01-25 20:14:37 +01:00			`from searx.engines.xpath import extract_text`
[mod] speed optimization compile XPath only once avoid redundant call to urlparse get_locale(webapp.py): avoid useless call to request.accept_languages.best_match 2019-11-15 09:31:37 +01:00			`from searx.utils import match_language, gen_useragent, eval_xpath`
[enh] bing engine added 2013-10-24 23:52:57 +02:00
Fix bing engine results count (#1387) This PR fixes the result count from bing which was throwing an (hidden) error and add a validation to avoid reading more results than avalaible. For example : If there is 100 results from some search and we try to get results from 120 to 130, Bing will send back the results from 0 to 10 and no error. If we compare results count with the first parameter of the request we can avoid this "invalid" results. 2019-08-05 16:15:40 +02:00			`logger = logger.getChild('bing engine')`

update bing engines and fix bing_news 2014-09-01 14:38:59 +02:00			`# engine dependent config`
			`categories = ['general']`
[enh] bing, google paging support 2014-01-29 21:14:38 +01:00			`paging = True`
[enh] search language support init 2014-01-31 04:35:23 +01:00			`language_support = True`
[mod] fetch supported languages for several engines utils/fetch_languages.py gets languages supported by each engine and generates engines_languages.json with each engine's supported language. 2016-11-06 03:51:38 +01:00			`supported_languages_url = 'https://www.bing.com/account/general'`
refactor engine's search language handling Add match_language function in utils to match any user given language code with a list of engine's supported languages. Also add language_aliases dict on each engine to translate standard language codes into the custom codes used by the engine. 2018-03-01 05:30:48 +01:00			`language_aliases = {'zh-CN': 'zh-CHS', 'zh-TW': 'zh-CHT', 'zh-HK': 'zh-CHT'}`
[enh] bing, google paging support 2014-01-29 21:14:38 +01:00
update bing engines and fix bing_news 2014-09-01 14:38:59 +02:00			`# search-url`
			`base_url = 'https://www.bing.com/'`
			`search_string = 'search?{query}&first={offset}'`
[enh] bing engine added 2013-10-24 23:52:57 +02:00
little refactoring 2014-09-02 17:13:44 +02:00
Fix bing engine results count (#1387) This PR fixes the result count from bing which was throwing an (hidden) error and add a validation to avoid reading more results than avalaible. For example : If there is 100 results from some search and we try to get results from 120 to 130, Bing will send back the results from 0 to 10 and no error. If we compare results count with the first parameter of the request we can avoid this "invalid" results. 2019-08-05 16:15:40 +02:00			`def _get_offset_from_pageno(pageno):`
			`return (pageno - 1) * 10 + 1`


update bing engines and fix bing_news 2014-09-01 14:38:59 +02:00			`# do search-request`
[enh] bing engine added 2013-10-24 23:52:57 +02:00			`def request(query, params):`
Fix bing engine results count (#1387) This PR fixes the result count from bing which was throwing an (hidden) error and add a validation to avoid reading more results than avalaible. For example : If there is 100 results from some search and we try to get results from 120 to 130, Bing will send back the results from 0 to 10 and no error. If we compare results count with the first parameter of the request we can avoid this "invalid" results. 2019-08-05 16:15:40 +02:00			`offset = _get_offset_from_pageno(params.get('pageno', 0))`
update bing engines and fix bing_news 2014-09-01 14:38:59 +02:00
Revert "remove 'all' option from search languages" This reverts commit 4d1770398a6af8902e75c0bd885781584d39e796. 2019-01-06 15:27:46 +01:00			`if params['language'] == 'all':`
			`lang = 'EN'`
			`else:`
			`lang = match_language(params['language'], supported_languages, language_aliases)`
[fix] use english as default language in bing If no language is specified, bing returns results with multiple languages for one query which isn't really useful. Setting english as default insted if nothing. 2016-12-30 18:17:14 +01:00
Drop Python 2 (1/n): remove unicode string and url_utils 2020-08-06 17:42:46 +02:00			`query = 'language:{} {}'.format(lang.split('-')[0].upper(), query.decode()).encode()`
update bing engines and fix bing_news 2014-09-01 14:38:59 +02:00
fix: robot fw, entry points, some flake8, package searx egg 2014-01-19 22:59:01 +01:00			`search_path = search_string.format(`
[fix] bing unicode issue part III. 2016-11-14 15:52:29 +01:00			`query=urlencode({'q': query}),`
[enh] bing, google paging support 2014-01-29 21:14:38 +01:00			`offset=offset)`
[enh] search language support init 2014-01-31 04:35:23 +01:00
[enh] bing engine added 2013-10-24 23:52:57 +02:00			`params['url'] = base_url + search_path`
fix bing "garbage" results (issue #1275) 2018-05-21 01:10:22 +02:00
[enh] bing engine added 2013-10-24 23:52:57 +02:00			`return params`


update bing engines and fix bing_news 2014-09-01 14:38:59 +02:00			`# get response from search-request`
[enh] bing engine added 2013-10-24 23:52:57 +02:00			`def response(resp):`
			`results = []`
Fix bing engine results count (#1387) This PR fixes the result count from bing which was throwing an (hidden) error and add a validation to avoid reading more results than avalaible. For example : If there is 100 results from some search and we try to get results from 120 to 130, Bing will send back the results from 0 to 10 and no error. If we compare results count with the first parameter of the request we can avoid this "invalid" results. 2019-08-05 16:15:40 +02:00			`result_len = 0`
update bing engines and fix bing_news 2014-09-01 14:38:59 +02:00
[fix] bing unicode encode error - fixes #408 2015-08-28 14:51:32 +02:00			`dom = html.fromstring(resp.text)`
update bing engines and fix bing_news 2014-09-01 14:38:59 +02:00			`# parse results`
[mod] speed optimization compile XPath only once avoid redundant call to urlparse get_locale(webapp.py): avoid useless call to request.accept_languages.best_match 2019-11-15 09:31:37 +01:00			`for result in eval_xpath(dom, '//div[@class="sa_cc"]'):`
			`link = eval_xpath(result, './/h3/a')[0]`
[enh] bing engine added 2013-10-24 23:52:57 +02:00			`url = link.attrib.get('href')`
Add bing in the test units 2015-01-25 20:14:37 +01:00			`title = extract_text(link)`
[mod] speed optimization compile XPath only once avoid redundant call to urlparse get_locale(webapp.py): avoid useless call to request.accept_languages.best_match 2019-11-15 09:31:37 +01:00			`content = extract_text(eval_xpath(result, './/p'))`
[enh] bing updates ++ language support 2013-10-25 01:37:48 +02:00
update bing engines and fix bing_news 2014-09-01 14:38:59 +02:00			`# append result`
[fix] pep8 : engines (errors E121, E127, E128 and E501 still exist) 2014-12-07 16:37:56 +01:00			`results.append({'url': url,`
			`'title': title,`
update bing engines and fix bing_news 2014-09-01 14:38:59 +02:00			`'content': content})`

			`# parse results again if nothing is found yet`
[mod] speed optimization compile XPath only once avoid redundant call to urlparse get_locale(webapp.py): avoid useless call to request.accept_languages.best_match 2019-11-15 09:31:37 +01:00			`for result in eval_xpath(dom, '//li[@class="b_algo"]'):`
			`link = eval_xpath(result, './/h2/a')[0]`
[enh] bing updates ++ language support 2013-10-25 01:37:48 +02:00			`url = link.attrib.get('href')`
Add bing in the test units 2015-01-25 20:14:37 +01:00			`title = extract_text(link)`
[mod] speed optimization compile XPath only once avoid redundant call to urlparse get_locale(webapp.py): avoid useless call to request.accept_languages.best_match 2019-11-15 09:31:37 +01:00			`content = extract_text(eval_xpath(result, './/p'))`
update bing engines and fix bing_news 2014-09-01 14:38:59 +02:00
			`# append result`
[fix] pep8 : engines (errors E121, E127, E128 and E501 still exist) 2014-12-07 16:37:56 +01:00			`results.append({'url': url,`
			`'title': title,`
update bing engines and fix bing_news 2014-09-01 14:38:59 +02:00			`'content': content})`

Fix bing engine results count (#1387) This PR fixes the result count from bing which was throwing an (hidden) error and add a validation to avoid reading more results than avalaible. For example : If there is 100 results from some search and we try to get results from 120 to 130, Bing will send back the results from 0 to 10 and no error. If we compare results count with the first parameter of the request we can avoid this "invalid" results. 2019-08-05 16:15:40 +02:00			`try:`
[fix] handle missing result size 2020-01-02 22:28:47 +01:00			`result_len_container = "".join(eval_xpath(dom, '//span[@class="sb_count"]//text()'))`
Fix bing engine results count (#1387) This PR fixes the result count from bing which was throwing an (hidden) error and add a validation to avoid reading more results than avalaible. For example : If there is 100 results from some search and we try to get results from 120 to 130, Bing will send back the results from 0 to 10 and no error. If we compare results count with the first parameter of the request we can avoid this "invalid" results. 2019-08-05 16:15:40 +02:00			`if "-" in result_len_container:`
			`# Remove the part "from-to" for paginated request ...`
			`result_len_container = result_len_container[result_len_container.find("-") * 2 + 2:]`

			`result_len_container = re.sub('[^0-9]', '', result_len_container)`
			`if len(result_len_container) > 0:`
			`result_len = int(result_len_container)`
			`except Exception as e:`
			`logger.debug('result error :\n%s', e)`
			`pass`

[fix] handle missing result size 2020-01-02 22:28:47 +01:00			`if result_len and _get_offset_from_pageno(resp.search_params.get("pageno", 0)) > result_len:`
Fix bing engine results count (#1387) This PR fixes the result count from bing which was throwing an (hidden) error and add a validation to avoid reading more results than avalaible. For example : If there is 100 results from some search and we try to get results from 120 to 130, Bing will send back the results from 0 to 10 and no error. If we compare results count with the first parameter of the request we can avoid this "invalid" results. 2019-08-05 16:15:40 +02:00			`return []`

			`results.append({'number_of_results': result_len})`
[enh] bing engine added 2013-10-24 23:52:57 +02:00			`return results`
[mod] fetch supported languages for several engines utils/fetch_languages.py gets languages supported by each engine and generates engines_languages.json with each engine's supported language. 2016-11-06 03:51:38 +01:00

			`# get supported languages from their site`
tests for _fetch_supported_languages in engines and refactor method to make it testable without making requests 2016-12-15 07:34:43 +01:00			`def _fetch_supported_languages(resp):`
bugfix: fetch_supported_languages bing, -news, -videos, -images Signed-off-by: Markus Heiser <markus.heiser@darmarit.de> 2020-03-01 08:01:36 +01:00			`lang_tags = set()`

			`setmkt = re.compile('setmkt=([^&]*)')`
tests for _fetch_supported_languages in engines and refactor method to make it testable without making requests 2016-12-15 07:34:43 +01:00			`dom = html.fromstring(resp.text)`
bugfix: fetch_supported_languages bing, -news, -videos, -images Signed-off-by: Markus Heiser <markus.heiser@darmarit.de> 2020-03-01 08:01:36 +01:00			`lang_links = eval_xpath(dom, "//li/a[contains(@href, 'setmkt')]")`

			`for a in lang_links:`
			`href = eval_xpath(a, './@href')[0]`
			`match = setmkt.search(href)`
			`l_tag = match.groups()[0]`
bing_news: parital rollback of c89c05bc The bing_news bug (discussed in #1838) was caused by wrong language tags, which was fixed e0c99d9d / no need to change the bing_news search string. closes: https://github.com/asciimoo/searx/issues/1838 Signed-off-by: Markus Heiser <markus.heiser@darmarit.de> 2020-03-01 11:07:59 +01:00			`_lang, _nation = l_tag.split('-', 1)`
bugfix: fetch_supported_languages bing, -news, -videos, -images Signed-off-by: Markus Heiser <markus.heiser@darmarit.de> 2020-03-01 08:01:36 +01:00			`l_tag = _lang.lower() + '-' + _nation.upper()`
			`lang_tags.add(l_tag)`

			`return list(lang_tags)`