searxng/searx/engines/bing.py

"""
 Bing (Web)

 @website     https://www.bing.com
 @provide-api yes (http://datamarket.azure.com/dataset/bing/search),
              max. 5000 query/month

 @using-api   no (because of query limit)
 @results     HTML (using search portal)
 @stable      no (HTML can change)
 @parse       url, title, content

 @todo        publishedDate
"""

from lxml import html
from searx.engines.xpath import extract_text
from searx.url_utils import urlencode
from searx.utils import match_language, gen_useragent

# engine dependent config
categories = ['general']
paging = True
language_support = True
supported_languages_url = 'https://www.bing.com/account/general'
language_aliases = {'zh-CN': 'zh-CHS', 'zh-TW': 'zh-CHT', 'zh-HK': 'zh-CHT'}

# search-url
base_url = 'https://www.bing.com/'
search_string = 'search?{query}&first={offset}'


# do search-request
def request(query, params):
    offset = (params['pageno'] - 1) * 10 + 1

    if params['language'] == 'all':
        lang = 'EN'
    else:
        lang = match_language(params['language'], supported_languages, language_aliases)

    query = u'language:{} {}'.format(lang.split('-')[0].upper(), query.decode('utf-8')).encode('utf-8')

    search_path = search_string.format(
        query=urlencode({'q': query}),
        offset=offset)

    params['url'] = base_url + search_path

    params['headers']['User-Agent'] = gen_useragent('Windows NT 6.3; WOW64')

    return params


# get response from search-request
def response(resp):
    results = []

    dom = html.fromstring(resp.text)

    try:
        results.append({'number_of_results': int(dom.xpath('//span[@class="sb_count"]/text()')[0]
                                                 .split()[0].replace(',', ''))})
    except:
        pass

    # parse results
    for result in dom.xpath('//div[@class="sa_cc"]'):
        link = result.xpath('.//h3/a')[0]
        url = link.attrib.get('href')
        title = extract_text(link)
        content = extract_text(result.xpath('.//p'))

        # append result
        results.append({'url': url,
                        'title': title,
                        'content': content})

    # parse results again if nothing is found yet
    for result in dom.xpath('//li[@class="b_algo"]'):
        link = result.xpath('.//h2/a')[0]
        url = link.attrib.get('href')
        title = extract_text(link)
        content = extract_text(result.xpath('.//p'))

        # append result
        results.append({'url': url,
                        'title': title,
                        'content': content})

    # return results
    return results


# get supported languages from their site
def _fetch_supported_languages(resp):
    supported_languages = []
    dom = html.fromstring(resp.text)
    options = dom.xpath('//div[@id="limit-languages"]//input')
    for option in options:
        code = option.xpath('./@id')[0].replace('_', '-')
        if code == 'nb':
            code = 'no'
        supported_languages.append(code)

    return supported_languages
update versions.cfg to use the current up-to-date packages 2015-05-02 15:45:17 +02:00			`"""`
			`Bing (Web)`

			`@website https://www.bing.com`
			`@provide-api yes (http://datamarket.azure.com/dataset/bing/search),`
			`max. 5000 query/month`

			`@using-api no (because of query limit)`
			`@results HTML (using search portal)`
			`@stable no (HTML can change)`
			`@parse url, title, content`

			`@todo publishedDate`
			`"""`
update bing engines and fix bing_news 2014-09-01 14:38:59 +02:00
Improves PEP8 compatibility. 2014-02-05 20:24:31 +01:00			`from lxml import html`
Add bing in the test units 2015-01-25 20:14:37 +01:00			`from searx.engines.xpath import extract_text`
[enh] py3 compatibility 2016-11-30 18:43:03 +01:00			`from searx.url_utils import urlencode`
fix bing "garbage" results (issue #1275) 2018-05-21 01:10:22 +02:00			`from searx.utils import match_language, gen_useragent`
[enh] bing engine added 2013-10-24 23:52:57 +02:00
update bing engines and fix bing_news 2014-09-01 14:38:59 +02:00			`# engine dependent config`
			`categories = ['general']`
[enh] bing, google paging support 2014-01-29 21:14:38 +01:00			`paging = True`
[enh] search language support init 2014-01-31 04:35:23 +01:00			`language_support = True`
[mod] fetch supported languages for several engines utils/fetch_languages.py gets languages supported by each engine and generates engines_languages.json with each engine's supported language. 2016-11-06 03:51:38 +01:00			`supported_languages_url = 'https://www.bing.com/account/general'`
refactor engine's search language handling Add match_language function in utils to match any user given language code with a list of engine's supported languages. Also add language_aliases dict on each engine to translate standard language codes into the custom codes used by the engine. 2018-03-01 05:30:48 +01:00			`language_aliases = {'zh-CN': 'zh-CHS', 'zh-TW': 'zh-CHT', 'zh-HK': 'zh-CHT'}`
[enh] bing, google paging support 2014-01-29 21:14:38 +01:00
update bing engines and fix bing_news 2014-09-01 14:38:59 +02:00			`# search-url`
			`base_url = 'https://www.bing.com/'`
			`search_string = 'search?{query}&first={offset}'`
[enh] bing engine added 2013-10-24 23:52:57 +02:00
little refactoring 2014-09-02 17:13:44 +02:00
update bing engines and fix bing_news 2014-09-01 14:38:59 +02:00			`# do search-request`
[enh] bing engine added 2013-10-24 23:52:57 +02:00			`def request(query, params):`
[enh] bing, google paging support 2014-01-29 21:14:38 +01:00			`offset = (params['pageno'] - 1) * 10 + 1`
update bing engines and fix bing_news 2014-09-01 14:38:59 +02:00
Revert "remove 'all' option from search languages" This reverts commit 4d1770398a6af8902e75c0bd885781584d39e796. 2019-01-06 15:27:46 +01:00			`if params['language'] == 'all':`
			`lang = 'EN'`
			`else:`
			`lang = match_language(params['language'], supported_languages, language_aliases)`
[fix] use english as default language in bing If no language is specified, bing returns results with multiple languages for one query which isn't really useful. Setting english as default insted if nothing. 2016-12-30 18:17:14 +01:00
refactor engine's search language handling Add match_language function in utils to match any user given language code with a list of engine's supported languages. Also add language_aliases dict on each engine to translate standard language codes into the custom codes used by the engine. 2018-03-01 05:30:48 +01:00			`query = u'language:{} {}'.format(lang.split('-')[0].upper(), query.decode('utf-8')).encode('utf-8')`
update bing engines and fix bing_news 2014-09-01 14:38:59 +02:00
fix: robot fw, entry points, some flake8, package searx egg 2014-01-19 22:59:01 +01:00			`search_path = search_string.format(`
[fix] bing unicode issue part III. 2016-11-14 15:52:29 +01:00			`query=urlencode({'q': query}),`
[enh] bing, google paging support 2014-01-29 21:14:38 +01:00			`offset=offset)`
[enh] search language support init 2014-01-31 04:35:23 +01:00
[enh] bing engine added 2013-10-24 23:52:57 +02:00			`params['url'] = base_url + search_path`
fix bing "garbage" results (issue #1275) 2018-05-21 01:10:22 +02:00
			`params['headers']['User-Agent'] = gen_useragent('Windows NT 6.3; WOW64')`

[enh] bing engine added 2013-10-24 23:52:57 +02:00			`return params`


update bing engines and fix bing_news 2014-09-01 14:38:59 +02:00			`# get response from search-request`
[enh] bing engine added 2013-10-24 23:52:57 +02:00			`def response(resp):`
			`results = []`
update bing engines and fix bing_news 2014-09-01 14:38:59 +02:00
[fix] bing unicode encode error - fixes #408 2015-08-28 14:51:32 +02:00			`dom = html.fromstring(resp.text)`
update bing engines and fix bing_news 2014-09-01 14:38:59 +02:00
[enh] display number of results 2016-06-28 00:06:50 +02:00			`try:`
			`results.append({'number_of_results': int(dom.xpath('//span[@class="sb_count"]/text()')[0]`
			`.split()[0].replace(',', ''))})`
			`except:`
			`pass`

update bing engines and fix bing_news 2014-09-01 14:38:59 +02:00			`# parse results`
[enh] bing engine added 2013-10-24 23:52:57 +02:00			`for result in dom.xpath('//div[@class="sa_cc"]'):`
			`link = result.xpath('.//h3/a')[0]`
			`url = link.attrib.get('href')`
Add bing in the test units 2015-01-25 20:14:37 +01:00			`title = extract_text(link)`
[mod] do not escape html content in engines 2016-12-09 11:44:24 +01:00			`content = extract_text(result.xpath('.//p'))`
[enh] bing updates ++ language support 2013-10-25 01:37:48 +02:00
update bing engines and fix bing_news 2014-09-01 14:38:59 +02:00			`# append result`
[fix] pep8 : engines (errors E121, E127, E128 and E501 still exist) 2014-12-07 16:37:56 +01:00			`results.append({'url': url,`
			`'title': title,`
update bing engines and fix bing_news 2014-09-01 14:38:59 +02:00			`'content': content})`

			`# parse results again if nothing is found yet`
[enh] bing updates ++ language support 2013-10-25 01:37:48 +02:00			`for result in dom.xpath('//li[@class="b_algo"]'):`
			`link = result.xpath('.//h2/a')[0]`
			`url = link.attrib.get('href')`
Add bing in the test units 2015-01-25 20:14:37 +01:00			`title = extract_text(link)`
[mod] do not escape html content in engines 2016-12-09 11:44:24 +01:00			`content = extract_text(result.xpath('.//p'))`
update bing engines and fix bing_news 2014-09-01 14:38:59 +02:00
			`# append result`
[fix] pep8 : engines (errors E121, E127, E128 and E501 still exist) 2014-12-07 16:37:56 +01:00			`results.append({'url': url,`
			`'title': title,`
update bing engines and fix bing_news 2014-09-01 14:38:59 +02:00			`'content': content})`

			`# return results`
[enh] bing engine added 2013-10-24 23:52:57 +02:00			`return results`
[mod] fetch supported languages for several engines utils/fetch_languages.py gets languages supported by each engine and generates engines_languages.json with each engine's supported language. 2016-11-06 03:51:38 +01:00

			`# get supported languages from their site`
tests for _fetch_supported_languages in engines and refactor method to make it testable without making requests 2016-12-15 07:34:43 +01:00			`def _fetch_supported_languages(resp):`
[mod] fetch supported languages for several engines utils/fetch_languages.py gets languages supported by each engine and generates engines_languages.json with each engine's supported language. 2016-11-06 03:51:38 +01:00			`supported_languages = []`
tests for _fetch_supported_languages in engines and refactor method to make it testable without making requests 2016-12-15 07:34:43 +01:00			`dom = html.fromstring(resp.text)`
[mod] fetch supported languages for several engines utils/fetch_languages.py gets languages supported by each engine and generates engines_languages.json with each engine's supported language. 2016-11-06 03:51:38 +01:00			`options = dom.xpath('//div[@id="limit-languages"]//input')`
			`for option in options:`
			`code = option.xpath('./@id')[0].replace('_', '-')`
make search language handling less strict languages.py can change, so users may query on a language that is not on the list anymore, even if it is still recognized by a few engines. also made no and nb the same because they seem to return the same, though most engines will only support one or the other. 2017-03-02 00:11:51 +01:00			`if code == 'nb':`
			`code = 'no'`
[mod] fetch supported languages for several engines utils/fetch_languages.py gets languages supported by each engine and generates engines_languages.json with each engine's supported language. 2016-11-06 03:51:38 +01:00			`supported_languages.append(code)`

			`return supported_languages`