searxng/searx/engines/duckduckgo.py

# SPDX-License-Identifier: AGPL-3.0-or-later
"""
 DuckDuckGo (Web)
"""

from lxml.html import fromstring
from json import loads
from searx.utils import extract_text, match_language, eval_xpath, dict_subset
from searx.network import get

# about
about = {
    "website": 'https://duckduckgo.com/',
    "wikidata_id": 'Q12805',
    "official_api_documentation": 'https://duckduckgo.com/api',
    "use_official_api": False,
    "require_api_key": False,
    "results": 'HTML',
}

# engine dependent config
categories = ['general']
paging = False
supported_languages_url = 'https://duckduckgo.com/util/u172.js'
time_range_support = True

language_aliases = {
    'ar-SA': 'ar-XA',
    'es-419': 'es-XL',
    'ja': 'jp-JP',
    'ko': 'kr-KR',
    'sl-SI': 'sl-SL',
    'zh-TW': 'tzh-TW',
    'zh-HK': 'tzh-HK'
}

# search-url
url = 'https://html.duckduckgo.com/html'
url_ping = 'https://duckduckgo.com/t/sl_h'
time_range_dict = {'day': 'd',
                   'week': 'w',
                   'month': 'm',
                   'year': 'y'}

# specific xpath variables
result_xpath = '//div[@class="result results_links results_links_deep web-result "]'  # noqa
url_xpath = './/a[@class="result__a"]/@href'
title_xpath = './/a[@class="result__a"]'
content_xpath = './/a[@class="result__snippet"]'
correction_xpath = '//div[@id="did_you_mean"]//a'


# match query's language to a region code that duckduckgo will accept
def get_region_code(lang, lang_list=None):
    if lang == 'all':
        return None

    lang_code = match_language(lang, lang_list or [], language_aliases, 'wt-WT')
    lang_parts = lang_code.split('-')

    # country code goes first
    return lang_parts[1].lower() + '-' + lang_parts[0].lower()


def request(query, params):
    if params['time_range'] is not None and params['time_range'] not in time_range_dict:
        return params

    params['url'] = url
    params['method'] = 'POST'
    params['data']['q'] = query
    params['data']['b'] = ''

    region_code = get_region_code(params['language'], supported_languages)
    if region_code:
        params['data']['kl'] = region_code
        params['cookies']['kl'] = region_code

    if params['time_range'] in time_range_dict:
        params['data']['df'] = time_range_dict[params['time_range']]

    params['allow_redirects'] = False
    return params


# get response from search-request
def response(resp):
    if resp.status_code == 303:
        return []

    # ping
    headers_ping = dict_subset(resp.request.headers, ['User-Agent', 'Accept-Encoding', 'Accept', 'Cookie'])
    get(url_ping, headers=headers_ping)

    # parse the response
    results = []
    doc = fromstring(resp.text)
    for i, r in enumerate(eval_xpath(doc, result_xpath)):
        if i >= 30:
            break
        try:
            res_url = eval_xpath(r, url_xpath)[-1]
        except:
            continue

        if not res_url:
            continue

        title = extract_text(eval_xpath(r, title_xpath))
        content = extract_text(eval_xpath(r, content_xpath))

        # append result
        results.append({'title': title,
                        'content': content,
                        'url': res_url})

    # parse correction
    for correction in eval_xpath(doc, correction_xpath):
        # append correction
        results.append({'correction': extract_text(correction)})

    # return results
    return results


# get supported languages from their site
def _fetch_supported_languages(resp):

    # response is a js file with regions as an embedded object
    response_page = resp.text
    response_page = response_page[response_page.find('regions:{') + 8:]
    response_page = response_page[:response_page.find('}') + 1]

    regions_json = loads(response_page)
    supported_languages = map((lambda x: x[3:] + '-' + x[:2].upper()), regions_json.keys())

    return list(supported_languages)
[enh] engines: add about variable move meta information from comment to the about variable so the preferences, the documentation can show these information 2021-01-13 11:31:25 +01:00			`# SPDX-License-Identifier: AGPL-3.0-or-later`
update versions.cfg to use the current up-to-date packages 2015-05-02 15:45:17 +02:00			`"""`
			`DuckDuckGo (Web)`
			`"""`
rewrite duckduckgo engine and add comments 2014-09-02 17:14:57 +02:00
[mod] ddg engine mods 2014-03-21 16:33:17 +01:00			`from lxml.html import fromstring`
[mod] fetch supported languages for several engines utils/fetch_languages.py gets languages supported by each engine and generates engines_languages.json with each engine's supported language. 2016-11-06 03:51:38 +01:00			`from json import loads`
[mod] duckduckgo engine: better support of the language preference After the main request, send a second to https://duckduckgo.com/t/sl_h See https://github.com/searx/searx/issues/2259 2021-02-09 14:36:43 +01:00			`from searx.utils import extract_text, match_language, eval_xpath, dict_subset`
[httpx] replace searx.poolrequests by searx.network settings.yml: * outgoing.networks: * can contains network definition * propertiers: enable_http, verify, http2, max_connections, max_keepalive_connections, keepalive_expiry, local_addresses, support_ipv4, support_ipv6, proxies, max_redirects, retries * retries: 0 by default, number of times searx retries to send the HTTP request (using different IP & proxy each time) * local_addresses can be "192.168.0.1/24" (it supports IPv6) * support_ipv4 & support_ipv6: both True by default see https://github.com/searx/searx/pull/1034 * each engine can define a "network" section: * either a full network description * either reference an existing network * all HTTP requests of engine use the same HTTP configuration (it was not the case before, see proxy configuration in master) 2021-04-05 10:43:33 +02:00			`from searx.network import get`
[enh] initial commit 2013-10-14 23:09:13 +02:00
[enh] engines: add about variable move meta information from comment to the about variable so the preferences, the documentation can show these information 2021-01-13 11:31:25 +01:00			`# about`
			`about = {`
			`"website": 'https://duckduckgo.com/',`
			`"wikidata_id": 'Q12805',`
			`"official_api_documentation": 'https://duckduckgo.com/api',`
			`"use_official_api": False,`
			`"require_api_key": False,`
			`"results": 'HTML',`
			`}`

rewrite duckduckgo engine and add comments 2014-09-02 17:14:57 +02:00			`# engine dependent config`
			`categories = ['general']`
[fix] fix duckduckgo engine - remove paging support: a "vqd" parameter is required between each request. This parameter is uniq for each request - update the URL (no redirect), use the POST method - language support: works if there is no more than request per minute, otherwise it is ignored ! 2020-10-09 15:01:40 +02:00			`paging = False`
fix fetch_langauges to be more accurate Add languages supported by either all default general engines or 10 engines. 2018-02-14 23:17:46 +01:00			`supported_languages_url = 'https://duckduckgo.com/util/u172.js'`
add time range search for duckduckgo 2016-07-18 16:15:37 +02:00			`time_range_support = True`
[enh] initial commit 2013-10-14 23:09:13 +02:00
refactor engine's search language handling Add match_language function in utils to match any user given language code with a list of engine's supported languages. Also add language_aliases dict on each engine to translate standard language codes into the custom codes used by the engine. 2018-03-01 05:30:48 +01:00			`language_aliases = {`
			`'ar-SA': 'ar-XA',`
			`'es-419': 'es-XL',`
			`'ja': 'jp-JP',`
			`'ko': 'kr-KR',`
			`'sl-SI': 'sl-SL',`
			`'zh-TW': 'tzh-TW',`
			`'zh-HK': 'tzh-HK'`
			`}`

rewrite duckduckgo engine and add comments 2014-09-02 17:14:57 +02:00			`# search-url`
[fix] fix duckduckgo engine - remove paging support: a "vqd" parameter is required between each request. This parameter is uniq for each request - update the URL (no redirect), use the POST method - language support: works if there is no more than request per minute, otherwise it is ignored ! 2020-10-09 15:01:40 +02:00			`url = 'https://html.duckduckgo.com/html'`
[mod] duckduckgo engine: better support of the language preference After the main request, send a second to https://duckduckgo.com/t/sl_h See https://github.com/searx/searx/issues/2259 2021-02-09 14:36:43 +01:00			`url_ping = 'https://duckduckgo.com/t/sl_h'`
add time range search for duckduckgo 2016-07-18 16:15:37 +02:00			`time_range_dict = {'day': 'd',`
			`'week': 'w',`
[enh] add year filter to duckduckgo 2021-03-25 00:25:36 +01:00			`'month': 'm',`
			`'year': 'y'}`
rewrite duckduckgo engine and add comments 2014-09-02 17:14:57 +02:00
			`# specific xpath variables`
[fix] duckduckgo's xpaths changed test_duckduckgo modified to reflect changes in duckduckgo's html 2016-03-22 03:19:13 +01:00			`result_xpath = '//div[@class="result results_links results_links_deep web-result "]' # noqa`
			`url_xpath = './/a[@class="result__a"]/@href'`
			`title_xpath = './/a[@class="result__a"]'`
			`content_xpath = './/a[@class="result__snippet"]'`
add correction support for duckduckgo 2020-06-13 23:42:16 +02:00			`correction_xpath = '//div[@id="did_you_mean"]//a'`
[fix] pep8 2014-03-29 16:38:45 +01:00
rewrite duckduckgo engine and add comments 2014-09-02 17:14:57 +02:00
add duckduckgo images engine 2017-05-21 05:33:08 +02:00			`# match query's language to a region code that duckduckgo will accept`
[mod] pylint: numerous minor code fixes 2020-11-16 09:43:23 +01:00			`def get_region_code(lang, lang_list=None):`
Revert "remove 'all' option from search languages" This reverts commit 4d1770398a6af8902e75c0bd885781584d39e796. 2019-01-06 15:27:46 +01:00			`if lang == 'all':`
			`return None`

[mod] pylint: numerous minor code fixes 2020-11-16 09:43:23 +01:00			`lang_code = match_language(lang, lang_list or [], language_aliases, 'wt-WT')`
refactor engine's search language handling Add match_language function in utils to match any user given language code with a list of engine's supported languages. Also add language_aliases dict on each engine to translate standard language codes into the custom codes used by the engine. 2018-03-01 05:30:48 +01:00			`lang_parts = lang_code.split('-')`

			`# country code goes first`
			`return lang_parts[1].lower() + '-' + lang_parts[0].lower()`
add duckduckgo images engine 2017-05-21 05:33:08 +02:00

			`def request(query, params):`
[fix] fix duckduckgo engine - remove paging support: a "vqd" parameter is required between each request. This parameter is uniq for each request - update the URL (no redirect), use the POST method - language support: works if there is no more than request per minute, otherwise it is ignored ! 2020-10-09 15:01:40 +02:00			`if params['time_range'] is not None and params['time_range'] not in time_range_dict:`
add duckduckgo images engine 2017-05-21 05:33:08 +02:00			`return params`

[fix] fix duckduckgo engine - remove paging support: a "vqd" parameter is required between each request. This parameter is uniq for each request - update the URL (no redirect), use the POST method - language support: works if there is no more than request per minute, otherwise it is ignored ! 2020-10-09 15:01:40 +02:00			`params['url'] = url`
			`params['method'] = 'POST'`
			`params['data']['q'] = query`
[mod] duckduckgo engine: better support of the language preference After the main request, send a second to https://duckduckgo.com/t/sl_h See https://github.com/searx/searx/issues/2259 2021-02-09 14:36:43 +01:00			`params['data']['b'] = ''`
[fix] multilingual duckduckgo only works if both country and language are set 2016-06-03 07:14:23 +02:00
refactor engine's search language handling Add match_language function in utils to match any user given language code with a list of engine's supported languages. Also add language_aliases dict on each engine to translate standard language codes into the custom codes used by the engine. 2018-03-01 05:30:48 +01:00			`region_code = get_region_code(params['language'], supported_languages)`
[fix] fix duckduckgo engine - remove paging support: a "vqd" parameter is required between each request. This parameter is uniq for each request - update the URL (no redirect), use the POST method - language support: works if there is no more than request per minute, otherwise it is ignored ! 2020-10-09 15:01:40 +02:00			`if region_code:`
			`params['data']['kl'] = region_code`
			`params['cookies']['kl'] = region_code`
add time range search for duckduckgo 2016-07-18 16:15:37 +02:00
[fix] duckduckgo engine: "!ddg !g" do not redirect to google * searx understand "!ddg !g time" as : send "!g time" to DDG * !g a DDG bang for Google: DDG return a HTTP redirect to Google This commit adds a the allows_redirect param not to follow HTTP redirect. The DDG engine returns a empty result as before without HTTP redirect. 2021-02-09 12:07:19 +01:00			`if params['time_range'] in time_range_dict:`
			`params['data']['df'] = time_range_dict[params['time_range']]`

			`params['allow_redirects'] = False`
[enh] initial commit 2013-10-14 23:09:13 +02:00			`return params`


rewrite duckduckgo engine and add comments 2014-09-02 17:14:57 +02:00			`# get response from search-request`
[enh] initial commit 2013-10-14 23:09:13 +02:00			`def response(resp):`
[fix] duckduckgo engine: "!ddg !g" do not redirect to google * searx understand "!ddg !g time" as : send "!g time" to DDG * !g a DDG bang for Google: DDG return a HTTP redirect to Google This commit adds a the allows_redirect param not to follow HTTP redirect. The DDG engine returns a empty result as before without HTTP redirect. 2021-02-09 12:07:19 +01:00			`if resp.status_code == 303:`
			`return []`

[mod] duckduckgo engine: better support of the language preference After the main request, send a second to https://duckduckgo.com/t/sl_h See https://github.com/searx/searx/issues/2259 2021-02-09 14:36:43 +01:00			`# ping`
			`headers_ping = dict_subset(resp.request.headers, ['User-Agent', 'Accept-Encoding', 'Accept', 'Cookie'])`
			`get(url_ping, headers=headers_ping)`
[mod] ddg engine mods 2014-03-21 16:33:17 +01:00
[mod] duckduckgo engine: better support of the language preference After the main request, send a second to https://duckduckgo.com/t/sl_h See https://github.com/searx/searx/issues/2259 2021-02-09 14:36:43 +01:00			`# parse the response`
			`results = []`
[mod] ddg engine mods 2014-03-21 16:33:17 +01:00			`doc = fromstring(resp.text)`
[mod] speed optimization compile XPath only once avoid redundant call to urlparse get_locale(webapp.py): avoid useless call to request.accept_languages.best_match 2019-11-15 09:31:37 +01:00			`for i, r in enumerate(eval_xpath(doc, result_xpath)):`
[fix] duckduckgo paging - closes #1677 2019-10-14 13:52:15 +02:00			`if i >= 30:`
			`break`
[fix] indexing 2014-03-21 18:17:13 +01:00			`try:`
[mod] speed optimization compile XPath only once avoid redundant call to urlparse get_locale(webapp.py): avoid useless call to request.accept_languages.best_match 2019-11-15 09:31:37 +01:00			`res_url = eval_xpath(r, url_xpath)[-1]`
[fix] indexing 2014-03-21 18:17:13 +01:00			`except:`
			`continue`
rewrite duckduckgo engine and add comments 2014-09-02 17:14:57 +02:00
[mod] ddg engine mods 2014-03-21 16:33:17 +01:00			`if not res_url:`
[enh] engine types 2013-10-15 19:11:43 +02:00			`continue`
rewrite duckduckgo engine and add comments 2014-09-02 17:14:57 +02:00
[mod] speed optimization compile XPath only once avoid redundant call to urlparse get_locale(webapp.py): avoid useless call to request.accept_languages.best_match 2019-11-15 09:31:37 +01:00			`title = extract_text(eval_xpath(r, title_xpath))`
			`content = extract_text(eval_xpath(r, content_xpath))`
rewrite duckduckgo engine and add comments 2014-09-02 17:14:57 +02:00
			`# append result`
[mod] ddg engine mods 2014-03-21 16:33:17 +01:00			`results.append({'title': title,`
			`'content': content,`
[fix] duckduckgo unicode url - #419 2015-09-07 23:13:04 +02:00			`'url': res_url})`
[mod] ddg engine mods 2014-03-21 16:33:17 +01:00
add correction support for duckduckgo 2020-06-13 23:42:16 +02:00			`# parse correction`
			`for correction in eval_xpath(doc, correction_xpath):`
			`# append correction`
			`results.append({'correction': extract_text(correction)})`

rewrite duckduckgo engine and add comments 2014-09-02 17:14:57 +02:00			`# return results`
[enh] engine types 2013-10-15 19:11:43 +02:00			`return results`
[mod] fetch supported languages for several engines utils/fetch_languages.py gets languages supported by each engine and generates engines_languages.json with each engine's supported language. 2016-11-06 03:51:38 +01:00

			`# get supported languages from their site`
tests for _fetch_supported_languages in engines and refactor method to make it testable without making requests 2016-12-15 07:34:43 +01:00			`def _fetch_supported_languages(resp):`
[mod] fetch supported languages for several engines utils/fetch_languages.py gets languages supported by each engine and generates engines_languages.json with each engine's supported language. 2016-11-06 03:51:38 +01:00
			`# response is a js file with regions as an embedded object`
tests for _fetch_supported_languages in engines and refactor method to make it testable without making requests 2016-12-15 07:34:43 +01:00			`response_page = resp.text`
[mod] fetch supported languages for several engines utils/fetch_languages.py gets languages supported by each engine and generates engines_languages.json with each engine's supported language. 2016-11-06 03:51:38 +01:00			`response_page = response_page[response_page.find('regions:{') + 8:]`
			`response_page = response_page[:response_page.find('}') + 1]`

			`regions_json = loads(response_page)`
			`supported_languages = map((lambda x: x[3:] + '-' + x[:2].upper()), regions_json.keys())`

update engines_languages.json and languages.py Also, fix fetch_languages.py so it can run on python3. 2017-10-10 23:52:41 +02:00			`return list(supported_languages)`