searxng/searx/engines/wikipedia.py

# SPDX-License-Identifier: AGPL-3.0-or-later
"""
 Wikipedia (Web)
"""

from urllib.parse import quote
from json import loads
from lxml.html import fromstring
from searx.utils import match_language, searx_useragent
from searx.network import raise_for_httperror

# about
about = {
    "website": 'https://www.wikipedia.org/',
    "wikidata_id": 'Q52',
    "official_api_documentation": 'https://en.wikipedia.org/api/',
    "use_official_api": True,
    "require_api_key": False,
    "results": 'JSON',
}

# search-url
search_url = 'https://{language}.wikipedia.org/api/rest_v1/page/summary/{title}'
supported_languages_url = 'https://meta.wikimedia.org/wiki/List_of_Wikipedias'
language_variants = {"zh": ("zh-cn", "zh-hk", "zh-mo", "zh-my", "zh-sg", "zh-tw")}


# set language in base_url
def url_lang(lang):
    lang_pre = lang.split('-')[0]
    if lang_pre == 'all' or lang_pre not in supported_languages and lang_pre not in language_aliases:
        return 'en'
    return match_language(lang, supported_languages, language_aliases).split('-')[0]


# do search-request
def request(query, params):
    if query.islower():
        query = query.title()

    language = url_lang(params['language'])
    params['url'] = search_url.format(title=quote(query), language=language)

    if params['language'].lower() in language_variants.get(language, []):
        params['headers']['Accept-Language'] = params['language'].lower()

    params['headers']['User-Agent'] = searx_useragent()
    params['raise_for_httperror'] = False
    params['soft_max_redirects'] = 2

    return params


# get response from search-request
def response(resp):
    if resp.status_code == 404:
        return []

    if resp.status_code == 400:
        try:
            api_result = loads(resp.text)
        except:
            pass
        else:
            if (
                api_result['type'] == 'https://mediawiki.org/wiki/HyperSwitch/errors/bad_request'
                and api_result['detail'] == 'title-invalid-characters'
            ):
                return []

    raise_for_httperror(resp)

    results = []
    api_result = loads(resp.text)

    # skip disambiguation pages
    if api_result.get('type') != 'standard':
        return []

    title = api_result['title']
    wikipedia_link = api_result['content_urls']['desktop']['page']

    results.append({'url': wikipedia_link, 'title': title})

    results.append(
        {
            'infobox': title,
            'id': wikipedia_link,
            'content': api_result.get('extract', ''),
            'img_src': api_result.get('thumbnail', {}).get('source'),
            'urls': [{'title': 'Wikipedia', 'url': wikipedia_link}],
        }
    )

    return results


# get supported languages from their site
def _fetch_supported_languages(resp):
    supported_languages = {}
    dom = fromstring(resp.text)
    tables = dom.xpath('//table[contains(@class,"sortable")]')
    for table in tables:
        # exclude header row
        trs = table.xpath('.//tr')[1:]
        for tr in trs:
            td = tr.xpath('./td')
            code = td[3].xpath('./a')[0].text
            name = td[2].xpath('./a')[0].text
            english_name = td[1].xpath('./a')[0].text
            articles = int(td[4].xpath('./a/b')[0].text.replace(',', ''))
            # exclude languages with too few articles
            if articles >= 100:
                supported_languages[code] = {"name": name, "english_name": english_name}

    return supported_languages
[enh] engines: add about variable move meta information from comment to the about variable so the preferences, the documentation can show these information 2021-01-13 11:31:25 +01:00			`# SPDX-License-Identifier: AGPL-3.0-or-later`
[enh] wikipedia infobox creates simple multilingual infobox using wikipedia's api 2016-03-14 07:32:36 +01:00			`"""`
			`Wikipedia (Web)`
			`"""`

Drop Python 2 (1/n): remove unicode string and url_utils 2020-08-06 17:42:46 +02:00			`from urllib.parse import quote`
[enh] wikipedia infobox creates simple multilingual infobox using wikipedia's api 2016-03-14 07:32:36 +01:00			`from json import loads`
[mod] fetch supported languages for several engines utils/fetch_languages.py gets languages supported by each engine and generates engines_languages.json with each engine's supported language. 2016-11-06 03:51:38 +01:00			`from lxml.html import fromstring`
use Wikipedia's REST v1 API 2020-09-08 07:05:21 +02:00			`from searx.utils import match_language, searx_useragent`
[httpx] replace searx.poolrequests by searx.network settings.yml: * outgoing.networks: * can contains network definition * propertiers: enable_http, verify, http2, max_connections, max_keepalive_connections, keepalive_expiry, local_addresses, support_ipv4, support_ipv6, proxies, max_redirects, retries * retries: 0 by default, number of times searx retries to send the HTTP request (using different IP & proxy each time) * local_addresses can be "192.168.0.1/24" (it supports IPv6) * support_ipv4 & support_ipv6: both True by default see https://github.com/searx/searx/pull/1034 * each engine can define a "network" section: * either a full network description * either reference an existing network * all HTTP requests of engine use the same HTTP configuration (it was not the case before, see proxy configuration in master) 2021-04-05 10:43:33 +02:00			`from searx.network import raise_for_httperror`
[enh] add supported_languages on engines and auto-generate languages.py 2016-08-06 06:34:56 +02:00
[enh] engines: add about variable move meta information from comment to the about variable so the preferences, the documentation can show these information 2021-01-13 11:31:25 +01:00			`# about`
			`about = {`
			`"website": 'https://www.wikipedia.org/',`
			`"wikidata_id": 'Q52',`
			`"official_api_documentation": 'https://en.wikipedia.org/api/',`
			`"use_official_api": True,`
			`"require_api_key": False,`
			`"results": 'JSON',`
			`}`

[enh] wikipedia infobox creates simple multilingual infobox using wikipedia's api 2016-03-14 07:32:36 +01:00			`# search-url`
Drop Python 2 (1/n): remove unicode string and url_utils 2020-08-06 17:42:46 +02:00			`search_url = 'https://{language}.wikipedia.org/api/rest_v1/page/summary/{title}'`
[mod] fetch supported languages for several engines utils/fetch_languages.py gets languages supported by each engine and generates engines_languages.json with each engine's supported language. 2016-11-06 03:51:38 +01:00			`supported_languages_url = 'https://meta.wikimedia.org/wiki/List_of_Wikipedias'`
add support for Chinese variants in Wikipedia 2021-02-09 05:56:45 +01:00			`language_variants = {"zh": ("zh-cn", "zh-hk", "zh-mo", "zh-my", "zh-sg", "zh-tw")}`
[enh] wikipedia infobox creates simple multilingual infobox using wikipedia's api 2016-03-14 07:32:36 +01:00

			`# set language in base_url`
			`def url_lang(lang):`
Revert "remove 'all' option from search languages" This reverts commit 4d1770398a6af8902e75c0bd885781584d39e796. 2019-01-06 15:27:46 +01:00			`lang_pre = lang.split('-')[0]`
fix after rebase 2019-01-07 21:28:58 +01:00			`if lang_pre == 'all' or lang_pre not in supported_languages and lang_pre not in language_aliases:`
Revert "remove 'all' option from search languages" This reverts commit 4d1770398a6af8902e75c0bd885781584d39e796. 2019-01-06 15:27:46 +01:00			`return 'en'`
[fix] check language aliases when setting search language 2018-11-26 06:32:48 +01:00			`return match_language(lang, supported_languages, language_aliases).split('-')[0]`
[enh] wikipedia infobox creates simple multilingual infobox using wikipedia's api 2016-03-14 07:32:36 +01:00

			`# do search-request`
			`def request(query, params):`
			`if query.islower():`
use Wikipedia's REST v1 API 2020-09-08 07:05:21 +02:00			`query = query.title()`
[enh] wikipedia infobox creates simple multilingual infobox using wikipedia's api 2016-03-14 07:32:36 +01:00
add support for Chinese variants in Wikipedia 2021-02-09 05:56:45 +01:00			`language = url_lang(params['language'])`
[format.python] initial formatting of the python code This patch was generated by black [1]:: make format.python [1] https://github.com/psf/black Signed-off-by: Markus Heiser <markus.heiser@darmarit.de> 2021-12-27 09:26:22 +01:00			`params['url'] = search_url.format(title=quote(query), language=language)`
add support for Chinese variants in Wikipedia 2021-02-09 05:56:45 +01:00
			`if params['language'].lower() in language_variants.get(language, []):`
			`params['headers']['Accept-Language'] = params['language'].lower()`
[enh] wikipedia infobox creates simple multilingual infobox using wikipedia's api 2016-03-14 07:32:36 +01:00
use Wikipedia's REST v1 API 2020-09-08 07:05:21 +02:00			`params['headers']['User-Agent'] = searx_useragent()`
[enh] add raise_for_httperror check HTTP response: * detect some comme CAPTCHA challenge (no solving). In this case the engine is suspended for long a time. * otherwise raise HTTPError as before the check is done in poolrequests.py (was before in search.py). update qwant, wikipedia, wikidata to use raise_for_httperror instead of raise_for_status 2020-12-09 21:23:20 +01:00			`params['raise_for_httperror'] = False`
[fix] wikipedia engine: don't raise an error when the query is not found Add a new parameter "raise_for_status", set by default to True. When True, any HTTP status code >= 300 raise an exception ( #2332 ) When False, the engine can manage the HTTP status code by itself. 2020-12-04 20:04:39 +01:00			`params['soft_max_redirects'] = 2`
use Wikipedia's REST v1 API 2020-09-08 07:05:21 +02:00
[enh] wikipedia infobox creates simple multilingual infobox using wikipedia's api 2016-03-14 07:32:36 +01:00			`return params`


			`# get response from search-request`
			`def response(resp):`
[fix] wikipedia engine: don't raise an error when the query is not found Add a new parameter "raise_for_status", set by default to True. When True, any HTTP status code >= 300 raise an exception ( #2332 ) When False, the engine can manage the HTTP status code by itself. 2020-12-04 20:04:39 +01:00			`if resp.status_code == 404:`
[enh] wikipedia infobox creates simple multilingual infobox using wikipedia's api 2016-03-14 07:32:36 +01:00			`return []`
[upd] wikipedia engine: return an empty result on query with illegal characters on some queries (like an IT error message), wikipedia returns an HTTP error 400. this commit returns an empty result instead of showing an error to the user. 2021-02-11 12:29:21 +01:00
			`if resp.status_code == 400:`
			`try:`
			`api_result = loads(resp.text)`
			`except:`
			`pass`
			`else:`
[format.python] initial formatting of the python code This patch was generated by black [1]:: make format.python [1] https://github.com/psf/black Signed-off-by: Markus Heiser <markus.heiser@darmarit.de> 2021-12-27 09:26:22 +01:00			`if (`
			`api_result['type'] == 'https://mediawiki.org/wiki/HyperSwitch/errors/bad_request'`
			`and api_result['detail'] == 'title-invalid-characters'`
			`):`
[upd] wikipedia engine: return an empty result on query with illegal characters on some queries (like an IT error message), wikipedia returns an HTTP error 400. this commit returns an empty result instead of showing an error to the user. 2021-02-11 12:29:21 +01:00			`return []`

[enh] add raise_for_httperror check HTTP response: * detect some comme CAPTCHA challenge (no solving). In this case the engine is suspended for long a time. * otherwise raise HTTPError as before the check is done in poolrequests.py (was before in search.py). update qwant, wikipedia, wikidata to use raise_for_httperror instead of raise_for_status 2020-12-09 21:23:20 +01:00			`raise_for_httperror(resp)`
[enh] wikipedia infobox creates simple multilingual infobox using wikipedia's api 2016-03-14 07:32:36 +01:00
use Wikipedia's REST v1 API 2020-09-08 07:05:21 +02:00			`results = []`
			`api_result = loads(resp.text)`
[enh] wikipedia infobox creates simple multilingual infobox using wikipedia's api 2016-03-14 07:32:36 +01:00
use Wikipedia's REST v1 API 2020-09-08 07:05:21 +02:00			`# skip disambiguation pages`
[fix] wikipedia: minor fix: return no result instead of crash in some very few cases. In few cases, the JSON results doesn't contains the key 'type'. 2020-12-07 17:42:05 +01:00			`if api_result.get('type') != 'standard':`
use Wikipedia's REST v1 API 2020-09-08 07:05:21 +02:00			`return []`
[enh] wikipedia infobox creates simple multilingual infobox using wikipedia's api 2016-03-14 07:32:36 +01:00
[fix] wikipedia: remove HTML from the title fr.wikipedia.org (and it seems not other wikipedia websites), adds HTML to api_result['displayTitle']. (Search for '!wp :fr Braid' for example) The commit uses api_result['title'] 2021-03-25 08:31:39 +01:00			`title = api_result['title']`
use Wikipedia's REST v1 API 2020-09-08 07:05:21 +02:00			`wikipedia_link = api_result['content_urls']['desktop']['page']`
[enh] wikipedia infobox creates simple multilingual infobox using wikipedia's api 2016-03-14 07:32:36 +01:00
			`results.append({'url': wikipedia_link, 'title': title})`

[format.python] initial formatting of the python code This patch was generated by black [1]:: make format.python [1] https://github.com/psf/black Signed-off-by: Markus Heiser <markus.heiser@darmarit.de> 2021-12-27 09:26:22 +01:00			`results.append(`
			`{`
			`'infobox': title,`
			`'id': wikipedia_link,`
			`'content': api_result.get('extract', ''),`
			`'img_src': api_result.get('thumbnail', {}).get('source'),`
			`'urls': [{'title': 'Wikipedia', 'url': wikipedia_link}],`
			`}`
			`)`
[enh] wikipedia infobox creates simple multilingual infobox using wikipedia's api 2016-03-14 07:32:36 +01:00
			`return results`
[mod] fetch supported languages for several engines utils/fetch_languages.py gets languages supported by each engine and generates engines_languages.json with each engine's supported language. 2016-11-06 03:51:38 +01:00

			`# get supported languages from their site`
tests for _fetch_supported_languages in engines and refactor method to make it testable without making requests 2016-12-15 07:34:43 +01:00			`def _fetch_supported_languages(resp):`
[mod] fetch supported languages for several engines utils/fetch_languages.py gets languages supported by each engine and generates engines_languages.json with each engine's supported language. 2016-11-06 03:51:38 +01:00			`supported_languages = {}`
tests for _fetch_supported_languages in engines and refactor method to make it testable without making requests 2016-12-15 07:34:43 +01:00			`dom = fromstring(resp.text)`
[mod] fetch supported languages for several engines utils/fetch_languages.py gets languages supported by each engine and generates engines_languages.json with each engine's supported language. 2016-11-06 03:51:38 +01:00			`tables = dom.xpath('//table[contains(@class,"sortable")]')`
			`for table in tables:`
			`# exclude header row`
			`trs = table.xpath('.//tr')[1:]`
			`for tr in trs:`
			`td = tr.xpath('./td')`
			`code = td[3].xpath('./a')[0].text`
			`name = td[2].xpath('./a')[0].text`
			`english_name = td[1].xpath('./a')[0].text`
			`articles = int(td[4].xpath('./a/b')[0].text.replace(',', ''))`
minor fixes in utils/fetch_languages.py 2016-12-17 05:14:14 +01:00			`# exclude languages with too few articles`
change language list to only include languages with a minimum of engines that support them. users can still query lesser supported through the :lang_code bang. 2016-12-29 06:24:56 +01:00			`if articles >= 100:`
remove articles number from engines_languages.json 2021-02-26 07:49:15 +01:00			`supported_languages[code] = {"name": name, "english_name": english_name}`
[mod] fetch supported languages for several engines utils/fetch_languages.py gets languages supported by each engine and generates engines_languages.json with each engine's supported language. 2016-11-06 03:51:38 +01:00
			`return supported_languages`