Partial reverse engineering of the Google engines including a improved language
and region handling based on the engine.traits_v1 data.
When ever possible the implementations of the Google engines try to make use of
the async REST APIs. The get_lang_info() has been generalized to a
get_google_info() function / especially the region handling has been improved by
adding the cr parameter.
searx/data/engine_traits.json
Add data type "traits_v1" generated by the fetch_traits() functions from:
- Google (WEB),
- Google images,
- Google news,
- Google scholar and
- Google videos
and remove data from obsolete data type "supported_languages".
A traits.custom type that maps region codes to *supported_domains* is fetched
from https://www.google.com/supported_domains
searx/autocomplete.py:
Reversed engineered autocomplete from Google WEB. Supports Google's languages and
subdomains. The old API suggestqueries.google.com/complete has been replaced
by the async REST API: https://{subdomain}/complete/search?{args}
searx/engines/google.py
Reverse engineering and extensive testing ..
- fetch_traits(): Fetch languages & regions from Google properties.
- always use the async REST API (formally known as 'use_mobile_ui')
- use *supported_domains* from traits
- improved the result list by fetching './/div[@data-content-feature]'
and parsing the type of the various *content features* --> thumbnails are
added
searx/engines/google_images.py
Reverse engineering and extensive testing ..
- fetch_traits(): Fetch languages & regions from Google properties.
- use *supported_domains* from traits
- if exists, freshness_date is added to the result
- issue 1864: result list has been improved a lot (due to the new cr parameter)
searx/engines/google_news.py
Reverse engineering and extensive testing ..
- fetch_traits(): Fetch languages & regions from Google properties.
*supported_domains* is not needed but a ceid list has been added.
- different region handling compared to Google WEB
- fixed for various languages & regions (due to the new ceid parameter) /
avoid CONSENT page
- Google News do no longer support time range
- result list has been fixed: XPath of pub_date and pub_origin
searx/engines/google_videos.py
- fetch_traits(): Fetch languages & regions from Google properties.
- use *supported_domains* from traits
- add paging support
- implement a async request ('asearch': 'arc' & 'async':
'use_ac:true,_fmt:html')
- simplified code (thanks to '_fmt:html' request)
- issue 1359: fixed xpath of video length data
searx/engines/google_scholar.py
- fetch_traits(): Fetch languages & regions from Google properties.
- use *supported_domains* from traits
- request(): include patents & citations
- response(): fixed CAPTCHA detection (Scholar has its own CATCHA manager)
- hardening XPath to iterate over results
- fixed XPath of pub_type (has been change from gs_ct1 to gs_cgt2 class)
- issue 1769 fixed: new request implementation is no longer incompatible
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
Partial reverse engineering of the DuckDuckGo (DDG) engines including a
improved language and region handling based on the enigne.traits_v1 data.
- DDG Lite
- DDG Instant Answer API
- DDG Images
- DDG Weather
docs/src/searx.engine.duckduckgo.rst:
Online documentation of the DDG engines (make docs.live)
searx/data/engine_traits.json
Add data type "traits_v1" generated by the fetch_traits() functions from:
- "duckduckgo" (WEB),
- "duckduckgo images" and
- "duckduckgo weather"
and remove data from obsolete data type "supported_languages".
searx/autocomplete.py:
Reversed engineered Autocomplete from DDG. Supports DDG's languages.
searx/engines/duckduckgo.py:
- fetch_traits(): Fetch languages & regions from DDG.
- get_ddg_lang(): Get DDG's language identifier from SearXNG's locale. DDG
defines its languages by region codes. DDG-Lite does not offer a language
selection to the user, only a region can be selected by the user.
- Cache ``vqd`` value: The vqd value depends on the query string and is needed
for the follow up pages or the images loaded by a XMLHttpRequest (DDG
images). The ``vqd`` value of a search term is stored for 10min in the
redis DB.
- DDG Lite engine: reversed engineered request method with improved Language
and region support and better ``vqd`` handling.
searx/engines/duckduckgo_definitions.py: DDG Instant Answer API
The *instant answers* API does not support languages, or at least we could not
find out how language support should work. It seems that most of the features
are based on English terms.
searx/engines/duckduckgo_images.py: DDG Images
Reversed engineered request method. Improved language and region handling
based on cookies and the enigne.traits_v1 data. Response: add image format to
the result list
searx/engines/duckduckgo_weather.py: DDG Weather
Improved language and region handling based on cookies and the
enigne.traits_v1 data.
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
One reason for the often seen CAPTCHA of the Startpage requests are the
incomplete requests SearXNG sends to startpage.com: this patch is a complete new
implementation of the ``request()`` function, reversed engineered from the
Startpage's search form. The new implementation:
- use traits of data_type: traits_v1 and drop deprecated data_type: supported_languages
- adds time-range support
- adds save-search support
- fix searxng/searxng/issues 1884
- fix searxng/searxng/issues 1081 --> improvements to avoid CAPTCHA
In preparation for more categories (News, Images, Videos ..) from Startpage, the
variable ``startpage_categ`` was set up. The default value is ``web`` and other
categories from Startpage are not yet implemented.
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
BTW this fix an issue in wikipedia: SearXNG's locales zh-TW and zh-HK are now
using language `zh-classical` from wikipedia (and not `zh`).
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
With the language and region tags from the EngineTraitsMap the handling of
SearXNG's tags of languages and regions has been normalized and is no longer
a *mystery*. The "languages" became "locales" that are supported by babel and
by this, the update_engine_traits.py can be simplified a lot.
Other code places can be simplified as well, but these simplifications
should (respectively can) only be done when none of the engines work with the
deprecated EngineTraits.supported_languages interface anymore.
This commit replaces searx.languages by searx.sxng_locales and fix the naming of
some names from "language" to "locale" (e.g. language_codes --> sxng_locales).
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
Implements a fetch_traits function for the Startpage engine.
.. note::
Does not include migration of the request methode from 'supported_languages'
to 'traits' (EngineTraits) object!
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
- fetch_traits(): Fetch languages from peertube's search-index source code.
[mod] Include migration of the request methode from 'supported_languages'
to 'traits' (EngineTraits) object.
[fix] old supported_languages_url is no longer valid since the sources
has been moved to a different path.
- fixed code to pass pylint
- request(): complete re-implementation based on the API docs [1]
- response(): complete re-implementation, adds serveral fields missed before
- add source code documentation
[1] https://docs.joinpeertube.org/api-rest-reference.html#tag/Search/operation/searchVideos
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
Implementations of the *traits* of the engines.
Engine's traits are fetched from the origin engine and stored in a JSON file in
the *data folder*. Most often traits are languages and region codes and their
mapping from SearXNG's representation to the representation in the origin search
engine.
To load traits from the persistence::
searx.enginelib.traits.EngineTraitsMap.from_data()
For new traits new properties can be added to the class::
searx.enginelib.traits.EngineTraits
.. hint::
Implementation is downward compatible to the deprecated *supported_languages
method* from the vintage implementation.
The vintage code is tagged as *deprecated* an can be removed when all engines
has been ported to the *traits method*.
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
When the user choose "Auto-detected", the choice remains on the following queries.
The detected language is displayed.
For example "Auto-detected (en)":
* the next query language is going to be auto detected
* for the current query, the detected language is English.
This replace the autodetect_search_language plugin.
- Add documentation to the plugin
- Harmonize FastText language model with SearXNG's language model
Reosurces::
import fasttext # --> +10 MB
fasttext.load_model(str(data_dir / 'lid.176.ftz')) # --> +4MB
Suggested-by: @dalf
- To speed up and simplify the deployment use fasttext-wheel instead of fasttext
- Building numpy on the Alpine Linux of docker-images takes ages --> install
py3-numpy from Alpines package manager (apk)
- Alpine Linux on docker-images (musl libc) do not support fasttext-wheel (gnu
libc) --> patch Dockerfile and build from fastetxt:
sed -i s/fasttext-wheel/fasttext/ requirements.txt
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
settings.yml:
* The default URL was unix:///usr/local/searxng-redis/run/redis.sock?db=0
* The default URL is now "false"
The default URL makes the log difficult to deal with:
if the admin didn't install a Redis instance, the logs record a false error.
It worked before because SearXNG initialized the Redis connection when the limiter started.
In this commit, SearXNG initializes Redis in searx/webapp.py
so various components can use Redis without taking care of the initialization step.
Most engines that support languages (and regions) use the Accept-Language from
the WEB browser to build a response that fits to the language (and region).
- add new engine option: send_accept_language_header
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
This patch fixes a leftover from [#1548], the list of the plugins was not
up-to-date:
- HTTPS_rewrite has been removed (247c46c6b)
- DOAI_rewrite is named 'Open_Access_DOI_rewrite' (575159b)
[#1548] https://github.com/searxng/searxng/pull/1548
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
This patch fixes the WARNING messages that pops up since Sphinx 5.x:
WARNING: extlinks: Sphinx-6.0 will require a caption string to contain
exactly one '%s' and all other '%' need to be escaped as '%%'.
[1] https://www.sphinx-doc.org/en/master/usage/extensions/extlinks.html
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
There are several reasons why we should prefer markdown-it-py over mistletoe:
- Get identical rendering results in SearXNG's `/info` pages and the SearXNG's
project documentation which is build by Sphinx-doc.
In the Sphinx-doc we use the MyST parser to render Markdown and the MyST
parser itself is built on top of the markdown-it-py package.
- markdown-it-py has a typographer that supports *replacements*
and *smartquotes* (e.g. em-dash, copyright, ellipsis, ...) [1]
- markdown-it-py is much more flexible compared to mistletoe [2]
- markdown-it-py is the fastest CommonMark compliant parser in python [3]
[1] https://markdown-it-py.readthedocs.io/en/latest/using.html#typographic-components
[2] https://markdown-it-py.readthedocs.io/en/latest/plugins.html
[3] https://markdown-it-py.readthedocs.io/en/latest/other.html#performance
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
With this patch ``searxng.msg`` files can be added to SearXNG. In
``searxng.msg`` files messages can be defined which are not captured by babel's
gettext, like the generic names of the categories or messages that are stored in
constants.
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
This patch implements a bolierplate to share content from info-pages of the
SearXNG instance (URL /info) with the project documentation (path /docs/user).
The info pages are using Markdown (CommonMark), to include them in the project
documentation (reST) the myst-parser [1] is used in the Sphinx-doc build chain.
If base_url is known (defined in settings.yml) links to the instance are also
inserted into the project documentation::
searxng_extra/docs_prebuild
[1] https://www.sphinx-doc.org/en/master/usage/markdown.html
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
can replace filtron:
* rate limite the number of request per IP and per (IP, User-Agent)
* block some bots
use Redis
data stored in Redis never contains the IP addresses, only HMAC using the secret_key
Co-authored-by: Markus Heiser <markus.heiser@darmarit.de>
The PluginStore is already initalized when the application is initalized
searx.plugins.initialize(application)
BTW: remove unneeded Flask import
Closes: https://github.com/searxng/searxng/issues/828
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
With Sphinx-doc update 4.4.0 we get some warnings about links that can be
replaced by already defined 'sphinx.ext.extlinks':
admin/engines/sql-engines.rst:144: WARNING: hardcoded link 'https://pypi.org/project/mysql-connector-python' could be replaced by an extlink (try using ':pypi:`mysql-connector-python`' instead)
docs/admin/installation-switch2ng.rst:10: WARNING: hardcoded link 'https://github.com/searxng/searxng/pull/446#issuecomment-954730358' could be replaced by an extlink (try using ':pull:`446#issuecomment-954730358`' instead)
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
The ? search operator has been broken for some time and
currently only raises the question why it's still there.
## Context ##
The query "Paris !images" searches for "Paris" in the "images" category.
Once upon a time Searx supported "Paris ?images" to search for "Paris"
in the currently enabled categories and the "images" category.
The feature makes sense ... the ? syntax does not.
We will hopefully introduce a +!images syntax in the future.
Fixes#702.
* allow not to record metrics (response time, etc...)
* this commit doesn't change the UI. If the metrics are disabled
/stats and /stats/errors will return empty response.
in /preferences, the columns response time and reliability will be empty.
The tab icon names are currently hard coded in the templates.
This commit lets us introduce an icon property in the future, e.g:
categories_as_tabs:
general:
icon: search-outline
Add a redis connector, the default DB connector is a socket at::
unix:///usr/local/searxng-redis/run/redis.sock?db=0
To set up a redis instance simply use::
$ ./manage redis.build
$ sudo -H ./manage redis.install
A hint for developers:
To get access rights to this instance, your developer account needs to be added
to the *searxng-redis* group::
$ sudo -H ./manage redis.addgrp "${USER}"
# don't forget to logout & login to get member of group
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
Stuff in folder searxng_extra/ is not suitable for normal users and should only
be used by developers.
The script searxng_extra/standalone_searx.py must not give the impression that
it improves privacy. [1]
[1] https://github.com/searxng/searxng/pull/651#issuecomment-1001389726
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
Previously all categories were displayed as search engine tabs.
This commit changes that so that only the categories listed under
categories_as_tabs in settings.yml are displayed.
This lets us introduce more categories without cluttering up the UI.
Categories not displayed as tabs can still be searched with !bangs.
Previously the documentation grouped the engines by their first
category so e.g. YouTube and Invidious were only shown in the
in the videos section but not in the music section.
This commit fixes this by iterating over searx.engines.categories,
which also has the added benefit that the sections are now in the
same order as the tabs in the user interface.