Jump to content

Policy talk:User-Agent policy/Archive 4

From Wikimedia Foundation Governance Wiki
Latest comment: 2 years ago by APaskulin (WMF) in topic Python code example


There is a proposal by the Analytics team to require that 'bots' add 'WikimediaBot' to their user-agent even for reads. See https://lists.wikimedia.org/pipermail/analytics/2016-January/004858.html John Vandenberg (talk) 10:32, 28 January 2016 (UTC)

Perhaps they should propose it on wikitech-l where people are likely to actually see it. Anomie (talk) 14:30, 28 January 2016 (UTC)
This is due to phab:T108599 titled "Communicate the WikimediaBot convention {hawk}". John Vandenberg (talk) 10:09, 21 March 2016 (UTC)


The following was added[1]:

"Also, if you run a bot or any automated system that could generate non-human traffic, please consider including the word "bot" (in any combination of lowercase or uppercase letters) in the User-Agent string. This will help Wikimedia's systems to better isolate human traffic and provide more accurate statistics."

I don't believe this has been discussed on wikitech-l yet. I think it has only been discussed on the analytics mailing list.

You're right, it was only discussed on the analytics mailing list. And thanks to your critiques and suggestions we could improve the amendment and make it a lot more flexible and innocuous. Marcel Ruiz Forns (talk) 12:02, 21 March 2016 (UTC)

How does adding 'bot' help over and above including email addresses and URLs in the User-Agent? Are there significant cases of human traffic browsers including email addresses and URLs in the User-Agent?

No, I don't think that there are cases of humans with such user-agents. Now looking at:
"If you run a bot, please send a User-Agent header identifying the bot with an identifier that isn't going to be confused with many other bots, and supplying some way of contacting you (e.g. a userpage on the local wiki, a userpage on a related wiki using interwiki linking syntax, a URI for a relevant external website, or an email address)..."
I understand that the policy asks the bot maintainers to add "some way" of contacting them, and some examples are given. I assume (given the use of: e.g.) that the example list is not exclusive, meaning they may also use other ways of contact info. Also, parsing the word bot is less error prone and cheaper than parsing long heterogeneous strings. Marcel Ruiz Forns (talk) 12:02, 21 March 2016 (UTC)
(I've responded below in a new section #More strict contact information). John Vandenberg (talk) 15:04, 21 March 2016 (UTC)

Or, is adding 'bot' an alternative to including email addresses and URLs? John Vandenberg (talk) 21:34, 12 February 2016 (UTC)

Seconded. I didn't immediately revert the addition because the beginning of the page says "This page is purely informative" and "As of 2015, no user agent requirement is technically enforced in general". But this page is supposed to be very stable, hence things should be discussed first. Nemo 08:04, 15 February 2016 (UTC)
Adding bot is not intended to be an alternative or to replace the current policy at all. It is only intended to add an optional way bot maintainers can help us. BTW, thanks for not reverting the addition. As I commented above this theme was discussed in the analytics mailing list 12. You both participated of the thread and gave very useful points of view, thanks again. Marcel Ruiz Forns (talk) 12:02, 21 March 2016 (UTC)
Can you appreciate that the addition of "bot" can be seen as an alternative. If a client adds "bot" instead of providing a contact information, they make "analytics" happy but do not make the client compliant with the pre-existing policy. If a client adds contact information, as is good behaviour for bots, then both the analytics and ops folks can be happy. I would hope analytics is striving to support compliance with the existing policy provisions, rather than creating their own provisions.
I understand that it can be interpreted that way, but it doesn't need to be like this. The amendment currently starts with 'Also, ...' which conveys addition to what came before. But we can change that to make it more explicit, like 'In addition to that, ...' if you like. On the other hand, the analytical jobs that process pageview stats already implement the User-Agent policy here. This code is in production. Marcel Ruiz Forns (talk) 17:50, 21 March 2016 (UTC)
If the analytics team thought the current policy wasnt good enough, they should be spear-heading improvements. Instead, analytics has this "bot" solution that they refuse to let go of. (It seems that the analytics team came up with the "bot" solution with effectively no awareness of this pre-existing policy.)
I'm sure nobody in Analytics thinks the current policy isn't good enough. And I also think we actually proposed an improvement. Regarding the awareness of the policy, you're right as far as it concerns me: I didn't know of it. I got to know it when I assigned that task to me and started researching. No team or person is all-knowing. And regarding refusing to let go the "bot" solution, we're just trying to implement what we think is a good option. You are obviously against it, but other people have seen this as valuable. Anyway I will present the team with your strong objections. Marcel Ruiz Forns (talk) 17:50, 21 March 2016 (UTC)
The suggestion that "Also, parsing the word bot is less error prone and cheaper than parsing long heterogeneous strings." is, as I understand it, irrelevant and frankly feels disingenuous. The analytics team has already agree that it should implement the existing policy, which means an email or URL should be being parsed whenever it is present, and I assume that the analytics implementation is of high quality, at least correctly detecting any email address or URL following the relevant standards. (The most complicated part of this is the encoding of the User-Agent field, which is effectively w:ISO 8859, whereas email and URL syntax have different encoding standards.) So if analytics needs to build non-error prone parsing email and URLs from the User-Agent, having to also parse "bot" is more work and more code complexity.
John Vandenberg (talk) 15:04, 21 March 2016 (UTC)
As mentioned in the wikiteck-l thread, the code that parses bot has been in production for a long time, because it detects lots of external bots (this convention is widely used). Also, the regular expression that tags bots, starts with that word 'bot', because it is so common, that when matching, it stops any further evaluation of the expression, making the parsing more efficient. Marcel Ruiz Forns (talk) 17:50, 21 March 2016 (UTC)
The Analytics team initially decided and implemented "WikimediaBot". You thought that was a good option too, but it was conceptually wrong. Your phrasing suggests this is a vote, where all opinions are counted, no matter how silly. I am providing you with technical facts that your team has somehow ignored in their "research". I am one of the people who needs to be involved in implementing this "improvement", otherwise Pywikibot will be given a bad client evaluation.
The current decision to use "bot" also appears to be problematic if you are trying to distinguish between human non-human consumption. As can be seen at https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/test/resources/isSpider_test_data.csv (and the Java code you refer to) , all user-agents containing 'bot' or 'User:' are considered to be a 'spider' (web crawler). The convention of using 'w:bot' in the first product component of the w:user-agent is a high scoring heuristic that the user-agent is not a general purpose Mozilla compatible web-browser , and is a spider/w:web crawler. However mw:Pywikibot is not a spider/web crawlers. So stopping parsing on the string 'bot' is currently incorrect. While it might be uncomfortable for Pywikibot devs and users, if Pywikibot is the only significant anomaly to this heuristic, maybe we should rename Pywikibot to Pywiki. That name is available https://pypi.python.org/pypi/pywiki \o/ , and luckily a new major release needs to be pushed out soonish phab:T130565. But before making that large breaking change, it would be nice to know whether there are other API clients that also use 'bot' in the product name and are not web crawlers. Looking at https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/test/resources/isSpider_test_data.csv, my limited familiarisation with some of those API client frameworks is that they are also not web crawlers (esp. the very recognisable DotNetWikiBot). In which case, using the word 'bot' as a 100% hit for "non-human" will almost certainly never be right as it requires too many external products to be modified. At best you will need to have special treatment for many API clients (and the list of 'bot' will grow over time), and then the question remains: how do we distinguish between human and non-human consumption, and the answer is not 'bot' in the user-agent. (p.s. could you reply below rather than interleaved within my comment; it is getting hard to follow the nesting.) John Vandenberg (talk) 01:20, 22 March 2016 (UTC)
Agree: Initially, the Analytics team had a proposal that was proven sub-optimal in the prior discussion. And it was changed to be less intrusive and totally optional because of that. I believe you were involved in the design and helped a lot. Regarding Pywikibot's client evaluation: I think no bot should be given a bad client evaluation for not following the amendment, because it is optional.
The jobs that process pageview stats do not distinguish between bots, spiders/crawlers yet. We tried to do this in the past, but it is a very difficult distinction to do. For now, they just try to tag everything that is non-human. So adding bot to the user-agent, wouldn't mean the gadget is a spider. Marcel Ruiz Forns (talk) 08:16, 22 March 2016 (UTC)
The bot contact details is also optional, however it was part of the 'gold standard'. All 'optional' parts of this policy would quite rightly be necessary for a bot framework to pass something called a 'gold standard', don't you agree?
You've not addressed the issue that Pywikibot (and other frameworks) always includes the string 'bot' in the user-agent, so if 'bot' is a deemed 100% match for 'non-human', these frameworks have no way to indicate to Analytics that the API usage is for human consumption.
You've also not given any reason why someone who can modify the user-agent should add 'bot' instead of adding an email, URL, or some other form of parse-able contact information, which you've indicated above is highly indicative of automated user-agents.
As far as I can see 'bot' adds nothing to this policy for the simple case of 100% automated agents, where parsable contact details would be better, and creates a new problem in the policy for any attempt to actually solve the hard problem of allowing accurate classification of non-100% automated agents. John Vandenberg (talk) 23:41, 22 March 2016 (UTC)

Alternative phrasing

Earlier in the above conversion, Marcel indicated that their team was open to improved phrasing, such as starting it with 'In addition to [bot contact details], ...'.

IMO this does not go far enough, as no reason has been put forward (yet) to justify using the string 'bot' in addition providing bot contact details. However as Marcel points out, the string 'bot' is being used in user-agents that do not provide contact details. If the intention is now only to indicate that the addition of 'bot' is an existing Internet wide convention recognised and used by Wikimedia, but is a poor substitute for parse-able contact information, then perhaps we can find common ground by changing how it is phrased accordingly. So perhaps the Analytics team might find something like the following acceptable.

If you run an automated agent, please follow the Internet-wide convention of including the string "bot" in the User-Agent string, in any combination of lowercase or uppercase letters. This is recognised by Wikimedia's systems, and used to classify traffic and provide more accurate statistics.

John Vandenberg (talk) 23:41, 22 March 2016 (UTC)

This alernative phrasing would be perfectly acceptable for Analytics. Thanks for proposing this. It even could be more optional, like: "..., please consider following the Internet-wide ...". Marcel Ruiz Forns (talk) 08:08, 23 March 2016 (UTC)
I modified the amendment following your suggestion. Please, feel free to modify it further. Marcel Ruiz Forns (talk) 05:41, 28 March 2016 (UTC)

Client evaluation

Worth noting, compliance to this policy was part of the mw:Evaluating and Improving MediaWiki web API client libraries by User:Fhocutt in 2014.

The client evaluation caused Pywikibot to add support for a customised user-agent, with the default user-agent automatically adding the username; see mw:Manual:Pywikibot/User-agent. John Vandenberg (talk) 10:24, 21 March 2016 (UTC)

Python code example

Can a Python code example be added? I'm violating the policy, but I don't know, what I need to add to stop violating it. 08:23, 8 April 2022 (UTC)

I added a Python example using Requests. Hope this helps! APaskulin (WMF) (talk) 00:46, 9 April 2022 (UTC)

API access through browser fetch()

How should API access through browser fetch() calls comply with the user-agent policy? It seems this can only be done for server-api-call applications. I am writing an alternative interface for wikipedia through the API calls (with all the other performance items listed in mind ofcourse, such as setting fetch limits). Thanks! — Preceding unsigned comment added by (talk) 09:49, 4 April 2019 (UTC)