Data retention guidelines

Added exception for page views investigation

The Privacy team has temporarily extended the retention period for two datasets for a short period so that the Data Engineering team can investigate the impact of a data collection technical issue. Between June 4, 2021 and January 27, 2022, some of the Foundation’s caching nodes stopped collecting web traffic data (see the Phabricator task for more details). This resulted in data loss for web requests and the derived pageviews, which impacts the Foundation’s ability to correctly report on the Wikimedia pageviews and fundraising banner impressions.

The Data Engineering team required a temporary short-term extension to the usual 90-day retention period in order to better estimate what data was not collected and which projects and geographies were most affected. The wmf.pageview_actor dataset is being used to estimate the data loss for pageviews and the wmf.webrequest dataset is being used to estimate the data loss for fundraising banners. Information from both datasets is required because webrequest data for visited banners is not reported as pageviews. Deletion of these datasets was paused on February 16, 2022 and deletion will resume by March 18, 2022.

If you have questions or concerns, please reach out to If you are interested in a conversation meeting to discuss this exception and investigation, please sign up below and we will contact you with details. MMoss (WMF) (talk) 19:51, 11 March 2022 (UTC)Reply[reply]

Definition of "public information": the IP address really?


Some examples of "public information" would include: (a) your IP address, if you edit without logging in;

This is insufficent (and it was a problem for all Wikimedia projects, that is currently being solved, because many users unfortunately made edits without being properly conencted and did not notice it immediately; such disconnection has often opccured for various technical resasons and were not always prominently displayed, revealing an IP address to a permanent ccount that was protecting their user provicy) and in fact this statement may be now false (as is it is now illegal in various juridictions to make IP addresses visible to the public view, and the WMF could have been liable of violation, or orders of termination, or banning on some networks that must respect privacy laws, especially in the UE and the EEA, or in California where the WMF is located). You should add this precision:

(if your access has still not been anonimized from the public view).

The anomymization of user access and IP accounts visible in the public list of users or in histories should soon be replaced by anonymized accounts for temporary accounts. IP addresses will only be visible later by specific users with CheckUser access rights and contractually accepting to strictly follow its usage policy (in addition to the general Privacy policy).

We have to accept the fact that this was not the case in the past and that there exists archives elsewhere (including in old database dumps published by the WMF) where such anonization will not be possible as they are now out of control. But "IP user" accounts should never be used now and should disappear from all categories, and former links to their uer pages or tlak pages should be processed by some admin tool that will associate them to as many anonimized accounts as needed (respecting the "temporary period"), to avoid capturing and keeping information on overlong periods of time. Such bot will then ned the permission to create temporary accounts, and "antidate" them to match the dates found in edit histories within which these IP user accounts were associated

You should inform users that the WMF will make all efforts to disable existing public views for such data hosted now, but that past public records are now out of control and the WMF is unable to warranty that other parties won't use the revealed IP addresses (they could be able to do that legally in various juridictions that do not protect privacy, especially for users not located or connected from their own juridiction, and that don't benefit of the local legal protection). However the WMF should be clear that these IP addresses were anonymized from the public in order to comply to privacy laws and if other parties are using such data, they'll do that on their own liability, because the WMF does not endorse or approve such use of private data (including all those listed in these guidelines) by third parties, even if these data were published under a Free licence (which mostly covers copyright and correct users attribution, but does not void any law related to the protection of privacy, independantly of what the used licence is granting). -- Verdy p (talk) 13:10, 17 June 2023 (UTC)Reply[reply]