Legal:Wikimedia Foundation EU Compliance/DSA Article 40(4) Data Catalogue
About
This page is the Data Catalogue relating to Wikipedia, for the purposes of EU DSA Article 40(4) and Article 6(4) of the EU Delegated Act on Research Access, C(2025)4340 (“DARA”). These provisions govern access by vetted researchers to nonpublic data for the sole purpose of conducting research that contributes to the detection, identification and understanding of systemic risks in the EU/EEA and how the adequacy, efficiency and impacts of those risks’ mitigations, for Wikipedia.
Contact details for the Wikimedia Foundation’s designated point of contact for these purposes is eu-dsa-art-40-4-contact
wikimedia
org. The EU’s Data Access Portal and help pages can be found at https://data-access.dsa.ec.europa.eu/home .
The Wikimedia Foundation prides itself on the very high degree of open data and tooling already available for researchers, and we also welcome voluntary collaborations with our own Research team for more advanced projects.
Therefore before researchers go through the process set up by the DSA and DARA, we would strongly encourage you to examine the publicly available data, or contact the Wikimedia Foundation’s Research team. If your research concerns artificial intelligence or machine learning models, we are pleased to say that at the time of writing, WMF has made all training datasets that are available to the organization publicly available. The datasets are linked from the corresponding model cards.
If you do make a data access request through the DSA Art. 40(4)/DARA procedure, please ensure that it is Wikipedia- and EU- specific. We further ask that you limit your requests to support research questions into systemic risks, as envisaged by DSA Art. 40(4).
We further encourage the following:
- Create a Meta:Research page about your project, as described here: https://meta.wikimedia.org/wiki/Research:Projects ;
- Upon completion of your research project, we encourage you to publish your output as described in https://foundation.wikimedia.org/wiki/Policy:Wikimedia_Foundation_Open_Access_Policy
- Familiarize yourselves with our guidance around Research and Privacy on Wikipedia: https://osf.io/preprints/osf/uyxnf_v1
- Join the public Wikimedia research mailing list: https://lists.wikimedia.org/postorius/lists/wiki-research-l.lists.wikimedia.org/
Data Catalogue
MediaWiki Content History
When a MediaWiki page is edited (for example, on Wikipedia), the previous version(s) of that page (its “revisions”) can normally still be viewed, allowing the page’s evolution (e.g., its moderation) to be publicly audited over time. Under special circumstances, according to policies often documented for the specific Wikipedia language, Wikipedia communities may disable access to (i.e., “remove”; sometimes - depending on the circumstances - also called “revdel”, “oversight” or “suppress”) specific revisions in the page history. This ensures that neither the current version of the page nor the Page History function are publicly disseminating inappropriate content. For example, if a young person recklessly posts their own telephone number to a page, it would be conventional for other users to not only remove it from the page, but then also remove the older page revisions that contain it . The Wikimedia Foundation, as the website host/platform operator, may also do this, pursuant to our Office Actions policy.
The functionality allowing this is called RevisionDelete. Once a revision is removed, it is neither generally visible on Wikipedia, nor included in subsequent (post-removal) public Wikipedia content history XML dumps.
However, the private copy of MediaWiki Content History includes the majority of the removed information. This includes the removed revision, together with the corresponding editor username and the edit summary they provided when originally posting the edit which created that revision.
Data structure and metadata
Please refer to the public documentation for the schema.
Suggested access modalities
For privacy and confidentiality reasons, direct researcher access might not be possible. We request researchers, so far as possible, to supply the exact query to be run.
Additional remarks
- MediaWiki Content History data is not guaranteed to contain all removed revisions. It is generated from an event-based, eventually consistent system that retrieves revision information from the public Wikipedia APIs after a revision has been created. Depending on nondeterministic factors such as event ordering or transient infrastructure errors, a revision may be processed before or after it has been removed, and in the latter case, its contents will not be preserved in the MediaWiki Content History private data.
- MediaWiki Content History includes a large amount of data that is already publicly available. Researchers are invited to rely as far as possible on that public data, and should not request access to it using the DSA Article 40(4) mechanism.
Editors Daily
The location from which an edit to Wikipedia was made, whether exact or approximate, is not publicly available. Editors Daily is an internal dataset, updated monthly, that contains this more sensitive individual information. The location in this dataset is estimated via IP-based geolocation.
Data structure and metadata
Please see the public documentation for the schema. In line with our data retention policy, Editors Daily may only contain information for the past two months, as historical information is continuously removed.
Suggested access modalities
For privacy and confidentiality reasons, direct researcher access might not be possible. We request researchers, so far as possible, to supply the exact query to be run.
Additional remarks
- The Geoeditors public data dump offers monthly aggregate editor activity per country.
- Historical editor activity information, not including geolocation, is available in the MediaWiki History public dataset.
Zendesk Support Ticket data
People contact us for help (e.g. to report illegal content under DSA Article 16), and authorities contact us (for DSA Article 9 and 10 purposes), by emailing us. We use the Zendesk Support ticketing system to handle these emails; each new email chain is created as a “ticket” on our system where it can be triaged, handled and tagged. Data about these tickets is used to compile our periodic Transparency Reports.
Data structure and metadata
Please refer to https://support.zendesk.com/hc/en-us/articles/4408827693594-Metrics-and-attributes-for-Zendesk-Support
As noted above, we have different reporting channels for different types of matters; see in particular those mentioned on this page.
Tickets may also have custom fields (which vary according to reporting channel/ticket purpose, and may evolve over time); please enquire in order to confirm what relevant ticket fields may be available for your research.
Suggested access modalities
It is possible to run queries and export aggregate statistics using the Zendesk Explore tool, through the creation of custom “reports”.
The documentation here explains what the Explore tool can do: https://support.zendesk.com/hc/en-us/search?content_tags=01H41B6Y9VDNEGDFSDQZGESE9F&%3Butf8=%E2%9C%93 .
For privacy/confidentiality and licensing limitation reasons, direct researcher access to the Zendesk Explore tool might not be possible. Where reasonable, and subject to confirmation, the Wikimedia Foundation may instead be able to generate Zendesk Explore reports (i.e., run Explore queries) on researchers’ behalf. Researchers should so far as possible supply the exact query to be run; see https://support.zendesk.com/hc/en-us/articles/4408845804314-Formula-writing-resources
Additional remarks
- The Wikimedia Foundation already produces detailed 6-monthly reports containing these sorts of statistics: see https://wikimediafoundation.org/who-we-are/transparency/ . Their production includes a manual review/cleanup exercise to ensure tickets handled during the relevant reporting period (the previous 6 months) are appropriately tagged/classified and therefore reported accordingly. Tickets that have not yet been through that exercise may have less reliable tagging. Researchers are encouraged to confine their analysis to tickets that have been through this review process. For example, if at the time of your research, our most recent published Transparency Report was for the January-June 2025 period, we advise you to limit your research to tickets that were handled up to June 30, 2025.
- Please do not request access to sensitive data e.g. (i) the content of freetext fields (including the content of tickets themselves), or (ii) data identifying the correspondents and handlers of a ticket (e.g., who submitted it, and which WMF staff members handled it).
- To achieve EU specificity, we recommend limiting analysis to tickets that have an EU Member State listed as the relevant country (if we have been able to determine/estimate this).
Disclaimer
The Wikimedia Foundation does not warrant the constant availability or correctness of data listed on this page or of the good functioning of its access modalities, nor does the Foundation make any representations concerning the lawfulness of third parties’ access to it as a matter of applicable laws around the world. The Wikimedia Foundation Terms of Use apply. Please note, in particular (but without limitation), the ToU’s "Disclaimers" and "Limitation of Liability” sections. The Wikimedia Foundation is a non-profit organization. All costs of access (such as the purchasing of additional licenses to use non-free tools) are to be borne by those demanding the access.