Jump to content

法律:数据收集指引

From Wikimedia Foundation Governance Wiki
This page is a translated version of the page Legal:Data Collection Guidelines and the translation is 2% complete.

隐私权是社区向维基媒体计划做出贡献的核心——维护这项权利是维基媒体基金会人权承诺的中心部分。此数据收集指引概述维基媒体基金会管理数据收集中的隐私风险的最佳实践。此指引补充了维基媒体基金会的数据保留数据公开指引,指导维基媒体基金会在潜在敏感数据的全生命周期内处理之。总之,以下指引有助于我们履行在隐私政策中阐明的保护用户数据的承诺。

The breadth of what constitutes data collection can vary widely as many teams at the Foundation engage in some kind of data collection behavior. To provide guidance in meaningfully evaluating a potential data collection activity, we primarily look to understand information pertaining to five general categories:

  • Data subjects (e.g. readers, editors, app users, donors)
  • Data senders (e.g. WMF tools like a browser, app, or extension; or third-party software providers)
  • Data recipients (e.g. WMF, WME, affiliates, third-party software providers, the public)
  • Type of data (e.g. user account information, page information, telemetry data, demographic information, attitudinal or behavioral information, geographic information, event information)
  • Data usage and changes to data usage (e.g. published in raw format, published anonymously, not published; de-identified, aggregated, and kept in perpetuity)

The following Data Collection Risk Tiering Grid presents those categories as criteria to help staff assess the risk tier of their data collection activity.

Data collection risk tiering grid

Low risk criteria
  • The data subject is subject to an applicable WMF Privacy Policy;
  • The data sender is subject to an applicable WMF Privacy Policy;
  • The data recipient of the data is WMF, or a WMF-approved third-party software provider that does not use cookies;
    • Note: if the third-party software provider is using cookies or other client-side storage, this immediately becomes medium or high risk activity
  • The data will be kept for a typical retention period and then deleted, aggregated, or de-identified and sanitized;
  • The data collected does not include:
    • multiple items of unhashed personal information[1]
    • personal information + username/user ID or app ID
    • long-term viewing history[2] + unique ID[3]
    • granular geographic data[4] + unique ID[3]
    • sensitive data[5]
Risk level Tier 1: High risk Tier 2: Medium risk Tier 3: Low risk
Data that could certainly expose data subjects or recipients to risk of harm. Data that could likely or possibly expose data subjects or recipients to risk of harm. Data that is unlikely to expose data subjects or recipients to risk of harm.
Criteria

The data collected is ongoing

[6] and fails TWO OR MORE of the low risk criteria.

OR

The data collected is one-off[7] and fails THREE OR MORE of the low risk criteria.

The data collected is ongoing

[6] and fails ONE of the low risk criteria.

OR

The data collected is one-off[7] and fails TWO of the low risk criteria.

The data collected is ongoing

[6] and fails ZERO of the low risk criteria.

OR

The data collected is one-off[7] and fails ONE OR ZERO of the low risk criteria. The single criterion failed cannot be collecting sensitive data.

Response time goal 3 work weeks 5 work days N/A
Expected % of requests (internal metric) 15% 35% 50%
What should WMF teams do next?
Things to do for all risk tiers
  • Once you have assessed your tier of risk using this tiering grid, log data collection activity in the data collection activity log form.
  • If you decide later to use the data obtained for a new purpose, please reassess your tier of risk using the tiering grid and submit a new data collection activity log form.
Additional things to do depending on your data collection activity and risk tier For surveys: Fill out the survey privacy statement to supplement your data collection activity log form.
For all other data collection activities: Submit data collection activity to the L3SC request form to supplement your data collection activity log form, for review by Privacy Engineering and Privacy Legal (+ other teams if needed). Reviewers will suggest mitigation measures to make it low or medium risk.

During the L3SC process, the reviewers will request approval of the data collection activity from a director or higher that the team that owns the data collection activity in order to proceed with high-risk collection activities.

For all other data collection activities: Submit data collection activity to the L3SC request form for review by Privacy Engineering and Privacy Legal (+ other teams if needed). Reviewers will suggest mitigation measures to make it low risk.

During the L3SC process, reviewers will request approval of the data collection activity from the engineering manager of the team that owns the data collection activity in order to proceed with medium-risk collection activities.

For all other data collection activities: No additional review by Privacy Engineering or Privacy Legal is necessary.

Recurring or changes to existing data collection activities

If a data collection activity is recurring,

[8] subsequent reviews will be of a known risk, and will require less stringent review standards. For example:

  • A high risk one-off survey in the first quarter would be deemed a known high risk (faster response and decision cadence) in later quarters if the information collected is the same.
  • A medium risk ongoing data collection activity on iOS would be deemed a known medium risk (only requiring entry into the log form) if an identical schema had already been reviewed for Android.

Proposed changes to existing ongoing data collection activities should be considered to involve a change in the type of data collected, and should be considered a new entry in the data collection activity log form/a new data collection to review.

Mitigations

Here are a list of example mitigation measures you can take to lower the risk of your data collection activity:

  • Because it is trivially easy for a bad actor to derive granular geographic data from a full IP address, for the purposes of these guidelines, collecting complete versions of IP addresses are considered to be both a unique identifier[3] and to leak granular geographic data — therefore, collecting IP address is a medium risk data collection activity. Relevant mitigations include:
    • dropping the last two octets of IP addresses (e.g. 192.168.xxx.xxx)
    • hashing IP address + user-agent (similarly to actor signature)
  • For circumstances in which granular geographic data is critical, consider collecting sub-national geographic data and then dropping all unique IDs.
  • To collect riskier unique IDs (like IP address) and maintain a low-risk status, it may be necessary to hash them.

Definitions

  1. Personal information: (from the Wikimedia Foundation Privacy Policy): Information you provide us or information we collect that could be used to personally identify you. To be clear, while we do not necessarily collect all of the following types of information, we consider at least the following to be "personal information" if it is otherwise nonpublic and can be used to identify you:
    1. your real name, address, phone number, email address, password, identification number on government-issued ID, IP address, user-agent information, payment account number;
    2. when associated with one of the items in subsection (1), any sensitive data such as date of birth, gender, sexual orientation, racial or ethnic origins, marital or familial status, medical conditions or disabilities, political affiliation, and religion.
  2. Long-term viewing history data: Data that logs pageview histories >90 days for logged-out users or >1 pageview for logged-in users.
  3. 3.0 3.1 3.2 Unique identifier (ID): An expansion of "Personal Information" as defined in the WMF Privacy Policy. To this list we add username/user ID, and app install ID. Hashed versions of plaintext unique IDs are still considered to be unique IDs, since they may still uniquely identify a user.
  4. Granular geographic data: Data that identifies the location of a user at a sub-national resolution.
  5. Sensitive data: (from the Wikimedia Foundation Privacy Policy): date of birth, gender, sexual orientation, racial or ethnic origins, marital or familial status, medical conditions or disabilities, political affiliation, and religion.
  6. 6.0 6.1 6.2 Ongoing data collection: Data collected in an ongoing manner, typically through automated means. This covers telemetry data from app/web interactions. Importantly, it is data collected through implicit consent just by using WMF projects. It can be long term (for monitoring usage over an indefinite amount of time) or short term (for conducting experiments that have a definite end).
  7. 7.0 7.1 7.2 One-off data collection: Data collected in a single instance, typically through a survey. Data subjects in this context may explicitly consent to sharing data by acknowledging a privacy statement, filling out a survey, and clicking a "Submit" button.
  8. Recurring data collection: Instances of data collection that either:
    • recur after some time period (e.g. each month, quarter, or year) or
    • have equivalent data collection schemas across some set of contexts (e.g. iOS and Android).