Digital Privacy – When is PHI De-identified

Digital Privacy – When is PHI De-identified

De-identification of Data

Social networking sites, efficient search tools (Google, Bing, Yahoo), blogs, cookies, mailing lists, message boards, active x controls/ embedded java script on websites and other databases make it easy to identify that new business prospect or easily cross-reference materials from multiple sources to yield unique insights into a matter of interest. However, these online repositories of data are making it much more difficult to maintain the anonymity of those whose confidential information has been de-identified. De-identified data has many useful purposes; the data can be used in its aggregate for tracking disease, flu outbreaks, tax purposes, etc. There is a darker use of these many data sources, where those in our society that are ethically challenged use these data sources for socially unproductive purposes. For example cyber-stalking and cyber-harassment are now serious problems for both companies and individuals – if you ever tried to stop such individuals you will note the absence of a well developed corpus of law in these areas.

De-identified Information is information that does not allow an individual to be identified because specified identifiers have been removed. Scientists have demonstrated they can often "reidentify" or "de-anonymize" individuals hidden in anonymized data 1.

The fundamental flaw with anonymizing data methodologies relates to an adversary being able to find a unique data fingerprint (e.g. date of birth, zip code, and gender), and link that data to auxiliary information or outside information. A potential adversary can use resources such as the web (Google), public records, blogs, social networks, Facebook, etc; the issue is particularly troublesome when multiple organizations independently release anonymized data about the same or similar populations. The ultimate balance comes in trying to de-identify data sufficient to withstand inspection by a potential adversary, while also remaining useful for public health, or other similar needs.

De-identification of health information on the one hand is essential, but also can be used to embarrass, extort, or otherwise annoy someone whose information has been disclosed. With respect to Protected Health Information (PHI), the HIPAA Privacy Rule permits covered entities to release data that have been de-identified without obtaining an authorization and without further restrictions upon use or disclosure because de-identified data is not PHI and, therefore, not subject to the Privacy Rule. Generally a covered entity can de-identify PHI in one of two ways. The first way, the "safe-harbor" method, is to remove all 18 identifiers enumerated at section 164.514(b)(2) of the regulations. Data that are stripped of these 18 identifiers are regarded as de-identified, unless the covered entity has actual knowledge that it would be possible to use the remaining information alone or in combination with other information to identify the subject. However copious amounts of auxiliary information that is publically available on the Internet may render HIPAA safe-harbor protection impossible. On the other hand the "actual knowledge" requirement may allow for data that could be readily re-identified by a hacker (super user) (i.e. associating a person with the medical or other confidential data), while the covered entity "reasonably" believes the data are de-identified.

The 18 identifiers are:


  • a) Names;
  • b) Geographic subdivisions smaller than a state;
  • c) All elements of dates (except year) related to an individual (including dates of admission, discharge, birth, death and, for individuals over 89 years old, the year of birth must not be used);
  • d) Telephone numbers;
  • e) FAX numbers;
  • f) Electronic mail addresses;
  • g) Social Security numbers;
  • h) Medical record numbers;
  • i) Health plan beneficiary numbers;
  • j) Account numbers;
  • k) Certificate/license numbers;
  • l) Vehicle identifiers and serial numbers including license plates;
  • m) Device identifiers and serial numbers;
  • n) Web URLs;
  • o) Internet protocol addresses (IP);
  • p) Biometric identifiers (including finger and voice prints);
  • q) Full face photos and comparable images; and
  • r) Any unique identifying number, characteristic

The second method to de-identify data is to have a qualified statistician determine, using generally accepted statistical and scientific principles and methods, that the risk is very small that the information could be used, alone or in combination with other reasonably available information, be used to identify the subject of the information. The qualified statistician must document the methods and results of the analysis that justify such a determination. (See 67 Fed, Reg. 53233 (August 14, 2002.))

As is typically the case – if some method is built into the system to allow for re-identification, then the covered entity may not (1) use or disclose the code or other means of record identification for any purposes other than as a re-identification code for the de-identified data, and (2) disclose its method of re-identifying the information. In essence the method and key (the code) almost become an encryption method, but like with encryption when the key is compromised the data are compromised.

Interesting the older the population the easier (the more likely) an individual can be uniquely identified. Accordingly greater care must be taken with the medical data of elderly populations2. Additional research has found that when multiple de-identified data sets are made from overlapping data sets re-identification of data becomes progressively easier. Accordingly even where extremely large geographical areas are used to aggregate data for population studies this information may still be de-identified.

Unlike de-identified data, a limited data set is even easier to re-identify (albeit there are significant legal restrictions on the use of this information). A limited data set is one that excludes the direct identifiers in 164.514(e)(2). Unlike a de-identified data set, a limited data set is PHI because it may include dates, city, state, and ZIP codes, and other unique identifying codes or characteristics not listed as direct identifiers. A limited data set may be used or disclosed, without Authorization, for research, public health, or health care operations purposes, in accordance with section 164.512(e), only if the covered entity and limited data set recipient enter into a data use agreement. However, if the use or disclosure could be made under another provision of the Privacy Rule, such as for public health purposes in accordance with section 164.512(b), such agreement is not required.


1  See Ohm, Paul, Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization (August 13, 2009). University of Colorado Law Legal Studies Research Paper No. 09-12. Available at SSRN: http://ssrn.com/abstract=1450006;

2  Philippe Golle, Revisiting the Uniqueness of Simple Demographics in the US Population (Palo Alto Research Center October 30, 2006)(available at http://www.truststc.org/wise/articles2009/articleM3.pdf).