Hubert H. Humphrey Building
200 Independence Avenue, SW
Washington, DC 20201
Robert M. Gellman, J.D., Chair
Simon P. Cohn, M.D., M.P.H., FACP
Kathleen A. Frawley, J.D., M.S., RRA
Richard K. Harding, M.D.
M. Elizabeth Ward, M.N.
John P. Fanning, J.D.
Harvey Schwartz, Ph.D.
A.G. Breitenstein, J.D., Health Law Institute
Gray Friend, IMS America
David Korn, M.D., AAMC
Deirdre K. Mulligan, Center for Democracy and Technology
Latanya Sweeney, Laboratory for Computer Science, MIT
Easley Hoy, US Bureau of Census
Janlori Goldman, J.D., Institute for Healthcare Research and Policy, Georgetown University Medical Center
Michael Lundberg, National Association of Health Data Organizations
Ben Steffen, Health Care Access and Cost Commission, State of Maryland
Alvan O. Zarate, Ph.D., NCHS
Convene Roundtable Discussion, Introduction of Participants - Robert Gellman, Chair
Roundtable Discussion: Identifiability of Data
MR. GELLMAN: Good morning. This the Subcommittee on Privacy and Confidentiality of the National Committee on Vital and Health Statistics. I'm Bob Gellman. I'm chair of the subcommittee.
This is the first of two days of subcommittee roundtables. These aren't hearings. What we are doing at these two days is we are going to take a more in depth look at two relatively narrowly focused issues that normally don't get much attention. When health privacy is discussed, particularly in a legislative context at hearing on Capitol Hill, most of the focus tends to be on broader issues -- law enforcement access, preemption, informed consent, the sort of major issues -- and a lot of the details and nuts and bolts tend to get ignored.
There are just not enough time in other forums to pay attention to these issues. So one of the purposes today is to shine light on these issues; to collect facts; to have some discussions, and to see what comes out.
We don't have an expressed plans to have a specific product from these workshops, or an outcome. We will be exploring the issues. Maybe we will reach a conclusion; maybe we won't. Maybe we will all reach different conclusions, but hopefully we will all come away with a better understanding of the facts, of the needs of various players, of some of the non-health background issues that are involved here, some of the constraints that exist, but they legislative, operational, technical or whatever. We'll talk some about the technology, and hopefully a lot about policy.
If there is a specific goal today, I might call it the search for the free lunch. This doesn't really apply so much tomorrow. Tomorrow's subject is data registries, and today's is identifiability of records. If we can find a way to make more records available for more socially beneficial purposes without impinging on anyone's privacy interest, that is the free lunch. I don't know whether it is possible at all, but at least we can define a goal that I think everyone can agree on. I think you find elements of this in all of the legislation.
The process today will be relatively loose. I hope to have more of a freewheeling discussion here. If that doesn't work, we may get a little bit more structured. The outline of events is that I'm hoping to focus somewhat more in the morning on the facts. We're going to have some of the people, committee invited participants who do data things, to talk about what they do, and to get some of the basics out on the table. I don't think we can assume that everybody knows what everybody else is doing, and the facts are important to try and make a decision about what should or shouldn't be done.
I hope this will be relatively informal. As long as it works, people can just speak up without necessarily having a list of people who want to be recognized, and we'll see if that works. There will be an opportunity for people in the audience to participant. There is a microphone there. I think people in the audience can wait to be recognized, and I will try and work them in. We will allow this to go on if it doesn't become too disruptive. If people have something screaming that they just absolutely have to say because it is urgent, you can wave your hand and see if that works.
We're going to go until we run out of steam sometime this afternoon or until 5:00 p.m., whichever comes first.
I think we're going to begin the introductions a little bit differently. What I'm going to do is we are going to introduce the audience first. Then I'm going to provide a little bit more background, and then we're going to go around the people at the table, and let the participants introduce themselves.
Would you like to begin?
[Introductions of the audience were made.]
MR. GELLMAN: Thank you all for coming. I would like to just say a word of thanks first of all to everyone who came. I am most grateful to all the participants for coming. We obviously couldn't have done this without everybody. I want to thank all of the staff for their work, and particularly Judy Galloway, who did much of the work in getting everybody here, and getting this thing organized. So we are grateful for that.
Let me just a say. I tend to focus on legislation, because that is my background, and I don't think that this day is to focus on the bills that are floating around, but I want to give everybody a sense from the legislation of what is up there on this issue, and why I think it's something that really needs a lot more attention.
I found four bills without looking very hard, that all deal with the issue of identifiability. One bill is Sen. Leahy's bill, S-1368, which basically defined non-identifiable health information as information that does not reveal the identity of the individual, and there is no reasonable basis to believe that the information could be used to identify the individual. That is one standard.
The Condit bill, HR-52, defined protected health information as information that identifies the individual, or for which there is a reasonable basis to believe that the information can be used to identify the individual. So the Leahy bill talks about no reasonable basis, and the Condit bill talks about a reasonable basis.
The McDermott bill, HR-1851, defines non-identifiable information as information from which it is impossible to ascertain, based on the information codes or identifiers, the identity of the subject, and that the information cannot be linked or matched by a foreseeable method to any other information about the individual; an extraordinarily high standard.
Finally, there is another bill. It is not a medical privacy bill. It is Sen. Brownback's Federal Statistical System Act, S-1404, trying to deal with statistical agency access and restructuring. It defines identifiable form as any representation of information that permits information concerning an individual to be reasonably inferred by either direct or indirect means.
So we've got four bills with four significantly divergent standards, none of which has any particularly clear meaning, at least to me, in an operational sense. These issues are imbedded in legislation, and our goal here is not to rewrite any of these provisions, but if we can shed some light on all of this, I think that would be useful.
What I would like to do is we are now going to go around and do the introductions. For the invited participants, you are welcome to spend a minute or two describing just where you come from, how you deal with data. There will be an opportunity later on to talk about these things in a little more detail, but to just give everyone a better sense of what you do.
[Introductions were made.]
MR. GELLMAN: I think the way we are going to start is I am going to ask some of the people from government to spend a few minutes talking about their data -- data activities; what do you collect; what do you do with it; how do you put it out.
Al, do you think you could begin? Take five or ten minutes and give us a good sense of what you do with data, what your statutory constraints may be, and how you function.
DR. ZARATE: I handed out a couple of pages in which I lead off by describing the twin mandates that we always have to balance in assessing what we can release, and what we ought to release. Our agency is charged with disseminating statistics on as wide a basis as is practicable, from our creating legislation, the Public Health Service Act.
In the following provision of that act, and I paraphrase it here -- I had to paraphrase it, because it has been described as some of the most tortuous language that was ever put forth in legislation. It basically says that no identifiable -- and you can see why we're very interested to be here -- information may be used for any purpose other than what respondents were told when it was collected, or prior to its collection actually. Nor may it be shared with anyone that the respondent was not aware of or not made aware of prior to its collection.
The term "consent" is vital to this. We are limited by what the respondent has consented to use it for, and to share it with. We have no discretion in this law. It is not like some other provisions which would allow us to release data for certain purposes. We may not release identifiable information.
MR. GELLMAN: Is there a statutory definition of identifiable here?
DR. ZARATE: No, unfortunately there is not, but what we have -- you can look at these notes, and peruse them at your leisure. What we normally do is we make of course a distinction, and this appears over and over again, and it is subject it a little interpretation. When data come in, when data are gathered, of course our field representatives are given strict procedures to follow for the maintenance of confidentiality at the initial point of collection, and to be careful to be able to explain to individuals when we are getting their consent, what the limitations are, what our statute says. So they are aware before they give the data to us.
What is also of course not specified and never has been in our statute is the exact meaning of "consent," because that phrase under regulations to be developed by the secretary, and they have never been developed.
So we use a level of consent that is basically consistent with other survey research. It is the implied consent or constructed consent, as I understand it. When an individual, knowing how we intend to use it, who we intend to share it with, gives us the information anyway, that that is construed as consent.
In other portions of our surveys, particularly which are more invasive you might say, in our health examination surveys we do obtain written consent before an individual undergoes any blood tests, or any kind of measurements.
MR. GELLMAN: You collect all of your data, most of your data, some of your data directly from individuals, or do you get it from intermediaries who collect data?
DR. ZARATE: All of our survey data are directly from -- no, sorry, I take that back. We have the Health Examination Survey from individuals. The better known ones, the National Survey of Family Growth, directly form individuals; the National Health Interview Survey directly from individuals.
The Hospital Discharge Survey, we obtain samples of medical records from the institution which maintains those records. So that is an example of indirect.
MR. GELLMAN: Are those records identifiable?
DR. ZARATE: No, they are not identifiable.
MR. GELLMAN: We don't know what that means yet.
DR. ZARATE: The thing is that the only way back to them would be through the provider, and we don't want to know. I'm trying to think now in the other cases where we -- the other major case is vital statistics, where this is information on births, deaths, marriages, et cetera, where the consent refers to the state office of vital statistics, which provides us the data, not to the individual themselves. So it does vary.
We observe standard procedures. Now I don't know if you are all aware of them, and one interesting new feature of our data collection efforts is that where previously we used paper questionnaires, now those data are collected electronically. In the case of one survey, it is collected for us by the Census Bureau as our contractors, and they move the data around electronically from region to region.
Finally, it gets to our office in electronic form. It is edited there. The first thing that is done there is that the direct identifiers are stripped, and an analytical file is made. That analytical file, which contains identifiable data, but not the so-called direct ones, explicit ones, then that is used. We keep a master copy of the file with the identifiers in a secure place, and then what our analysts use is what we call in-house files. Those files are still regarded as confidential.
At a later point when we decided that this is something that we want to release in a public use basis, it then comes to me, but not after -- we have had the people in the program go through the file. We use a checklist first developed by the Census Bureau, and I guess Hoy Easley will tell you more about that, but which we have now further developed with more health examples in it.
What we do is we get information on the file. We want to know how old it is. We want to know how big it is. We want to know the information that is contained, but more than that, we want the program people to give us some advance information that we'll use to decide the level of identifiability with what they are proposing.
We ask them a lot about geography. We ask them not just the level of geography that is identified, but whether or not there are any implicit geographic measures there that they think about. For instance, it is common practice to imbed in an internal identification number, some geographic detail; the block on which the respondent was found, the county or the primary sampling unit, which may be a single county, et cetera, et cetera.
So that the knowledgeable intruder could look at that -- probably go there first -- and figure out what county, what state, and they are off and running. So we ask them about those kinds of details. Is there any other kind of information that might be used? We ask about detail.
With regard to ordinary demographic barriers, which are either commonly available elsewhere, or which might result in a rare and highly visible case. So for instance, recently I had to ask one survey to reduce the amount of detail it provided on height and weight, because some very large and very tall people, and small people -- anyway, rare and visible kinds of people even on a regional basis might have been identified. So we ask that kind of information.
We ask whether or not they know of any other files which may be in general or that are matchable into what they have done. Sometimes we tend to forget that where we got the sampling frame originally is going to have this information.
In one of our cases where we had a sample of institutions from Dunn and Bradstreet, someone pointed out that well, they've got the sample. So we've got to completely block the information that would describe that sample from them unless the respondents have been told they could have it, and they want it. So that kind of information. So we go through this excruciating detail.
MR. GELLMAN: When you do this, do you have a formal checklist or a formal procedure?
DR. ZARATE: Yes, it's in a checklist which actually -- I'm getting ahead of myself, but recently, about three years ago was formulated in the government; a group called the Interagency Confidentiality and Data Access Group. There are people like myself, a lot of them are mathematical statisticians or Easley's colleague Laura Ziads(?) is now the chair. I'm the vice chair. Jenny Dewilk(?) for the Bureau of Labor Statistics, got this thing going. We are now an interest group; a subgroup if you will of the Federal Committee on Statistical Methodology.
What we have done is we have put our experience together in this checklist, and we have designed it so that -- it is in an electronic version. You can adapt it to whatever use you want use it for. You can take things out, add some examples. It is intended to educate the people in the programs. In my case, I want people to know why we are so concerned about certain things.
It tells them for instance some of the general ways in which the information could be used to match in other files. We explain to them why we want certain information. It is a standardized checklist. I require all of the people who want to release a file for public use, to fill that out.
I don't regard it, nor does anybody else regard it as the complete answer to the definition of identifiability. It is a start. It means that all of the surveys are treated in the same way. We are sure to ask the same questions that we have agreed are pertinent, and to educate the people who are asking it. It is designed to provide information on sample surveys, complete counts, tabulations, that is the tables, as well as micro data tapes, and it is flexible.
So I use a version at NCHS, and we ask people to fill this out. After that is done, I still look at the code to see if they have missed something. They do from time to time, miss some things. If there is a particularly difficult case, I have the discretion -- we have not instituted this on a regular basis like the Bureau of the Census has -- we also have a disclosure review board, where we have some mathematical statisticians, the chairman of the Confidentiality Committee, and people from the program sitting, who review the files, and are available to me to discuss any real problem areas.
From time to time we have even had people come in from the outside to help us look at a particular, really problematic case where I have said, you can't release it. They say, gee, we've got to get this out. Is there any way we could do it? We bring people in to look at it, and to see if there is any way in which it can be recoded.
Again, this was alluded to before, the issue of if you put in all of the tools at disclosure limitation arsenal, then you generally make a file unusable, and the researchers are unsatisfied. You may satisfy yourself that you guarded the individual's privacy well, but then researchers can't use it. You have to say, well, is this one of those cases where I simply can't release this file? We are not going to go anywhere with this, and there have been cases like that, or is there something more we could do to release it.
In that situation, after we have gone through all of these processes, we make a judgment. We do not have an absolute way of defining identifiability. When it comes down to it, what we are saying is that where we use that reasonableness test quite frequently. Where I know that there is another database out there, it is a no brainer. You can't release it. We are constrained.
Then there are situations where we say, well, you know these are common characteristics, demographics which are generally available at almost every level, and we know that states and municipalities are constantly putting together databases which could be matched with ours and we err on the side of precaution.
There are other cases where we have to say, who knows? Then we think of well, what would it take to get in there? Who would it take to get in there? How much effort? Who would be interested? I don't think there have been any cases in which we have said, well, it doesn't look -- because what I really try to get people to do is say, well, if you are really unsure, let's do some more tabulations. Let's look at what a potential intruder would be presented with. Usually people will just say, yes, you know, we can't let that out.
So in the face of this, I must say that we have tried to turn to other means of provider researchers -- and I indicated some of those -- the kind of detail that they need. This is very difficult to do, because we do not have the capacity to allow researchers in general -- we very rarely, if ever, have gotten consent from individuals to let a generalized class of people to look at information, like all researchers. We always said that until now, we have never done that.
What we have tried to do is to look at ways in which information in the amount of detail that researchers normally want it, is available to them without compromising confidentiality that we have set up.
There are some normal ways in which do it. There is collaboration with researchers in our agency, where the researcher is the one who accesses, or our NCHS staff is the only one who accesses, and his or her colleague then looks at cleaned up tables.
We have recently gotten and implemented a visiting scholar program, the American Statistical Association Visiting Scholar Program. We have the administrative meetings to allow that person to access confidential data, but very few people can take advantage of that kind of thing.
So we have turned to other methods of permitting more people to use the information, three in particular. One, we have an analytical programming service which has been set up to examine tabulations from the National Health Interview Survey and the National Survey of Family Growth. This is an extension of cooperation really.
We get tabulations. We give people a dummy file, and they can use that dummy file to get their programs ready. The data don't represent anybody in our samples. They are fictitious cases. They can use that to debug their programs.
They then submit those programs to us, and we run them for them. After we have run them -- and of course none of their programs can have instructions to list individual cases, or to ask for geographic detail. We pre-program that so it would permit them to get key items of information for individuals.
We then look at their tabulations, and we do tabular disclosure limitation practices, and then give it back to them. So basically, they have never looked at identifiable data, but they have the use of it.
We are working on extending that to what we call a remote access system, where there would be fewer people involved, and this would be automated.
Finally, we are trying to develop something that the Center for Economic Studies in the Census has done in several places, and that is to develop a research data center, where individual scholars or researchers can come, and in the capacity of a sworn employee -- which we don't have yet, but we are working on it, and we are hoping to have that, and other security measures -- can access confidential data on our site in such a way that they can't bring in any data with them to match up with it and achieve an identity.
Of course, these are files which have been stripped of the obvious identifiers, so that any identification would require some analysis. Effectively what we do is block them from doing that analysis, and making sure that they don't bring anything with them, or they don't take anything out. What they do take out is reviewed by us before they take it out.
So we are working on setting up those kinds of systems where people can come, and once again, have the use of confidential data without being able to identify any of our respondents.
MR. GELLMAN: Let me ask you to talk for a minute or two about what you publish. The kinds of information that you put out on data tapes; who uses them; what they use them for. Give us a better substantive sense of why this data is important.
DR. ZARATE: I can tell you that we put out -- I think the figure is more than 400 public use data tapes in the last five or six years. It's an enormous number. Some of these are just reiterations. Like the National Health Interview Survey does a survey every year. Where it hasn't varied, it's the same survey every year, but it is more recent data.
Then there are data for vital statistics. Data on our Health Examination Survey is done in a cyclical basis, roughly once every four or five years.
MS. GREENBERG: It's going to continuous.
DR. ZARATE: It's going to continuous, that's right. So that information, with all of it supplements -- I mean it's not just one file, but it's many different files focusing on particular aspects of the data that they gather.
As I said, vital statistics before, and hospital discharge statistics, but our Health Care Statistics Branch is a family, so that there is information on nursing homes, on doctors' visits, ambulatory care. So there is a whole host of information of that variety, where we get information on physicians, and information on the providers, as well as samples of their respondents.
MR. GELLMAN: Describe the users.
DR. ZARATE: Users, wow. It's everybody, isn't it? I can't think of any person -- periodically we have people reminding us that our principal user is Congress. Then we have the American taxpayer. Then there are the researchers, of every description. There are pharmaceutical people who are interested in especially our case of death statistics.
I think offhand of cases of people who want to use our information for commercial purposes. Some of the standards, the physical dimension data that are put out by the Health Examination Survey are used by commercial interests in establishing seat size on airliners and things like that.
This is what I hear around the agency. Other people could give you more detail. I don't think there is any limit or end to the kinds of people who would use our data.
MR. GELLMAN: Has it come to your attention from time to time that the information you have been releasing has been used in a way you didn't intend, that is, has been used to identify people? Is this something that comes up occasionally, rarely? Would you know?
DR. ZARATE: I guess you have to be realistic that there could be ways in which that has happened, and that we don't know. We know of none. None has come to our attention. There have been cases where we say, whoops, let us have that back, but we have always managed to make sure that the circle is unbroken.
We get I think challenges in the form of Freedom of Information requests. Our agency has an exemption from those on two counts; one, that it is an unwarranted invasion of person privacy, and it is also covered by another statute. So although we get them periodically, we have only once had to worry about what we were going to give out.
So that from time to time we get worried, because one of our significants is we maintain what is called a National Death Index. This is a system for the convenience of researchers to be able to use the information in our mortality data sets, so that they can really follow people. They are usually people who are following cases and want to know, are the dead, and if so, what did they die of?
They may have moved out of the study area. In that case, it would be nice to go to a central spot and locate the death certificate and get that information. What we do is we tell them where they can find it. We don't give them the data. We say, okay, here is where it is likely to be. We don't tell them for sure. So they put in a request, we know this person, and such and such description.
We put that description in and we do a match, and we say here are the ones that are likely to be positive. There may be more than one, and here is where they are. The researcher's task is to go to that state and ask for them, and it's the state's decision to let them have that information, whatever information they want to let them have.
Sometimes we get worried that this information is out. There are legal interests who would like to have that information. There has been one case where a lawyer successfully sued to get the information, but basically what had to be done was that the holder of that information had to go back to the states and get permission. So once again, we did not allow any information out that weren't supposed to. It was adjudicated and taken care of. That's as close as we have come to releasing any identifiable information.
MS. SWEENEY: I have a couple of questions. One is on the 400 use files, what is the typical aggregation level? Like how much of micro data is at the individual level?
DR. ZARATE: The micro data tapes are all at the individual level.
MS. SWEENEY: No, but I meant of the 400 public use files, how many of those were micro data files?
DR. ZARATE: I was referring to micro data files, all of them.
MS. SWEENEY: All 400 of them are micro data files?
DR. ZARATE: Yes, micro data files. They are not aggregated.
MS. SWEENEY: But what is the typical record size?
DR. ZARATE: I really have no way of telling you.
MS. SWEENEY: Is it in millions or thousands?
DR. ZARATE: It is not that high. I don't think so. I think it's safe to say that it is in the thousands. The one case where we had a really large one, but we didn't allow as a public use file. We're talking about huge numbers, where any one of them would have been a unique -- the odds are very, very high you would say that they would constitute a unique identifier. We just said, no, the odds are much too high.
MS. SWEENEY: So of the 400 micro data public use files, how many aggregate public use files do you do, where you sort of clump people together in categories?
DR. ZARATE: Well, I guess there are files which permit a person -- we put it out in a form that permits people to aggregate on it, but the files that I'm talking about right now, the micro data files, are all individual records files. If you want to ask to what level are they aggregatable?
MS. SWEENEY: No, I'm interested -- so say for example, if I don't give out all the data on all the people in my micro data file, I may give it out on 80 percent of the people or 50 percent of the people. Then if I also produce a file that gives summary information on all the 100 percent of the people, it gives me a notion of how much could be inferred about the people that are and are not there.
So the question is if you have 400 micro data public use files, how many complete, aggregate files are there released?
DR. ZARATE: I really couldn't answer that question. I think that it's very common in our surveys -- we used to ask people how much of the file has been edited. When you sit down with the people who do these surveys, when you ask, what does it look like in the field, and then what does it look like when you finally put it on a data tape and release that, when you account for non-response to begin with, and when you account for item non-response, when you account for the editing that was required to make the response logical, when you account for the response error, which is different from item non-response, that is if you ask a person's income and they round off, age is rounded, things like that, when you account for all of these, then a significant amount of noise, as they say, is already built into the file.
As a matter of fact, I heard from my Census colleagues that sometimes you don't want to change the information -- and this is a Census perspective, and I'm sure that Hoy would want to comment on it -- there are some people who feel, don't change anything they tell you, because what they tell you is what you have collected. When you change it, you're modifying it from what you have collected, and that is not right.
Any file, to be useable for the researcher as you know, has to be crafted so that it is put into useable form. All that crafting means that it takes one additional step, step after step after step away from what was said in the field. Now some of it actually brings it closer to the truth of course, and with some level of probability. You say, it must have been this. It had to be, so we'll take it as that, so that's okay, but some of it may take you away.
When I first entered into this business I assumed that everything was true on the data file, and that the case of matching in was going to be a piece of cake, because everything else was true on the outside, but it's not. So I think it's important to recognize that you start off in the matching game with some built in limitations.
Now having said that, that's why we ask, what can be done to this file, and what are the consequences of doing it that way?
DR. KORN: I have to apologize; I'm not an information expert or mavin, so let me ask a simple question. The NCHS is probably one of, if not the largest repository of population data, at least on things that I know about, that have to do with well being, health in a broad, death, whatever, in the country. I think it may be.
It is used by an enormous variety of agencies, people, researchers for a whole bunch of purposes that I would argue for the moment are in general very useful, and even important.
The question I'm asking is, does anybody around the table or in the room know of a single instance in which the use of that data has been demonstrated to be injurious to anybody? Are we talking about problems here, or are we talking about hypotheticals? That's the question.
MS. SWEENEY: I would pose two things to consider in the question that you posed. The first thing I would consider is something that has been said by a lot of hospitals with respect to disclosure control, who by the kinds of standards set by the federal agencies, is a little too loose and uncomfortable. They often say, yes, but, we could suffer five lawsuits before we have to consider changing our practice.
The reason for that justification is that it is a very hard thing to prove, because the kind of people who are taking advantage -- if the Census Bureau released financial information that my credit card company uses, and they are not apt to go back and tell the Census Bureau that they were able to take advantage of that, but they are in fact using that information. So that is one aspect that I would have you consider.
The second is perhaps one of the reasons that I think I'm here today, if I have only one message to give you -- times have changed. Privacy, the protection for privacy that we had in the past, by our past practices, by the kind of statistical modeling, statistical disclosure techniques that have used are simply ineffective in the face of today's technology.
Technology changed everything. The proliferation of the Internet, the information initiative to put more and more public use information available on the Internet. Before, if I wanted to find out, to re-identify some of the information from his file, let's say some of it came from the Mayo Clinic, I would have to fly to Minnesota, during the hours that they would let me in their file room. Do whatever I needed to do, to write by hand.
Today I take my little hand held computer. I click on. I connect to Mayo's database. I collect the whole database, and do whatever I want with it. So technology has changed everything in ways that our old practices really aren't equipped to deal with.
So in one way of answering your question of have there been any explicit abuses, I don't know. I suspect there have. Whether or not we know of them so publicly may be one, because aren't quick to tell what advantages they are getting from the data, and two, I argue that in the next five years, whatever that number is, will be magnified tremendously.
MS. MULLIGAN: I would like to add to that actually, that I feel like maybe I don't have to be here, because you are here, and it's quite nice. In addition to the usefulness of the technology itself, we have also seen an enormous growth in the collection of data by the private sector. So that before while you may have been the primary repository of that information, there are many, many organizations, state governments, private companies that are now collecting enormously useful, manipulatable, digitized data elements.
So our ability to manipulate that data and identify people -- I mean we are entering an entirely new era. So I think that is it both the technology, but also just the growing wealth of data in the private and non-federal sector government collection. So it is dealing with both the increased quantity, and the real usefulness in this technology in manipulating it and in matching it.
DR. KORN: I understand that, and I'm not going to argue that with you, but I really was asking whether the NCHS data, as it is collected, managed and used, has been a source of injury. I know that there are credit card records and DMV records and all kinds of things that have nothing to do with the NCHS, that may be very sloppy, and very easy to get into. Probably you can, and I'm not arguing that point.
I just want to make a comment. I think that if one takes the issue of information as a single huge cosmic thing, you can drive yourself crazy trying to worry about it, and you will never get anywhere.
Unless you think in much more specificity about uses of information, and then try to understand what those uses are, how they are being managed, and what needs may exist for strengthening their protection, or doing other things to make them more secure or whatever, I think you get into sweeping generalities, semantic impossibilities with words and definitions. I think that that really confounds discussion, and it certainly confounds legislative debate badly.
MS. BREITENSTEIN: I just wanted to add one thing, and then I had a question. I think, at least in the realm that I work in, which is primarily HIV and AIDS we had an example, I think it was last year, of the HIV registry in Florida being sort of grossly abused by virtue of data being pretty much taken right out in disk form, and then shared with people in a bar, as far as I knew by the story.
So I mean I think that that's the first maybe not a typical example, but definitely an example. I think the question I think that Latanya and Deirdre were trying to get to was are we going to wait for something to really happen that is bad, or are we going to take responsibility for the data that we're collecting and how we are going to use that before we have a problem?
I think waiting until a problem occurs is sort of a finger in the dike sort of idea. That was my comment.
My question to you was is there any sort of evaluation about the data elements that are requested? In other words, is a tape a tape a tape, or is there any sort of request for certain data elements which is then answered, or is it like here you go, you can look at whatever you want to look at.
DR. ZARATE: We get these requests all of the time principally for geographical detail. People want to know about their own area or their own region. Because of this, one of the things that we have tried to do is the National Survey of Family Growth I mentioned before is a file. It was specially created with lots of geographic detail. That is, with information about the local area in which women live. Information about the family planning services, and what we call contextual data.
You put this all together -- and this that case, Latanya, where you said we can't make this a public use tape, because there is just too much information on it. It is huge. It is a huge record such that if you took the value of any one individual, there would be none other like that anywhere in the country. So we decided that that would be a good candidate for a remote access system, where people could use that information, but never really see it.
I got so interested in what I was saying, that I lost track of your question.
MS. BREITENSTEIN: I'm sure my question wasn't half as interesting.
DR. ZARATE: I'm sorry.
MS. BREITENSTEIN: No, it's okay. I was just wondering if it's like here is the data, or if it's you are requesting these certain data elements; here are these certain data elements.
DR. ZARATE: We have to evaluate them all, and part of my job is to make sure our staff are aware that they just can't grant an interesting request. It's something that you have to be on top of all of the time.
I think that one of the things that we try to do is I took seriously, there was section the private lives and public policies, which basically said that confidentiality is not just a question of assuring someone that you are going to take good care of the information they provide, but you've got to actually do it.
We go to, I think -- our colleagues in the Census Bureau and other agencies go to extraordinary lengths not only making sure that informants understand what we are going to do with their information, but making sure that that is exactly what we do with it, and that we don't provide any information to anyone -- we've shot ourselves in the foot sometimes where another agency has come in and funded something, and then I had to turn around and tell them, sorry, you can't have the data in the form you wanted it, because we didn't tell the respondents we were going to share it with you.
We told them it would never leave the building, and so that's what is going to happen. It's not going to leave the building. There is a lot of pressure on me from time to time. Fortunately, it's not brought by my superiors, but by a lot of people who want to release this information, and by researchers who will say, isn't it enough for us to tell you that we are nationally recognized, and have been for many years, a professional organization, and we have all the means to take good care of your data. Isn't it enough?
I have to tell them no, it's not enough, because we have told the respondents that we wouldn't give it to people like you in the form that would identify them. We can't do it.
DR. HARDING: Just a question and a request. The Interagency Confidentiality and Data Access Group, you are the vice chair?
DR. ZARATE: Yes.
DR. HARDING: Would it be possible for us as a committee to get some of the thinking and minutes and so forth from that group?
DR. ZARATE: We can give you the thinking. The way we have set it up is that --
DR. HARDING: I don't mean all right now.
DR. ZARATE: Okay, but in order for us to really discuss issues like this, one of our features of our charter is that our minutes are not public documents; legitimately so. If a contractor is there, we ask the contractor to leave if it might be a problem. We can certainly share with you our products and our thinking. That would be no problem.
DR. HARDING: I think that would be very helpful to us, and even if we could give you some feedback on that.
DR. ZARATE: Oh, yes.
MR. GELLMAN: A question from the audience, and then we are going to move on to the Census Bureau.
MS. MOLE(?): Deanna Mole(?), Illinois Department of Public Health. In response to the question, just so you don't leave with the impression that there are no cases of actual harm, the Illinois Cancer Registry was subjected to a lawsuit where there was a cluster of neuroblastoma cases, and our Supreme Court has refused to hear it, but there was a rural southern Illinois Appellate Court who determined that cancer registry data by zip code, date of diagnosis, and the type of cancer was non-identifiable, which is also public information.
We are currently getting a FOIA request for the exact same kind of data for other cancers; some of the press association attorneys in Illinois. So to think that this is not a problem, is not true. I think our data probably dumps to CDC or NCHS. I'm sure it does do that somehow.
Imagine that as a different cancer. Neuroblastoma happens to hit children. Karposi's sarcoma -- there is a lot of information you can get by type of cancer and date of diagnosis and zip code, because for one thing, small towns in Illinois, they have one zip code, 500 people. It is not going to take a rocket scientist to figure out who it is in town.
DR. ZARATE: To my knowledge, we do not have that data, but I would be glad to talk to you to make sure, because that concerns me.
There is another case of direct concern to us that just occurred very, very recently. I happen to sit on the New York State Disclosure Review Board that oversees releases of their SPARKS(?) data system. The Supreme Court -- I think it's the Supreme Court of New York; it wasn't the Court of Appeals, because they are going to appeal the case -- has told the SPARKS system that in response to an FOIA request by The New York Times, has ruled that information which we would refer to as identifiable, not direct identifiers, must be released.
So a question of harm is something else again, Dr. Korn, but they are very concerned. We have been trying to supply them with information that we have at our disposal to buttress the idea that the information is still identifiable, but the court did not accept their argument that it was identifiable.
MR. GELLMAN: This is a perfectly legitimate subject for debate, but I think we can postpone some of it until this afternoon.
Janlori, would you like to identify yourself, and say whatever you like.
MS. GOLDMAN: If we are going to save it for this afternoon, then I will, but there is another case --
MR. GELLMAN: Why don't you identify yourself, so everyone knows who you are?
MS. GOLDMAN: I'm Janlori Goldman. I have started a health privacy project at Georgetown University Medical Center.
There is a case that I have just recently become aware of in Mississippi which sounds very similar to the one that is ongoing in New York, but may have a slightly different result, where the media has been engaged in a lawsuit with the state of Mississippi's Public Health Department to try to get access to data from the public hospital system there, going back decades.
Their goal is to look at race-based discrimination in the public health system and the hospital system in terms of how treatment was delivered. They want access to identifiable data, not just to look at the demographics, but also to be able to contact people and interview them.
This is an ongoing debate, obviously, within the privacy and the Freedom of Information Act community about what is considered confidential, and when an individual's privacy is outweighed by the larger public good in disclosure.
This lawsuit is going, but so far the courts in Mississippi have ruled that identifiable data, because it is held by a public agency, should be released under these circumstances. We are talking about the medical record. We're not just talking about name, address, and maybe diagnosis, but the record.
There are a number of people who are involved in this lawsuit. The Mississippi American Civil Liberties Union has decided to come in on the side of the media, asking for disclosure. So this is a tough one. I don't know, Dr. Korn, if you would consider that an abuse of public health data. I certainly would, but I think that this is the kind of thing that we're going to be struggling with. It is not yet fully resolved, but it is not looking very good.
MR. GELLMAN: Just a word, not all state FOIA laws have privacy exemptions. Some of them do not have them at all, and that can lead to this kind of result.
Let's go to the Census Bureau. Easley, can you talk about what you do and how you deal with some of these problems?
MR. HOY: Yes, thank you. I want to thank Alvan for mentioning a lot of things. Some of the things he has mentioned are at the Census Bureau.
Let me kind of back up a little bit about the Census Bureau. We collect, I would say, most of our data. It is not all, because we also contract out to other agencies such as NCHS and so forth, and there are different confidentiality statutes. Most of our surveys we collect under what call Title XIII, for specifically, the Bureau of the Census.
It has quite a few provisions in it, but the one that most people would be interested in here is the confidentiality. There are two sections in there that talk about confidentiality. So basically on our questionnaires we have a pledge of the confidentiality of the data. So that is our statute.
Now on the other side of the coin, as you well know, there are always data users who are always asking for detailed data. There are different forms of detailed data that go out. Let me just hold that for a moment, and describe to you our Disclosure Review Board that Alvan referred to.
Our predecessor to the Disclosure Review Board was what we called a Micro Data Review Panel, because at that time, years ago we would -- I guess in the 1960s we started putting out public use micro data files, first with the decennial census, and then later with respect to our demographic surveys.
So there was a Micro Data Review Panel that was made up of employees of the Bureau. They were charged to review the content of these micro data files to determine whether or not information would be a breach of disclosure.
Now talking about disclosure a little bit, the reason why we are the Disclosure Review Board, and you when you hear the terms "disclosure limitation," "disclosure avoidance," and that is done on purpose. You cannot 100 percent, prevent disclosure.
The big thing is how can we minimize the risk -- and we talk about risk here -- so that we can still provide information to the data users, but at a minimum risk to our data suppliers, if you will? I guess the one thing I would like to say about that is in addition to the lawsuits, which are obviously a very serious thing, as a data collecting agency we are also concerned about public cooperation with our surveys and censuses.
I don't know if you have heard, but there have been incidences where in certain countries there is a revolt about providing data, because of concerns about confidentiality. If you ever get people to decide I'm not giving you that information, we are all in trouble. So that's in addition to your lawsuits.
Let me continue on with the Disclosure Review Board. Basically, we have now nine members, and we have three permanent members, myself as a chairperson; I have a person on my staff who is responsible for disclosure avoidance research techniques; we have a person from the policy office of the Bureau, and these are the three permanent members. We also have members from the decennial area -- they put out a lot of data of course -- two members from the demographic surveys area, and three members from our economic survey area.
We also conduct an economic census, as well as economic surveys. A big distinction there, I might add, is that in the economic area we very rarely put out micro data. It is almost all tabular data. Certainly one of the reasons is because of the skewedness of the data. If you pick up our economic data, it is very easy to identify. Even with tabular data, it can be easy, and we go through excruciating automated computer review for both primary and what we call complementary disclosure on tabulations, not to mention micro data.
Basically, we meet once a week at a regularly scheduled time and room, and we usually get at least two requests a week, because we are putting out a lot of data products. Some of them are just updates of prior requests; some of them are new requests; some of them are major revisions of a request; and some of them even involve research. Some of the more unusual requests that we look at involve they want to do research, and they want to give it to a contractor or another agency or somebody like that to test something. Sometimes we have to look at that.
I guess our philosophy on anything that is a public file is that it's a public file. Once it leaves our doors, anybody can do whatever they want with it. So we are very cognizant of that fact, and are very conservative about that. It's a public file, and we treat it like that, even with, as Alvan mentioned, some of our agencies who contract to us. If certain things weren't discussed or mentioned in the early negotiations of launching the survey, sometimes you can get surprised, and have that detail across tabulation, so that's important.
You mentioned the checklist. Yes, we have a checklist that we ask the person who is doing the processing to generate the data files, we ask them certain key questions based on our experience over the years as to things they could give us information of what's on that file, and what kinds of possible tabulations could come about because of that. Certainly our main concern is the sparsity of data. If you can get a table full of ones, you've got a problem; a potential problem anyway. So we have that.
I need to describe a little bit about our PALMS(?) files, our public use files. First of all, the PALMS file is based on the decennial. We have a different term for our demographic survey. We call them public use files, which are PUPs. The PALMS file is a sample. It is a 5 percent sample. We have a restricted geographic size. These are pre-defined geographic areas of at least 100,000 population, and there is no record on there that has any geographic identification on there that would identify an area below 100,000 population.
The PALMS files also had in 1990, a certain amount of traditional noise. I think in 1990, there was some blanking and imputation; a certain amount of that.
I want to mention something about the research data centers, because Alvan mentioned that. In our research data centers, we currently have two. This was instituted I guess after the 1980s. The reason why they were initially instituted was we were looking for consultation with some researchers on how to do some things in the decennial census, and we established the Boston regional office as a site.
The research proposals were deemed to be of the benefit of Census Bureau in carrying out its function. It is a physically secure site. It stands alone, and nothing goes in and out with a special Bureau employee overseeing it. The researchers become special sworn employees. No detailed tables are permitted out. It is not meant to generate detailed tables. It is meant for just strictly research, maybe derived measures, which are reviewed by the Bureau employee before they are taken out.
Also any reports that are written up based on this research are also looked at by the employee, so that we are looking to make sure really no detailed data goes out of those centers.
We have another one that we recently opened in Carnegie Mellon University in Pittsburgh. That has just been opened in the last year. There are plans to increase that number. We are working with the National Science Foundation to set up criteria for accepting new candidates for the research data center. So that's that.
In addition, we have a new challenge coming ahead. For the year 2000, we have what we call the DADS system, which is data and dissemination system, which is the idea of having users making queries about defining their own tables. Up until now, most of our tabulations are all pre-defined. The staff looks at the tables. They are aware of the possible of disclosure issues. So we have that.
The new challenge will be if you are going to let the user define the tables without human intervention, then we have to put some other constraints up. We have to consider other kinds of constraints.
We still have to look at the sparsity of tables. We still have to worry about the lowest level of geography, and so forth.
I'm sure Alvan is right, we always get a lot of pressure about putting out something, but so far we have maintained pretty much what our rules are.
MR. GELLMAN: Let me ask both you and Al the same question. Are there any micro data tapes that you used to publish in the fifties, in the sixties, in the seventies that you have made a recent decision in light of all of other data that is out there that has become available, the stuff that Deirdre talked about and others, that you said, we no longer can publish this data tape at all, or in the same format that we used to. We have to make a change. Is that circumstance that has come up?
MR. HOY: It hasn't come up, but we are starting to talk about it internally. The fact of the matter is, once something goes out the door, you can't pull it back, because people make copies. It's just when you open the door, it's gone.
MR. GELLMAN: But you don't have to put out the same data again the following year?
MR. HOY: Right. We could review the situation given the technology that is present, and we could decide not to put out that type of file again. As I said, it's not to say that just because something happened several years ago, doesn't mean it has to happen now or in the future. Basically, we could change.
MR. GELLMAN: Al?
DR. ZARATE: That's a very good question. I must say that that's one of the items we added to the checklist was whether or not this is a stand alone file, or whether it is part of a series of releases, and whether what has already been released would in an additive way, amount to identifiability. So that is certainly the criterion that we have now.
We have, on a couple of occasions, had to make the hard decision not release certain items of information, because of the already public availability of details. So it is restricted in the amount of detail we have been able to release in recent years.
MR. HOY: I must say that it's a very relevant question, because your users -- that's one of the arguments they will make, is that why did you let us have it five years ago; why can't we have it now? You can fall into that trap very easily, and I think you have to be aware of the technology that says, well, now we can't.
Yes, there is going to be a lot of pressure. If you put something out there the first time, and you give a lot of detail, and later on you decide you want to reduce the detail, it is a lot of pressure to say no, because they'll use the argument you gave it before. Did you violate your pledge of confidentiality before? So therefore it must not have been a violation, because you did it.
MS. MULLIGAN: Both of you talked about allowing queries to come in, which deals with that front end. We're not disclosing the data; rather we are asking for requests and providing back tabulated data. It seems that, as you pointed out, this DADS system and other systems like that really do raise a lot of questions, because rather than 400 public use tapes, you have I would imagine many times more than that of different types of tabulated data sets that people have generated.
Just from a management perspective, trying to keep a rein on what is available, what's been released, and how those things might be combined just gets inordinately complex. I'm wondering, you talked a little bit about what types of limitations you need to put on the front end, and I was wondering if you could talk a little bit about what you have been considering, or what might be considered to deal with that issue?
MR. HOY: We are still in the middle of a discussion about that, but one of the things that we are talking about in terms of sparsity of data is talking about what the mean cell size is, or what the median cell size is. For instance, if a person asks for different cross-tabs and it comes out to be 1,000 cells, and you only have 1,000 observations, you know that you have got a very sparse table there. I think whatever rule that you come up with, you basically have to say no, I have to deny that request.
We had the median, because we also thought of the idea that someone could gerry-rig the cross-tab such that you would have all your observations in one cell, or 90 percent of your observations in one cell, and nothing but ones in the rest of the cells. So you want to look at that as well.
You have to be a little paranoid to think of these things. That's not to say that we know all the possibilities, but that's why it helps to have more than one person look at it, and say well, they could do this, they could do that. That is how we have to operate.
MS. MULLIGAN: Can I ask a follow-up question to that? Unlike the public use tapes, are these also considered public use, or are these given out in a more limited set of circumstances, with different rules?
MR. HOY: Now the DADS thing is just now beginning of course. I'm just mentioning this. It's under development. So we haven't done the custom query thing yet, but it is being tested now.
I'm sorry, could you repeat your question?
MS. MULLIGAN: Well, where it's a narrow request coming in, I'm wondering like the public use tapes, are they then available to anyone, or are they only made available to the individual that is requesting them?
MR. HOY: As it is being planned right now, it is being viewed like a very public request. Any person in the general public could make this query as it's being planned right now. There are other discussions about the possibilities of registering people or not registering people. That sort of thing is also an option being discussed.
DR. ZARATE: We plan to vet the people before they have access. They would have to submit a proposal, and we would have to know exactly how they want to use this information, what uses they propose to put it to.
MS. MULLIGAN: Does that raise any conflicts between FOIA or the statute that governs you other than FOIA?
DR. ZARATE: No, because as Easley says, once we let them have it, we feel that anybody could have that kind of information. We do try to keep track of these requests, because the other thing which I think was said explicitly is that it is possible for people to call up and customize it, but successive custom jobs will lead to a very rich data set. That can be the very thing that if you have seen it toto, you would have said, no way, but you only see little bits and pieces.
One of the things that we try to do is to make sure that the people who are in a position to answer these requests recognize that. We have got a confidential manual which we have revised several times, and are currently revising now just for some of the reasons we are talking about today, and Latanya is sensitizing us to, in which we remind people what are the rules when you release tabulated data?
They can't all be in one cell. The row totals or the column totals, that has got to be changed in the ways in which it is done, and tables can't be linked in order to put them together and get something greater than the sum of one table.
So we try to make sure that people are aware of this. I, for instance, have gotten into the habit -- I've done it twice now -- of putting out a little question and answer to the staff in general. I give them questions to ask, and the answers, I just tell them what the situation is. One of them was on identifiability, is it true or false that if you just take a name and social security off it, it is no longer identifiable?
I don't know that I'm really raising the knot. I'm certainly sensitizing them to when they've got an issue or a problem, come see Al, or go see someone else and ask about this. The problem with any statistical agency is that not of all these requests come through one central spot. I guess you are raising that issue. So you have to make sure that all those spots are covered, and it is not easy.
MR. GELLMAN: Let's take one more comment, and then we are going to have to break. Janlori?
MS. GOLDMAN: About a decade ago when the Census Bureau was preparing for the 1990 census I participated in a series of meetings, the purpose of which was to look at how the private sector's information uses had become so sophisticated and so powerful that the Census Bureau was trying to re-evaluate how information was released.
So it wasn't a question of should we withhold information that we have previously released, but should we somehow try to re-examine how we release it. There was talk about masking data, skewing data when it looked like there was a greater potential for identifying in small data sets.
I realize that given that that might that some information that was released was either inaccurate or incomplete, there was an understanding that we needed to balance that against again the risk of identifying individuals.
I'm wondering how, given I think we know much more now a decade later about how powerful those technologies are, and when you mention other countries, and people being reluctant to participate in the census, we know that that is happening here. The 1990 census had the lowest participation rate of any other census, and at least at that time there was some belief that that might be linked to concerns about confidentiality, given the long form and the questions that were asked.
What are the standards, and are they being re-evaluated to look at, at what point do you mask or skew data? Do you try to withhold information, or actually change information? When do those particular kinds of reactions take place?
MR. HOY: I think generally the tools that we have for the most part are to recode data. When it gets to too much detail, you recode data. You top code some continuous variables. I think those are still being used. It's a two-edged sword, because basically if data is analytically valid, then there is a higher risk associated with disclosure. If you obviously put too much noise in there, then the data are less valid analytically. So it's a very fine line that we walk.
I think what we are doing for the year 2000, or plan to do for the year 2000 versus 1990 is mostly in the long form data, where before we just blanked and imputed, I think we are swapping data now. We are matching records that might be unique or semi-unique, and swapping some data. That is one of the things we're doing that is a little different this time, at least plan to do different this time than last time. It's just a constant struggle.
MR. GELLMAN: Let's stop at this point. We are going to come back in 10 minutes and hear from the state people.
MR. GELLMAN: We're going to hear from some of the people who are doing data activities at the state level. Michael, do you want to begin? Tell us what you do. Talk about your data activities sort of along the same lines as what we have heard.
MR. LUNDBERG: All right. Virginia Health Information is not a state agency. We do operate under contract to the Commonwealth of Virginia. Our board of directors is made up of consumers, business, hospitals, health insurance companies, nursing homes, physicians, and state representatives, of which the commissioner of health is on our board of directors, and is also with the Joint Commission on Health Care, which is a permanently funded legislative group.
We were formed in order to maintain Virginia's patient-level data system in 1993, which is a hospital discharge data set. We were the 38th state to have such a mandate to collect hospital discharge data. So this has actually been going on in one form or level in a number of states. For about 15-20 years, states have been engaged in this process.
There are approximately 40 states now that do have such mandates. We were the 38th state. Because of that, we did have the opportunity, as we were crafting legislation in Virginia, to be cognizant of the structures and the successes and failures of other states, to take into account patient privacy issues and try to craft legislation that would give us the ability to create information for people to use, but also take into account issues to the extent possible on access to information, as well as privacy issues.
The information we get is primarily information from hospital billing claims forms. It's in administrative data sets that they have there. There is information on diagnoses and procedures and charges, the physician that is providing care, the hospital that is providing care. We do not collect the patient's street address. We don't collect the county residence or the city. We do get patient zip code.
MR. GELLMAN: Do you get names?
MR. LUNDBERG: No, we do not.
MR. GELLMAN: Social security numbers, case file numbers?
MR. LUNDBERG: We do get the patient social security number. Now I'll talk about restrictions. Essentially, we never release that. We cannot release any of that information. I'll talk about the reason we do collect that information in a moment.
From that information, again, we don't have most identifiable information that you think of strictly as far as the addresses and things. Some states do collect another level, an entire level, the street address, but we do get patient social security numbers.
We do create something called a public use file and a research file, but again, I would like to clarify, we are not a state agency. When I say public use file, that doesn't mean once it is out of the door, you can't get it back and there aren't restrictions on it. We don't sell the data. We don't give it away. We do license the data.
We are working with the state attorney general and with our own lawyers about the process. Licensing the data afforded an additional level of protection of which you could restrict anyone from making any attempt, at least in writing in their hand on paper, that would ever try to identify individual patients, if there was anything there to do.
We cannot by statute, release any information that reasonably could be expected to identify a patient. Once said, the information we do collect is by statute, publicly available with the restriction that we cannot release any social security information under any conditions.
We have a research file derivative that is identical to the what I call public use file -- and again, I use that term lightly -- that does have an encrypted social security number. It is a similar process to what the Health Care Financing Administration has traditionally followed for their research files, and it's fairly heavily restricted.
We do not release any information for linkage by secondary sources. We do have examples in which we have worked with the Division of Motor Vehicles, and currently are in the process with the Department of Rehabilitative Services where we are linking data that they give us in order to provide summary statistics back, but we do not return any information that would give them additional information on individual patients that they had, or individual clients.
For example, with the Division of Motor Vehicles we were able to use linkage of social security number to help them quantify the overall cost of driving accidents; drunk driving versus age group driving versus other things to help support public policy. I don't have to spend much time with this group to let you know that one of the reasons we have seat belt laws in this country is because of linkage of these types of data sets, so there can be value.
Another project we are involved with is spinal cord injury. The Department of Rehabilitative Services is linking the information on spinal cord injuries in Virginia over a several year period with the inpatient hospital discharge records, then charting over time, the number of rehospitalizations.
There has been concern that in today's health care environment, that people are not getting the same level of access at home and other services as they did before, and that people who have suffered spinal cord injuries might be having additional admissions, or might be suffering untoward effects of lack of condition. So that's one of the purposes for that. When we finish this project, they get summary information, but again, not linkable information back to that. So those are some of the linkage projects that we have.
From this information -- I spoke about the data files. We do a lot of custom reports for people. Those are at a much higher level. We generate summary information by geographic region or sometimes by hospital. We don't have anything personally identifiable from those.
We have public reports that are created. We had a 1996 study on obstetrical care that was public, that had information at the hospital level. We will have a 1998 study, a repeat of that, that will go down to the physician level and some health plan groupings. We'll have other studies in the future on cardiology and others that come there, as well as other things we do with other derivatives of financial data that we have.
MR. GELLMAN: Ben, while don't we go right on with you, and then we can do a compare and contrast between the states.
MR. STEFFEN: Before I begin my description of the data we collect, I thought I would step back and give you something of a quick and rough ride over the health care landscape in Maryland, and tell you what our responsibilities are. I think it is important to not uncouple data collection from the responsibilities of an organization, and look at the purposes behind it.
We were formed in 1993, as part of the Health Care Reform Act that passed in Maryland, and not many other places in the country. Our commission was set up and was given basically six major responsibilities: the development of HMO report cards; the reform of the health insurance small group market in Maryland; the creation of standard protocols, that is, practice parameters for physicians; the regulation of electronic data interchange; and the development of health care information systems on health care, professional, and physician charges.
The last provision was really seen as the tool by which you would look at these other reform activities. Certainly for the development of a physician payment system, you need to have information available on charges, information available on services to devise such a system.
For practice parameters you need to know something about the variation in services based on diagnoses. Likewise, if you are talking about health care reform and expansion of insurance, you certainly want to know something about what is happening to insurance as a result of that initiative. What types of services are being provided? How does small group insurance compare? How does a small group market compare to the large group market? How does the private compare to government payers?
We began an incremental process of moving towards the data collection effort. We had a number of public sessions that devised a strategy. The approach was to move incrementally, with a voluntary data collection effort that went on for two years beginning in 1995, with the submission on a voluntary basis from HMOs and private payers.
This culminated in 1996 with the passage of regulations that required all major payers, both private insurance companies and HMOs, to submit a limited amount of information on physician charges to the commission.
It is noteworthy at this point to emphasize that in Maryland -- Michael mentioned Virginia being one of the last states to have data collection on hospital information -- in Maryland we were one of the first states to collect information on hospital discharges. That effort had gone on for more than nearly 20 years before the physician data collection effort got underway.
So the view in terms of the two data systems is that they would augment each other. That information on institutional services would support analyses on physician charges and vice versa.
The data we collect is similar to a UB-92 form. It is based on the HCFA 1500 form. It includes, however, not the complete set of information that a provider would submit to an insurance company, but a limited subset of that. It includes basically three pieces of demographic information; year and month of birth; zip code; and gender. In addition, an encrypted identifier from the insurance company is required for editing purposes. We collection information the service --
MR. GELLMAN: Wait a minute, the encrypted identifier -- can you be more specific about what that is?
MR. STEFFEN: The encrypted identifier is a derivative of the identifier an insurance company or HMO would use for identification of the patients in the insurance, however, it is encrypted before it is passed along to our commission.
MR. GELLMAN: Would different providers have different numbers that you use?
MR. STEFFEN: We definitely know that different insurance companies have different formats for their patient identifier. There are no specifications that we provide to an insurance company on how they encrypt it. The instructions in our regulations are that they must be encrypted.
The information on services relates to diagnosis, CPT code, allowed and billed charges, amount reimbursed, and the patient's liability.
We are currently, in terms of the full data collection, in the first year of it. The 1996 data was provided in June 1997. We are devising a report that will be released next month on expenditures in the state. Previous reports on the voluntary data have been released for the last two years.
We do not currently release any detailed, that is, person-level information. We have convened a privacy panel that came up some recommendations on a limited release of the information. Because we do not have full data collection, we have chosen not to act on those recommendations up to this point.
I think it is also important to recognize that we have additional hurdles that we will have to meet before we consider any data release. Number one, is the quality of the information provided by the insurance companies and HMOs in the first year of the data submission -- as you might expect, it was not perfect. We will be resolving those issues before we would consider a release of the information even at a mildly aggregated level.
That basically concluded what I had to say about the data collection. I think it is in its infancy. I think we have learned some things on the process of going through this effort. We use the hospital model of data collection, that is information collected for insurance purposes was used as the basis for developing our collection format.
Information that had for many years, been acceptable for hospital discharges, generated considerable controversy when it was applied to physician information. In particular, information such as date of birth, which is routinely collected on hospital data sets, created considerable outcry. Information on ethnicity was also a variable that exists on many hospital discharge data sets, but created some discussion and debate in Maryland. We ultimately decided not to collect that information.
I think that as you move from data source to data source, you have to not only look at what is collected, but the scope of the data collection. Certainly, when you are considering physician information, both inpatient and outpatient services, you are encompassing a much larger portion of the population.
MR. GELLMAN: Can you talk a little more about the use of the data? How has this affected public policy, state decision-making, private decision-making, what have you?
MR. STEFFEN: What I would say that it has affected it on two levels. One, the annual report that we have published gives state policymakers a better understanding of where health care dollars go in a very general way. It also identifies high cost procedures; how health care differs by age category. This information I don't think would be available. It wouldn't be possible to publish information such as this if it was not submitted to us in a detailed format.
We are using this data set to also augment information that we gather from HMOs using the HEDIS information set to compare and contrast how HMO services are different in the state. I think that will go a long way towards providing much more complete information on the quality of services provided by HMOs that operate in Maryland.
MR. GELLMAN: Now in Virginia you collect the social security number, which means that you can link records across time and space, but you can't, is that right?
MR. STEFFEN: The identifier that we collect is not common across insurance companies, so we could not link information.
MR. GELLMAN: So your data is much less identifiable than the Virginia data?
MR. STEFFEN: Yes, I would agree that that's a correct characterization, that we are limited to services provided. In terms of looking at per capita expenditures for instance, we are limited to services provided by a single payer.
MR. GELLMAN: Let me ask you about the structure. You say you are not a state agency.
MR. LUNDBERG: That's right.
MR. GELLMAN: What are you? Are you a private company?
MR. LUNDBERG: We're a not-for-profit organization. As I mentioned, our board of directors comes from consumer groups that nominate representatives, as well as business groups, hospital-nursing home-insurer groups, state representatives, and others. In the statute in Virginia it requires that the organization that collects this information shall represent a public/private partnership and represent these different groups.
The concept involved in Virginia is to have all the different health care stakeholders involved in the process of designing information that is to be collected, as well as information projects that go through there. So we are under contract with the Virginia Department of Health.
MR. GELLMAN: Other comments, questions?
MS. MULLIGAN: I was wondering, I think I probably was aware of kind of the formation of your entity, and I was wondering if you could talk about some of the decisions that were made that you be a not-for-profit organization. I think there were FOIA issues that came up. I think there were contractual issues about release of data, and whether or not thinking proactively about how you limit access to data, and how that came into play on the front end, rather than on the back end.
MR. LUNDBERG: Much of this had to do with a desire of all the health care stakeholders, particularly business and providers, to remain at arm's length from the state on this. They wanted the protection of the state as far as the statutory authority to collect this information. They wanted to be put in the position to have an additional measure of autonomy and be removed from the concept of FOIA. We are not subject to FOIA. We are a private organization, and we don't meet those characteristics. So those were some other things that were done on the front end.
It was felt that the state needed to have access to some of this information, but it was felt that they did not want the state to control that. So that's the 30 second gist of the decisions that went on there.
MS. SWEENEY: I wanted to make two observations and have your comment on them. The first observation I wanted to make in general, because I think David Korn's original question certainly always stands as a beacon over these questions of where is the risk, where is the harm, that kind of thing.
The first observation I want to point out, and I also want to say that I have had the opportunity to work with Laura and Bill Winkler(?) at the Census Bureau, and their work is absolutely excellent. They do a fantastic job, but I want to underscore something. There is a dramatic difference in the kinds of databases that were talked about by Al and Easley, versus the kind of the databases that you guys are talking about.
For example, the kinds of statistical disclosure techniques that Al and Easley talked about employing distort the data. They talk in terms of adding noise. They talk in terms of swapping. In a gross way we could say, gee, if I swapped around gender, I turned boys into girls, or if I swap ages, I may have a 10 year old boy giving birth to a 50 year old woman.
So the collection of these kinds of databases, the demand for them is also coming from new places; places that Al and Easley never had to address. The demand for this data is coming from economists who say give me all the data on all the people, and I'll save you some money. It is coming from computer scientists who say, gee, give me all the data on all the people, and I can build better decision-making tools for doctors.
These are all legitimate good things for society, but it is important for us to understand that the nature of the databases are totally different, and the nature of the kinds of requests that are made are totally different. So to underscore my point, if I were to ask you even in the conversation about the type of disclosure, where is there protection, they are saying that their protection is in the data elements.
The other guys who had far less data elements -- we didn't talk about the number data elements in Al and Easley's databases, but they are tiny in comparison to a medical record, which is typically 300 fields or more.
DR. KORN: Okay, and therefore?
MS. SWEENEY: Therefore, the vulnerability of this medical data versus the Census data or the statistical data is much higher.
DR. KORN: Because there are more elements in it?
MS. SWEENEY: Right. The more I know about you, the easier it is to point you out. If I say it's a guy in this room; if I say he's sitting at the corner at a table; the more characteristics I add, the easier it is to point any particular individual out. It is not sufficient to limit myself to the same type of requirements that Al and Easley used.
MR. STEFFEN: I don't know how many data elements are collected in Virginia. In Maryland on the physician information that I'm most knowledgeable about we collect 32. My experience with hospital discharge data sets is they are somewhat more extensive, but not in the range of 300 perhaps.
MS. SWEENEY: Well, let me just clarify. The NAHDO list is a small set of what the recommended core set was. That is the minimum that the 37 states who collect hospital level data use. So we certainly know that your collections are in there. I actually have your data schema, so I have a sense of its size, but they do go up from there, and some of the ones I have seen have been as large as 300. A full information system in a hospital is around 300 elements.
DR. KORN: I think Latanya Sweeney is raising a very central issue, and since I have already pleaded my ignorance of what is the sophistication of this stuff, is it absolutely clear that a 300 element data set derived from a medical record is in some quantitative way, more probable to enable you to identify the source of that data than one that has 150 elements or 100 elements?
MS. SWEENEY: Well, I wouldn't want to make sweeping statements, because --
DR. KORN: Well, you did, but that's what I'm asking you.
MS. SWEENEY: So let me qualify the statement. When I look at the full medical record, and it has 300 elements, and it has clinical measurements, which is blood pressure taken every 15 minutes, which is not the data that you necessarily get, or if it has those kinds of things, it might not be the case that I can recognize anyone whose blood pressure was taken every 15 minutes. On the other hand, if someone's blood pressure was very erratic, and someone worked in the hospital, they may very well be able to detect them.
So in the general sense the answer is always yes, if the data had meaningful content, and someone may have knowledge of it. The problem is you never know what other people know. The more they know, the more they bring to bear on your data.
So when Al and Easley keep looking at the demographics, and they talked a lot about age and gender and geographical location, what they are focusing in on are things that they know people would link on. That's what they know that someone going from a state release of information could link on.
In the days of computers being so cheap, and large storage spaces being so large, the point that Deirdre made is look how much more information everybody is keeping on everything. If I'm an HMO and I'm the largest HMO in Ben's database, I know my people, and now I can do inferences on others as well.
DR. KORN: Okay, we should come back to this later, because I think this is really a terribly central point to this whole discussion. I would just make one comment though, which I'm going to make again later. That is, for me as a scientist thinking about the sweep of biomedical and medical science, the integrity and accuracy of the input data is the most central quality that determines whether the effort is worth a damn, and whether the results are going to mean anything.
When I hear people talking about swapping gender or ages or addresses or this or that, I get to the point where I ask myself, well, what in the heck are you collecting the data for, and what's the use of it? Now I understand that there are things that you can mask that aren't going to affect particular kinds of uses of data. I understand that.
You get to the point where you can swap and change and mask so much that you essentially have a corrupted data set that isn't worth the effort of collecting it in the first place.
MR. GELLMAN: Let's come back to that.
MS. BREITENSTEIN: One of the things I wanted to raise, and actually we had had a discussion about this in the break, is the extent to which the question of how much something is identifiable in my mind, has to be the second question. The first question is how do you protect the people who are trying to make it dis-identified or de-identified?
The analogy for me would be if you require that a bank has locks or security, you also make the stealing of money illegal. What I mean by that is we have not looked at in any of the legislation or any of the policy, is saying that the re-identification of data is going to be a prohibited activity.
When someone like the Census Bureau releases data which is intended not to be re-identified, that there be any accountability on the part of the people who take that data, who access that data, not to go to the extraordinary efforts, and I think actually after all the work that is obviously being done, that has to be fairly extensive efforts, that there be some prohibition.
There should be some legislative prohibition that says that if you take data from the census, you take it on the understanding that they have made an attempt to make it safe, and that you, as a user, are not allowed to then make those efforts for not.
DR. ZARATE: That is in fact what we do with our public use tapes. In the micro data tapes you can't access the data without seeing a screen that says that NCHS is dependent upon statistically people having the confidence in us to give us accurate information. We are also legally bound not to identify.
If you should inadvertently, in spite of everything that we have done, come across an identity, we ask you to please notify us immediately, and not make any use of that information. We also advise them that they are subject to legal sanctions, which are not strong enough in my mind, but they are subject to whatever the law can do to them if they do that.
So our previous attorney used to drum it into our head the requirements of our statute travel with the data, and they never really leave it. Having said that, once you give it to someone, you cannot insure that that is kind of tattooed into each and every data element in that file.
So that whoever uses it or any derivative part of it -- we are always worried about this -- that someone who sees the file given to them by a professor at a university, who has completely conformed with what we asked them to do, then asks the students to look at it. Now it is beyond our aegis.
MS. GOLDMAN: Are you saying that if the information is released, that you think that that there is a legal prohibition on that information being matched with information from other databases? Let's say people use the information that you give them, and they want to match it with Census data, with DMV data, with information that they have, are you saying that this prohibited that they do that if they end up identifying?
DR. ZARATE: No, we ask them if they should inadvertently identify a person in whatever use to which they put those data, that they tell us immediately so that we can take steps.
MS. GOLDMAN: What is the legal prohibition on identifying somebody? Not whether it is inadvertent in the data that you have given them, but whether they then identify somebody by matching and linking the data?
DR. ZARATE: The legal prohibition is that the data which we release should not be used to identify an individual. We ask them not to use the data for that purpose.
DR. KORN: And there is statute behind that?
DR. ZARATE: There is Section 308(d) of the Public Health Service Act.
MS. GOLDMAN: Has there ever been any action taken against a recipient where there was a concern that they did use the data to identify an individual?
DR. ZARATE: No, not that I know of. We have our two attorneys here though.
DR. DETMER: I would like to follow-up on a question that Bob Gellman asked Alvan and Easley. That is, have you had in your two states or are aware of other state databases either instances of problems or actual harms to the extent that we talked about on these others as the use and availability of these data?
MR. STEFFEN: The physician data is very new, and there have been no instances of harm being done as a result of use of the information. Of course, it is important to note that we have not released any detailed data as of this point.
We also went back and looked at the 20 year history on the hospital data set, and we were not able to identify any instances where harm was done from the public release of a limited set of information from that database.
Now one thing that has happened in Maryland was there was this point bouncing around that in Maryland there had been a release of information from a cancer registry by a banker. Some effort was spent on trying to track that down, and what had actually happened was that at a risk analysis conference in Rockville, Maryland in I believe it was the summer of 1992, a risk management analyst had reported that there was an incident of release of information by a state health care commissioner -- never mentioned the state.
That trickled into the public record. It was identified as Maryland being the state. The only thing we were ever able to track down, and we went all the way back and talked to the person, was the conference had been held in Rockville, Maryland, and that is the way it was identified with the state.
MR. GELLMAN: That has actually become quite a common horror story that is passed around. I have always had my doubts about it, because it didn't make any sense.
MR. STEFFEN: The story would seem to point to release of information from a cancer registry or SEER data set, and there was no basis that we have been able to find.
MR. GELLMAN: Have you documented your inability to document that? Do you have a report or a document on that?
MR. STEFFEN: We can supply information on who we spoke with and how we tried to track that.
MR. GELLMAN: Scott, did you have something?
DR. WETTERHALL: Scott Wetterhall at the Centers for Disease Control. Most of today's discussion thus far has really focused in inferential identification of persons in data sets, and I think that's an important area to focus on. It hasn't dealt with the protection of privacy and confidentiality when the data is collected, which would be the example of the leaking of AIDS data in Florida.
The point I want to emphasize is that any inferential identification is always going to be very context driven. What we are talking about is probability of identification. That is the thing we are concerned about. A 300 field record is worthless if those 300 fields merely measure the length of random hairs plucked off of a person's head, but other types of information within a certain context, a 300 field record would be highly identifiable.
Similarly, Al deals with a data set under which certain statutory limitations, he can only release certain data. He may not have any sort of control over the person who is able to get a data source from another particular supplier, and then thereby link them.
So my comment is that I think as we look at these questions, we really need to think of them in terms of the sort of context which is always going to drive how identifiable a particular record is from an inferential standpoint. It is why we spend so much effort. As I think Easley said, or Al said, we always make sure more than one person looks at this, because it is an inferential process that is occurring within a complex context.
MR. FRIEND: In our efforts to address the issue of non-identifiability in terms of privacy, a question for all the four, both federal and state, to what extent are you protecting privacy, addressing non-identifiability through technical solutions versus what I will call in the business, contractual solutions. Technical meaning there are statistical ground rules as far as minimum cells sizes; that there are ground rules in terms of what is collected; aggregation rules versus the contractual side as far as Al, you just said, that there is a legal provision that prohibits an improper use.
Each of you to some extent have touched on it, but I think this may bring some focus to this balance of addressing non-identifiability either through a technical measure, or addressing it through contractual ground rules or legal ground rules.
MR. GELLMAN: Anybody, Al?
DR. ZARATE: I'll be brief. I think we have tried to use every measure at our disposal in that we do advise the people who develop our surveys that I would like to start talking with them as soon as they starting thinking about their survey, about who they want to share data with, and what how they intend to gather the data, the security procedures. All of that stuff we try to in on the ground floor with them so that they are not in a position of having to force an issue, of saying well, what's the level of risk? We want to let them know right off the bat what the restrictions are.
So the contractual understanding of our limitations comes in very early when we start planning a survey. Of course this all has to get cleared by the Office of Management and Budget, and I review the entire survey when it comes through my office at that point, and we go over that.
I might say too that our surveys are also subject to review by our institutional review board. There is very, very careful attention paid to the issues of informed consent and possible harm that could be done by the process of studying our respondents. That process has become even tighter in recent years. So it gets that kind of a review before they can even go into the field.
Once they are in the field then, I think the point that Latanya made is very important, that the people who develop our locks, also have keys if they are going to be the lock makers. I think that Dr. Wetterhall's point is absolutely important too. There are those who are convinced that they can measure the level of disclosure of risk in a perfectly quantifiable and objective way, and I have one such article here.
Prof. Diane Lambert and other people have published articles saying that you can assess that with a figure, but even with that, there is artistry that comes in. You have to ask the question that Scott was asking. So what kind of data are we talking about? If we are talking about values that are easily understood, and that are just excruciating detail on one theme, rather than a truly rich in terms re-identification, those two can be miles apart, and that is where you need people who are familiar with the issues, familiar with the other database.
I wouldn't think of releasing a public use tape if the field experts, if the subject experts -- if I couldn't consult them and say, what do we know? What's in the field? What could be matched in with this? So it can't be done by simply a person sitting in a remote office. They have to be in contact with the people, and the context in which it is going to be released. So those kinds of questions have to be asked.
Software is being made available. There is one which we are thinking of testing. I understand it is being tested at the Census Bureau. The Bureau of Statistics in The Netherlands has made available some -- I remember that's when I met Latanya, at their demonstration -- where they were talking about a system called ARGUS(?).
It permits you to identify what we call sample uniques, that is where there is only one of them in the sample. You have to make the assessment of whether there is only one in the whole population. So if those two match, then you've got a real problem.
ARGUS will tell you that first of all are there sample uniques. Then it asks you for some decision points, and you feed the decision points in like the minimum number of cases in a cell, and it will do all that for you. So there are some of these technological -- and I'm sure there are others like this, and even perhaps more sophisticated.
Well, now what question is, what do you with people who do that kind of work for you? Well, some of it I wouldn't tell you if I did have that information, because it is something that the more information I let you know about what we do to protect our data, the more keys I throw out available to anybody who could pick them up.
There are administrative ways of bringing people in. After all, the contractual arrangements that we could make establishes an identity between the agency and the contractor. We normally go beyond that to make the contractors sign a contract, and to actually go about and see, to make sure that they fulfilled the terms of the contract. If you take that as a given, you can still bring people not aegis so much, but the very same restrictions you have to employ, and have them do work for you.
I guess this is a personal opinion. When things do go wrong, you know where to look when things happen, because it is a rather tight fraternity or sorority or whatever, so once again, the context tells you a lot.
MR. GELLMAN: We have a lot more ground to cover this morning before lunch, so I'm going to ask Michael, Ben, or Easley if they want to respond to Gary's question very quickly?
MR. LUNDBERG: Essentially on three fronts is how we address this. Legislatively, I described that there are certain things that we cannot release under any conditions. Administratively, we have certain things that we do to restrict things. We can technically release employer name. We don't do that, because it's too easy to tie back to patient. We could release certain other fields, but we mask them so they don't go out. So those are some of the technical things we have.
Contractually, the restrictions in Virginia follow the data, so that if people violate confidentiality provisions, if we license data to them, then they are held civilly liable for that, and then contractually they agree not to make any attempt to re-release the data, and then under no circumstances provide access to others on a person-level basis.
Then there are the technical things we were talking about, other things that we do. So it is basically on three fronts that we try to address those issues.
MR. GELLMAN: Ben?
MR. STEFFEN: I will be very brief. The first point I would like to emphasize is that in the enabling statute that created our commission, we are directed to collect data in a manner that does not identify the patient. We are mostly concerned currently in terms of the administrative processes and procedures that we have in place, because we don't release detailed information of any sort publicly, is managing contractors that assist us with processing, and individual staff members that have access to the information.
We have a set of procedures internally that guide how information can be used. Limitations on access to the data are placed on a study-specific basis. We also, in the methods that we use to organize the data, we keep in mind not only some data processing principles, but try to segment data so that it is not organized and stored in a single place.
For management of the contractor, we have previously had independent verification of the activities of the contractor independent from our own staff management of that to confirm that they are complying with the requirements in the contract.
I think that basically you are talking about technological controls, as well as administrative procedures on how this information should be controlled and limited.
MR. GELLMAN: Easley, do you want to answer?
MR. HOY: Mainly our contractual obligation is the law, Title XIII, which applies to the Census Bureau and its collection of the data. Of course the technical measures as we mentioned before is the issue of swapping, recoding, top coding, and that sort of thing.
I wanted to answer one other question that was raised about why do we collect this detailed data if we are not allowed to use it in such a way? I think you have to realize that there are a lot of different users out there. There are some people who want this detail, maybe at a national level.
At least in our surveys and censuses, the users we have -- one of the questions in the decennial has to do with ancestry and there could be 100 different categories. Obviously if you are looking at something like that at a national level, there is less risk of disclosure. If you were to say, I want to see that down at the block level, you've got two problems. One is the accuracy of that information, but also if it is accurate, then you do have a very high disclosure risk.
So it kind of depends again on all these different users, and they want different detail at different levels. So when you are trying to satisfy that demand, you collect that detail, because that is what it is. That doesn't mean, on the other hand, that you can just avoid the disclosure issue by putting it down at the lowest geographic level where your disclosure risk is highest.
So that is the reason why. You can collect detailed data, but depending on how people use it. That's what makes it so difficult, because you are making compromises all the time. If you want detail in this area, then I've got to sacrifice detail in another area. This is what our day-to-day business on the Disclosure Review Board is. It is making compromises all the time.
MR. GELLMAN: Let's move on. Can you hold it? There will be more of a free-for-all this afternoon.
Deirdre, can you do a couple of minutes? I want to throw some more facts on the table. Can you talk about there is a ton of data out there in the commercial world and the public sector about people? Can you fill in some of the details, and just talk about that a little bit?
MS. MULLIGAN: Well, I'll try to fill in some of the details, but a place to certainly look to see some of the details put forward in a very factual way was a recent report by the Federal Trade Commission that Congress, looking at the Individual Reference Serv ices Group, which I think represents 13 companies or so that came together in the face of very large public concern and congressional concern over the use of data from public record sources, private record sources, to create large profiles on individuals that are used for a variety of purposes.
I would say the top line purpose being law enforcement and other government issuers who are interested in looking at information on individuals, but also background checks by corporations and you are hiring a nanny; a variety of purposes, some of which are very publicly beneficial, some of which may be questionable, and all of which raise many, many privacy concerns.
If you look at the kind of breadth of data that is accessible from these different sources, it is, as Bob pointed out -- and I think Bob is actually the largest expert on this at the table -- it ranges from detailed information on people's health provided both from public record sources such as vital statistics, but also from private sources. It ranges from surveys about health conducted at different level. People calling in and ordering things over the phone.
Certainly the public record sources that you are probably just as familiar with as I am; but Department of Motor Vehicle records, land records, property ownership. It is amazing and growing. I think what is happening today is technology is really increasing the ability to collect information more and more in the private sector.
If you look at the Federal Trade Commission, which has held a variety of hearings looking at the data collection that is going on, focusing a lot on the Internet, but I think a lot of it is applicable to other digital technologies whether it is our telephone system or others. Just the wealth of information on people's day-to-day transactions -- what you purchase; where you purchase it; where you are flying; where you are traveling; how fast you got from one toll to another on a highway.
The amount of data in both the quantity, but also quality is really increasing, and it is raising privacy concerns I think to a magnitude, both from the public perspective, but also from the perspective of policymakers in a way that it hasn't been raised before.
I think it is the juncture of the two issues that Latanya and I were discussing, that it is both the increased quantity, but it is also the increased ability to use and manipulate that data in ways that really weren't foreseeable, and I think that's why we are all here.
MR. GELLMAN: I think that is a very good summary, and there is more data on more people from more sources, and there are more institutions that are actively collecting and compiling and exploiting that data. That is a trend that has accelerated in the last 10 years probably, and it is going to get worse. So that the background here is that there is more behind -- there is more data out there to match other data with for good or evil.
I want to move on, and I want to ask David Korn if he would like to chat for a few minutes. Identifiers are very useful things. They also create a lot of problems. I think it is helpful to get out on the table some of the good things that you can do with identifiers, and how identifiable information is used in research, and how we all benefit from it.
DR. KORN: Thanks. As I said earlier, I have spent my life in academic medicine, and at least before I became a dean, was a very active researcher, with the usual grants and all those other good things.
The provision of medical care, and the payment for medical care, and the oversight of medical care, and fraud checks, and compliance reviews, and accreditations, and audits all run as a highly integrated system on a free flow of identifiable, often -- frequently, usually -- identified medical records.
Medical records are the database. They are the fuel. They are the lubricant. They are the material that makes the whole system work. And I would say that much medical research requires that the data being used in the research be at least linkable, if not identifiable at the level of an individual's identity. I want to use the words very carefully, and I'll explain a bit in a minute by what I mean by that.
I would also like to make a point that I thought that one of the very important statements that Sec. Shalala made in her report to the Congress in early September, which probably not everybody will agree with, is that in our society today there is no absolute right to individual privacy. There can't be, or society won't function.
So the tasks are always to balance between privacy rights, which we do think are very important in this country, and society benefits or public goods as the economists like to call them. We are constantly making trade offs between private rights and public benefits.
When you get into the issue of medicine, the practice, the care, and all of the features of the health care delivery system, and of the research that is interlocked in a very intimate way with that system, you have to understand that it runs on medical information and it runs on identified, frequently, usually, and almost always identifiable medical information.
We have created in this country, for better or worse, a system in which not everybody is guaranteed access to care, which I personally think is abhorrent, but that's the system we have. We have a system where almost nobody wants to pay for their own medical care, so we have third party payers -- federal, state, and private.
There can be privacy in medical transactions. I know for a fact that you can go with a satchel of cash to a health care provider, and paying fully for the costs of what needs to be done for you, insure that there will be no written record of the transaction. It is possible to do that. Most people don't do that. Most people can't afford it. Most people don't even think about it.
So for most people there are records, and because we have other people paying for our care, those records are of interest to more people than you and your physician, or you and your physicians. People who pay for that care are not issuing blank checks. They want to know about whether the care was given, whether it was given correctly, whether there is fraud in the system, this, that, and the other thing.
So you get into a situation where there are an awful lot of entities and people in those entities that are handling medical records. It is very, very difficult to think of ways of de-identifying those records in an effective way that would still allow the health care system to work. I just want to leave you with that set of observations, because they are very, very important given the system that we all have.
Now medical research -- I'll make a flat statement. The history of medicine from the very beginning has been based on the observation of disease in diseased people. That is what medicine is all about. We are in an age now in the last decades or two where we term it molecular medicine. That's all very exciting and it's very powerful, and it's going to transform a lot of medicine.
The point is we are still pretty much dependent on observing people with illnesses; how they respond to those illnesses; how the illnesses behave, the natural history of those diseases; how those disease respond to therapeutic interventions, whether they may be surgical or radiotherapy or pharmacological or combinations of those.
One learns how these diseases behave and where they act, and what may be effective ways of dealing with those diseases from this long population-based record of experience, which is accessible for research and publication. Publication never identifies individuals -- never -- and it doesn't have to, ever.
The experiences have to be available for research and publication, so that everybody learns that in one institution that has treated 400 cases of Hodgkin's disease they have found these factors are determinants of how that disease is going to behave, and these procedures have proved to be very effective in effecting cures of that disease or others ineffective in accomplishing cures of those diseases.
Now in my field, which is pathology, what you should know is that from the beginning of pathology, and in fact from the beginning of autopsies, which were probably brought into medical practice in the 1300s or 1400s, why? By physicians who began to get curious about why their patients died. They wanted to know what has happened? What can we learn from the failure that we have had in whatever primitive armamentarium they might have had at that time.
Then since the middle of the nineteenth century the whole idea of looking at tissue sections under a microscope, cell theory of diseases and all of that, which is the foundation of modern medicine today, came into play. Over these years, these decades, this more than a century and a half, tissues have been taken from biopsies and surgical and autopsy cases, and have been stored after the diagnostic work has been done, usually in a fixed form, in little bitty things that are fixed in a paraffin bedding or whatever.
Those tissues have been used in uncountable, literally, studies for 150 years, and have essentially provided the vocabulary and the syntax of modern medicine. We know what diseases are, and we know where they are and how they behave because of the studies of the studies of these tissues from millions of people all over the world, and correlating these studies with their clinical experience.
Now that has to go on. One way or another, we have to allow that to go on, even in an era of molecules and genes. Let me give you a practical example of what I mean by that. One of the big medical problems and one of the big societal problems that we have right now is our inability as physicians, whether we are surgeons or pathologists or whatever we are, to be absolutely sure of how certain abnormalities, lesions we call them in the trade, abnormal things are going to behave.
Women are able now through mammography to have identified really microscopic foci of lesions in their breasts that look like cancers when they are removed. Men, through other kinds of screening methods, are finding very early, tiny, microscopic, often foci of lesions that look like cancer in their prostate glands.
The consequences to the woman with a breast lesion like that, or a man with a prostate lesion are quite high and quite profound, whether it is big surgery, radiation, chemotherapy, all of the above, consequences of the surgery that may not be very pleasant.
We also know that probably some large fraction of those little tiny lesions are not going to do any more damage once that little tiny excision has been taken care of, because they are going to behave in a very benign fashion, and we don't know which ones are which. So we can't tell you right now, you have had this little microscopic focus of breast cancer. It has come out in this little biopsy, and you are cured. You will not have any more trouble from that lesion. It is finished.
Or unfortunately, your lesion is a very bad actor, and the chances are you are going to have bad problems, and we are going to have to give you the full dose of adjuvant therapy. The same thing with men with prostate. Autopsies reveal -- and this is well known for a long, long time -- that in elderly men as you get up in the seventies, eighties, and even in the nineties, a huge fraction of all men who come to autopsy in those last late decades of life have microscopic foci of cancer in the prostate glands.
God knows how long they have been there. Certainly it hasn't affected these men. It has nothing to do with their health or illness, and it didn't kill them. f those foci had been there 30 years before, which they may have been, and had been picked up in some kind of sensitive detection, they would have had their prostate gland whacked out probably.
Now suppose I, using new knowledge which is pouring forth, take 100 of these little tiny prostate samples or breast samples -- I don't want to be chauvinistic here -- and do some genetic research. That's a buzz word, right? Real big buzz word, genetic research. Suppose I find out that in a set of 25 out of my 100 or 150 samples there seems to be a very specific pattern of genetic abnormality which is not present in any of the other samples that I have amassed, maybe from many different hospitals across state lines, from many different pathology departments that have accumulations of these cases, so I have my necessary sample.
Suppose I make that observation, and I don't know anything else about those samples, and I can't get any more information about those samples. What do I do with that? I can write a paper, which would be kind of boring, and say, hey, I have discovered that these three genetic points are behaving badly in 25 out of this 150 microscopic prostate cancers. Everybody would read it, if they published it, and would say, so what? What doesn't it mean? Is it important? Do we care?
If I can go back through some kind of linkability, and find out what happened to these people whose 150 prostate biopsy specimens I have, because some of them may be 20 years old, or 15 years, or 10 years old, or 5 years old, what would have happened to them from their prostate lesion has happened. We don't have to wait 20 years now to find out what is going to happen. It has happened.
So if I can find out let's say, and this is very romantic of course, that all 25 of those people with that particular set of genetic changes were alive and well 10, 15, 20 years later, or conversely, that the set with those changes all died within three, four years, or two years. I would have an interesting observation that could be used now, if confirmed, as a predictor.
So when that next biopsy comes along and you do that genetic test, you could tell the physician, who could tell the patient, what I hypothesized earlier, that either you are cured. You have a little bit of cancer. It has come out, but we know because of these features that you are cured. Go home, live your life and don't worry about it. Or that unfortunately the opposite is the case, and therefore I am recommending that a kind of heavy schedule of further procedures and nasty stuff be done in an effort to try to save your life.
Now you could say to me, well, you could have done the whole study prospectively. Those people whose prostates were removed in the past, didn't give you permission to do genetic tests on them. Well, we didn't even know there were genetic tests when that happened, so why not do it all prospectively?
Fine, we could do it prospectively, but then we've got to wait with diseases that may not really show their hand for very long periods of time. I mean there are some cancers for example that are quick. Certain lung cancers, the life expectancy from diagnosis is very short.
Breast cancer and prostate cancer, and there are many other kinds of these things, can have very indolent courses, and it can be years later before the unfortunate victim finds a metastasis. Suddenly after 3 years, 5 years, 7, 9 years of thinking they are cured, bang, they've got a metastasis somewhere.
So we are going to have to wait a generation or so in order to know whether those observations that I have described really mean anything biologically that can be applied to the benefit of patients with these diseases.
Now that is just an example. It is a bit romanticized. I wish I could do it, as a matter of fact. It can be done. The material is there, and the technologies are coming. The reason I'm making the example is that the linkability -- that's the thing I want to emphasize. Unless there is a way to get the correlative data to find out what really did happen to Mr. X and Mr. T and Mr. Z from their lesions years ago, you can't put your observations into a clinically useful, clinically meaningful context so that they can be taken further, and determined whether or not they are now a really good predictive test.
There are tens of tens of thousands of examples like that, that I could give you. So, what I'm trying to argue is that I like to think of it this way. That within the health care delivery system, with all of its pieces and the biomedical and medical research that link in and out of that, what you really want to do is have the least impedance to the freest flow of information, and much of it, if not most of it is going to be identified.
Some of it can be encrypted, and as computer capabilities get more and more powerful, I'm sure that more and more of it can be encrypted, but even with the encryption it is going to be linkable. It can't be unlinkable or it is useless for many things; not all things, but many things.
The real issue then is to build protections that prevent that information either from leaking inappropriately out of the health care system, where it could be used to disadvantage or injure people, or protect it from being forcibly extracted from within that system by whatever.
For example, we don't think that law enforcement authorities ought to have carte blanche access to medical records. We think there ought to be a process, like a court order perhaps as a minimum, that would enable a law enforcement officer to rummage through a hospital's or a physician's records. That is a very controversial point. I know it came up in the secretary's report to the Congress.
We are not for carte blanche access to records, but I think that when you think about the nature of the problem, the use of the information, and the mechanisms that you all think would be appropriate to recommend for buttressing the security of the information, and minimizing the damage that might come from inappropriate use of that information, please be cognizant of the system that depends minute-by-minute on the free flow of that information for its minute-by-minute functions, and don't, please, screw it up.
MR. GELLMAN: Thank you very much. Can I plug your paper?
DR. KORN: If you want to.
MR. GELLMAN: Dr. Korn has written a paper called, "Contribution of the Human Tissue Archive to the Advancement of Medical Knowledge and the Public Health." That was a paper he did for the NBAC, the National Bioethics Advisory Commission, and I think the same points that he has made here very eloquently in a lot more detail, and with a lot more examples. I found it extremely instructive.
I would like to turn to one more presentation before lunch. We have had a very good argument presented for the need for identifiers in some lines of activity, but those who can live with it. I want to let Gary take about what IMS does with non-identifiable information.
MS. MULLIGAN: I just want to excuse myself. Sorry.
MR. GELLMAN: Thank you.
MR. FRIEND: I'm not sure what to do with that set up.
MR. GELLMAN: Whatever you like.
MR. FRIEND: Let me begin by saying something either is going to seem controversial, or is becoming obvious. Anonymization I think in the truest sense is becoming the holy grail, however, protecting privacy interests need not. I think that is what an obsession is of our company, and I think that's what we are seeking to do today with this debate over how do we define non-identifiable versus identifiable. It's the issue of how do we protect privacy interests, while Dr. Korn said balancing the necessary uses of information of the individual.
IMS in the United States alone processes 72 billion records of data a month. To paraphrase C. Everett Dirksen, a billion, a billion there, pretty soon we're talking real data. It is from some 250,000 different sources. As a side, we operate in 90 countries around the world that include the member states of Europe, and therefore we have had some reasonably good experience in operating in omnibus privacy frameworks.
I think Dr. Korn and others have talked to the value of health information in areas such as disease, treatments, and outcome, and we exist -- our vision statement is to be an essential partner in the advancement of health that is using information in those ways.
How we do it is -- if you put it down in writing, we do, and it is a summary of our information practices. Essentially, our business model is that we do not need, and do not want to know the individual's identity. In those very unique instances in our business where do, we do it with full consent as to why it's being collected, what it is being used for, et cetera.
I would say that there are four components to how we preserve privacy. That is largely why I asked the question about balancing technical solutions with contractual or legal. One is attempting to do it through tools of anonymization. Second is through aggregation; using statistical rigors to preserve privacy by creating standards as far as minimum cell sizes. Third is addressing the who can use. Then the four is under what circumstances it can be disclosed, and with contractual measures applied across the board.
Every employee of our company signs a confidentiality agreement, the provisions of which specifically address the uses of information that they are going to handle.
I think the issue of privacy -- what I was saying is that it's balancing two things. It is balancing privacy interests with the important uses of information. It is balancing technical solutions with legal or contractual measures.
Historically, we believed at IMS that we had this non-identifiable issue nailed down. What we have grown to realize is that were we, for example, to a data supplier explicitly state what information we want and what information we do not want, and we are explicit in saying that we do not want a name, address, phone number, a social security number, what we are realizing is there are other data that theoretically, if used improperly, could over time be integrated, merged, et cetera, and that you could theoretically reverse engineer to identify someone.
That's not why we collect it, and that's not what it is intended to be used for. So our issue is therefore, how do we protect privacy interests without seeking the holy grail, which is non-identifiable. I think that is really where I would hope this discussion focuses.
In the area of anonymization for example, where we are explicit what we do and we do not want, to say that -- some of the argument over privacy has been well, individual owns their medical record, and therefore should control its use regardless of whether or not their name follows it, in some ways defeats the goal of anonymity.
If you create a ground rule that says to anonymize, you have to get permission, and thereafter all uses have to go back and get permission, you are in effect forcing the model of reverse engineering back to identity, instead of accepting the fact that both reasonable measures of anonymity, and in turn, responsible uses in disclosure preserve privacy interests of the individual.
MR. GELLMAN: Could you give us a little bit more context for what kind of data you collect, and what kind of products you produce, just so everybody knows?
MR. FRIEND: Sure. I'll focus on a couple of areas, one of which is what we are most known for, and that is we track the full activity of the pharmaceutical industry. What drugs are manufactured, right down to the pill, the packaging. We track it through the actual dispensing of the drugs.
We capture patient-level prescription records from 32,000 of the 54,000 pharmacies in the United States, anonymized, meaning we capture what was prescribed, but what is masked prior to it being provided to us is anything that would identify the individual, but it's what drug is prescribed.
On a sample basis we survey physicians and their prescribing patterns, and the characteristics of the patients for which drugs are prescribed. The value of that is an example the analeptic drugs such as ritalin that deal with ADHD, we can show that there has been a steady decline during the past several years in the prescribing of ritalin to children under the age of six.
Why is that important? Because research has shown that it is not accurately diagnosable until a child gets into the age of seven or eight, so there is fairly good, objective, aggregated measurements that will show that the education is succeeding. The word is getting out, and prescribing patterns are following contemporary research.
In other areas we look at what we call the categories of disease, treatments, and outcome, and we have data that looks at diseases, what treatments are applied, and their outcome. In the case of what we call patient-level data, non-identifiable, what we will do is in the example of the physicians, in one case we have a sample of just under 3,000 physicians.
They will, instead of providing a patient identity, a randomized, non-identifiable number is substituted for records management purposes. So we get a number, and we have absolutely no way of reverse engineering the number, because it is totally a meaningless, nonintelligent number. Obviously the physician has that linkage, but for that matter, the physician always is possessing the linkage between some information and their patient, and that's where it is germane.
The value of having that number is in the case where we are looking at something where they are users of information, such as FDA or NIH, looking at someone on a longitudinal basis, it is possible to track diseases and their therapies for an individual over time, but there is a flaw in that.
If the individual, if the patient changes practitioners, and coincidently it is another practitioner that is part of our sample, it is a different patient as far as we see. We have researchers saying to us it would be tremendously more powerful if you could link information across practitioners, but we have a culture in the company, and that is identifiable data is the third rail. If you step on it, it is a very uncomfortable experience.
Part of that is a CEO, who is a native of Sweden. He was involved in the Swedish data protection law in its origins. He ran our operations out of Germany for 14 years. I think people in privacy know Germany's reputation for privacy. So it is a culture in the company.
It is a dilemma. Are we providing as much value to society as we could if we had identifiable information? Perhaps not, but we have struck what we feel is the happy medium in terms of what we feel is possible against what individuals are comfortable with.
MR. GELLMAN: Thank you. Any questions, comments?
MS. SWEENEY: I have a quick question for Gary. In the context of the kinds of identifiable issues that Al and Easley were bringing up, and in the light of the kinds of requirements that David Korn was bringing up, how do you define identifiability?
Because in the examples you gave, you have to have age. You have to have practitioner I.D. You have to have date. You have to have geographical specification or it won't be useful to David's type of needs. At the same, Easley and Al pointed out very heavily that these are very sensitive fields.
MR. FRIEND: Correct. It's a very good question, and it goes back to my point about true anonymity being the holy grail. Aggregation procedures then come into play in terms of what we will disclose to any user, and customer. So we do not release any individual records. So where we're capturing 72 billion records of data a month, they never leave our door in their raw, individual form.
That is why I said the four components of protecting privacy interest is either the anonymity, the aggregation, the use of the disclosure's aggregation is the next line of defense, if you will, in preserving privacy.
MS. SWEENEY: So if David Korn, as a researcher, wanted to come to you and do an epidemiological study, you would not give him the data, you would just run the study for him?
MR. FRIEND: Correct.
MS. SWEENEY: So that puts you in a different position than the others.
MR. FRIEND: Correct. We have researchers, Ph.D.s, registered pharmacists, that part of the business is actually doing research against a set of hypotheses, testing a research question, as opposed to releasing the raw data to a researcher.
MS. SWEENEY: One more quick question. So in the spirit of doing that research by your in-house researchers who have this valuable data, do you then turn and use the public use files that Easley and Al, or Ben and Michael make available so that you can actually get more texture to the type of research that you are doing?
MR. FRIEND: Yes, in some of the areas we do use the public data as an additive to the data we otherwise collect from private sector sources.
MS. SWEENEY: That's makes the example of what Deirdre was saying about how when a private entity has a huge collection and brings it to bear on public use information, it is not exactly what someone on the outside might think of.
DR. ZARATE: Excuse me, I think you have to ask the additional question of when you do this kind of data set enhancement, are you adding to an individual record, or are you adding to the cases? You would appoint a case as new prescription record, not necessarily a new individual.
MR. FRIEND: Correct. Your point is a very important one.
DR. ZARATE: We have this kind of differentiation in a lot of our surveys too. So you may see a record of a hospital discharge. That individual might appear multiple times in a survey. So that the ability to link records on an individual basis it seems to me is the most important question you have to ask. Does that in fact add to the identifiability of the individual?
You could add the information and increase the research potential or richness of the database, while not necessarily adding to the identifiability of an individual.
MR. FRIEND: Correct. You are on a very sage point. It drives me towards I think a more precise answer to Latanya's question. That is we are not in a position to expand the record of an individual by integrating different data sources, because we don't know who that individual is, and the nonintelligent, numeric identifier that may be attached to a record from one source, and as I had mentioned, there is another physician that is happening between the same person, it is a different identifier, and therefore, we have no way of knowing that.
Where there is integration beyond aggregate levels, where you might be integrating data in geographic measures, for example looking at population characteristics if you will, for counties or states against drug patterns. For example, is there a higher incidence of prescribing antidepressants in Washington, D.C. I suspect the past week there may have been, versus in other MSAs around the country.
So there you are integrating census data about the population versus prescribing data, but it is not enabling us to drill down to identifying an individual.
MS. SWEENEY: I have never seen your database, but from the example of what you described I could take a release from Ben, and I could take what David Korn said, I take the breast cancer patients who are prescribed a particular drug. Ben is going to tell me what date those people got those drugs. You know what date they got those drugs. You know their doctor I.D., because Ben tells you their physician I.D. number.
MR. FRIEND: But I don't know who the individuals are.
MS. SWEENEY: I would know their birth date. I would know all the information that Ben made available. I then go to the public information, say voter registration lists, which is public information. It gives me their birth date, zip code, gender, name, address, and other information, and I do the third link.
Now I have identified I would say off the top in the United States, in a given kind of database -- I'm not picking on you, Ben. I'm not saying you released the data, I'm just saying in states that do release the data on occasion, that that number would be quite high. It could be as high as 80 percent.
MR. FRIEND: I'm not disagreeing. The point I made is that technology makes true anonymity the holy grail. I don't think there is disagreement on that point. There is enough out there, and it always has been out there. The values haven't changed. Ethics haven't changed. Technology has changed. It has moved the goal post.
So I don't think anyone is disagreeing. I think therefore, what is the solution, because we don't want to simply say, well, sorry, but either we shut down the computer industry or we shut down the free flow of information. I think it has got to move towards the issue of contractual measures, uses, instead of saying that we are going to hang on this way of finding purity and anonymity.
MR. GELLMAN: Ben, do you want to clarify that scenario that Latanya put out?
MR. STEFFEN: At the present time, it would not be possible to link the two databases, because we don't make that information available.
I wanted to ask Gary another question.
MS. SWEENEY: Well, let me rephrase. I can cite -- well NAHDO is here, and we could look at the data schema of many states who do so. While I pointed you out, and while I picked on Gary, the point that I'm really trying to make is not about Gary's company, it's not about Maryland, it is not about Virginia, it is about what the available of information really means.
This is a real availability. I'm not saying in the specifics of any of you, because I can replace Gary with hundreds of other people, and I can replace you with hundreds of other sources as well.
MR. STEFFEN: Your point is taken.
I wanted to ask Gary a question about if any detailed data was available stripped of identifiers? Is any micro level data available through IMS that you are aware of? By micro I mean service level, individual patient level information.
MR. FRIEND: No, never down to the human being. It may be down to the provider, or down to as I would call it, an aggregation of human beings, but again, it's part of the line of defense. There are very, very strict contractual provisions over everything that we release. Nothing is dispensed without a contractual provision that tightly controls appropriate uses of that information.
MR. STEFFEN: The point I was going to make is that as a for-profit corporation, there may be motivations other than simply protecting privacy in those requirements. One of the big issues when we looked at health care reform on a national level was discussed four years ago, was that there were no accurate national sources of information for many of the questions that policymakers were trying to address. The private sector information sources were limited in one way or another.
I think a point to consider is that it is dangerous to leave data collection simply to the private sector, and assume that it is going to be a complete, accurate, representative, and affordable to every interest group that needs this type of information. I would argue that perhaps -- not picking on the pharmaceutical industry necessary -- but the pharmaceutical industry might be better equipped to buy your information than advocates for children's health care.
That I think, is something that we also have to keep in mind, and why the public sector has a very important role to play in the collection of health care information.
MR. GELLMAN: Nelson?
MR. BERRY: That exchange brought something to light that Ben, I think this morning when you opening up the focus on identifiability is what is an identifier? An identifier to my friend Ben here, who happens to be client of ours by the way, and to Gary, may be two different things.
Here again, identifier is different from identification. What creates an identifier in one person's mind may be an encrypted health insurance claim number. To the way HCFA looks at it you've got date of birth, zip code, service dates, and forth and so on.
MR. GELLMAN: Okay, we are actually right on schedule. It is 12:30 p.m. What we are going to do is we will break for lunch. This afternoon we will start up at 1:30 p.m. with a presentation from Latanya on her research, which is right on all of these points, and I found it very eye opening. Then we are going to have much more general discussions of what is an identifier, what is a record, and fish around to see if we can talk about solutions.
[Whereupon the meeting was recessed for lunch at 12:31 p.m., to reconvene at 1:30 p.m.]
MR. GELLMAN: This afternoon we are going to begin with a presentation from Latanya Sweeney, who has been doing research that is very much on point here, and highly relevant to all the questions that we talked about this morning.
Then we're going to have somewhat more of a free-for-all discussion around some of these issues, talking about what is an identifiable record. What's an identifier? How do we assess and evaluate risk? How do we make all the trade offs here? Is there such a thing as anonymous data? Some of the other elements here.
The floor is yours.
MS. SWEENEY: My name is Latanya Sweeney, and I am very tickled and excited about being here, because for one thing is this really the heart of what I have committed myself to doing and working on, and I think it is a very special place in time. The people gathered here are from some very different and diverse communities, and who bring to bear a different perspective and different needs on the problem.
My goal isn't to produce a particular perspective here. In some sense, I see myself as a technician whose job it is to really provide solutions to satisfy the needs that are being expressed here.
So what you should do is you should stop me at any point I say anything, because in some sense I see myself as trying to convey information about what the technology means. If there is only one thing that you leave here with, it is the notion that things aren't the same any more. The technology is making a big difference, and we need to think about how it is affecting us in our different perspectives that come to bear on this problem.
It is becoming increasingly difficult. The Census Bureau has some of the world's experts. I mean these people are great at statistical databases. I also worked with George Dunkin(?) and Steve Fineberg(?) at Carnegie Mellon, and I've been to the research centers, for example. I am very impressed with the work that they have done.
They continue to lead the way on the international community as well, but even in statistical databases they are finding it increasingly difficult to provide anonymous information in a globally networked environment. In some sense, this is what Dr. Wetterhall said. It is about the context. It's true, you can't provide anonymity across the board, just in the way I can't tell you, go protect your house.
If I say go protect your house, you said fine, I go home and I lock the doors and the windows. Then you lock the doors and the windows, and then I break through one of the windows. You say, darn, I should have thought about that. I'll put bars on the windows. Then I drill through the ceiling. Now no one would have anticipated that, but in a sense we sit in a job where you can't protect against everything, so I can never guarantee that any release of data is anonymous, even though for a particular user it may very well be anonymous.
So what I have to do is take my best guess at protecting against linking. It's the notion of how much information can be linked from different sources of information that is our biggest problem in today's society.
One of the other problems is that a lot of the people who hold the data, a lot of the people who make decisions about the data don't really understand the jeopardy that they are putting the data in. I used the example earlier that basically technology has eroded the kinds of protections that we had before. When I look at the protection that society had before, it wasn't our policies or even some of our techniques and practices.
No matter what the Census Bureau did, how many people could take their micro data file and do anything with it? Who could own a computer that could hold such a file? Nowadays you can hold ten years worth of their information on a machine that is less than $1,000, and for a couple of hundred dollars I can buy linking software to find all kinds of correlations and so forth.
Technology is changed and it is making a difference. That is without the availability of information on the Internet. More and more public information -- marriage licenses, motor vehicle, birth certificate information, real estate holdings -- anything that was publicly available, putting it on the Internet changes everything drastically. It changes everything.
This problem isn't about medical data. It is about all data. It is a national security problem. The Department of Defense and the Department of Energy are trying to declassify documents. They are having an incredibly difficult time. In fact, they are having to suffer with these same issues.
If the Department of Defense releases information thinking that it is now declassified, going by the practice of declassification they used before. Now so much more information can be brought to bear on the declassified document to re-identify the things that were marked out, it has changed everything, and we all are having to look at solutions in all kinds of fronts.
It is not the Census Bureau problem. It's not a medical data problem. It is just a whole data issue. It is not sufficient to say it is about privacy. It's not about privacy. It's about privacy. It's about confidentiality. It's about sensitive information.
We all might draw the line of what we consider private at a different place, but the problem still remains that the issue of sensitive information and the ability to protect it has to be addressed.
On the other hand, there are tremendous uses. I'm a computer scientist. This is a very exciting time. Having so much data around escorts new uses of computers that people can't even think about. Decision support systems for doctors based on profiling case histories of heart attacks say. It brings forth new economic opportunities. You give an economist all the data on all the people, and they too can find ways to save us more money.
It is just a tremendous, exciting time as well, but what I find is it is a tug of war here between these two forces, with usually me in the middle, being kicked saying you're with them, you're with them. Really what I'm trying to do is say there is a continuum of spectrums, and technology offers solutions; not all the solutions. I don't have a clad answer to say here, this is the magic pill.
I do have some solutions, some things that have been adopted from the statistical community and things that I have put together, and some things that others are also working on that I want to share with you. The talk will be short, despite its long introduction, and really changed a lot, basically by the conversations that happened earlier, so I'll make a lot of reference to them.
The first part of the talk will pretty much focus on the old practices. What does it mean in the light of new technology? Sort of where are the threats, and what does it mean?
We'll look at two computational systems that do attempt to provide some anonymity to some degree, and we'll look at where they may fail. We'll also look at Mu-Argus, because that came up in conversation. We'll look at them quite quickly.
Then we'll talk a little bit about the fact that these aren't complete solutions, and what do they mean? They do change policy decisions. They do offer us alternatives to the kinds of either I get access or I don't, either I cross out this data element or a I don't. I want to show the view that that is not sufficient.
Even that is not the level of granularity that we need to entertain, however, that is the level of granularity that all of our legislation, all of our policies, all of our practices with the exception of the techniques that are used by statistical databases that we tend to employ.
Now I have to say, and I just want to reiterate there are two tremendous advantages that Al and Easley had over Ben and Michael. I say this not about their particular collections, but about a federal agency releasing data versus a state agency, and one is not even an agency. So I don't mean in particular; I'm not pointing to Maryland and Virginia saying, boy, they are bad guys. I'm just using them as examples.
First of all, in the federal case, if I know everything there is to know about my data, if I know the medical records of everyone in the world, or everyone in the country, then in fact I know what is sensitive and what isn't. I know that you are the only person that gave birth to seven children, and I'm not going to release certain information, because clearly anybody could identify you.
It is the situation that IRS finds itself in. They can't say, gee, I'm thinking about a billionaire who lives in the northwest without us all thinking of Bill Gates. So it is the same kind of issue. So if I have all of the data on all of the people, in some sense it is easier to make decisions.
We still even then with Al and Easley saying, it's not so easy. Even though I've got all this data, it's still not so easy. They come up with a nice protection scheme, this notion of statistical databases. They tweak the numbers, they play around with it. David Korn says, yes, but I can't use that, and he is right. For the new kinds of uses that we see with computers, those techniques don't work.
What they are good for is that those tweaks and those techniques and those manipulations maintain statistical invariance. So when it is the statistical properties that I'm after, then in fact those techniques are wonderful, but the kinds of uses that we are finding with the data, the kinds of uses that Gary talked about from his data set require integrity to be maintained across the record, and this brings us to a new level of difficulty.
One of the biggest problems is that anonymity is basically in the eye of the beholder. What looks anonymous, isn't really anonymous. What if I told you that these three records were part of a huge and large and vast database? Then in fact you might say, gee, these are definitely anonymous.
Then what if I subsequently tell you that 33171 is primarily a retirement community? Then there are very few people of such a young living there; 02675 is in fact the zip code for Province Town, Massachusetts, and I have found five black women who live there year round. Likewise, 20612 may have only one Asian family. Now in all of those cases, it was information outside the data that helped to identify those individuals. It was information that I brought to bear on the data.
This is the nature of the context. The context isn't the specifics of a particular release or a particular data set or a particular agency. The context is the issue of linking what is publicly available, what I think the recipient can bring to bear on the information that I release.
I'm going to use an example. Now I live in Massachusetts, and I love this country, so I don't want to pick on Maryland or Virginia or anybody else, but I'm going to use this example from Massachusetts because it really gets to the heart of many issues, and having it as a concrete example I think can serve us well. There came a point in time -- for many reasons, for both sides it serves us well.
The Group Insurance Commission in Massachusetts is responsible for purchasing the insurance for state employees. Some of the HMOs and providers came in and said, look, your people are very expensive. We lost $1 million. Next we are charging you $2 million more.
Now I'm proud to say that the person representing the state said, gee, I feel like my hands are being tied. I want to do something about it. So what he chose to do was he said, fine, I will pay you the $2 million, but now every time there is a claim made by one of my people -- so these would be state employees, retirees, and their families -- I want a copy to come to me.
The copy that comes to me I don't want to have any explicit identifiers, because it is illegal. I don't want them to think that I'm tracking my employees and so forth. I don't want the name, the address, the phone number. Well, what did it include? It included about 95 fields of information. These 95 fields had the birth date, the zip code, and other quite interesting information.
I know NAHDO is here. In 1996, NAHDO reported, and I don't know what the current number is -- they can tell us -- but reported that 37 states were collecting hospital-level data. That means that every time you go to the hospital, this is the kind of information that is coming back. These are the kind of databases that Ben and Michael were talking about. Some of the collections, as I was saying, go on much longer. In the case of GIC, theirs was 95 fields and so forth.
Let's just focus in on this; 17 states, by the way, collect ambulatory care. That means even when you go to your physician's office, that that data is reported. The patient number is usually an encrypted social security number. The zip code technically is usually a nine digit zip code, but a five digit zip code in some collections. It includes an ethnicity or racial background, the full birth date -- that would be the month, day, and year -- the patient's gender, visit date.
Then they say, but this is a billing record, not a medical record, but in some sense if I tell you the diagnosis or any procedures that were done, or any medications and so forth, many of us feel that is a medical record. In some sense it is a trace of a medical record that many people feel is identifying.
So this is the kind of data that GIC collected. Now I was minding my own business at MIT, sitting behind my desk. Someone said, gee, they collected this data. How much do you think is identifiable? I looked the database schema and say, oh, about 80 percent. The next thing I know, I am being called down before the legislature and asked to explain this 80 percent.
So I, with that in mind, said how do I go about proving my allegations? How do I build a statistical model to show the disclosure, without actually touching the data and so forth? So I went to the city of Cambridge and for $20 I purchased a Cambridge voter list on two diskettes. Anybody can do it. In fact, I think now you can get it off the Web. That was last year, so this year you can get it off the Web. In fact, I know you can get it off the Web now.
The Cambridge voter list included: birth date, gender, zip code. It also included name and address, the day you registered to vote. The party you registered for. The day you last voted, and other information. The local Census information is even more identifying: how many children you have, how much money you make, what's your occupation, and so forth, and that too is available on diskette. I don't know if it is available on the Web.
I found some very interesting things out about it. Birth date alone -- this would be the full month, day, and year -- could uniquely identify 12 percent of the population of voters; birth date and gender, 29 percent; birth date and a five digit zip code, 69 percent; and birth date and the full postal code, 97 percent. Notice I only used one and two way combinations. I didn't even use three way combinations or other things that were inside the GIC database.
Now Gov. Weld -- he was the governor of Massachusetts at that time -- lives in Cambridge. There were six people with his birth date. Only three of them were men, and he was the only one in his five digit zip code. So with this $20 thing, I could then re-identify the full medical record of Weld, and because his family's records were also linked by an identifier in the record, I could also identify the family's records as well.
Well, this was pretty serious you say, but then who really got the data? Well, this is the other thing that I have observed, and I'm sure all of you can attest to, or at least almost everyone here can attest to. Data is collected for one purpose; used for another. That is okay in some sense, but it is an honest trend.
When we start by collecting data for one purpose, really good uses for society -- epidemiology research, economic research -- lots of other people come to the data and say, but we could do this, we could do this, we could do this. Eventually, in the case of the GIC data, the same thing happened. They gave data away to researchers, and they sold a copy to industry.
Now that included state employees, and the records of judges, the law enforcement officers at the state level and so forth, as well as the legislature. I think you get the idea of the gravity of the situation. Clearly, another copy of these records existed in the state collection of records as well, which of the 37 states, almost all of the states that I understand have gone through the same trend. They give the data away and they sell. So I have seen that trend being repeated even at that level not understanding how identifying information can be.
Notice it's not just at a field level. It's not about just chopping out a field, because in the ones that we saw earlier, with the five black women in Province Town, it was counting across combinations of fields that were revealing. So the problem is quite a bit tricky.
The other thing that I'm noticing is that the nature of research, because of the data, is changing drastically. The kind of data that is being released is also changing. Often hospitals before -- when I first got involved in this -- hospitals were concerned with releasing database data; the field structured, row-column kinds of data. More and more they are being pressed to release the doctor's notes or letters between referring physicians.
The reason is the following. The insurance companies and other financial institutions come to look at the data and they want to corroborate the integrity of the record. One of the ways to do that is to look at the doctor's notes, because that is the place where they describe their findings and so forth.
So this is a tricky problem. I'll spend very little attention, because it is not of great use to everyone here, so I'll spend just a very little time on it, but I think that we'll see even at the state level, the nature of data also changing there too eventually.
This is a letter to a referring physician. It has the patient's name -- this is all made up of course -- the I.D. number for the hospital, to the physician. There is a typo here, instead of long, it says lang. Then what you often see are references to the patient's nickname, that isn't necessarily a derivative of the patient's first name, references to caretakers at other institutions, reference to the mother, whose last name doesn't agree with the child's last name, and references to outside activities or the mother's employment.
Text like this is almost impossible to go and find all of the inferences that could be drawn, all of the keys that could give it away.
So my first experience in this area was actually to solve that problem. So this is the scrub system. In some sense many people say, well, how hard could it be? I have global search and replace. I know how that works in my text editor or my word processor. So in some sense, if I know the database, I know the patient's name and I.D. number, I could go through and I could just replace known information with all the unknown.
The things that are underscored in blue are the things that you find. The things that are underscored in red are the things you wouldn't find. So you would still leave a tremendous amount of identifying information in the document.
The scrub system does actually does this quite well. So the straight search and replace method we talked about only identified 37 percent. You could also use the fact that those were letters between physicians. I could use the format of the letters and get it better, but the scrub system did quite well.
So we have to think about some of the techniques that the scrub system used. Before we worry too much about the scrub system, since that doesn't apply to everyone, let's think about what does it mean to de-identify, because certainly the letter in the end was de-identified. That means that I removed all the explicit identifiers.
So many of us by our practices say this letter is then anonymous, but is it really? What if I tell you at the age of two she burned down their house? At the age of three, she stabbed her sister with scissors. At the age of four she was sexually molested. At the age of five she beat up her teacher. The more information I tell you about this person, eventually you may know who she is, because of simply looking up -- she may be in the news actually, if she keeps going on that route.
The point is nothing there required scrubbing. This goes back to the number of data elements, because the more I told you about her, the easier she was to identify. We started narrowing it down so that could only be one such person who had that identity. This is exactly the nature of discharge notes and other parts of the database that we often find. It is certainly true of the aggregate, especially when the fields in the database get quite large.
So what are the kinds of techniques we can use in the database community? Well, one of the notions that is quite effective comes from the Social Security Administration and the Census Bureau guidelines that in fact Easley talked about earlier.
The Social Security Administration is almost identical to what was described earlier, uses a sampling fraction. When they want to produce a public use file, they use a sampling fraction of 1 in 1,000. They do so with no geographical locators. They make sure that at least five people could correspond to any file that they actually do release.
Now this notion of a bin size, that is how many people could this record possible be, is quite nice. Now the Social Security Administration puts that number at five. It is totally inappropriate to use that number in the setting of the state kinds of database releases or the local kinds of database releases. Certainly in a hospital setting, it is totally unreasonable.
Why? Because first of all, all the kinds of releases that say David is talking about in those examples, is a not a sampling. He doesn't want 1 in 1,000, he wants all the data. He wants all the people represented there. So we lose the sampling fraction right away.
Even if I remove zip code and any other type of geographical locator, unlike at the federal level where you could be anywhere in the country, if Cambridge Hospital released this data, and I know so by the hospital I.D. number, or I know so because in fact I got the data from Cambridge Hospital, then I could pretty much draw a circle around the zip codes in which these people could be represented. So I could bring this to bear on the data.
So we find that the guidelines set at the federal level are totally inappropriate at the state level. Ironically, the state level has adopted all the federal guidelines, and including the hospitals have adopted the federal guidelines. So many times I will see a tabular report for example, where people will say, I'm suppressing this cell because it is less than five people. It's a totally inappropriate thing. It doesn't mean anything. It is out of context. The other conditions changed.
This is actually in a broader sense, the message I'm trying to tell you; that's inappropriate and so are many of the other practices, and we need to rethink them, and we need to think what they are.
On the other hand, I don't want to let you think that I have concrete, hard solutions. I have techniques that are effective within certain contexts, and we should know what those contexts are. So for example, you might say, well, then how big should this bin size be? Well, a couple of months ago I surfed the Internet.
You know how when you have those discussion groups, the e-mail lists, those news groups? Two women were having a discussion. The one was talking about this guy she met who was an MIT student. It wasn't important about what she was saying about him other than he wasn't that germane to the main gist of the conversation other than in the process she described him, and there were a few messages.
So I made a profile of the student. He was in computer science or electrical engineering. He lived near the river. I figured his eyes were dark, because I sort of stereotyped, because in the write up it said his parents were from Greece. I put him about 5'8, I don't know why. He liked to play soccer. I sent out one e-mail message. I sent this e-mail message out to everyone in the computer science and electrical engineering departments.
Several messages came back, but all of them only had one name, and it was the guy. Then he comes, and he was very upset. He said, well, that's the kind of thing you tell a girl. I didn't give her my phone number. I don't think he plays soccer very well, looking at him, but it was quite telling, because this was a case where this was a conversation that was recorded. I think it goes to the heart of the issue.
The issue isn't even about databases. It could be Web-based information in a text file. It could be conversations that are recorded. Technology has a way of recording and disseminating the information in ways we have just never seen before.
So what are the kinds of techniques that are effective? The second system I did does a focus on databases was called Data Flier, around the time that Alvan and I first saw each other, around the time that Mu-Argus came out. It also does one-way hashing or encryption, so given a particular identifier, I can change it. This guarantees a kind of anonymity, but there are other issues with this. This is not a great protection. Encryption doesn't hold lots of answers. It is not the magic pill.
Generalization is the kind of thing we have seen before, sort of a recoding, as they might say in statistics, where instead of the full birth date, I'll just give the birth year. Suppression is where if it is a really sensitive cell, perhaps I'll just blank it out.
These are the techniques that I can go from the original data, produce a file, that can map to however many people I decide up front. This technique is called K-anonymity. The modeling means that when I release the data, everybody who is in the data matches at least K people. In the protection of how large K is, is the protection that I provide you of whether or not you can identify those K people.
The system performs quite well. This is just the results on 300 patient records from Children's Hospital in Boston. These are their complete medical records, with 7,000 visits; 6,000 diagnoses; and 285 files. These are pretty sick children, yet to produce the files at different levels of K.
So what is happening here is the anonymity level is increasing. You can see that we define completeness as the inverse of how much suppression I did, and precision is the inverse of how much generalization did I have to do. It basically retains much of its context. Obviously, the more disparate the data, the worse generalization would be, and so you use suppression and generalization together, to offset each other.
MS. GOLDMAN: With table are you saying that you can track those patients longitudinally even with?
MS. SWEENEY: No. What I'm addressing in this table -- I was trying to go quickly, so I went so quick that nobody understood a thing a said. What this chart that I threw up here is saying is that there are those two forces that are always pulling. Certainly I can make anonymous data by scrambling all the values, but then David will say, but that's totally useless. They'll say you didn't even maintain the statistics if you just randomly do things.
So then the question is anything I do to protect anonymity is actually going to be some distortion on the data. The question is, is it still useful? Do I have a measure of how accurate or how much precision remains in the data? How complete is the data?
So this is answering a different question than what you posed. Let me just say one thing, however, there are ways to do anonymous linking. Linking has come up as a serious issue. There are ways, and we can talk about them quickly if you want.
DR. SCHWARTZ: The source records were automated or on paper?
MS. SWEENEY: The source records here?
DR. SCHWARTZ: That's right.
MS. SWEENEY: Were all automated.
DR. SCHWARTZ: That's why it was easier perhaps to apply the scrubbing et cetera.
MS. SWEENEY: I don't even know what to do if you gave me paper records. If you asked me, how do I protect my paper records? I say leave them as paper, because nobody is going to mess with them. Nobody is going to put them on the Internet.
DR. KORN: Did you have to get IRB approval for this work?
MS. SWEENEY: What a wonderful question. The question was, did I have IRB approval? I had the wonderful pleasure of meeting Ellen Wright Clayton(?) recently, and actually she is a wonderful person. She reminded me that I'm a National Library of Medicine fellow, and she said, "You know, as a National Library of Medicine fellow, you have to view these records as human subjects. That's the federal requirement, so therefore, you need IRB approval every time you think about using it."
It just so happens that I did, however, I have to be honest, I learned a trick very early on. If I go to a hospital and I go to the IRB and say, look, I want to protect confidentiality and privacy, and give me this data, and I'll do this great work. A month or two later, a report, a presentation, I get the data.
I go to the same hospital. I go to the administrator and say, look, I'm going do some privacy so you can give the data to an economist. I get the data that day. I learned that very early on, however, with this particular study it did go through IRB.
DR. ZARATE: Latanya, can I just make sure I understand what that indicator of completeness means. It marries what, between 0 and 1, I take it?
MS. SWEENEY: Yes.
DR. ZARATE: The number of suppressions, that's a U form of ratio of the number of items suppressed?
MS. SWEENEY: The number of cells suppressed, divided by the total number of cells, and I take the inverse of it, and I define that as completeness.
DR. ZARATE: So this characterizes this one table?
MS. SWEENEY: Yes.
DR. ZARATE: Level and height are what?
MS. SWEENEY: When I did generalization, like if I take a birth date, and I generalize to the birth year or a five year window or something like that, or if I took zip code, and instead of giving 02139, I will say 02130, so that the last digits and the first ones that begin with 023 all have zero and I can generalize zip codes this way as well.
Then I find I can think of it as a hierarchy tree. I can think of it as a generalization tree, because the values are getting more and more general. So I can talk in terms of how far did I move up the tree, divided by the total height of the tree. The height of the tree is where all of the data would merge as one. So there is an inherent maximum height on any instantiation of the data.
DR. ZARATE: Just on that one point, the scientific question you would ask -- one thing is that you made a statement before, and I hope you didn't mean this, that anything that makes data less precision is distortion. I hope you didn't mean that, because I couldn't accept that. The reason I couldn't accept that is because this precision applies to statistical manipulation of this one table.
As I understand it, the way we address a scientific question, could we first state the level of the precision, the evidence of which we will accept or not accept as generally -- and we go at it backwards. We say there is no difference, and this is what it is going to take to show me that there is a difference.
That level of precision is independent of the data. It has to be independent, because we don't want to bias the kind of information that we will accept or not accept. So that's one level of precision that is different from what you are talking about.
MS. SWEENEY: Yes.
DR. ZARATE: Make sure that these kinds of questions are really important, because if you present -- for instance, we never allow exact date of birth in any of our data sets, nor do we allow it to be inferred from any other information. So the level of precision our data are scientifically subject are of a different order, but they are nonetheless as statistical generalizations go.
MS. SWEENEY: Let me say three things. First of all, yes, the definition you have of precision is not the same as this definition of precision, so we should understand that straight up.
Second of all, computationally, the way I have framed the question is slightly different than the way the statistical community has framed it. One of the important differences in our conversation is that when I go to measure these data, there are lots of measures. For example, there are objective entropy measures that I'll make. These are entropy measures.
I'm going more towards entropy every time I change that data. So in that sense, anything I do to the data, I'm losing something. So when I use metrics that are based on entropy, any distortion is going to change me that way.
On the other hand, there are also two other sets of metrics, and these are the metrics that you are capturing. One set of metrics might be the ones that Janlori would want, and those are metrics that says how much protection is there in this data? Prove it to me. Give me some measurement of how much anonymity is attained.
Then there might be the protections that David Korn wants that says how practically useful is this data? So what I have come up with are actually another set of metrics for each of them. So now I can take a technique and I can say what it does in each of these metrics. So I can travel this space. It's a slightly different thing, but maybe we should talk about it off line.
DR. ZARATE: I was wondering if you were going towards some kind of objective measure of identifiability.
MS. SWEENEY: Yes, that's very important.
DR. SCHWARTZ: Do you have any sense of what guided the IRB's decision to permit you to perform this study, and might that help this work here?
MS. SWEENEY: Why the IRB -- because Children's Hospital, like many other places, wanted to show that Web technology could be quite useful. It's an inexpensive alternative to hospital information systems. So they had a vested interest of being able to produce anonymous data for their demonstration. So they wanted someone to show them how to make it anonymous. So, yes, they weren't doing it out of the goodness of their hearts. They didn't have to pay me, so it worked out well.
In concluding, what I would like to do is show you a simple example. I know that most of you work with data. Most of you are quite savvy with data, but there are some of you who work from a slight different standpoint. So it is nice to put the techniques into the context of a real example.
So if you will assume that this is all there is to this database, there are lots of blacks, there are lots of Caucasians, there are lots of females, and lots of males, but there is only one female Caucasian. So in some sense she is unique, or an outlier.
So if you Data Fly, and here K would be set to two, because of the size of the database, and no other reason, we would get this result. The social security number has been one-way hashed or encrypted to give me another number, but encryption is important here, because it meant that it was a unique number, so every time the original number showed, every time Jim shows up, I make it John, so that therefore, I can link across tables and so forth, and all of Jim's became John's, and so I can maintain the identity of the person.
We see that the female Caucasian full record was suppressed, because it recognized it as an outlier. Still to maintain that every record appeared at least twice, the birth date was generalized to the birth year, and the zip code, the last digit was lost. Now notice I didn't necessarily want to say that this was part of what made an identifier.
I have a term called quasi-identifier. A quasi-identifier is a combination of fields that uniquely identify people. In this particular case, then I don't have to worry about, but my quasi-identifier is this. It is the set of the ethnicity, birth, sex, zip, because in fact this is what could be linked to the local census data. This is what could be linked to the voting lists and so forth to re-identify these people.
So I make this the quasi-identifier. This is a nice feature from David Korn's standpoint, because it means that the distortion only needs to be limited onto the quasi-identifier. That meant the clinical information doesn't have to be touched.
So these types of techniques can be effective when the fields on which one links aren't the same as the fields on which I'm doing my research. We find that that is often true.
DR. KORN: What would you do to the quasi-identifier?
MS. SWEENEY: I generalize them and suppressed them, and I tell you what I did.
DR. KORN: Oh, that's what the result is.
MS. SWEENEY: This is the result of doing that.
As Al pointed out, around the same time I did Data Fly, in fact we met at that very time, Mu-Argus. Mu-Argus is an offering from Statistics Netherlands that attempts to do the same thing. It was very interesting. We didn't know of each other.
They didn't know about me, and I didn't know about them, but the techniques turned out to be almost similar. They had the same notion of suppression. They had the same notion of generalization, but they called it something different. They also had a notion of what I termed K-anonymity; that notion of a bin size.
Mu-Argus takes an approach, a walk through the data that is a little buggy, because it took a probablistic walk in a sense through the data. So for example, in identifying the female Caucasian, it identified these as being sensitive, and that's okay, because it didn't know about encryption.
It had found the female Caucasian and blocked it out, but notice how there is only one occurrence of the Caucasian male in the 02138 zip code. In other words, because they don't look at the full set of quasi-identifiers, and they don't exhaust that set or deal with it in any way to cut down the computation that is effectively other than a probablistic walk, they left sensitive information in the data. This could be as sensitive as the female black who lives in Province Town.
So when I reported this actually to Bill Winkler in the Census Bureau, he ran a study. He had 1,500 records. He used Mu-Argus with lots of various settings, and then gave the data to IRS, who was able to re-identify 100 percent of the records.
So while this first version of Mu-Argus is not effective for the database kinds of uses that we're talking about in medical journal releases, that is different than the statistical types of needs of Alvan and Easley, because if I'm not giving all the data, then it's a different answer as to how effective this technique might be.
I learned a lot from Mu-Argus. I like that notion of cell suppression, because I think Data Fly overgeneralized, over-destroyed some data. As you remember, it deleted one record completely, and so the recent work that I have done is trying to find only the cells that are needed, trying to maintain much of the specificity in the data.
The point of these techniques is not to say, oh, well, she finally did something. The point is to say that I think this all stands as a challenge to policymakers even now, even though it's not the smoking gun solution in the sense. It says that the answers aren't I either give you this access to this field or I don't, or I give access to these records or I don't.
The answer is that I ask you which fields do you need and which fields can you do the distortion? I ask myself which fields might you use to link my data to other people? In the two descriptions, I can find optimal solutions. Who does the compromising here? Well, it might be the agency or the data holder themselves that does the compromise here. It might be some back and forth mediation.
It should be bounded by contractual arrangements. In all of the data that I have acquired, almost very few people have ever asked me for a contractual arrangement. There has been absolutely no requirement that I destroy the data after I use it. There has been no requirement that I not give it away. The only protection that people find is, but if you screw me, you're dead in the sense that we know where you work and this is a small community, and basically you're dead.
In medicine that's a legitimate answer to an extent, but I don't think it's a substitute for having the proper type of review process through IRB. I think if it is released administratively, it should go through IRB too. I should have had to wait and stand in line and so forth.
I have other thoughts on it. I'm only throwing them out, not that I think you guys should write this as practice, but as to try to jar your thinking a bit away from the traditional. I think that actually releases are a continuum of releases, and how do I decide where a particular release goes?
In doing this work, actually, it's only been a couple of years, but the Department of Defense, like I said in the beginning, is very concerned over the issue in general. I have been working with a lot of the statistical community, George Duncan and Steve Fineberg at Carnegie Mellon, and a lot of the database community at SRI, Stanford Research.
Any questions, comments? Yes, ma'am.
MS. BREITENSTEIN: In what you are calling quasi- identifiers, is that a bounded set that you sort of determined, or is that set due to expand?
MS. SWEENEY: Well, in some sense the protection scheme of K-anonymity works by saying the only protection I can give is I'm going to stop you from linking to databases I think you could link to, known databases. So I'm sitting here with my database. I can figure what is publicly available, and I know what else I have given out or where else I got this data. So I can make good predictions about which fields may actually be used to link on. Every time I make such a prediction, that determines a quasi-identifier.
MS. BREITENSTEIN: So the quasi-identifiers is contextual in each case? It's not sort of generalized to something that you made baseline experience?
MS. SWEENEY: Right, but since we know publicly available information like the local census and the voting lists and so forth exists, there are in fact the basic sets, which Easley and Al always kept pointing out, and that's the birth date, the zip, the gender and so forth.
That's certainly the basic set, but it's not an exhaustive set, and you still have to keep your counts within the fields, like define the female black that lived in Province Town or the young person who lived in a retirement community. You know that that's true, because they are unique in your database.
If there is only one woman who gave birth to triplets in my database or gave birth quintuplets, lots of people know that information. The school records know it. The census data has it, and so forth. So if I release her information and the clinical complications, then any release that includes her, and includes some basic things like the fact that she gave birth to these kids, will in fact also give away that clinical information about her.
MS. PAYNE: Susan Payne, State Health Department in Maryland. Just looking at the manipulations you did to that data set, if wasn't my data and I was asking to have that data because I wanted to do an epidemiological study, the epidemiological usefulness of that data set was destroyed by the manipulations that you did to it, and I'm sure Dr. Korn will tell you the same thing.
So if I wanted to look at the epidemiology of hypertension, and you suppressed the only female Caucasian with hypertension, then you have destroyed the epidemiological usefulness of that data set.
Now I understand that some of these data sets are so big, that that wouldn't happen, or that perhaps we would tell you in advance which fields you couldn't screw around with, and that you would then would not screw around them; you would change some of the other ones. The problem is that then you need to know the answer as to what is important epidemiologically before you start the study, rather than finding out from your data set what in fact the important correlates are.
MS. SWEENEY: Let me start to answer your question by saying first of all, I have spent a tremendous amount of time over the last couple of months studying epidemiological studies. It started actually with the article from The New England Journal of Medicine that Melton(?) wrote. Melton is in epidemiology at Mayo Clinic, and for those of you who don't know the article, I think Dr. Korn made a better argument.
In his case he argues that look, Mayo has done wonderful work for society. We have always had a very open door for our researchers to have access for epidemiological studies. He goes on and on about all the great successes. Then he says that he needs all the access to all this data to do his hip fracture study. Actually, it turns out he didn't need any identifiable data to do his hip fracture study. He didn't need any.
So I took the last seven years of The American Journal of Epidemiology and started reading from page 1, going backwards in time, and looking at every study, and looking at the identifiability of the data that they used, and looking at the identifiability of the data they could have used, and looking at techniques that would allow them to bridge the two.
Now that is a study that is still ongoing. What I can say about it are the following three things. First of all, it is not true that epidemiology needs everything on everybody in such a magnanimous way. In that sense, Melton was true to form.
The second thing that I would say is that I do think that epidemiology does wonderful work that is incredible beneficial to society, so that there has to be a case of exceptions. Truly identifiable data, truly anonymous data -- these are public use files -- are not going to be acceptable to every epidemiology study. We recognize this. We have to have a mechanism. This is what I mean about the continuum.
So you say to me, but I need all of these fields. I say, but I can't give you any of those fields. So we say, well, look, you have to go to the IRB and you have to basically swear your first born, and then you can get the data. We need some mechanisms like that.
Clayton believes that mechanism should be the IRB. Other people think it should be a contract. Other people don't know, but those are the things I would argue. I do think it still stands as a challenge to epidemiology, because I don't think that they can just have data the way they used to. There is going to have to be some accountability.
DR. KORN: One quick point. I was swept away by your presentation, and commend you for the trouble that you are causing, but one point about Melton, I think, Latanya that what he was talking about was the consequence of a state law that became effective in Minnesota on January 1, 1997, that mandated that every patient had to give consent for any use of his or her record, and record means anything about that patient -- tissues, the paper, the computer record, blood samples --
MS. SWEENEY: But it was of identifiable nature. The law did require that --
DR. SCHWARTZ: I think in that article he mentioned that each patient would have to give two consents. One of the consents was for research uses, and then the other consent that most hospitals use as well.
MS. SWEENEY: He was particularly focused on the consent required for research.
DR. SCHWARTZ: Yes, but he also said the mechanism was the IRB. He also came out with saying near the end of the article, that we have a mechanism in place to make these type of decisions.
MS. SWEENEY: Yes, but he's not saying what I'm saying. I'm saying that the full position might be anonymous data. As Easley said, if it's out there, it's out there. I just produce this file. If you need something better than that, then the more invasive you are with respect to their privacy, the more commitment or the more hurdles in a sense, you are going to have to go through.
That is not the same as what Melton was saying. Melton is saying there shouldn't be consent, because in fact we have an IRB, and all we need is an IRB. Then we can choose to get all the data or not, and that's the decision IRB makes.
DR. SCHWARTZ: That essentially was underlying my question earlier as to what really informed the IRB to approve -- what guided their decision in approving your study. This was like the chicken and egg type of thing.
DR. KORN: The point I wanted to make was that he was arguing from the perspective both of the Mayo's very long record of using -- what Mayo is known for is not basic research, but research that has been done on patient experience. It has been a very rich contribution to the medical literature for a very, very long time.
There is a Rochester County epidemiology database which is considered in the epidemiological community -- and I'm not an epidemiologist -- a very, very elegant, very full, and very useful database, the Rochester database. The problem he was addressing was that under the most vigorous circumstances of repeated harassment -- my words -- of patients to get them to say yes, they always wound up with a core of denials.
This was not in the paper, but I heard it presented at one of the meetings last summer. The denials are not random. The denials are very, very highly skewed by age, gender, ethnicity, and where they contacted the health care system. The argument in the oral presentation was that with a 5 percent highly skewed denial, it was a terribly severe -- I don't know the right epidemiological term -- but it was severe blow to the usefulness of that database, if that would be the way that base would go from now on. That's what he was arguing.
MS. SWEENEY: But he actually made the opposite point. He actually said, that gee, we always get consent, so why are we wasting money getting consent?
DR. KORN: No, they don't always get consent.
MS. SWEENEY: Okay, but let me be clear about one thing, and that is that my view of my research is that in general I believe from some of the studies that have been done, the Equifax(?) studies for example on research that the public would say you can have all my data as long as you can't identify me. That is very different position than what Melton is talking about.
He is arguing the issue of consent. I'm not arguing the issue of consent. I'm saying that I want to give all the data on all the people to whoever wants it, but by golly I'm going to do whatever it takes to make sure no one can be identified.
MR. GELLMAN: Let me ask a couple of questions. The work that you described looking through the epidemiology journal -- you don't have to answer this other than in a hypothetical way -- but I gather what you are doing is you are looking at studies and saying, this study didn't need identifiable data.
MS. SWEENEY: Epidemiology is actually a high standard for research, because they are not computer scientists. They don't just take the first answer that comes along. They actually look at it from two and sometimes three different ways to conclusively get the most generalizability out of their results, and so forth.
This is quite impressive, but this poses a tremendous challenge, because it means that they got the data from one source, but now they've got to go get data from other sources, and have some way of linking it off that. Now sometimes the linking doesn't have to be at the person-level. There are other techniques that can be employed, just as anonymous linking in many cases.
In general, I would think that it is probably the highest standard of research that I could ever try to --
MR. GELLMAN: Where I'm headed with this is presumably all of this published research was cleared through an IRB. Are your conclusions, when you have conclusions, going to offer comments on the quality of decisions made by IRBs in allowing these people to get access to records?
MS. SWEENEY: No, I think that IRBs, like our policies, like our practices, have always had the answer of you get it or you don't. Our laws are you get it or you don't. What I'm arguing is you can get it, but what form can you get it in. You can get all the data you want, but what form can you get it in? It is going to be generalized to this level. It's going to be this many is suppressed. Is that okay with you in these attributes? That's the level of negotiation.
The IRB needs to perhaps think about shifting its way of making decisions, and the way they want to make decisions is that I'm going to give the least invasive data to privacy that you can practically use, and that's how we're going to finally get balance.
DR. WETTERHALL: Sometimes I'm a card carrying epidemiologist. I very much enjoy what you are doing, and I think it's incredibly unique in terms of developing new methods for making data sets available for certain types of analysis. I think that is one potential application for the sorts of things you do.
I feel incredibly uncomfortable though reducing the whole field of epidemiology to one journal, which happens to be The American Journal of Epidemiology. Even within that journal there are clearly many different types of studies that are undertaken. There are original longitudinal studies --
MS. SWEENEY: I actually separate them across the classes.
DR. WETTERHALL: At least on the basis of your statements, you are tending to lump them. They range from longitudinal studies, such as the Framingham Study, which is basically a clinically oriented longitudinal study, to outbreak investigations, which are truly public health emergencies that require immediate access to personally identifiable information to perhaps save lives, to secondary data analyses in which people may have used tapes from the National Center for Health Statistics.
I'm just offering that as a caveat, that I feel uncomfortable that we are reducing epidemiology, although I do agree it is probably one of the most rigorous of the social sciences, to those sorts of terms. I think it is far more complex than that.
I really do come back to my earlier comment today. It is the context. It is the reason for which the study is done. That to me, very much dictates the type of data that is needed, the means of collection, and even the level of identifiability.
MS. SWEENEY: Let me just say that I was answering in a nutshell. In a nutshell, I just used epidemiology, but let me assure you that I am actually dividing them for the type of study that they are. I don't make any judgment about the quality of their study. I assume that whatever data they asked for, they needed. I just then look at what they did with the data to say, what did they really need.
So for example, a study by a particular researcher where it started with all the birth dates, but really only needed the age, well, given the population that that particular researcher had, the number of people who could have been identified because that researcher got the full birth date at the most was every group of people could have been three.
If the researcher only need the age, and that was the only thing the researcher used, and that's the only thing their methodology pointed to, then if they had only gotten the age to start with, that bin size would have been 1,125. This is what I'm talking about.
This is why I'm saying I don't question the methodology. I don't question what you are asking. I only say that when I give you the data, I'm going to give you within the range that you say is practically useful. I'm going to give it to you. If I know that I'm giving you sensitive information, then I have now identified my risk, and I'm going to cover my risk by contracting with you, or what have you. That's all I'm saying. I'm not saying you don't get the data.
DR. WETTERHALL: I certainly agree with the stratification that you talk about. It does raise another issue that I don't think we should time on, but that's the burden of data requests. I think that does come into play.
MS. SWEENEY: I have a profile system.
DR. ANDREWS: I'm Elizabeth Andrews, another card carrying epidemiologist. I really appreciate your comments about the high standards of epidemiologists, and I fully agree. This is very interesting research.
I guess my question -- and forgive me, I missed some of the morning discussion, so this may be rehashing old ground -- but if I understand what you are saying, it's that all such studies, even using observational databases, not experimental studies, should be run through IRBs. If that is the case, and there are adequate protections against redisclosure, or using the data to identify patients, then I guess I'm missing the point for going to the next step of further de-identifying the data.
MS. SWEENEY: I'm saying that if I produce data for public use, then I have to hold it to a high standard. I have to provide data for which I know that to a reasonable degree, people cannot be identified. If you can use public use data in your system, you don't have to even think about an IRB, however, Ellen Wright Clayton says you should, because any legitimate researcher would. That's a different issue. You can take that up with her.
In my view, I'm with Easley; if it's out there, it's out there. You're going to knock yourself out. When you say that you need something more than that, you will often say I realize now that there is a little risk. Maybe even just the opposite. I gave age, and now you want birth date. Well, I know there is a little risk. Then I only care a little bit about it. I know who you are. It's a personal disclosure.
You need the full names and addresses. As is often the case, you need to contact them. That's a much higher standard of identifiability, so that there is a full scale.
DR. ANDREWS: But you're saying that even the IRB protections you don't think are adequate even for research that has been approved by IRB, you need to go another level?
MS. SWEENEY: No. I'm saying that if IRB approved research, that's fine, but what I would like to see IRB doing is instead of saying you get it or don't, I would like to see them saying you always get the data, but the question is what form do you get it, and I'm going to keep it to as general a form that you find practically useful.
I'm just asking for a little bit more accountability, because there is no reason to put privacy at risk unnecessarily. That's why I often call it the least invasive approach.
DR. ANDREWS: Your previous comments really related to the public access files?
MS. SWEENEY: Yes.
MR. GELLMAN: I want to answer your question. This happens all the time with disclosures. Everybody says, well, the data is confidential. We're only disclosing it to X, Y, Z and whatever, but it's still confidential, meaning we haven't published it in The New York Times. The answer is, giving it to anybody means it is not confidential. So if there is a way to mask that in some fashion, if you don't need it, and you can accomplish what you are doing -- I mean, you may know one of the data subjects. That is possible.
So there is a value in doing that. There is a cost in doing it too, and you have to weigh the two, but you cannot say that I'm only disclosing it somebody, but it's still confidential.
DR. HARDING: Something that we talked about in previous meetings that you brought up and it got a rise out of several people I know is the issue that as a scientist or an epidemiologist you go in and ask for permission to use that data for scientific purposes, medical purposes, or whatever, and you have to go through an IRB and it takes a long time and so forth. You have to sign away your first child and so forth.
Then you go to the CEO of the hospital, and without an IRB board and without anything, you are given the data immediately. Why? Now that is a difficult subject, but is it involved with who owns that data? Does the CEO own that data, and have the ability then to do as he or she wishes with it, as long as it is economic research and not medical research, or how does that quite work? I don't want to get into that all afternoon, but it's that kind of a topic, but do you have any comments on it?
MS. SWEENEY: I'll give some brief comments about it. It is a very hot issue. First of all, your reason is absolutely right. Traditionally -- and Tom can tell me if I'm wrong about this -- as best I can tell, computer systems basically first came into hospitals as billing records. So they were there to fill out claims faster and so forth. So far as from the hospital viewpoint, this was just part of them doing their administrative duty.
As a business, they have a right to have administrative records. So if they hired consultants to come in to help protect the security of the data, or help find economic patterns so that they can save money and so forth, this is part of them doing their administrative duty.
When the hospital data looks like the fields that we just saw from NAHDO, when it grows to over 100 fields or more, or it just simply has diagnosis, procedures, medications in it, the public feels in general that gee, that's medical data. That is not billing record. So a lot of the state databases for example, for years you would hear, oh no, it's not clinical data, it is not medical records, it's billing information, but I don't think that that is legitimate to hide behind.
The reality is, it is medical information. It is equally as damaging if I link it because it came from claims than if I link because you gave it to an epidemiologist. It is equally a problem, and so there has to be some even accountability. This is the world according to Sweeney.
MR. GELLMAN: Let me come at this problem from another angle, because it is something that came up constantly in the hearings that we had on privacy, and that is we have all these different people who use data. Some of them are called researchers, some of them are called public health authorities, some of them are hospital managers, some of them are law enforcement agencies, and everyone has a different gateway to the data.
At some level of generality, they are all doing the same thing. They are all sort of generic data users. Most of them are not going after individual patients. They are trying to make generic conclusions about care or payment or what have you, but they are generic.
This is a question that has come up in the context of the legislation, and no one has figured out how to deal with this. Do we treat them all the same way? Do we make everybody go through an IRB? Do we ignore all the historical precedents for all of this? This is a very difficult problem. It is very hard to regulate people in different ways when we can't define them differently. We can't distinguish one from the other, and this remains as a problem that no one has been able to solve.
DR. SUSSMAN: Richard Sussman, Aging Institute at NIH. We fund lots of epidemiologists. Let me first correct the sort of warm glow you have given epidemiologists. They have a terrible reputation for not sharing and not archiving data, and replication is a very important component of science, and they fall down very badly on that compared to many other disciplines.
That said, I think there is something very difficult about archiving and making public use files from epidemiological studies, most of them. Most of them are given a name -- Framingham, Rochester, Mayo, Whitehall, Westinghouse. They are locations. They are narrow, specific, geographic locations.
One of the features of most of the tables you put up is that you have narrow geography. In terms of confidentiality and anonymity, having detailed geography is not the only, but it's one of the fastest routes to reidentification. So I agree, it is fairly difficult to have a true public use file of an epidemiological data set that is useful.
That would then argue for your point that one really needs to have specialized mechanisms for release of those data to other researchers that in some way protects the data. I would address a question perhaps to John, what are the retributions for misuse of data?
First, before saying that, let me say that from what I have heard and have seen, the greatest danger comes not from the individual research, but from other government officials. IRS for example, people who have access to full files of the population against which to match this. If you don't somebody is in a study, unless they tell, it is much more difficult to know who that person is. That is a key point when you are dealing with a sample.
The question is there is human subjects protection. If somebody has a grant, they'll never get another grant, or you can close down their study. Their professional reputation can be ruined, but there is very little legal address to a researcher who misuses the data. I have seen very few examples of this.
MR. GELLMAN: It is about a quarter to three. I think this is as good a time as any to take a break. First I'd like to thank Latanya. You lived up to your billing. You lived up to my billing of you, and I think that was terrific. We will come back at three o'clock and solve all these problems.
MR. GELLMAN: This morning I kind of suggested that we had an objective today that was sort of the notion of a free lunch of more data available to more people without actually threatening anyone's privacy. I think it is pretty clear that I didn't expect to find a free lunch. I was hoping sort of for a free snack, but I'm not even sure we are going to get a free hors d'oeuvre out of all of this by the time we are done.
I would like to open the floor for discussion, but let me try and make some suggestions about things for people to talk about if it appeals to them. Among the questions that can be discussed is how do we deal with this? We don't seem to have made any progress in terms of -- I read some legislative definitions this morning -- I'm not sure we made any progress in devising alternative legislative definitions of anything. Everything has gotten fuzzier, rather than clearer, which is okay up to a point.
Is there is a process that helps us wade through this? We have talked about IRBs. Are IRBs effective in doing this? Do they have the expertise? Do they work? Do we need something besides IRBs? Will contracts work? Latanya talked about that. Do we need statutes as back-up for some of this? If a statute is an answer, what does it say? Does it say use more contracts, use more IRBs, have stiffer penalties?
Who has the burden or who should have the burden of all this? We are dealing with data that at some level, is likely to be identified by somebody if they are in the mood to do so, and we don't want that to happen. Who bears the burden here? Should it be the person with the data, who is giving it out? Should the burden fall on them? Should the burden fall on the person who wants to use the data? Should this be some kind of joint responsibility?
We've got at some level, a lot of technical concepts here. How do we approach this thing? How do we deal with it?
The floor is open, but I'm going to begin with a question for Latanya. I want you to follow-up on a comment you made. You said encryption isn't the magic pill. Would you like to expand upon that?
MS. SWEENEY: Sure, I'll just plant a couple of things. First of all, encryption in practice is far from RSA-type high level encryption in general. In fact, most people will say it is scrambling, or it is just a look up. So that is really not encryption. All it is, is that the person who gives the data can reidentify it.
Sometimes it can be quite weak. If I take the social security numbers and I scramble them, I may be able to reidentify many of the people. Actually, the social security number is not a random selection of numbers. There is structure to it, and if I know the structure, I can begin to exploit it.
The second problem with encryption is encryption continues to keep us in an access or not mode. It doesn't have the granularity of I can get it in a more general form or something like that. So for example, if I encrypted, but there were only -- five times the person came to the hospital or five times they had a particular prescription in Gary's data. So I have one database. I know how many times they were given prescriptions. I can use this how many times to break the encryption.
So it doesn't provide us a good technique. In many senses it becomes just another de-identifier. So it loses what we think we are trying to get to.
When encryption is good is when the issue is disclosure or not. So in other words, these notions of I might have a private part of the file, or I might split the file so my psychiatrist can see one part, and my medical doctor can see another and so forth. In these ways a system of keys with encryption can be quite good at access control. So it has goods and bads, but it is definitely not the panacea for disclosure.
MR. GELLMAN: Let me just spin off of that and make one more point. That is, we can encrypt a claim or a file or a stream of data and send it over the Internet or however we communicate it. One question is, okay, maybe can with an appropriate set of skills and the right computer, unencrypt this stuff.
One of the questions is, who wants it? It has to be factored into the process. It's the element of risk. Who is likely to want to actually try and do all of this. It doesn't mean that you don't have to be concerned about it at all, but it's another factor, because for a lot of this data, it's of absolutely no utility to anybody that they would be willing to go to the time and effort to do this. That may help get out of this, assure in theory somebody could get access to this and identify it, but it may not be a realistic possibility all the time.
MS. SWEENEY: The street market price for a medical record I hear is $150.
DR. KORN: In Boston. It's much more in Washington.
MS. SWEENEY: So as Peter Sullivan says, you stand on the corner, and he's not sure who you talk to or what you say, but somehow this record happens.
MR. GELLMAN: Your point is a fair one, but it also underscores my own point, which is if I want to buy Kathleen's medical record, I'm not likely to find it in any stream of data running across the Internet, unless I can provoke somebody to send her data in a way that I could capture it. The odds are I could sit there and look forever and I might never find it. So that $150 isn't going to anybody. I'm going to have to find another way. Walking into a record room and bribing somebody to give me a record always works.
MS. SWEENEY: But that's the targeted disclosure. There are many motives. For example, I only need to identify one person in the public use file from either Al or Easley to discredit their agency. To simply do it one time to one person is sufficient. If I can pinpoint any individual, I don't think there is any way to protect any single individual's records, because people can be compromised and so forth. So that's the $150 option.
DR. ZARATE: You don't even have to be right. You can make a mistake. A lot of agencies will say it's the publicity that damages in the long run. For perfectly legitimate reasons, it damages your ability to do your business, so you don't even have to be right.
MR. GELLMAN: Do you want to finish your point?
MS. SWEENEY: I was just going to say, but what we are noticing though in general is that it might be the case -- there is a lot of value in finding out more information, sometimes for the information's sake, and sometimes to hurt the individual. In this way, this is the new trend that we are establishing. A long that new trend, I think to understand where this new value is coming from, the nature -- IRBs have problems. I don't want you to think that, gee, I'm totally in the IRB line.
In a brief study of IRBs that I sort of conducted, one of the things that I noticed was that the nature of the requests coming to IRBs has changed over time. It used to be that a researcher would come to an IRB and say, I want this data for this purpose, and this is my study, and this is my hypothesis, and this is what I expect to prove or disprove. Nowadays more and more requests are taking the form of give me all the data on all the people and I'll see what I can find.
DR. KORN: I challenge that one, Latanya.
MS. SWEENEY: Not by epidemiologists, but by computer scientists. Like I'm saying, give me all the data on all the people, and I'll what security I can find for you.
DR. KORN: That's not typical of medical center environments that I've been, for sure. You wouldn't get anywhere. You would be thrown out of the room.
MS. SWEENEY: I think if you ask Ben and Michael or Easley and Al, that that will be characteristic of what they have seen.
MR. GELLMAN: Do you want to pursue this point, David?
DR. KORN: I do, but in a somewhat more general context. I would like to just offer a couple of observations that bear on this point and the others. I don't think we have a homogeneous problem. I think as the cosmologists say, it is very lumpy.
I don't think that the issues of protecting massive national aggregate databases, whether they are in Census or Center for Health Statistics or CDC or wherever are the same problems as presenting the mass of medical records, tissues, and blood samples that exist in all the academic medical centers primarily, but not exclusively. I think the shape of those problems is not going to be amenable to a single solution, true.
I think that the issue of identifiability -- I worry about that, because the more I listen to people from the information sciences profession like Latanya and many others, you get to the point where anything you offer -- I guess proffer is a nice word in Washington these days -- anything you proffer as a mechanism of reasonably de-identifying a data set, somebody like Latanya is going to say, give me an hour or give me a day or give me a second, and I'm going to be able to break that. So you're in a kind of an endless debate about identifiability.
I would make the point that for the kinds of research that I was talking about, that are the kind of conventional, clinically-oriented, medical, biomedical research, anonymous records won't do it. Most of the time, if not all the time, identities are unnecessary, but linkability is essential.
Now what would be fine for me as linkable, you might not like, because lots of things are going to be unidentifiable to me that you are going to be able to identify, and that's why I don't know how to deal with that issue. There is no absolute on identifiability. What would be reasonably unidentifiable to me would not be necessarily reasonably unidentifiable to Latanya.
MR. GELLMAN: Let me say, I think I agree with both of your points. I'm willing to settle here for consensus on small things, if it's a start. One is that we don't have a homogeneous problem, and that's an important observation, and if anyone wants to take issue with it, they are welcome to.
Also, on the second point, I agree, and that's kind of why I tried to frame this discussion in terms of process, because I don't think it's just all we have to do is put the right words in the right order in some definition, and we're going to solve the problem. We're going to need a process to address this. I agree with both those.
DR. KORN: Okay, well, then just to finish the generalizations, I think that because I do not think that a perfect solution around the issue of identifiable is possible, I think the approach needs to be partitioned a little bit according to the clusters of problem that we are trying to deal with.
I have been arguing, and I will continue to argue that for the kind of stuff that I'm talking about, which is different in many ways from what the big data managers are talking about, I think more emphasis needs to be put on confidentiality assurance of the database, and protection of the confidentiality of the database, and behavioral expectations, and penalties for violating those expectations than messing around eternal debates about what is reasonable identifiability.
I think that the way to solve this is to go after a set of rules that recognize that the data that is going to be flowing within those rules is identifiable, not to me necessarily, but to you, and say fine, we can't help that. It's going to be identifiable. Let's now do the best we can.
Like the bank -- we can't guarantee the bank isn't going to be broken into, but you can do a lot to make it difficult to rob money from the bank. That is an approach for some of this problem that I think would be a heck of a lot more fruitful than sitting here for 25 years arguing about identifiability.
Finally, I find -- and I don't want to make enemies with my very good friends in the room who are outside of academia -- but I never understood why if A.G. and I are sitting side-by-side in two offices in one academic medical center, and I want to review the angioplasty procedures of the last year for some intellectual purpose, I go through an IRB.
If she wants to have someone in her staff as a manager let's say, or a hospital executive, review the very same data set for cost effectiveness studies or whatever, she can just do it ad libitum without anybody's oversight at all.
I'm not interested in making like more difficult for the management of hospitals and health care systems or industry, but from a point of logic, there is no sense to that kind of discontinuity, none at all, unless you believe that I and all academic researchers are inherently less trustworthy than all these other folks that have access to the same database.
Finally, be careful, I would urge, about the IRBs. IRBs are volunteers in the academic setting. Now there are some for-profit IRB mechanisms developing, and I don't know enough about them to say a word. In an academic setting you've got people whom you wish to be active clinical and basic researchers, so they know what the issues really are that they are going to be reviewing.
You don't pay them. You impose heavily on their time to do a kind of burden of drudgery, and every discussion in the last couple of years about these issues, and a lot of the congressional discussions too have always come down on adding mega tons of obligations on IRBs. I just don't think the IRB system will tolerate it.
Now we may want to go to a totally new paradigm of oversight. I'm not suggesting it. But loading up on these volunteers -- good people won't serve. They've got better things to do. The people that will want to serve are people who don't have good things to do, and you don't really want them. It is going to break the back of the system.
MR. GELLMAN: A.G., do you want to make a point?
MS. BREITENSTEIN: Yes. David and I and Latanya, actually by derivation, just because she and I talk so much, have been having some great discussions down here, and I think actually you hit on it.
I think the history of this debate, at least as I have traced it, and not for very long, but for as long as I have, has been sort of the tension between confidentiality and security as two different issues, and trying to figure out which of those two issues is really going to help us in determining what questions need to be asked, and how they need to be handled.
I think Latanya's research, which again, I have always had the benefit of being in Boston with her, so that has always been a great thing, is to realize that the confidentiality and security policies not only have to interlock, but have to interlock along a sliding scale. The extent to which you are dealing with data which is not handled with the utmost security or with technical protections requires that you have that much more additional confidentiality restrictions.
So the challenge, at least legislatively is to craft language which is as sort of nuanced as we have seen is available in terms of data manipulation. It can really slide along the same scale. I think that from the perspective of someone who works with legislative language, that is a real challenge. That's a real challenge to figure out how to make language that is that flexible, and understandable at the same time.
I think, just in terms of some of the parameters that you set out, in terms of limiting data fields, figuring out what is available publicly, what your sort of quasi-identifiers are, we can do that. We can figure out how to draft language that reflects that.
I want to say again, and I think at least in terms of the work that I do, this is a huge issue, is the prohibition of re-identification. Actually, Janlori and I talked quickly about this, and she remained me of the Civil Libertarian concern about public records, and I think that is another thing that needs to be balanced.
I am also talking primarily in the private realm, with private information that would not fall under some of those requirements. I think we have to attend again, to balancing the burdens. We can't put all the burdens on those who are disclosing the data.
We have to put some burdens on those who are using the data, and saying that if you are using data under the notion that it is unidentifiable or de-identified, that you have to stick to that. You have can get in trouble for doing all the fancy stuff that Latanya does, because that is a real problem. We can't guarantee against that.
The other thing I want to say is that I operate from a perspective of a privacy advocate. I think that the other thing that needs to get built in as we consider the issue of identifiability is to recognize that the enforcement of this system, regardless of how we create it, is going to require a lot of attention and diligence on the part of individuals.
Now most individuals don't have time for that. We have a Credit Reporting Act that is, for better or for worse, you can judge it yourself, but a lot don't avail themselves of the rights that they have under the Fair Credit Reporting Act.
So I think the other thing we have to deal into this debate is figuring out how to put systemic checks and balances into the system. We have to figure out how to get the entities that are involved to really sort of be looking over each other's shoulders in a way, and to be again, sharing that burden such that we don't put it all on individuals to enforce the system for us.
I think the formulation of that was contractual arrangements between the parties, and some retained liability on the part of those in that contract to be keeping an eye on each other. So if Ben shares this data with me under some contract, that Ben and I have to keep an eye on each other that we are actually adhering to that contract.
If I am an individual who is aggrieved by a violation of that, that I can sue both of them or one of them, depending on who I choose, and how I choose to do that. So that those systems actually are keeping each other balanced, and we're not putting the responsibility on the individual.
So I guess what I'm saying is the accountability in this for sliding scale that we have here also has to involve all of us as try to figure out how to craft those things. Without that, we are really going to be creating a system which just isn't going to have a lot in terms of real effect.
MR. GELLMAN: Well, the language we want is a piece of cake to write. The language says there shall be reasonable and effective administrative, physical, technical, and statistical safeguards, and the secretary shall write regulations.
You say you want more institutional involvement in this, but he says IRBs can't do it. Are you for IRBs? Can they do it? Can they be enhanced? Do we need somebody else? Do we need something else?
MS. BREITENSTEIN: I think he is absolutely right. The formulation of IRBs right now is such that they are voluntary. You can pretty much make up your own IRB. I can go out and formulate an IRB tomorrow if I wanted to, and in fact I did frame the IRB in my organization, and it was not difficult. If I can do it, probably anyone can do it.
I think what it is, is that we have to be creating -- and I think these are some of the discussions that came out of NCQA-JAHCO joint session was to actually have more -- and you guys work with Census and NCHS -- is to actually have people whose responsibility it is to actually look at this stuff. Look at who is asking for the data; what data fields are they asking for; what, in combination, can be used to identify individuals.
It is not necessarily IRB per se. It is the function that we are detailing as being important that has historically, at least in that context, existed as a concern to the IRB. This is some of what I think Latanya was alluding to, is that we are not asking those questions.
We need some mechanism, and I don't know whether it is IRBs or some other formulation, but we need some mechanism which is going to be the choke point for that, and which is going to retain accountability for enforcing those decisions or those evaluations on the person who is coming to that organization, board, whatever you want to call it.
MR. GELLMAN: Just to argue with you for a while, If I say, okay, you are the data holder, and you have to do everything you just said, and you can't use an IRB, because we don't think they work, am I creating some new barrier that is going to say, okay, my answer is you want data? We're not giving it out. It's too much of a burden on us.
MS. BREITENSTEIN: I'm not saying it can't be an IRB. I'm just saying that we have to be specific, and we can't sort of throw it all on IRBs either. The IRBs right now don't have any of that expertise. None of them have that expertise.
MS. GOLDMAN: But they are charged with applying that standard. That's where the conflict comes in. The IRBs, at least under the federal regulations, are charged with requiring informed consent before identifiable data can be released, and if the researcher argues, well, I can't get informed consent, it is not practicable, and they can show that it's either not practicable to get consent, or that the public interest in the disclosure outweighs whatever privacy interest there is that would necessitate the informed consent, then it can be released in identifi able form.
So they are already required to apply that standard. What we found is that it's a tough standard to apply, and it is tough to weigh. When you are looking at it from a public health perspective, the privacy interest doesn't always seem so salient.
So I understand that this might not be the right place to put it, but it is there now. That responsibility is there now. The question is how was it applied? We don't really know much about it either.
I'm hoping in the next couple of years that we can see some kind of research done on how the IRBs do apply that standard, and how they weigh them. What expertise is brought to bear? I don't think we know much. We hear sometimes as David was saying, you would be thrown out if you had such a broad and sweeping request, but I have heard from other people different stories. I think there is probably a pretty wide spectrum of behavior in terms of how the standard is applied.
MS. BREITENSTEIN: If we were at a point now where we could say that if we looked at the IRB standards, and we raised them -- whatever we did to them -- and that would be the chock point for all information, we would be miles ahead of where we are now, which is that the vast majority of data usage never has to have any involvement with an IRB.
Which is what I'm saying. I'm not saying IRB, IRB. I think the IRB may be far ahead, or be a model, or be some mechanism that we can look to with some usefulness. Until we corral all of the uses, to have to go through some choke point and some evaluation, we can't even get to the question of how good is that evaluation, because we're still dealing with 80 percent of the data.
MR. GELLMAN: Elizabeth.
MS. WARD: Thank you. It seems to me that part of what -- I don't know whether my experience in the state of Washington is unique, but there is a communication between the IRB and the data steward. The IRB does not have all the responsibility for determining exactly how many pieces of that data. So that the data steward takes hours of time dealing with the researchers' request about why do you want it, want do you want it in this form?
The level of precision that Latanya was describing I think is something -- I don't think it is something you put into law, but it certainly is a best practice policy issue. I think everybody could get better at it, but that has to get linked with the IRB process.
I think some of the proposed legislation we have seen talking about designating data stewards and somehow raising that -- but we certainly don't give data out until there are both pieces of that going on, about knowing exactly what kind of data and what form you are going to give it, and hopefully doing some of the stuff that Latanya is talking about before that person then goes to the IRB and says, this is generally the purpose of my research. They are complementary activities.
MR. APWELDIA(?): Gabe Apweldia with Envoy(?). I would like to make a comment about this IRB discussion and then a question to the panel. We can put some technical barriers like encryption, things like that. We can put some procedural barriers, like an IRB.
I think that we also need to have some disciplinary methods for trespassing and enforcement of whatever statute applies. A few years ago carbon copies became possible with a copy machine. The Department of Treasury made the $50 bill and $100 with a little metal strip so you can't copy it in the copy machine. Still, if you go to Kinko's and make a copy of them, it is illegal, and you go to jail. So it is not just the technical barrier, it is also that it is illegal. So I think that those three -- technical, procedural, and a legal framework -- complement each other.
The question I have for the panel, and I would like to see this addressed is what happens with those entities that do not have an IRB, that not under the control of such a board? For instance prescription benefit managers, PPOs, HMOs, the clearinghouse industry, which I'm working in. What kind of regulations, what kind of restrictions, limitations -- how does it work for us that are not in an IRB-controlled environment?
MR. GELLMAN: Latanya?
MS. SWEENEY: I would say that the point is one of an overall review process. It doesn't necessarily have to be IRBs, but as Janlori pointed out, some people point to IRBs, because they are in fact already empowered with that responsibility.
When Easley and Al both talked about review processes. In fact, most statistical offices throughout the world have such panels that you have to go through before such data can be released. The point is that that I think that has to be required.
It is not that I'm saying to the IRBs or to the review process -- we can talk about the way those are formed, and those are within their agencies as well, for example -- it's not just that the question you just turn a person loose and have to kind of figure this out on their own.
The techniques that we saw today, on the one hand they distort the data, but more importantly they give us measurements to tell us how sensitive is the data, and where that sensitivity lies. This is nice from the viewpoint of a panel, because when someone asks me for the data, and for me to give the data exactly the way they want, I can at least tell you how much risk I have taken on, what was vulnerable, and where I have assumed that risk.
That is a very nice metric for which a review process can use. It is similar to the kinds of metrics that Easley and Al, through different mechanisms, have also been employing for years. They have good notions of what is sensitive and so forth, so they get a good measurement. I'm just trying to make the process bigger, in some sense just modeling what statistical offices throughout the world do already, but in other ways.
MR. GELLMAN: Let me follow-up on that for a second. Can you statistical people come up with a measure that we could apply? Say everybody's got to get to this point on a scale before they can get data? Is there something out there that is universally recognized that you use internally, and we could rely on?
MS. SWEENEY: Let me just say I didn't mean it that way. I mean the opposite. I can say what is sensitive, not what is the right level.
MR. GELLMAN: I'm just fishing here. Anybody have any ideas?
DR. ZARATE: Not as far as I have seen. I think that Dr. Korn has made the most important point here. It is going to vary by the situation.
MR. GELLMAN: I'll take more than one standard. We'll have three or four.
DR. ZARATE: Yes, and clearly there is a need for a tremendous amount of education on the part of various people. I don't know what is so unreasonable about requiring that every time you have a data set, you have a review mechanism if you intend to release that into the public. That review mechanism has to meet certain minimal requirements.
We can develop packages in which we give them what the requirements are, the minimum requirements. If they want to go to the maximum, that's fine. They can recruit experts to help them with this. They can review different kinds of packages. It may be that in some cases the rudimentaries are all that is necessary and you've got it. In other cases you may need something very, very complicated, so it is going to vary with the data set as well.
One of the things that we just started on in this interagency committee that I talked about before was a brochure that we could circulate to people in more or less conversational terms, because Bergan's statistical agency has already got one of these. They tell people what a survey is and what survey practices are to maintain people's confidentiality, and then what they do with the data, and what they are entitled to do with the data, and what laws are, and what responsible agencies do.
It is meant to educate the public, but I don't see why something like that couldn't be done. We are trying to take off on that, at least within our own realm, because we want to concentrate on what we know about, not what we don't know about, and do the same kind of thing for statistical agencies elsewhere in the federal government.
These are the things you have to be concerned with. This is the law. These are the general techniques. If you want to know more, here are the resources. It is seems to me that this education is something that we can do now without legislation.
MR. GELLMAN: Ben, what do you think of the idea of this review panel? Does that bother you?
MR. STEFFEN: Well, a review panel is something that we think will be a critical part of the data release, so we are very comfortable with this. My boss admonition, as I left for this was, be sure to talk to people about participating in review panels.
MR. GELLMAN: Do you think other states will agree with you?
MR. STEFFEN: I think that in terms of Maryland's current data release, the hospital discharge data set, which by the way, I think on average we are talking about 100 releases of the public use data set on an annual basis. There is not a flood of demand for these things.
I think in the private sector, when I think of the organizations that were involved in the data business ten years ago, most of them have fallen on hard times. Maybe our IMS friend can tell us about that, but it's not an enormously profitable business in my judgment. Nonetheless, I think we should be very vigilant about how this is released.
I don't think there is a flood of demand for these data files in the private sector. I think we have somewhat misplaced some of those concerns about how many releases there will be. Nonetheless, for those that request them, I think a process like a review panel is critical if you are moving much beyond a very limited set of data items.
MR. GELLMAN: I wonder if private sector demand won't increase when they all read Latanya's articles, and figure out that I can take your anonymous data and start making mailing lists out of it? A rhetorical comment.
Gary, I want to drag you into this. I don't want to focus entirely on IRBs. I have heard -- I think we all have -- the problems with IRBs discussed for a long time. I want to talk more about contracts. You said you use contracts. Can you talk more them? What do you tell people? What are the dynamics of this process?
MR. FRIEND: I guess I'll take two as an example, and I don't have the language memorized, but conceptually I can tell you what is in them. Every employee at the start of employment, signs a confidentiality agreement. It's a one page agreement, and there is a specific section that draws their attention to the fact that they are going to be potentially handling information.
It is the usual things -- protecting intellectual property right, but it also sensitized them to the fact that it may be sensitive information, and that under no circumstances is it to be released, disclosed, et cetera, et cetera. There are legal ramifications if they violate it. So if you look at the different stakeholders that handle information, employees are treated in that way.
In a customer contract, it is fairly explicit in how it is binding in terms of how the information can be used. It is explicit in not allowing for public disclosure of it or release outside of the specific purpose for which it is being obtained. So when a customer -- and this happens all the time -- we get a call.
Somebody is saying, we're doing a research paper, or potentially it could be Glaxo where they are going to be giving a speech, and they want to cite some of the information, we have to see what it is and approve it to insure that again, there is not any potential risk of compromising either: (a) from just purely an intellectual standpoint, or (b) if it does head down the road to where it starts to touch what we call the third rail.
MR. GELLMAN: Anybody else have any thoughts about contracts? Do you like them, not like them? You've thought about this probably more than anyone, Latanya. What do you think?
MS. SWEENEY: Actually, in my original thinking about contracts with respect to the data is where I think you are referring to, it actually came from the statistical community working paper 22. When the federal government statistical offices were trying to set federal guidelines of their internal practices, they made strong use of contractual arrangements.
It doesn't say it, but in talking to some of the people who were around at the time -- in fact some of them are here, and they can just -- but in talking to some of them who were around during the construction of those documents and so forth, and when those decisions were made, they were saying that that is because they were taking their best guess at what was the right release point and what were the right release strategies and so forth. So these were ways to help protect them.
MR. GELLMAN: Suppose we set up a statutory recognition of these kinds of contracts, and we told the secretary to write model contracts for various purposes for people to rely on. We said this was a prerequisite for some of these uses. What do you think about that A.G.?
MS. BREITENSTEIN: I think you've got to make individuals who are the subjects of information, sort of explicitly recognize third party beneficiaries in the contract, with at least rights to enforce or to bring some sort of case for redress of a grievance under the contract.
I think contracts are effective to the extent that they give again, systemic weight to what you are trying to effect. You really do have to give the individual some identity within that contractual system.
MR. GELLMAN: Okay, that's fair enough.
DR. KORN: I'm getting confused on to whom these contracts apply and in which circumstances would they apply? You are only talking about these massive population bases that the Census Bureau --
MR. GELLMAN: No.
DR. KORN: Well, to whom would it apply?
DR. ZARATE: There are a variety of agencies that have these. The National Center for Education Statistics has legislated a basis for developing licenses with users. The Bureau of Labor Statistics has got it.
DR. KORN: These are large, aggregated databases, and the public use thereof.
DR. ZARATE: As far as I know, yes.
MS. SWEENEY: No, my argument about the contractual arrangement is that it is actually for the more sensitive. It's national data, because it's from a federal agency, but --
DR. KORN: I don't know what "more sensitive" means.
MS. SWEENEY: It means that it is not public use.
MS. BREITENSTEIN: It's like the GIC data.
MR. HOY: That was why I was making the distinction from our research data centers, the access that you have in those data centers versus the public use file. In our public use file we do recodes by collapsing categories. We do top coding of sensitive things like income and other kinds of financial information. That is all public use file. We really swap data and we add noise, and we mask it.
In our research data centers is as if the people inside the Census Bureau itself were looking at it. It's like a contract, because when we bring people in because of their research proposal, and we accept it, then we basically swear them in as a Census Bureau employee. So they are subject to the penalties just like anybody else who is employed by the Census Bureau.
There is a penalty stated in law, Title XIII, that says if you reveal the data, you are subject to a maximum fine of X dollars, and imprisonment of so many years. So in a way we are entering into our contract with the researcher. It is temporary thing.
What happens is a year or two after their research is over with, then we strike them off the rolls as being eligible as a special sworn employee. It is a temporary thing. It's not a permanent thing.
MR. BERRY: At HCFA, anybody who wants identifiable data must sign what we call a data use agreement. It has, under the Social Security Act, has some penalties associated with it. You will not get identifiable data unless you come through with a formal request; you have a research protocol; you sign a DUA, data use agreement; and you pay for the data. Then it is shipped out, and we keep a track of everything that goes out of there.
MR. GELLMAN: What are the terms of the data use agreement?
MR. BERRY: It's an eight page document. I can get you a copy, but I don't have one with me. It is a dynamic document. Helen Mayhugh back here, her shop has a responsibility in HCFA for the development of the data use agreement. She is in our FOIA and Privacy Act office. The document is updated on a pretty regular basis. It has just undergone a major overhaul.
It has conditions for receiving that data, and the conditions of re-release or re-use of that data. It is laid out in each paragraph. It has to have the signature of the person who requested it, the custodian of the data, and anybody else who touches it. It requires them in essence, if you wanted to re-use or re-release this data, it has to be with the expressed, and only with the expressed permission of HCFA.
MR. GELLMAN: Are you on this point of contracts? Do you want to talk about that? Hang on then.
MS. SWEENEY: I have one thing to say. With respect to the contracts, this is where I always get amazed. I just have to share with you, because we heard about the stiff requirements from the Census Bureau. We heard about the stiff requirements from Social Security and so forth, but when it comes to medical data in the hospital setting, you talk about going to an IRB, but the reality is that most of the people in IRBs hold two hats. The researcher holds two hats.
As a doctor in a hospital they are allowed to see any record they want, whenever they want, in any way they want in many institutions. That kind of open access, it is only under certain circumstances that they may choose to go -- a lot of research is done without an IRB, because in fact they just simply have access to the data.
Then even if they do go to an IRB, like I said -- I know very few IRBs that actually require any type of contract. I think when you look at the medical data, and you look at the harm that can be caused, and you see how much more is required for me to get Census data, and how much is required for me to get HCFA data, it is kind of amazing.
DR. KORN: Okay, that's just exactly where I thought this was going, and here is where you and I really part company. I do not accept as a general proposition, and I would that this committee would not accept as a general proposition that in -- I have to talk about academic centers, but that's where most of this kind of research gets done, although it is not exclusive.
The standards of enforcing the common rule on institutions that are dependent on federal funds for research are generally maintained and upheld with a great deal of stringency. Now you know there is no activity in life where someone may not violate a regulation or a law, whether it is going through a stop sign or God knows what. The fact is, as a rule, I would make the argument that academic institutions take extremely seriously, their responsibilities to enforce the provisions of the common rule with regard to research involving human subjects.
Now as an entirely separate issue, it is true that historically within these institutions, there has been a lot of ad libitum access to medical records by the physician and related health professional staff in the course of dealing with the care of patients. I think that that also is being tightened up, particularly as institutions move to electronic medical records, where there are capabilities both of identifications for access, and limitations of what piece of the database can be accessed by some standard of need to know.
We are no where near uniformity on this. The investment financially and otherwise in converting the entire record system to a state-of-the-art electronic system is not going to happen overnight, but that is certainly moving in that direction.
MR. GELLMAN: Let me pursue the first point, that academic centers are very careful about this. Before you said we can't really distinguish between academic research and management use, and all these other uses. So now you are pleading a special argument -- or are you going to say the same thing about all the other users, that they are all trustworthy as well?
DR. KORN: No, I'm not talking about trustworthiness. I'm saying that the rules that apply are adhered to pretty seriously in academic medical centers. The rules, as they apply, do not require that there be oversight through an IRB mechanism of management study or management research of clinical records. That's a federal legislative issue or federal regulatory issue.
I know that in most places that have federal research funding, which is all of them that do research, they will actually requirement that non-federally funded human subjects research will go through that same IRB process. They will not make the exception on the basis of source of funding, even though they don't have to make that requirement, they do make the requirement.
I do not believe that they put management under the same requirement. That is the distinction that I'm making. As I said before, although I don't want to create internal warfare within the community, I don't understand the logic of why the common rule of protections ought not to extend to all human subjects research done in the United States, no matter who does it, and no matter who pays for it. I don't understand why that is not something we ought to support.
MR. BERRY: What common rules?
MS. BREITENSTEIN: The common rule.
DR. DETMER: Or as an interesting alternative, if in fact there are problems with IRBs, and all the issues that you raise if you add whole additional constraints and considerations, plus the quality constraints that you try to operate on in terms of doing quality management, put everything under a contract. Just put everybody under a contract on these things. That's another option.
In fact, the researcher knows that they must meet certain considerations, and if they do not, they have certain constraints. I am just asking, because the idea of trying to do all of the QA kind of work through an IRB mechanism strikes me as shutting down health care. I just don't see how practically you could even gear up for that, and you acknowledge that they might not be ideal if you did.
DR. KORN: That's fine. I'm not pushing to create more impediments to an efficient, cost effective health care system, I can assure you of that.
DR. DETMER: I didn't hear it that way. I'm just thinking. I'm saying I think IRBs have had some merit. It's also fascinating to me that a number of researchers who are very capable, have to go to the IRB every single time for every study they have done, when they have never had anybody ask for a semi-colon or an apostrophe in five years of multi-million dollar work.
Now at one level you think, well, okay, maybe that's an administrative burden that is worth it, because there is a distribution, and there is a learning curve, and everybody ought to do this, because it's something we believe in. By the same point though, I think those folks would understand a contract strategy very, very well.
I'm just trying to catch to this issue, how can aim the process? There is a lot of discussion going on. One is what do we want to assign, as a society merging into an information age, to privacy not as an instrumental value, but as an intrinsic value. Not as an instrumental value, where in fact it's a helpful thing for a variety of purposes, but in and of itself.
That is in fact, I agree with you, a debate that will go on for some time. I don't think we're going to cut that one in the time in which Congress makes legislation, although I must admit, the time it's taking them to get around to this and passing something is not trivial.
By the same token, I think there are serious exposures, and there are probably processes that can and should be put in place. Some already exist, with their faults and their strengths. I think the question is to the extent that we could put into place, mechanisms that would have broader understanding and broader applicability, and still have validity and have rigor, I think that would be a real good thing.
Instead of having one group living with contracts and another group living with IRBs or some mixture of this and that, if you could strike some processes that had sufficient clarity. This is the other issue, are you after privacy or after confidentiality or after security? Those are not identical terms. Again, how do you in fact try to strike something that is workable?
I would argue that to try to strike something that strikes at the philosophical dimension of privacy as policy is going to be very tough. To try to get down closer to security and to confidentiality kinds of issues, I think we have made progress, and can continue to make more.
MS. ALEXANDER: I'm Lois Alexander, formerly with Social Security. A contract can be useful during the terms of the contract, but when the contract ends, and the database continues to exist, you need to have some mechanism for the disposition or for the return.
The other thing I wanted to say about contracts, we found them there, and also when I was New York's DPRB, review board of the SPARKS system, we found that the contract usually required supervision. It required some level of analysis to determine who should get the data, and what they could do with it, and then also supervision to make sure that they followed the rules, and that the identifiable file, if it was released in identifiable form, was then returned and not re-released to other users. So there is a fair amount of oversight that is required regardless of the contract.
MR. GELLMAN: Who did the supervision?
MS. ALEXANDER: The board didn't, but the administrative people associated with the DPRB actually did a fair amount of supervision, and did some on-site monitoring. When you get big files with a lot of very sensitive information that can be identified to individuals, why usually some level of follow-up is needed.
Also, I think the point that I started out with, a contract usually has a term. You want to be sure that you know what is going to happen at the end.
MR. GELLMAN: Gary, are you still talking about contracts?
MR. FRIEND: Yes. I guess a point that I made earlier, which is the search to answer today's cosmic question which is identifiability versus non-identifiability. A straw person theme that I have been hearing from today's discussions is that you start from the premise that two components, a record relieved of the individual's name, address, phone number, social security number as step A, and commitment or obligation, but it contractually or legislatively that the net information may not be used, merged, combined with something else that will allow for reverse engineering unless there is consent.
Again, we have been talking about contracts. We have been talking about how much do you have to delete. Latanya showed that technology is going to keep pushing the envelope. We keep stripping and stripping and stripping, but there will still be ways with computers and databases to reverse engineer it back, and we either say well, we're going to shut down -- at some point research that uses information in responsible ways will choke to a halt, or to say we've got to find a compromise.
The compromise is take a white out approach that worked in a paper-based world, combined with a legal mechanism that says that if you are looking at, that on the face if it is harmless, and saying you cannot in turn do something with it to make it harmful, which means you can't reverse engineer.
MR. GELLMAN: Let me phrase this more precisely in legislative terms. The search in all of the bills is to say we are controlling some kinds of information, because it is identifiable, because it has a degree of sensitivity attached to it, and that is subject to regulation, but there is other data that doesn't reach that threshold that is not subject to regulation.
So what you are saying is that we are no longer going to talk in terms of identifiable or non-identifiable data. Maybe it is controlled and uncontrolled data. Maybe that's not the right term, but now we are going to say health data is all subject to regulation unless, then you sort of go through this process that it doesn't have certain overt identifiers on it, and it is released through a process of control, a contract or what have you.
So it's not absolutely a non-identifiable, but in this context it's not overtly identifiable, and the person who has the data, has agreed not to engage in Latanya's transformations if you will.
MR. FRIEND: Right. I think what we have been saying is that theoretically everything that we can now do today with the technologies Latanya demonstrated, theoretically it was possible 30 years ago if somebody took enough manual records. So it's not a question of it was never possible, it's just easier today than it was.
I think all we are trying to do is say all right, that it is easier. Is there some other method to put some friction in it? It is clear that technology has taken away the friction.
MR. GELLMAN: This is in effect, a different concept. We are not dividing the world up into identifiable and non-identifiable records anymore. We are basically creating an exception for some kinds of health records that have certain characteristics and certain controls attached to them. That may be a workable solution.
DR. ZARATE: In shifting the locus of responsibility it seems to me what you are doing is saying that now the original data holder will not do everything that is required to insure that data disclosures do not occur in releasing this data set, to saying that we're going to develop a contract, whatever standing this contract has, and hope for the best.
That sounds weak to me. It sounds as though we are saying that we are going to presume upon the security measures and the understanding, the statistical sophistication, and the goodwill of whoever we can get to do a contract.
MR. GELLMAN: Let me rephrase it, and see if you feel better. It's not, this data is regulated and that data isn't. This data is regulated very carefully and fully and all of its uses, and the other data is regulated at a much lower level. We still have penalties. We still have security, but it doesn't have for example, if it is health data, if you look at any bill, there says there patient access, notice of information practices, right of correction, all of these rules that apply to anything that is identified as protected health data.
That doesn't apply to this other stuff, because it is essentially -- I don't know what word to use. I'm afraid to use de-identified. It is not as identified as the other data. It is not only protected by a set of security rules and penalties, but it is also protected under a contractual arrangement. Does that make you feel better or worse?
DR. ZARATE: Not much.
DR. DETMER: What I hear you saying though is that we are not going to consider privacy either intrinsic or instrumentally valuable. In fact I would say that society is clearly moving to where it sees both of those things as good. So I guess from my perspective, I don't see us sort of just totally throwing it out there, because I think as the methodologies come along to allow us to in fact do this, I think people want that.
DR. ZARATE: One of the things that I was going to mention before, that both the Bureau of Labor Statistics and the National Center for Education Statistics do have audits of the files that they release out under contract. They periodically go out and scare the bejeezes out of the these people by showing up at their door, and saying I'm here to check this data out, where is it and who's got it?
Sometimes they find things that they wish they hadn't found. There are practices that we always sit around in rooms at NCHS and say, what if? Some of them are things that we would worry about -- people having blanket access to data. You say, well, it's all well and good to say that only authorized people can get at this data. So what do they do? They authorize everybody in sight, and that's not what we intended.
Now contracts can be interpreted. They can be misread. We may be light years away from the on-site -- we would like to say we can jump in and look at everything and at any time, but we don't have the resources to do that. It takes a considerable amount of oversight and resources to do that kind of thing. It sounds good on paper, Bob, but in terms of implementing it, I don't see it as working.
MR. GELLMAN: What if we put the burden on the data recipient, not on you? They have to have an independent audit. They pay for it.
MS. BREITENSTEIN: Then you're letting a third person into the data.
MR. GELLMAN: Well, maybe, but you can't have everything. We have independent auditors.
MR. HOY: They can't be independent if one person is paying them.
MR. GELLMAN: There are ways of controlling that.
MR. HOY: There is an infrastructure involved in this. When we set up these data centers, you notice we're not jumping up and down and going from two up to 100 right away. We going from two to four, maybe six or something like that, because there is an infrastructure cost involved in this. It isn't free.
MR. GELLMAN: It also occurs to me of course, that there is a ton of health data being moved all over the place. It is nice to talk about this contract or what have you in the abstract, but if you all of the sudden begin to see all the connections and all the requirements, it could be an overwhelming sort of administrative structure, with everybody signing pieces of paper with everybody else, and end up having no meaning.
Gary, in your case, you have a very expressed business interest for protecting your data. You may be doing this on behalf of data subjects as well, but your business interest is we don't want anyone giving our data away, because then we don't have any customers anymore. That is fine. That structure doesn't always exist in other cases, because the people who get the data may not care.
MR. BERRY: Bob, believe it or not, we've been talking about this interagency committee and trying to get active on it. We've had people up talking to us. That may surprise some of the people in both the public and the private sectors that we talk.
The problem is if somebody hacks in, they don't get one record, they get potentially millions of records. If they come in the stream and get the access code and pick it off, bang, it is just an incredible amount of data going over those wires.
MR. GELLMAN: Well, I mean that's true, but at some level there is nothing we can about it.
I would like to go back to the people in the audience. There was someone what stood at the microphone for a while. Did he leave? Well, then it's your turn.
MS. JACKSON: Dawn Jackson, Blue Cross/Blue Shield Association. I have a point on contracts. Just rest assured that this subcommittee isn't the only entity looking at this issue. The National Association of Insurance Commissioners are having conversations on just this issue.
That leads me to my question. I would like to know if the subcommittee is looking at applying the contract principle to federal agency and researchers within that context, or is the committee also looking at suggesting, recommending the use of contracts with the health plan vendor, in that environment, in that relationship? If you are, that would be exceptionally problematic to health plans.
MR. GELLMAN: The answer to your question is we don't know what we are thinking about. We are just sort of kicking around ideas here. We are not anywhere that close to -- tell me why it would be difficult.
MS. JACKSON: Well, I'm not an attorney, but I talked to our plan's, and one of the major issues for them is when you have a contract apply to a vendor, that potentially sets up liability for the health plan if the vendor doesn't hold to the contract. We don't see that as being very useful for subscribers, patients nor the health plan.
MR. GELLMAN: That is an issue raised in all of the legislation that has been proposed. I don't think it's all that clearly defined in any of those things, but one answer is you pick your vendors very carefully. If you haven't been negligent in doing in so, then you won't be liable. No one has proposed strict liability for health plans, and the vendors would have an independent obligation both to the plan and to the data subjects. If they violated it, they would be subject to criminal penalties and to civil lawsuits. If you negligently picked a vendor, you might be as well.
MS. JACKSON: I don't know, in a perfect world, you are absolutely right, but the relationship between a health plan and a vendor is not the same relationship between a health plan and its employees. Clearly, the scenario that you laid out, a health plan can have recourse with its employees, but it's pushing the envelope a little bit to hold health plans liable for the actions or inactions of its vendors.
MR. GELLMAN: Well, that's a fair point. I think it's something to keep in mind.
MS. BREITENSTEIN: Can I ask a question? When we are talking about contracts, we are, I assume or I hope, still talking about data which has been rendered in the form that we have been talking about or Latanya has been talking about, and the folks from NCHS have been about, that is de-identified, if you can use that term colloquial today, a format as possible? As we discuss contracts, are we still talking about that, or are we talking about identified?
MR. FRIEND: I saw it as two parts. You did say it better than I did, but the two parts are not saying let's forget about trying to non-anonymize, but say that there is a base level anonymization done, and whether it be a legislative or contractual commitment, that it will not be undone without consent.
MS. BREITENSTEIN: So is that sort of a standard format, name, address, phone number, birth date, social security number, or is it just tell me the fields you want, and I'll give it to you?
MR. GELLMAN: I don't know that we're going to get into that level of detail. This is an idea. This is not an idea that we have all agreed upon by any means. It's just this is the kind of thing that in order to develop this, you have to sit down and work through the details, and see when you're done doing that, if you come up with anything that makes sense.
It's sort of broad brush. It seems like a new approach. It is somewhat of a new concept, and at this level it may work. It may not work at the detail level, and it may have a lot of other problems that we haven't even thought of. In fact, the problem that the woman from Blue Cross just brought really actually applies very much in this context.
Here you are agreeing to give data to somebody. At least with a vendor you are paying them money. You have some control over them. You have some relationship. If some researcher came along and said, can I have your data. Sure I'll sign your contract. Do I have an obligation to investigate? You can raise all these kinds of questions there. I don't know where that leads.
Do you end up setting up a structure that is such that the incentive is don't give your data to anybody ever unless they have a subpoena, because you are exposing yourself to legal liability. We may be in that situation today, it's just that the lawsuits haven't been filed.
So you end up setting up a structure that is this complex, and the result is all these functions shut down, because nobody can get the data they need for something. That may not be a good result either. You may protect privacy, and ignore all the other objectives that we have in mind.
MR. FANNING: I want to ask the question again. I think we are still faced with making the distinction between in some way identified and non-identified with reference to what are traditionally public use data tapes that are given out to anyone as a matter of policy, without any type of promise or assurance or undertaking or whatever. I assume we still want to do that?
MR. GELLMAN: I would assume that. Maybe one answer is no. Maybe one answer is that we have to have the statistical people all sit down and raise the barrier here. It's not cells of X, it's cells of 3X.
DR. ZARATE: Who said you have always got to say yes?
MR. GELLMAN: Another answer is no, but another answer is same thing we're doing, but with a higher barrier.
MR. FANNING: You don't always have to say yes to a request, but there is a general public policy in favor of public use data tapes, and we want to keep that business going.
MR. GELLMAN: But a public use data tape may no longer be a public use data tape. It may come with a contract attached to it in all cases. In fact, you could, under the structure either have higher standards, or even lower standards, because you give out data to people under a contract that you wouldn't give to the public.
It may be that you come in and you tell me what you want to do, and I'll decide whether you get data at this level of anonymity or higher or lower or whatever. We cut the data to suit your research project.
DR. KORN: Who is "we?"
MR. GELLMAN: I don't know; whoever has the data.
MR. BERRY: The owner of the data.
MR. GELLMAN: Well, I hate use the word "owner," but the possessor. It's a question.
DR. KORN: I make a point about this, and it's the same point that I really was trying to make, but I guess I didn't, about the IRBs. I think the IRB system is a great system, because I can't think of a better one to do what they have done. So I'm not hitting against IRBs.
I think though that standards need to be agreed to and codified. I don't think that these kinds of problems that we have sat here all day, and still don't agree on very much, can be thrown up to 3,000 IRBs and 15 different government agencies and say, you guys decide what the rules are going to be, and who is going to get your data, and what they have to do to get it, and this and that.
I think it is the responsibility that you all have, or whoever is going to do this, to set some standards, and make those standards applicable broadly across all of these data sets.
MR. GELLMAN: But that's what we are looking for. We are looking for a high level of abstraction kind of approach and standard for this. Then it's going to say the secretary will write rules ultimately, to try and provide some degree --
DR. KORN: You can give some guidance. When I said who decides, you said whoever owns the data. That is not, by me, having a set of standards that gives some definition to their degrees of freedom.
MR. GELLMAN: That is fair enough.
MR. BERRY: We make that call at HCFA for HCFA data now.
MR. GELLMAN: But if we gave you a statutory standard or a secretary gave you a standard, you would apply that standard.
MR. BERRY: And have some flexibility within that standard.
DR. WETTERHALL: I have real problems in terms of suddenly this thought that we are not going to make public use data available, or we are going to suddenly start attaching various conditions to it. It is somewhat philosophical, but we basically, in conducting what we call public health surveillance, see the dissemination as a critical component in that. There is collection, analysis, interpretation, and dissemination of information, and we would be socially remiss if we did not do that.
The second thing is that the tenor of the conversation seems to imply that we are somehow providing something to these users of these public use data tapes that will of benefit to them. I think that loses sight of the tremendous social good that has come from public use data tapes, one of which is document taking the lead out of gasoline resulted in lower blood levels for children in the United States, and a greater likelihood of developing normally.
So I think there is tremendous benefit in making these public use data sets available to the broader public, simply because I feel it is our moral and philosophical obligation, and we will derive social benefit from that.
MR. GELLMAN: I don't think it is fair to characterize anybody's comments here as saying public use data tapes should no longer be available.
DR. WETTERHALL: But you made that as one of the options.
MR. GELLMAN: The questions are the terms under which the tapes are available. They may change at some level. I don't know what. I think your point about the value of the tapes, I'm not sure anybody is going to disagree with that. Good things have come from all of this stuff, and good things have come from giving people identifiable records as well. That's the struggle here.
DR. ZARATE: One of the ways I was reacting to the contracts was especially in terms of also my reaction of IRBs for the kind of things we have been talking about is that on the scale on which we have been talking, I see it as very, very impractical. Having said that, it occurred to me that we also have a system in place for releasing vital statistics data that has to do with our relationship with the states who have granted us the consent to use those data to begin with.
One of the facets of our agreement with the states is that we would set up a three tiered system in which we would let out progressive levels of data useful for identifiability, depending on the kind of requests we get, and our ability to determine what the use would be, and who was going to use it, and would they sign an agreement.
This already goes all the way up to the point of where we allow data for small areas. I made the statement before we never really release exact date of birth or death, and I have to contradict myself. There is one situation in which we do, and that is the last tier, where if an individual with the bona fides, and they want to sign a document, we would release data for small areas, things that we wouldn't release generally in public health.
The basis of that is that we carefully review the requests, that we already have consent of the people that we have to get consent from to release, and they have also given us consent to put our mechanism in place for releasing that. One critical thing is that it is done very sparingly. It is not a mechanism for making releases of whoever would call up, and for whatever use.
So I think that if it is done that way, and I naturally have to go along with it, because that's our agency policy. Again, as an across the board method, I think it would have to be highly qualified.
MR. GELLMAN: Fair enough. No one has explored that.
If anyone has anything else on this -- I'm going to just throw out another idea for brief discussion. Let me phrase this in terms of a question. Latanya, you have done the leg work in this area. One of the problems here is the broad availability of public record data. There are a ton of public records available. Deirdre talked about some of them this morning. This is an incredibly long list of them, and most of this is at the state level.
The federal government of years ago passed a driver's privacy and protection act that told the states not to withhold drivers records, but to say you've got to give people a choice before you make at least some disclosures of drivers records. We see in the state of Maryland at least -- I don't know what the experience is in other states -- that as soon as this capability became available, people ran to opt out.
Opt out rates are not really known overall, but people who were not renewing licenses, but had to call up or communicate through the Internet, they were getting 50,000 opt outs a week from people for a while. In at least one office the opt out rate of people renewing their licenses was 85 percent. So there seemed to be a public demand for this.
What happens to all the capabilities that you have illustrated if some or all of these public records dry up and are not available any more? Does that diminish the problem significantly?
MS. SWEENEY: Yes, actually if I turn the question inside out, I think I can graphically show it to you. Ben and Michael at the state level are faced with the fact that I can lots of state level data on all the people in the state. Easley and Al don't have that. So if they did, in other words, if I could get the motor vehicle information that I can get at the state level, I could get it at the national level, if I could get the voter list at the national level, then in fact I don't know what they would be able to release.
They certainly would be able to release it in anywhere near the detail that they currently do. When I came to this realization -- and I'm saying this ad hoc -- it's a scary thought, because in some sense you say then, well what the heck can a state do? If you bring it down to a hospital or a smaller data holder, what can they do?
But at the proliferation of data, there is sort of a point of no return. There is a point where any disclosure you do is just simply -- whatever is private in that information, you can't keep it private if you disclose anything at all. So this really is a point of no return.
I don't know exactly when that is. I have some ideas about how to figure out what that limit is. If the proliferation of public information continues at the rate it is going, we are going to get there pretty fast, and I'm not sure you can pull it back then.
MR. GELLMAN: Well, John.
MR. FANNING: When you speak of this information and drivers licenses and the like, it's not the same problem presented by large data companies like R.H. Donnelly and so on. Has anybody addressed that?
MR. GELLMAN: A lot of their data, but not all of it by any means -- and that's somewhat of an open question -- a lot of the basic source data for all of these look-up services and databases is public records. It's not the only source, but that's a lot of the source. To the extent that people disclose their own data to marketers and information companies, they have put themselves at risk. Now whether they understand what they are doing, and have been told of the consequences is an entirely different story.
You could also at least make the distinction that the state is compelling people to disclose information in order to get a pet license, in order to register a car, in order to own land, in order to do lots of other things. They are putting them at risk, if you will, for a variety of direct and indirect consequences.
The practices, by the way, vary tremendously from state to state. Not all state voter registration records would be available to you. In some states they're really only available to political parties; in other states they are just totally public to anyone who wants them. In other states they have standards that say some people can have them, and some people can't. Actually, you might qualify, because some of the states say researchers can get them, and you might be able to do it.
So I think it's all sort of an open question, but if we have a problem here that is I won't say caused, but at least significantly exacerbated by the availability of other data, which does not necessarily -- and this is a point of view -- have to be in the public domain, maybe that is another way to try and limit the problem here. This is well beyond the scope of what we are really looking at, but it is something that is at least worth having on the table and discussing.
MR. BERRY: Back into the scope of what we are looking is the public use files that the agencies release. I think Latanya raises the level of awareness for those of us who have public use files, and HCFA has many; some are on the Internet. We do a very close scrutiny of those files, but I know from experience that just recently that we may have some vulnerability -- I guess that's the way to speak - - with regard to a couple of data elements on a couple of files.
Now what she showed me today is a technique to raise that bar, to try to reduce that vulnerability. That's how I see the public use environment looking. I think we better look at ourselves here in retrospect, of what we are really releasing, but a public use file is a public use file. Some of them have been produced for 10 years, and I know people haven't looked at them.
MR. GELLMAN: That seems to me the very clear conclusion, that the risks in the disclosure of public use tapes are dynamic. You have to look at them regularly.
MR. BERRY: That's an expensive process, very expensive process. That's a high level analyst, taking a lot of time. Just thought I'd mention that. You can't get an FTE. Maybe I can hire you or something.
MR. GELLMAN: It's a lot cheaper to do that than get sued by everybody on your database.
MR. BERRY: I don't want to be on "60 Minutes," I assure you I don't. I am making every effort not to be there.
MR. GELLMAN: The floor is open. I said we would go until 5:00 p.m. or until we ran out of steam, and I sort of think it's the latter point. If there are no other comments, we will adjourn until 9:00 a.m. tomorrow in this room, where we will talk about data registries.
I thank everyone for participating. It was very helpful, everybody.
[Whereupon the meeting was recessed at 4:30 p.m., to reconvene the following morning, Thursday, January 29, 1998, at 9:00 a.m.]