Harold S. Luft, PhD


Dr. Detmer, Members of the Committee, Guests:

My name is Harold S. Luft. I am Professor of Health Policy and Health Economics and Director of the Institute for Health Policy Studies at the University of California, San Francisco. The views I will offer on issues associated with the Health Insurance Portability and Accountability Act (HIPAA) of 1996 (P.L.104-191) represent my own perspective and are not necessarily those of the Institute, the University, or any of my funding sources.

I am trained as an economist and have been undertaking research in health economics and medical care for over a quarter century. Much of my work focuses not on the traditional issues of health economics, but on the quality of patient care and clinical outcomes. To explore these questions, I have used a wide variety of administrative records, such as hospital discharge abstracts, and billing or encounter data. Thus, the proposed changes stemming from the HIPAA are especially important to me.

Although I am not a clinician, I frequently work with clinicians, and I have learned from them to be very concerned about the quality of the data we use, as well as the potential burdens that data collection imposes upon organizations and the potential risks to patients if confidentiality is breached. I also trained at Harvard with Paul Densen, a pioneering statistician who demonstrated the importance of good quality, routinely collected information for assessing the performance of health care organizations. One of Dr. Densen's most important lessons was that the people responsible for the collection and maintenance of data need to understand how and why it is useful and thus care enough to assure its validity and reliability. I am repeatedly reminded of this lesson when I approach a new data set and examine the quality of the data. The validity and reliability of the information is usually directly proportional to its value to those responsible for collecting it, rather than its potential value to me as a researcher.

I also recognize that, while more and better research may help improve patient outcomes and medical care delivery, researchers cannot assume that data collection is costless, either in money or time. Thus, while more data may help provide better and more complete information, we need to be aware that more data is costly, and requests for additional data not needed for other purposes should take into account the likely benefits.

Given this perspective, I will not attempt to offer many recommendations with respect to specific issues of coding and classification. Many of the problems of existing systems are well known, such as the inability to distinguish right from left hip replacement procedures, and hopefully this will be addressed by others. However, I will offer some general recommendations and suggestions that I hope will be helpful in your deliberations.

Given current data system capabilities and those likely to be available in the foreseeable future, as well as the more highly differentiated sites of care and delivery systems, it is crucial that data for individuals be linkable across settings. Furthermore, claims or encounter data should be collected from all sites, even if the sites or organizations are not paid on a fee-for-service basis. Thus, a provider group capitated for its patient care needs to keep track of patient care encounters even though payments are not made based upon such information. This means that we might expect such data to be less accurately coded with respect to detailed procedures than would be the case for claims-based data.

On the other hand, if the medical group or health plan is to be responsible for the treatment provided its enrollees, encounter data must include information on the major interventions and treatments provided. Claims data rarely need to have precise diagnostic information, and the diagnosis data included is usually just enough to justify the procedures billed. Clinically-oriented encounter data, however, is usually more accurately coded with respect to diagnoses, but may be less precise with respect to the specific services rendered.

If data are to be combined across systems of care, coding should represent the expected degree of accuracy. While the most precise codes would be best in all situations, pseudo accuracy is worse than recognition that some data are only approximate. Thus, I would recommend that if the collecting organization is not requiring very detailed coding and checking its accuracy, it should use less detailed codes, e.g. 3-digit rather than 5 digit codes, so all users understand the appropriate level of "resolution." For example, a provider group that is capitated for its services might use less detailed procedure codes, but more detailed diagnostic codes than a claims-based payer. This is not a proposal to encourage "sloppy coding" or minimalist reporting. Various groups may well ask plans and delivery systems why they are not coding at a level of detail attained by their competitors. It would, however, make the data reported more useful and less misleading.

A second level of simplification would recognize that we are currently in a transition phase between very limited data collection and reporting capabilities and the universal availability of complete, electronic medical records. During this period, which is likely to last at least a decade, required data submittals could be designed to be at two levels of detail. For example, drawing upon over a decade of experience I have had using the hospital discharge abstract data from California, it is clear that having a uniformly collected discharge for all patients in the state (missing only those from federal hospitals) is enormously valuable. It allows the identification of readmissions, for whatever reason, and allows a wide variety of valuable research at relatively little cost to the hospitals. On the other hand, there are many questions we cannot ask of these data because variables such as physiological status, and even blood pressure, are not included as procedures or diagnoses. In some states much more detailed data, capturing a wide array of clinical findings is collected, at a more substantial burden to the hospitals. Because it would not make sense to combine the results of patients admitted for a normal delivery with those admitted for coronary artery bypass surgery, a more targeted approach may be appropriate.

There are probably a few dozen conditions and procedures that represent a large fraction of patients and that can be used to explore questions of differential outcomes, practice patterns, and resource use. The literally thousands of other conditions and procedures occur too infrequently for routine monitoring at as great a detailed level. In many instances, one would want to know the results of a specialized blood test or X-ray which may not be routinely coded, and because of the infrequent occurrence of such patients, coders are less likely to process the information accurately.

My recommendation would be for the routine collection of a relatively small set of data, including all procedures and diagnoses, along with such common physiological measures such as blood pressure, pulse and temperature on admission. For patients falling into one of the select categories of interest, such as delivery, heart attack, pneumonia, stroke and coronary artery bypass surgery, additional condition or procedure-specific information would be collected. This additional data would be keyed to the patient's problem, e.g., timing and nature of prenatal care for delivery, left ventricular ejection fraction for CABG patients. The list of conditions and procedures, along with the appropriate related measures, should be allowed to be changed over time by a national body, such as the NCVHS.

Patient confidentiality is a crucial concern as the potential harm from the disclosure of medical information is apparent. However, it may be much easier to protect patient-specific information that is stored electronically, with suitable encryption and password protection, than in paper files in doctors' offices or teaching hospitals. The principal concern is that some individuals may inadvertently, or by chance, come upon large data bases of sensitive information (Markoff, John. "Patient files turn up in used computer." New York Times, April 4, 1997, p. A9).

It is quite appropriate to require that researchers prepare a research protocol and specific request to use data that includes potentially sensitive patient information. Such requests should be reviewed and be subject to approval by the appropriate Institutional Review Board at the investigator's institution, and should also be approved by the agency with responsibility for the data. It is certainly reasonable that the data be given under agreements that it not be passed on to other parties, although such restrictions should not apply to aggregated data files developed from the original that do not contain any patient-identifiable data. For example, I may require sensitive information such as birthdate, admission date, and a social security number on a hospital discharge abstract file to link records either within the file or across multiple files. Such information is clearly sensitive and needs to be handled with the utmost care. Once the linkage is accomplished, the social security numbers are eliminated and the birthdates and admission dates converted to approximate values, the data are no longer identifiable and should be made available for use by others. It is certainly reasonable for the initiating source to require notification of transfer of any derivative data sets to other parties.

The ability to undertake linkages across data sets is crucial for many research purposes. The availability of the social security number has been invaluable for many research projects, but other unique identifiers could be used. In fact, in some data sets the social security number is inadequate as an identifier, if only because it cannot be collected at the time of admission due to the patient's condition, or because newborns do not have such identifiers. Furthermore, there is little redundancy built into the SSN, so a transposition of digits can wreak havoc in linkage. Adding information via the CDC's SOUNDEX system, which was developed for the derivation of confidential identifiers for patients with HIV infection and AIDS, might be a major step forward.

If there are a few highly secure locations that can do the linkage on a routine basis, there is little need for researchers to have the unencrypted identifiers. (Linkage sometimes depends on original identifiers if the encryption algorithm is sensitive to coding errors. For example, two records may have SSNs that are identical with the exception of a single transposition, and obviously belong to the same individual based on other information, yet some encryption algorithms may result in markedly different values for the two SSNs, precluding a reasonable match.) Given these problems, those who link records need highly accurate and redundant information. On the other hand, once the records are linked, the true identifiers can be easily replaced by a set of arbitrary identifiers that are unique for an individual, but need not match any other set of identifiers. Thus, as a researcher, I may need data that are well matched, and have a unique code for each person, no matter how often he or she appears in the data set.

There is no reason, however, for that unique code for that person to be the same as the code for the same person in a data set provided to another researcher .The security of data sets can be enhanced in yet other ways. For example, the exact date on which a patient was admitted, or the exact birthdate of a patient may not be crucial to a researcher (although the latter may be important in determining the accuracy of the match by whoever is doing the record linkage.) Thus, records could easily have all the dates shifted forward or backward by a random amount, say 30 days for each patient. As long as the shift value for the same person was constant across all records for that individual, it would not affect most research. Again, the random shift factor could be altered for each linked data file. This second level of encryption could be handled at the creation site, or could be required in the protocols approved by the Institutional Review Board. That is, it could be the responsibility of the investigator, as the first step in the preparation of the data, to undertake a secondary encryption of dates and unique identifiers. In this way the data would be further protected from accidental discovery by hackers or others.

In summary, I believe that the development of linkable data sets, with well-coded and valid information is essential to the assessment of medical care performance and the improvement of patient care. These data can be collected and linked at relatively low cost, particularly if the rules for the development and collection of the data take into account the primary uses of the data. The availability of unique identifiers, with sufficient redundant checks to assure accurate linkage is crucial. High level routine encryption of codes, along with site-specific secondary encryption and modification of the data will assure a far greater degree of confidentiality than currently exists with medical records. We should move ahead with adopting uniform standards and protocols to begin to reap the benefits of improved data while minimizing the chances for harm.