[This Transcript is Unedited]
Renaissance Hotel
999 9th Street, NW
Washington, D.C.
TABLE OF CONTENTS
DR. STEINWACHS: I'd like to welcome everyone to the second day of our Workshop on Data Linkages to Improve Health Outcomes. This workshop has been put together by the Subcommittee on Population Health of the National Committee on Vital and Health Statistics, and this workshop is being broadcast live on the Internet. So I would like to welcome those people who are listening on the internet.
And I thought before we got started officially, we might just go around the room and introduce ourselves, for those who are on the Internet.
And I am Don Steinwachs. I chair the Subcommittee on Population Health, and I am from Johns Hopkins University.
MR. IAMS: I am Howard Iams from the Social Security Administration Research Office.
MS. MADANS: Jennifer Madans from the National Center for Health Statistics.
MR. HARRIS-KOJETIN: Brian Harris-Kojetin from the Office of Management and Budget.
MR. BJORKLUND: Rick Bjorklund, Office of the Assistant Deputy Under Secretary for Health, the Veterans Health Administration.
MR. PETSKA: I am Tom Petska. I am Director of the Statistics of Income Division of the Internal Revenue Service.
MR. CHAPMAN: Chris Chapman, U.S. Department of Education, National Center for Education Statistics.
MS. OBENSKI: Sally Obenski, Data Integration Division, U.S. Census Bureau.
MR. PREVOST: Ron Prevost, Data Integration Division, U.S. Census Bureau.
DR. DAVERN: Michael Davern, University of Minnesota.
DR. STEUERLE: Gene Steuerle from the Urban Institute and a member of the committee.
MR. LOCALIO: Russell Localio, University of Pennsylvania, School of Medicine, a member of the committee.
DR. SCANLON: Bill Scanlon, Health Policy R&D and a member of the committee.
(Additional intros around the room.)
DR. STEINWACHS: It is a pleasure to welcome everyone today.
Our first session has speakers from Internal Revenue Service, Department of Education, Department of Veterans Affairs, and Russell Localio, who is a member of the committee, is going to serve as facilitator.
Russ.
Agenda Item: IRS, Education and Veterans Administration
MR. LOCALIO: Good morning, everyone.
I just want to introduce our first speaker, Tom Petska, our friend from the Internal Revenue Service.
PARTICIPANT: Is that true? Oh, of course that is true. Of course.
Agenda Item: Tom Petska, Director, Statistics of Income Division, IRS
MR. PETSKA: I have no comment on that.
Actually, I am very pleased to be here to speak on this topic, Workshop on Data Linkages to Improve Health Outcomes. A lot of people might think what is the IRS role in that? I hope I can say a few things in the next 10 or 15 minutes about that.
In a way I feel a little bit inadequate about speaking to this group for a few reasons.
One is that my division is centrally located between IRS and Treasury, and that gives me two bosses, the Director of Research Analysis and Statistics of IRS and the Director of Tax Analysis of Treasury, and they can - they sometimes agree - put it that way - on what we should be doing and how we should be doing things. So that is a little awkward.
Also, because I am Director of SOI, I get occasional questions as to, Can you tell me about Section 8267 of the Code and how that effects my small oil and gas operation?
And the bottom line is the Internal Revenue Code is 2,000 pages long. The supporting regs are over 10,000 pages, at last count, and I really don't have that encyclopedic knowledge. I am sorry.
But I do get questions like, What is the average amount of charitable contributions for my level of income? And that is something I can look up, even though I can't look up what is the status of your refund.
So, that said, hopefully, I can tell you a little bit about IRS tax data administrative statistics as a potential source for shedding some light on health outcomes and so on.
Before I say anything further, I would like to add my disclaimer that these are my personal views and not necessarily those of the Internal Revenue Service or the Department of Treasury.
A little bit about my organization. IRS is a large organization, 100,000 employees, an $11-billion budget.
The SOI program is about .4 percent of that, $40 million, which sounds like a lot of money, but, relatively speaking, it is small, and under 500 employees.
Our primary customers - We have two very intensive customers and those being the Office of Tax Analysis of the Treasury Department and the Congressional Joint Committee on Taxation.
I'll be talking a little bit about data access and disclosure, and let me just say that those organizations do have full access to all of our data, and they have a heavy role in directing our priorities and studies.
However, we do have many other customers, many in the federal statistics community, including the Bureau of Economic Analysis and the Census Bureau.
Okay. What kind of data do we have?
Well, I probably should have given a little bit more background to this slide. I am going to be talking a little bit about two types of data. One are the sample data from my organization. We sample tax returns and we have scientifically-designed samples. We edit these samples very carefully, and we weight these to national totals.
The content is based on our user needs, and so if the Treasury Department comes back and says, We want multiple schedules on depreciation by different types of class lives, we can't add that. It is a resource issue, but it is not a policy issue or program issue.
Now, separate from that is the relatively content poor and less edited data in the IRS Master File system, and we don't produce that data, but we are kind of a gateway to other federal statistical agencies and researchers who do have access for that.
It is also the main source of statistics at the sub-national level, because our SOI samples are not robust for most states and certainly not below the state level.
Okay. So those are the types of data we have.
Now, in each of these program areas - individual, corporate, partnership, estate and gift tax, tax exempt or non-profit organizations - we have pretty much these two sources, and, for the most part, the content-rich SOI samples are the preferred source, just because the data are of higher quality as well as the content much greater.
And just as a footnote, I should add that all the data that is filed on all tax returns and schedules is not transcribed, and that is one reason for our program is that I think there are something like nearly 100 schedules that an individual can append to their 1040 return, and only a limited amount of that data is transcribed. So, as far as content, we have that flexibility, although, as I said, it is a resource issue, and, again, all of our data are preaudit.
Well, two viewpoints on SOI. Well, one is that we try to be a cooperative, collaborative and efficient producer and user of data based on administrative records, and I think we do a pretty good job on that, for the most part, but, then, on the other hand, we are the - quote - tax collectors in disguise, as survey statisticians, and I'll talk about what that means a little bit further down the road, but, first and foremost, we are employees of the Internal Revenue Service, and that has certain legal issues in terms of what we do and in terms of our relationships with others as well.
Now, we have kind of generically three opportunities for data linkages, first being linkages involving solely IRS tax and information returns. Others are linkages involving tax data to surveyed records, and that is a very short topic, which I'll tell you why, and, then, lastly, I suspected that the kind of tone or focus of the conference would be what about microdata access by researchers and other agencies and so on, and so I thought I would spend some time on that, and I have brought my disclosure expert, Nick Greenia, who is sitting in the back there, who works with me and my boss, Mark Mazor(?), on a lot of interagency issues involving data access.
Okay. If you gave me the time, I could go on and on about this first one, linkages involving tax and information returns, for which we have control over the data. We have access to the data and so on, and first might be individual returns linking in 1099s and W2s.
For instance, you see an individual return. It's got a certain income, $75,000. It is a joint return. There's a husband and wife, apparently. Is it from one income or two incomes? Well, you can't determine that without looking at such things as the W2s and so on. So we do linkages like that to look at the whole picture of family economic income.
Partnerships. Partnerships are a major type of business, but they are untaxed. People or organizations, corporations, non-profits form partnerships. They report their financial activities on an information return, and then they distribute their taxable shares to the partners, and those could be expenses. They could be income and so on.
So to ascertain the total effect of partnership taxation, you have to link in partner tax returns, and one partnership could have 20,000 partners. So it is not a trivial matter.
Small business corporations are similar. They have the flow-through nature of a partnership.
The estate tax has become a hot political topic once again, and among the questions are, What happens to these bequests from large estates? The estate return shows who gets them, but it doesn't show the income and what happens to those individuals over time. We can link them up by matching their 1040 returns and following them over time on a panel basis.
There are a lot of issues in regard to consolidated corporations. Subsidiaries may file separately. To get a combined picture of corporations, you have to link these together, and we use the employer identification number to do that.
And then another focus of our primary customers, Treasury and the Joint Committee, are for panel files linking the same entity - corporation, individual, et cetera - over time or linking within a year, and we do quite a bit of that kind of work, which we can talk about as time allows.
Okay. What do these files have?
Well, first of all, high-quality linking variable. If you don't have that - We don't do a lot of research on using other things like names and addresses to link. If we don't have a high-quality variable, we probably don't have the resources to do a high-quality study.
Fortunately, most of our studies, most of our files, the variable, the employer identification number or Social Security number is very accurate.
Obviously, if you didn't have overlapping samples and you try to link records, you are going to get a few hits because of that. So we need at least one population file or samples that are substantially overlapping.
And, lastly, what about accounting periods? We do - at a lot of our linked studies, we want to align accounting periods. We want to take the partner and the partnership or the corporation and its employment return and we want to be very specific about aligning those accounting periods, or, in other cases, we want to show a panel-like study with different periods.
Okay. Well what have we learned from all these studies over time?
Well, a few things. First of all - and I think that has been the tone of day one of the conference is that matched files can be very rich analytically and so on, but, on the other hand, linking data files is never probably as easy as it seems to be. Data quality is never optimal. Even in our SOI files linking variables, even though our Social Security numbers and EINs are high quality, but there are times when, in some cases, that they are not as high quality as we would like and we have non-matching and so on.
Resolving these discrepancies, if they are important - and I think they are - often is a very labor-intensive effort, and so we try to avoid that to the extent possible, but we don't want to produce a file that is based solely on matches and ignore the non-matches. I think that would be a mistake, and linked files sometimes don't answer all the questions, and we could talk about that also. Okay?
But I think the one key thing is that you take this data, if you develop high-quality linking variables and you put it in a relational database, I think you are way ahead of the game. Okay?
Well, this, I said, was going to be a short topic, linkages of tax data to survey records, and it is short for two reasons. One, we do very few surveys ourselves. The only surveys we do are for corporate returns, particularly multinationals.
As you know, just as individuals can request filing extensions, corporations can, too, and most of them routinely do, and so if we have an early data cutoff and we need to get that corporate return in, that major U.S. corporation, that sample at a weight of one, we often do a survey and request preliminary data from them, and so - or, in some cases with multinational corporations, they're just not completed that accurately, the presumption being that everybody provides IRS accurate and complete data, but that is not always the case, and these multinational returns are very, very complex and sometimes, even despite their best efforts, they are not complete showing all their international activities and so on.
But then the last point is we don't do matches to other survey data, because, again, getting back to my point earlier, we are the tax guys, so people don't provide us microdata. I mean, that is the other hat we wear. We are the IRS guys and the perception would be that could these data be brought into a compliance type situation.
Okay. Provisional microdata to other agencies. Who can get what? Well, this is a very, very short summary and so on, but, first of all, tax administration. By this I mean, for the most part, members of the IRS, and sometimes Treasury as well and so on, who have a tax administration motivation. Taxpayer account processing, audit, compliance and research functions are all internal and they all can get access to most of these. Although, we have to be very careful in cases where there are sample data involved and could it destroy the accurateness and representatives of these samples.
Tax analysis. Treasury's Office of Tax Analysis, the Congressional Joint Committee, CBO and GAO all have roles in tax analysis and can get data, not 100 percent, but, for the most part, can get some identifiable tax data for specific purposes as articulated in the Internal Revenue Code.
And, then, statistical use. The Bureau of Economic Analysis can get corporate data. Census gets population data for individuals and for businesses. When the Ag census was moved to the Department of Agriculture, NASS, a few years ago, that data access was also enabled with them, and the CBO as well. Okay?
We talked yesterday - and I think the Census people presented this very well that their goal, their mandate is to use existing data system to the maximum extent possible, such as administrative records.
Our mandate, unfortunately, is the opposite, though. It is provide only the federal tax information for authorized purposes and to the minimum extent necessary. So this is just a naturally conflicting mandate that we have.
Constraints on using IRS tax data. Well, first of all, it has to be for a use for authorized purposes, and this is defined in the Internal Revenue Code and in supporting regulations, the 10,000 or 12,000 pages that I mentioned earlier, and can be also further defined in separate documents, MOUs, such as the Census-IRS criteria agreement.
Again, our mandate is to disclose only the minimum confidential federal tax information necessary, and there are substantial penalties for unauthorized disclosure or inspection, and publicly released data must be anonymous. Although, we do have a public-use file. As time allows, we can talk about that, but we do remove identifiers and sanitize records in a subsample of our individual program, and it is used by a number of high-profile policy analyst groups.
Okay. Briefly, the authorization process for access to tax data.
Well, first of all statistical recipients - and this came up yesterday - need to be cited in Section 6103(j) of Title 6 (sic) of the Internal Revenue Code.
To change that, Congress must enact legislation.
The statute authorizes access purpose and may stipulate supporting regulations and so on. The regs - regulation detail may restrict uses as well.
And, as I said before, policy agreements can provide additional enumeration.
In summary, access to tax data is very restricted. Some possibilities include - and these are very limited - working as a contractor for tax administration purposes. We have had a few of these, but they are very limited. Working at an agency with current access, like the Treasury or like the CBO or the Joint Committee, or accessing limited business data via Census' Center for Economic Studies.
And to find out more, we can talk later or drop me an email or give me a call.
Thank you.
MR. LOCALIO: Thank you, Tom.
Do we have any questions?
MS. TUREK: (Off mike).
MR. PETSKA: Survey on Consumer Finances.
MS. TUREK: Yes. They use you as a sampling frame. They don't get any data items.
MR. PETSKA: I worked on that study several years ago, and where it is right now in terms of the firewalls, I am not clear exactly, but, basically, the high-wealth portion of their study is a list frame developed from our 1045. That is correct, and we have done this with all years of the Survey of Consumer Finances, going back at least to 83, and we did have some involvement in the first, the Survey of Financial Characteristics of Consumers.
Nick, do you want to say something about that?
MR. GREENIA: It is true that the Federal Reserve does receive some data, but the Federal Reserve is perceived as a 6103N contractor, which means that the purpose of that tax-data receipt is seen as fulfilling a Treasury tax-administration purpose.
As you may recall, when CIPSEA was submitted to Congress in 2002, there was a companion bill, and the companion bill had the Federal Reserve in there, so that they could receive tax data unrestricted for survey of consumer finance purposes.
DR. STEUERLE: Tom, you might clarify a little the fact that it is possible to apply to your agency to have data run by outsiders.
I am wondering if you might also comment the extent to which that is really restricted by resource constraints, because I am sure everyone who comes to you to have something run basically impinges upon your resources to some extent, because they always need some helping hand, but that is a possibility -
MR. PETSKA: Yes, that is a very good point.
DR. STEUERLE: - with respect to health data that may not be even with respect to any tax question, right?
MR. PETSKA: Yes, that is right. I mean, again, I have talked about restrictions on access to micro data, but at table-level data, we have disclosure rules to suppress cells that have fewer than three observations at the national level and so on.
But, for the most part, if we have an existing file that will meet your needs, we can enter into a small reimbursable contract to produce tables from those files and so on.
The problem gets in when there's matching required or content that we do not currently have.
For instance, a few years ago, we talked about the idea of non-cash charitable contributions. Could we produce some aggregate statistics to be published on that?
We didn't pick it up in the program back in those days, and so for us to edit those data, build it into the sample, weight it and everything else was a very expensive task.
Since then, we've gotten a push from Treasury and from Joint Committee to include that part of the program. So, now, we have, though, so we could produce additional - from that and so on.
So, again, we do have restrictions on staff, time and so on, but, for the most part, a tabulation from an existing file, we really try to service those kind of requests.
DR. DAVERN: Hi. Michael Davern from the University of Minnesota. I have a question.
We heard from Census yesterday about matching Medicaid records to the current population survey. Something that might be interesting as well would be to take a look at some of the W2 information - I don't know if they have access to it - which would say deferred compensation for health insurance coverage, for example, that a person paid pre-tax dollars into an account.
I was wondering if that was some kind of matching study that may be possible to verify how well the CPS not only measures Medicaid, but how well it measures private insurance coverage.
MR. PETSKA: That is a good question.
Nick, can you help me out there in terms of does Census get the W2 records now?
MR. GREENIA: As a result of a regulation amendment two or three years ago, they now get some limited data that includes deferred compensation from the W2. So that information is in scope at the Census Bureau.
Where we get into some issues is when Social Security data might be involved, and Social Security, as you know, has a very unique arrangement with the Census Bureau, they essentially access tax data and Census data as special sworn status employees.
So, for purposes of the criteria agreement, the policy agreement that Tom was talking about that enables access to new projects by special sworn status, we treat them as, if you will, employees for purposes of this agreement.
The sticking point is that that kind of access, especially if we are talking about any tax data that Social Security does not have access to, even when they are matched to Census Bureau and they do have access as far as we are concerned, Census Bureau views them as special sworn status employees, which means, for purposes of our agreement, they are viewed as Census employees.
So once that enters the equation, the work has to be done under the criteria agreement, which means that it has to be predominantly for Title 13, Chapter 5.
MR. LOCALIO: Howard, did you want to comment on that at all?
MR. IAMS: Yes, the process for doing that would be to go to the Census Bureau and apply for permission to use these data at your Census restricted data center and they have an application process that formally requests a purpose, and what Nick has emphasized about the Title 13 is - has to be done, what you have articulated would have a clear Title 13 purpose.
I think the problem you will encounter is that I do not think the code that isolates health purpose for the deferred compensation is available. It may be available for the last earnings year, but it was not available before 2005. So you won't be able to identify how the purpose of this deferred differs from, say, 401K.
I don't know if Ron - do you remember? We recently started coding for our matched data the reason for the deferred - what kind of account it was - whether it was 401K, 403B, 457, and I do not recall if the health account was separately coded.
MR. PREVOST: Yes, I don't believe it was, but we could check into that, that is for sure. Certainly, the project that we are potentially discussing here would have a clear Title 13 benefit. I mean, it is -
MR. IAMS: Oh, it would. The question is whether the data are there.
MR. PREVOST: Whether the data are available is the main question.
MR. IAMS: The deferred compensation is identified, but there is a box that identifies what the deferred compensation reflects, and I don't recall if health account was one of the codes. It has only been available in the last earnings year.
MR. LOCALIO: Well, I think we need to go on.
I want to thank you, Tom, very much for your presentation. We gotta go on or we are going to run out of time. I am sorry.
Next, we are going to hear from Richard Bjorklund from the Veterans Administration. While he is getting his presentation set up, I just want to say that, yesterday, we heard several people comment about their potential dealings with the Veterans Administration and how those dealings were cut short by that announcement of a potential breech which never happened.
I do want to say that I got a letter from the Veterans Administration, as a veteran, saying that there had been a potential breech, and then a letter saying that it did not happen, and I think we would all be interested to find out what have been the - if you have any comments about what have been the repercussions of that from your perspective.
Thank you.
Agenda Item: Richard Bjorklund, Director, Veterans Administration
MR. BJORKLUND: Well, just to answer that question quickly, the repercussions have been a very tightening of all of the policies and procedures for distributing data both within the organization and also to other federal agencies or contractors or researchers.
In fact, researchers outside of VA who once had access with stipulations to VA data have all but been restricted from accessing that data now.
So the security and privacy procedures have been tightened extraordinarily, and I know some of you from CMS are here, and we have been working diligently for months to get specific agreements in place and protocols for transferring VHA to CMS. That is happening shortly, but it has taken a great deal of time, too.
Well, let me launch into the presentation, the nature of my presentation is more general in nature than what Tom was talking about.
I am going to be talking about more strategic direction that our organization is taking regarding linked data and talking specifically about a project that is underway.
First, we link with our internal data a number of independent activities. We have an annual survey of enrolles where we try to identify perceptions, interests, preferences, behaviors of veterans, and we link that to their inpatient and outpatient clinical records.
We also do customer-satisfaction surveys of fairly excruciating detail, and we link that also to our administrative data both clinical and cost data. So we can get a comprehensive assessment of the performance of veterans organizations.
Today, I want to talk specifically about a very large project that we have been undertaking for the last 18 months, and it regards the integration of VHA, Medicare and Medicaid data in the production of a user-friendly system, and I am going to be talking about the opportunities that we envision for improving healthcare outcomes, some of the barriers to implementation that we have observed going through this 18-month process, more about what the process was and some of the challenges going forward.
First, to put this in context, I want to spend a little bit of time talking about the VHA organization.
First of all, VHA is a component of the Veterans Administration. It is one of three major components. The other two being cemeteries and veterans benefits.
We have approximately 156 hospitals, 876 outpatient clinics, nursing homes and domiciliaries.
The VHA budget is about $35 billion and would rank it amongst the Fortune 50 organizations if we were a private-sector organization. So we are a very, very large organization and we are a very big player in the U.S. healthcare system.
Recently, VHA has been mentioned as providing some of the best healthcare in the country, and it has been management's objective to be the world-class healthcare provider for some time, but to continue to be the world-class provider, we need to continue to be on top of our game, and that is to identify opportunities to improve quality, cost and access, and, in addition, we think, by identifying these opportunities, it facilitates what we refer to as a learning organization.
When these opportunities are identified, it challenges our employees to think about the best solution to the opportunities, and, hence, old cliches like not invented here are quickly becoming disassociated from the culture of the organization and in its place is a constant search for better ways to do things, better ways to achieve superior outcomes and looking beyond our organization to the outside world to identify ways to improve that and essentially becoming globally smart, and we think that integrated data provides opportunities to facilitate our overall objectives of maintaining our world-class status and identifying opportunities, and, specifically, the areas that we think have the greatest potential here are for best practices, and that is comparing VHA with the private sector along both outcomes and cost dimensions.
So for any of our 156 hospitals, we would be able to identify where the biggest opportunity, the biggest clinical opportunity for improvement is or where the biggest cost opportunity is and what the tradeoffs and the metrics linking cost and quality are.
So we, essentially, are beginning to use - we have internal resources devoted to developing risk-adjusted outcomes models, severity-adjusted cost models. We also have resources dedicated to identifying how veterans make decisions when they select a VA facility versus a private-sector facility, and we can compare things like quality, cost, access, benefits and service characteristics of our facilities versus those in the private sector and look at the impact of decision making.
As part of this particular effort, we were also able to identify fraudulent billing practices that occurred where healthcare plans, physicians, offices, et cetera were billing, double billing both VHA and CMS for the same set of services.
We think that when more timely data are available or integrated into our plans that physicians will be able to utilize this data online for treating patients.
And, finally, strategic opportunity identification where we can look at - from the corporate level, we can identify where we think the biggest opportunities are, whether they be in cost or quality of access, and identify corporate-level strategies that would be part of the corporate-level strategic plan for the coming year.
In terms of barriers to implementation, we have talked about the number of opportunities, I think, by the description that I have given. The size of these opportunities are potential huge. So why haven't we taken advantage of these opportunities in the past?
These are some of the barriers that we have identified as we went through this project. First, there were few people that have knowledge about these three data sets and the ability to access the data.
Secondly, integration is very difficult and time consuming. Medicare and Medicaid and VHA data are three separate databases that were developed independently with different purposes, different data and different data definitions, and so if you can imagine every time in the past when we have tried to do an analysis of data where we had to integrate this, it was each time the data had to be integrated for that single purpose.
Data sets are very large and generally require a higher level of programming skill, and, generally, that is SASS.
Investment in hardware and storage media can be an important consideration, depending on the number of users.
Potential users of this data have different needs, and it is - those needs have to be carefully considered in designing a system. One system will not satisfy the needs of all users.
The size of the potential demand for this integrated data is unknown within our organization, and, hence, the risk of investing in a large system that is very expensive and it may not produce a payback.
And, of course, privacy and security laws and regulations add in a very large dimension to managing this set of data and is something that is becoming increasingly more important, being elevated in terms of its priority in making investment decisions.
Next, the decision makers, by and large - and I guess I am talking in general - do not have the experience of using data that is outside of the organization. Historically - and I think this is true within most organizations - the focus has been on internal data, and, quite frankly, I would suspect that - just estimating - 70 percent of the information for solving most problems comes from internal information.
And so our managers and executives are not - do not have the extensive experience in requesting the kinds of information that comes from the integrated data.
And, finally, and another important consideration, is the economics of such a database, and when we talk about the economics, we are thinking more broadly, not specifically, at the costs, the dollars and cents numbers, but more broadly in terms of the fixed cost and the variable costs of these activities and whether it makes any sense to outsource those variable costs, those costs that could be converted from fixed to variable.
So the process in this project, first, we undertook a survey of users in our organization to try and identify both the size of demand and the timing of demand and also customer uses of integrated data, and those are potential uses, and we learned that demand would grow slowly, but would, over time, begin to increase rather dramatically as people learned how to use the data and how to access to the data.
We hired a contractor who had experience in integrating Medicare and Medicaid data, but no experience with VHA data, and asked them to integrate the data and develop a user-friendly system.
The user-friendly system we considered key, because it would expand, exponentially, the number of users, and, specifically, historically, our user base for this integrated data have been researchers and data analysts.
We wanted to expand that to what we refer to as the casual users, the directors of hospitals, the directors of our visims(?) or regions, our chief medical offices, et cetera, but they are not sophisticated users, and so the user-friendly system had to be simple enough for them to access the data, and we thought, as we expanded the user base, that we would be increasing the value that the organization received from the data.
The pilot test that we put together consisted of data integration of the three data sets that I have spoken about and systems design.
We had three white papers written which were basically analytical, short analytical papers by the contractor. He worked on three issues that were top priority issues for the organization at the time and presented some white papers.
We also did tutorials to researchers and data analysts about the system. We asked them to come up with a short research topic and to use this integrated system to address the quick research questions that they came up with.
After the tutorials and research projects were completed, we conducted a customer-satisfaction survey amongst the users to identify strengths and weaknesses of the system and of the integrated data. We also did a data validation study, where we validated the data that came from this integrated system with the raw CMS data that we have in our files.
Generally, systems design, we have in mind a multiphased project. Each phase would consist of design, use and assess. So we would be coming up with a first phase having our users assess it, going back to the design table with the contractor, redesigning it again and going out, having users use it and accessing it and until we felt comfortable that the new system could be rolled out to the entire organization, and I have talked about the three customer groups, the researchers, data analysts and the casual users we were trying to look at.
In terms of Phase One, we used the five-percent sample of Medicare data for one year and 100 percent sample of VHA and Medicaid data for the same year and merged those three data sets.
As I mentioned, we did a customer-satisfaction survey, and it pointed to areas that were strengths of the system, but also some shortcomings in the system. So we are prepared to make improvements should we decide to go forward.
VHA, Medicare, Medicaid data were integrated using contractor assumptions; that is to say that VHA staff were not intimately involved at this point.
Intermediate data products, and these were basically SASS data sets that were produced from the raw data, were compared to VHA Medicare files and no significant differences were found.
Issues that came from the satisfaction survey, one was spending more time learning the system and/or making the system more user friendly were raised. There were technical questions raised about how to make the system faster, and, for example, with bigger machines, more memory, processing one year's worth of data at a time.
It was thought that managing risk associated with HIPAA, the privacy act and security regulations might be reduced via using a contractor and contractor's customized software, and, in the future, more involvement of FHA staff was needed.
Question remain about whether there is sufficient demand to justify the investment, technical and other challenges - whether technical and other challenges can be overcome.
And some of those challenges are the user-friendly nature of the system. Can it be made more intuitive. Reducing the learning curve of researchers, data analysts and clearly for casual users.
We think that a reporting system linked to the output of the user-friendly system would satisfy the needs, for the most part, of casual users.
And addressing processing time issues is also another one. Processing time was mentioned by our technical folks. Another side of the story was mentioned by one of our researchers who said that his time from the point where the project was initiated and where integration of the data was called for to the time that he received analytical results was cut almost by a fifth.
Now, at the same time, what we were hearing from some of our technical folks that it was taking up to 24 hours to run requests for large data sets.
So there is some benchmarking that is required as we go forward and some internal agreement as to what we are going to measure and how we are going to benchmark.
MR. LOCALIO: Richard, we have to move on to the next speaker, if you could conclude as quickly as possible.
MR. BJORKLUND: Okay. There are cultural differences issues that we need to overcome. I have mentioned the economics and the importance of outsourcing, and, finally, some organizational issues.
This has been a green-house project. We are not entirely sure whether it is part of a - the next phase should be part of a planning and policy office.
MR. LOCALIO: Thank you very much.
I want to introduce our next speaker, and, then, while you are setting up, maybe entertain a question.
Our next speaker is going to be Christopher Chapman of the National Center for Education Statistics.
Do we have a quick question for Richard on his presentation?
DR. SCANLON: This is maybe a comment as much as a question.
Since you raised sort of the issue of Medicaid data again, we heard about it before, it creates for me sort of a bigger issue, which is the quality of administrative data, and while we are interested in terms of linkages to be able to expand our capacity, there is a question of did we move too far in terms of reliance upon sort of administrative data, and while averages may sort of turn out for populations to be the same, when we do validation studies, when we get down and we start to slice things more and more, we may be on relatively thin ice, because the data are not good.
I raise this because of sort of prior work that I did at GAO. Medicaid data is always suspect, and there were efforts that we had to do where we had to go out and collect new data because the kind of information that comes to CMS is not necessarily sort of accurate.
It becomes even more problematic as Medicaid moves more and more towards use of managed care and the variability, in terms of what the managed-care plans report to the state, is increasing and leaves you big gaps, and so I guess this is - maybe there's not an easy answer to this, but I think it is an issue that we should be thinking about.
It is as much as sort of how much - in terms of trying to protect sort of privacy, whether people can be identified, it should be a concern about sort of linked-data sets should be - what is the extent of their strength? I mean, what can they be used for and what would be pushing their limits too far in terms of reliability and accuracy?
MR. LOCALIO: Did you want to respond quickly?
MR. BJORKLUND: Yes, in terms of the Medicaid data, yes, we agree. We have concerns about that.
Our primary focus in our best practices is with the Medicare and the VHA data.
At the corporate level, we try to what we say - what we refer to as dumb down the data. So we downgrade it from ratio and interval-scaled data to nominal and ordinal data, so as to eliminate some of that error, but, clearly, these are concerns.
MR. LOCALIO: Thank you.
Chris, why don't you proceed. Thank you.
MR. CHAPMAN: Sure.
Hi. My name is Chris Chapman. I am from the National Center for Education Statistics, which is part of the U.S. Department of Education.
My presentation is really going to focus more sort of on our experiences at the center with using administrative record data that have been collected already through National Center for Health Statistics.
I guess before I get too far into this I should also make a disclaimer. I am speaking here not for the department or from organization, but more as a data user.
That said, let me sort of jump in and discuss a little bit about the kinds of data that we typically get at the center regarding health.
Not too surprisingly, we focus most of our data collection on trying to get information about students and other individuals like teachers that are key to the education system, and apart from individual-level data, we also collect information directly from institutions themselves, in particular schools.
Most of our experience there, in terms of gathering health information, has been at the elementary and secondary level, trying to determine which students actually have individualized education programs which are specifically designed to help students with disabilities get the kind of education that they are going to need in order to function in society later on after school.
The data sources that we normally get our information from are parents, students and school records. Okay. These are not health-system type data collections. They are relatively general, and we rely on parents to have relatively good information about medical evaluations of their children, and we rely on students to be relatively knowledgeable about their health conditions, and for school records, as I mentioned before, we really focus in on the IEP data that schools gather and keep for their students.
However, it would be good and would be useful for us to be able to get more information about student health linked into the school record systems.
This next couple of slides, I am going to briefly go over the types of data collections that we've got in place, so that you'll have a better understanding of what we have available and what we usually work with.
This first slide focuses on the early-childhood longitudinal studies. These studies are actually in my program office.
As you can see here, we've got two cohorts. There is a birth cohort and a kindergarten cohort. The birth cohort focuses on a group of children who were born in 2001 and the kindergarten cohort focuses on a cohort of students who were in kindergarten during the 1998-1999 school year.
The reason why I want to start off with this slide is the ECLS-B, the birth-cohort study, really is, I think, the center's most extensive experience using health-record systems - okay? - in particular the sample for the study was drawn directly from the birth-certificate record systems that are available through the National Center for Health Statistics.
And apart from using the birth-certificate data, that data set also involves some direct health assessments. Our field interviewers actually did data collections on birth weight, height, cognitive growth and motor-skill development as the child progressed from birth through at least into kindergarten now.
And then we also had, as I mentioned before, many of our data sets, parent reports of diagnosed disabilities and overall health of the child.
The kindergarten cohort data collection had much of the same kind of health information, except for the birth-certificate data, and those data would have been useful to get.
We were not as experienced, I don't think, as some of the other organizations here with actually taking an existing data set and trying to cross link it with administrative record systems. So we did not undertake that.
And apart from that information, the kindergarten cohort also collected data directly from schools about IEPs for the sample children.
This next slide has some information about a high-school cohort that is comparable to the ECLS studies in that we are tracking a group of students over time. In this case, it is tenth graders, and we are tracking them through early adulthood.
Again, we - the health-related information we have in this collection were reports from the students' schools about their IEP status and health-related programs that the students were in.
We have also asked the parents to provide us information about diagnosed disabilities that might not have impacted their IEPs, and then we also asked the students themselves about their health status.
The National Household Education Survey collects data about populations from preschool through adulthood. Here, our only data source is information that we get directly from the parents and the students themselves.
The type of information that are on these data sets that would allow us to cross link to the administrative record systems is relatively limited. The sample draw for this particular data collection are telephone numbers, and we have not, to date, found a good way to even cross link the telephone numbers with strong, address-matching records, which prohibits our ability to link it into some of the more detailed administrative record-data systems that are out there.
This next slide summarizes our post-secondary data collections that we have collected to date. Then we continue to field.
The biggest one here is the National Postsecondary Student Aid Study or NPSAS.
All these studies, again, rely on us getting reports directly from the students about their health status.
The NPSAS collects a lot more detailed information than many of our other studies, but, nonetheless, is still a self-reporting system.
Okay. I wanted to get back to the ECLS-B for a moment here, because, as I mentioned before, that is really our primary experience with this sort of activity, where we are trying to actually link in existing administrative record systems with our survey-data systems.
The birth cohort did this quite efficiently by starting out with the birth-certificate data system that is available. So, as a result, we have a very rich database available for these children right from their birth, and the data set is actually rich enough or the birth-certificate data is actually rich enough that we treat that initial birth-certificate collections point is actually an initial data set, even though we didn't do any surveys. We just basically took the data off of the birth-certificate data and we linked it into the student record, and we have been tracking it ever since, but that is really our only experience to date using the statistics with our survey data.
Staying with the ECLS-B, I don't want to minimize the health-record systems that are out there. Without them we could not have even done this study. We wanted to make sure that we had a representative sample of children at birth, and the most efficient way to do that was to use the birth-certificate-record system.
In the next few slides, I am going to go through some of the ramifications of that, one of which is that because the administrative record data are so rich and it is relatively easy to identify individuals using them, even in a relatively small end-sample study, with the ECLS-B, in particular, we have gone to a model where we do not have a public-use data set. Okay?
If researchers in the room are interested in using the data, they need to apply to the center for license, and, then, we'll grant you the license, and we have a relatively stringent - procedure to make sure that the data are not inadvertently released and that no individually-identifiable information is ever published.
That said, much like the VA, we have done a lot of work to try to figure out ways to make the data more user friendly to the public. You know, you can get a restricted-use license if you are a researcher and do a lot of very interesting analyses, but a lot of the times the types of people who want to get access to the data are school administrators or child-care providers, who just want to get a general snapshot of what the population looks like. So we have been working with an online data tool that will allow people to get access to the underlying micro data, and we have done similar studies that other agencies have done to make sure that those reports are accurate and also to make sure that the types of data you can get out of the systems cannot drill down below groups that are smaller than 50 in number. Okay? That is our primary -
Yes, 50. I know. Some people think that's pretty big. Some people think that's way too small. We have gone back and forth over the years. To date, we haven't been able to - it with 50. We haven't really tried to drop it down any further, but right now, that is where we are at.
In - reports, however, we will and do produce tables that have cells that are based on ends as low as three. If it gets below three, we go on data-suppression mode on the cells and collapse them, so people can't even figure out that we only have three or fewer cases in a cell.
At the center, most of our data collections are actually done through sample surveys, and, as I mentioned, with like the in-house - the National Household Education Survey, the sample frames themselves are often limited to the extent that you have identifiers that you can easily use to link into existing administrative records systems.
So, to some extent, some of these crosswalk activities might be of relative utility to us, but I started to try to think through, well, how could we access these rich data sets that are out there now and help them inform our studies, and we have some experimental articles that have been put out that look at self reports and crosswalk them with the administrative record data on like health records or medical records, and those studies have been relatively useful, in terms of us improving our survey items.
One way that we can use the administrative record system is to do relatively small-end studies whereby we actually can sort of cross link and purposefully design our study to cross link the survey data with the administrative record data to see just how accurate self reports are and to try to figure out ways to improve those self reports.
And another area of research that we probably should consider is taking a look at linking the administrative record data that are out there on health statistics with our own school-based administrative records systems.
We have relatively extensive administrative records systems that we have for both elementary and secondary schools, and also our postsecondary schools, and, right now, the type of data that we have really focus on disabilities, and that has to do with some legal requirements that the department has to help service students with disabilities, but thinking beyond that, we could, if the data were available, use health statistics and health data to try to figure out, well, are there students who are not necessarily of disabled status who could benefit from additional services? And, right now, we don't have any way to really get those data, and I think we could use the administrative records systems to do that.
I have a feeling that this first bullet was pretty extensively covered yesterday, but one of the key issues that we have and would have with trying to do some more crosswalks with records systems is actually getting the correct identifiers in our surveys.
In order to do that, we have to have a really good understanding of what information is available in the different records systems that are out there that we might link into, so that we are not collecting data to crosswalk that only will crosswalk with one record system and not another. We don't want to waste resources there.
And then we also need to have some help interpreting the data that we do get from the health databases.
I just heard a little bit of a back and forth here about what exactly is in the Medicaid data sets. We'd have to get a better understanding of the strengths and weaknesses of the administrative records systems to use them properly. We don't have that kind of expertise in house right now. I mean, we really focus on education types of issues.
I think that is it. So not too far over.
MR. LOCALIO: Well, thank you, Chris, and we have time for a couple of quick questions before we have to take a break.
Don.
DR. STEINWACHS: It would help me to get a little bit better idea, on your longitudinal studies you are picking up information on individual children.
MR. CHAPMAN: Um-hum.
DR. STEINWACHS: I guess linking in some of those information on the schools and those resources.
MR. CHAPMAN: That is right.
DR. STEINWACHS: Are there other national data collections that the Department of Education does that tracks children or are they all these - are they special studies or is there sort of a statistical system that -
MR. CHAPMAN: There is an ongoing collection effort, basically, where we start a new cohort every so many years of a particular population, so - Then that ranges all the way from high school up to college.
DR. STEINWACHS: And those would be nationally representative and -
MR. CHAPMAN: They are nationally representative data. That's right. So drilling down below the national level really isn't an option. The cost gets prohibitive very quickly with these types of collections.
DR. STEINWACHS: Because, on the health side, there are some interesting issues these days as people are very concerned about the amount of drugs and medications being given to children, whether it is for Attention Deficit Disorder or other problems, antidepressants. There are anti-psychotics now being given to small kids, so on, and, in concept, you might think creatively about or we might think creatively, I guess, about are there ways in which you could take information like out of Medicaid or other sources that could tell you geographically populations that are getting high rates of these and link them to school districts or things that would tell you something, and I was just wondering whether or not it was possible with the kind of national data, and I guess maybe it's probably a longer discussion -
MR. CHAPMAN: Right. The first answer is maybe. The second answer is we start those - Especially with the studies that we do of children who are already in elementary and secondary school system or who are already in college, we start that sample design with basically school frames. So if there is some way that we could link a student ID, especially once they are in college, when we start getting Social Security numbers, then, the linking process becomes relatively straightforward.
But for the younger children, as you were talking about, we might be able to do some linking activities through the addresses that the schools would have for the students and then crosswalk those into the databases that are out there on health statistics. I mean, we do think about that stuff.
DR. STEINWACHS: Thank you.
MR. CHAPMAN: Yes.
DR. STEUERLE: This question is a question I am going to ask later, when we get to our final session. So you might just answer briefly, if you have some answer, but I am involved in so many projects within particular silos of government organizations, but within the research community itself. So I am involved with one group that is trying to study children's outcomes that pretty much is now focusing on early childhood education and even beyond, and has sort of at least taken up, whether correctly or not, this model that the earlier we intervene with children the more the return on the investment, and then I am involved with groups like this, which is interested in healthcare, and then, at Urban Institute, we have another group that is helping to work with some of these longitudinal studies you are creating at Education, and sometimes I wonder how much do the health, the economic and the education researchers really get together when they design some of these samples.
I guess it is probably unfair to throw this all on you, except that it's so many cases where we are talking about outcomes and opportunities and mobility and issues like that. People always seem to come back both to early intervention and to education, and sometimes I don't know how to link them.
Give you a common example. For instance, some people now think we really should start at minus nine months, starting to measure what is happening to the well being of a child, because it could be that drugs and alcohol, depression, whatever other illnesses within pregnancy could have strong educational outcomes down the road.
So the question for you is to the extent you get together, you start designing these models, how easy is it to bring in somebody from HHS and how easy is it for them to come, and how easy is it to bring in people from some of these very different worlds and try to really design the models and the longitudinal studies you have?
MR. CHAPMAN: Okay. I think at the beginning I made a disclaimer that I am speaking for myself. I am going to do that again here. Not speaking for the agency.
The ease that I have experienced so far has been great. It is relatively straightforward to contact Health and Human Services or the National Center for Vital Health Statistics and say, We are developing this study of children from birth, and we are going to track them through the first couple of grades or at least through kindergarten. Can we start to have some meetings with staff in your agency that might be interested in related topics?
And for those early childhood longitudinal studies, we have had a lot of input from Health and Human Services, and, obviously, we had to get the birth certificate data for the - cohort, but we have also done some work with - in our own Office on Special Education - to get better measures and to think through measures in health.
I think what we run into isn't always necessarily a coordination problem. Although, those certainly do exist. We also run into just response-burden problems, and it is good to see OMB in the room, because we can only spend so much time with the student in a school setting or so much time with a child in their house or with the parent in their house without running up to really serious burden issues.
And they are good issues. I mean, we need to consider them, because, from our perspective, we want to get as much data as we can on education. So we focus on assessments and we focus on educational resources in the household and in the school, and, then, we know, from a lot of research, that there's health issues that relate strongly to educational development. So then say, Okay. Well, we better make sure we get some of those health statistics in there, but it is rarely the case that we can make it the primary focus of the study. So we are limited to the types of data we can collect, even with really good collaboration.
Does that answer your question?
MS. GENSER: I am Jenny Genser from Food Nutrition Service.
I wanted to ask if your surveys contain information on receipt of school lunch and breakfast, and also if you have obesity data, because that is a big, big health-related issue that wouldn't show up in an educational plan.
MR. CHAPMAN: Right. It varies across our studies. I like to keep going back to the longitudinal study that we have for the early-childhood populations, because that is the one I work on day to day.
In that particular study, we actually work with USDA to make sure that we had proper items in there to ask about weight, but apart from that, we also have some direct assessments where we actually weigh the child and we do a body-mass index measurement that is part of the data collection.
That said, I don't want to say that that happens regularly and across the board in all of our collections. I mean, it is actually relatively unique to our early-childhood studies.
MS. GENSER: (Off mike)?
MR. CHAPMAN: We don't ask the parents regularly whether or not children are in free and reduced-price lunch programs for similar reasons that are causing problems for the CPS item on lunch receipt.
In order to really get at that, you need a good 10 questions, and we don't always have time to nail that down, but, at the school level, we do, in our school surveys, ask what - how many students in the school are actually getting free or reduced-price lunch and whether or not the program exists in the school.
DR. STEINWACHS: I want to thank the panel speakers very much and hope you'll stay with us and continue this dialogue. We are at the point where we promised you a break. We will deliver a break at this time, but what I always say is five minutes and figure that it probably stretches a little longer than that. So please take a break and come back in about 10 minutes or so.
(Break).
Agenda Item: Maximizing the Benefits from Linked Data: Access for Research and Related Issues
DR. STEINWACHS: We are sorry that Dr. Citro is not going to be with us today. She is ill and sent her regrets, very much, that she couldn't be here.
So we have just about an hour-and-a-half and split it among three speakers, and very happy to have Joan Turek, who deserves a very large part of the credit for bringing together this group, and we did ascertain that Joan is the one who knows everyone, and so certainly the right person to know, and so over the next hour-anda-half, we'll be hearing from three key speakers on areas of Maximizing the Benefits from Linked Data: Access for Research and Related Issues.
Joan.
MS. TUREK: Thank you.
When they were setting it up, I asked to facilitate this section, because I am a major data user, and I am probably very obnoxious about wanting access to my data and not wanting anything to happen that could limit that.
It is unfortunate that Connie couldn't be here. There is a disc available either from Richard Sussman at NIA or from her, that has all of the reports of the studies that they have done on data sharing.
The first one, in 1985, was called Sharing Research Data. The last one, in 2005, was Expanding Access to Research Data. So it looks like, over the last 20 years, we haven't solved all the problems.
But we have three very good speakers. We are going to start with Brian Harris-Kojetin from OMB, who is going to tell us about their activities to improve access to data, and then we are going to talk to two major users about what they really want to get.
I think it is important that we have the data available and we have high quality data, but it is equally important that it is available to the people who wanted to use it, and I think that the value of the data is really dependent on our ability to get access. If you just collect it and stick it in a box somewhere, you may as well save the money.
Brian.
Agenda Item: Brian Harris-Kojetin, OMB
MR. BRIAN HARRIS-KOJETIN: Good morning.
As the other speakers said, I have a similar kind of disclaimer. In fact, if I say anything inappropriate, the Director of OMB may well disavow that I even work there.
But I have been listening here for the past day and - well, yesterday and this morning, and I am not sure what I have to contribute to your discussion. I don't have any data sets. I don't link data. I don't directly use any of the data sets, but I'll share with you some work our office does in terms of - kind of related to this in terms of the confidentiality issues and some of the legal issues.
Another disclaimer is, of course, I am not a lawyer. I do play one on TV every once in a while, but - and I fake it in my job a fair amount, and, as you'll see, but, again, I have to disavow any of those kinds of things that I might say that sound like I actually can make such a policy.
There's a few laws - there's a couple of issues on the charge questions here that I thought I could say something about and see if it is at all helpful to the committee.
There are several laws that have some impact on data sharing. Here are the ones that I am familiar with.
One of the things that came out in the presentations by the folks from Census yesterday, you heard Title 13 mentioned quite a number of times. Certainly, a primary consideration is whatever statute - whatever authority - legal authority the agency has that is originally collecting the information, whether that is being gathered for statistical purposes or whether it is being gathered for administrative purposes, the agency has to have some defining authority to gather that information, and, oftentimes, their statutes will specify what are appropriate uses for the information and if there are any confidentiality provisions for that, and sometimes these are very vague and, you know, Go out and gather data on health or on the economy and do good works and disseminate it. Other times, it is very specific.
One administrative data set that many of you may be well aware of is the National Directory of New Hires, and those of you familiar with it know that there are - Well, with the exception, I guess, of SSA, every other agency that has access to the data has access to it for a specific purpose that is very carefully specified, and this is - So that is one example of what you are allowed to access the data for, how it is allowed to be used, and it can be specified by each agency that may be allowed access to it, and if you are not explicitly allowed access to it, then, even though you have the grandest intentions, you can't have access to it.
Folks yesterday also mentioned a couple of broader laws that apply across government agencies. The Privacy Act and some routine uses of information came up. The Paperwork Reduction Act was also mentioned by a couple of folks.
Those of you not as familiar with it may be interested to know that under the functions of the statistical policy and coordination functions that are codified in the Paperwork Reduction Act, the director actually specifically authorizes the chief statistician to promote sharing of information collected for statistical purposes, but consistent with the privacy rights and confidentiality pledges.
What I am mostly going to talk about this morning, the thing that I mostly know about, so this is why I assume I got invited, was CIPSEA, and which many of you are, I believe, aware of and maybe you know everything I am going to say.
CIPSEA is the Confidential Information Protection and Statistical Efficiency Act of 2002, which is why we call it CIPSEA instead of all of that.
It is composed of two major titles. First one is confidential information protection, which was - which you can see the purposes here is really to strengthen public trust in pledges of confidentiality, prohibit disclosure in identifiable form, control access to in uses made of statistical information and ensure that information is used exclusively for statistical purposes.
CIPSEA provides a nice statutory floor for the use of information collected exclusively for statistical purposes.
The second part of CIPSEA is the statistical efficiency part that applies only to three designated statistical agencies - the Bureau of Labor Statistics, the Census Bureau and the Bureau of Economic Analysis.
The goal of this subtitle was to reduce paperwork burden on businesses, improve the comparability of economic statistics, specifically mentioning BLS and Census, comparing their establishment - and increasing the understanding of the economy.
Quite a number of you, I am sure, are intimately familiar with the history behind this, why CIPSEA was sought after for literally decades. This went through many evolutions. There were many bills that came so close, some that passed the House and then died, and we finally got CIPSEA in 2002.
As Nick noted earlier, if he is still here. I thought I saw him, but maybe he stepped out, there was a companion bill - a Treasury companion bill for CIPSEA that - for amending 6103J - that did not - I don't know if it was ever even introduced, but has not been passed that is really key to some of the data-sharing provisions within CIPSEA, but that has not gone forward yet.
But one of the reasons why this has been a very important law is that there is a real patchwork among many, many different agencies that do some kinds of statistical activities. Now, there's 10 agencies that are often referred to as principal statistical agencies that that is really their sole mission is to do statistics, but there are many others representative in this room and outside this room that do some kinds of statistical activities as part of their other mission that may be regulatory or providing services.
And there have been many attempts over the past number of years to strengthen and try and standardize the statutory protections for the confidentiality of individually-identifiable data.
As I was saying before, every agency has their own specific statutes and has variations in confidentiality protection. Some, like Title 13, are extraordinarily strong. Other agencies, prior to CIPSEA, like BLS, had practically no authority whatsoever to - legal statutory authority - to base a promise of confidentiality upon, and so this really - CIPSEA provides a ground-level kind of foundation of protection for information gathered for exclusively statistical purposes under a pledge of confidentiality. So I have been saying this, now, uniform protection. It covers all data that an agency collects for statistical purposes under a pledge of confidentiality. There's very strong penalties. This is similar to penalties that some agencies already had, like Census under Title 13, and CES, under their statute, a $250,00 fine and/or five years in prison.
It also specifically says that FOIA requests are exempt, because it defines them as a non-statistical purpose.
When we talk about - there's a few key distinctions that are important in CIPSEA. One is between statistical and non-statistical agencies. CIPSEA provides a definition of statistical agencies, those that are predominant - whose activities are predominantly the collection compilation processing or analysis of information for statistical purposes.
Statistical agencies have - are given some special privileges, and also some requirements, extra requirements under CIPSEA. Specifically, one area that I think many of you have been interested in is this ability to designate agents, which may be a contractor or an external researcher or - this is similar to Census' authority to have special-status employees. CIPSEA specifically provides this authority only for statistical agencies, not all federal agencies.
The other key distinction that we talked a fair amount about in the discussion yesterday is this statistical versus non-statistical purposes. CIPSEA really puts in statute this functional separation between statistical purpose, defined here, and a non-statistical purpose and draws a bright line between these two uses; that is, any information collected for statistical purposes cannot be used for a non-statistical purpose, and a non-statistical purpose being using the information in identifiable form that effects the rights, privileges or benefits of a respondent, that we talked a little bit about yesterday afternoon.
So for the Census Bureau to get administrative data that were used for a program that were used to effect - were originally gathered for non-statistical purposes, to take that across the firewall there and say, Now, we will use it for exclusively statistical purposes. However, it does not go back. CIPSEA is drawing that same bright line there. Even if your intention was to get - give it to Census, get better race codes and say, then, Can we have it back in our administrative records? Sorry.
So what requirements does CIPSEA impose on agencies? Inform the public, basically, that CIPSEA has to - can only really take effect here is - You are only collecting something under CIPSEA if you are adequately informing the respondents that you are going to use the information for exclusively statistical purposes and keep that information confidential, and we've got some forthcoming guidance that talks about this specifically as a CIPSEA pledge, and, of course, safeguard the information and protect it. CIPSEA is law for protecting the information. Honor that pledge that you make to respondents.
In terms of data sharing, as I said, and as many of you are well aware, the provisions are very specific. Only business data are covered. Only three designated statistical agencies are authorized for this business data sharing. So that is all that CIPSEA is itself authorizing. It is important to point out that CIPSEA is not altering existing laws that may permit other data sharing among federal agencies, but CIPSEA did not itself authorize any further data sharing than this business data sharing and - between BLS, BEA and Census.
So some implications here, which I think is - which you are most interested in that you may, again, be well aware of already.
For federal agencies that are acquiring and protecting confidential statistical information, CIPSEA may offer some new protections for those agencies that didn't have strong legislative protection already.
It does not - It specifically does not restrict or diminish any existing protections. So if the Census Bureau is gathering information under Title 13, for example, and they can only use that information for a Title 13 purpose, CIPSEA does not restrict or diminish that at all. CIPSEA does not say, Oh, but why can't you do something that - CIPSEA would let me do any kind of statistical purpose. It does not effect that is how the lawyers, I understand, are interpreting it now.
For federal agencies providing access to confidential statistical information, CIPSEA does permit statistical agencies - remember, only statistical agencies - to designate agents, to perform exclusively statistical activities. This is not a requirement for statistical agencies to do this. It is a may. They may, if they so choose, do this.
This will require from them policies and procedures for access and control, responsibilities for providing security and employee training and these things take resources, as everyone in the room is well aware. RDCs are not cheap, any other means of - Chris Chapman, if he is still here, can tell you licenses are not free, and -
Implications for researchers, just to kind of wrap up here. One of the things I wanted to make clear was that some people had viewed the language in the law regarding agents as opening up researcher access to all data collected by statistical agencies, which it really does not do. It does provide a means for statistical agencies to designate agents, but encloses stringent requirements on those agencies and the agents to protect the confidentiality of the data.
It is important to remember, as I was saying, that CIPSEA doesn't diminish any of these existing protections, and so it is not going to remove some of the barriers that currently exists, and it also doesn't provide a right of access to federal statistical data. Researchers who obtain authorization to access the confidential data, for exclusively statistical purposes, have to share that responsibility to maintain and uphold the confidentiality of any data they access.
And, as you all are well aware, different agencies have - vary in the sensitivity of the information that they have and they may not be able to provide access to their data or all of their data or may have to do so under varying circumstances or in different ways, and so researchers seeking access to those data will have to conform to the agency requirements and respect those confidentiality provisions, even if those are more limiting and restrictive than those that they may have for - in their own institutions or they may encounter with some other data sets, but I think that was - I had to share with you this morning.
MR. LOCALIO: Thank you for your presentation.
I just have to say that one of the problems that I mentioned yesterday, if you were here, is that people refer to, The lawyers have done this, and the lawyers have done that, but the lawyers are not here, and they do not understand the problems that we are discussing. In fact, I am not sure that they care about the problems that we are discussing.
MR. HARRIS-KOJETIN: I disagree with you, and some of the lawyers we work with, but they are very, very well informed on this issue.
MR. LOCALIO: What I find, and I have reviewed some of the legislation, I have a copy of CIPSEA here that I have been carrying around with me for the last three years, and I have a copy of the statute that NCHS uses, and they are in conflict. They are vague. Some of the things are vague, and it doesn't seem there has been an effort to reconcile them.
Is there any further effort, post-CIPSEA, to say how these statutes are going to work in practice? Is there any effort to evaluate the implications of these statutes, in terms of who gets what, when, where or do they just pass it and then they say, Well, this is going to work?
What is the evaluation component of - Does OMB have an evaluation component to figure out whether this is working? And I am not talking about the second provision, the data sharing among the three agencies. I am talking about essentially the first one.
Thank you.
MR. HARRIS-KOJETIN: Very pertinent question and I don't know that the evaluation is strictly in the law, but I think that is something we do care deeply about or I do, since I am speaking for myself.
We have got guidance forthcoming on CIPSEA. Some of you may have heard me say this for the past two or three years. Nick Greenia, in the room, I think has just given up asking me when it is coming out.
It is actually coming out very soon, but you can't really trust whatever I say on that issue, since I have said it has been coming out soon now for two or three years.
So we do have some fairly lengthy, in terms of relative to the statute, like about 30 pages of guidance on implementing CIPSEA that we will be issuing that will help agencies that are using that.
We have gone through quite an intensive interagency process to help develop and inform this. Again, if you folks in the room were on an interagency team that has helped give OMB input into doing that, and so as that gets out there, as agencies are going forward, we have had a number of questions that have come up from agencies in terms of what impact it has on their programs, how they need to - how it will effect their operations. We have been dealing with those on an ongoing basis, and, again, have used that to help inform the broader guidance.
So it is an evolving process and it is something that every agency is going to struggle with a little bit in terms of what does this - what kind of changes and what kinds of things does this do for us, and we have some reporting requirements for agencies to get back to us on how they are using CIPSEA, how they are using specifically the - the statistical agencies are using the agent's provisions, and so we can evaluate and monitor this and see where things are working.
Many folks are, of course, very interested in the data-sharing portion of that and hoping that we could go back to Congress after some time, after we prove how good of a job that the three designated agencies have done sharing business data, to see if we could explicitly expand that authority, which is where I thought you might go, even though you pulled back from that.
DR. STEINWACHS: I think, just clarify that. There was a reference in there to business data, and so that was the sharing among BLS and the others of information on employers in business activities in the U.S.?
MR. HARRIS-KOJETIN: Exactly. Yes, economic data, and so the business lists between Census and any of the economic surveys that Census and BLS did.
DR. STEINWACHS: Thank you.
MR. PETSKA: Can I make a comment also?
Again, speaking from my own personal view, I wish that CIPSEA would have gone further and kept in the tax component of that, because when CIPSEA, when the early discussions of CIPSEA were unfolding with Brian's boss, Kathy Wallman, at OMB and other representatives of the federal statistical agencies, there was a lot of valid research purposes that were articulated for sharing of data, including tax data and so on, and once CIPSEA started to be formulated, it became very clear that the - one of the more controversial aspects was tax-data sharing, and the question is - I don't know if it is because of congressional committees. Clearly, there was not strong support on the Hill.
Randy Krosner(?), in the Council of Economic Advisors office, was pushing this and so on, but when he left the administration, it seemed like that piece had no possibilities at all.
I spoke to him a couple of months ago at a conference in Cambridge and his comment was, CIPSEA, as it was, was the best deal we could get, that we would have liked to have expanded data sharing to more agencies, including the tax component, but it was clear that that bill would be dead on arrival, which is unfortunate.
MR. GREENIA: I am Nick Greenia from IRS and I just wanted to add a couple of things.
First of all, I wanted to address the previous question in terms of the evaluation, because the guidance that is, we hope, going to be coming out next month - Brian, is that right? - in the Federal Register -
MR. HARRIS-KOJETIN: Absolutely.
MR. GREENIA: The guidance is actually - I was on the other agency committee for that. So I can speak to that a little bit.
I think the outlook is unknown. I think the answer to your question is it is not clear, and one of the reasons I say that is because there's a lot of flexibility in terms of how agencies can safeguard the data, how they make the data accessible to researchers, and I think what may come about is, if you will, a de-facto evaluation, which is that researchers and Congress, if there is another data-sharing effort, are going to look at the experience and they are going to see that, You know what, it depends on the sensitivity of the data, in terms of what sort of safeguards are prescribed, and procedures, and - are on, and there is going to be a lot of flexibility, and there is going to be a lot of variability in terms of how the data are accessible and protected.
And I just wanted to add something to what Tom Petska said, since Connie Citro is not here, as you may know, since that report on the data-sharing workshops was released last Friday, and if you would like to get a - I am doing a little plug here for Senstat(?), of course - but if you want to get an idea of some of the difficulties, including the recommendations facing tax data for purposes of data sharing, I would highly recommend you read the article coauthored by Mark Mazor and myself on tax data and some of the many, many issues that have to go into that.
And picking up on what Tom said regarding why the tax-amendment bill did not go anywhere, as you know, the tax-amendment bill accompanied CIPSEA in July of 2002 to Congress, and CIPSEA proceeded to the floor of Congress, and the J-bill, the amendment to the tax bill, went to Joint Tax Committee, and there were a number of reasons, we think, that the tax bill foundered. Tom has put his finger on one of them, which is the leadership vacuum, but there are some other lessons that we think are valuable as well, including freezing the items in the statute itself, as opposed to allowing infinite regulations to stipulate item content in the future.
So I highly recommend you take a look at that article for tax data.
MS. TUREK: Thank you.
I have one question also.
We have been talking here about new forms of data, that would be the survey and administrative data linked together.
Has OMB begun to look at the implications for users and to look at whether or not we would need any kind of new statutory language to permit this kind of data to be shared? Because, I mean, it would not be identifiable, presumably, but there was the risk-disclosure issues.
And will OMB take a position or will it look at what should be done to help users get access to this data once it is available?
If you do a SIPP that's got a match to administrative records, will you look at whether or not we could have a public-use tape?
MR. HARRIS-KOJETIN: I thought David was going to look at whether you can have a public-use tape or not.
You'll have a public-use tape.
MS. TUREK: Thank you.
I mean, I just wondered if OMB had a role in this or -
MR. HARRIS-KOJETIN: We could have a role if we need to have a role, if it seems that that would be helpful.
Obviously, when you are bringing in the linked data, there are other things that go along with it. The example I was giving before, an agency certainly can promise confidentiality to a statistical - Well, a statistical agency can promise confidentiality to another agency, and so BLS, for example, does this all the time when they gather information from states. They take data that states may not consider confidential, but say I will use this for exclusively statistical purposes and I will keep it confidential, and once it goes back there, then BLS intermingles it with their other information, and, in essence, they have elevated the level of protection required. Just like whenever anything gets intermingled with IRS tax data, it gets elevated to that status of protection.
MR IAMS: Could I make a comment?
I am Howard Iams from Social Security.
I really think Tom is correct and Nick is correct. You have to have legislative authority to permit a broader sharing than currently exists. The agencies do not have the authority to pass data to other agencies for use at those other agencies for statistical purposes, and this is separate from disclosure and confidentiality. The agencies just cannot pass it and cannot use it without some sort of legislative authorization, and Brian can - I don't know - perhaps disagree on some instances, but I don't think that the agencies that are interested in this can go further than what they are doing now, and a lot of the limitations that you are hearing about are created by this legislation.
Now, the disclosure raises a - Well, let me finish - My train of thought would be that if you are a CIPSEA-authorized or a CIPSEA-compatible agency, which I think I am in - We are a statistical outfit in a big administrative organization, but we just do statistics and policy research.
There ought to be, ideally, permission to share confidential agency data with such a group for them to use it however they wish for whatever purpose they want, not just Title 13, not just Title XYZ, but that it is a statistical-analysis function that might have policy implications, might not, but it is for a statistical purpose. It is not going to go and administer somebody's benefits or effect some individual's rights and whatever, as CIPSEA is defined. That would open up a whole lot of sharing that outfits could do and a whole lot further analysis with this type of information than is currently possible.
Once you bring in the disclosure, confidentiality issues, you raise a whole lot of other things, and my only comment would be, the thing that I think undermines almost any public-user file is geography.
We put a copy of our new beneficiary survey - It is sitting on the web. It's got two surveys. It's got earnings from our tax records. It has hospital records from the Medars(?) file. It has our benefit information.
We cleared this through - with Pete Saylor's(?) help at IRS through their requirements. It meets all their confidentiality requirements. It meets Medicare's, CMS's - at that point it was HICFA. It meets SSA's. The key is there is no geography. It is a big country out there. There are a whole lot of characteristics that you could say are unique, but you really can't tell, because it is a big country out there.
If you know what state someone is in, it is over. It is not a big country. It is a small state.
Now, we have a national program. For Social Security, it doesn't matter what state or what locality you are in, but if you are dealing with TANIF(?), you want to know the state.
I, being selfish, think that they ought to put out all this administrative data on a national file with no geography, and if you want to use geography, you should have to go to these research data centers. That is what the University of Michigan does with their Health and Retirement Survey. There are restrictions. Users can use it in University X in Alaska, Hawaii, whatever. The only place you can do geography is in Ann Arbor, Michigan. That is the price you pay to have geography with those data. If you've gotta have geography, you will never have a public file with confidential data. It will not be possible in today's age. My judgment.
MS. TUREK: I find that fascinating.
I think we ought to go to our last two speakers who are both users, and actually will be from a very different perspective, I think.
Heather Boushey is an economist with the Center for Economic Policy Research and Dr. Deb Schrag is - I guess you are on the staff of the Memorial Sloan-Kettering Cancer Center, and so we have here an economist and a medical doctor who are both heavy data users. So I think it'll be really interesting.
Heather.
Agenda Item: Heather Boushey, Economist, Center for Economic and Policy Research
MS. BOUSHEY: Great. Thank you. Thank you so much, Joan. Thank you for inviting me to speak here today. It is a pleasure to have the opportunity to talk to you about the way that we use data.
Before I talk about the main points that I want to make, I want to tell you just a little bit about myself and my organization and what we do, because my understanding is is that is why I have been invited here to speak today, to talk about how we use the data that agencies make available.
I am an economist. I work at a think tank here in town called the Center for Economic and Policy Research. We are a very small shop. We have four economists, about a staff of 15, and we do research on economic issues facing people here in the United States.
We are very heavy users of the CPS and the - the Current Population Survey - and the SIPP, the Survey of Income Program Participation - although, I don't know, in this audience, if I need to spell out those acronyms, but I am so used to doing it.
We make use of this data in a very timely manner, both to effect media debates and policy debates around pressing policy issues.
We work on very short time frames, and, because of that, we have spent the past five years taking both the SIPP data and the CPS data, both - we are working on the March(?), but we have done this for the org - and creating what we call Uniform Data Files.
As many of you know, if you use survey data, they do these fabulous things at Census and BLS where they'll change the name from year to year or they'll do little things that mean that you can't just write one piece of code and pull out the data from every year.
So you have to invest a lot of time if you want to know what has gone on between 1973 and today or 79 and today. If you want to have a time series, you have to sort of make this huge up-front investment.
So we do that, and we have made all of this publicly available on our website, all of our code and our uniform extracts, but we have made this investment so that when there is a debate on the Hill or when the media says something that is inaccurate about what is going on in the economy, we have a data set that is up and running, that is on our desktops, and we can then comment on it in days or weeks, rather than months or years, and I know all of you know just how complex this work is, and being able to do the work up front and have it available for timely analysis is a critical part of our mission and how we use the data.
Which is not to say that we don't have longer-term research projects. We do projects that take years, but it builds on this uniform data file and it is always with this goal of policy work.
We do not do any linking with administrative data, because it is certainly beyond the kinds of timely work that we could do.
I do have experience working with administrative data in another life, when I was at the New York City Housing Authority. So I do have some understanding of just how complex some of those issues are, but I have not matched it, just so that - I don't want any questions about that. I have never done it. Don't want to. Sounds complicated.
So being able to have data on our desktops that we can use quickly and accurately and that we have confidence in because we have already done all the background work has been critical to our success, and it is that major point that I want to relate to the main points I want to make to you here today.
We are concerned about timeliness of the data, and we are concerned about our access to it. The question that Joan said about public-use files is critical to what we need to know.
And we are also concerned about - and I don't know how germane this is to this topic, but we are concerned about maintaining access to survey data. I think - and I'll talk about that just for a few seconds at the end.
So my first two concerns about timeliness and accuracy of the privacy issues, of course, they are linked, and so the major questions that we have are will administrative data that has - survey data that has administrative data matches lead to delays in releasing the data? Will it be less timely than it is now?
And, second, will it require new security measures, kinds of like the things that have already been talked about requiring us to go to special locations or special sites to use the data, and that will significantly both delay our ability to use it, but not just delay it by days or months, but could delay it by years, because we wouldn't be able to have it sort of up and running and ready to go when we have an issue.
Now, I have not been doing this kind of work for maybe as long as many of you have, but I have heard tales from people who got their Ph.D.s in the 80s and before that it used to be the case that if you used survey data, you had to go to special computers, because you couldn't do it on your laptop, and I hear they had these things called cards and that it was very time consuming, and what I find - the point here is that I think that the way that we are able to use data now has transformed the way that we are able to engage in policy debates, both at the national level and the state level.
The fact that we have access to this data, not just on our desktops, but I have access to it on my laptop at home, I can do it on the train, that we are able to do these very complex kinds of work in a much faster way than we had been able to in the past.
I think you can - There is some correlation between that and the rise of think tanks like mine and private research organizations that are effecting policy debates, both at the state and local and national level, and I think that that is an important accomplishment that technology has given us, and we don't want to sort of move backward in any way. So I think that that is just a very critical, critical point.
To give you a couple of examples, when I have been asked to testify in front of Congress, both in the House and the Senate side, I had no more than two weeks' notice, and, in one case, I had just five days' notice.
These are very short time lines, where, if you want a specific number from your data, you need to just be able to go to your computer. You don't have time to go and to sign up and to wait.
But the second kind of concern we have about timeliness is that we often only have a few months lead time to know what the issues actually are. We spend a lot of time thinking about what kinds of policy issues are going to come up, what do we need to prepare for over the next year or two, but we may not know whether or not Congress is going to vote on minimum wage this session or next session until a few months out.
So having to go through application processes is simply just not viable for those of us engaged in policy debates. It is perfect, it is fabulous for academics, and we build on their work and use it, but we need things that are, of course, shorter.
And on this issue, I might add that this is something that our organization is on sort of the more progressive end of the political spectrum, that we have been working very closely with the Heritage Foundation on these issues about access and timeliness, because they are just as concerned as we are, and it is certainly a point that transcends, I think, political boundaries and sort of goes beyond left and right, which I think is a very important point, and especially this issue about having independent organizations be able to access government data to discuss pressing policy issues is one that we all, both on left and right, agree on.
I basically am making the same point over and over again. So I won't be that much longer here. We need access to timely data. So, hopefully, that has gotten through here.
So the final question in the set of questions that we were given ahead of time to focus on was what are the potential costs to the public from failure to take advantage of these opportunities, and I have a couple of comments on that.
First of all, one of the largest projects I am engaged in right now, we are looking at take-up or effective coverage of benefit programs in 10 states. We are doing this for advocacy purposes, for public policy, and we are doing it using the SIPP and the CPS and the National Survey of American Families.
Now, we know that each of these data sets has significant problems with how people report their benefits, that there is under-reporting of benefits. It would be absolutely fabulous to be able to have data that is matched, so that you can look at eligibility for public programs and then get the numerator to be an actual estimate of coverage. That would be a significant improvement, and, right now, we are working on this project, because, in the states, many of the state groups that we work with are very concerned about take-up of public programs, and because there is no real place that makes a lot of this accessible across a wide range of programs, because the eligibility rules, on the one hand, are so complicated, if you want to look at eligibility, you have to use survey data, and, quite frankly, the SIPP is the only survey that I have used that has enough questions to really get at the complexity of eligibility, which, of course, as Mr. Iams said, this is all at the state level, this game is all at the state level, but we really don't have a good numerator, because we don't have administrative matching. Being able to have that matched data does have - I mean, significant policy implications that we could be using right now. So that would be wonderful.
Of course, all these issues about privacy, I leave it to you all to sort that out, but we would love to have access to it.
But I do have a couple of concerns about sort of the move to the matching - particularly, and, of course, my perspective is thinking about this matching either SIPP or CPS or ACS, one of the surveys that I have used, with administrative data.
My first concern is, with my limited experience using administrative data, I think that there are concerns about accuracy and how high the - what the gold standard is.
It seems to me that we need three kinds of data. We need administrative data that tells us one thing, but there are biases and there are - obviously, there's problems and there's errors in that data as well as there is with survey data. The biases run in different directions.
We need survey data to tell us about the wide populations, the full scope, and I think we also need qualitative data to tell us some of the why questions, but that is a whole different group of people.
This question about whether or not the administrative data is always perfect and what we are going to gain from matching and how we talk about that, especially with the public and especially with the people that are trying to convince how important this is. I think it is an important note to just note that some of the caveats and some of the potential problems with that, in terms of accuracy.
Second, administrative data is clearly no substitute for survey data, and this cannot come at the expense of these surveys.
Again, looking in the issue that I look at, eligibility for public benefits, this is something - and we need to know how many folks are eligible for Medicaid and who aren't receiving it or who are receiving it. The only way we can do this is through surveys that ask a ton of questions of people at a subannual, a monthly level, because this is the way that people access these programs.
I mean, and just to go off on that just for a second, one of the things we learned from our work with the SIPP - and looking at take-up - is that people are - People move up - their incomes move up and down month to month, and when they access the system is not necessarily the month that they don't have any income, because it takes months - weeks or months for them to even make it to the office or get on line or get all their papers together to receive Food Stamps or another benefit program. You need access to the survey data that provides you with those dynamics. They are not substitutes.
And then my final point, which is going back to my theme here, if the cost of matching is that we lose in terms of timeliness or the ability for the public to access public-use files, then I think that is a serious concern and one that we should spend a lot of time focusing on.
And I think, having said that message about 12 times here, hopefully, it has come through, and I think I will stop there. So thank you very much for allowing me to speak to you.
Agenda Item: Deb Schrag, Memorial Sloan-Kettering Cancer Center
DR. SCHRAG: So I am Deb Schrag. I am a physician and health-services researcher at a big cancer center in New York City, Memorial Sloan-Kettering Cancer Center, and I also appreciate the opportunity to speak to this group.
Unlike Heather, I don't do anything in days to weeks. I am more representing the perspective of academics, and we do everything on a months to years time frame.
I am going to - I guess, I have to say that I had a completely different talk when I came here yesterday, and perhaps it is being towards the end of the workshop, I revised my slides and got rid of most of the data examples and slides that showed the results of various linkage projects I have been involved in, and put in what I'll call more philosophical slides, because I think some sort of more conceptual framework - Maybe it is just - at the end of a workshop, it feels like a conceptual framework tying together all these different enormously complex issues that we have heard about for the past two days is sort of in order and maybe there'll be some discussion in that regard at the end.
Again, I represent an end user, not my institution or any agency.
So types of research questions, examples of linkage attempts, challenges that we have encountered and, of course, a wish list to add to Heather's.
So I guess I am here representing academic health services researchers, and we examine, obviously, relationships between need, demand, supply, delivery and outcomes of healthcare.
The big topics for us, I would say, over the - since - in this decade - and I think that these are going to remain front and central on people's research agendas - are disparities in healthcare, access and barriers, technology dissemination. Quality measurement is a big one, and, ultimately, efficiency of healthcare delivery. So that includes all the cost issues.
We talk about data - and I think that this is an underlying theme of many of the presentations that we have heard in this workshop - is that these data are layered.
We start out with source populations, basically, United States citizens, who have IRS data. They work. They don't work. They do save for retirement. They don't. They exist in specific geographic regions of the country, and on top of the source populations are diseased populations. I happen to work in cancer. Other people work in mental illness or psychiatric disease or malnutrition, all kinds of examples of what one consider a diseased population or a population with a health concern of interest.
On top of that are providers. Typically, these are physicians, but other - nurses, other types of healthcare providers as well, and, on top of that, are healthcare delivery units, facilities, whether they are clinics, hospitals.
The issue with federal data is that federal data is better at the bottom of the pyramid. Federal data has a lot of information about source populations and some - For example, Dr. Breen is here from the National Cancer Institute. A lot of information about populations who get cancer. It tends to be a lot less rich as you go up to the top of that pyramid and have a lot less information about providers and facilities.
So when we try to link, very often, in my experience, what health services researchers are trying to link is the rich, rich government data at the bottom of the pyramid with more granular detailed data about providers and facilities at the top of the pyramid that often resides outside the public domain, if you will, and I think we have heard allusions - you have heard references to some of these data sources from the speakers yesterday. AHA data was mentioned, AMA data, and I'll give you some examples.
We talked about evaluating the quality of healthcare. Really, we are interested in health outcomes, and the main ones - main health outcome we get out of big federal databases are typically just very basic things like who lives, who dies and who gets particular diseases. So mortality and incidence, basically.
And the inputs that we want to go - that we want to relate to health outcomes are community attributes; person attributes; health risks and behaviors, which come from the big surveys; the structure of delivery systems and the processes of care, processes of care. I would probably put Medicare data in that bucket. Medicare data is what exactly are we doing to these people - all of us - that lead to these health outcomes.
As we think about linking data, I think it is helpful to think about what the frameworks are for putting data in different buckets, and, now, obviously, some data belong in multiple buckets. Medicare data also has mortality, which is an outcome, but I think that sort of as we conceptualize these linkage exercises, it is helpful to think about in what domain the data sets belong.
The other thing that I think is helpful to think about, and I always think about when I am contemplating any sort of linkage project is where it lies along the spectrum of pure population-based data.
So federal-agency data are best for big, broad population-based analyses. So a cancer example is I want to know something about everyone in New York State with lung cancer, and I can go to the registry and Census data, but, very often, increasingly, we are interested in quasi-population-based data, where we want everyone in New York State with lung cancer who is covered by a particular private insurance provider - Oxford Insurance Plan - and so I think it is really also important when we talk about linkages to be clear. Is this true, pure population-based data? Are we trying to link federal agency or state agency data with some external data source that resides elsewhere? We need some kind of nomenclature for where those boundaries are.
And, then, of course, there is non-population-based data. It may be that - You know, my research institution is always coming to me and saying, Well, you link these data and you work with these large population data sets. We want to know why cancer patients in New York State are not all coming to get their treatment from us, since we are the best center, and you have all these data. Why can't you do that?
And I say, guys, that is marketing. That is not health-services research. That is not an appropriate way to use these data, but explaining sort of analyses at the population, the quasi-population and the non-population data, we need some kind of taxonomy, and I think simply having that taxonomy or the government help develop standard taxonomy for these types of activities would be a really helpful place to start and would educate the end-user community.
Obviously, the health services research strategy - I mean, we just all want to get our hands on as much data as we possibly can as quickly as possible, and we want to juxtapose all these various data sources.
So the kinds of things that we work on, really, I would say the focus and theme of our research is to look at what we call the implementation gap, which is the difference between clinical efficacy and effectiveness.
So most healthcare - What works in healthcare is discovered and described in these very neatly, nice-packaged little clinical trials where we take 100 people and give them blue pills and 100 people and give them red pills, and we decide that the red pills are better.
What we get out of that is efficacy. These red pills work, but that really doesn't tell us anything about what happens when we unleash red pills on the population.
When we unleash red pills on the population, we are trying to measure effectiveness, that gap between efficacy and effectiveness - me and others call the Implementation Gap, and we are really trying to get at what the reasons are for those gaps and trying to identify important sources of variation and particularly those that we can do something about, and we want to know whether the reasons for the gap are endogenous to patients, doctors, healthcare systems, background population. So that is really the unifying theme of the research and why access to these data are so incredibly important.
We told you a little bit about this data source yesterday. Very simple example, a kind of chemotherapy given after an operation for a particular stage of colon cancer, and this is a big deal. Fifty-thousand Americans get this condition a year.
Do patients, in the Medicare population, who are insured and this kind of chemotherapy is covered. Do they actually receive these treatments?
Well, we went to SEER-Medicare, and, very quickly, within a week, were able to answer the question.
Now, of course, it took a while to get the data. Once we had the data, the analyses took a week. So, again, I spend 90 percent of my time trying to get data, manage permissions and so on, and actually analyzing it is a lot less time consuming.
But we identified a very simple finding, which is that there is a very steep gradient, and that, although we treat most young Medicare beneficiaries with this kind of chemotherapy, we really don't treat the older folks.
Well, this very simple finding, made possible by linked data, really sparked a whole set of subsequent more detailed analyses to go back to physicians and patients and to conduct interview studies, to really hone in on what the reasons are that underlie this important healthcare-delivery pattern. Okay?
Now, one of the problems here is that the older patients were never included in the randomized trials, those little efficacy studies. So doctors don't know what to do. So there is uncertainty, and this is what actually happens.
Okay. So there is a real circularity where we do these population-based data analyses with linked data, and that, basically, catalyzes subsequent studies to really get at the underlying reasons.
So we have this nice linked-data set, but we really wanted to know - We said, Look, these people are not getting a kind of chemotherapy that they really ought to. What is going on? Are they just not being referred to medical oncologists who have the therapy? Are they refusing the therapy, even after they go to medical oncologists and are old people just saying, Thanks, but no thanks?
Well, to do that - Sorry. This is Census data that just shows that we also - Census data can be helpful because we see that if you are married, you are much more likely to get the right kind of treatment, chemotherapy, than if you are widowed or single, and these are all adjusted for age, and these are very, very strong findings. So having Census data can be very, very important, and we can figure out who is at risk for getting inappropriate medical care.
So why doesn't everybody get chemotherapy? Do people refuse? Do they see a medical oncologist? We could use the UPINs on CMS claims, and CMS claims have access to some specialty-code information about the types of doctors patients receive, but that data is not particularly complete, updated or accurate. Maybe Gerry can comment on it. There are other better data sources for figuring out provider characteristics.
So when we use just the specialty datas, we could see that among the patients who got chemotherapy is represented by the green bar. Most people saw an oncologist. The top 20 percent there in the chemo bar, those people got chemotherapy, but, apparently, did not see an oncologist. All those people saw internists. Well, that is just because CMS doesn't know the difference between an oncologist and an internist, because the data is not coded well, but we wanted to get our hands on better data sources.
The people who didn't get chemotherapy, most of them made those decisions without seeing an oncologist.
So we are able to basically do these kinds of analysis to say, We, in the healthcare system have a healthcare delivery problem, patients are not making informed decisions not to get a treatment. They are making uninformed decisions, because they are not even going to see the relevant providers.
People look at this data and it wasn't all that compelling because they say, You can't even figure out who is the medical oncologist. So, then, we wanted to get AMA data.
It took essentially 18 months to get the AMA data, which is much more complete, to be able to do the linkage to really prove that to a higher level of satisfaction. Very complicated to do.
Ultimately, successful, and, then, the green bar went all the way up 99 percent, and the green bar in the no-chemo group went up to about 40 percent, and our conclusion was, essentially, that the mechanism was people were not appropriately being referred.
So a wish list from an end user would be linkage of UPINs on claims data to files that describe position characteristics.
AMA data is better than CMS data. Data from the American Board of Internal Medicine, the American College of Surgeons, all the specialty societies, is still better than AMA data, and the untapped resource is state-level data. That is the most complete and the most difficult to obtain.
So I have a license to practice medicine in the State of New York. They know lots about me. They know whether I have ever committed a felony, been in jail, all the tests I have taken. They maintain it. I pay them $250 every two years to update that license.
So we don't have that data to link to federal. So with respect to health, it is really critical to know physicians, physicians' characteristics, distribution of physicians.
Next on the wish list is pharmacy claims. The analysis I showed you, we want to know were people not getting intravenous chemotherapy because they are getting oral chemotherapy? Are people not getting supportive medications? Are they sticking to their therapies? Are they getting appropriate pain control?
So the wish list would be Part D data. We heard about that yesterday. Medicaid data for pharmacy claims, and private claims data sets. There are enormous pharmacy-data clearinghouses that are very - that have not been widely linked, but are very important for health-services research.
So, again, the example I just gave you involved taking federal data set and trying to link it to external data sets that are not federal. So I think developing some kind of taxonomy or framework for what these linkages are - Are you linking federal to federal? Are you linking federal data to state data? Are you linking federal data to private data with a broad public relevance?
And I put AMA data, AHA data. Those are private data maintained by non-profit - Well, yes, AMA is a for-profit organization, but they are big organizations that have - really control, monopolies on important data sets that pertain to health that have broad relevance for many researchers in and outside of government.
And, then, there are custom data, where you have your own personal data set about the patients in a particular region or with a very specific set of disease that you have that you then want to link to.
And I think developing some kind of taxonomy to understand what the activities are would help frame development of coherent policies and rules that researchers could understand.
So an example of a study I am working on - and Gerry Riley has been extremely helpful to us here - are to look at capacity to deliver mammography in the United States.
Women in the United States, age 40 to 80, need mammograms. A lot of them are unscreened. There are big racial disparities, and the lack of available facilities, mammography screening centers, and lack of radiologists are potential explanations for suboptimal use.
So the question here is does lack of capacity explain geographic variation and racial disparities? And does capacity predict breast-cancer incidence and mortality?
Well, these kinds of analyses require geo coding. They require knowledge of where the facility is, and you can get that from FDA accreditation data. Where are the radiologists? Again, that is physician data. Where are women unscreened? BRFSS, Medicare data are informative there. And where are there high rates of breast cancer? SEER.
To do these kinds of analyses, we want data ideally at the Census tract level, but if we can't get it, we'll go less granular to the Zip Code or county level.
Trying to do a project like that and figure out where to start to go to obtain the permissions is extremely complex. Approval from one agency or many. So some kind of central clearinghouse and more clearly delineated set of procedures would help.
To get these kinds of projects done, we are really dependent on personal relationships with key individuals who sit in specific agencies.
For example, Gerry, in this case, helped us get access to the FDA accreditation data for mammography facilities, but people who don't know Gerry can't do this, and that doesn't really seem fair. Although, we are very happy we know him.
Finally, I want to talk a little bit about area versus person-level data.
Access to granule-level data helps most health-services researchers, and privacy security concerns obviously involve less risk when you are talking about area as opposed to person-level data.
I think, again, in all these discussions, it is really clear it really would be helpful if we delineated between what we are talking about, and I think Howard was alluding to this before.
So, again, area-level data at the bottom, state, county, Zip Code, Census tract. On top of that, you often have anonymized patient data, which can be linked to unit area, and, on top of that, what is the most vulnerable is individual patient data, and I think one thing to think about, in terms of linking federal-agency data sets, is to release them - again, I am not pro-public-use data - but to make them available to the research community with appropriate bars and jumps and hoops you've got to jump through is to make them - make area-level data available without making individual patient-level data and have discussion among the agencies of what the hoops are to get state, county, Zip-Code, Census-tract level data with higher bars the more granular you go.
So, again, this would really help us with very repetitive common tasks that we end up performing again and again when we try and make maps and figure out where are the patients, where are the providers, where are the disparities and where are the mortality rates.
Wish list in that regard would be access to chloropleth maps by various geographic units, very useful for common-data elements and Census and survey-data results. It could be a shared resource for investigators, and ARC, GIS and other software packages that have really just become available in the last three or four years have catapulted the possibilities in the ease - how easy it is to do this light years ahead. I would say even in the last three years. Others might wish to disagree.
I think I am going to skip this and talk, finally, about Medicaid, and if I had to put something at the top of my wish list it would be for the federal agencies to help us figure out how to tap in more effectively to Medicaid.
I think the states just don't have the organizational capacity to get this going, but the federal government does, but the states really care. The largest component of their budget, talking about healthcare, I guess, at this workshop, Medicaid funds healthcare for the poorest, sickest members of society. It is really an untapped resource. I think CMS has done an enormous amount of work over the past decade trying to get the data into some common file structures and make it a little bit easier to work with, but it is still an untapped resource.
I think when we talk about Medicaid data, we want to distinguish between two things. One is enrollment. Who are poor people enrolled in Medicaid versus the administrative data which is what is done to those people, are they getting EKGs or chest X-rays or what units of healthcare are they actually consuming, and it may not be so complete for the latter, but very informative for the former, and I think we really have to not let the perfect be the enemy of the good when we talk about Medicaid data.
So this is just an example of a study where we tried to link cancer-registry data from the entire State of California to Medi-Cal, which is Medicaid claims for the State of California, because we wanted to know about delivery for cancer care to poor patients, and we were wondering what would the yield be from a linkage of California cancer registry data and Medi-Cal.
So what we did is we started, and you'll note on the left, with all incident cancer cases reported to the California Cancer Registry, which is very complete, and we took 98 cases, and if you look at the cervical-cancer example, there were 1,690 women diagnosed with cervical cancer in the State of California in that year. About 80 percent of them were 18 to 64 at diagnosis. We don't care about the older ones, because we have them in Medicare, right? So that is about 1,350 cases.
What proportion of them were enrolled in Medicaid? About 21 percent. If you look down at hepatoma, it is about 35 percent. So these are cancers that are associated with infectious disea