[This Transcript is Unedited]
Department of Health and Human Services
National Committee on Vital and Health Statistics
Working Group on Data Access and Use
November 14, 2012
National Center for Health Statistics
3311 Toledo Road
Hyattsville, MD 20782
CASET Associates, Ltd.
Fairfax, Virginia 22030
Table of Contents
- Introductions – Review Agenda
- Discussion of best practices and practical suggestions for release of “open” HHS data
- CDC Data Report
- Wrap Up for Future Meetings and Next Steps
P R O C E E D I N G S
DR. CARR: Thank you for being here. I would like to convene the working
group on HHS data access and use. What we do is we go around the table and say
who you are and where you’re from and whether you’re a member of the working
group. So I’m Justine Carr, Steward Health Care System, and chair of the
DR. SUAREZ: Good afternoon everyone. My name is Walter Suarez, I’m with
Kaiser Permanente and I’m a member of the working group.
DR. FRANCIS: I’m Leslie Francis. I’m at the University of Utah in Law and
Philosophy and I’m on NCVHS’s Privacy Confidentiality and Security subcommittee
and I’m a member of the working group.
DR. GIBBONS: I am Chris Gibbons. I’m from Johns Hopkins in Baltimore and I
am a part of the working group.
MR. CROWLEY: I’m Kenyon Crowley with the University of Maryland and a member
of the working group.
DR. ROSENTHAL: I’m Josh Rosenthal with RowdMap and a member of the working
MS. QUEEN: I’m Susan Queen with ASPE, and staff to the working group.
MS. KANAAN: Susan Kanaan, writer for the committee and the working group.
DR. VAUGHAN: Leah Vaughan, member of the working group.
DR. MAYS: Vickie Mays, I’m a member of the full committee and I’m a groupie
of the working group.
DR. COHEN: I’m Bruce Cohen, a member of the working group, a member of the
full committee from the Massachusetts Department of Public Health and co-chair
of the Populations Health subcommittee.
MR. SCANLON: Jim Scanlon, I am with HHS Office of Planning and Evaluation
and I’m Executive Staff Director for the full committee.
MS. GREENBERG: I’m Marjorie Greenberg, I’m from the National Center for
Health Statistics, CDC, welcome to NCHS, and I’m Executive Secretary to the
committee and I’m also a groupie to the working group.
MS. JACKSON: Debbie Jackson, National Center for Health Statistics,
MS. SEEGER: Rachel Seeger, Office for Civil Rights.
MS. JOHN PAUL: Tammara John Paul, NCHS, CDC.
MR. SOONTHORNSIMA: Ob Soonthornsima, member of the NCVHS committee.
MS. BEBEE: Suzie Bebee, ASPE.
MS. JONES: Katherine Jones, CDC, NCHS, and committee staff.
DR. CARR: Great to see everyone. We have had the opportunity to integrate
the excellent input that we had at the first couple of meetings. And I think
that we’ll have an opportunity today to really have a working group with some
deliverables or suggestions for HHS at the end of today. So what I wanted to do
was just go through where we are and who we are.
As you can see we have here listed the membership of the committee, and as
you heard Bruce Cohen, cochair of Populations, Leslie, Co-chair of Privacy,
Walter co-chair of Standards, and Paul Tang couldn’t be with us today but he’s
co-chair of Quality.
And then what was interesting to me is just the richness of our other
members with great expertise in IT. Kenyon Crowley, Bill Davenhall, Josh, and
Peter Hudson, not here. Mo couldn’t be here, but their tremendous expertise in
And then our other expertise. Wonderful that Chris is here today in his
expertise in community health informatics. Patrick Remington has called in but
not today I guess, and Kevin just had a baby, and Leah with her experience in
And Chris, you weren’t here for the first two meetings but it’s been a
learning process for these disparate groups to come together and get to a
common ground, but I think we’ve made great progress. Our liaisons of course
are Susan Queen, Ed Sondik, Nialls Brennan, Jim, and Marjorie.
So our initial charge is that we monitor and identify issues and
opportunities to make recommendations to HHS on improving data access and
innovative use, including content, technology, media, and audiences. Second,
advise HHS on promoting and facilitating communication to the public about HHS
data. And finally, facilitating HHS access to expert opinion and public input
regarding policies, procedures, infrastructure, to improve data use. So that’s
our big charge.
In the details I’ve just highlighted seven things that we are asked to do,
and began with reviewing current portfolio of HHS data, monitoring trends about
new information dissemination, social media, identify and monitor types of data
and information that’s needed by all participants, improve data access, promote
and facilitate creative communication, facilitate HHS access to expert opinion
and public input, and to advise HHS in understanding and evaluation of how HHS
data is being applied and the value it is generating.
So where we are on this ambitious agenda is starting with we had two
webinars reviewing some of the data. We’re up to half of the data that HHS has.
And I mentioned, Katherine, if we could put links on the Working Group
SharePoint site, it would be helpful to be able to go back and look at some of
these data that we’ve seen already, and we’ll be arranging additional webinars
to go over the rest. And actually at the end of our meeting today, 4:30 to
5:00, we have a CDC presentation on some of their data.
But we talked last time about the fact that we wanted to use the opportunity
of being face to face to really work and save the presentations for outside
this meeting time.
So when we last met we talked about thinking about the two sides of the
house, the supply side and the demand side. And so when we think about the
supply side what we want to do is take some of the data that we’ve seen, and as
said, identify and study areas of opportunity to improve data access and
application. And Mo couldn’t be here today, but we talked about let’s come away
with five tangible recommendations or next steps coming out of today that Jim
can take back.
So I asked Josh to take us through from the supply side what does the data
look like, what’s available out there, what can we do for it to make it better.
We’ll look at some of the front-end platforms, and then we’d also like to go
through some of the issues that Josh raised – taxonomy, hierarchy, and
so-on, that we who use the data every day take for granted but not so clear to
a developer who looks at this terminology for the first time.
So Josh is going to take us through that and then actually building on what
Bill Davenhall also had recommended, that we identify candidate data sets to
sort of pilot. But I think actually that’s a bit of what we’re going to do
today. And then Susan, we also want to talk about the special considerations
related to survey data availability and usability.
Once we kind of get through this side we’ll then begin to think about the
demand side. And as we said in the concluding comments at the last meeting what
do individuals and communities need to make better decisions and improve
So let me stop there and open it up for any comments or suggestions,
additions, directions. Are we good with that, Jim?
MR. SCANLON: A little more framing, Justine. And remember now, HHS has
really worked with several audiences over the years: the public health
community, the research community, the health care community, and that’s really
where our efforts have been focused.
So it’s almost a business to business kind of an arrangement. You need a
fair amount of analytic capacity to take the data, use the data, interpret the
data, and apply it to the levels.
With this initiative what we’re trying to do is – all of you have heard
this before – liberate, democratize, all of the data that we have and
probably other places, to make it available to a new set of audiences, and
using more customer friendly technology, and certainly the latest technology.
So we put together the group – all of you were hand-picked by the way
– but we put together because you know sort of both sides of the equation.
What we’ve been doing at HHS for example is besides the usual outlets and ways
of disseminating data. The policy now is to put our data holdings into
HealthData.Gov – and so does EPA and the other health agencies and the federal
government – and to put it there hopefully in a format that developers,
applications specialists, and others could take. And really without us having
to intervene could just go from there, use the data properly, but basically go
And what we’re really asking you to do is help us think of additional ways
of getting data to HealthData.Gov, but really besides HealthData.Gov, there are
a number of other publically available platforms – we’re just beginning to
learn about them in HHS – where we can make the data available, where you don’t
necessarily have to know how to program SAS or SPSS or all the others.
So again, these are different audiences. We’re not trying to dumb down the
research side or the public health side or anything like that, those
dissemination outlets will continue in their full glory. But what we are
looking for is applications for communities, consumers, patient groups,
community coalitions, so they can take data and really use it to improve health
and health care sort of in their own settings.
And you’re sort of a unique group because we’ve kind of put together both
sides of the equation, and the first couple meetings I think we’ve had a
translation, everyone getting on the same page in terms of technology and data.
But we’re actually moving along very nicely and we’re already beginning to get
some very good ideas and we’re really asking you to give us additional thoughts
and help us think this through.
We’ve already looked through the portfolio of data from NCHS, as well as
SAMHSA, our mental health substance abuse agency, and from the Agency for
Healthcare Research and Quality, which many of you are aware of. So we’re about
I think at the end of the day, later in the meeting we’re going to have CDC
talk about its portfolio as well. And then we have a couple more agencies that
I think we’d like to brief you on. So then you’ll have an idea what it is we
have. CMS was one of the first.
And these are all different kinds of data. Some of them are survey data
where identity has to be protected, research data, surveillance data. But in
CMS’s case it’s claims data, administrative data, literally from all over the
country, probably the most complete set of local data that we have. And clearly
that data is being used for hospital compare, to can compare hospitals to other
places as well.
But again I think everyone would agree that we are not experts in HHS in the
many current and evolving technologies and platforms and really ways of getting
the data out, and that’s what we’re looking to you for.
DR. CARR: Okay. We have a couple of people who just joined the table, if you
could introduce yourselves. And then Josh, you want to come up and start your
MR. QUINN: Matt Quinn from NIST, staff to the committee.
DR. GREEN: Larry Green, member of the NCVHS committee.
DR. ROSENTHAL: Good afternoon. Hi. So three or four days ago I spoke at an
industry thing with Brian Seebach and Niall Brennan and some other folks who
were on my panel, and we came up with the ultimate device for displaying data
And you’re going to see Hopkins professor from MPH and a bunch of other
folks, at BCDSCO trying to figure out exactly how to use this, but we put our
slides and our analytics on this and this went over like gangbusters. You guys
remember the old view masters you used when you were kids? So that’s what
they’re doing. The Hopkins guys trying to figure it out.
But that’s not what we’re going to be talking about today, although that is
funny, that tickles me ever so greatly. At the last HDI thing we had huge foam
cowboy hats and you can see all the HHS guys running around in those as well,
some industry folks.
so I’m just going to take 15 minutes and be done, in and out very quickly,
but I just want to set up where I think we’ve been and where we’re going and
what we’re doing. So this is our third meeting. The first meeting we got our
charge from everyone including Todd, and he was very fiery and inspiring. And
then out of that we all sort of left, or at least I left trying to figure out
what was going on. And we talked a lot of meta talk about what we could do.
And so I just went together in sort of true startup fashion and slapped a
bunch of stuff together. So I put together seven immediate things that I
thought would be pretty good in terms of recommendations. They’re probably
wrong, they’re probably backwards, but at least these were seven very specific
things. If you walked out of here with nothing else at least we’ve done that
part of the charge.
And then there’s various specifics talking about what taxonomy is, network
through file, showing how to go about doing that, learning centers and why
those are important. Business value, baking that into the challenges. Semi and
synthetic data sets, we have an IT person talking about that. Data browsers,
this means the dissemination. Partnerships and products – and kind of mocked
Todd there as well. Then a green button with an opt-in kind of like the blue
So that’s what we did, and then Mo and company took it off and said let’s
actually try to make this reasonable. But this was the first cut as far as
that. And so then what happened was over the past few weeks we took a tour. So
those were just initial points. And a few of those were around taxonomy, and
engaging other people and business contacts as part of the questions, so we
weren’t just doing cool fluffy stuff. And then also disseminating the data to
different sorts of audiences.
And as you remember my background is also in kind of public health data from
doing the Dartmouth algorithms and whatever, so I have a firm foot on that side
of the house, but also on the startup side with Harvard and MIT and Hopkins,
some of the other places doing some of that sort of stuff.
And so we walked through these tours of the data that the government
agencies have, and we’d talk about data.gov, and basically there we kind of
dumped files and we dumped datasets and maybe we structure it and maybe there’s
some meta data. And then we took tours of two different things, either a big
file or a database like a claims identifiable database that has restricted
usage, or kind of a small file that is kind of all over the place and it’s
really tough to figure out what the pieces are in it, or these kinds of secure
And to do that there are kind of supposedly user facing because you can
create your own table, but to do that to be honest, you really have to know
what you’re doing to go in there and most of those were restricted access
So if the question is how do we engage other sorts of people to use this
data? I suggested that there might be a couple things we might want to think
about and that’s what we’re going to talk about for the next ten minutes or so.
So a couple things. Does that sound like a fair, accurate summarization?
So I’m going to walk you through some real life resources as best I can, but
I’m also going to frame it up. And this file will be available for you as well.
And again this is just kind of first draft, putting it out, I’ll send that on
to you guys. And this is talking about data browsers and really building
community exploration and creation of intelligence and market value around
information, not data. And so I’ll walk you through the examples.
But if we’re going to frame up the choice, what can a user – and I’m
using that very broadly – do with government data? Users can do a couple
different things. If you’re very sophisticated and you have specialized
expertise and you have access, you can go in and pull the files down from .gov
and figure out what they mean, or maybe you’re granted restricted access into
an enclave when those fire up and you can play around and create your tables,
and that’s one thing. The amount of users that do that, I tried to poke around,
we’re talking no more than thousands at the very upside.
The other thing is you can build apps, and that’s what data liberation and
datapalooza has been about up until this point. We put the data up there and
then we allow the app builders to come and do their thing. And theoretically
they create market value, just like what happened when we put weather data up
and when we put geolocation data up. That really hasn’t happened as people will
kind of know from looking at CrunchBase and the internal databases that HHS
uses to evaluate, in terms of creating market value that hasn’t really
Part of that is because they’re mostly based by tech people who really don’t
understand health care, and partially it’s because it’s DTC, which historically
hasn’t worked very well, and partially because in order to do that I can’t
really do that, and I can program pretty well but I have to have a decent
amount of money and go into the app store, and maybe I build a team, I mean
it’s not easy to do.
So I always point to iTriage, and that has about seven million dollars of
venture funding behind it, and that’s one of the few successes around that. So
that’s a pretty high bar.
So when Todd talks about putting the data out there and letting the
communities build the interfaces that’s still a pretty select group of people
who are going to do that, not just because of the technical skills which a lot
of healthcare folks don’t have those technical skills, but also because on the
tech side to understand the healthcare data and the nuance you’re looking at,
those tech folks do not have those skills.
There’s a whole informational educational conversation we can have, and
that’s kind of the last session we did at HDI where we did that sort of
information for folks. But that’s very difficult to do. The third option is
what you can call a data browser, and I’ll talk about that, or kind of an
interactive front end. And in this scenario the data is already defined, all
the meta data is done for you, the taxonomy is done for you. These are wildly
popular in the tech world.
In fact one of the more recent contests which was one of the bigger ones was
comorbidities of diabetes actually using some of your data and it was wild and
no-one really knew about it. And the beauty of that is you play with the
information rather than the data. So you don’t need to do anything except drag
and drop things. So we’re going to talk about that.
Just to give you a sense of scale, one of the things I’ve been trying to
figure out is who uses this data and how many people actually use it. And we’ve
tried to get some internal stats from CMS and other folks and that’s sort of
been processed and what have you. So I am going to roughly, roughly 80/20 proxy
it. When I go to Alexa, and this is a good place to go if you want to figure
out how popular a site is, just FYI, I know you guys know this but it’s a good
place to go.
So this is up here, and go to Alexa.com, and this whole slide there will be
available to browse at your leisure. So what you see is this is where you type
in a site, you basically see how much traffic it has, what its rank is, how
many people are linking to it, and you can look at different metrics. Data.gov
has about 3000 sites linking into it, and that’s a pretty good metric. That
tends to be better than hits, but you can use your own stuff.
We’re going to look at Google Public, and most estimates of Google Public
data, they only say about one to five percent of the people using Google are
using Google Public data. But that’s still a lot of people. There are five
million sites linking into it so you have millions and millions of page views
every day, every hour.
MS. QUEEN: May I ask another question? What do you mean by sites linking
MR. ROSENTHAL: So if I link to something, if I have my little blog or CHIDS
put out a little post on the economic form you’re talking about, and they kind
of link, they link to Data.Gov. So they put up on their website or their blog
or a tweet or anything on the web, they send a link and say hey there’s this
cool thing called Data.gov, go here. Or it’s anything specific, it’s anything
that belongs to that domain. It may be check out the hospital compare site
inside. So this is just ball parking it.
What I’m trying to say is Data.gov has a couple thousand sites linking to
it. Google has millions and millions, and only a fraction of those are using
this Google Public thing I’m going to talk about, but my point is that’s still
a lot of people, that dwarfs the amount of people. I’m just trying to get a
sense of scale of what we’re talking about.
And then there’s these big tech sites, just to give you a sense, when they
have these data explorer contests like ReadWriteWeb, they have about 40,000
sites linking in when they do that. And so you’re talking more than ten times
as many as Data.Gov, and that’s just one instance. I’m just trying to get a
sense of scale of when we talk about users we try to get some internal metrics
and that’s still really good to do and to hunt down but I’m just using some
external metrics, back of the envelope, right?
This is kind of like the Google Consumer Surveys example I was talking
about, you can run a consumer survey using their algorithm for a couple hundred
bucks instantly, and it doesn’t give you what you’ll find out about it but I’m
just trying to get a sense of scale.
And so Google Public Data Explorer, let me hop out of here and show you.
Actually let me show you the slides before we hop into the live environment.
Basically you just search for this thing and you go in and it has these dynamic
graphs, this is like the TED talk where you click the play button and it does
all the stuff for you, it animates it.
DR. CARR: I am going to take Susan’s lead and just ask like the dumb
questions to just bring everyone along. This is the browser. So you went to
public data, and what’s the data that’s populating this?
DR. ROSENTHAL: So there’s tons of public data and you can search by data
types. The World Bank, almost every government institution has had their data
taken out and sucked into this thing. Private data, public data, individual
data sets, world health organizations, a lot of its international data or it’s
US data that’s not health. There’s a ton of census data in here, there’s a ton
of economic data.
DR. CARR: This is a site you go to and you can pull in the data set that you
want, or do they give you here the things you can choose from?
DR. ROSENTHAL: They do this all for you. So if I were to go to this probably
the best way to show you is just give me 30 seconds to show you.
DR. COHEN: You said they don’t have any health data?
DR. ROSENTHAL: Very little. They have some, I’ll show you what they have.
But not nearly as much as what you could have. So I go here, I go to Google,
I’m going to show you live how to do this.
MR. QUINN: How does Google populate this? Do they have a team of people
there that search for this data? Or is it done by computers?
DR. ROSENTHAL: Both. So they have some automated things which Google tends
to do, and they scrape a bunch of stuff, and then they have a team of people,
and then they have users contributing this. Say if I’m at the World Health
Organization, there’s someone at the World Health Organization who puts the
thing in a CSV file and dumps the stuff. But they have to do three things.
Typically they have to give you the data and they have to define it in this
Taxonomy is a fancy way of saying table of contents, and I’ll show you what
that looks like. And the beautiful part is it’s not just a data set. Once you
put your data up there I can pull in economic data from World Health, I can
pull in US census stuff. Once you put your data up there that mix and mashing,
right now if I want to mix and mash health data I as a developer have to take
it and I have to go and find census data and I have to go do some Google Survey
data and I have to scrape it with some other stuff. Google does it all for you
as do all these public data browsers. And so once you put your data up there
anyone can mix and match it with anything else that’s in there, and the stuff
that comes out of that is crazy.
MR. QUINN: How can you make sure that Google finds it or Google gets it?
DR. ROSENTHAL: You’d want to contribute to it. World Health Organization
contributes a lot of stuff to people because they want people to use their data
and they think it’s good to do.
DR. CARR: Just to bring it back to us, if with the HHS stuff some of it is
there but there’s more that could be there, then we could initiate and push
that to that site.
DR. ROSENTHAL: Oh, yes, they would love it.
MS. QUEEN: What was the third thing? You said give the data, define the
DR. ROSENTHAL: That’s about it. You give them the data, you have to define
it in the taxonomy and you kind of have to label it. So they basically ask you
to tell people what the data means. So when you say morbidity – here, I’ll
show you. So I log on here, and here’s these different graphs that people have
put together. And I’m going to look at this one. I’m going to hit explore the
data. And on this computer it will take a long time.
DR. CARR: What you are looking at it looks like it says fertility rates, but
that happens to be one data set that you pulled up.
DR. ROSNETHAL: Yes. So here’s all the data. So here’s the data that’s going
into this particular view. So this is world development indicators and I can
search these data sets in different views and mix them and match them. This is
going to take a long time to pull up.
DR. FRANCIS: Is there any way to tag the data so you know the data
DR. ROSENTHAL: Yes. It’s funny you mentioned that.
DR. FRANCIS: So you can tell interrelationships between sources?
DR. ROSENTHAL: Yes. That’s exactly what this is over here. So these
indicators, that shows the data set contributor. That’s different than the
person who put the view together. The person who interprets the data and
basically says you know what I think – I’ll flash over to this one here if
the computer fires up. So she won this little contest and she put together this
thing. And so what she looked at was diabetes and these things she thought were
the greatest comorbidities. And so she pulled essentially from Google Data and
other places and tried to figure out what she thought had the greatest
correlation, and now she has her name on it as well. And so she kind of became
an expert and people tend to look at her and et cetera et cetera.
So there are two levels of transparency. There’s one: where’s the data
coming from, so show me just the stuff from World Health Organization, or from
US Census or another source. Or show me by type what am I interested in, or
show me by the person who actually pulled the stuff into view.
This is Google doing all the stuff for you. The next layer on this is
individuals using it and creating their own stuff. And here I can search by
country. So here I’m looking at fertility rate and life expectancy, and I want
to subset that by different types of lending types. I might want to pull in
what am I interested in. I’m interested in COT emissions to see if that has
anything to do with fertility rate. And so on and so forth. I can pull any
piece of data in the system.
DR. CARR: Can you show us?
DR. ROSENTHAL: Yes. This is going to fry this thing but let’s see.
DR. GIBBONS: What level of data is this? Is this county, country, what level
DR. ROSENTHAL: Yes. This is the beautiful part about this. When you have a
taxonomy, and this is actually far more helpful I think actually showing people
than me just rattling on like last time. So when you have a taxonomy you’re
sort of defining your data, and one of the attributes about that is geographic
region or unit. and so you have a separate part of the taxonomy, the table of
contents, that talks about that. And so you link up country to region to HR to
whatever you want to look at, CMS contract, country, et cetera, et cetera.
All of that is built in as a dimensional taxonomy. And so the answer is
different data sets have different grains, and if you want to look at the view
that only shows you the highest minimal grain, so you can’t go down below what
you shouldn’t see. So before when we talked about how do we do grains, do we
want to do zip code, this is how they do it in the non-healthcare world and it
works pretty nice.
If you want to submit something you can say I want to submit that, I don’t
want to do zip plus four, I don’t even want to do zip, I don’t even want to do
county, maybe let’s do CMOS contract, maybe we’ll do HRR, you can do that. And
it works like a charm. Once you figure out that master table of contents then
it’s very snazzy.
DR. COHEN: You don’t hide the data from the public, you just submit the
level you are comfortable with sharing?
DR. ROSENTHAL: Yes. That is absolutely well said. And so what you’ll find is
that this may be a public set. If you wanted to be really snazzy you could do a
synthetic set on it. If there’s something that you think is very meaningful
that you want people to wrestle with but they don’t understand the concepts but
you’re scared of sensitivity or mosaic effect you can use a synthetic set, or
you can just use the public sets you already have.
But the beautiful part about this is it doesn’t require me downloading an
excel file and dabbling around in SAS or whatever I’m going to do. It doesn’t
require me going on and even trying to figure out how to generate tables on my
own. And there’s this kind of compounding learning. When I put it up there the
next thing I can pull in any of these other sources. And you can see by gender,
by urban, by country, I can play the thing. I think what I’d like to show you,
let me just take you on a little tour and then we can ask any questions because
I want to put all the stuff in your head before we take specific questions.
So here we can look at metrics. Let me look at say health. So I have my data
sets I’m saving here, let me say I want to look at world development
indicators, maybe I don’t want to look at that, maybe I want to look at other
MS. QUEEN: I just wanted to add something to what Bruce was saying. It would
be my assumption that something like BLS or Census that have made their data
available, they already have determined the lowest level.
DR. FRANCIS: Actually, that goes to my question. If I submitted at this
level because I don’t want it going any lower, but somebody could mix it with
another data set and get it lower.
DR. ROSENTHAL: They are going to typically enforce that maximum. So Google
is pretty good at security stuff. I know the government is good too but Google
is pretty good. And one of the ways they typically handle this is that maximum
grain. You can basically set that in as a rule, as a business engine rule, and
say anyone who uses my data set in combination with anything else, the minimum
geographic setting is either the maximum of the data I submitted or a separate
setting. So here I just searched and I found all these different things, out of
pocket expenditure, when we look at to your point the data provider, who do I
want to look at? World Economic Forum, Human Development, et cetera.
And here’s what they have on health which looks like a lot, but I’m looking
right here and this is global obviously and you can click on that. Condom
usage, contraceptive prevalence, depth of hunger, diarrhea treatment, HIV, et
cetera. The beautiful part is you can mix it with the other stuff in there:
economics, lending, social media, et cetera, any of that sort of stuff.
And what would be really interesting is if I’m a researcher and I have a
question, hey is condom usage and Twitter usage related on Friday night, I can
get an answer to that question pretty quickly without having to write a grant
and do that sort of stuff. That’s probably not the best example but it gives
you a sense of the sort of things you can do.
So that’s Google. Google sort of does that stuff for you and you have to
work with one of their reps, but they love the Government.
DR. GIBBONS: Can I ask a question? So are we assuming that data integrity is
all very good?
DR. ROSENTHAL: They do a pretty good job, and usually they’ll rely on –
if you think you can do better, if they don’t like it, they won’t post it.
They’re pretty quick about taking stuff down if you don’t like it.
DR. CARR: Is there a checklist that they have that assesses the data
DR. ROSENTHAL: They used to – yes, you would probably have to talk to
them right now. There are folks that run that sort of thing. And historically
there are kind of forms or templates you submit. It’s not just a manual
checklist, it rejects it if it doesn’t do certain things. And then they run
things. Not to get too much into the weeds but maybe the best example for the
NCHVS people is to think about Google Consumer Surveys.
If you go under Google Consumer Surveys you can ask anyone anything and
basically you pay a couple hundred bucks or something stupid like that and at
the end of the day they enforce all of your survey stuff using Google’s
pre-existing algorithms of specificity. So they’re leveraging Google’s
intelligence on top of a survey platform. And that sounds like those things
don’t go together but they’re really good about doing that sort of stuff. And
they sort of do the same stuff with the data as well.
Google is really good about data integrity, they’re much better than
probably anyone you’ve ever met, so when they pull your stuff up there it’s
going to be in the best shape it can be. But at least historically as I said if
you don’t like it they do it in partnership and they take it down. But this is
a private entity doing this stuff for their own users, and they’re creating
these views and this platform. That’s very different from the next level which
is folks like these.
Here’s ReadWriteWeb, and this isn’t anything special but it’s just one of a
dozen highly read technology blogs that talk about data and healthcare is on
their pretty predominantly. And so they’re not health 2.0 they’re kind of
bigger web 2.0. And what they said is they’ll give someone a pass for free to
their version of ATI and $500 to get some tacos and beer if they win this
It wasn’t an application challenge where someone had to develop iTriage or
something like that. It was just we’re going to put some data up there, we’re
going to label it and put it in a meaningful table of contents i.e. taxonomy,
and then we’re going to allow people to play with other types of data and see
what insights there are. The prize isn’t can you build a cool app that does
something that people may not use, the prize isn’t can you take data at a
granular level and come up with a better algorithm, the prize is can you come
up with some interesting insights and literally do it in a way that lets other
people explore your work and build on it, disagree with it, et cetera, et
And so they put up a little form and they partnered with these other guys,
not Google but Tableau, although these were the guys that the Google founders
at Stanford they built this company after they taught Larry and Sergei how do
to the search stuff, so they’re pretty good. And they basically step one, two,
three, four, you submit, and then the person who won – so you can use any
of this data that we looked at – so this got a lot of traffic, and what
won out of everything, and this was like baseball stats, anything you can think
of, crazier stuff than I would dare mention in this company.
And what won was an interpretation meaning a view or a pulling together of
data that allows you to explore it of US obesity and comorbidities. And she
won, and she was a student at DePaul I believe. And there she is and she talks
about it, and her tagline was supersized caskets which is pretty snazzy.
She’s interesting because she has no technological skills, she couldn’t
develop iTriage, nor is she really MP nor could she go through and say hey this
is unwarranted variation in this HR and we think that’s a comorbidity. And she
put together this little thing. And so this is an explorer and she pulled the
data and she said here are the things I want to look at.
And so here is the US and color density indicates RAID and so the beautiful
part is this little thing went viral, went all the way around the web. I can
take her view and I put it on my blog, I put it on my Facebook account, I put
her interpretation of analysis in this public data everywhere. They don’t have
to go to Data.Gov, it floats all around.
DR. CARR: What does this tell us?
DR. ROSENTHAL: She tells us the things she thought were most correlated with
obesity and diabetes. And so these are the factors that she likes.
MS. QUEEN: How did she determine which factors were most related?
DR. ROSENTHAL: She went through these and she probably did a regression
analysis one by one. So she dragged, just like in Google I showed you the
little dot bubbles and did the regression analysis with the flick of a wrist.
And so the academics will obviously say I disagree with this or I do this, but
here’s the interesting thing, here’s an 18 year old kid talking about
comorbidities in diabetes.
And actually if you look at what she did she actually had some fairly
high-falootin folks, some of whom are in government, judge her work as part of
the criteria, which is interesting. So she talks about fast food restaurants
that would typically take a five year study to do. I mean I know we have food
atlas and food deserts and we’ve played around with that.
So she pulled from that set and flipped it up there real quick. Income,
mileage to a store, a fast food restaurant, poverty rate, convenience stores
with gas she was using as a proxy for fast food, low income receiving SNAP,
price of low-fat milk. So she was looking at stuff that you might not look at
typically and she’s able to do it.
DR. COHEN: She used existing public data sets at the state level probably?
DR. ROSENTHAL: I think it goes down to county I believe. Yes, county. And in
fact there’s low obesity, price of sweetened drinks. So whoever asked the
question, and I’m assuming, and this is typically how they do this, so she
looked at probably 50 things and she said with low obesity price of sweetened
drinks, if it’s a lot of money for a sweetened drink, consumption of fruit and
vegetables – and she’s doing this from various sources – adults
meeting activity guidelines, meeting household income, full service
restaurants, et cetera et cetera.
And the point is not to look at the specific example but just to look at
someone doing this sort of stuff. And the beautiful part is she’s using
different sort of data. She doesn’t have to code, she doesn’t have to program,
she doesn’t have to build an app. And she’s just interpreting information, some
of which is public and is out there and some of which should be out there. And
And then various folks from kind of the New York Times, they have their new
departments who do just this sort of stuff. So it’s not just on the fringes of
the tech community, even in the mainstream media community this is big. So if I
go into the gallery I can look at different things. If I look at health you can
browse this on your own. Geography of diabetes, contributors to obesity, health
care cost, tooth decay, and I can go on. I encourage you to play with this on
So when we’re talking about making our data available and thinking that
someone has to build iTriage there’s this whole other movement going on. And so
very quickly I can filter. Here’s what these guys did. This is Annette Griner
and she does very good work actually.
So if I were going to hire a consultant or if I wanted to identify who were
good at doing something I would look. It’s just very much like doing a piece of
code in the tech world. I can see how good they are and see what people think
about them and review them. And let’s filter it by poverty rate and obesity
rate and ethnicity and see what happens.
MR. QUINN: Can you identify the underlying data sets?
DR. ROSENTHAL: Yes. So with Google you can and with these guys you have to
download the client install rather than just the web-based one, and so you can
absolutely do that. So what I tried to show you was two different approaches to
this. Not to say that you have to do it one way, I tried to show you the two
ends of the spectrum. Other people do this too. You can think of it as YouTube
or Kickstarter for data and analysis. So Google takes an approach where they
say everything has to fit in my table of contents if you give it to me.
And the benefit of that is they don’t create and define the views, they
allow you to pull anything else you want to at the drop of a hat. Tableau takes
a different approach which is ironic considering their shared Google heritage.
They basically say it’s scattered. The data that you give us is a freestanding
little unit with referential integrity only between the sets that you submit as
And so if I were to look at this geography of diabetes versus tooth decay or
versus healthcare cost I can’t link back and forth between these sets. And so
that means that the individual who sets it up has more flexibility in terms of
the metadata structure in what they want to do. The downside means someone else
can’t come along and talk about baseball stores, do they have a correlation,
which believe it or not there is some interesting stuff in there.
MR. QUINN: I was going to ask, Google publishes its taxonomy, correct?
DR. ROSENTHAL: Yes.
MR. QUINN: I didn’t see any nations, I mean Google being worldwide, have any
nations or any public health entities in the world said we are going to shift
our vital and health statistics or other national statistics to accommodate
DR. ROSENTHAL: I don’t know the answer to that. Historically, their approach
has been to do it collaboratively. So what they say is if a subject matter
expert or a producer of the data knows something and feels very strongly about
doing it in one way they typically accommodate that or don’t show it.
MR. QUINN: In the world of building standards, you start with something that
everybody can agree with you hope, and adoption results in standardization.
DR. ROSENTHAL: Back to my slides, I said kind of taxonomy, and we said
what’s taxonomy, it’s really whacky, how do we do it, et cetera. And I said
well actually people have your data in a taxonomy right now. You might disagree
with it, but there’s a starting point, there’s a rough draft you can start
And what I’m trying to show you is there’s different ways of putting this
data out there that has different types of adoption. So this user is not a
hardcore tech person. And by the way, as you do your own public health
initiatives you should really look up there, there’s some really interesting
stuff believe it or not. Nor are they necessarily a data analysis person.
So in terms of kind of doing contests and incentivizing people I used that
little ReadWriteWeb, I’m only using it as an illustrative example. Here they’re
giving away $500 and a free pass to their own conference so it basically costs
them very little, and they partnered with a technology platform who was all too
happy to partner with them. So outside of having a dedicated person in-house do
it they weren’t spending a lot with that and they got massive traffic per
Alexis and some of the another analysis.
So what I was trying and hope to show you was you can do this sort of stuff
with very little investment. I mean the users or the partnerships do it for
you. And it reaches a very different sort of audience. And you can do that in
two ways. You can say I start supply or demand side with the data I feel safest
with, put it out there and see what happens, or you can say actually we’re
really interested in this subject and maybe we do a synthetic file or something
we feel secure with at a particular grain and see what happens, what
conclusions do they have in a particular topic of interest.
And what you can really do since there’s only five or six kind of secure
access passes granted to different enclaves you can say whoever wins the
contest using that synthetic file judged by HHS – and the people in this room
if you want to do it, so you can have your hand in seeing if it’s worth doing
or not – actually gets access to the real data. So there are all sorts of
fantastic creative things you can do. So that’s it.
DR. FRANCIS: On that slide, there was a place to click download, and what
I’m interested in is what can be downloaded, and if data are downloaded is
there a way to continue to enforce the levels thing when it is downloaded.
DR. ROSENTHAL: What you can do is if you are working in the Tableau world or
people like them you can do whatever you want to do. You can basically say we
don’t want this downloaded, or you can say we’re not comfortable with it
downloaded, do something synthetic. Or you can say the only grain we’re going
to do is whatever you want to do. And they typically download extracts.
So a good example of this is run a free Google Consumer Survey and just see
what happens, because the download you’ll get – and this is a good analog
or actually look at this – is a PDF of insights, so they do all the
correlation for you. A survey I run rather frequently is what people think
about Medicare Advantage. And I do it by open users, I do it by enrollees, and
I do it by current users of Medicare advantage. We do that for our private
thing. And it does it all for me.
And so in this state women with this education and this income actually have
– it does it all for me. And I can download the PDF of those informatics
or it gives me just Excel sheet basically, and their grain recently went from
state to county.
MR. SCANLON: Rather than peer-review processes, this relies on the community
of users to judge the quality or the accuracy of usefulness, right? And open
peer review I hope process.
DR. ROSENTHAL: Yes. And they do it in different ways. So Google has kind of
– not surprisingly, this is why they get criticized for kind of being like
Microsoft. So they’re much less peer-review based than say Tableau. So they
actually are much less open. They actually, because they’re a closed system,
their experts internally review it. But Tableau takes a different view and they
basically open it up.
DR. CARR: This really is remarkable for how disruptive it is for our
traditional ways. And I think about the meeting that we just had over the last
day and a half, we think a lot about privacy, that was a lot of our discussion,
and it’s interesting that in some ways they’ve solved it. Like we’re not going
to allow you to get to too small of a level. So it’s not debatable, it’s just
And we’re not going to allow you to merge. And I think the thing that’s
interesting is this is hypothesis generation in a way more than a peer review
article. And I think it speaks to how we learn in this new environment. We talk
a lot about how many journal articles there are versus 10 or 20 years ago. Now
we say how much information is available and minute to minute it’s not possible
to make a world of peer review articles over 18 months or whatever, 20 years,
how long it takes to publish an article.
So this is extremely informative as we think about the data and now where it
goes and how it’s used. I mean the examples you’ve showed, the diabetes one is
just powerful. We’ve had lots of article about growing obesity, but when you
see that and you juxtapose with all of the community factors, again it goes
back to for NCVHS because we have our diagram of the influence of not just the
person and the community and et cetera, this is the embodiment of that kind of
DR. ROSENTHAL: They don’t have your expertise and they would obviously value
that very highly. The other thing I should say is that hypothesis generation or
education. So rather than how are you going to field the knowledge workers and
et cetera, you can get a traditional MPH and that’s interesting, or you can go
to Corsair edX, MIT and Harvard put their courses online, Stanford has a data
mining certificate for like $10,000 you can get a certificate from Stanford and
you’re certified as being able to mine data and you play with some of the
And when I’ve been hiring in our product development I’ve always taken the
latter candidate rather than the former. So the point of the story is there’s a
whole bunch of different things in here that are worth considering.
So I hope this is helpful because the first time I was talking about stuff
it wasn’t terribly helpful because this was the mental model I had. So feel
free to dismiss it or change it or do whatever, but I just wanted to give a
sense of the type of stuff which is out there, and I think that can be a real
credit or something that the committee can contribute. Because right now health
data initiative is about putting data out there, letting people use enclaves,
and helping the app developers do it.
And then you know what happens? The guy building the where’s my parking app
also just takes the hospital compare, that doesn’t really solve a business
need. By business need I mean also public health, public social need,
reasonable value, however you’re going to do it, where this is kind of a
different way of going about it.
DR. COHEN: This is great. Thank you for letting us see how your mind works.
I see this really as evolutionary rather than revolutionary. From the public
health perspective first we generated huge stacks of reports of numbers that
sat on a shelf. There was probably something that preceded it, probably just
filing statistics and books, and then we generated the reports when we learned
how to bind and when Xerox invented photocopying.
And then a lot of folks at a variety or perspectives – my perspective
is government – have produced web-based query systems at the state level,
at the local level, and at the national level. And the focus there was nobody
reads the report so let’s put them online somehow through these web-based query
systems, and that’s been perking along for about 15 or 20 years.
Some of the state WDQSs are more sophisticated than others, and I see this
as the next version in that progression of disseminating information. This is
just, to me, an easier to use easier to access web-based query system that
gives the user a lot more flexibility because the contents, the direction are
predetermined by the data holders.
So we’re saying we’ve got all this neat information, let’s get it out there,
some in more combined formats and some – why don’t we put all the vital
statistics at the county level from the US, give it to Google, and put it up
there and let people work on it?
Or if we wanted to focus on a project that’s more directed, why don’t we
take heart disease mortality and behavioral risk data on heart disease risk
factors and what we know about the social determinants related to heart disease
and work with somebody you think and put a data set out there and see what
folks do with it.
DR. ROSENTHAL: That was sort of when we were doing the webinars. It was kind
of asking the questions, are they thinking about this sort of stuff in these
ways, it didn’t go very well so I tried to back off. That’s exactly what I was
trying to articulate, much better said.
DR. FRANCIS: I want to go back to the point Justine made about privacy. From
that privacy perspective the interesting questions are what level, if I’m a
data contributor, what are the levels I want to put in?
And another sort of version of that is is there a way that I can make sure
that there isn’t a proxy for something lower than the level that I’m
comfortable with those results when you aggregate data sets. And I simply don’t
know the answer to that, but you’re shaking your head, maybe there isn’t any
way to do it. But I’m assuming that there must be, so you don’t get – I
don’t know, I’ll call it GIS coordinates replacing zip codes or something like
DR. COHEN: We started with the basic building block. Let’s say county, which
most people accept, people of New England really relate to. Then you can
essentially build robust data sets and where the data are too sparse there are
a variety of ways, and you can have some blank counties or there are techniques
to impute county level values. If you get aggregate county level values from a
variety of indicators depending on the cross-county classifications, we can
throw as many different data sets on top of those for people to look at
simultaneously without increasing the chance of identifying any one individual.
So I think looking at aggregate data overlaid is a different strategy than
linking in individual level data that increases the probability.
DR. ROSENTHAL: That question came up in a previous meeting, and so in a
slide, not this one I gave you but a previous one, you’ll see privacy comp
side, personnel side, health care, I have her quote up there, a professor on
that specific thing at one of the conferences I spoke at, and so she’s
addressing that and essentially saying the same thing in much more complex
And number two, here’s a crazy idea: You say we don’t have that download
button. It’s literally a closed environment. The only reason you need to
download data is to be able to pull it down, link it with other things. But if
you can do it all in a browser you don’t need to download it. So those views I
was showing you, one of them had download and one of them didn’t. That’s what I
was trying to say earlier, if you can do it all in the web you don’t need to
DR. FRANCIS: That is where there are going to be really hard questions about
what you allow there, too. For example the SAMSA example that’s in the minutes
raises exactly that question.
DR. ROSENTHAL: That is why I started with the view master, that was sort of
tongue in cheek.
MR. SCANLON: This is not for everybody. Obviously the research community,
the public health community, they’re going to continue to get the data they
want for their own work and then run analyses, hopefully in better ways. But
this is another attempt to reach a whole new set of people, it’s not so much
business to business, it’s more business to person.
DR. ROSENTHAL: Some of the stuff planetRE(?) does is a version of this, and
Tableau, they do a lot of B to B. So Fair Isaac, your FICO score comes from
using this. So this is the public or kind of dumb version of the thing that
actually supports the very high end B to B stuff right now.
MR. QUINN: The thing that really struck me about this, and this has got my
brain going, is this taxonomy. That is the key to all of this and anybody who’s
ever said, hey, let’s develop a taxonomy and get everybody to use it, you can
just here the whooshing sound of time, and it’s 20 years later and you’re no
closer. And this is the key to putting all the data together.
DR. ROSENTHAL: And notice their taxonomy isn’t like a metadata of taxonomy,
it’s data, it’s the business taxonomy. In your original files you’ll see an
example of a business taxonomy of our data. What is this, how do I use it to
answer business or performance questions.
MR. QUINN: It is more normalized.
DR. ROSENTHAL: Yes. So it’s not just this is the ICD-9 code, this is
answering specific questions.
MR. QUINN: You can always build it out for that. That’s an area where I
would love to see communities in specific areas – for example public
health, or specific areas where there is the need for metadata or semantic
DR. ROSENTHAL: The community kind of has three approaches to it. Typically
the academics is to say let’s get in a room, let’s forma standards committee,
we know what’s best, and that’s going to be our taxonomy. And that tends,
historically, not to be the most efficient way to do it. And there’s also some
questions about the integrity of that as well.
The other option is just to do a purely open source community which is what
Google and these guys have done already. The other option is build a community
of subject matter experts. Every MPH, grad, et cetera, with peer reviewed et
cetera. You have different options.
MS. QUEEN: Josh, on your previous slide I think it had the contract number.
This was where at the last meeting when Ed Sondik was here, there was a
definite disconnect for those of us in the survey world who don’t even know who
in the agency or where in the agency you have the information that’s needed.
You knew what you were talking about, I don’t think we understood. Where do you
get that kind of information?
DR. COHEN: I think the best solution is to take a data set, take NHANES or
take mortality data or take a Medicare claims file and sit down.
MS. QUEEN: But NHANES is a data dictionary. What you’re talking about is
something different when you’re talking about contract number.
DR. ROSENTHAL: This is payer compare data with Medicare and MA contract.
MS. QUEEN: The surveys all have the detailed variables.
DR. COHEN: It is not going to be rocket science to create this taxonomy, the
file formats and structures exist, they’re just not arranged in a way that
developers think about putting data into applications.
DR. ROSENTHAL: The beautiful part about this is once the developer does that
then – this is what I was trying to say at the other meeting, if I want to
do this right now I have to as a developer do it myself, and then the next
developer, and then the next developer. In this scenario, I do it once and then
I put it in the browser, and then everyone is living off the fruits of my
whatever. You can have an official standard or you can have multiple people
using different things.
MS. QUEEN: But there is one big difference with the surveys, and I know
we’ll talk about this later, the survey data has to be run to generate. So you
have to run it, you have to weight it, when you download these data files
you’re not going to pull out a record from that, you’re going to run it to get
whatever it is that you’re looking at. If it’s health insurance you may have
six different health insurance questions.
DR. ROSENTHAL: You are essentially giving them an analytic extract, and in
that Google Public data there is results of kind of international survey data
DR. CARR: If we were going to then – because we have our charge of
coming up with specific recommendations – what do we want to take away
from what we see here today?
DR. ROSENTHAL: Can I say one more thing and then I’ll be quiet for the rest
DR. CARR: Don’t be quiet, that would be foolish.
DR. ROSENTHAL: Don’t tempt me. What I was trying to show with this was
originally in the recommendations we talked about kind of taxonomy and learning
center, all this stuff, and the question was what is the unit around the
learning. Is it around the data? What’s the definition of contract versus HR?
No, that’s not so helpful. Is it around this, that, or the other thing?
Originally when I made the screen cap in your other slides of doing kind of HHS
and CMS data driven product development around seeing what people use, this was
a piece of a broader learning center.
So I didn’t originally envision this as its own thing, free-standing, but to
your point whoever asked about authorship and community, you have this
functioning as a layer in that. Someone working with the data and saying how do
you do it, someone working with information and saying how do you do it, and
then someone on top of it saying how do we actually build applications for
public/social good, business value, et cetera. So I just wanted to throw that
out there, I don’t want to overly narrowly focus on this. This was just kind of
a piece of the puzzle.
DR. COHEN: I understand the context for what you’re doing, but we have to
start somewhere. And to the extent that starting small, the first goal is can
our charges deliberate the data. How do you start liberating the data? You
start by getting it out there and see how it resonates and then working up from
there, I think it will create its own context rather than trying to think about
the grand scheme and work down.
So in terms of a first step these tools exist and they’re well used by
everybody apparently except folks in the public health field. And representing
the feds here, we’re not familiar with these, we haven’t used the value of
these in getting our data out for people to do the things they want to do with
it. So in terms of a first step I think this is a wonderful strategy.
MR. QUINN: The thing that strikes me with this is how about we choose a data
set that we’re comfortable sharing with Google, with the world, and see if it
fits the taxonomy and work through the process of putting it out there, see
MS. QUEEN: I think with the Google one, because I was playing with it, aside
from Census I think something from AHRQ is already up.
DR. ROSENTHAL: They pulled it.
DR. CARR: They pulled it meaning they took it down?
DR. ROSENTHAL: They scraped it.
DR. CARR: I need taxonomy. What does scraping mean?
DR. ROSENTHAL: They have these little automated things and they kind of
crawl through the web and they pull information down in structured ways and
then they can post the data.
DR. CARR: It is data that is already on the web and they pull it into their
DR. ROSENTHAL: Yes.
MS. QUEEN: It was something from AHRQ. It was either maps or HCAP or
whatever’s been made public. So with HCAP you don’t have a public file.
DR. COHEN: Wearing my community hat, the problem is with surveys because of
level of granularity. I would choose a more surveillance type data set like
births or deaths or cancer where you can populate data from every county.
MS. QUEEN: It may have been the HCAP.
DR. COHEN: I think that would resonate and reach a broader audience.
DR. ROSENTHAL: I would only say, just while we’re thinking about it, is that
ReadWriteWeb example is very interesting. You have people at HHS who are pretty
good at business liaising, who know some of the people at places like
ReadWriteWeb. If you’re going to do it you have a lot of cache, don’t kind of
post it on your own.
Depending on how big you want to go, if you want to have a million people
using this data set if you do the PR and marketing around it – in the
first meeting we talked about kind of PR and marketing around it, I just wanted
to throw that up there as a lever. You can post it up there kind of in the
night and then it’ll be up there and no-one will really know about it, or you
can go big and say the winner of this gets the HDI thing, et cetera.
DR. CARR: Wait. I didn’t really understand what you said. Let me break it
down. We can put data up on the web, we could put it on Google or we could put
it on that other site, and we’d be better than we are today because more people
would have access to it. And then you’re saying we could draw attention to it
as part of a prize or something like that?
DR. ROSENTHAL: Yes. If you approach in a public/private thing, and there’s
people at HHS who do this. There are challenges, and it’s worth thinking about
the marketing, do you want to do it stealthy and kind of put it up quietly or
do you want to make it —
DR. CARR: Well probably all of the above. But do the one thing. I mean if we
could make it available to these sites, Google and what’s the other one called?
Tableau. We could do that, and that would be light-years ahead. That would open
that data up to orders of magnitude more people, so that would be huge. Then we
could be better than that by drawing attention to it.
DR. ROSENTHAL: Relatively easily for 2013 there’s the application
challenges, those are on-going, they’re planning it right now.
DR. CARR: You are talking about a Datapalooza?
DR. ROSENTHAL: Yes, as we speak. And the challenge is going to be who
develops the best application. You already have your data up there. My point is
there is very simple things you can do to broaden the users. Not just people
who want to use the data and introduce challenges to the applications.
DR. CARR: They are not exclusive, though, putting it up there is number one
and then drawing attention to it is number two.
MR. CROWLEY: Part of the value in this too is being able to mix and match
different data types. So rather than thinking about one data file now they want
to think about a couple of different data files that are being used in
different ways, and then see what works how and what doesn’t work, perhaps
different opportunities within this data structure.
MR. QUINN: I just think the process of going through the process of
selecting a data set or two or three or four or whatever, and actually walking
through the steps of doing it, posting it and seeing how it works and the
reaction, just the government process of doing that, and then the process of
seeing if people find it, the feedback that we get, that’s valuable in itself.
MS. QUEEN: I am wondering if with the Google site for example, if for those
data that have already been made available, public use data files or public
data sets that are already up there on HealthData.Gov or Data.Gov, Google
presumable could go and download them as part of their site, is that correct?
DR. ROSENTHAL: Verify everything I am saying because things have changed in
the while since I looked at this.
MS. QUEEN: If they are publicly available they’ve already been through all
the proper disclosures so it can be downloaded.
DR. CARR: I am going to be the person asking the really simple questions,
but I think that’s a great point. So if we have already made this data
available on the web, then Google could go and take what’s there, right?
DR. ROSENTHAL: Yes. This will get into a business discussion really quickly,
so you’re going to have to have someone representing the agency talking to
someone at Google representing it.
DR. CARR: We don’t need to get to that, but just conceptually.
DR. ROSENTHAL: Conceptually if you’re going to do it, or if you’re going to
spend a dollar, you could hire a third party to do the scraping and formatting
DR. CARR: In other words HHS will be part of the process of that data going
onto that website, whether it’s proactive or reactive?
DR. COHEN: All the mortality data at the county level by age, race, sex, and
cause are already up online via WONDER, but WONDER is pretty much a cult
product because it’s difficult to use and people don’t know where it is and how
to see it.
MS. QUEEN: Isn’t it a tool? That’s not the raw data, correct?
DR. COHEN: The raw data exists back into the system because it’s got all the
death information at the county level.
MS. QUEEN: The distinction I am making is from the viewpoint of somebody who
is trying to download something. Like when I query BRFSS, I do the online tool.
I don’t download the data, I just get the results of my query.
DR. COHEN: That is right. I was responding to the notion that there are data
that are publically available via these tools, so the underlying data exists
there and I’d say they’re all on parole, they haven’t been liberated.
DR. ROSENTHAL: In terms of feasibility, of what’s possible or not, I think
that’s a pretty good place to kind of get a subject matter expert to come in
and talk about it, or I’d talk to Google directly.
But if the question is do we want to pick a set and try to do something with
it, before you go into that I would at least do a feasibility analysis and say
is it actually easier, can they do that without us doing it already? Do we want
to do it to maintain control in certain ways? Or instead of doing that should
we take an alternative path like some of the private scrapers I was talking
DR. CARR: Let’s try to map out a roadmap here. We’ve seen about a dozen data
sets, 10 or so, that are available only in these structured sites of HHS. So
now we would want to move some of them, maybe all of them, to a Google data
platform. So one is to do that we have to have the right people in the room
from HHS, and the second is that they would define the parameters of
protection, et cetera, sample size.
If we made a recommendation back to Jim that the data leads in each of the
agencies and in HHS do exactly that, that they meet up with Google and report
back, which ones are a go, and if any are holding back, if so, why? I think
that would be a very valuable next step for them to do. I like the idea of
piloting the data but I don’t see a need to pilot anything.
If that’s all out there then let’s move that forward. If we thought about a
pilot, I think then we would be in kind of a different domain of either
responding to here’s a need and this data could be put together in a particular
way. Or maybe, as you said, cardiovascular or an app or a business case, or
something like that.
So they’re kind of two things. One if simply continue, complete the
liberation by making it more digitally usable.
MS. GREENBERG: In this brave new world I’m this old-school executive
secretary of this committee. Not of this working group, but of the committee.
So I guess we talked a little bit before lunch about what are we talking
about, and I still, from the point of view of how this working group would
convey information, I still think if you’re talking about recommendations that
have to go through the national committee on vital and health statistics, if
you want to make some suggestions like we heard this presentation, this sounds
kind of interesting, we might want to explore this or something, that’s fine.
But I really, just being sort of steeped in all the policies, et cetera, of
several advisory committees, I really don’t think that this working group can
make recommendations on its own to the department.
DR. CARR: Sorry. I am sensitive to that. As a reactor panel we could perhaps
have the reaction that this front-end browser could add value to the datasets
that are already out there.
MS. GREENBERG: That might be worth exploring. I’m from a really data for
dummies point of view, so I’m more of a data policy person than a data user
person. But if you take data that are already on the web, people can download
them, maybe it’s not so easy, but the intent is not to make it difficult, the
intent is to make it available. I recognize that it may not be that user
friendly to the uninitiated.
So if it’s already there I’m trying to think what’s the downside of putting
it – let’s just start with the Google rather than the Tableau – but
what’s the downside? Well, one is that you asked the questions, what do you
have to do?
I mean a question of can Google just come and take it, it sounds like they
couldn’t because they would need this taxonomy. They’re not in the position to
write out a taxonomy, are they? I mean they have a taxonomy, but how do they
put these data that they don’t know anything about into this taxonomy?
MS. QUEEN: Marjorie, for all of the data that are downloadable from HHS that
are publically available, along with there are links to the information, so the
information is there. To me the difference is if Google has done this with data
we’ve made available they’ve done it with data we’ve made available.
For HHS to intentionally ask them to do it or sanction it or whatever, it’s
sort of like HHS would have to be approving what Google does. There is a
distinction between having it out there and letting them use it and encouraging
their use if anybody uses it.
DR. ROSENTHAL: I will just say three things. They can absolutely do that
right now if they have the intention, and in fact they’ve already done it and
they do it with a bunch of different bodies and there’s very little you can do
about it unless you want to get very feisty with them.
MS. GREENBERG: Why aren’t they doing it, is it too much work?
DR. ROSENTHAL: They have done it, with some data sets. There’s more to be
done for sure. That’s why I say there’s almost like a business context to think
about. They can do this on their own, do you want to encourage them to do that
as you redesign Data.Gov and do the next generation of it?
Do you want to just encourage them to do that? Do you want to have some sort
of working partnership with them where there’s some sort of intelligence or
security you’re putting on top of that? Or do you want to kind of do it
yourself and take the more active role?
That’s kind of the decision you probably want to think about, or one way to
frame it up. They don’t need your help to do that, they can do that on their
own quite well.
MS. QUEEN: Isn’t that what NIH is doing with Amazon? Something with the
genome data, the big data?
DR. ROSENTHAL: Yes. That is just one way to think about it. You can pretend
you never saw this and just kind of let them go on their own, and they’ll put
more up and five years from now you’ll look and a bunch of your stuff will be
there, or you can say we want to actively encourage them to do this, we want to
promote it and we want to get users to do that, and that means some very
specific 101 marketing and PR stuff that we’re thinking about. Or you could
think about becoming even more actively involved in terms of saying we have
research and public health agendas, we want to put certain things forward in
terms of et cetera.
MS. GREENBERG: I’m asking those of you like Susan, who are more familiar
with Data.Gov, we have high confidence that even though by putting it there or
it being there, we must, because they can do it now. No matter what other data
they mash it up with or whatever, it just isn’t going to be a problem.
MS. QUEEN: Well, we have all of these privacy and compliance requirements
that the agencies go through, a lot of them are very similar but some of them
there may be a suppression of three to a cell or five to a cell, it may vary
depending on the agencies. But they have a lot of different statistical
disclosure techniques that they apply to the data before they are released.
And that was one of the documents that I put up on SharePoint was something
that we had to develop as we put together for OMB when HHS took over
HealthData.Gov from GSA. So it described all the different techniques that the
agencies use to protect the data and to minimize disclosure of anyone’s
identification. So that’s out there.
DR. COHEN: I agree with Josh. I mean, what’s the business case? The business
case is we want to proactively liberate the data to expand its use and to make
it easier for folks who haven’t traditionally used our data to be able to get
to it because they use these tools, they know these places, they don’t know the
All the data at the community level in our first presentation in the
indicator warehouse, all those data have already been reviewed and populated
pretty much mainly at the county level. There’s a huge amount of information
If we push that into Google as another platform for release, or if we want
to create patterns using Tableau, combining those data with other data that
traditionally don’t fall in the public health sector, I think that’s the goal
of data liberation, really.
So it’s not protecting our enclaves, and there’s already a lot of
information that is quasi-public, but people just don’t know how to get to it
because it’s not in a place where they go to look for information.
MS. GREENBERG: What is the downside?
DR. COHEN: My answer is there is no downside for the agencies or for the
community, and particularly I think this will encourage agencies to think more
creatively about how to disseminate information in their data.
MS. GREENBERG: Is the only reason we haven’t done it is because we don’t
know about it, or we don’t have the resources?
DR. COHEN: It is not a priority. We spend 99 percent of our time making sure
the data we collect are of high quality. We spend one half of one percent of
our time thinking about dissemination and getting it to the people that need
it. And that’s just what we initially thought – I’m speaking now as a
government person – what our charge and responsibilities are.
We fulfill our obligations by being the stewards and making sure the data is
collected and protected, and once we’re comfortable with that we need to flip
the switch and send our children away.
DR. CARR: I just want to make a comment. I think Josh said it in the
beginning, you have a finite amount of time to spend and you can spend it
trying to figure out how to manage the data, or you can outsource that to
Google and now spend your time thinking about the content and the value.
DR. GREEN: I want to join the Susan and Justine line of dumb questions. In
this arrangement, what is a data analyst?
DR. ROSENTHAL: I don’t do very well on philosophy so I have difficulty
answering what strikes me as metaphysical questions.
DR. CARR: Where are you going with that Larry? Maybe elaborate a little bit
DR. GREEN: It is not a philosophical question, it’s a job description.
What’s the job description of an analyst once you get here? In the old world
we’d whine and complain about public health departments having very little
analytic capacity. And we paid premium bucks to someone who can develop mastery
of a data set, understand it in depth and breadth, understand what can be
linked and what cannot be linked, et cetera. So in this world I’m just asking
what is an analyst going to be.
DR. CARR: I think if we get back to the life cycle of data or the supply
chain of data, I think we have moved the nidus of control from one venue to
another. And I’m not sure that an 18 year old can tell us about diabetes. So
that’s already happening whether we planned it, liked it, or whatever.
But I think we have to think about then what is after that hypothesis
generation, what happens to that hypothesis. It looked pretty right to me, it
looked good and there’s a lot of stuff behind it, but someone else could
juxtapose a lot of data that was not as sophisticated and that too will be out
in the public domain.
DR. ROSENTHAL: In this sort of interactive business intelligence world
outside of health care and even in health care, not necessarily at MHP world,
typically there’s kind of business analysts, there’s business intelligence
analysts, there’s metadata, they’ll called a metrician.
So one way to think about this is it’s become more specialized. The person
that sets up that taxonomy has arguably a much higher degree of skill. If I
wanted to do that five years ago I had to run thousands of lines of SAS. That’s
all I did, I sat there and did that. I was an analysis but I was running SAS
all day, I had to run a regression.
The person who sets that up is largely doing the same thing, they’re just
not doing it by hand. There’s a greater degree of specialization. There’s not
so much a generic analyst but there’s difference types of analyst nowadays is
the easiest way to describe that.
And in terms of the fundamental tasks are very similar. One way to think
about it is the speed and rate has increased. So if I’m running an analytics
group which I have, and one of the analysts comes to me and I don’t like his
factorial analysis basically, I think his correlation is off, I don’t think
he’s correctly done a feature construct set and come up with reasonable
comorbidities for diabetes, I’m essentially doing the same thing that took me
three months to do in two days now, but the functions are the same.
DR. CARR: I think Larry’s point is that you have a group of people under
your control, and you supervise them. But unsupervised people have access to
these data as well. And that’s a reality, and that is the disruptive piece
DR. COHEN: Perhaps disruptive, but essentially we’re moving the data
upstream to put it in the hands of the decision makers rather than having to
create this artificial interface between the people responsible for decisions
and the data, who we used to call data analysts. I mean that’s the goal of data
Decision makers make decisions whether they have data or not. Our underlying
assumption is by providing more data directly to decision makers they’ll be
able to make better decisions. It might not all be decisions that we agree
with, it might not all be decisions that we would have made given the data.
But our goal is to provide decision makers, whether it’s individuals
choosing health plans or whether it’s community groups deciding what priorities
to operate on, to give them the opportunity to use the information that we’ve
generated to make decisions.
So who is the analyst? The analyst now becomes the decision maker, and the
information is not filtered through any lens other than the lens that they want
to use to make decisions.
DR. ROSENTHAL: I apologize. I misunderstood the question, I thought we were
asking what in terms of the job description is an analyst working for an
analytics shop. There you have a group working.
But in terms of the open sourcing, absolutely. The specialization is you
don’t need to require everyone to understand the taxonomy. One person can do
that and then you can open it up and everyone else can build on that rather
than hierarchical. And that tends to work pretty well, at least that’s the
charge as I understand it.
DR. CARR: I think your point is excellent that it gets into the hands of the
decision makers. So if you get information, let’s say by your community, and
you say this doesn’t resonate with my reality, you’re going to be motivated to
enhance, refine, or improve that data set so that you haven’t missed anything.
So it’s kind of crowdsourcing or something, it’s a lot of people weighing in on
DR. FRANCIS: What I want to actually push on, which I just don’t understand,
is in a way what’s all the hullaballoo about, because if the only question on
the table is you’ve got a public use data set and it’s on the HHS website or
it’s over on the Google website, hey, I mean with the Google taxonomy, no
problem. There are no more privacy issues raised than there were when it was on
the HHS website.
Now, if I go to what Bruce said a minute ago, more data, and ask what that
means without knowing it, or I look at something like the webinar summary from
the 23rd where there’s some tools to allow public access to do
analytics behind a firewall, but you don’t get the data because of
confidentiality restrictions. Now if what we were doing was essentially taking
that data and putting it on Google with the firewall allowing the analytics —
MS. QUEEN: I can never see the restricted data being made available
publicly. The restricted access even has a lot of strange things. Certain SAS
codes you can’t run, there are little proc lists, there are a number of things
they won’t even let you do, and you can’t download it.
DR. CARR: At the continuum of data there is a tremendous amount of data that
as Leslie said, is hugely valuable, especially when juxtaposed with other
available data that accounts for decades of learning. And they can still be the
people who have the restricted access and expertise, et cetera, to use that
DR. ROSENTHAL: Not to cloud the issue, and that’s one thing for James to
think about, but when we were taking those surveys of websites and what they
were doing, if you were in the private world and looking at say credit card
data, instead of a portal that allowed you to create a table very clunkily
you’d have a professional version of one of these browsers up and running and
be able to do that very quickly and very meaningfully and all the researchers
could ask questions and get answers and do things intuitively.
So that’s a separate subject for how do you get people out there but it is
worth thinking about, like when I walk in with my glasses and see this is the
generation of stuff we’ve spent a lot more money on in the commercial world
with cost to put one of these things up behind a firewall for researchers like
yourself to do, that’s just something else to think about, but that’s eminently
feasible and being done really easily.
DR. CARR: We’re at 4:30, CDC is going to do a presentation. Do we have 45
minutes now to kind of land the plane a little bit on what we’d like to put
forward for consideration? As we as the reactor panel seeing this information,
I just want to make sure we have the right language, Marjorie, that we’ve
fulfilled the charge, that we’re not making recommendations and we’re not
sending a letter, we are reacting to information that has been presented to us.
MS. GREENBERG: I am seeing a frown on Leah’s face. Leah hasn’t said
anything, do you mind if I call on her? Your reaction to this discussion.
DR. VAUGHAN: I guess I am trying to understand what’s most helpful for me to
help you. One of the things that does kind of strike me, one of the things that
might be helpful is to actually get you guys to drive data sets on a number of
the platforms outside of your area of expertise and see what you understand and
then have to feed it back to somebody whose area of expertise it was perhaps so
that you have a sense of a range of tools which are far beyond Google which has
its own challenges and problems around privacy, but so that you actually get a
sense of what it is that can be done.
And there have been some wonderful examples in the non-profit sector of
doing that, there have been some wonderful instances within the challenges of
taking specific data sets and putting them forward, including a million parts.
But to maybe not just have you drive it but have you drive some data, not
scraping it, but drive it, and to just have you have the direct personal
experience of going through it.
DR. CARR: I am hearing you say two things. One is that we shouldn’t just
focus on the two browsers that we saw today and we ought to broaden that to say
we ought to leverage public browsers.
DR. VAUGHAN: There is that piece because at private companies there’s a very
strong open access movement both here and in the UK which is very much about
not limiting it to only open access, but to ensure the integrity and continued
to the extent we understand perpetual availability of the data sets to the
public without them becoming proprietary. And there’s been some excellent work
done in both, there was just a conference last week. But my strongest
impression is that rather than talk about it and make a recommendation we need
to actually do it.
DR. CARR: Again, we are a reactor group, and the doing – I mean we’re
doing a bit of travelling through this today but I think that HHS has
configured in a way now that there are individuals accountable for the utility
of their data.
DR. ROSENTHAL: What are some other public data browsers that you’re thinking
of, and could you clarify the difference between driving and scraping for me
just so I understand what you’re talking about?
DR. VAUGHAN: There is a very large number of browsing products that are
available, and I think my suggestion had to do with data domain experts using
some of those visualization tools that are available right now to see what data
looks like and feels like by using them for this public facing analysis. Well
maybe to try it in your own area, but to try it outside of your area so you’re
perhaps maybe having more of the consumer experience.
DR. CARR: With regard to that part of your suggestion, you were saying for
us to have an experience of what it was like – I mean Josh was like let’s
pull down this, let’s pull down that –-
DR. VAUGHAN: That is not the same as you doing it.
DR. CARR: You doing it yourself?
DR. VAUGHNAN: Yes.
DR. CARR: I mean actually I did it during that last webinar, and so at the
end of that experience each person will see what was hard, what was easy, what
was unexpected. And then what? Are you saying that would influence which
DR. VAUGHAN: I think – I am hearing some misunderstandings and I think
the best way to, rather than just talk about it, is to just do it. I think the
notion of doing outside of particular area of expertise and then reading that
to someone in that domain whose expertise it is is, again, a little bit more of
the consumer experience even though everybody here is an exquisite expert, to
see what you’re learning and understanding outside of your particular day-in
DR. CARR: But again that information would then inform the direction of
DR. VAUGHAN: Not so much which browser but the ways in which they’re useful,
the ways in which they’re limited, the ways in which there are still privacy
issues, some real ones. It’s not so much that the data sets aren’t hugely
refined to ensure privacy, it’s that given that it’s aggregated to a large
area, there’s still people who will find a specific address to have that
meeting even though that’s not statistically appropriate or policy appropriate,
it still will happen, and it does happen.
And to a certain extent there is nothing you can do about that, but you
should understand that that does happen, and to understand what some of those
consequences might be. Vaccine myths are certainly a huge example of that. No
matter of truth telling seems to quite absolve them.
DR. CARR: We have a lot of conversation at the meeting about the
accountability across the food-chain or the supply chain or the lifecycle or
whatever you want to call it of data. But I think it is true that we are now in
a world where data is out there and responsible people use it and irresponsible
people use it.
And so we talked about the stewardship that when you might be showing data
that’s alarming, disturbing, what impact, housing values and so on, and you
have a responsibility not to suppress the data but to deliver it in a way that
is constructive. But I’m not sure that that’s where this group is. The fact
that that could happen, we’ve already made the data public and I think the
question that we’re trying to address is how do we make it more usable.
And with that we’re all on the same page that leveraging a browser might not
be any of these but exploring whatever browsers are out there to enable a user
to focus on the content, not the configuration, would create greater
opportunity to learn from the data. So are you disagreeing with that?
DR. VAUGHAN: Not in general, I’m saying that I think in terms of strengths
and limitations of many of these initiatives that I think you would come away
with a better idea, better able to make those recommendations with an
experience of a number of them, and just actually hands-on doing it.
DR. CARR: I agree with that. But I’m not sure that it’s the role of this
group that meets a couple of times a year to do that homework to advise HHS. I
think that highlighting that the data can be manipulated more readily, not
manipulated in a bad way, but moved around more readily if you had a browser.
And to go to that point to then leave it to HHS whose day job is to make this
DR. VAUGHAN: I think a lot of those things have already happened. I wish
Patrick Remington was here today, I don’t know if you’ve had the great pleasure
of using his site, County Health Rankings, which I think does an exquisite job
of doing just that already. Certainly American Academy of Family Physicians has
been a leader in all of this.
I think there are a number of awesome really finely done examples of that
already, I think there’s great instances in some of the challenges already.
Actually I did not read your cross-agencies, but to put forward specific data
MS. QUEEN: Actually, I was thinking that one thing we probably should do,
whether it’s staff – I would want to look at what challenges that have
been done that are ongoing that they’re using to adjust data.
MR. QUINN: We could have ONC or someone at HHS do that.
DR. VAUGHAN: It’s not just ONC. It’s everybody.
MS. QUEEN: The other thing is I didn’t have enough time and I was just
playing around with the Google site for example trying to figure out what
Census data, what BLS data, I was kind of in a hurry and I could see AHrq, so
getting a better sense of what’s already being used on these other platforms, I
mean I would want to know, it’s just something I want to know.
DR. VAUGHAN: In terms of the international data, the world bank is really
just done an amazing turnabout in terms of how they open up their data and how
they make it accessible, I certainly commend their initiative to your
MS. QUEEN: I think the HHS Chief Technology Officer in his office would be
involved at least with all the challenges at the Datapalooza. I mean we have a
source at least for the things we know about related to challenges and this
other stuff, we could be looking ourselves at least some extent just to find
out what’s already being used.
DR. VAUGHAN: As to the specifics of Google, it’s a multi-tiered process
directly involved in the Hurricane Sandy initiative right now and how that data
and that set are handled in other parts of the company. So there are many
people I’m sure who would be delighted to talk with you at Google, but it’s a
fairly complex process and a large company and hard to generalize about any
MS. GREENBERG: I sort of have mixed feelings because I don’t want to be a
stick in the mud at all, and I want to encourage us to be thinking creatively
and not spending a year studying things. At the same time I do think that even
this working group, although it’s formulated or established somewhat
differently than the National Committee, does need to have sort of a
deliberative process in which you gather information.
And if not doing the work yourself because I have to kind of agree with
Justine that I’m not sure that that’s – I wouldn’t stop anybody playing
around with it or whatever but I don’t think that’s solely the work of this
group – so that if you just say well we heard about this, we think it’d be
interesting, we really have to question how far that would go, I mean what kind
of influence that would have.
On the other hand if you sort of write up this discussion, chop it around
the entire working group because we don’t have half of them here I think, maybe
we’ll be able to stop hearing about data sets but actually hear about some of
these different activities that people are doing, kind of enrich the
discussion, find out about challenges, some of these other things.
And then taking it – you can stop there – but taking it to the
national committee and having a discussion there could result in something that
people would actually listen to as opposed to – and I know what Jim was
saying, the department might want to just bring someone to this group and say,
what do you think. Fine, that’s a reaction. But if you’re the initiators I
still say it gets very close to recommendations.
And if it isn’t recommendations, if it’s really in a low level, oh we just
heard about this and we think it might be interesting, I don’t know if it would
have much impact. So I’m thinking both about whether we’ve been deliberative
enough or understand enough to say something meaningful and then also how it
would be the audience.
As Bill said, you have to think about the audience, how is it going to be
received. I know that you want to come out with something and not just be
talking heads, but I really am wondering what really – I’m trying to think
about what would be most useful.
And I said if it’s at a really informal just kind of suggestion level which
is really all that I think would be appropriate given that you are part of a
federal advisory committee, that it should not be making recommendations on its
own without going to the full committee, then it seems to me that it might not
be overly useful.
But you can try it and see I guess. We’ve heard about these activities that
obviously have potential, they’re of interest, we hear there are other
activities, there are people who might be able to tell us more, some of the
pitfalls, some of the challenges, some of the issues. I think it may be that
there needs to be a little bit more deliberation frankly.
DR. CARR: I do want to get your feedback but I want to hear from the Working
Group members. Why don’t we start with Josh?
DR. ROSENTHAL: I am not so savvy in all of this, I’m just working from the
charge and I just want to know what the charge is. I thought the charge was to
actually make these specific suggestions. So if we’re just hear to kind of
review, then that’s very different than what I had thought.
MS. GREENBERG: No. I didn’t say that. What I was saying was there are two
ways that this group can be used. One is that the Department can bring you some
things and just ask for reactions. You can do that, you don’t need to ask the
National Committee on Vital and Health Statistics to approve your reactions,
your reactions are your reactions, and your input, et cetera.
And the Department has an interest in using the group that way. And to learn
really from you because you have expertise that the people in the department
and even in this advisory committee don’t necessarily have.
But if you’re going to make recommendations, then just like any
sub-committee, the sub-committee eon privacy and confidentiality just spent six
months putting together recommendations related to a stewardship framework for
community health data.
They had hearings, it was very deliberative, but they can’t just make those
recommendations to the department, they have to go through the full committee,
it has to be deliberated through the full committee. So I am saying if this
working group wants to make recommendations, it also has to do it through the
full committee. That’s the rules, that’s the guidelines, otherwise you’re
basically violating FACA.
DR. CARR: We have reactions, we heard a number of things, and one reaction
is that the data is hard to use. But I don’t even want to speak for that. What
I’d like to do is go around the room and ask each person to provide a reaction
just from what we’ve seen so far, and then also what else should we be thinking
And we’ve heard about a couple of things today. We’ve talked a lot about big
picture. But I think what Bruce was saying and I think actually what Leah was
saying as well was actually getting to what is the demand side, what’s a big
population issue, what do we want to put together, maybe a data set that then
users could use to develop. That’s how I kind of think about it.
But let’s start over. Josh, a reaction, and then to Marjorie’s point, if we
want to go in greater depth we can have other presentations, hearings,
webinars, whatever else. So did you want to offer something, a reaction, and a
DR. ROSENTHAL: Yes. I’d like to know more. The first set of slides I made
available, this set of slides I will make available, I think it would be very
helpful if we’re talking about things at specific incidences of them kind of
brought to the group. So I heard several things in your comments and I have to
admit I was pretty confused. If there’s a number of different public browsers,
I want to make sure we’re talking about the same thing because we’re using
DR. CARR: That will be for the discussion afterwards. What I want to do is
just get a hearing around the room. I think you guys are seeing things from
different perspectives and I get that.
DR. ROSENTHAL: I heard that security in a browser is a different risk than
actually having the AHRQ file up on the website in its own browser, its own
instance, and I don’t agree with that from a technical concept. And I want to
make sure that A) that was what was claimed and B) that’s what we want to
really dig into.
DR. CARR: Let’s hold that for a discussion. Kenyon, thoughts, reactions on
what we saw so far on the HHS data and thoughts about a next step?
MR. CROWLEY: Broadly, in terms of HHS data, I have been quite pleased with
the different innovative programs. I mean I think the webinar on the behavioral
health data was quite instructive. I mean typically that being sort of a data
source that’s so hard to get to and so hard to use, having found solutions then
I think it’s great – and the other sites as well.
But if you generate a lot of questions too, and I think some of the
questions to think about, so as that new behavioral resource is being put
forward and other resources that are made available through the health
indicators warehouse and through HealthData.Gov, and as these are growing, I
started to have some questions, including what can I answer with this data, how
can I make use of it, how can I share the data with colleagues, where can I go
to make more sense of the data.
And I started thinking about that and interestingly enough when I looked
through the HCUP data they actually had a full set of these are the full
questions you can answer with the data, these are the things you can do, so
obviously there have been a lot of smart people thinking a lot about these
Now I think in terms of the charge of looking sort of how can we make this
more broadly available and create this learning environment, so I think the
committee should continue to think sort of closely about how can we make sure
that the learning data is captured as people are using the new data source and
using existing data sources is made available to others, in that sense.
So maybe something as people are looking for data that if they do find the
data they need, if they don’t find the data they need, how do we know whether
they do or they don’t? And if they don’t what mechanisms are in place to allow
to point them to to allow everyone who is very familiar with the data, maybe
within HHS, or more broadly within the community, to say this is your data
research question, this is what you were trying to do, I know how to do this.
I’ve used this, this is one way to do that.
But creating an infrastructure using sort of social media or sort of other
architectures that have been used for other open source communities, that
allows those questions to be moderated by a community and to have that learning
and those answers within the community being fed back in to the community so
that as others come to use the data they can more readily create the most value
from it. So those were some things that struck me.
And as a reaction to the discussion today, I think that data browsers
provide an additional channel for people to more readily use data. As it was
discussed earlier, it’s not for everybody but for many people who are visual
learners and maybe not be experts in SAS or other techniques, having the
ability to mix and match data and view those results somewhat instantaneously
can – back to the analyst question – can sort of accelerate the
analytical abilities of a much larger population base.
So essentially you’re taking what was a public health analyst or other type
of analyst with a very specific set of skills and training which is still very
valuable but you’re using technology and decision systems to enable those same
types of results and findings to be accomplished by a different set and a wider
set of people. So I think that’s important, something that should continue to
be looked at closely.
MR. QUINN: To bring this back – so the charge and the focus of the
broader NCVHS effort is around the community as a learning system, using local
help to improve local data. So this report talks about how we’re missing. We
heard from a variety of communities that are doing various things to improve
health and use data in their communities, but by and large we’re not building
the infrastructure, the technical infrastructure but also the data analysis
infrastructure in communities that’s needed to improve health.
And the broader NCVHS is looking at solutions for that, strategies and
recommendations for HHs to address that. I see this as directly involved with
that, and ways of reducing the burden on local folks who don’t have millions of
dollars to spend, who don’t have the time and effort and resources to reinvent
And in this context I think what we could do is we can both understand
what’s going on today, so are any of these folks who we’ve talked to or others
using resources like this or those provided by the government to address this?
Would they like to, is it in the cards?
But also to understand what needs, what infrastructure could be provided
centrally or on a distributed basis like this to address those needs and to
inform the broader NCVHS on that to incorporate what we learn here through
those recommendations for some of our other activities.
Is there, for example, not through the lens of subcommittees but there are
other standards issues for combining data or making it available. Are there
privacy issues, are there quality issues, to view it through that lens as well
and to say our goal is to build the infrastructure to support communities as
the learning system, can this get us closer to that or is it an unrelated thing
and we should look at it as something else?
DR. GIBBONS: I am still getting my feet wet here trying to understand fully
what’s going on. But just on the little that I’ve heard today, sort of thinking
about it, it strikes me that there are sort of three issues, and Matt touched
on a couple of them.
One of them is the issue that the data is hard to use, and I guess we’ve
been talking about that a little bit. The other is in terms of the same sort of
thing, and I don’t know the answer to this, perhaps you do. What is the sort of
general knowledge in the country about the availability of the data sets just
Because if that’s widespread then it’s a non-issue, but if it’s very small
that’s different than the data is hard to use, and therefore the solution to
that is education or something else. I’m just thinking about what types of
problem might be preventing widespread use of this data. One could be that
people just don’t know the breadth of data that is available, whether it’s hard
to use or not.
And the last one, maybe it’s getting a little bit on the demand side, but
before even seeing this and hearing Matt my question was how are you defining
community. Are you meaning some geographic area? It has many definitions.
But really what I’m getting at is what it is that we understand the
community capacity to use this data, and do communities, however you define
them, really have whatever level of capacity and desire? Because on the one
sense we may be assuming that if we get the data to a certain level, then wow,
300 million people in this country are going to use it.
But maybe not, maybe it’s only going to be a small subset of people who
would ever have the interest, capacity, and desire to use this. I’m not saying
that’s bad or good but I’m saying that may be a reality. And if we have a
better understanding of what’s the universe of potential people really who
would use this level of data then we could target or think more appropriately
in terms of what is good and what is best, and where our efforts need to go in
DR. FRANCIS: As I am trying to get my hands around this, as I understand it
there are at least three different questions and I’m sure there are many more
but there are three that I will outline that I think HHS is facing as it
One question is what data. Another question is in what format. And a third
question is to whom. And some of what we’ve been talking about right now is in
what format, and some of it is to whom, and for example when Susan answered my
earlier question about the SAMSA data it was we’re not changing the data that
we release we’re just changing the vehicle and maybe who gets it. But I’m not
sure about that.
So what I see the role of this working group as is helping HHS to understand
the questions, the options, the benefits, and risks of different options so
that HHS makes as informed a decision as it can about what’s a really important
I’m as deeply committed to having usable data out there I think as anyone
out there in the room, but what I want to be really sure about is, and
obviously from my questions and how I introduced myself, I’m not a tech person,
but I want to feel comfortable that there isn’t a reason why somebody who read
in the headlines HHS gives its data to Google, I can say to them there aren’t
any risks here, that’s a real misunderstanding.
But I think the role of this subcommittee or working group is to make sure
that we first, at HHS, I’ll go back to what Leah said, aren’t acting under
misconceptions because if we are something could happen that would be really
scary and that’s the worst case scenario that I don’t think anybody wants.
MS. QUEEN: Can I just say one clarifying thing because I didn’t meant to
give the impression that it’s always the same. The public-use data files are
definitely not the same files as the restricted access files.
DR. SUAREZ: I was trying to get my arms around this, as well. I came up
actually with a list of six. Some of them are overlapping, but let me go
through those very quickly. I think the very first question is availability of
data, what is the data available out there. The other one is a question about
reliability, validity, and completeness of the data. Another dimension really
is the limitations of the data, that some intrinsic limitations in terms of the
characteristics of the data and then some external limitations like privacy
related policy constraints.
So limitation to data is the third. Barriers to access, I think that’s a
very significant one. That might be technical, that might be policy. And then
the last two are really sort of more once there is data, once people have data,
it’s really the tools to improve usability of the data and then the ability to
aggregate that data. And the last item is really a mechanism to improve the
analytical capabilities, sort of the data analytics and the resources that
exist to this data analysis.
Now these organizations like ours, Kaiser’s and others, are dealing with
this big question of the big data issue and the whole data analytics. And I
think the same type of challenges are going to apply to communities that are
going to be trying to use in some way big data.
In this case virtualized perhaps but still dealing with the same challenges
of the three V’s as they are called, the velocity, the variety, and the volume
of data. So those were my areas I guess that I think there could be some
additional work that this workgroup should do.
MS. GREENBERG: I have already spoken, but I’m really concerned Josh. I don’t
want to sound like I’m putting a wet blanket at all on what you said, because I
think we’ve all learned some things today, and I encourage that. I’m just
trying to think through how you can be most helpful not only to the Department,
but really more broadly to the public and to communities who, as we all said,
need to use data and want to use data. I think a lot of the issues have been
raised. So I think that’s all I’ll say right now.
DR. COHEN: Where to begin? I actually like Leslie’s framework for what it is
that we need to do. So the question for me is what are the next steps?
Essentially the most specific question is what are the options for leasing HHS
data at the county level to maximize its utility, to make it as visible as
possible, and to let it breathe and have as much value as we can.
And the three questions that I think Leslie raised are, what are the data
we’re talking about releasing. We began a review but essentially that’s pretty
easy to answer, we can go dataset by dataset to decide what data and what
The second question is how, and I heard some suggestions from Josh today
about channels for release, and I think Leah brought up some good points. We
need to review other possible channels that exist because we don’t want to
reinvent the wheel and there may be other ways to get the data out there.
So I think the next step for us is to review what other mechanisms exist
that are actively liberating the data. And the third question is to whom, and
my answer is to as many people as possible and as many venues as possible and
as many ways as possible as long as we’re comfortable about the parameters of
I think Chris raised a good question: what is community? I mean
traditionally we’ve focused on geographic communities but certainly we can form
the data to target other communities, whether they’re race/ethnicity
communities or gender or age specific or whatever affinity groups are, it’s
always possible to re-aggregate and reorganize the data in a comfortable way
where we can address a variety of definitions of community.
So I guess next steps are identifying, I think we’re moving along the path
of identifying what data we’re talking about, we need to really review what
channels and options exist for releasing the data, and then I think the next
steps would be I think really kicking the wheels, trying it out to see how it
works, and then making suggestions or recommendations or providing advice to
the agencies for what we think would be the best way to liberate their data.
And it might require us as the national committee to make formal
recommendations to the secretary about how to move this process along.
MS. GREENBERG: Let me just ask, when you said the goal is releasing county
level data, are you talking about data beyond that which is currently being
DR. COHEN: I just wanted to have a specific target. The lowest level of
aggregation that’s viable at this point in time for most of the HHS data
holdings that aren’t address-specific like where homeless shelters are or where
halfway houses are is at the county level. There might be other configurations
like hospital market areas that people use, it could be MSAs, there are a
variety of others, but basically the geographic configuration is the county
Some of the data are, as public use data, available at the county level.
Other data, vitals for instance, is not available at the county level as public
use data, it would require review to release that data. But I think that’s
possible. So, again, we would need to go data set by data set to see what data
can be released. But that’s a technicality, that’s not an impediment.
MS. GREENBERG: It was my understanding that what this discussion was
supposed to be about today was given that the Department has already made the
decision to release a lot of data through HealthData.Gov, is there a way to
make it more accessible and usable? Not even going to that next step which is
also I think within the scope of this group to say are there data that the
department has decided not to release or hasn’t yet released that maybe could
be or how could it be.
But let’s just take the stuff that’s already out there, and then there are
some of the questions that you’ve all raised. Is it there but it’s difficult to
use? Are there tools that could make it easier to use? Are there approaches
that would allow people to do more things with it?
I have a feeling that there is a very limited number of people who are going
to be, even though there are a lot compared to how many used to, I mean we go
out from NCHS, we go to universities and everything and say, oh you collect
that data? I mean in schools of public health. So never assume anything I would
So that’s a different question but I think we need to figure out, or you
need to figure out, which questions you are asking. And so I would start with
the data that the Department has already decided to release and take it from
DR. COHEN: We can discuss that in more detail later. My response is the
indicators warehouse, HealthDAta.Gov, are good channels and they reach a target
audience. It makes more sense to me to bring the data to where we know people
are rather than to try to move people to where we put the data, and that’s
MS. GREENBERG: That is what Josh was talking about.
DR. COHEN: That is exactly it. We need to maximize the visibility of the
Department’s data release efforts, but there are existing channels that people
use to seek information and those are the ones that we should be promoting and
providing the data through.
DR. CARR: Great.
DR. VAUGHAN: I think there have been a lot of really great comments. I love
the idea of trying to understand who the community is and does it make a
difference to them or how can we help it make a difference to them and make it
And to really expand the user-base of the public use data sets, let alone
give more texture and bring a more finer grain of data. I think that certainly
making it easier to visualize the data is one way, and there’s lots of good
choices in that.
But I also think that we miss a lot of opportunities honestly without going
to the data analysts that exist and asking them if you could, if you weren’t
constrained, how would you want people to use your data. And I think we don’t
ask that question and there’s a lot of wisdom there and a lot of really great
ideas. So to understand that we have a lot more great ideas out there, that we
should also be trying to use.
DR. GREEN: I want to make two sets of comments, one is about process and one
is about the work. The process thing reminds me of earlier discussions in the
last 36 hours. We have such uneven experiences with the technology that’s now
being used with already liberated data, and I really heard you saying one thing
that could help us just work together in a better process if we had a shared
understanding of how this goes, and then you proposed a tactic for doing that.
And this connected back to something you said, Justine, earlier in the last
couple of days about the groups still in search of a common language so you can
just talk to each other. This keeps surfacing in the process stuff. It looks to
me like you’re making good progress, but I understand what you meant better now
than I used to, and it’s still an issue.
My comments about the group go like this. It’s going to be an echo of Matt
and Chris. Where this anchors in the work of NCVHS is how can information
technologies help communities to be learning health systems. I mean I think
that’s still the overarching question that runs through, that’s the river
running through it all. And I would remind you that that report that Susan
wrote, one of the key things is we have a missing infrastructure, and what
we’re observing is that there are new infrastructures that are emergent, that
are known to a few but not known to many, that they’re sort of nascent and
We’re noticing that often missing standards, they’re sort of making up the
standards as they go along, we’re not sure whether they’re good standards, are
they useful standards, or they enhance the product, what’s the deal? We just
don’t understand that much.
But you can see how it cuts across the groups of NCVHS pretty readily.
Another thing, this stewardship framework, we do not have data stewards. And
your conversation just called it out in spades. Who’s the data steward for
Hartford Connecticut? Particularly given the liberated data and the different
people who are using it who we don’t know who they are or what they’re doing
with it or what they’re going to mesh it up with and that sort of stuff.
And this is where I’m probably just going to stake out a little personal
territory. We’re missing a workforce for this. The workforce we’ve got has a
job on their hands to transition to the workforce we need. and that looks like
very fertile territory to me in terms of advising the department.
I want to end with two metaphors that you might not find very useful. A lot
of the work that I saw going on here today calls out the difference between
knowing the map and knowing the terrain. You can map the terrain, and you can
know everything about the map, and you can tell someone to go up here and turn
right on road 13013, but if you know the terrain, you say when you get to this
farmhouse that has a mailbox on it that looks like it should have been torn
down 35 years ago, don’t go any further. That’s knowing the terrain. And we’re
further on in making the maps than we are at understanding the terrain.
Another one goes back to being a doctor again. My life as a doctor since the
internet was invented has changed. People walk in all the time with data. They
walk in with maps. They walk in with comparisons. They walk in with tables. You
know what they don’t have a clue about? They have no idea what they mean,
particularly to them. To the particular community what these data means
requires contextualization and local knowledge.
So if this group can help understand what the technology can and can’t do,
the full committee I think, is looking to help move that forward into
stewardship frameworks of what’s needed to make it work, what can be done to
enable it so that the people aren’t just reading the map but they’re actually
making a different in the terrain by doing it.
MS. QUEEN: I have a whole bunch of conflicting thoughts, but the first one
is all of the agencies have health data leads rather than a health data
initiative. So there is a source of like a listing of what’s currently out
there that has been made available from HHS.
So I’ve just been going to the HHS website and trying to figure it out from
there. And the health data initiative also has an indication of the granularity
of the publically available release data in terms of geographic, whether it’s
county – most of them aren’t county – but there is an available
listing that we could use to give us an idea of what’s already out there.
I personally am going to be compelled to go to a couple of the things that
we’ve looked at today to get more information on what is being used there and
then also just checking with our chief technology officer regarding some of the
current challenges or the ones that have already passed but what’s been done so
DR. CARR: Our discovery continues. I think that everything that was said
today, this is a time of enormous change with an asynchrony of available data
information and skills of knowing how to use it. And I think that if it feels
like it’s hard to come up with a simple answer it’s because of the enormity of
MS. QUEEN: I think we may also want to hear from NIH about their
public/private partnership on their big data project since that is an
initiative that was announced earlier this year, the Whitehouse initiative with
big data. So they’ve managed to put their stuff in the cloud, we should find
DR. CARR: I guess it is harder than we thought to have five next steps,
deliverables. But I think we still have a little bit of confusion I think, with
regard to what are the boundaries of reaction versus recommendation, and it’s
just that it’s new territory. So I think all that’s been said today, and I hope
that we can actually get the transcript, or Susan Kanaan is taking some notes,
and get it out and try to frame it.
There are multiple directions here. Do we want to choose one direction, kind
of go a little deeper on that and come out with something, while planning for
the next direction? Clearly the intersection of the work of the full committee
on communities and the opportunity that is in front of us here with the data
liberation, is important to marry up and we can perhaps take this to the
Executive Subcommittee. I know that we have some data, are we going to get a
presentation at 4:30 about the CDC data? Is someone on the other line?
MR. BUELLER: Hello. Hi, this is Jim Bueller from CDC, I just joined the call
a few minutes ago.
DR. CARR: Thank you for joining us, Jim. We have mapped out about a half
hour, does that work for you? 4:30 to 5:00 to take us through it?
MR. BUELLER: Sure.
DR. CARR: Are there slides?
MR. BUELLER: No.
DR. CARR: The floor is yours.
MR. BUELLER: Thank you. I’m Jim Bueller, I direct the public health
surveillance informatics program at the CDC, and just briefly we run several of
the large surveillance systems that are associated with the CDC, that notify
the disease system which is based on reporting that occurs within states, and
then states volunteer to share that with the CDC according to conditions that
are by law reportable in states, and then the states agree on which subsets of
those they’ll all share with us. Those data are updated weekly.
We run the BioSense system, BioSense 2.0 which is a large syndromic system
which keeps track of patterns of disease they’re seeing in largely hospital
emergency departments on a daily basis, and we run the behavioral risk factors
surveillance system, which is a telephone survey that all states conduct, to
track trends in a variety of health risk behaviors or other health care use
behaviors. And again, we aggregate that at a national level, and the survey is
conducted by states, we support them in doing that.
In addition we provide a variety of services and informatics services, that
supports infrastructure of the public health surveillance. But broadly beyond
that we are a place at CDC for addressing cross-cutting issues. So while we run
three surveillance systems that’s a small fraction of the over 100 or so
different surveillance systems that are managed by CDC programs, and the vast
majority of those are run by different programs.
In essence, if you think of your patient as a population within your
jurisdiction, whether that is local or state or national level, the
surveillance simply is what we do to keep track of the health of our patient or
that population. And the way that that gets done are varied as the spectrum of
things that are of concern to public health, from traditional infectious
diseases to injuries, to chronic diseases to maternal and child health,
occupational health, et cetera.
Whatever the types of issues that CDC programs are addressing there
typically some form of surveillance to go with it, which means that they are
run by experts in particular diseases or conditions, as part of individual
programs and in gear to meet those needs.
And I thought I would just give an example of several different types of
surveillance activities to give you a sense of the flavor of that. So for
example with the notifiable disease surveillance, as I mentioned all states
require that doctors or laboratories and others, report certain diseases. Most
of those are infectious diseases, although they also use that authority for
reporting to things like cancer registries or birth defects registries or
maternal mortality. But for the national system, it’s focused on infectious
diseases. The states get together and agree which ones should be nationally
notifiable, and then they agree to share that data with CDC. They come in
But in addition to that there are a variety of other systems that complement
that. So for example a system called PulseNet, for certain infectious diseases
the states may require or they give clinical laboratories the option of
submitting isolates of certain bacterial infections to the state public health
State public health laboratories then do a DNA fingerprint or
electrophoresis that characterize in a very specific level the particular
strains and then those data are reviewed at a state level and also shared with
CDC. And that provides a way of finding disease outbreaks that are associated
with products, typically food products, that are sold in multiple states that
might be resulting in disease that at a given state or locality level, would
never, at least initially, be recognized as enough of a change to be out of the
ordinary but when you pull that data together from multiple states and look at
the very highly specific DNA fingerprint, you start to see that something is
unusual. So that’s how a number of the fairly high profile multi-state
outbreaks have been detected in recent years.
Another complementary system to that is something called the emerging
infections program, which is an even more detailed project that working in
about 10 different states where different jurisdictions either at a local or
state level, are funded to do a very comprehensive effort to find all cases of
certain infections within a specific geographic boundary.
Trained abstractors go in to perform a detailed record abstraction and get
information much more detailed than would be possible through routine reports.
Information on the antibiotics sensitivity, information on a specific anatomic
site, information on other clinical aspects of the illness that would be beyond
the scope of what’s routinely collected.
And I could give a few other examples, but the point there is that for a
given condition there may actually be a mosaic approach, one approach offering
on a broad national level that provides for routine data, and then a
supplemental approach that goes beyond that perhaps on a subset of cases, based
on whether laboratory specimens are available or based on a specific project
that’s funded in a few localities to dig more deeply.
And I think you’ll see that if you look across the number of diseases, that
different surveillance systems operate at different levels. For example another
example might be tracking the impact of influenza. Influenza is not necessarily
a reportable condition, and yet states keep track of, during flu seasons, the
number of visits for influenza like illness.
It’s a relatively nonspecific definition that might capture people with some
other diseases but that’s for the practical need of tracking when has the flu
season started, is it worse or more severe than other years. and then death
attributed to influenza and pneumonia are tracked at the other level.
At the other level, efforts are made to collect specimens, not at a
comprehensive level but at enough of a level to get a sense of whether
circulating food strains line up with a vaccine in a given year, whether
they’re sensitive or resistant to different antiviral drugs, et cetera. You see
a similar approach in the chronic disease arena where there may be a need for
data that is updated less frequently, even on an annual basis, maybe decision
to keep track from one — to another.
And more and more you see that many systems don’t involve the primary
collection of data but draw on data that are collected by others. So for
example diabetes surveillance tracks do diagnose of different outcomes of
diabetes. And they draw on a number of the NCHS surveys or information systems,
vital records, national health surveys, NHANES, et cetera, to do their mosaic.
I’ve been talking so far about information that is generated or arises
because people have sought health care in one form or another. Another approach
is to do survey. And as I mentioned, many programs at CDC draw on some of the
There are also some surveys that are sponsored directly by the other parts
of CDC. The major risk(?)surveillance system for example, has been running for
about 25 years, we fund the state to conduct it, each state conducts the survey
themselves, using a standardized set of questions, and then they have the
option of adding any of a number of different modules that could be employed to
look at different issues. And they have the option of adding some questions of
their own individual state interests as well.
But it differs from NCHS surveys which are drawn upon a national sampling
frame. These surveys are conducted by each state, and many times they actually
are able to look at sub state levels and they are then used extensively by
states to manage their chronic disease prevention activities.
So there are lots of different approaches to surveillance, lots of different
systems. They operate on a variety of timeframes, at different levels of
detail, different levels of geographic coverage. You can collect a little bit
of information about a lot of people or you can collect a lot of information
about a few people, and you have a trade-off. But you want it to be timely or
complete, whether you have a lot of money to spend or just a little money to
spend, et cetera.
But perhaps I can just stop there and see if there are questions and make
sure that I’m giving you the perspective that you’re looking for.
DR. CARR: Thank you. Susan?
MS. QUEEN: Hi. This is Susan Queen. I was just wondering, are any of the
surveillance data made available to the public.
MR. BUELLER: Yes, it varies. There are a variety of considerations. So for
example, with the BRFSS, you can go (teleconference operator interruption).
Some programs at CDC have the resources to invest in preparation of the public
access database. It varies in terms of what’s available. Part of the process of
surveillance is providing information back to people so they’re all producing
reports in a variety of formats and ways, but they vary in terms of whether
they are public access databases.
And within the public access databases we have to be mindful of what the
level of detail that’s provided to minimize the likelihood that an individual
patient could be identified.
There’s also we have to be respectful of whatever concerns the state may
have and when they may prefer when people go to them and then to us and
information about a particular state. There are some instances where there may
be data use agreements within a state between a health department and a
hospital that provides data. It is just a variety of considerations that go
DR. COHEN: Does CDC, itself, provide de-identified individual level data for
research or public use?
MR. BUELLER: In some instances, yes.
DR. COHEN: Would the IOI surveillance data be available?
MR. BUELLER: That is an example where we don’t even get it at the individual
level. We get from individual providers the percentage of patients that they’re
seeing that have IOI. It varies from program to program what level of access or
availability they have.
I think one of the issues is it’s really what are resources that it takes to
prepare and document a public access database. That is going to vary from
program to program. There are a fair amount that is available but certainly not
all of it.
I mean there are also instances when a public access database may be
insufficient for a particular researcher, and there are precedents of
researchers working directly with an individual program within various
agreements of what would or wouldn’t be done with the data. I think you can
appreciate the importance of sensitivities about the confidentiality that
surrounds many, not all, systems.
DR. COHEN: Is there a summary by surveillance set of what’s available and
what variables and at what geographic level?
MR. BUELLER: We actually have an inventory of surveillance activities at
CDC, and right now that’s an internal resource but we’re working on making it
more broadly available. It’s most immediately available also to people within
state and local health departments, but we are working to make that available
in surveillance activity.
And that does include information about what the URL on the CDC webpage is
to go and get more information about that. But it’s going to be highly variable
from one system to another.
PARTICIPANT: Just one quick question. Just curious Jim, what are your most
popular, most used data sets?
MR. BUELLER: That is a good question. We run something called WONDER which
is Wide-ranging Online Data for Epidemiologic Research. There’s a fair amount
of NCVHS data there, there’s any number of different systems from CDC, you can
access BRFSS data there.
I think some of the notifiable disease data, there’s census data there that
you can use. I would venture a bet that probably the BRFSS is one of the most
heavily used. If you ever want to dig into it just go to CDC.Gov/BFRSS and
there’s a tremendous amount of information that you can get. I guess BFRSS is
one of them.
I know that AIDS has maintained a public access database. I know that from
having worked in AIDS in the past. I don’t work there now, but I would presume
that that’s maintained. But there was a lot of interest in that for a number of
Obviously the NCHS systems are very, very heavily used and I think it’s fair
to say the NCHS would say that they don’t operate surveillance systems, but
their data are used for surveillance and for many other purposes as well.
They’re really geared and built up to provide public access databases, so
they’re very heavily used.
DR. FRANCIS: I have only seen proposals for surveillance systems that are
operated on a distributed query basis. I just wanted to ask whether the sort of
standard model is that you collect the data or whether there are considerations
or discussions of distributed query surveillance structures?
MR. BUELLER: So just in case others aren’t familiar with that concept, the
notion of a distributed system is that the data sits behind each owner’s
firewall and when you have a question you develop a query and the data owners
are asked to hold the data again by format so that you can craft the query, go
out, hit that against each of the data owners, and bring back aggregate report.
We tried something like that, there is actually a system called Distribute
that was a grass roots system, developed in the 1990s. When H1N1 hit, we put a
lot of effort into scaling that up. And basically rather than getting
individual level data we’re asking the states that had syndromic systems to
provide aggregate counts among a handful of variables.
And that worked reasonably well up to a point, and I know that it parallels
in a small way, many of the discussions I’ve heard at FDA around something that
they’ve got called, Sentinel, or things like the HMO Research Network. It takes
a lot of work to understand what you’re getting when you can’t get it at the
individual level, and it’s much less flexible, particularly if something
happens and you need to query the data in a way that you hadn’t done it before.
It’s much easier if you have that individual level data, particularly when
you see something and then the first thing that happens is you start to get
five more questions, and to be able to answer those questions is much easier at
an individual level.
But with that said, there is a lot of precedent, the CDC, many of these
things like the HMO research network which is the forebearer of the FDA
Sentinel was based on a project that really came out of CDC with the vaccine
safety data language that’s based on this notion of a distributed approach.
That has been used, VSD, Vaccine Safety Datalink, is a very successful project,
but it’s like any approach, it has advantages and disadvantages.
I know it’s one that ONC is very keen on supporting, the Office of National
Coordinator for Health Information Technology, they’ve got a query health
project and members of our staff have been involved in helping them think about
that. So it’s one approach, we do think about it, it has strengths and
DR. CARR: Thanks. That was very helpful, very informative, we appreciate it
very much. And you’re welcome to stay on the line, we’re just going to have a
couple of concluding comments from today’s meeting. Thank you for joining us.
DR. CARR: I want to try to pull together what we’ve covered today, in the
last three hours, and put it out there for your consideration and ask for any
suggestions or other issues.
I would say where we are today is that we can say that it is good that HHS
has liberated the data. We all agree that that availability is tremendous
opportunity. Second, that we have seen interesting examples from challenges
from the Datapalooza, demonstrating new observations that can be drawn from the
data alone, or from merging the data from multiple sets.
We also observe that the use of data, at least based on the hits on the
website that we’ve seen today, the use seems modest and could be higher, and
that the challenges toward higher use are three. One is knowledge of the data
availability. The second may be the usability of the data, the ease of use of
using it. And the third might also take into account formulation of priority
issues that ought to be addressed that would drive someone to those data sets.
One other observation is that there is a very strong intersection between
the issues we’ve been addressing in the working group and the full committee
focus on empowering communities to access and use data, and we’re going to make
sure that everybody has a copy of this report.
But just to read briefly from the report and the executive summary of what,
based on the hearings that we had, what the communities said that we needed. A
key need was infrastructure to provide support, facilitate shared learning, and
create economies of scale.
And specifically, they felt important components of the infrastructure
include a privacy and security framework to guide communities in using local
data. A standardized set of community health indicators. Training and technical
assistance, to improve data access, management, and analysis methods and
competency. Better data visualization tools and skills, something we talked
about today. Support and external facilitation to strengthen local financial
human resources, including those for coalition development. Guidance on
achieving data informed approvement through effective leadership and mechanisms
to enable communities to share knowledge and information, stay abreast of
federal and state resources and activities.
And then in the section on envisioning the federal role, there are a number
of recommendations. I’ll just read six of them. NCVHS has identified ways in
which the federal government can support the development and functioning of
community base or community oriented learning systems.
One is facilitate and provide resources to strengthen communities’ capacity
to collect data. Drawing on health indicators warehouse, continue to identify
and encourage adoption of standardized community health indicators.
Three, provide local communities with local data on environmental resource
factors including economic housing, transportation, and education data that are
routinely generated by state and federal entities. Four, promote development
use of federal and state web-based query systems to provide small area data,
easy analytics ,and visualization capabilities.
Five, expand technical assistance mentoring communities in survey design,
data collection, data analysis, et cetera, and convene a summit of local
communities to share what they’re doing and enumerate a set of barriers that
effect all communities working to improve local health.
So clearly there’s a tremendous intersection about what we’ve been talking
about today, how to address some of these issues, and these are the issues that
we heard in the hearing. So I’d like to suggest, again, that when we meet
again, whether it’s in person or by phone, but we will certainly be meeting at
the next February NCVHS – March 1st, okay.
And at that time I think in order for us to be a reactor panel, I think it
might be helpful for two things: one is to get an update from HHS on what they
are already contemplating, what they have considered and accomplished or
considered and rejected, and why. And then actually to have some of those data
folks from HHS come to our next meeting so they can have a dialogue. Because I
think that’s the way that we can communicate quickly as we did when Todd
convened the committee and we just had a meeting, there were no
recommendations, we gave reactions and went from there. So I think that’s a
data reactor kind of venue.
And I think some of the things that we’ve covered today, and we’ll explore
more what Lee was suggesting we could bring to that group. And then I think we
need to probably marry up the work that could be done by this group and the
work of the full committee, on empowering communities, as we said take a copy
of this report because I think it very much addresses the issues we have.
So with that summary I’d like to invite any additional comments,
DR. COHEN: The one additional piece would be I’d like us to explore
additional web access technologies to learn more about those that aren’t
necessarily HHS or government-oriented, that use health data or other vehicles
that haven’t used health data yet.
DR. CARR: Now, I think what I am going to do is confer with Marjorie and Jim
– oh, Vicki.
DR. MAYS: I was just going to ask, is the committee interested in other
vehicles besides the web? Because there are partnerships that people are doing
with like Wal-Mart’s and Walgreen’s where they have video screens and they’re
looking for health messages. There is a big movement right now in the drug
stores and they’re looking to get the health information to it, so I just want
to put it on your radar.
DR. CARR: My thought would be that we stay grounded in what our charge was,
and I think that the prominent in that as I read it today and I read it before,
is the feedback to HHS on what they can do to get this out. Well, I heard the
charge is grounded in the data, the usability of the data, the knowledge about
the data and kind of priority issues.
So what you suggest might be something that we take up more in the community
data initiative. But I think for now we’re still trying to get our hands around
getting the feedback configured properly to HHS.
DR. VAUGHAN: Susan just whispered in my ear and it amplifies with what
Vickie was just saying, but it’s an excellent point. Looking at that as maybe a
larger subset of what’s called mHealth or mobile health, or kind of reaching
folks with good health information, including data, these alternative systems.
DR. CARR: It is pushing out data?
DR. VAUGHAN: Yes. One of the most interesting, for me, is text for baby
which pushes out pregnancy wellness data based on gestation. So that has been
immensely successful, very low cost, and is being replicated across the agency
and other instances. So it doesn’t always have to be something that’s expensive
or fancy, sometimes it’s just going to where the community is and what they can
use and putting it in a framework that they can use it.
DR. GREEN: Justine, question about scope again, and staying in the charge.
But where do devices like asthma inhalers with geospatial devices, do they fit
into the charge or is it really just internet?
DR. CARR: I know Jim’s priority is getting the feedback on the HHS data. I
think that the asthma obelisk and so on, I mean obviously we’re going to
intersect, in fact the call right now about the Datapalooza is going on.
But I think this committee, even though we’re asynchronous, we’re not quite
where we want to be, we’re still in the learning curve, I think the amount of
learning that we’ve had these last couple of sessions can inform. And perhaps
we ought to be thinking about for the Datapalooza, back to Josh’s point, of how
do you incentivize specifically community data or something like that for that
I think what I’m going to do, Susan Queen and I, I think we’ll arrange a
conference call and/or webinar, certainly before February, probably early
January. But the other thing is we’ll get the summary of this meeting out to
everyone and invite input for the folks who were unable to be here today.
So I think with that we’ll conclude and adjourn, and I thank you all for
coming and wish you a safe travel home. Thank you.
(Whereupon, the meeting adjourned at 5:10)