Go Go Health Gadgets; achieving the quantified self

June 23, 2011

My day job involves supporting observational studies on retrospective data sources.  We try to monitor what is happening in the healthcare industry at a very macro level using various sources of health information, with varying degrees of reliability and of course, success.  The data sources available come from third parites and have significant limitations; but they are what we have to work with today.  I gave a talk not so long ago at Rutgers University that touched on privacy issues regarding these sources of health information and many people voiced their concerns regarding the confidentiality of their health information, rightfully so.  I just wish people were as passionate about collecting their health information as they were about the privacy of it.

Today few people have a personal health record in any form, let alone electronic.  To me that is really disappointing because given the good work we are able to do to improve the human condition through the analysis of broad, limited data sets I can only imagine the potential benefits from complete, personal health information.  Personally, I've started doing what I can to get more involved with my own health and the health of my family.  I'm now the proud owner of a FitBit and Withings Scale.  The FitBit tracks my daily activity levels and nightly sleep schedule while the Withings scale monitors my weight, lean mass and fat mass.  My next purchase will likely be the Polar heart rate monitor for when I exercise (because FitBit is telling me that I need to do more of that...).  The great thing about each of these products is that they sync wirelessly through a computer and upload the information into their own respective analysis tools as well as a personal health record like Google Health or Microsoft HealthVault.  Its really just the tip of the iceburg in terms of personal monitoring.  I've embedded a TEDTalk below that was given by Daniel Kraft in April, 2011 in which he goes in depth on just how far and how fast the landscape for health information and its use will likely change.

Once your done watching Daniel's talk and are inspired to start getting more involved with your health, hit up the links below to get your gadgets. Since, you know, we all like gadgets.

After you're done shopping I encourage you to check out Technology Review to get sneak peek at a future device that has shows a lot of promise and an emerging conference dedicated to those who are trying to live the measured life and acheive a quantified self.

Loading the Heritage Health Prize Data

April 5, 2011
As I have significant interests in Public Health, Informatics, Data Integration and the like, the Heritage Health Prize is a great opportunity to participate in a competition that leverages all those skills.  The 3 Million dollar prize isn't too shabby either.  I don't have much confidence that I could win it, face it, the world's best will be plugging away at this one but that's not all that important here.  I'm using this competition as an opportunity to test out my custom built ETL libraries and integrate some new statistics libraries I'm evaluating.  It will be fun. 
In fact, its been fun already.  In about 20 minutes I was able to develop a data import utility and complete a full load of the first release of the Heritage sample data.  I even had time to make it look nice.
  

 

A few people have expressed interest in my new ETL library, and while its not ready for release yet, I can give you a glimpse of it here in the context of this data import utility.  The data released for this challenge will eventually have multiple files containing sample medical claims data.  Here is the code I used to load them:
// load claims data
InsertColumn insertSID = new InsertColumn(0);
ManipulateColumns claimManipulations = new ManipulateColumns();
claimManipulations.AddManipulations(new int[] { 4, 8 }, SafeIntConvert);
claimManipulations.AddManipulations(new int[] { 1, 2, 3 }, SafeBigIntConvert);
List<ITransformation> claimTransforms = new List<ITransformation>() { insertSID, claimManipulations };

string[] claimFiles = Directory.GetFiles(textSourceFileDirectory.Text, "Claims*.csv");
foreach (string filename in claimFiles)
{
	GenericRecordSource<string> claimSource = new GenericRecordSource<string>(
		new TextFileSource(filename, 1),
		new DelimitedTextParser(',')
	);

	SQLServerTable claimDestination = new SQLServerTable(connection_string, "Claim");

	Datapipe claimPipe = new Datapipe(claimSource, claimDestination, claimTransforms);
	DatapipeResults claimResults = claimPipe.Process();
	Log("Claim " + claimResults.Summary);
}
So then.. lets break it down.
I start off creating all the transformations involved in processing this file.  I do it outside the loading loop for performance reasons, no reason to recreate them for each file since the files are all the same (in fact right now they only have one claim file, but that will change in the future). 
InsertColumn insertSID = new InsertColumn(0);
ManipulateColumns claimManipulations = new ManipulateColumns();
claimManipulations.AddManipulations(new int[] { 4, 8 }, SafeIntConvert);
claimManipulations.AddManipulations(new int[] { 1, 2, 3 }, SafeBigIntConvert);
List<ITransformation> claimTransforms = new List<ITransformation>() { insertSID, claimManipulations };
You'll notice two of the transforms I have available in this example, the first is the InsertColumn transform, which does what you would expect, it inserts a column in the record at the index specified.  Here I'm inserting a column at position 0 to create a system identifier in my database. 
The second transform I'm using here is called the ManipulateColumns transform.  It started out as a lot of different transforms to provide the ability to do data type conversions but I quickly found that my transform library would have been way too many different converters and they would never answer the need for a really specific custom data manipulation.  That led me to create the ManipulateColumns class.  It has a few overloads but basically it allows you to specify a set of columns and a func<object,object> pointer.  That way you can create whatever transformation necessary and pass it in to apply to the columns of the record.  I then pack up the transformations into a List (any IEnumerable would work) to pass into my Datapipe, which we'll get to later.
The foreach loop is just grabbing all the files matching the filter, so that's enough about that.  Let's get to what is inside the loop.
GenericRecordSource<string> claimSource = new GenericRecordSource<string>(
	new TextFileSource(filename, 1),
	new DelimitedTextParser(',')
);
So then, a bit of background.  When I first wrote my ETL library, it was awful.  I coupled the source and the parsing of the source way to tightly.  In my current version I have raw sources and parsers.  The raw sources are the raw input to the ETL process.  They can come from tables, files, even URLs right now.  They implement a generic interface that gives the library the opportunity to really use anything as a data source, but thats for another day.  The parsers are responsible for taking a raw input and converting it into a Record.  Records are a class that I use to standardize data moving through the pipeline and they are very simple.  So, when you put together a raw source and a parser you get a RecordSource.  I have a few different implementations of raw sources and parsers, in this example I'm using a TextFileSource which takes a URI to the file and as a second parameter the number of rows of data to skip before beginning processing.  The parser in this case is a DelimitedTextParser which breaks down the raw input into columns using a delimiter.  This file is comma delimited so I pass in the proper character.
SQLServerTable claimDestination = new SQLServerTable(connection_string, "Claim");
This destination is very straight-forward, its a table in a SQL Server database.  The parameters are the connection string and the target table to load into.  And finally...
Datapipe claimPipe = new Datapipe(claimSource, claimDestination, claimTransforms);
DatapipeResults claimResults = claimPipe.Process();
The Datapipe is where work happens.  Here I initialize my datapipe with the source, destination and transforms I just detailed.  Then I call process, capturing the results in the DatapipeResults which provides information on the number of rows processed, time elapsed, etc.
As usual, hit me up via email or in the comments if you have any questions.

Data Visualization from Hans Rosling

March 21, 2011

Today I was getting my fix from TED and ended up watching a recent talk by Hans Rosling on the magic washing machine. It was an excellent talk as are many by Hans, including an old favorite that I was reminded of while watching this one.

Back in 2009 Hans did another talk called Let my dataset change your mindset where he discussed public health and visualization techniques, two topics that I'm passionate about. Most of the visualizations in his talk are created with the Gapminder software which was acquired by Google back in 2007. Gapminder is excellent at showing how data changes over time.

Take a look at the talk and if you share my excitement with the visualization be happy to know that it lives on both at Gapminder.com as well as part of Google's public data explorer. You can even leverage the visualization technique yourself by using Google's motion chart.  In addition as you get captivated by Hans you'll likely also enjoy this earlier talk where Hans shows the best statistics you have ever seen.

About the author

Frank DeFalco has worked as software developer and informatics professional since 1996.  He is currently employed by Johnson & Johnson where he manages the Health Informatics Center of Excellence.

 

Reach me via email at frank (at) thedefalcos (dot) com.

 

Follow me on twitter @fjdefalco