Loading the Heritage Health Prize Data

April 5, 2011
As I have significant interests in Public Health, Informatics, Data Integration and the like, the Heritage Health Prize is a great opportunity to participate in a competition that leverages all those skills.  The 3 Million dollar prize isn't too shabby either.  I don't have much confidence that I could win it, face it, the world's best will be plugging away at this one but that's not all that important here.  I'm using this competition as an opportunity to test out my custom built ETL libraries and integrate some new statistics libraries I'm evaluating.  It will be fun. 
In fact, its been fun already.  In about 20 minutes I was able to develop a data import utility and complete a full load of the first release of the Heritage sample data.  I even had time to make it look nice.
  

 

A few people have expressed interest in my new ETL library, and while its not ready for release yet, I can give you a glimpse of it here in the context of this data import utility.  The data released for this challenge will eventually have multiple files containing sample medical claims data.  Here is the code I used to load them:
// load claims data
InsertColumn insertSID = new InsertColumn(0);
ManipulateColumns claimManipulations = new ManipulateColumns();
claimManipulations.AddManipulations(new int[] { 4, 8 }, SafeIntConvert);
claimManipulations.AddManipulations(new int[] { 1, 2, 3 }, SafeBigIntConvert);
List<ITransformation> claimTransforms = new List<ITransformation>() { insertSID, claimManipulations };

string[] claimFiles = Directory.GetFiles(textSourceFileDirectory.Text, "Claims*.csv");
foreach (string filename in claimFiles)
{
	GenericRecordSource<string> claimSource = new GenericRecordSource<string>(
		new TextFileSource(filename, 1),
		new DelimitedTextParser(',')
	);

	SQLServerTable claimDestination = new SQLServerTable(connection_string, "Claim");

	Datapipe claimPipe = new Datapipe(claimSource, claimDestination, claimTransforms);
	DatapipeResults claimResults = claimPipe.Process();
	Log("Claim " + claimResults.Summary);
}
So then.. lets break it down.
I start off creating all the transformations involved in processing this file.  I do it outside the loading loop for performance reasons, no reason to recreate them for each file since the files are all the same (in fact right now they only have one claim file, but that will change in the future). 
InsertColumn insertSID = new InsertColumn(0);
ManipulateColumns claimManipulations = new ManipulateColumns();
claimManipulations.AddManipulations(new int[] { 4, 8 }, SafeIntConvert);
claimManipulations.AddManipulations(new int[] { 1, 2, 3 }, SafeBigIntConvert);
List<ITransformation> claimTransforms = new List<ITransformation>() { insertSID, claimManipulations };
You'll notice two of the transforms I have available in this example, the first is the InsertColumn transform, which does what you would expect, it inserts a column in the record at the index specified.  Here I'm inserting a column at position 0 to create a system identifier in my database. 
The second transform I'm using here is called the ManipulateColumns transform.  It started out as a lot of different transforms to provide the ability to do data type conversions but I quickly found that my transform library would have been way too many different converters and they would never answer the need for a really specific custom data manipulation.  That led me to create the ManipulateColumns class.  It has a few overloads but basically it allows you to specify a set of columns and a func<object,object> pointer.  That way you can create whatever transformation necessary and pass it in to apply to the columns of the record.  I then pack up the transformations into a List (any IEnumerable would work) to pass into my Datapipe, which we'll get to later.
The foreach loop is just grabbing all the files matching the filter, so that's enough about that.  Let's get to what is inside the loop.
GenericRecordSource<string> claimSource = new GenericRecordSource<string>(
	new TextFileSource(filename, 1),
	new DelimitedTextParser(',')
);
So then, a bit of background.  When I first wrote my ETL library, it was awful.  I coupled the source and the parsing of the source way to tightly.  In my current version I have raw sources and parsers.  The raw sources are the raw input to the ETL process.  They can come from tables, files, even URLs right now.  They implement a generic interface that gives the library the opportunity to really use anything as a data source, but thats for another day.  The parsers are responsible for taking a raw input and converting it into a Record.  Records are a class that I use to standardize data moving through the pipeline and they are very simple.  So, when you put together a raw source and a parser you get a RecordSource.  I have a few different implementations of raw sources and parsers, in this example I'm using a TextFileSource which takes a URI to the file and as a second parameter the number of rows of data to skip before beginning processing.  The parser in this case is a DelimitedTextParser which breaks down the raw input into columns using a delimiter.  This file is comma delimited so I pass in the proper character.
SQLServerTable claimDestination = new SQLServerTable(connection_string, "Claim");
This destination is very straight-forward, its a table in a SQL Server database.  The parameters are the connection string and the target table to load into.  And finally...
Datapipe claimPipe = new Datapipe(claimSource, claimDestination, claimTransforms);
DatapipeResults claimResults = claimPipe.Process();
The Datapipe is where work happens.  Here I initialize my datapipe with the source, destination and transforms I just detailed.  Then I call process, capturing the results in the DatapipeResults which provides information on the number of rows processed, time elapsed, etc.
As usual, hit me up via email or in the comments if you have any questions.

About the author

Frank DeFalco has worked as software developer and informatics professional since 1996.  He is currently employed by Johnson & Johnson where he manages the Health Informatics Center of Excellence.

 

Reach me via email at frank (at) thedefalcos (dot) com.

 

Follow me on twitter @fjdefalco