Go Go Health Gadgets; achieving the quantified self

June 23, 2011

My day job involves supporting observational studies on retrospective data sources.  We try to monitor what is happening in the healthcare industry at a very macro level using various sources of health information, with varying degrees of reliability and of course, success.  The data sources available come from third parites and have significant limitations; but they are what we have to work with today.  I gave a talk not so long ago at Rutgers University that touched on privacy issues regarding these sources of health information and many people voiced their concerns regarding the confidentiality of their health information, rightfully so.  I just wish people were as passionate about collecting their health information as they were about the privacy of it.

Today few people have a personal health record in any form, let alone electronic.  To me that is really disappointing because given the good work we are able to do to improve the human condition through the analysis of broad, limited data sets I can only imagine the potential benefits from complete, personal health information.  Personally, I've started doing what I can to get more involved with my own health and the health of my family.  I'm now the proud owner of a FitBit and Withings Scale.  The FitBit tracks my daily activity levels and nightly sleep schedule while the Withings scale monitors my weight, lean mass and fat mass.  My next purchase will likely be the Polar heart rate monitor for when I exercise (because FitBit is telling me that I need to do more of that...).  The great thing about each of these products is that they sync wirelessly through a computer and upload the information into their own respective analysis tools as well as a personal health record like Google Health or Microsoft HealthVault.  Its really just the tip of the iceburg in terms of personal monitoring.  I've embedded a TEDTalk below that was given by Daniel Kraft in April, 2011 in which he goes in depth on just how far and how fast the landscape for health information and its use will likely change.

Once your done watching Daniel's talk and are inspired to start getting more involved with your health, hit up the links below to get your gadgets. Since, you know, we all like gadgets.

After you're done shopping I encourage you to check out Technology Review to get sneak peek at a future device that has shows a lot of promise and an emerging conference dedicated to those who are trying to live the measured life and acheive a quantified self.

R you connecting to SQL Server?

May 3, 2011
As I've mentioned some of my pet projects require access to statistics and data mining algorithms.  I have played with a few different approaches but none of them really had everything I was looking for in terms of methods.  I eventually decided to give R a try.  For those who are not familiar R is a free software environment for statistical computing and graphics that is available for a wide variety of UNIX platforms, Windows and MacOS.  I had used R in the past but only on a very limited basis and never against a relational data source so I was quite suprised with what I found. 
R has a massive following with thousands of contributors to an online library of algorithms.  Need access to a bayesian algorithm for classification, no problem.  Want to draw familial diagram with genetic markers, piece of cake.  It's all in there and there are quite a few nice resources online to help you learn the platform.  One of the best I have found is http://www.statmethods.net/ but there are plenty of others.  Even if what you are looking for doesn't exist in R (unlikely) you can engage with the community to get where you need to go.
Once I determined that R has all the methods I was looking for the next step was accessing my data from within R.  Sure, you can load flat files directly into R but my data is all stored in a relational database.  As it turns out, R can handle that too.  Let's get started.
R has a terrific package management system that allows you to select a package repository called a CRAN mirror (there are many) and install additional packages.
 
To connect to a relational database you'll want to grab the RODBC package by selecting 'Install package(s)...' and finding it in the list after selecting a nearby CRAN mirror. Once it has been installed you can load into your current sessions as follows:
library("RODBC")
Now for the fun part.  In the code below I connect to the database and establish a channel that can then be used for running your queries.  Once I have a channel I get a listing of tables in the Sales schema of the AdventureWorks sample database.  Finally I run a query to return some data.
channel <- odbcDriverConnect(connection = "Driver={SQL Server Native Client 10.0};Server=localhost;Database=AdventureWorks;Trusted_Connection=yes;")
sqlTables(channel, schema="Sales")
sqlQuery(channel,"select top 10 * from Sales.SalesOrderHeader")
Of course you might not be using SQL Server 2008 or you might be using a different security model. Regardless just take a trip over to www.connectionstrings.com and find the ODBC connection string that fits your scenario and plug it into the odbcDriverConnect function.  Now that you can connect R to your relational environment you can take advantage of the statistical and data mining algorithms within, free of charge.

Loading the Heritage Health Prize Data

April 5, 2011
As I have significant interests in Public Health, Informatics, Data Integration and the like, the Heritage Health Prize is a great opportunity to participate in a competition that leverages all those skills.  The 3 Million dollar prize isn't too shabby either.  I don't have much confidence that I could win it, face it, the world's best will be plugging away at this one but that's not all that important here.  I'm using this competition as an opportunity to test out my custom built ETL libraries and integrate some new statistics libraries I'm evaluating.  It will be fun. 
In fact, its been fun already.  In about 20 minutes I was able to develop a data import utility and complete a full load of the first release of the Heritage sample data.  I even had time to make it look nice.
  

 

A few people have expressed interest in my new ETL library, and while its not ready for release yet, I can give you a glimpse of it here in the context of this data import utility.  The data released for this challenge will eventually have multiple files containing sample medical claims data.  Here is the code I used to load them:
// load claims data
InsertColumn insertSID = new InsertColumn(0);
ManipulateColumns claimManipulations = new ManipulateColumns();
claimManipulations.AddManipulations(new int[] { 4, 8 }, SafeIntConvert);
claimManipulations.AddManipulations(new int[] { 1, 2, 3 }, SafeBigIntConvert);
List<ITransformation> claimTransforms = new List<ITransformation>() { insertSID, claimManipulations };

string[] claimFiles = Directory.GetFiles(textSourceFileDirectory.Text, "Claims*.csv");
foreach (string filename in claimFiles)
{
	GenericRecordSource<string> claimSource = new GenericRecordSource<string>(
		new TextFileSource(filename, 1),
		new DelimitedTextParser(',')
	);

	SQLServerTable claimDestination = new SQLServerTable(connection_string, "Claim");

	Datapipe claimPipe = new Datapipe(claimSource, claimDestination, claimTransforms);
	DatapipeResults claimResults = claimPipe.Process();
	Log("Claim " + claimResults.Summary);
}
So then.. lets break it down.
I start off creating all the transformations involved in processing this file.  I do it outside the loading loop for performance reasons, no reason to recreate them for each file since the files are all the same (in fact right now they only have one claim file, but that will change in the future). 
InsertColumn insertSID = new InsertColumn(0);
ManipulateColumns claimManipulations = new ManipulateColumns();
claimManipulations.AddManipulations(new int[] { 4, 8 }, SafeIntConvert);
claimManipulations.AddManipulations(new int[] { 1, 2, 3 }, SafeBigIntConvert);
List<ITransformation> claimTransforms = new List<ITransformation>() { insertSID, claimManipulations };
You'll notice two of the transforms I have available in this example, the first is the InsertColumn transform, which does what you would expect, it inserts a column in the record at the index specified.  Here I'm inserting a column at position 0 to create a system identifier in my database. 
The second transform I'm using here is called the ManipulateColumns transform.  It started out as a lot of different transforms to provide the ability to do data type conversions but I quickly found that my transform library would have been way too many different converters and they would never answer the need for a really specific custom data manipulation.  That led me to create the ManipulateColumns class.  It has a few overloads but basically it allows you to specify a set of columns and a func<object,object> pointer.  That way you can create whatever transformation necessary and pass it in to apply to the columns of the record.  I then pack up the transformations into a List (any IEnumerable would work) to pass into my Datapipe, which we'll get to later.
The foreach loop is just grabbing all the files matching the filter, so that's enough about that.  Let's get to what is inside the loop.
GenericRecordSource<string> claimSource = new GenericRecordSource<string>(
	new TextFileSource(filename, 1),
	new DelimitedTextParser(',')
);
So then, a bit of background.  When I first wrote my ETL library, it was awful.  I coupled the source and the parsing of the source way to tightly.  In my current version I have raw sources and parsers.  The raw sources are the raw input to the ETL process.  They can come from tables, files, even URLs right now.  They implement a generic interface that gives the library the opportunity to really use anything as a data source, but thats for another day.  The parsers are responsible for taking a raw input and converting it into a Record.  Records are a class that I use to standardize data moving through the pipeline and they are very simple.  So, when you put together a raw source and a parser you get a RecordSource.  I have a few different implementations of raw sources and parsers, in this example I'm using a TextFileSource which takes a URI to the file and as a second parameter the number of rows of data to skip before beginning processing.  The parser in this case is a DelimitedTextParser which breaks down the raw input into columns using a delimiter.  This file is comma delimited so I pass in the proper character.
SQLServerTable claimDestination = new SQLServerTable(connection_string, "Claim");
This destination is very straight-forward, its a table in a SQL Server database.  The parameters are the connection string and the target table to load into.  And finally...
Datapipe claimPipe = new Datapipe(claimSource, claimDestination, claimTransforms);
DatapipeResults claimResults = claimPipe.Process();
The Datapipe is where work happens.  Here I initialize my datapipe with the source, destination and transforms I just detailed.  Then I call process, capturing the results in the DatapipeResults which provides information on the number of rows processed, time elapsed, etc.
As usual, hit me up via email or in the comments if you have any questions.

Chainsaw.Net with Style

March 29, 2011
I've been working pretty diligently during my nights at finishing version 2.0 of my custom ETL engine.  Its coming along nicely and I'm pretty happy with how this revision is coming.  Look for more on my custom ETL engine in the near future, but today I'm revisiting my .NET implementation of Chainsaw, a log viewer for log4net.  As some of you might remember I finished an initial version of the viewer a few weeks back.  I've been using it pretty heavily during the development of my ETL engine and I got tired of it looking crappy.  Tonight I took the night off from ETL development to "beautify" the log4net reporting viewer and I have to say, its easier on the eyes. 
If you compare the two screenshots you'll notice that I moved all my controls up to a single bar at the top of the page and incorporated the jQuery UI button components to replace the basic checkboxes, radio buttons and drop downs.  I've also removed the logger tree from being a permanent fixture on the page and now it toggles with a button in the top left that gives you access to the tree to make your logger selections.
 
That screenshot is full resolution if you open it in a new tab, which makes it easier to check out the new look and feel.
I found that it can be a bit tricky to pull the state out of these modified controls but it only required a few changes to the javascript.  Notably pulling the value out of the converted checkboxes and radio buttons required a slightly sophisticated select which I'll share here just in case you find yourself in a similar situation.
var sort_order = $('input:radio[name=radio-message-order]:checked').val(); 
var per_page = $('input:radio[name=option-perpage]:checked').val();
I played with swapping around some of the themes available from jQuery UI but they don't seem to want to play all that nicely which likely means I'm doing something wrong or the browsers are caching things. Either way, it will have to wait for another day.
Oh yea, I also discovered that the comment system wasn't exactly working.  I had CAPTCHA enabled but not the CAPTCHA tool.  Whoops, its working now so comment away.  Now, back to the real stuff.

Data Visualization from Hans Rosling

March 21, 2011

Today I was getting my fix from TED and ended up watching a recent talk by Hans Rosling on the magic washing machine. It was an excellent talk as are many by Hans, including an old favorite that I was reminded of while watching this one.

Back in 2009 Hans did another talk called Let my dataset change your mindset where he discussed public health and visualization techniques, two topics that I'm passionate about. Most of the visualizations in his talk are created with the Gapminder software which was acquired by Google back in 2007. Gapminder is excellent at showing how data changes over time.

Take a look at the talk and if you share my excitement with the visualization be happy to know that it lives on both at Gapminder.com as well as part of Google's public data explorer. You can even leverage the visualization technique yourself by using Google's motion chart.  In addition as you get captivated by Hans you'll likely also enjoy this earlier talk where Hans shows the best statistics you have ever seen.

Logging and Triggers and Email Oh My!

March 17, 2011

I've been doing a lot of work lately using SSIS to load our production databases.  We're loading very large databases and our SSIS packages are long running.  Many take days not hours.  It can be difficult to understand what parts of the packages are taking most of the time and when or if they have issues.  SSIS provides a logging feature that you can enable and it goes a long way to solving those issues.  We have enabled our packages for logging and direct the output to a centralized database where we log all of our loading package executions.  From there I've written a web application that analyzes the logging database so we can view all the information generated by our packages.  I even have a nifty little timeline viewer I've written that depicts our package flow, its simple, but it does the job.

The report has a ton more detail than the timeline, but I'll save that for another day.  What i wanted to focus on here was how to generate alerts.  We wanted a way to get notifications when packages had issues.  Yes, we could do things in the packages themselves to set paths to go to email tasks on failures, but it just complicated the packages and we would have to configure it in each package for each task, you see where this is going.  Since we have all of our packages logging centrally we thought it would be easier to setup a trigger on the logging table to alert us when particular events occurred.  This ended up working really well once we worked through a few minor issues.

First you have to go through and setup database mail on the server with the logging database which is done by just following through the wizard.  If anyone posts any questions I'll go back and provide some more information on how its done.

With database mail setup you have to make sure that the credentials you are using to write your log entries to the database have permissions to send email using database mail.  The script to do that is straight forward:

USE [msdb]
GO
CREATE USER [<ACCOUNT NAME HERE>] FOR LOGIN [<ACCOUNT NAME HERE>]
GO
USE [msdb]
GO
EXEC sp_addrolemember N'DatabaseMailUserRole', N'<ACCOUNT NAME HERE>'
GO

Now that your account has the right permissions you just need to setup a trigger to send the email.  You can customize it anyway to provide different levels of notification based on what you capture in your log table.  In the following example we send an email when a package encounters an error.

CREATE TRIGGER [dbo].[SSISError] ON [dbo].[sysssislog]
FOR INSERT
AS

IF (SELECT COUNT(*) FROM inserted WHERE [event] = 'OnError') = 1
BEGIN

exec msdb.dbo.sp_send_dbmail 
	@profile_name = 'Default Email Profile',
	@recipients = 'somebody@who.cares.about.alerts.com',
	@body_format = 'HTML',
	@body = 'Go check your error.,
	@subject = 'SSIS Error'
END
And there you have it.  Logging, reporting and alerts from your SSIS packages.  Now go load some data!

ETL : Made from scratch.

March 14, 2011

So for those of you dedicated readers (there are currently about 40 of you, so keep it up, and tell your friends) you already know that I wrote a bit about what it takes to get Rhino-ETL built.  I've since been amazed at how complex Rhino-ETL is, I mean, it has its own scripting language or DSL component as its called.  Its impressive, but I don't find it to be all that simple.  I'm a big fan of simple.

I use SQL Server Integration Services (SSIS) in my professional work.  Its not really all that simple either.  It is very capable, but not simple.  DTS was simple.  For those who aren't familiar DTS was Data Transformation Services, the ETL engine before SSIS.  Ever wonder why SSIS packages are called DTSx files?  They didn't know what they were calling SSIS in time to change the file extension. Much like old school ASP became ASPX, DTS became DTSX.  Now you know.  I also believe it was to lure us all in.  Let's face it, a lot of people really liked DTS.  No, it wasn't an enterprise ETL engine like SSIS or Informatica, but it was light weight, I'll even go as far as to say that it was fun.  That's right, FUN.  It inferred types, loaded data in ways it probably shouldn't have, but it just kind of worked.  It couldn't handle complicated scenarios, you know, the 20 part of the 80/20 split? But it handled the 80 part pretty well and I liked it.  Flash forward to today and DTS is gone.  Out, out brief candle. 

I have a few personal projects ongoing that require the help of an ETL engine and it must be open source or more specifically, free.  We have SSIS, but it certainly isn't free and Rhino-ETL looked great but its not meeting my criteria of being simple and so...

ETL : Made from scratch.  That's right, I'm re-inventing the wheel. Too bad.  Everyone else gets to do it.  I mean, did we really need GIT?  We have subversion.  Subversion tore me from my comfortable little world of Microsoft tools and quietly convinced me of its superiority to Source Safe and at least the first few versions of TFS.  Then someone decided we needed GIT.  So, too bad, I'm writing my own ETL engine and besides, it will be from scratch!  You know, because for some reason we are all pretty much convinced that things are better when they are made from scratch.  Cupcakes, pasta, tomato sauce, they are all better when made from scratch instead of from the boxes and jars we buy in the store.  Therefore, it logically follows that my ETL engine will be better. Well, probably not, but it will be simpler.

I wrote version 1 and promptly threw it away, because version one is never any good anyway.  I'm just putting the finishing touches on version 2 and hope to finish it up in the next few weeks.  Its clearly over-engineered just like any good V2 is supposed to be over-engineered.  Once I finish version 2 I will put it through the paces of loading all the data for my personal projects and see how it does.  To be fair, version 1 was able to fulfill about 80% of my needs but it was damn ugly.  V2 is much prettier, hopefully it works better too.  Look for more details on V2 and even some sample code in the near future.

I couldn't find a chainsaw, so...

March 2, 2011

I recently started working on a web based reporting console for log4net since I couldn't find a .NET version of Chainsaw, the java based app for a similar purpose.  

For those who aren't familiar log4net is a .NET port of the log4j logging framework.  You can find out more about the framework and download the bits off the Apache site.  Essentially its a logging library that you can use instead of rewriting the way you log for each project you develop.  What's really great about it is that it is completely abstracted from the storage layer, meaning you can log your messages to the local file system, just about any relational database, the windows event log, even email via smtp, all without changing your code.

My reporting console isn't nearly as flexible but works for my purposes.  I've setup an ASP.NET web application that utilizes Entity Framework to access my SQL server based logging data created by log4net, which I obviously have configured to use SQL server for storage.  The data is exposed through a web service and accessed via ajax from my front end.  

I've integrated a jquery extension called jstree to provide a hierarchical way to filter the loggers' messages to display in the report which is actually somewhat similar to what the java based log viewer Chainsaw does in its reporting.  Here is a screenshot of the reporting interface:

The top left shows the treeview courtesy of jstree.  To the right of the tree are a few simple reporting options I have implemented so far including a sort order for the log messages, a limit on the number of messages for the interface to download and filters on which columns to include in the report.  

I'm pulling the state of the options using jquery and then putting them all together in order to make the ajax call to the web service.  Gotta give a shout out to the folks over at ElegantCode for having a nice quick tutorial on reading the state of the select objects, since I always forget how to do that. Check out the basics of the data update routine below:

var sort_order = $('.option-sortorder').val();
var per_page = $('.option-perpage').val();

var bit_id = ($(".option-mask-id")[0].checked) ? '1' : '0';
var bit_date = ($(".option-mask-date")[0].checked) ? '1' : '0';
var bit_level = ($(".option-mask-level")[0].checked) ? '1' : '0';
var bit_logger = ($(".option-mask-logger")[0].checked) ? '1' : '0';
var bit_message = ($(".option-mask-message")[0].checked) ? '1' : '0';

var mask = bit_id + bit_date + bit_level + bit_logger + bit_message;

$.ajax({
	type: "POST",
	contentType: "application/json; charset=utf-8",
	url: "LoggerDetails.svc/GetMessages",
	data: "{\"ids\" : \"" + id_list.toString() + "\"," +
		"\"sort\" : " + sort_order + "," +
		"\"perpage\" : " + per_page + "," +
		"\"mask\" : \"" + mask + "\"," +
		"}",
	success: function (data, status) {
		$(".messages").html(data.d);
		$(".message-table").tablesorter({ widgets: ['zebra'] });
	},
	error: function (err) {
		alert(err.statusText);
	},
	dataType: "json"
});

I've been working a bit on timeline visualization controls and I might integrate one of them into this interface if it looks like it would be useful, which it does at this point.  I'm also considering adding a date filter so you can pull things back from specific time periods or some aggregate statistics, so you can get a higher level view of the volumne and category of entries in your log.

Source code available upon request, just hit up the comments below.

Git Rhino-ETL Going

March 1, 2011

For one of my projects I'm trying to identify a decent, open source, .NET based ETL component.  Yes, those are some picky requirements but there appears to be at least one answer called Rhino ETL.  Getting it built isn't the most straight forward task.  Here is what it takes:

First off grab the source from GIT, which might first require you have access to GIT utilities, I chose GITExtensions off of SourceForge.  After installing it and launching it I get to look at a cow.  I'm not sure why.  I'm sure there is a reason.  I don't think I want to know what it is though, so I'm going to skip googling GITExtensions and cow and just get on with it, but for your own viewing pleasure I've included a screenshot.

Once you have GITExtensions installed you can use the UI to download the current source for Rhino-ETL.  Currently the URL for Read Only access to the Rhino-ETL source is here:

https://github.com/hibernating-rhinos/rhino-etl.git

In order to grab a copy of the source you click 'Clone repository' in the above UI and then fill out the clone screen something like this:

Hit clone and now you'll have the source in the destination you specified.  

More...

Informatics

March 1, 2011

The term informatics covers a wide spectrum of areas.  Let us begin with the definition from none other than wikipedia:

"Informatics is the science of information, the practice of information processing, and the engineering of information systems. Informatics studies the structure, algorithms, behavior, and interactions of natural and artificial systems that store, process, access and communicate information. It also develops its own conceptual and theoretical foundations and utilizes foundations developed in other fields. Since the advent of computers, individuals and organizations increasingly process information digitally. This has led to the study of informatics that has computational, cognitive and social aspects, including study of the social impact of information technologies."

Part of what I want to communicate through this blog is my experience working in the field of informatics.  Until recently, I would not have even called myself an informatician, but the more familiar I became with the field, the more I understood that it is what I am.  To me, informatics is simply working with information.  Sure, the wikipedia definition wins the competition for most words in the definition, but mine is simpler and there is definitely something to be said about The Power Of Simplicity.  As simple as my definition is, the field is very complicated for a variety of reasons.  I'll be posting more about informatics in the future and including some practical examples of the trials and tribulations of an informatician. I'm sure you can hardly wait. 

About the author

Frank DeFalco has worked as software developer and informatics professional since 1996.  He is currently employed by Johnson & Johnson where he manages the Health Informatics Center of Excellence.

 

Reach me via email at frank (at) thedefalcos (dot) com.

 

Follow me on twitter @fjdefalco