R you connecting to SQL Server?

May 3, 2011
As I've mentioned some of my pet projects require access to statistics and data mining algorithms.  I have played with a few different approaches but none of them really had everything I was looking for in terms of methods.  I eventually decided to give R a try.  For those who are not familiar R is a free software environment for statistical computing and graphics that is available for a wide variety of UNIX platforms, Windows and MacOS.  I had used R in the past but only on a very limited basis and never against a relational data source so I was quite suprised with what I found. 
R has a massive following with thousands of contributors to an online library of algorithms.  Need access to a bayesian algorithm for classification, no problem.  Want to draw familial diagram with genetic markers, piece of cake.  It's all in there and there are quite a few nice resources online to help you learn the platform.  One of the best I have found is http://www.statmethods.net/ but there are plenty of others.  Even if what you are looking for doesn't exist in R (unlikely) you can engage with the community to get where you need to go.
Once I determined that R has all the methods I was looking for the next step was accessing my data from within R.  Sure, you can load flat files directly into R but my data is all stored in a relational database.  As it turns out, R can handle that too.  Let's get started.
R has a terrific package management system that allows you to select a package repository called a CRAN mirror (there are many) and install additional packages.
 
To connect to a relational database you'll want to grab the RODBC package by selecting 'Install package(s)...' and finding it in the list after selecting a nearby CRAN mirror. Once it has been installed you can load into your current sessions as follows:
library("RODBC")
Now for the fun part.  In the code below I connect to the database and establish a channel that can then be used for running your queries.  Once I have a channel I get a listing of tables in the Sales schema of the AdventureWorks sample database.  Finally I run a query to return some data.
channel <- odbcDriverConnect(connection = "Driver={SQL Server Native Client 10.0};Server=localhost;Database=AdventureWorks;Trusted_Connection=yes;")
sqlTables(channel, schema="Sales")
sqlQuery(channel,"select top 10 * from Sales.SalesOrderHeader")
Of course you might not be using SQL Server 2008 or you might be using a different security model. Regardless just take a trip over to www.connectionstrings.com and find the ODBC connection string that fits your scenario and plug it into the odbcDriverConnect function.  Now that you can connect R to your relational environment you can take advantage of the statistical and data mining algorithms within, free of charge.

Loading the Heritage Health Prize Data

April 5, 2011
As I have significant interests in Public Health, Informatics, Data Integration and the like, the Heritage Health Prize is a great opportunity to participate in a competition that leverages all those skills.  The 3 Million dollar prize isn't too shabby either.  I don't have much confidence that I could win it, face it, the world's best will be plugging away at this one but that's not all that important here.  I'm using this competition as an opportunity to test out my custom built ETL libraries and integrate some new statistics libraries I'm evaluating.  It will be fun. 
In fact, its been fun already.  In about 20 minutes I was able to develop a data import utility and complete a full load of the first release of the Heritage sample data.  I even had time to make it look nice.
  

 

A few people have expressed interest in my new ETL library, and while its not ready for release yet, I can give you a glimpse of it here in the context of this data import utility.  The data released for this challenge will eventually have multiple files containing sample medical claims data.  Here is the code I used to load them:
// load claims data
InsertColumn insertSID = new InsertColumn(0);
ManipulateColumns claimManipulations = new ManipulateColumns();
claimManipulations.AddManipulations(new int[] { 4, 8 }, SafeIntConvert);
claimManipulations.AddManipulations(new int[] { 1, 2, 3 }, SafeBigIntConvert);
List<ITransformation> claimTransforms = new List<ITransformation>() { insertSID, claimManipulations };

string[] claimFiles = Directory.GetFiles(textSourceFileDirectory.Text, "Claims*.csv");
foreach (string filename in claimFiles)
{
	GenericRecordSource<string> claimSource = new GenericRecordSource<string>(
		new TextFileSource(filename, 1),
		new DelimitedTextParser(',')
	);

	SQLServerTable claimDestination = new SQLServerTable(connection_string, "Claim");

	Datapipe claimPipe = new Datapipe(claimSource, claimDestination, claimTransforms);
	DatapipeResults claimResults = claimPipe.Process();
	Log("Claim " + claimResults.Summary);
}
So then.. lets break it down.
I start off creating all the transformations involved in processing this file.  I do it outside the loading loop for performance reasons, no reason to recreate them for each file since the files are all the same (in fact right now they only have one claim file, but that will change in the future). 
InsertColumn insertSID = new InsertColumn(0);
ManipulateColumns claimManipulations = new ManipulateColumns();
claimManipulations.AddManipulations(new int[] { 4, 8 }, SafeIntConvert);
claimManipulations.AddManipulations(new int[] { 1, 2, 3 }, SafeBigIntConvert);
List<ITransformation> claimTransforms = new List<ITransformation>() { insertSID, claimManipulations };
You'll notice two of the transforms I have available in this example, the first is the InsertColumn transform, which does what you would expect, it inserts a column in the record at the index specified.  Here I'm inserting a column at position 0 to create a system identifier in my database. 
The second transform I'm using here is called the ManipulateColumns transform.  It started out as a lot of different transforms to provide the ability to do data type conversions but I quickly found that my transform library would have been way too many different converters and they would never answer the need for a really specific custom data manipulation.  That led me to create the ManipulateColumns class.  It has a few overloads but basically it allows you to specify a set of columns and a func<object,object> pointer.  That way you can create whatever transformation necessary and pass it in to apply to the columns of the record.  I then pack up the transformations into a List (any IEnumerable would work) to pass into my Datapipe, which we'll get to later.
The foreach loop is just grabbing all the files matching the filter, so that's enough about that.  Let's get to what is inside the loop.
GenericRecordSource<string> claimSource = new GenericRecordSource<string>(
	new TextFileSource(filename, 1),
	new DelimitedTextParser(',')
);
So then, a bit of background.  When I first wrote my ETL library, it was awful.  I coupled the source and the parsing of the source way to tightly.  In my current version I have raw sources and parsers.  The raw sources are the raw input to the ETL process.  They can come from tables, files, even URLs right now.  They implement a generic interface that gives the library the opportunity to really use anything as a data source, but thats for another day.  The parsers are responsible for taking a raw input and converting it into a Record.  Records are a class that I use to standardize data moving through the pipeline and they are very simple.  So, when you put together a raw source and a parser you get a RecordSource.  I have a few different implementations of raw sources and parsers, in this example I'm using a TextFileSource which takes a URI to the file and as a second parameter the number of rows of data to skip before beginning processing.  The parser in this case is a DelimitedTextParser which breaks down the raw input into columns using a delimiter.  This file is comma delimited so I pass in the proper character.
SQLServerTable claimDestination = new SQLServerTable(connection_string, "Claim");
This destination is very straight-forward, its a table in a SQL Server database.  The parameters are the connection string and the target table to load into.  And finally...
Datapipe claimPipe = new Datapipe(claimSource, claimDestination, claimTransforms);
DatapipeResults claimResults = claimPipe.Process();
The Datapipe is where work happens.  Here I initialize my datapipe with the source, destination and transforms I just detailed.  Then I call process, capturing the results in the DatapipeResults which provides information on the number of rows processed, time elapsed, etc.
As usual, hit me up via email or in the comments if you have any questions.

Data Visualization from Hans Rosling

March 21, 2011

Today I was getting my fix from TED and ended up watching a recent talk by Hans Rosling on the magic washing machine. It was an excellent talk as are many by Hans, including an old favorite that I was reminded of while watching this one.

Back in 2009 Hans did another talk called Let my dataset change your mindset where he discussed public health and visualization techniques, two topics that I'm passionate about. Most of the visualizations in his talk are created with the Gapminder software which was acquired by Google back in 2007. Gapminder is excellent at showing how data changes over time.

Take a look at the talk and if you share my excitement with the visualization be happy to know that it lives on both at Gapminder.com as well as part of Google's public data explorer. You can even leverage the visualization technique yourself by using Google's motion chart.  In addition as you get captivated by Hans you'll likely also enjoy this earlier talk where Hans shows the best statistics you have ever seen.

Informatics

March 1, 2011

The term informatics covers a wide spectrum of areas.  Let us begin with the definition from none other than wikipedia:

"Informatics is the science of information, the practice of information processing, and the engineering of information systems. Informatics studies the structure, algorithms, behavior, and interactions of natural and artificial systems that store, process, access and communicate information. It also develops its own conceptual and theoretical foundations and utilizes foundations developed in other fields. Since the advent of computers, individuals and organizations increasingly process information digitally. This has led to the study of informatics that has computational, cognitive and social aspects, including study of the social impact of information technologies."

Part of what I want to communicate through this blog is my experience working in the field of informatics.  Until recently, I would not have even called myself an informatician, but the more familiar I became with the field, the more I understood that it is what I am.  To me, informatics is simply working with information.  Sure, the wikipedia definition wins the competition for most words in the definition, but mine is simpler and there is definitely something to be said about The Power Of Simplicity.  As simple as my definition is, the field is very complicated for a variety of reasons.  I'll be posting more about informatics in the future and including some practical examples of the trials and tribulations of an informatician. I'm sure you can hardly wait. 

About the author

Frank DeFalco has worked as software developer and informatics professional since 1996.  He is currently employed by Johnson & Johnson where he manages the Health Informatics Center of Excellence.

 

Reach me via email at frank (at) thedefalcos (dot) com.

 

Follow me on twitter @fjdefalco