R you connecting to SQL Server?

May 3, 2011
As I've mentioned some of my pet projects require access to statistics and data mining algorithms.  I have played with a few different approaches but none of them really had everything I was looking for in terms of methods.  I eventually decided to give R a try.  For those who are not familiar R is a free software environment for statistical computing and graphics that is available for a wide variety of UNIX platforms, Windows and MacOS.  I had used R in the past but only on a very limited basis and never against a relational data source so I was quite suprised with what I found. 
R has a massive following with thousands of contributors to an online library of algorithms.  Need access to a bayesian algorithm for classification, no problem.  Want to draw familial diagram with genetic markers, piece of cake.  It's all in there and there are quite a few nice resources online to help you learn the platform.  One of the best I have found is http://www.statmethods.net/ but there are plenty of others.  Even if what you are looking for doesn't exist in R (unlikely) you can engage with the community to get where you need to go.
Once I determined that R has all the methods I was looking for the next step was accessing my data from within R.  Sure, you can load flat files directly into R but my data is all stored in a relational database.  As it turns out, R can handle that too.  Let's get started.
R has a terrific package management system that allows you to select a package repository called a CRAN mirror (there are many) and install additional packages.
 
To connect to a relational database you'll want to grab the RODBC package by selecting 'Install package(s)...' and finding it in the list after selecting a nearby CRAN mirror. Once it has been installed you can load into your current sessions as follows:
library("RODBC")
Now for the fun part.  In the code below I connect to the database and establish a channel that can then be used for running your queries.  Once I have a channel I get a listing of tables in the Sales schema of the AdventureWorks sample database.  Finally I run a query to return some data.
channel <- odbcDriverConnect(connection = "Driver={SQL Server Native Client 10.0};Server=localhost;Database=AdventureWorks;Trusted_Connection=yes;")
sqlTables(channel, schema="Sales")
sqlQuery(channel,"select top 10 * from Sales.SalesOrderHeader")
Of course you might not be using SQL Server 2008 or you might be using a different security model. Regardless just take a trip over to www.connectionstrings.com and find the ODBC connection string that fits your scenario and plug it into the odbcDriverConnect function.  Now that you can connect R to your relational environment you can take advantage of the statistical and data mining algorithms within, free of charge.

Loading the Heritage Health Prize Data

April 5, 2011
As I have significant interests in Public Health, Informatics, Data Integration and the like, the Heritage Health Prize is a great opportunity to participate in a competition that leverages all those skills.  The 3 Million dollar prize isn't too shabby either.  I don't have much confidence that I could win it, face it, the world's best will be plugging away at this one but that's not all that important here.  I'm using this competition as an opportunity to test out my custom built ETL libraries and integrate some new statistics libraries I'm evaluating.  It will be fun. 
In fact, its been fun already.  In about 20 minutes I was able to develop a data import utility and complete a full load of the first release of the Heritage sample data.  I even had time to make it look nice.
  

 

A few people have expressed interest in my new ETL library, and while its not ready for release yet, I can give you a glimpse of it here in the context of this data import utility.  The data released for this challenge will eventually have multiple files containing sample medical claims data.  Here is the code I used to load them:
// load claims data
InsertColumn insertSID = new InsertColumn(0);
ManipulateColumns claimManipulations = new ManipulateColumns();
claimManipulations.AddManipulations(new int[] { 4, 8 }, SafeIntConvert);
claimManipulations.AddManipulations(new int[] { 1, 2, 3 }, SafeBigIntConvert);
List<ITransformation> claimTransforms = new List<ITransformation>() { insertSID, claimManipulations };

string[] claimFiles = Directory.GetFiles(textSourceFileDirectory.Text, "Claims*.csv");
foreach (string filename in claimFiles)
{
	GenericRecordSource<string> claimSource = new GenericRecordSource<string>(
		new TextFileSource(filename, 1),
		new DelimitedTextParser(',')
	);

	SQLServerTable claimDestination = new SQLServerTable(connection_string, "Claim");

	Datapipe claimPipe = new Datapipe(claimSource, claimDestination, claimTransforms);
	DatapipeResults claimResults = claimPipe.Process();
	Log("Claim " + claimResults.Summary);
}
So then.. lets break it down.
I start off creating all the transformations involved in processing this file.  I do it outside the loading loop for performance reasons, no reason to recreate them for each file since the files are all the same (in fact right now they only have one claim file, but that will change in the future). 
InsertColumn insertSID = new InsertColumn(0);
ManipulateColumns claimManipulations = new ManipulateColumns();
claimManipulations.AddManipulations(new int[] { 4, 8 }, SafeIntConvert);
claimManipulations.AddManipulations(new int[] { 1, 2, 3 }, SafeBigIntConvert);
List<ITransformation> claimTransforms = new List<ITransformation>() { insertSID, claimManipulations };
You'll notice two of the transforms I have available in this example, the first is the InsertColumn transform, which does what you would expect, it inserts a column in the record at the index specified.  Here I'm inserting a column at position 0 to create a system identifier in my database. 
The second transform I'm using here is called the ManipulateColumns transform.  It started out as a lot of different transforms to provide the ability to do data type conversions but I quickly found that my transform library would have been way too many different converters and they would never answer the need for a really specific custom data manipulation.  That led me to create the ManipulateColumns class.  It has a few overloads but basically it allows you to specify a set of columns and a func<object,object> pointer.  That way you can create whatever transformation necessary and pass it in to apply to the columns of the record.  I then pack up the transformations into a List (any IEnumerable would work) to pass into my Datapipe, which we'll get to later.
The foreach loop is just grabbing all the files matching the filter, so that's enough about that.  Let's get to what is inside the loop.
GenericRecordSource<string> claimSource = new GenericRecordSource<string>(
	new TextFileSource(filename, 1),
	new DelimitedTextParser(',')
);
So then, a bit of background.  When I first wrote my ETL library, it was awful.  I coupled the source and the parsing of the source way to tightly.  In my current version I have raw sources and parsers.  The raw sources are the raw input to the ETL process.  They can come from tables, files, even URLs right now.  They implement a generic interface that gives the library the opportunity to really use anything as a data source, but thats for another day.  The parsers are responsible for taking a raw input and converting it into a Record.  Records are a class that I use to standardize data moving through the pipeline and they are very simple.  So, when you put together a raw source and a parser you get a RecordSource.  I have a few different implementations of raw sources and parsers, in this example I'm using a TextFileSource which takes a URI to the file and as a second parameter the number of rows of data to skip before beginning processing.  The parser in this case is a DelimitedTextParser which breaks down the raw input into columns using a delimiter.  This file is comma delimited so I pass in the proper character.
SQLServerTable claimDestination = new SQLServerTable(connection_string, "Claim");
This destination is very straight-forward, its a table in a SQL Server database.  The parameters are the connection string and the target table to load into.  And finally...
Datapipe claimPipe = new Datapipe(claimSource, claimDestination, claimTransforms);
DatapipeResults claimResults = claimPipe.Process();
The Datapipe is where work happens.  Here I initialize my datapipe with the source, destination and transforms I just detailed.  Then I call process, capturing the results in the DatapipeResults which provides information on the number of rows processed, time elapsed, etc.
As usual, hit me up via email or in the comments if you have any questions.

Logging and Triggers and Email Oh My!

March 17, 2011

I've been doing a lot of work lately using SSIS to load our production databases.  We're loading very large databases and our SSIS packages are long running.  Many take days not hours.  It can be difficult to understand what parts of the packages are taking most of the time and when or if they have issues.  SSIS provides a logging feature that you can enable and it goes a long way to solving those issues.  We have enabled our packages for logging and direct the output to a centralized database where we log all of our loading package executions.  From there I've written a web application that analyzes the logging database so we can view all the information generated by our packages.  I even have a nifty little timeline viewer I've written that depicts our package flow, its simple, but it does the job.

The report has a ton more detail than the timeline, but I'll save that for another day.  What i wanted to focus on here was how to generate alerts.  We wanted a way to get notifications when packages had issues.  Yes, we could do things in the packages themselves to set paths to go to email tasks on failures, but it just complicated the packages and we would have to configure it in each package for each task, you see where this is going.  Since we have all of our packages logging centrally we thought it would be easier to setup a trigger on the logging table to alert us when particular events occurred.  This ended up working really well once we worked through a few minor issues.

First you have to go through and setup database mail on the server with the logging database which is done by just following through the wizard.  If anyone posts any questions I'll go back and provide some more information on how its done.

With database mail setup you have to make sure that the credentials you are using to write your log entries to the database have permissions to send email using database mail.  The script to do that is straight forward:

USE [msdb]
GO
CREATE USER [<ACCOUNT NAME HERE>] FOR LOGIN [<ACCOUNT NAME HERE>]
GO
USE [msdb]
GO
EXEC sp_addrolemember N'DatabaseMailUserRole', N'<ACCOUNT NAME HERE>'
GO

Now that your account has the right permissions you just need to setup a trigger to send the email.  You can customize it anyway to provide different levels of notification based on what you capture in your log table.  In the following example we send an email when a package encounters an error.

CREATE TRIGGER [dbo].[SSISError] ON [dbo].[sysssislog]
FOR INSERT
AS

IF (SELECT COUNT(*) FROM inserted WHERE [event] = 'OnError') = 1
BEGIN

exec msdb.dbo.sp_send_dbmail 
	@profile_name = 'Default Email Profile',
	@recipients = 'somebody@who.cares.about.alerts.com',
	@body_format = 'HTML',
	@body = 'Go check your error.,
	@subject = 'SSIS Error'
END
And there you have it.  Logging, reporting and alerts from your SSIS packages.  Now go load some data!

ETL : Made from scratch.

March 14, 2011

So for those of you dedicated readers (there are currently about 40 of you, so keep it up, and tell your friends) you already know that I wrote a bit about what it takes to get Rhino-ETL built.  I've since been amazed at how complex Rhino-ETL is, I mean, it has its own scripting language or DSL component as its called.  Its impressive, but I don't find it to be all that simple.  I'm a big fan of simple.

I use SQL Server Integration Services (SSIS) in my professional work.  Its not really all that simple either.  It is very capable, but not simple.  DTS was simple.  For those who aren't familiar DTS was Data Transformation Services, the ETL engine before SSIS.  Ever wonder why SSIS packages are called DTSx files?  They didn't know what they were calling SSIS in time to change the file extension. Much like old school ASP became ASPX, DTS became DTSX.  Now you know.  I also believe it was to lure us all in.  Let's face it, a lot of people really liked DTS.  No, it wasn't an enterprise ETL engine like SSIS or Informatica, but it was light weight, I'll even go as far as to say that it was fun.  That's right, FUN.  It inferred types, loaded data in ways it probably shouldn't have, but it just kind of worked.  It couldn't handle complicated scenarios, you know, the 20 part of the 80/20 split? But it handled the 80 part pretty well and I liked it.  Flash forward to today and DTS is gone.  Out, out brief candle. 

I have a few personal projects ongoing that require the help of an ETL engine and it must be open source or more specifically, free.  We have SSIS, but it certainly isn't free and Rhino-ETL looked great but its not meeting my criteria of being simple and so...

ETL : Made from scratch.  That's right, I'm re-inventing the wheel. Too bad.  Everyone else gets to do it.  I mean, did we really need GIT?  We have subversion.  Subversion tore me from my comfortable little world of Microsoft tools and quietly convinced me of its superiority to Source Safe and at least the first few versions of TFS.  Then someone decided we needed GIT.  So, too bad, I'm writing my own ETL engine and besides, it will be from scratch!  You know, because for some reason we are all pretty much convinced that things are better when they are made from scratch.  Cupcakes, pasta, tomato sauce, they are all better when made from scratch instead of from the boxes and jars we buy in the store.  Therefore, it logically follows that my ETL engine will be better. Well, probably not, but it will be simpler.

I wrote version 1 and promptly threw it away, because version one is never any good anyway.  I'm just putting the finishing touches on version 2 and hope to finish it up in the next few weeks.  Its clearly over-engineered just like any good V2 is supposed to be over-engineered.  Once I finish version 2 I will put it through the paces of loading all the data for my personal projects and see how it does.  To be fair, version 1 was able to fulfill about 80% of my needs but it was damn ugly.  V2 is much prettier, hopefully it works better too.  Look for more details on V2 and even some sample code in the near future.

Git Rhino-ETL Going

March 1, 2011

For one of my projects I'm trying to identify a decent, open source, .NET based ETL component.  Yes, those are some picky requirements but there appears to be at least one answer called Rhino ETL.  Getting it built isn't the most straight forward task.  Here is what it takes:

First off grab the source from GIT, which might first require you have access to GIT utilities, I chose GITExtensions off of SourceForge.  After installing it and launching it I get to look at a cow.  I'm not sure why.  I'm sure there is a reason.  I don't think I want to know what it is though, so I'm going to skip googling GITExtensions and cow and just get on with it, but for your own viewing pleasure I've included a screenshot.

Once you have GITExtensions installed you can use the UI to download the current source for Rhino-ETL.  Currently the URL for Read Only access to the Rhino-ETL source is here:

https://github.com/hibernating-rhinos/rhino-etl.git

In order to grab a copy of the source you click 'Clone repository' in the above UI and then fill out the clone screen something like this:

Hit clone and now you'll have the source in the destination you specified.  

More...

About the author

Frank DeFalco has worked as software developer and informatics professional since 1996.  He is currently employed by Johnson & Johnson where he manages the Health Informatics Center of Excellence.

 

Reach me via email at frank (at) thedefalcos (dot) com.

 

Follow me on twitter @fjdefalco