ETL : Made from scratch.

March 14, 2011

So for those of you dedicated readers (there are currently about 40 of you, so keep it up, and tell your friends) you already know that I wrote a bit about what it takes to get Rhino-ETL built.  I've since been amazed at how complex Rhino-ETL is, I mean, it has its own scripting language or DSL component as its called.  Its impressive, but I don't find it to be all that simple.  I'm a big fan of simple.

I use SQL Server Integration Services (SSIS) in my professional work.  Its not really all that simple either.  It is very capable, but not simple.  DTS was simple.  For those who aren't familiar DTS was Data Transformation Services, the ETL engine before SSIS.  Ever wonder why SSIS packages are called DTSx files?  They didn't know what they were calling SSIS in time to change the file extension. Much like old school ASP became ASPX, DTS became DTSX.  Now you know.  I also believe it was to lure us all in.  Let's face it, a lot of people really liked DTS.  No, it wasn't an enterprise ETL engine like SSIS or Informatica, but it was light weight, I'll even go as far as to say that it was fun.  That's right, FUN.  It inferred types, loaded data in ways it probably shouldn't have, but it just kind of worked.  It couldn't handle complicated scenarios, you know, the 20 part of the 80/20 split? But it handled the 80 part pretty well and I liked it.  Flash forward to today and DTS is gone.  Out, out brief candle. 

I have a few personal projects ongoing that require the help of an ETL engine and it must be open source or more specifically, free.  We have SSIS, but it certainly isn't free and Rhino-ETL looked great but its not meeting my criteria of being simple and so...

ETL : Made from scratch.  That's right, I'm re-inventing the wheel. Too bad.  Everyone else gets to do it.  I mean, did we really need GIT?  We have subversion.  Subversion tore me from my comfortable little world of Microsoft tools and quietly convinced me of its superiority to Source Safe and at least the first few versions of TFS.  Then someone decided we needed GIT.  So, too bad, I'm writing my own ETL engine and besides, it will be from scratch!  You know, because for some reason we are all pretty much convinced that things are better when they are made from scratch.  Cupcakes, pasta, tomato sauce, they are all better when made from scratch instead of from the boxes and jars we buy in the store.  Therefore, it logically follows that my ETL engine will be better. Well, probably not, but it will be simpler.

I wrote version 1 and promptly threw it away, because version one is never any good anyway.  I'm just putting the finishing touches on version 2 and hope to finish it up in the next few weeks.  Its clearly over-engineered just like any good V2 is supposed to be over-engineered.  Once I finish version 2 I will put it through the paces of loading all the data for my personal projects and see how it does.  To be fair, version 1 was able to fulfill about 80% of my needs but it was damn ugly.  V2 is much prettier, hopefully it works better too.  Look for more details on V2 and even some sample code in the near future.

blog comments powered by Disqus

About the author

Frank DeFalco has worked as software developer and informatics professional since 1996.  He is currently employed by Johnson & Johnson where he manages the Health Informatics Center of Excellence.


Reach me via email at frank (at) thedefalcos (dot) com.


Follow me on twitter @fjdefalco