feed

Aug 2006 SchemaForm and schema deformations by Chris Poirier • in Uncategorizedpermalink

I initially posted this article to my old blog in August of 2006.  I’m reposting here, under the Rethink heading, as that’s where these ideas have ended up. 

I apologize in advance for the long-windedness.  :-)


Okay, bear with me here: I’ve got something I think is worthwhile to say, but I’m not sure how to pack it into a little space.

Actually, knowing me, I’m probably not even going to try.  :-)

I’ve always hated database systems.  Not the idea of them, not the practicality of them — I’ve always liked those things — no, it’s the PITA involved in making them work. 

The problem is normalization.  Now, don’t get me wrong — normalization is great for programmer sanity and system maintainability.  It is, frankly, the only way to design.  But if you develop a normalized data model for your system, you are going to have to deal with two fundamental problems.  1) Your system will most likely run like molasses.  Uphill.  On a brisk November morning.  And 2) You are probably going to have to spend a lot of time writing glue code in queries and representation classes to pull all that data back together into useful structures.

This is the joy and pain of normalization.  You store any particular datum only once, so it stays minimal and consistent.  But putting it back together into the assembled, summarized views we humans find so useful is really costly.

A couple of years ago, I spent a year trying to solve the problem.  I had an impossibly large project I convinced myself I could pull off in no time and on no budget if I could just write a programming language that would do all that bookkeeping/reassembly for me. 

Well, I was wrong about the project.  (Enough said on that topic.)  For the sake of this discussion, the language is the important thing.  SchemaForm was it’s name.  I wrote it in Ruby.  I know, I know — what kind of crazy person tries to write a compiler in a scripting language.  Well, my kind of crazy person.  :-)

Anyway, SchemaForm’s aim was to make it simple to declare and manipulate not just the data itself, but also the relationships between the data.  Instead of declaring your “physical” data in one place (the tables) and your assemblies of data in another (views and procedural code), and your validations in another (field and table constraints, procedural code), SchemaForm let you declare them together, using a common, vaguely-functional language.  The SchemaForm compiler then took those declarations and used them to build SQL and object-oriented library code to create the database and retrieve and manipulate its data. 

Perhaps an example would be helpful.  Here’s a simple problem: you want to store address information for your customers.  Your staff work in English (they want to read “Quebec City”), but some of your customers speak and read primarily French (and expect their mail addressed “Ville de Québec”).  A normalized design might have a City table which references a Strings table, which has a one-to-many relationship with a Translations table.  To get the French name of a particular city, you join from City to String on CityNameID, then from String to Translations on StringID and restrict on the French language code.  Simple, right?  Here’s what it might look like in SchemaForm (with a few conveniences for this particular application):

   (class City Cities
      (oid     CityID                  )
      (field   NameStringID      String)

      (derived EnglishName (dereference $NameStringID EnglishText))
      (derived FrenchName  (dereference $NameStringID FrenchText))

      (key NameStringID)
      (key EnglishName)
   )

   (class String
      (oid     StringID)

      (derived Description (dereference (lookup Translation Language $StringID "en-ca") Data ""))
      (derived EnglishText (dereference (lookup Translation Language $StringID "en-ca") Data ""))
      (derived FrenchText  (dereference (lookup Translation Language $StringID "fr-ca") Data ""))
   )

   (class Translation
      (oid     TranslationID                )
      (owner   StringID            String   )
      (field   LanguageID          Language ) 
      (field   Data                text:4000)
      (derived Language            (dereference $LanguageID Code       ))
      (derived LanguageDescription (dereference $LanguageID Description))

      (key StringID LanguageID (name LanguageID))
      (key StringID Language   (name Language))
   )

Okay, no that isn’t tiny.  But, what it does is give SchemaForm enough information to do most of the assembly for you.  For instance, each City record has a calculated field called EnglishName, which is the English name of the city.  If you happen to be holding a City object, you access its EnglishName field, and you’ve got the value.  If you are holding a set of Cities, you can query on it.  In fact, if you look carefully, City actually gets its EnglishName from the calculated EnglishText field of String, which is defined as a dereference of a lookup on a Translations table candidate key (StringID and Language).  And if you look even closer, you’ll find that even Language is a derived field — the stored value in Translations is a LanguageID, which references a code table (not shown).  Whew!  Imagine constructing all those joins by hand!

SchemaForm does a lot more than what is shown here, actually.  For instance, it provides field- and type-level validations and filters; it provides a relational algebra for building operations and retrievers; it integrates object-oriented ideas of type extensions into the the schema definition language (for instance. your Invoices reference a Customer, which may be either a Person or a Company); and it provides a definition language for rendering query results into hierarchical, summarized structures (you might call them XML-based reports).

So, why isn’t every one using SchemaForm, I hear you asking.  Well, because the language sucks.  Okay, maybe that’s unfair.  The language doesn’t suck.  But the implementation does.  As it turns out, a lot of that great stuff SchemaForm does for you doesn’t translate well into SQL.  Performant SQL, in any event.  Slap as may indices as you want on that beautifully normalized database, and it will still perform terribly, because all that joining is expensive.  And my experience with SQL Server, at least, is that the query optimizer actually makes the problem about an order of magnitude worse.

Oh, and from a development standpoint, placing a compiler run between your thought process and the running system turns out to be less than optimal.  Surprise, surprise.

But it wasn’t all a waste, methinks, for writing SchemaForm got me thinking about something that I have come to believe is really worthwhile. 

Denormalization. 

:-D Okay, okay, maybe I should qualify that a bit. 

Denormalization involves caching known and/or stored data in more than one place; generally a place were it doesn’t naturally belong.  It is done because it can often (but by no means always) improve the performance of your database operations, by keeping data in the place where it will be used.  The cached data might be a verbatim copy of a field from another table, in order to save a join operation in the average case (or to be used in a multi-column index).  It might be a summary field, like a count of the records on the other end of a one-to-many relationship.  Or it might be the result of a scalar calculation that is too expensive to do on a regular basis (a conversion of Markdown text to HTML, for instance). 

But there is a cost to all this performance enhancement.  You end up with multiple whole and part copies of data elements, and this can quickly become a nightmare for the poor programmer who has to keep all of those copies in sync.  Imagine, for instance, if you decided to denormalize the simple three-class schema I quoted above.  Imagine if you wanted to cache every one of those derived fields directly in the underlying table, so that you wouldn’t have to take the join hit every time you used them in a query, or so that you could use those fields in a SQL index.  Imagine the bookkeeping work you’d have to do, the dozens of places you’d have to carefully insert the extra code to keep all those copies in sync.

Okay, now imagine this.  It’s Monday morning.  Your boss comes to you and tells you that e has a new requirement for your beautifully normalized, 2.5 million record inventory system.  E wants to be able to search the database for Customers over the age of 35 whose parents are both Customers, who drive a particular make of car, and who have bought over some dollar threshold of some inventory part, or of any part that contains that part, or of any part that has the same manufacturer as that part.  Those were variables, not constants you heard in there, BTW.  And E wants the system to supply those answers instantly, with data that is always completely up-to-date.  And E needs it working yesterday (isn’t that always the way?) and it can’t have any impact on any other part of the system (which, being beautifully normalized, has been working flawlessly all this time).

Ready?  Set?  Go!

What?

Okay, actually, I’m sure you can get the thing functional in short order.  All the data is in there.  You already have relationship information between your Customer records.  You have optional data on Customer birthdate.  You can get a good approximation of the make of car the Customer drives based on the invoicing information (you don’t generally buy Honda parts for your Ferrari, after all).  Your parts table shows relationships between the parts, and a transitive closure will discover all of the equivalents, and from there you can pull in all the unrelated parts with the same manufacturer.  With SchemaForm’s relational algebra and dereference function, you can probably whip up something that will work.

There’s just one small problem.  At this point, each query your boss runs on your still-beautifully normalized database is probably going to run for a couple of days.  You could build a reporting database with the data, but nightly runs aren’t going to supply that “up-to-date” thing your boss was so adamant about.

So, it’s intractable, right?

But what if you could go into the schema, add your calculated fields, add the necessary relational functions (views), and then mark them all as cacheable?  You don’t do the denormalization — you just suggest to the database system that perhaps it should do it for you!  You still use the system like it’s beautifully normalized, but under the covers, the database runtime figures out how to propagate data changes to all the dependent locations within the system, and does so.  Because, after all, it has the functional description of how those calculations are to be done.  It is, in fact, the only entity that is truly qualified to determine all those dependencies, and to propagate the changes in the right order.  You use that calculated field or relational function just like normal, except that now it runs like stink!  Now, all of sudden, you are back in the hero seat, because your system did not require any changes to data entry screens or model classes, but, nonetheless, your boss gets eir instant results to eir insane questions.

It’s frickin’ Utopia, man!  ;-)

Okay, maybe I’m getting too excited.

So why am I writing about this today, after not writing about it for most of two years?  Well, I finally got around to looking at Ruby on Rails today.  I looked at it a few years back, when it first came out, but I wasn’t really doing web development at the time, so I let it slide.  Anyway, today I was feeling frustrated and exhausted with work stuff, and, being the geek that I am, I thought playing with a new toy — especially a Ruby toy — would cheer me up.  It sort of worked.  Turns out I’ve been building Ruby on Rails for PHP for the last three months, only RoR has done it better, and I’ve been avoiding Ruby for no reason.  Yes, it’s depressing for both reasons.  ;-)   In particular, their schema migration stuff works better than my PHP version, because they actually update the schema with the changes you make in the migrations.  Nice job, guys!

Anyway, suddenly I’m thinking that maybe the schema deformation layer can be written without the compiler pass, and without an actual new type of database server (the path I have been considering).  Maybe — I’m not sure yet.  This idea isn’t even half-baked yet.  But maybe, I could build it for Rails.  Not the full version, probably — I’ve already learned my lesson about Ruby’s compilation/graph-analysis prowess (ie. it doesn’t have any) — but a reasonable hand-drawn facsimile thereof might be possible.  And it would certainly make my database programming life more enjoyable. 

So, there you have it.  Yes, I rambled, and I probably rushed over some of the important stuff, too.  But I hope you aren’t feeling I’ve wasted your time with this.  If the idea interests you, drop me a line. 

Thanks, Chris.

Related Links

in Uncategorized:
on site:

Discussion: No comments

Jump to comment form | comments rss | trackback uri

Leave a comment

Markdown: The kinds of formatting markup you'd use in an email will probably work here. For more details on what you can do, check out the Markdown docs.


Site copyright 2007-2008 Chris Poirier.       Powered by Wordpress.       Entries RSS Comments RSS Validate Log in