Thursday, May 15, 2008

About Predictability and Traceability

In this post I will discuss two somewhat overlooked benefits of SBA.
But first I want to explain what triggered me to write it.

A few weeks ago I read a post based on a book by Daniel Schacter, a Psychology Professor at the Harvard University. According to Schacter, the human brain tends to generalize and categorize things so that we can easily remember them and relate to them, without having to think about all the details involved. This is thought to be an evolutionary advantage, since it enables us to categorize threats and identify them very quickly (albeit not always reliably…).

For example, most people would say "Italian cars are so fun to drive" (unless they had a Fiat Multipla), or "British cuisine sucks" (no offense my fellow Englishmen, there are very few things I like better than a hot, freshly fried fish n chips :) ).

Software products in general and GigaSpaces in particular are no exception to this human behavior.
Facing prospects and being involved in numerous sales opportunities, I often see people categorizing us as "a high-scalability, low latency solution" (which I'm ok with, don't get me wrong) or "caching solution" (which I'm less ok with, as it's only part of the story). But naturally, as generalizations tend to do, this doesn’t tell the entire story.

In one of my recent sessions for a certain prospect, I had an interesting discussion with one of the attendees, a savvy enterprise architect. After presenting our way of thinking, he said something in the following spirit:

"Well, our JEE tier based application works just fine now. The latency is reasonable for all important use cases, my developers add and remove changes at a reasonable time, and I'm ok with the capacities I need to handle. In fact, I know I can handle about 50% more throughput than I do now.
So, it's not that I don't like your solution, but I don't really need it for now. If I wrote an order management system for a bank I would definitely give it a shot, but for my current needs it's kind of an overkill"

At first I thought that he had a good point. After all, we are defining ourselves as an XTP (eXtreme Transaction Processing) application server, so if your application is not extreme (like this architect here) you don't really need GigaSpaces, right?
But after thinking for a few more seconds, we started a discussion in the following spirit (I'm U below, for Uri, and he is P, for prospect):

U: So do you think you will not need to grow with capacity anymore?
P: I didn't say that, I said I'm good for another 50% increase or so
U: And then what? Do you think you will get to a point when this will not cut it?
P: Hopefully we will, if our business is successful enough. I'll do some testing, find where the bottleneck is, which will probably be the database or the messaging as always, and buy more hardware or better storage. After all this system is not running on a very expensive hardware setup, so it shouldn't be too hard to get more budget for it. I might also re-architect parts of my app to make it perform better
(at this point my brain started to make funny noises trying to compile a proper response…)
U: But can you really tell how much more hardware you will need, or how much that will cost you? Or where will patching the current architecture get you?
P: hmmm… kind of.
U: What do you mean?
(pause…)
P: I'm not sure how this will affect the performance of the database and the messaging server. It will probably improve, just not sure how much
U: Are you sure? We have a customer using <database X with clustered configuration>. Going into this configuration actually slowed things down, because now the database servers need to coordinate everything with one another…
Furthermore, how will you make capacity planning? How will you plan the budget for this?
P: I have to check it first hand before I can really tell. I'll probably start with a couple of machines or change the relevant parts in the app, try it out to see if it's good enough, and if not add more machines to the mix
U: So you will go through a complete development and performance testing cycle without knowing if and at what cost it's going to solve your problem?
P: well, now that you put it like that…
U: And another point, what if you will go through all of this, and get great throughput numbers, but not so great latency numbers?
P: Then I need to trace and profile my app and find where I have a bottleneck
U: How will you do that?
P: Well there are a lot of tools out there for doing just that, very good ones I might add
U: Do they show you the entire latency path? I mean can they tell you where is the bottleneck between your mix of DB, messaging and application servers?
(pause…)
P: Some sort of can…
U: So let me get this correctly: To know your capacity, you need to build the entire production environment in advance, buy potentially very expensive tools to check in case something goes wrong, and even then you're not entirely sure if it'll do the trick for you?
P (a bit aggressive): So what do you offer to solve this problem?
(Finally, this conversation is heading somewhere…)
U: Well, with SBA, the whole point is that everything happens in the same JVM. So it's much more predictable, because there are no other moving parts involved. When the application is partitioned correctly, and one JVM gives you 2000 operations/sec for example, two would give you more or less 4000, and so on. This is what linear scalability is all about, and the SBA model enables it.
And since it's one JVM, you can just attach a simple profiler or even print you’re your own debug messages to a log file and analyze them later. It's as simple as profiling and debugging a stand alone Java app.

I won't wear you down with the rest of this conversation, but hopefully you get the point. I think the above summarizes very well why GigaSpaces is not just about scalability or latency. It also gives a pretty easy way to answer the following questions, which is not so trivial to do with the classis tier based approach:

  • Predictability - If I add X more machines, how much more throughput will I gain? Is there a latency penalty for that?
  • Traceability - Where the heck is my bottleneck?

Finally, if you'd like to read some real life story about how hard and expensive it was for people to use the "build, deploy, test, see what we got" methodology, and why scaling is something you need to think about in advance, here's a read well worth your time.

2 comments:

Guy Nirpaz said...

Uri,

Very nice one, I really liked it.

The general idea conveyed in this post is very similar to the difference between water-fall vs. agile in software development methodlogies.

Where in the first one, you'll need to finish the project (more or less) to know if the customer's requirements are met. As opposed to the later. Iterations are key to every agile methdology; being able to get the feedback and make the changes accordingly.

Guy

Uri Cohen said...

Thanks, I agree.
We're not the first ones to say it, but I guess the divide and conquer principals are true to a lot of things in life :)