Friday, October 17, 2008

Tips and tricks for discovering performance issues before they become production issues.

Nobody wants to get the phone call. The one that comes at 6 am, informing you that it is going to be a very long day due to a crisis in the software you built. It could be an outright crash, data corruption, or users fleeing your application due to its lagging performance. It doesn't have to be like this; there are reliable ways to put your application under the microscope within the bounds of reasonable effort.

First and foremost, you need tools. There are a range of tools available for all budgets and as in most things, you get what you pay for. Start simple -- you need a way to generate load against your application and you need a code profiler. Freely available open source tools like JMeter are a good place to start -- you can become productive in under an hour. There are countless inexpensive code profilers available, just a web search away. What these lack in features, they make up for in simplicity and price. If you've never examined the performance of your application, then anything is a step up at this point.

Once we start generating consistent load against our applications and run a code profiler, what should we be looking for? The approach we use at J9 is one of pragmatism: focus on the areas with the largest potential return. For example, many books on Java application performance start with a discussion of String versus StringBuffer or StringBuilder. Perhaps they include information on choosing the appropriate Collection types to reduce synchronization overhead. These are fantastic suggestions, but there are plenty of other, more pragmatic improvements to be made before we get to this level of granularity. Take initial jvm heap sizes as an example. This is now one of the first questions we ask customers when evaluating their application performance -- have you explicitly set an appropriate size? We've seen countless customers pulling their hair out due to server crashes have their concerns dissipate by making this trivial change.

We've seen a common list of problems over the years that yield significant improvements with a few simple changes. Besides the initial jvm heap setting:

-- Object pooling to external dependencies. Are you using it? Are your pools sized correctly to expected demand?
-- XML serialization: This one normally shows up as high-CPU use and causes your application to spend all its effort in processing wrappers rather than the business problem at hand.
-- Poor database interaction: There are a number of basic issues here, like cumbersome sql statements and failure to properly index tables.
-- Lack of caching: Dynamically fetching otherwise static data (like jndi entries) or a weak caching strategy leading to poor hit ratio. This assumes that any caching at all has been implemented.
-- Slow report or page rendering: This is common, especially on pdf generation with large data sets. Typically this is an architectural problem stemming from a monolithic approach.
-- Slow network, inadequate hardware: Need to have the basics in place before we can expect performance.

Under even a consistent, moderate amount of load a rudimentary code profiler should offer hints at the above common problems.

Once we've got our tools selected and set up and we start looking for trouble, what should our approach be? Optimally, we're striving to employ the scientific method, beginning with simple divide-and-conquer. First, I take a look across the entire application, noticing what transactions stand out in terms of latency. My search begins there because that tends to be the low-hanging fruit: many problems, regardless of their root cause, manifest as latency issues. It also allows me to potentially focus on getting an early win -- if I can reduce an annoying 10 second response time to 5 seconds, it's a noticeable difference to an end user versus saving an extra 256 megabytes of ram which only a purist would notice. With the information about the slowest performing transactions, I can then take a tier or layer perspective. This step involves examining the performance of my application at each step -- where are the slow downs in the web tier? Are there bottlenecks in the database? What about web services or message oriented middleware? Maybe there are problems within the business logic itself. The key to this examination is keeping concerns separated: just look at the performance within one layer at a time, ignoring the performance within other layers.

Once you've identified the layer or tier where a problem is occurring, begin looking for data that enables you to build a testable hypothesis. Be careful to not assume the first problem you find is the root cause -- many problems are side-effects of the real issue. For example, is the sql statement slow running because it has not been optimized, because the invoked tables have not been indexed, or because of poor data validation and legitimacy? A typical work flow might be as follows:

-- Slowness is identified at the database.
-- Slowest running sql statements are identified.
-- Statement execution is divided into connection and execution latency. Which one is worse?
-- Assuming connection latency is significant, look for obvious issues:
-- Are we using connection pooling?
-- Are we actually pulling our connections from the pools we have configured?
-- Are the pools sized adequate with the expected transaction throughput?
-- Are there network or connectivity issues that would cause connections to expire or be slow to create?
-- Create a test or collect supporting data to eliminate each of these potential concerns.
-- For each issue identified, devise a solution and test the effectiveness of that solution.
-- Repeat process until application achieves acceptable performance levels.
-- Implement monitoring, thresholds, and alerts to proactively catch future issues.

As you work through problems, be aware of what you can and cannot control. There are physical limits to computing that are outside your control, just like code from a vendor will seldom quickly be repaired.


We can summarize the approach recommended by J9 as follows:

-- Look for solutions with a high probability of success
-- Be aware of the basic limits and issues with your environment.
-- Use a scientific, quantitative approach.
-- Put in place tools that make finding and testing for issues easier

No comments: