TestnScale Blog
November 12th, 2011Informed articles about web performance and scalability
Informed articles about web performance and scalability
Going by the many posts in various LinkedIn groups and blogs, there seems to be some confusion about how to measure and analyze a web application’s performance. This article tries to clarify the different aspects of web performance and how to go about measuring it, explaining key terms and concepts along the way.
Web Application Architecture
The diagram below shows a high-level view of typical architectures of web applications.
The simplest applications have the web and app tiers combined while more complex ones may have multiple application tiers (called “middleware”) as well as multiple datastores.
The Front end refers to the web tier that generates the html response for the browser.
The Back end refers to the server components that are responsible for the business logic.
Note that in architectures where a single web/app server tier is responsible for both the front and back ends, it is still useful to think of them as logically separate for the purposes of performance analysis.
Front End Performance
When measuring front end performance, we are primarily concerned with understanding the response time that the user (sitting in front of a browser) experiences. This is typically measured as the time taken to load a web page. Performance of the front end depends on the following:
For most applications, the response time is dominated by the 3rd bullet above i.e. time spent by the browser in retrieving all of the components on a page. As pages have become increasingly complex, their sizes have mushroomed as well – it is not uncommon to see pages of 0.5 MB or more. Depending on where the user is located, it can take a significant amount of time for the browser to fetch components across the internet.
Front end Performance Tools
Front-end performance is typically viewed as waterfall charts produced by tools such as the Firebug Net Panel. During development, firebug is an invaluable tool to understand and fix client-side issues. However, to get a true measure of end user experience on production systems, performance needs to be measured from points on the internet where your customers typically are. Many tools are available to do this and they vary in price and functionality. Do your research to find a tool that fits your needs.
Back End Performance
The primary goal of measuring back end performance is to understand the maximum throughput that it can sustain.Traditionally, enterprises perform “load testing” of their applications to ensure they can scale. I prefer to call this “scalability testing“. Test clients drive load via bare-bones HTTP clients and measure the throughput of the application i.e. the number of requests per second they can handle. To increase the throughput, the number of client drivers need to be increased until the point where throughput stops to increase or worse stops to drop-off.
For complex multi-tier architectures, it is beneficial to break-up the back end analysis by testing the scalability of individual tiers. For example, database scalability can be measured by running a workload just on the database. This can greatly help identify problems and also provides developers and QA engineers with tests they can repeat during subsequent product releases.
Many applications are thrown into production before any scalability testing is done. Things may seem fine until the day the application gets hit with increased traffic (good for business!). If the application crashes and burns because it cannot handle the load, you may not get a second chance.
Back End Performance Tools
Numerous load testing tools exist with varying functionality and price. There are also a number of open source tools available. Depending on resources you have and your budget, you can also outsource your entire scalability testing.
Summary
Front end performance is primarily concerned with measuring end user response times while back end performance is concerned with measuring throughput and scalability.
Service Level Agreements (SLAs) usually specify a response time criteria that must be met. Although SLAs can have a wide range of metrics like throughput, up time, availability etc., we will focus on response times in this article.
We often hear phrases like the following :
Do you see anything wrong in these statements? Although they sound fine for general conversation, anyone interested in performance should really be asking what exactly do they mean.
Let’s take the first statement above and make the assumption that it refers to a particular page in a web application. When someone says that the response time is 5 seconds, does it mean that when this user typed in the URL of this page, the browser took 5 seconds to respond? Or does it mean that in an automated test repeatedly accessing this page, the average response time was 5 seconds? Or perhaps, the median response time was 5 seconds?
You get the idea. For some reason, people tend to talk loosely about response times. Without going into details of how to measure the response time (that’s a separate topic), this article will focus on what is a meaningful response time metric.
For purposes of this discussion, let us assume we are measuring the response time of a transaction (which can be anything – web, database, cache etc.) What is the most meaningful measure for the response time of a transaction?
This is the most common measure of response time, but alas, usually is the most flawed as well. The mean or average response time simply adds up all the individual response times taken from multiple measurements and divides it by the number of samples to get an average. This may be fine if the measurements are fairly evenly distributed over a narrow range as in Figure 1.

Figure 1: Steady Response Times

Figure 2: Varying Response Times
But if the measurements vary quite a bit over a large range like in Figure 2, the average response time is not meaningful. Both figures have the same scale and show response times on the y axis for samples taken over a period of time (x axis).
If the average is not a good representation of a distribution, perhaps the median is? After all, the median marks the 50th percentile of a distribution. The median is useful when the response times do have a normal distribution but have a few outliers. In this case, the median helps to weed out the outliers.The key here is few outliers. It is important to realize that if 50% of the transactions are within the specified time, that means the remaining 50% have a higher response time. Surely, a response time specification that leaves out half the population cannot be a good measure.
In standard benchmarks, it is common to see 90th percentile response times used. The benchmark may specify that the 90th percentile response time of a transaction should be within x seconds. This means that only 10% of the transactions have a response time higher than x seconds and can therefore be a meaningful measure. For web applications, the requirements are usually even higher – after all, if 10% of your users are dissatisfied with the site performance, that could be a significant number of users. Therefore, it is common to see 95th percentile used for SLAs in web applications.
A word of caution – web page response times can vary dramatically if measured at the last mile (i.e. real users computers that are connected via cable or DSL to the internet). Figure 3 shows the distribution of response times for such a measurement.

Figure 3: Response Time Histogram
It uses the same data as in Figure 2. The mean response time for this data set is 12.9 secs and the median is even lower at 12.3 secs. Clearly neither of these measures covers any significant range of the actual response times. The 90th percentile is 17.3 and the 95th is 18.6. These are much better measures for the response time of this distribution and will work better as the SLA.
To summarize, it is important to look at the distribution of response times before attempting to define an SLA. Like many other metrics, a one size fits all approach does not work. Response time measurements on the server side tend to vary a lot less than on the client. A 90th or 95th percentile response time requirement is a good choice to ensure that the vast majority of clients are covered.
This is my follow up blog post, the first in the series. Click here to see the first post.
In my previous post, I started with identifying response time and sleep problems. Let’s address the response time issue first. When we measure a server’s response time under load, we actually do not want the client side response time to be in this picture for the following reasons:

Traditional response time measurements
The diagram to the left illustrates the typical approach to measuring response times. While this applies to any facility, using SOAP web services exaggerates the problem and makes it really visible.
We capture the time before starting the request, make the web service request, and then capture the time after the response. The response time is the difference between the two times. What we do not think about all the time is that this time includes:
1) time to marshal the request into SOAP/XML, 2) time to format the http request headers and build the request, 3) time on the wire and server response time, 4) time to process the http response, and 5) time to unmarshal the XML into native objects. The time we want to measure is usually just the server response time, which may be only a small part of the measured response time.
To measure the server response times with minimal effect from the client, the measurement needs to happen as close to the wire as possible. While many load generators don’t care about this problem, others approach this problem by implementing their own protocol stacks. To cover a wide variety of protocols, they will need to implement a protocol stack for each protocol they want to support. This drawback of this method is the high maintenance of each and every protocol stack as well as proprietary APIs for each protocol. As you can imagine, this is extremely laborious, and there are not many tools that do this.
The problem is exasperated for secure communications over SSL. Most load generators make use of a client-side library similar to OpenSSL or Apache HttpClient and take measurements before/after the client library call. This adds the entire encryption/decryption overhead on the client-side to the response time.
By now we should be clear about the basic issues with response time measurements. However, there is another time component that greatly affects your results – the inter-arrival time or think time which is a sleep time component. In my next and last post of the series, I’ll talk about errors around such sleep times. Unfortunately, this is also the hardest problem to understand and solve.
Many tools like LoadRunner and JMeter that help develop load tests provide a simple record and playback mechanism. They either use a proxy server or a browser plugin. All you do is traverse the web application as a normal user would. Your interactions with the application are captured and used to create playback script/code. Voila ! You have a test case. Run the required number of emulated users, each executing this script and your workload is ready. Or … is it really ?
If all your users act like a linear computer program executing at a fixed pace, your recorded script may work. But the truth is human beings rarely follow a single path, let alone follow it in a predetermined time. Your users will make one of the many choices available to them in your site, at the pace they desire.
Two factors need to be taken into account when modeling user behavior:
The rest of this article will address the operation mix, data generation, and other issues involved in record and playback. As operation timing is a slightly independent topic by itself, it will be addressed in a different article.
Operation Mix
Tools differ in the way they create a workload from the recorded actions. The primary difference is in how they create an Operation Mix i.e the proportion of the various types of operations (aka requests) that the test makes.
The fact is that web application navigation is best represented by a state diagram and the best method to solve this navigation is by use of a stochastic model. This model is known as MatrixMix in Faban and is best created algorithmically – not by record and playback. An example of such a mix is given below. The first row states that if the user is currently on the home page, the probability of going to the products page is 80% and to the contacts page is 20%.
| From | To home.html | To products.html | To contact.html |
|---|---|---|---|
| home.html | 0% | 80% | 20% |
| products.html | 20% | 39% | 41% |
| contact.html | 60% | 19% | 21% |
Often, many web operations will require a variety of input data. The record-and-playback tools usually deal with this by having test developers edit the generated script to parameterize the input fields. The values for these fields are then read from files that the developer must somehow populate. For instance, if a user login name is required, the developer must create a file with all the login names that the workload must use (usually, by dumping the data out from the application’s database). Imagine what this process will be like if a site has millions of registered users. The workload must then choose one name for each emulated user. For other parameters, we may really want the workload to choose a different value for each operation executed (not just one per emulated user). These kinds of choices usually require some kind of coding – be it an XML (or other proprietary) script or coding in a programming language. (It’s interesting to note that although LoadRunner claims to use scripts, the code is actually C or Java and must in fact be compiled). It turns out that in many cases, this coding can be quite extensive, blowing away the so-called “no coding required” record-and-playback claims that the tool vendors make.
If a tool claims that no coding is required at all, be suspicious. It is very likely that it does not provide enough flexibility for data generation. Tools that use scripting may also not allow flexibility to manipulate data.
Also note that requiring all parameterized field values to be in files means the data cannot be programmatically generated.
The fact is that a well-designed workload requires a robust mechanism in order to both generate request data and process response data.
So far we have only talked about input data for operations that retrieve known/existing data from the application’s data store. Most web2.0 sites allow a considerable amount of new data to be uploaded by users – whether they are new blog, wiki entries, comments or ratings, profile information, photos etc. How does a record-and-playback methodology work for this ? One cannot pull data from a database to pre-load a parameter file, so either these ‘Add’ operations will repeatedly use the same data (which can of course the application to fail if for example, the same username is entered twice) or the tool must provide for some way for the workload developer to specify how these parameters are to be generated. Note that different parameters may have different syntax and semantic requirements. If there is a load generator tool that can effectively generate new data without requiring programming, I’d like to know about it.
For a workload to be used for load testing or capacity planning purposes, it needs to be run at different load levels. This is achieved by using one or more scale factors by which both the initial data store and the load scales. Simply adding emulated users without due consideration to the data store will not create a proper workload. More on this topic with several examples of how real applications scale can be found in the paper, “Performance Workload Design“. Record and playbook tools have no mechanism to handle realistic scaling – one has to achieve this programmatically.
This issue is obvious – record and playbacks can only work for web workloads where a proxy can be used to capture user interactions. Of course, the mechanism can work with any type of interactive application provided a “proxy” for the protocol used by the application is in place. LoadRunner does provide proxies for various protocols but it’s easy to see that this method can become pretty unwieldy quickly and results in product bloat.
It is better to find a tool that provides a good framework and code your own load generator for the specific protocol that you want to test. The process can be eased considerably if the framework understands various commonly used protocols and provides the ability to plugin other protocols as well.
To summarize, here are key points to remember while using a recording tool to generate a load test :
We often hear the terms Load Testing or Performance Testing, but no one talks much about Scalability Testing. Before I go further, let me define these terms so you know what I am talking about :
In this article, we will consider how scalability testing should be done to ensure that the results are meaningful.
The first requirement for any performance testing is a well-designed workload. See my Workload Design paper for details on how to properly design a workload. Many developers and QA engineers typically craft a workload quickly by focusing on a couple of different operations (e.g. if testing a web application, a recording tool is used to create one or two scenarios). I have pointed out the pitfalls of this method in a previous post. So take care while creating your workload. Extra time invested in this step will more than pay off in the long run. Remember, your test results are only as good as the tests you create!
Scalability tests should be planned and executed in a systematic manner to ensure that all relevant information is collected. The parameter by which load is increased obviously depends on the type of app – for web apps, this would typically be the number of simultaneous users making requests of the site. Think about what other parameters might change for your application. If the application accesses a database, will the size of the db change in some relation to the number of users accessing it ? If it uses a caching tier, might it be reasonable to expect that the size of this cache will expand ? Consider the data accessed by your workload – how is this likely to change ? Both the data generator and load generator drivers need to be implemented in a way that supports workload and data scaling.
When running the tests, ensure you can collect sufficient performance metrics so as to be able to understand what exactly is happening on the application infrastructure. One set of metrics is from the system infrastructure – cpu, memory, swap, network and disk i/o data. Another is from the software infrastructure – web,application, caching (memcached) and database servers all provide access to performance data. Don’t forget to collect data on the load driver systems as well. I have seen many a situation in which the driver ran out of memory or swap and it took awhile to figure this out because no one was looking at the driver stats ! All performance metrics should be collected for the same duration as the test run.
With planning done, it is time to run the performance tests. You want to start at a comfortable scale factor – say 100 users and increment by the same factor every time (e.g. 100 users at a time). Some tools let you run a single test while varying the load – although this may be acceptable for load testing, I would discourage such short-cuts for scalability testing. The goal is not just to get to the maximum load but to understand how the system behaves at every step. Without the detailed performance data, it is difficult to do scalability analysis. Do scaling runs to a point a little beyond when the system stops scaling (i.e throughput stays flat or worse starts to fall) or you run out of system resources.
Now comes the fun part – analyzing all the data. I will cover this in another post.
When you measure the response time of some work being done and the tool reports a number like 0.345 sec, have you ever thought about the significance of the digits in this number?
Since we tend to take these numbers for granted, we’re saying the tool reports 0.345 seconds for response time, so it must actually measure 0.345 seconds. Actually, the result of any measurement is an approximate number. The accuracy of this number really depends on the accuracy of the tool we use for the measurement.
Similarly, processed and reported results have to be read with the same caution. We often round the results into an easy-to-read number. What many don’t think about is that the last digit of any measured and/or reported number actually tells us the range of possible results, not really that exact number. Let me take my favorite number of 0.345. What this tells me is the actual value should be greater than or equal to 0.3445, and less than 0.3455. It will never be exactly 0.3450 (and add any number of zeros you want). For measurements, this last digit also tells you the precision or degree of confidence that the result is somewhere in that given range. The more digits reported, the smaller the range and the higher the precision. A result reported as 0.3450 has ten times the precision of 0.345. The range of actual results would be from 0.34445 to 0.34505 which is 10 times smaller than the range of a result reported as 0.345.
Some light food for heavy thoughts. Next time you report a number, or your tool reports a number, make sure to choose the last digits wisely. They do set expectations!
In this article, we will take a look at how typical load generation tools measure response times and point out the issues with this method :
The typical load generation script executes a path as illustrated in this simplified pseudo-code below:
while (!done) {
x = get time
response = make request
y = get time
response time = y - x
sleep thinktime
}
Looks reasonable! But the devil is in the details. Let’s look into some of the issues:
Now, you may start wondering how to solve such issues. If I got your heads spinning and your brows furrowed now, good. Just stay tuned. We’ll get to our solution, soon.
There are two kinds of load generation tools for the web: Simulated web browser and real web browser. The simulated browser uses a simplified http client to access the system we want to measure. The client does not have the ability to process and display the web page. But the client also need not use the resources used by a full-blown web browser, and it ain’t little. If you don’t believe me, check the process size of your browser currently viewing this blog post. If you’re using any flavor of Linux or Mac OS X, just run ‘top’ and you’ll easily find Firefox or Safari on top with a significant amount of your memory consumed. Just look at the RES column. To get a perspective on how much resources you’ll need to simulate 10,000 concurrent users, just multiply that RES column by 10,000 and you’ll soon get the perspective.
Not only does a browser need lots more resources, the response time typically also includes the page rendering time. While you may think “but I want it to include the page rendering time,” consider the difference of this rendering time between a 500MHz Pentium and a 3GHz Nehalem processor. Your measurement now includes the client’s CPU time, and your results will vary based on what that client CPU is. If you’re running these “real” browsers on a cloud, the variation can be from run to run giving you new real insight into the “actual” response times.
On the other hand, simulated browsers are usually light-weight. We can easily fit a thousand such simulated browsers into a single 32bit process. Some of them give you the ability to measure response times very close to the socket, giving you fairly accurate server response times. You’ll be able to pinpoint server latency issues using the results of such measurements.
Great! But I still hear the argument: “I really want to know how long it takes to render the page under stress.” Oh yes you do. But do you really need spend the resources needed to measure it with 10,000 processes? Since this is client side processing, you actually just need one client. The pure rendering time of the browser is best measured on standardized client hardware interacting with an idle server. But if you want to know the user experience while the server is under stress, just use your browser while 10,000 simulated browsers are pounding your site.
Collecting the response times and rendering time from your single browser is simple enough. A Firefox extension called “Firebug” is one of your good friends for this task. Not only does Firebug allow you to investigate web pages and debug JavaScript, the “Net” tab allows it to capture and visualize response times, downloading sequence, and rendering time very clearly. Mouse-over one of these bars and you’ll readily find detailed response time information about the request. The response time is broken down into great detail, such as DNS lookup time, connection time, pure waiting time, and time used for receiving data. A sample result is shown in the image below.

Firebug Net Panel
In this picture you can see the large server response time of 8.56 seconds for this request. The figure will vary given the application and server system utilization. You can also see it took the browser 16.98 seconds to render the page.
In summary, putting 10,000 browser instances on a large number of driver systems to drive a web workload is quite meaningless. A single or very few browser instances is more than adequate. Leave the bulk of the work to the lightweight, simulated browsers that can do the job very efficiently. Focus your end-user response time measurement using a single browser. Firebug is your best friend giving you all the necessary information.
Performance and scalability are often times an after-thought in the minds of many developers. Even applications that are designed with scalability in mind may fail to scale due to the inherent performance limitations of the underlying software infrastructure. Especially with web applications, there is this myth that all one needs to do is to add more powerful and/or more servers and performance and scalability problems will magically disappear. Unfortunately, performance problems can sometimes be hard to solve and throwing hardware at the problem can only take one so far.
In this blog, we will talk about performance testing, scalability measurement, tools for developing workloads and measuring performance, load testing and of course the features and functionality offered by TestnScale.