TestnScale Blog
November 12th, 2011Informed articles about web performance and scalability
Informed articles about web performance and scalability
This is my follow up blog post, the first in the series. Click here to see the first post.
In my previous post, I started with identifying response time and sleep problems. Let’s address the response time issue first. When we measure a server’s response time under load, we actually do not want the client side response time to be in this picture for the following reasons:

Traditional response time measurements
The diagram to the left illustrates the typical approach to measuring response times. While this applies to any facility, using SOAP web services exaggerates the problem and makes it really visible.
We capture the time before starting the request, make the web service request, and then capture the time after the response. The response time is the difference between the two times. What we do not think about all the time is that this time includes:
1) time to marshal the request into SOAP/XML, 2) time to format the http request headers and build the request, 3) time on the wire and server response time, 4) time to process the http response, and 5) time to unmarshal the XML into native objects. The time we want to measure is usually just the server response time, which may be only a small part of the measured response time.
To measure the server response times with minimal effect from the client, the measurement needs to happen as close to the wire as possible. While many load generators don’t care about this problem, others approach this problem by implementing their own protocol stacks. To cover a wide variety of protocols, they will need to implement a protocol stack for each protocol they want to support. This drawback of this method is the high maintenance of each and every protocol stack as well as proprietary APIs for each protocol. As you can imagine, this is extremely laborious, and there are not many tools that do this.
The problem is exasperated for secure communications over SSL. Most load generators make use of a client-side library similar to OpenSSL or Apache HttpClient and take measurements before/after the client library call. This adds the entire encryption/decryption overhead on the client-side to the response time.
By now we should be clear about the basic issues with response time measurements. However, there is another time component that greatly affects your results – the inter-arrival time or think time which is a sleep time component. In my next and last post of the series, I’ll talk about errors around such sleep times. Unfortunately, this is also the hardest problem to understand and solve.
Many tools like LoadRunner and JMeter that help develop load tests provide a simple record and playback mechanism. They either use a proxy server or a browser plugin. All you do is traverse the web application as a normal user would. Your interactions with the application are captured and used to create playback script/code. Voila ! You have a test case. Run the required number of emulated users, each executing this script and your workload is ready. Or … is it really ?
If all your users act like a linear computer program executing at a fixed pace, your recorded script may work. But the truth is human beings rarely follow a single path, let alone follow it in a predetermined time. Your users will make one of the many choices available to them in your site, at the pace they desire.
Two factors need to be taken into account when modeling user behavior:
The rest of this article will address the operation mix, data generation, and other issues involved in record and playback. As operation timing is a slightly independent topic by itself, it will be addressed in a different article.
Operation Mix
Tools differ in the way they create a workload from the recorded actions. The primary difference is in how they create an Operation Mix i.e the proportion of the various types of operations (aka requests) that the test makes.
The fact is that web application navigation is best represented by a state diagram and the best method to solve this navigation is by use of a stochastic model. This model is known as MatrixMix in Faban and is best created algorithmically – not by record and playback. An example of such a mix is given below. The first row states that if the user is currently on the home page, the probability of going to the products page is 80% and to the contacts page is 20%.
| From | To home.html | To products.html | To contact.html |
|---|---|---|---|
| home.html | 0% | 80% | 20% |
| products.html | 20% | 39% | 41% |
| contact.html | 60% | 19% | 21% |
Often, many web operations will require a variety of input data. The record-and-playback tools usually deal with this by having test developers edit the generated script to parameterize the input fields. The values for these fields are then read from files that the developer must somehow populate. For instance, if a user login name is required, the developer must create a file with all the login names that the workload must use (usually, by dumping the data out from the application’s database). Imagine what this process will be like if a site has millions of registered users. The workload must then choose one name for each emulated user. For other parameters, we may really want the workload to choose a different value for each operation executed (not just one per emulated user). These kinds of choices usually require some kind of coding – be it an XML (or other proprietary) script or coding in a programming language. (It’s interesting to note that although LoadRunner claims to use scripts, the code is actually C or Java and must in fact be compiled). It turns out that in many cases, this coding can be quite extensive, blowing away the so-called “no coding required” record-and-playback claims that the tool vendors make.
If a tool claims that no coding is required at all, be suspicious. It is very likely that it does not provide enough flexibility for data generation. Tools that use scripting may also not allow flexibility to manipulate data.
Also note that requiring all parameterized field values to be in files means the data cannot be programmatically generated.
The fact is that a well-designed workload requires a robust mechanism in order to both generate request data and process response data.
So far we have only talked about input data for operations that retrieve known/existing data from the application’s data store. Most web2.0 sites allow a considerable amount of new data to be uploaded by users – whether they are new blog, wiki entries, comments or ratings, profile information, photos etc. How does a record-and-playback methodology work for this ? One cannot pull data from a database to pre-load a parameter file, so either these ‘Add’ operations will repeatedly use the same data (which can of course the application to fail if for example, the same username is entered twice) or the tool must provide for some way for the workload developer to specify how these parameters are to be generated. Note that different parameters may have different syntax and semantic requirements. If there is a load generator tool that can effectively generate new data without requiring programming, I’d like to know about it.
For a workload to be used for load testing or capacity planning purposes, it needs to be run at different load levels. This is achieved by using one or more scale factors by which both the initial data store and the load scales. Simply adding emulated users without due consideration to the data store will not create a proper workload. More on this topic with several examples of how real applications scale can be found in the paper, “Performance Workload Design“. Record and playbook tools have no mechanism to handle realistic scaling – one has to achieve this programmatically.
This issue is obvious – record and playbacks can only work for web workloads where a proxy can be used to capture user interactions. Of course, the mechanism can work with any type of interactive application provided a “proxy” for the protocol used by the application is in place. LoadRunner does provide proxies for various protocols but it’s easy to see that this method can become pretty unwieldy quickly and results in product bloat.
It is better to find a tool that provides a good framework and code your own load generator for the specific protocol that you want to test. The process can be eased considerably if the framework understands various commonly used protocols and provides the ability to plugin other protocols as well.
To summarize, here are key points to remember while using a recording tool to generate a load test :
When you measure the response time of some work being done and the tool reports a number like 0.345 sec, have you ever thought about the significance of the digits in this number?
Since we tend to take these numbers for granted, we’re saying the tool reports 0.345 seconds for response time, so it must actually measure 0.345 seconds. Actually, the result of any measurement is an approximate number. The accuracy of this number really depends on the accuracy of the tool we use for the measurement.
Similarly, processed and reported results have to be read with the same caution. We often round the results into an easy-to-read number. What many don’t think about is that the last digit of any measured and/or reported number actually tells us the range of possible results, not really that exact number. Let me take my favorite number of 0.345. What this tells me is the actual value should be greater than or equal to 0.3445, and less than 0.3455. It will never be exactly 0.3450 (and add any number of zeros you want). For measurements, this last digit also tells you the precision or degree of confidence that the result is somewhere in that given range. The more digits reported, the smaller the range and the higher the precision. A result reported as 0.3450 has ten times the precision of 0.345. The range of actual results would be from 0.34445 to 0.34505 which is 10 times smaller than the range of a result reported as 0.345.
Some light food for heavy thoughts. Next time you report a number, or your tool reports a number, make sure to choose the last digits wisely. They do set expectations!
In this article, we will take a look at how typical load generation tools measure response times and point out the issues with this method :
The typical load generation script executes a path as illustrated in this simplified pseudo-code below:
while (!done) {
x = get time
response = make request
y = get time
response time = y - x
sleep thinktime
}
Looks reasonable! But the devil is in the details. Let’s look into some of the issues:
Now, you may start wondering how to solve such issues. If I got your heads spinning and your brows furrowed now, good. Just stay tuned. We’ll get to our solution, soon.
There are two kinds of load generation tools for the web: Simulated web browser and real web browser. The simulated browser uses a simplified http client to access the system we want to measure. The client does not have the ability to process and display the web page. But the client also need not use the resources used by a full-blown web browser, and it ain’t little. If you don’t believe me, check the process size of your browser currently viewing this blog post. If you’re using any flavor of Linux or Mac OS X, just run ‘top’ and you’ll easily find Firefox or Safari on top with a significant amount of your memory consumed. Just look at the RES column. To get a perspective on how much resources you’ll need to simulate 10,000 concurrent users, just multiply that RES column by 10,000 and you’ll soon get the perspective.
Not only does a browser need lots more resources, the response time typically also includes the page rendering time. While you may think “but I want it to include the page rendering time,” consider the difference of this rendering time between a 500MHz Pentium and a 3GHz Nehalem processor. Your measurement now includes the client’s CPU time, and your results will vary based on what that client CPU is. If you’re running these “real” browsers on a cloud, the variation can be from run to run giving you new real insight into the “actual” response times.
On the other hand, simulated browsers are usually light-weight. We can easily fit a thousand such simulated browsers into a single 32bit process. Some of them give you the ability to measure response times very close to the socket, giving you fairly accurate server response times. You’ll be able to pinpoint server latency issues using the results of such measurements.
Great! But I still hear the argument: “I really want to know how long it takes to render the page under stress.” Oh yes you do. But do you really need spend the resources needed to measure it with 10,000 processes? Since this is client side processing, you actually just need one client. The pure rendering time of the browser is best measured on standardized client hardware interacting with an idle server. But if you want to know the user experience while the server is under stress, just use your browser while 10,000 simulated browsers are pounding your site.
Collecting the response times and rendering time from your single browser is simple enough. A Firefox extension called “Firebug” is one of your good friends for this task. Not only does Firebug allow you to investigate web pages and debug JavaScript, the “Net” tab allows it to capture and visualize response times, downloading sequence, and rendering time very clearly. Mouse-over one of these bars and you’ll readily find detailed response time information about the request. The response time is broken down into great detail, such as DNS lookup time, connection time, pure waiting time, and time used for receiving data. A sample result is shown in the image below.

Firebug Net Panel
In this picture you can see the large server response time of 8.56 seconds for this request. The figure will vary given the application and server system utilization. You can also see it took the browser 16.98 seconds to render the page.
In summary, putting 10,000 browser instances on a large number of driver systems to drive a web workload is quite meaningless. A single or very few browser instances is more than adequate. Leave the bulk of the work to the lightweight, simulated browsers that can do the job very efficiently. Focus your end-user response time measurement using a single browser. Firebug is your best friend giving you all the necessary information.
Performance and scalability are often times an after-thought in the minds of many developers. Even applications that are designed with scalability in mind may fail to scale due to the inherent performance limitations of the underlying software infrastructure. Especially with web applications, there is this myth that all one needs to do is to add more powerful and/or more servers and performance and scalability problems will magically disappear. Unfortunately, performance problems can sometimes be hard to solve and throwing hardware at the problem can only take one so far.
In this blog, we will talk about performance testing, scalability measurement, tools for developing workloads and measuring performance, load testing and of course the features and functionality offered by TestnScale.