TestnScale Blog

November 12th, 2011

Informed articles about web performance and scalability

The Timing Fallacy (2 of 3)

January 16th, 2011

This is my follow up blog post, the first in the series. Click here to see the first post.

In my previous post, I started with identifying response time and sleep problems. Let’s address the response time issue first. When we measure a server’s response time under load, we actually do not want the client side response time to be in this picture for the following reasons:

  1. The client’s performance and load can greatly affect the measured response time.
  2. The client part of the response time alone can easily be measured using Firebug.
Illustration showing traditional response time measurement

Traditional response time measurements

The diagram to the left illustrates the typical approach to measuring response times. While this applies to any facility, using SOAP web services exaggerates the problem and makes it really visible.

We capture the time before starting the request, make the web service request, and then capture the time after the response. The response time is the difference between the two times. What we do not think about all the time is that this time includes:
1) time to marshal the request into SOAP/XML, 2) time to format the http request headers and build the request, 3) time on the wire and server response time, 4) time to process the http response, and 5) time to unmarshal the XML into native objects. The time we want to measure is usually just the server response time, which may be only a small part of the measured response time.

To measure the server response times with minimal effect from the client, the measurement needs to happen as close to the wire as possible. While many load generators don’t care about this problem, others approach this problem by implementing their own protocol stacks. To cover a wide variety of protocols, they will need to implement a protocol stack for each protocol they want to support. This drawback of this method is the high maintenance of each and every protocol stack as well as proprietary APIs for each protocol. As you can imagine, this is extremely laborious, and there are not many tools that do this.

The problem is exasperated for secure communications over SSL. Most load generators make use of a client-side library similar to OpenSSL or Apache HttpClient and take measurements before/after the client library call. This adds the entire encryption/decryption overhead on the client-side to the response time.

By now we should be clear about the basic issues with response time measurements. However, there is another time component that greatly affects your results – the inter-arrival time or think time which is a sleep time component. In my next and last post of the series, I’ll talk about errors around such sleep times. Unfortunately, this is also the hardest problem to understand and solve.

Record and Playback – Does it really work ?

January 6th, 2011

Many tools like LoadRunner and JMeter that help develop load tests provide a simple record and playback mechanism. They either use a proxy server or a browser plugin. All you do is  traverse the web application as a normal user would. Your interactions with the application are captured and used to create playback script/code. Voila ! You have a test case. Run the required number of emulated users, each executing this script and your workload is ready. Or … is it really ?

If all your users act like a linear computer program executing at a fixed pace, your recorded script may work. But the truth is human beings rarely follow a single path, let alone follow it in a predetermined time. Your users will make one of the many choices available to them in your site, at the pace they desire.

Two factors need to be taken into account when modeling user behavior:

  1. The decision tree of which option to choose at any particular point (the Operation Mix).
  2. The time to follow through to the next operation (called the Timing).

The rest of this article will address the operation mix, data generation, and other issues involved in record and playback. As operation timing is a slightly independent topic by itself, it will be addressed in a different article.

Operation Mix

Tools differ in the way they create a workload from the recorded actions. The primary difference is in how they create an Operation Mix i.e the proportion of the various types of operations (aka requests) that the test makes.

  • Fixed Sequence: This is the simplest method in which each emulated user simply submits the exact same sequence of recorded operations. It may be the simplest but obviously the most flawed as well. For a real application, seldom do users traverse the site in the exact same sequence. As such, this mix creates a very artificial workload.
  • Flat Mix: In this method, the test developer identifies the types of operations (either during the record session by pausing between operations or editing the generated script). The workload then consists of randomly selecting a particular operation, assigning an equal probability to all of them. Some tools may go a step further and allow the probability to be changed (i.e operation1 executes 50% of the time, operation2 executes 20% of the time etc.) In either case, this method is extremely flawed because some of the generated sequences may make no sense at all from the application’s perspective. Websites are never navigated at random. In many sites, one needs to first login to perform certain operations. In other cases, it is necessary to follow a sequence for certain operations (e.g shopping cart -> checkout -> shipping options -> payment). As such this method completely fails to create a correct web workload.
  • Flat Sequence Mix: Tools that use this method (both LoadRunner and JMeter do), will allow the user to record multiple use cases (referred to as scenarios). Each use case is then treated as a fixed sequence and the overall mix is created by specifying a different probability for each sequence. Both LoadRunner and JMeter use this method. Although this method is more realistic than the previous two, it can quickly become unwieldy as the number of scenarios increases. – the scenarios grow quadratically quickly exasperating the test developer.

The fact is that web application navigation is best represented by a state diagram and the best method to solve this navigation is by use of a stochastic model. This model is known as MatrixMix in Faban and is best created algorithmically – not by record and playback. An example of such a mix is given below. The first row states that if the user is currently on the home page, the probability of going to the products page is 80% and to the contacts page is 20%.

FromTo home.htmlTo products.htmlTo contact.html
home.html0%80%20%
products.html20%39%41%
contact.html60%19%21%

Data Generation

Often, many web operations will require a variety of input data. The record-and-playback tools usually deal with this by having test developers edit the generated script to parameterize the input fields. The values for these fields are then read from files that the developer must somehow populate. For instance, if a user login name is required, the developer must create a file with all the login names that the workload must use (usually, by dumping the data out from the application’s database). Imagine what this process will be like if a site has millions of registered users. The workload must then choose one name for each emulated user. For other parameters, we may really want the workload to choose a different value for each operation executed (not just one per emulated user). These kinds of choices usually require some kind of coding – be it an XML (or other proprietary) script or coding in a programming language. (It’s interesting to note that although LoadRunner claims to use scripts, the code is actually C or Java and must in fact be compiled). It turns out that in many cases, this coding can be quite extensive, blowing away the so-called “no coding required” record-and-playback claims that the tool vendors make.
If a tool claims that no coding is required at all, be suspicious. It is very likely that it does not provide enough flexibility for data generation. Tools that use scripting may also not allow flexibility to manipulate data.
Also note that requiring all parameterized field values to be in files means the data cannot be programmatically generated.

The fact is that a well-designed workload requires a robust mechanism in order to both generate request data and process response data.

New Data Generation

So far we have only talked about input data for operations that retrieve known/existing data from the application’s data store. Most web2.0 sites allow a considerable amount of new data to be uploaded by users – whether they are new blog, wiki entries, comments or ratings, profile information, photos etc. How does a record-and-playback methodology work for this ? One cannot pull data from a database to pre-load a parameter file, so either these ‘Add’ operations will repeatedly use the same data (which can of course the application to fail if for example, the same username is entered twice) or the tool must provide for some way for the workload developer to specify how these parameters are to be generated. Note that different parameters may have different syntax and semantic requirements. If there is a load generator tool that can effectively generate new data without requiring programming, I’d like to know about it.

Workload Scaling

For a workload to be used for load testing or capacity planning purposes, it needs to be run at different load levels. This is achieved by using one or more scale factors by which both the initial data store and the load scales. Simply adding emulated users without due consideration to the data store will not create a proper workload. More on this topic with several examples of how real applications scale can be found in the paper, “Performance Workload Design“. Record and playbook tools have no mechanism to handle realistic scaling – one has to achieve this programmatically.

Non-web Workloads

This issue is obvious – record and playbacks can only work for web workloads where a proxy can be used to capture user interactions. Of course, the mechanism can work with any type of interactive application provided a “proxy” for the protocol used by the application is in place. LoadRunner does provide proxies for various protocols but it’s easy to see that this method can become pretty unwieldy quickly and results in product bloat.

It is better to find a tool that provides a good framework and code your own load generator for the specific protocol that you want to test. The process can be eased considerably if the framework understands various commonly used protocols and provides the ability to plugin other protocols as well.

Summary

To summarize, here are key points to remember while using a recording tool to generate a load test :

  • Use a realistic mix of operations. No real user executes scenarios stepping through the same sequence of pages in exactly the same way.
  • Ensure that the back-end data sources are exercised in the same way as in production. This means, not using a limited data-set that all emulated users share.
  • Test creation/upload of new data to the application. This requires new, random data to be created during load generation.

The Last Digit

August 22nd, 2010

When you measure the response time of some work being done and the tool reports a number like 0.345 sec, have you ever thought about the significance of the digits in this number?

Since we tend to take these numbers for granted, we’re saying the tool reports 0.345 seconds for response time, so it must actually measure 0.345 seconds. Actually, the result of any measurement is an approximate number. The accuracy of this number really depends on the accuracy of the tool we use for the measurement.

Similarly, processed and reported results have to be read with the same caution. We often round the results into an easy-to-read number. What many don’t think about is that the last digit of any measured and/or reported number actually tells us the range of possible results, not really that exact number. Let me take my favorite number of 0.345. What this tells me is the actual value should be greater than or equal to 0.3445, and less than 0.3455. It will never be exactly 0.3450 (and add any number of zeros you want). For measurements, this last digit also tells you the precision or degree of confidence that the result is somewhere in that given range. The more digits reported, the smaller the range and the higher the precision. A result reported as 0.3450 has ten times the precision of 0.345. The range of actual results would be from 0.34445 to 0.34505 which is 10 times smaller than the range of a result reported as 0.345.

Some light food for heavy thoughts. Next time you report a number, or your tool reports a number, make sure to choose the last digits wisely. They do set expectations!

The Timing Fallacy (1 of 3)

August 1st, 2010

In this article, we will take a look at how typical load generation tools measure response times and point out the issues with this method :

The typical load generation script executes a path as illustrated in this simplified pseudo-code below:

   while (!done) {
        x = get time
        response = make request
        y = get time
        response time = y - x
        sleep thinktime
    }

Looks reasonable! But the devil is in the details. Let’s look into some of the issues:

  • When ‘make request’ is called, is it really making a request at that time? Assuming this is a http request, the client system will need to format your request, create the http headers, and write the request to the wire. After the response is received, the client needs to check the response, strip the headers, and return the response data. While this all looks trivial, the response time you’re getting actually includes a good amount of client processing. A slow client will give you slow response times while a fast client interacting with the same server will give you fast response times. This issue will be exaggerated as you deal with more complicated client logic, such as sending a SOAP web service request. The client will need to deal with generating the XML before making the request and parsing the XML after getting the response. The difference between a slow and fast client becomes much more apparent in such cases.
  • When you ask the load generator to sleep for a certain amount of time, we normally assume the process/thread is awakened exactly after the specified time elapses. On all but a very few systems, the sleep contract is actually a minimum sleep. If you ask to sleep for 10 seconds, you’ll wake up at least 10 seconds after the time you go to sleep. You may actually wake up any time after 10 seconds. It could be 10.1 seconds, 11 seconds, or even 15 seconds. Well, I’m exaggerating in the last case. But that would still be legal. When you deal with sleeps in the millisecond level, these effects become rather serious. It is actually very common for a sleep fo 10 milliseconds to wake up only after 15 milliseconds. The more you sleep, the less load you place on your test. Moreover, most tools don’t even tell you how much time was really spent sleeping (i.e what the actual think time was). So you may get better response times and less throughput not knowing that this is caused by the load generator spending all the time sleeping.

Now, you may start wondering how to solve such issues. If I got your heads spinning and your brows furrowed now, good. Just stay tuned. We’ll get to our solution, soon.

Using Firebug when Load Testing

July 10th, 2010

There are two kinds of load generation tools for the web: Simulated web browser and real web browser. The simulated browser uses a simplified http client to access the system we want to measure. The client does not have the ability to process and display the web page. But the client also need not use the resources used by a full-blown web browser, and it ain’t little. If you don’t believe me, check the process size of your browser currently viewing this blog post. If you’re using any flavor of Linux or Mac OS X, just run ‘top’ and you’ll easily find Firefox or Safari on top with a significant amount of your memory consumed. Just look at the RES column. To get a perspective on how much resources you’ll need to simulate 10,000 concurrent users, just multiply that RES column by 10,000 and you’ll soon get the perspective.

Not only does a browser need lots more resources, the response time typically also includes the page rendering time. While you may think “but I want it to include the page rendering time,” consider the difference of this rendering time between a 500MHz Pentium and a 3GHz Nehalem processor. Your measurement now includes the client’s CPU time, and your results will vary based on what that client CPU is. If you’re running these “real” browsers on a cloud, the variation can be from run to run giving you new real insight into the “actual” response times.

On the other hand, simulated browsers are usually light-weight. We can easily fit a thousand such simulated browsers into a single 32bit process. Some of them give you the ability to measure response times very close to the socket, giving you fairly accurate server response times. You’ll be able to pinpoint server latency issues using the results of such measurements.

Great! But I still hear the argument: “I really want to know how long it takes to render the page under stress.” Oh yes you do. But do you really need spend the resources needed to measure it with 10,000 processes? Since this is client side processing, you actually just need one client. The pure rendering time of the browser is best measured on standardized client hardware interacting with an idle server. But if you want to know the user experience while the server is under stress, just use your browser while 10,000 simulated browsers are pounding your site.

Collecting the response times  and rendering time from your single browser is simple enough. A Firefox extension called “Firebug” is one of your good friends for this task. Not only does Firebug allow you to investigate web pages and debug JavaScript, the “Net” tab allows it to capture and visualize response times, downloading sequence, and rendering time very clearly. Mouse-over one of these bars and you’ll readily find detailed response time information about the request. The response time is broken down into great detail, such as DNS lookup time, connection time, pure waiting time, and time used for receiving data. A sample result is shown in the image below.

Firebug Net Panel

Firebug Net Panel

In this picture you can see the large server response time of 8.56 seconds for this request. The figure will vary given the application and server system utilization. You can also see it took the browser 16.98 seconds to render the page.

In summary, putting 10,000 browser instances on a large number of driver systems to drive a web workload is quite meaningless. A single or very few browser instances is more than adequate. Leave the bulk of the work to the lightweight, simulated browsers that can do the job very efficiently. Focus your end-user response time measurement using a single browser. Firebug is your best friend giving you all the necessary information.

Welcome

June 30th, 2010

Performance and scalability are often times an after-thought in the minds of many developers. Even applications that are designed with scalability in mind may fail to scale due to the inherent performance limitations of the underlying software infrastructure. Especially with web applications, there is this myth that all one needs to do is to add more powerful and/or more servers and performance and scalability problems will magically disappear. Unfortunately, performance problems can sometimes be hard to solve and throwing hardware at the problem can only take one so far.

In this blog, we will talk about  performance testing, scalability measurement, tools for developing workloads and measuring performance, load testing and of course the features and functionality offered by TestnScale.