TestnScale Blog

November 12th, 2011

Informed articles about web performance and scalability

Measuring web application performance

July 30th, 2011

Going by the many posts in various LinkedIn groups and blogs, there seems to be some confusion about how to measure and analyze a web application’s performance. This article tries to clarify the different aspects of web performance and how to go about measuring it, explaining key terms and concepts along the way.

Web Application Architecture

The diagram below shows a high-level view of typical architectures of web applications.

The simplest applications have the web and app tiers combined while more complex ones may have multiple application tiers (called “middleware”) as well as multiple datastores.

The Front end refers to the web tier that generates the html response for the browser.

The Back end refers to the server components that are responsible for the business logic.

Note that in architectures where a single web/app server tier is responsible for both the front and back ends, it is still useful to think of them as logically separate for the purposes of performance analysis.

Front End Performance

When measuring front end performance, we are primarily concerned with understanding the response time that the user (sitting in front of a browser) experiences. This is typically measured as the time taken to load a web page. Performance of the front end depends on the following:

  • Time taken to generate the base page
  • Browser parse time
  • Time to download all of the components on the page (css,js,images,etc.)
  • Browser render time of the page

For most applications, the response time is dominated by the 3rd bullet above i.e. time spent by the browser in retrieving all of the components on a page. As pages have become increasingly complex, their sizes have mushroomed as well – it is not uncommon to see pages of 0.5 MB or more. Depending on where the user is located, it can take a significant amount of time for the browser to fetch components across the internet.

Front end Performance Tools

Front-end performance is typically viewed as waterfall charts produced by tools such as the Firebug Net Panel. During development, firebug is an invaluable tool to understand and fix client-side issues. However, to get a true measure of end user experience on production systems, performance needs to be measured from points on the internet where your customers typically are. Many tools are available to do this and they vary in price and functionality. Do your research to find a tool that fits your needs.

Back End Performance

The primary goal of measuring back end performance is to understand the maximum throughput that it can sustain.Traditionally, enterprises perform “load testing” of their applications to ensure they can scale. I prefer to call this “scalability testing“. Test clients drive load via bare-bones HTTP clients and measure the throughput of the application i.e. the number of requests per second they can handle. To increase the throughput, the number of client drivers need to be increased until the point where throughput stops to increase or worse stops to drop-off.

For complex multi-tier architectures, it is beneficial to break-up the back end analysis by testing the scalability of individual tiers. For example,  database scalability can be measured by running a workload just on the database. This can greatly help identify problems and also provides developers and QA engineers with tests they can repeat during subsequent product releases.

Many applications are thrown into production before any scalability testing is done. Things may seem fine until the day the application gets hit with increased traffic (good for business!). If the application crashes and burns because it cannot handle the load, you may not get a second chance.

Back End Performance Tools

Numerous load testing tools exist with varying functionality and price. There are also a number of open source tools available. Depending on resources you have and your budget, you can also outsource your entire scalability testing.

Summary

Front end performance is primarily concerned with measuring end user response times while back end performance is concerned with measuring throughput and scalability.

 

Response Time Metric for SLAs

February 18th, 2011

Service Level Agreements (SLAs) usually specify a response time criteria that must be met. Although SLAs can have a wide range of metrics like throughput, up time, availability etc., we will focus on response times in this article.

We often hear phrases like the following :

  • “The response time was 5 seconds”
  • “This product’s performance is much worse than slowpoke’s. It takes longer to respond.”
  • “Our whizbang product can perform 100 transactions/sec with a response time of 10 seconds or less”
  • Do you see anything wrong in these statements? Although they sound fine for general conversation, anyone interested in performance should really be asking what exactly do they mean.

    Let’s take the first statement above and make the assumption that it refers to a particular page in a web application. When someone says that the response time is 5 seconds, does it mean that when this user typed in the URL of this page, the browser took 5 seconds to respond? Or does it mean that in an automated test repeatedly accessing this page, the average response time was 5 seconds? Or perhaps, the median response time was 5 seconds?

    You get the idea. For some reason, people tend to talk loosely about response times. Without going into  details of how to measure the response time (that’s a separate topic), this article will focus on what is a meaningful response time metric.

    For purposes of this discussion, let us assume we are measuring the response time of a transaction (which can be anything – web, database, cache etc.) What is the most meaningful measure for the response time of a transaction?

    Mean Response Time

    This is the most common measure of response time, but alas, usually is the most flawed as well. The mean or average response time simply adds up all the individual response times taken from multiple measurements and divides it by the number of samples to get an average. This may be fine if the measurements are fairly evenly distributed over a narrow range as in Figure 1.

    Steady Response Times

    Figure 1: Steady Response Times

    Figure 2: Varying Response Times

    Figure 2: Varying Response Times

    But if the measurements vary quite a bit over a large range like in Figure 2, the average response time is not meaningful. Both figures have the same scale and show response times on the y axis for samples taken over a period of time (x axis).

    Median Response Time

    If the average is not a good representation of a distribution, perhaps the median is? After all, the median marks the 50th percentile of a distribution. The median is useful when the response times do have a normal distribution but have a few outliers. In this case, the median helps to weed out the outliers.The key here is few outliers. It is important to realize that if 50% of the transactions are within the specified time, that means the remaining 50% have a higher response time.  Surely, a response time specification that leaves out half the population cannot be a good measure.

    90th or 95th percentile Response Time

    In standard benchmarks, it is common to see 90th percentile response times used. The benchmark may specify that the 90th percentile response time of a transaction should be within x seconds. This means that only 10% of the transactions have a response time higher than x seconds and can therefore be a meaningful measure. For web applications, the requirements are usually even higher – after all, if 10% of your users are dissatisfied with the site performance, that could be a significant number of users. Therefore, it is common to see 95th percentile used for SLAs in web applications.

    A word of caution – web page response times can vary dramatically if measured at the last mile (i.e. real users computers that are connected via cable or DSL to the internet). Figure 3 shows the distribution of response times for such a measurement.

    Figure 3: Response Time Histogram

    Figure 3: Response Time Histogram

    It uses the same data as in Figure 2. The mean response time for this data set is 12.9 secs and the median is even lower at 12.3 secs. Clearly neither of these measures covers any significant range of the actual response times. The 90th percentile is 17.3 and the 95th is 18.6. These are much better measures for the response time of this distribution and will work better as the SLA.

    To summarize, it is important to look at the distribution of response times before attempting to define an SLA. Like many other metrics, a one size fits all approach does not work. Response time measurements on the server side tend to vary a lot less than on the client. A 90th or 95th percentile response time requirement is a good choice to ensure that the vast majority of clients are covered.

    The Timing Fallacy (2 of 3)

    January 16th, 2011

    This is my follow up blog post, the first in the series. Click here to see the first post.

    In my previous post, I started with identifying response time and sleep problems. Let’s address the response time issue first. When we measure a server’s response time under load, we actually do not want the client side response time to be in this picture for the following reasons:

    1. The client’s performance and load can greatly affect the measured response time.
    2. The client part of the response time alone can easily be measured using Firebug.
    Illustration showing traditional response time measurement

    Traditional response time measurements

    The diagram to the left illustrates the typical approach to measuring response times. While this applies to any facility, using SOAP web services exaggerates the problem and makes it really visible.

    We capture the time before starting the request, make the web service request, and then capture the time after the response. The response time is the difference between the two times. What we do not think about all the time is that this time includes:
    1) time to marshal the request into SOAP/XML, 2) time to format the http request headers and build the request, 3) time on the wire and server response time, 4) time to process the http response, and 5) time to unmarshal the XML into native objects. The time we want to measure is usually just the server response time, which may be only a small part of the measured response time.

    To measure the server response times with minimal effect from the client, the measurement needs to happen as close to the wire as possible. While many load generators don’t care about this problem, others approach this problem by implementing their own protocol stacks. To cover a wide variety of protocols, they will need to implement a protocol stack for each protocol they want to support. This drawback of this method is the high maintenance of each and every protocol stack as well as proprietary APIs for each protocol. As you can imagine, this is extremely laborious, and there are not many tools that do this.

    The problem is exasperated for secure communications over SSL. Most load generators make use of a client-side library similar to OpenSSL or Apache HttpClient and take measurements before/after the client library call. This adds the entire encryption/decryption overhead on the client-side to the response time.

    By now we should be clear about the basic issues with response time measurements. However, there is another time component that greatly affects your results – the inter-arrival time or think time which is a sleep time component. In my next and last post of the series, I’ll talk about errors around such sleep times. Unfortunately, this is also the hardest problem to understand and solve.

    Record and Playback – Does it really work ?

    January 6th, 2011

    Many tools like LoadRunner and JMeter that help develop load tests provide a simple record and playback mechanism. They either use a proxy server or a browser plugin. All you do is  traverse the web application as a normal user would. Your interactions with the application are captured and used to create playback script/code. Voila ! You have a test case. Run the required number of emulated users, each executing this script and your workload is ready. Or … is it really ?

    If all your users act like a linear computer program executing at a fixed pace, your recorded script may work. But the truth is human beings rarely follow a single path, let alone follow it in a predetermined time. Your users will make one of the many choices available to them in your site, at the pace they desire.

    Two factors need to be taken into account when modeling user behavior:

    1. The decision tree of which option to choose at any particular point (the Operation Mix).
    2. The time to follow through to the next operation (called the Timing).

    The rest of this article will address the operation mix, data generation, and other issues involved in record and playback. As operation timing is a slightly independent topic by itself, it will be addressed in a different article.

    Operation Mix

    Tools differ in the way they create a workload from the recorded actions. The primary difference is in how they create an Operation Mix i.e the proportion of the various types of operations (aka requests) that the test makes.

    • Fixed Sequence: This is the simplest method in which each emulated user simply submits the exact same sequence of recorded operations. It may be the simplest but obviously the most flawed as well. For a real application, seldom do users traverse the site in the exact same sequence. As such, this mix creates a very artificial workload.
    • Flat Mix: In this method, the test developer identifies the types of operations (either during the record session by pausing between operations or editing the generated script). The workload then consists of randomly selecting a particular operation, assigning an equal probability to all of them. Some tools may go a step further and allow the probability to be changed (i.e operation1 executes 50% of the time, operation2 executes 20% of the time etc.) In either case, this method is extremely flawed because some of the generated sequences may make no sense at all from the application’s perspective. Websites are never navigated at random. In many sites, one needs to first login to perform certain operations. In other cases, it is necessary to follow a sequence for certain operations (e.g shopping cart -> checkout -> shipping options -> payment). As such this method completely fails to create a correct web workload.
    • Flat Sequence Mix: Tools that use this method (both LoadRunner and JMeter do), will allow the user to record multiple use cases (referred to as scenarios). Each use case is then treated as a fixed sequence and the overall mix is created by specifying a different probability for each sequence. Both LoadRunner and JMeter use this method. Although this method is more realistic than the previous two, it can quickly become unwieldy as the number of scenarios increases. – the scenarios grow quadratically quickly exasperating the test developer.

    The fact is that web application navigation is best represented by a state diagram and the best method to solve this navigation is by use of a stochastic model. This model is known as MatrixMix in Faban and is best created algorithmically – not by record and playback. An example of such a mix is given below. The first row states that if the user is currently on the home page, the probability of going to the products page is 80% and to the contacts page is 20%.

    FromTo home.htmlTo products.htmlTo contact.html
    home.html0%80%20%
    products.html20%39%41%
    contact.html60%19%21%

    Data Generation

    Often, many web operations will require a variety of input data. The record-and-playback tools usually deal with this by having test developers edit the generated script to parameterize the input fields. The values for these fields are then read from files that the developer must somehow populate. For instance, if a user login name is required, the developer must create a file with all the login names that the workload must use (usually, by dumping the data out from the application’s database). Imagine what this process will be like if a site has millions of registered users. The workload must then choose one name for each emulated user. For other parameters, we may really want the workload to choose a different value for each operation executed (not just one per emulated user). These kinds of choices usually require some kind of coding – be it an XML (or other proprietary) script or coding in a programming language. (It’s interesting to note that although LoadRunner claims to use scripts, the code is actually C or Java and must in fact be compiled). It turns out that in many cases, this coding can be quite extensive, blowing away the so-called “no coding required” record-and-playback claims that the tool vendors make.
    If a tool claims that no coding is required at all, be suspicious. It is very likely that it does not provide enough flexibility for data generation. Tools that use scripting may also not allow flexibility to manipulate data.
    Also note that requiring all parameterized field values to be in files means the data cannot be programmatically generated.

    The fact is that a well-designed workload requires a robust mechanism in order to both generate request data and process response data.

    New Data Generation

    So far we have only talked about input data for operations that retrieve known/existing data from the application’s data store. Most web2.0 sites allow a considerable amount of new data to be uploaded by users – whether they are new blog, wiki entries, comments or ratings, profile information, photos etc. How does a record-and-playback methodology work for this ? One cannot pull data from a database to pre-load a parameter file, so either these ‘Add’ operations will repeatedly use the same data (which can of course the application to fail if for example, the same username is entered twice) or the tool must provide for some way for the workload developer to specify how these parameters are to be generated. Note that different parameters may have different syntax and semantic requirements. If there is a load generator tool that can effectively generate new data without requiring programming, I’d like to know about it.

    Workload Scaling

    For a workload to be used for load testing or capacity planning purposes, it needs to be run at different load levels. This is achieved by using one or more scale factors by which both the initial data store and the load scales. Simply adding emulated users without due consideration to the data store will not create a proper workload. More on this topic with several examples of how real applications scale can be found in the paper, “Performance Workload Design“. Record and playbook tools have no mechanism to handle realistic scaling – one has to achieve this programmatically.

    Non-web Workloads

    This issue is obvious – record and playbacks can only work for web workloads where a proxy can be used to capture user interactions. Of course, the mechanism can work with any type of interactive application provided a “proxy” for the protocol used by the application is in place. LoadRunner does provide proxies for various protocols but it’s easy to see that this method can become pretty unwieldy quickly and results in product bloat.

    It is better to find a tool that provides a good framework and code your own load generator for the specific protocol that you want to test. The process can be eased considerably if the framework understands various commonly used protocols and provides the ability to plugin other protocols as well.

    Summary

    To summarize, here are key points to remember while using a recording tool to generate a load test :

    • Use a realistic mix of operations. No real user executes scenarios stepping through the same sequence of pages in exactly the same way.
    • Ensure that the back-end data sources are exercised in the same way as in production. This means, not using a limited data-set that all emulated users share.
    • Test creation/upload of new data to the application. This requires new, random data to be created during load generation.

    Scalability Testing

    October 10th, 2010

    We often hear the terms Load Testing or Performance Testing, but no one talks much about Scalability Testing. Before I go further, let me define these terms so you know what I am talking about :

    • Load Testing refers to the kind of testing usually done by QA organizations to ensure that the application can handle a certain load level. Criteria are set to ensure that releases of a product meet certain conditions like the number of users they can support while delivering a certain response time.
    • Performance Testing on the other hand, refers to testing done to analyze and improve the performance of an application. The focus here is on optimization of resource consumption by analyzing data collected during testing. Performance Testing to a certain extent should be done by developers but more elaborate, large scale testing may be conducted by a separate performance team. In some organizations, the performance team is a part of the QA function.
    • Scalability Testing refers to performance testing that is focused on understanding how an application scales as it is deployed on larger systems and/or more systems or as more load is applied to it. The goal is to understand at what point the application stops scaling and identify the reasons for this. As such scalability testing can be viewed as a kind of performance testing.

    In this article, we will consider how scalability testing should be done to ensure that the results are meaningful.

    Workload Definition

    The first requirement for any performance testing is a well-designed workload. See my Workload Design paper for details on how to properly design a workload. Many developers and QA engineers typically craft a workload quickly by focusing on a couple of different operations (e.g. if testing a web application, a recording tool is used to create one or two scenarios). I have pointed out the pitfalls of this method in a previous post. So take care while creating your workload. Extra time invested in this step will more than pay off in the long run. Remember, your test results are only as good as the tests you create!

    Designing Scalability Tests

    Scalability tests should be planned and executed in a systematic manner to ensure that all relevant information is collected. The parameter by which load is increased obviously depends on the type of app – for web apps, this would typically be the number of simultaneous users making requests of the site. Think about what other parameters might change for your application. If the application accesses a database, will the size of the db change in some relation to the number of users accessing it ? If it uses a caching tier, might it be reasonable to expect that the size of this cache will expand ? Consider the data accessed by your workload – how is this likely to change ? Both the data generator and load generator drivers need to be implemented in a way that supports workload and data scaling.

    Collecting Performance Data

    When running the tests, ensure you can collect sufficient performance metrics so as to be able to understand what exactly is happening on the application infrastructure. One set of metrics is from the system infrastructure – cpu, memory, swap, network and disk i/o data. Another is from the software infrastructure – web,application, caching (memcached) and database servers all provide access to performance data. Don’t forget to collect data on the load driver systems as well. I have seen many a situation in which the driver ran out of memory or swap and it took awhile to figure this out because no one was looking at the driver stats ! All performance metrics should be collected for the same duration as the test run.

    Running Scalability Tests

    With planning done, it is time to run the performance tests. You want to start at a comfortable scale factor – say 100 users and increment by the same factor every time (e.g. 100 users at a time). Some tools let you run a single test while varying the load – although this may be acceptable for load testing, I would discourage such short-cuts for scalability testing. The goal is not just to get to the maximum load but to understand how the system behaves at every step. Without the detailed performance data, it is difficult to do scalability analysis. Do scaling runs to a point a little beyond when the system stops scaling (i.e throughput stays flat or worse starts to fall) or you run out of system resources.

    Now comes the fun part – analyzing all the data. I will cover this in another post.

    The Last Digit

    August 22nd, 2010

    When you measure the response time of some work being done and the tool reports a number like 0.345 sec, have you ever thought about the significance of the digits in this number?

    Since we tend to take these numbers for granted, we’re saying the tool reports 0.345 seconds for response time, so it must actually measure 0.345 seconds. Actually, the result of any measurement is an approximate number. The accuracy of this number really depends on the accuracy of the tool we use for the measurement.

    Similarly, processed and reported results have to be read with the same caution. We often round the results into an easy-to-read number. What many don’t think about is that the last digit of any measured and/or reported number actually tells us the range of possible results, not really that exact number. Let me take my favorite number of 0.345. What this tells me is the actual value should be greater than or equal to 0.3445, and less than 0.3455. It will never be exactly 0.3450 (and add any number of zeros you want). For measurements, this last digit also tells you the precision or degree of confidence that the result is somewhere in that given range. The more digits reported, the smaller the range and the higher the precision. A result reported as 0.3450 has ten times the precision of 0.345. The range of actual results would be from 0.34445 to 0.34505 which is 10 times smaller than the range of a result reported as 0.345.

    Some light food for heavy thoughts. Next time you report a number, or your tool reports a number, make sure to choose the last digits wisely. They do set expectations!

    The Timing Fallacy (1 of 3)

    August 1st, 2010

    In this article, we will take a look at how typical load generation tools measure response times and point out the issues with this method :

    The typical load generation script executes a path as illustrated in this simplified pseudo-code below:

       while (!done) {
            x = get time
            response = make request
            y = get time
            response time = y - x
            sleep thinktime
        }

    Looks reasonable! But the devil is in the details. Let’s look into some of the issues:

    • When ‘make request’ is called, is it really making a request at that time? Assuming this is a http request, the client system will need to format your request, create the http headers, and write the request to the wire. After the response is received, the client needs to check the response, strip the headers, and return the response data. While this all looks trivial, the response time you’re getting actually includes a good amount of client processing. A slow client will give you slow response times while a fast client interacting with the same server will give you fast response times. This issue will be exaggerated as you deal with more complicated client logic, such as sending a SOAP web service request. The client will need to deal with generating the XML before making the request and parsing the XML after getting the response. The difference between a slow and fast client becomes much more apparent in such cases.
    • When you ask the load generator to sleep for a certain amount of time, we normally assume the process/thread is awakened exactly after the specified time elapses. On all but a very few systems, the sleep contract is actually a minimum sleep. If you ask to sleep for 10 seconds, you’ll wake up at least 10 seconds after the time you go to sleep. You may actually wake up any time after 10 seconds. It could be 10.1 seconds, 11 seconds, or even 15 seconds. Well, I’m exaggerating in the last case. But that would still be legal. When you deal with sleeps in the millisecond level, these effects become rather serious. It is actually very common for a sleep fo 10 milliseconds to wake up only after 15 milliseconds. The more you sleep, the less load you place on your test. Moreover, most tools don’t even tell you how much time was really spent sleeping (i.e what the actual think time was). So you may get better response times and less throughput not knowing that this is caused by the load generator spending all the time sleeping.

    Now, you may start wondering how to solve such issues. If I got your heads spinning and your brows furrowed now, good. Just stay tuned. We’ll get to our solution, soon.

    Using Firebug when Load Testing

    July 10th, 2010

    There are two kinds of load generation tools for the web: Simulated web browser and real web browser. The simulated browser uses a simplified http client to access the system we want to measure. The client does not have the ability to process and display the web page. But the client also need not use the resources used by a full-blown web browser, and it ain’t little. If you don’t believe me, check the process size of your browser currently viewing this blog post. If you’re using any flavor of Linux or Mac OS X, just run ‘top’ and you’ll easily find Firefox or Safari on top with a significant amount of your memory consumed. Just look at the RES column. To get a perspective on how much resources you’ll need to simulate 10,000 concurrent users, just multiply that RES column by 10,000 and you’ll soon get the perspective.

    Not only does a browser need lots more resources, the response time typically also includes the page rendering time. While you may think “but I want it to include the page rendering time,” consider the difference of this rendering time between a 500MHz Pentium and a 3GHz Nehalem processor. Your measurement now includes the client’s CPU time, and your results will vary based on what that client CPU is. If you’re running these “real” browsers on a cloud, the variation can be from run to run giving you new real insight into the “actual” response times.

    On the other hand, simulated browsers are usually light-weight. We can easily fit a thousand such simulated browsers into a single 32bit process. Some of them give you the ability to measure response times very close to the socket, giving you fairly accurate server response times. You’ll be able to pinpoint server latency issues using the results of such measurements.

    Great! But I still hear the argument: “I really want to know how long it takes to render the page under stress.” Oh yes you do. But do you really need spend the resources needed to measure it with 10,000 processes? Since this is client side processing, you actually just need one client. The pure rendering time of the browser is best measured on standardized client hardware interacting with an idle server. But if you want to know the user experience while the server is under stress, just use your browser while 10,000 simulated browsers are pounding your site.

    Collecting the response times  and rendering time from your single browser is simple enough. A Firefox extension called “Firebug” is one of your good friends for this task. Not only does Firebug allow you to investigate web pages and debug JavaScript, the “Net” tab allows it to capture and visualize response times, downloading sequence, and rendering time very clearly. Mouse-over one of these bars and you’ll readily find detailed response time information about the request. The response time is broken down into great detail, such as DNS lookup time, connection time, pure waiting time, and time used for receiving data. A sample result is shown in the image below.

    Firebug Net Panel

    Firebug Net Panel

    In this picture you can see the large server response time of 8.56 seconds for this request. The figure will vary given the application and server system utilization. You can also see it took the browser 16.98 seconds to render the page.

    In summary, putting 10,000 browser instances on a large number of driver systems to drive a web workload is quite meaningless. A single or very few browser instances is more than adequate. Leave the bulk of the work to the lightweight, simulated browsers that can do the job very efficiently. Focus your end-user response time measurement using a single browser. Firebug is your best friend giving you all the necessary information.

    Welcome

    June 30th, 2010

    Performance and scalability are often times an after-thought in the minds of many developers. Even applications that are designed with scalability in mind may fail to scale due to the inherent performance limitations of the underlying software infrastructure. Especially with web applications, there is this myth that all one needs to do is to add more powerful and/or more servers and performance and scalability problems will magically disappear. Unfortunately, performance problems can sometimes be hard to solve and throwing hardware at the problem can only take one so far.

    In this blog, we will talk about  performance testing, scalability measurement, tools for developing workloads and measuring performance, load testing and of course the features and functionality offered by TestnScale.