Please note that this post is based on experiences with Tsung version from (about) November 2012. As far as I can see, most commits added since then are bugfixes or minor changes. Perhaps one of them would make some conclusions different, but no one says that everybody always uses most recent code from Github. 🙂
For those who are not yet aware what Tsung is: It is a load-testing framework developed in Erlang by ProcessOne along with community (and my small input for Websockets ;)). I used it only for ejabberd testing so I have no opinion on other protocols support than Jabber via TCP/BOSH/Websockets.
Yeah, multiple protocols support. This definitely sounds nice. Unfortunately it enforces some architectural solutions, which can make writing scenarios maybe not hard but sometimes troublesome.
In general, things I don’t like are:
Redundancy in scenario XML
Even if you specify session type as Jabber, you also have to use <jabber> nodes for issuing commands like connect or send message.
Inconsistent protocol options
For example: it took me some time to discover that BOSH doesn’t support plaintext authentication, as opposed to direct TCP one or Websockets. One might ask why I even bothered with plaintext. The answer is: At the beginning I couldn’t find valid nodes sequence for SASL authentication. It’s really not that obvious:
<request><jabber type="auth_sasl" ack="local" /></request>
<request><jabber type="connect" ack="local" /></request>
<request><jabber type="auth_sasl_bind" ack="local" /></request>
<request><jabber type="auth_sasl_session" ack="local" /></request>
What is more, I don’t know why plaintext auth is even allowed, since Tsung by default sets XML stream version to 1.0 and it causes server to deny this method.
Too large overhead when sending messages
Nothing extraordinary: I would like every client to send a message to any other online client. Of course it’s possible but the CPU load on controller node suggests it is queried by every client for random online JID. I would like to set that they can assume every client with JID number lower than own one. Indeed I used scenarios divided in two phases (not to be mistaken for arrival phases): connecting all users and message exchange and in such case online JID is any JID.
Global ACK sucks
Any request can have three types of acknowledgements:
- none – will just continue to next command
- local – will continue after receiving any response from server
- global – will synchronize all clients at this point in scenario
Ejabberd and MongooseIM can take a lot of users before running out of resources. This means creating *many* client instances and imagine e.g. 100k instances trying to synchronize with each other at the same time. I’ll make it easier: 100Mbps network had a hard time processing this load.
The practical solutions in some scenarios is to set thinktime for each client equal to estimated time of whole login phase or use dynamic variables to set thinktime in more adaptive way.
Who cares if users really connected
I think intuition suggests that when load testing suite fails to connect a user, it should be screaming that something’s wrong. At least it should be the default behaviour. Of course in case of Tsung you have to check session count inside server to make sure that these amazing results you’re getting are not caused by half of users never starting a valid session.
In general: it’s not realiable when Tsung nodes become heavily loaded. I can understand it, of course. The thing is, I was’t particularly happy with using another Erlang node running custom code just for pinging two clients and checking latencies.
A lot of “manufacture” when using more than one network interface
Good thing is that Tsung nodes allow usage of more than one interface (Kernel bless the quite low limit on port number range per interface…). Less nice is that for each client you have to precisely specify IP of every interface you’d like to use. I think the most preferable solution would be to specify subnet and have nodes use all interfaces that have IP in this subnet.
Please do not think I hate this tool. It was very helpful:
Very good monitoring
Built-in monitoring (except for latency measurement) provides wide range of various parameters, of which for me the most helpful were CPU & RAM usage and request rate. Since gnuplot is used for creating graphs, Tsung creates a lot of gnuplot-friendly data files you can use to create your own images. Actually, I usually opened mentioned graphs in Eye of Gnome and had them constantly updated by watchn executing tsung_stats.pl.
Actually I never experienced a sudden crash e.g. after an hour of running the load test. When something went wrong, it wasn’t because of some bug in the core code.
Sometimes it requires more processing power to generate the load than handle it. Tsung was perfectly able to manage 10 nodes.
Does its job
When i dealt with the issues described earlier, Tsung eventually proved to be a good tool. It has many configuration options and since it is written in Erlang, it wasn’t hard to adjust some aspects in the source code. It is fairly easy to use, with some nice sample scenarios publicly available. After initial setup performing tests themselves was quite pleasant.
Would I ever use it again? Probably yes. I have the scenario- and knowledgebase that would allow me to skip the initial difficulties. On the other hand I would certainly welcome similar project dedicated for XMPP with friendly learning curve.