Logo Questions Linux Laravel Mysql Ubuntu Git Menu

Scaling Tigase XMPP server on Amazon EC2

Does anyone have an experience running clustered Tigase XMPP servers on Amazon's EC2, primarily I wish to know about anything that might trip me up that is non-obvious. (For example apparently running Ejabberd on EC2 can cause issues due to Mnesia.)

Or if you have any general advice to installing and running Tigase on Ubuntu.

Extra information:

The system I’m developing uses XMPP just to communicate (in near real-time) between a mobile app and the server(s).

The number of users will initially be small, but hopefully will grow. This is why the system needs to be scalable. Presumably for a just a few thousand users you wouldn’t need a cc1.4xlarge EC2 instance? (Otherwise this is going to be very expensive to run!)

I plan on using a MySQL database hosted in Amazon RDS for the XMPP server database.

I also plan on creating an external XMPP component written in Python, using SleekXMPP. It will be this external component that does all the ‘work’ of the server, as the application I’m making is quite different from instant messaging. For this part I have not worked out how to connect an external XMPP component written in Python to a Tigase server. The documentation seems to suggest that components are written specifically for Tigase - and not for a general XMPP server, using XEP-0114: Jabber Component Protocol, as I expected.

With this extra information, if you can think of anything else I should know about I’d be glad to know.

Thank you :)

like image 855
Jon Cox Avatar asked Dec 29 '11 16:12

Jon Cox

1 Answers

I have lots of experience. I think there is a load of non-obvious problems. Like the only reliable instance to run application like Tigase is cc1.4xlarge. Others cause problems with CPU availability and this is just a lottery whether you are lucky enough to run your service on a server which is not busy with others people work.

Also you need an instance with the highest possible I/O to make sure it can cope with network traffic. The high I/O applies especially to database instance.

Not sure if this is obvious or not, but there is this problem with hostnames on EC2, every time you start instance the hostname changes and IP address changes. Tigase cluster is quite sensitive to hostnames. There is a way to force/change the hostname for the instance, so this might be a way around the problem.

Of course I am talking about a cluster for millions of online users and really high traffic 100k XMPP packets per second or more. Generally for large installation it is way cheaper and more efficient to have a dedicated servers.

Generally Tigase runs very well on Amazon EC2 but you really need the latest SVN code as it has lots of optimizations added especially after tests on the cloud. If you provide some more details about your service I may have some more suggestions.

More comments:

If it comes to costs, a dedicated server is always cheaper option for constantly running service. Unless you plan to switch servers on/off on hourly basis I would recommend going for some dedicated service. Costs are lower and performance is way more predictable.

However, if you really want/need to stick to Amazon EC2 let me give you some concrete numbers, below is a list of instances and how many online users the cluster was able to reliably handle:

  • 5*cc1.4xlarge - 1mln 700k online users
  • 1*c1.xlarge - 118k online users
  • 2*c1.xlarge - 127k online users
  • 2*m2.4xlarge (with 5GB RAM for Tigase) - 236k online users
  • 2*m2.4xlarge (with 20GB RAM for Tigase) - 315k online users
  • 5*m2.4xlarge (with 60GB RAM for Tigase) - 400k online users
  • 5*m2.4xlarge (with 60GB RAM for Tigase) - 312k online users
  • 5*m2.4xlarge (with 60GB RAM for Tigase) - 327k online users
  • 5*m2.4xlarge (with 60GB RAM for Tigase) - 280k online users

A few more comments:

  1. Why amount of memory matters that much? This is because CPU power is very unreliable and inconsistent on all but cc1.4xlarge instances. You have 8 virtual CPUs but if you look at the top command you often see one CPU is working and the rest is not. This insufficient CPU power leads to internal queues grow in the Tigase. When the CPU power is back Tigase can process waiting packets. The more memory Tigase has the more packets can be queued and it better handles CPU deficiencies.
  2. Why there is 5*m2.4xlarge 4 times? This is because I repeated tests many times at different days and time of the day. As you can see depending on the time and date the system could handle different load. I guess this is because Tigase instance shared CPU power with some other services. If they were busy Tigase suffered from CPU under power.

That said I think with installation of up to 10k online users you should be fine. However, other factors like roster size greatly matter as they affect traffic, and load. Also if you have other elements which generate a significant traffic this will put load on your system.

In any case, without some tests it is impossible to tell how really your system behaves or whether it can handle the load.

And the last question regarding component:

Of course Tigase does support XEP-0114 and XEP-0225 for connecting external components. So this should not be a problem with components written in different languages. On the other hand I recommend using Tigase's API for writing component. They can be deployed either as internal Tigase components or as external components and this is transparent for the developer, you do not have to worry about this at development time. This is part of the API and framework. Also, you can use all the goods from Tigase framework, scripting capabilities, monitoring, statistics, much easier development as you can easily deploy your code as internal component for tests. You really do not have to worry about any XMPP specific stuff, you just fill body of processPacket(...) method and that's it. There should be enough online documentation for all of this on the Tigase website.

Also, I would suggest reading about Python support for multi-threading and how it behaves under a very high load. It used to be not so great.

like image 150
Artur Hefczyc Avatar answered Jan 22 '23 18:01

Artur Hefczyc