Bugzilla – Bug 110
Game is unplayable bcoz of lag
Last modified: 2013-04-03 13:43:38 CEST
come back to h&h, jo&lo, fuck da Salem.
Though the problem is obviously well-known, I guess an issue to track any progress on it can't hurt. :) If anyone wants to help, by far the best way would be to find a usefully reproducible system configuration that exhibits the same problem that the server has, such that could be experimented upon to find the root cause. The proximate cause of the problem is clearly I/O-related, in that I/O operations that should be simple (such as write-faulting a page in a memory-mapped file, opening a file, unlinking a file, and such things) can take multiple (and even multiple tens of) seconds. Miscellaneous observations that I have made include the following: * Oftentimes, when such I/O dips happen, I observe (in vmstat) not increased I/O activity, but rather *decreased* activity, making it seem like the system is just not doing anything. * I/O dips seem to correlate to some degree to sync/fsync/fdatasync calls occurring in other processes (for instance, the mysql daemon serving the forums and Bugzilla sometimes syncs its table files after completing a transaction, and that often seems to occur at the same time as an I/O dip in other processes). The exact nature and strength of the correlation are unknown, however. * The system often has several gigabytes of RAM unallocated to processes, so there should be more than enough block cache space to not have to block I/O operations. I suspect the problem is somehow connected to the I/O configuration, which currently looks like this (using glamorous ASCII art): sda1 --.--> md0 --> swap sdb1 -/ sda2 --.--> md1 --> /boot (ext3) sdb2 -/ sda3 --.--> LVM PV --> LVM LV --> Minimap filesystem (ext3) sdb3 -/ sda4 --.--> md2 --> LVM PV ---> LVM LV --> Root filesystem (xfs) sdb4 -/ \-> LVM LV --> Game data filesystem (xfs) However, the nature of the server, as being both remotely hosted, and holding a lot of data, makes it hard to experiment with lots of different configurations, so it's hard to nail down what exactly in this matrix actually constitutes the root cause. For that reason, it would be very helpful to have a more minimal system setup that causes the problem and can be experimented on.
In this context, it should be said that my own home server uses a somewhat similar I/O setup (it has two disks md-RAID'ed together into a LVM PV that hosts a couple of LVs with xfs, and some other disks merely LVM'd together to host a larger ReiserFS filesystem) and displays problems that are not entirely dissimilar (in particular, I run a `du' process on large parts of the filesystems once a week, which causes severe I/O dips), so it is probably a good place to start.
It is also worth noting some events that has affected the I/O performance of the system. The two disks in the systems were not RAID'ed together, so one disk held the root filesystem and minimap data, and the other held the game data. Under that regime, I/O was quite fine. The system was then running Debian 5.0, so Linux 2.6.26 IIRC. Then, a near hard-drive crash occurred, and so I RAID'ed the disks to protect the data and the system, which made I/O much worse, though not entirely as atrocious as it is now. I then upgraded the system to Debian 6.0 since 5.0 had stopped being supported, which upgraded the kernel to 2.6.32 (and the init scripts no longer support 2.6.26, so I couldn't revert to it), and that made I/O performance the worst ever. I then installed 3.2.0 from Debian backports, which made I/O performance *slightly* better, but still as completely unplayable as it is now. I then compiled 3.5.4 from source to be able to do some testing (it enabled me, among other things, to enable latencytop and ftrace), but it doesn't seem to have affected I/O performance at all, either positively or negatively. This is the kernel that the server is still using.
Its not visible clearly where your problem is as I know very few about your server software internals. But I might be able to meet you in person or online and have a chat about what possibly can be changed. I am located in central Stockholm area on working days and Södertälje on evenings, weekends and holidays. As a last resort you probably should revert back to old kernel and not touch it until problem is solved. Meanwhile build a stage server with new Linux and try to nail the problem and fix it locally.
I don't know much about it but i found some articles in google. Maybe you haven't saw then :) http://www.techforce.com.br/news/linux_blog/lvm_raid_xfs_ext3_tuning_for_small_files_parallel_i_o_on_debian#.UNax_MVmKYE
(In reply to comment #4) Just to prove myself and why you might want to have a chat with me http://se.linkedin.com/in/kvakvs/ working during over 8 last years with servers, clusters, distributed systems and stuff. Currently using Erlang as my main development language, which is designed to create fault tolerant distributed applications. Erlang may possibly be of use for you, if you are looking to rewrite some old and unstable code. Also I am conveniently located in Sweden :) Quite busy at some days, but we'll see what can be done.
loftar, please contact to Dmytro Lytovchenko ASAP, he really could help.
I'm not particularly fond of realtime discussions unless they're really necessary, and I don't really see what would better better discussed over such a medium than here in the bugtracker (which leaves a public record, to boot).
its not about discussions, its about help. discussions makes only a noise. nevermind, this specialist is gone. happy dying to H&H.
(In reply to comment #9) It's not about healing the Game. It's about keep it alive till salem release. Restoring kernel is 2 hours to 1 day work and it is 95% that lags would go off. And if jo&lo don't have time there is Dmytro and even borka that would fix it with closed eyes, almost any average linux user would be able to restore kernel easly if you don't have time to tracking bug. In my opinion it is about to make people belive "HnH is laggy but salem will be different, people play salem." In my country we say "If You don't know why it's because of money."
1) Try adding noatime to each of your mount options in /etc/fstab. It may help a little. 2) Get rid of raid. There have always been kernel bugs related to it and it's hard to get it right without being an expert. From what you say, you only have two disks (probably use raid1?). Consider automatic backups at certain timepoints, instead. Nobody is going to have a problem if 12h of gametime are lost when reverting to an old backup. Still better than dealing with a non functional game.
Maybe I'll be wrong, but judging from my conversations with SSQL have variants defragmentation base. the second version of the considered konstrruktsii build server - is optimization rezevrvnogo copy (of real time), including the right DISTRIBUTION: sda3 --.--> LVM PV --> LVM LV --> Minimap filesystem (ext3) sdb3 -/ sda4 --.--> md2 --> LVM PV ---> LVM LV --> Root filesystem (xfs) sdb4 -/ \-> LVM LV --> Game data filesystem (xfs) sda3 --.--> LVM PV --> LVM LV --> Game data filesystem (xfs) sdb3 -/ sda4 --.--> md2 --> LVM PV ---> LVM LV --> Root filesystem (xfs) sdb4 -/ \-> LVM LV --> Minimap filesystem (ext3)
http://www.google.com/search?ie=UTF-8&hl=ru&q=xfs%20vs%20ext3#hl=ru&tbo=d&q=xfs+vs+ext3+performance&revid=1830624578&sa=X&ei=DgzvUL76NYSF4gSwjYHICg&ved=0CI4BENUCKAA&bav=on.2,or.r_gc.r_pw.r_qf.&bvm=bv.1357700187,d.bGE&fp=9e8221ea17d1f7a4&biw=1680&bih=843
great job