Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I've been struggling with transactional consistency across the network in, of all things, Second Life. Yes, Second Life, the virtual world.

The classical web is mostly stateless, although that's changing with fancier web sites. Second Life has always been very much a persistent state system. Technically, this makes it an alternate universe to the classical web. So it hit these problems long ago.

Second Life's world is divided into regions 256 meters on a side. The viewer displays a seamless world, but internally, the seams are very real. Each region is maintained by a separate process, a "sim", loosely coupled to neighboring sims and the viewer.

Avatars and vehicles can cross region boundaries. Often badly. For over a dozen years, region crossing behavior in SL has been fragile. Objects sink through the ground or fly off into space at region crossings. Vehicles and avatars become separated. Avatars with elaborate clothing and attachments can even be damaged so badly that the user has to do extensive repair work. The Second Life community, and the developers of Second Life, were convinced this was un-fixable.

I became interested in this problem as a SL user and started to work on it. The viewer is open source, and the message formats are known. Within the Second Life world, objects are scriptable. So, even without access to the internals of the server, much can be done from the outside.

My goal was to make fast vehicles cross regions properly in Second Life. The first step was to fix some problems in the viewer. The viewer tries to hide the delay at a region crossing handoff with extrapolation from the last position and velocity. The extrapolation amplifies noise from the physics simulator, and can result in errors so bad that vehicles appear to roll over, fly into the air, or sink into the ground. I managed to limit extrapolation enough to restore movement sanity.

Once that was under control, it was clear there were several different remaining problems. They now looked different visually, and could be attacked separately. There were several race conditions. I couldn't fix them completely from the outside, but I was able to detect them and prevent most of them from the scripting language code which controls vehicles. This got the mean time between failures up to about 30-40 minutes for a user driving a fast vehicle. When I started, that number was around 5-10.

The remaining problems were intermittent. I discovered that overloading the network connection to the viewer tended to induce this failure. So I used network test features in Linux to introduce errors, delays, and packet reordering. It turned out that adding 1 second of network delay, with no errors or reordering, would consistently break region crossings. This provided, for the first time, a repeatable test case for the bug.

I couldn't fix it, but I could provide a repeatable test case to the vendor, Linden Labs. With some publicity, upper management was made aware of the bug, and effort is now being applied to solving it. It turns out that the network retransmission code has problems. (Second Life uses its own UDP-based protocol.) Fixing that may help, but it's not clear that it's the entire problem.

The underlying problem is that, during a region crossing, both region manager programs ("sims") and the viewer all have state machines which must make certain state transitions in a coordinated way. The error cases do not seem to have been thoroughly worked out. It's possible to get stuck out of sync. Second Life is supposed to be a consistent-eventually system, but that wasn't achieved.

This is roughly the same problem as the "500 test" in the parent article. If you have communication problems, both ends must automatically resolved to a valid consistent state when communication resumes. Distributed database systems have to do this. It's not easy.

Network-connected state machines are a pain to analyze. If the number of states at each end is not too large, examining all possible combinations by hand is feasible. That's what the 2 volumes of "TCP/IP Illustrated" do for TCP. If you create your own network connected state machines, you face a similar task. If you don't do it, your system will break.



Right. The article is actually quite bad and doesn't show much understanding of the problem. Fault tolerance and consistency are very well known and very well researched problems in distributed systems.

Without proper distributed algorithms a fault can leave both ends in an inconsistent state. One end might think the operation was successful, but the other end never saw it succeed and timed out. It doesn't know whether it was successful or not. If there are users most will assume failure in this case and try again, leading to two state mutations where there should be only one.

To solve this you can try to achieve consensus between both ends and if there are users force them to wait an unbounded time, which is unrealistic of course, so only helps somewhat. The other choice is eventual consistency, preferably strong eventual consistency, where write operations never fail, never force anyone to wait, just get delayed and resynced later sometimes.


For webcrap, it's probably best to avoid this problem entirely. Don't design systems that need tight consistency between client and server. You'll get it wrong.

Maybe only one side needs to be stateful. Consider a map display application, where the map is downloaded as tiles of various resolutions. The user can pan and scroll. The goal is to give the user a seamless experience. This may require loading low-rez tiles first, loading ahead in the direction of movement, and canceling requests for tiles that are not in yet and are no longer needed. It needs to be eventually consistent - whatever the user does, the map needs to settle into validity after a while.

But the server side can just blindly serve tiles as individual files. You can do all the stateful stuff on the client. No need for synchronization.

So don't go there unless you absolutely have to.


> For webcrap, it's probably best to avoid this problem entirely. Don't design systems that need tight consistency between client and server. You'll get it wrong.

> So don't go there unless you absolutely have to.

Couldn't agree more. It is possible to build feature-rich clients with few syncing calls as long as you are willing to replicate business logic on the client-side and always transfer state unidirectionally.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: