Sync losses
Synchronization loss? What does it mean?
Synchronization losses happen in network games when one or more clients see a different game than the host.
Because network games in Clonk are synchronized by transferring player controls only, this is a fatal error from which the game can not recover easily. Once there is just a small difference between host and client, the changes usually quickly snowball into much larger offsets and the root cause can hardly be made out just from looking at the game states. This is due to the highly dynamic game and some factors like e.g. the global random number generator, which pushes the game into completely different directions once there has been a small mismatch between host and clients.
The game checks synchronization by transferring packets of some key checksum parameters - like random number generator state, Clonk positions, etc. - and checking the client against the host state on all clients. Because the network host, by definition, always has the "correct" game state, only network clients can receive a sync loss message.
The sync loss message
A typical sync loss message may look like this.
FATAL ERROR: Network: Synchronization loss! FATAL ERROR: Network: Host Frm 100 Ctrl 50 Rnc 98213 Cpx 116852 PXS 593 MMi 630 Obc 1703 Oei 1825 Sct 2655 FATAL ERROR: Network: Client Frm 100 Ctrl 50 Rnc 99071 Cpx 76260 PXS 600 MMi 553 Obc 1705 Oei 1827 Sct 2654
The message shows the checksum variables for host and clients. Unless there is a bug in the sync check code, at least one of these values should be different between host and client (in this case, it would be Rnc, Cpx, PXS, MMi, Obc, Oei and Sct). The fields have the following meanings:
- Frm: Frame counter. This should almost always match. If it is different between host and client, all other values will probably differ as well. Possible causes might be bugs in the control execution (i.e.: Clients executed the sync check packet too early or too late), wrong frame counter initialization from savegame resume or some serious memory corruption overwriting the value.
- Ctrl: Control frame index. Control frames are control exchange cycles between clients. Difference usually due to similar reasons as for Frm.
- Rnc: RandomCount. Number of times the random number generator has been called since its last reseeding. Because just about everything in the engine calls the random number generator, this value is almost always different when a sync check fails. It usually doesn't help in finding the cause.
- Cpx: CrewPositionX: Sum of all x coordinates of player crew members times 100. A large difference right as the game starts often indicates that there was a problem joining one of the players. In scenarios that have random start positions, it could also mean the difference happened before player joins (e.g. during landscape creation). Small difference means one of the Clonks was at a slightly different position. This could be due to anything and doesn't help much.
- PXS: Loose pixel checksum. Because loose pixels heavily rely on random numbers in their behaviour, it's usually off whenever Rnc is off. Doesn't tell you much about the error cause.
- MMi: MassMoverIndex. Movement of large bodies of materials (e.g. water). Similar to PXS.
- Obc: ObjectCount. Number of objects in the game.
- Oei: ObjectEnumerationIndex. Number of objects that have ever been created in the game. If Oei is higher for one of the computers but Obc is the same, a short-lived object had been created on only one of the clients.
- Sct: SectorShapeSum. Checksum of overlapping shapes of objects in sectors. If this is different while all other values are the same, there was a problem sorting objects into sectors lists. Common cause would be invalid or weird object shapes, or object position changes without updating the shape structure.
Common sync loss causes
Here are some problems that have caused sync losses in the past. Listed in descending order of frequency, i.e. most common causes first:
- Different engine versions. Because version numbers are usually checked, this is uncommon for regular players, but very, very common for engine developers who just click away that warning about incompatible enignes. Causes could be people didn't synchronize the repository, didn't update or didn't check in their code changes. Any sync loss report where the participants used different engine versions or where they are not sure if they used the same version should be ignored.
- Synchronization issues between platforms. The most common bug is that the Linux version executes the game differently than the Windows version, or the 64 bit build executes differently than the 32 bit build. Often, bugs can also be reproduced by compiling the game with different compilers or build options under the same system. Common errors are differences in floating point calculation. But any code that causes behaviour which is "undefined" by the language (uninitialized variables, memory used after freed, wild pointers, etc.) can have different effects on different platforms/builds. To narrow this down, it should be checked if the bug occurs even if everyone uses the same build and platform.
- Script execution dependent on unsynchronized values. If a value is not the same on all clients - such as the local system time - it should never be used to do stuff that affects the synchronization of the game - e.g., call Random() and use up a random numer. Common values that have caused bugs in the past and should be handled with care are:
- Viewport values (existence, size, order, ...). Viewports are not synchronized.
- Localized strings. Because people may use different language settings, you can almost never assume that strings have the same content on all computers. Do not run synchronized operations that depend on string contents or string length. E.g., if you display a localized message to the user, it is not OK to wait a certain number of frames depending on message length, then do some operation that modifies the game state.
- Sound/Music. People may have their sound turned off. Music is not aligned to frames, i.e. it might start and stop earlier/later on some clients
- Player and Clonk portraits (obsolete in OC?)
- Random/UnsyncedRandom - use Random for synchronized code, UnsyncedRandom for asynchronous operations
- User interface operations. UI window positions, mouse position, keypresses, etc. may only be used after they have been synchronized though the control queue
- Local files
- Different file versions. All files that are relevant for a synchronized game should be checked before game start. However, there are often bugs that cause this check to fail. Files may be relevant but synchronization is not checked, or there might be trouble reading all files in the same way when they are unpacked. When narrowing down the bug, it might be worthwhile to let all participant play with the same, packed files from a clean install for testing purposes.
- Configuration-dependent sync losses. Sometimes, bugs occur only for certain people even if they use the same engine build as everyone else. This is often a bug that occurs only in certain configuration settings. Examples for bugs that have been found in the past are music-related stuff, player file-related bugs (e.g. synchronized code depending on player portraits) or differences between DirectX/OpenGL or bit depth.
Tracking down the cause
TODO
Records
Because records, just like network games, store player commands only, sync losses can happen in records as well. In that case, the game state during replay loses synchronization from the game state as it was when the game was recorded. Knowing this, records can be a useful tool to hunt down synchronization bugs.