wiki:CharacterEncoding
close Warning: Error with navigation contributor "AccountModule"

Version 4 (modified by chris, 10 years ago) (diff)

--

Box Backup and International Characters

If you live in a country or work in an environment where languages other than English are widely used, you have probably experienced the joys of international computing and character sets. This problem affects virtually all software, including Box Backup.

The root of the problem is that most software is written and designed for 8-bit character codes, which can only represent up to 256 different characters, and this is not enough to represent all the characters used anywhere in the world. By a long way.

There exist a very large number of different encodings which map local characters to 8-bit codes. These are almost all fundamentally incompatible with each other. If you move data from one environment to another, some or all characters in that data may be displayed as garbage. Nothing is actually lost or corrupted, but the new environment does not display the old character codes correctly.

Read on to find out how this can affect you when using Box Backup, and what to do about it.

This has very little to do with translation. Box Backup is currently written in English and all messages are in English only. We welcome any efforts to help to resolve this, for examply by adding a gettext infrastructure to Box Backup, externalising strings and translating them.

Character Encoding

Since international domain names do not exist yet, and Box Backup uses numbers to identify accounts and files and directories over the network, the issues are entirely confined to the client (at the moment).

Box Backup assumes that the client always runs on a single system which always uses the same character set. This allows it to treat character codes as opaque 8-bit values and not to do any translation at all (on Unix), which avoids a lot of complexity (and complexity in backup software is a bad thing).

This will work fine for you unless you do one of the following things:

  • access your Box Backup files from more than one client computer (e.g. restoring files on a different computer);
  • copy or mount files from a system which uses a different encoding;
  • upgrade or change your operating system in a way that changes the default character set;
  • change the default character set on your system.

If you do any of these, it's likely that you will see one of the following issues:

  • files which are already backed up may be backed up again (after the change);
  • files and paths listed in the client configuration file may not be found;
  • files listed in bbackupquery may be displayed wrongly.

If you know of any other issues, please list them here or contact us via the mailing lists.

Differences between Unix and Windows

Luckily, Unix systems are standardising on UTF-8 encoding, which can store all known characters. If you upgrade an old system which does not use UTF-8 to a recent one which does, these issues will affect you, but only once. We recommend that you set all your Unix systems to UTF-8 and leave them that way.

Windows, however, does not have a default encoding. The encoding used will depend on the language that you select in Windows and a number of other system properties. In their infinite wisdom, Microsoft have chosen a different solution to this problem, which is to use UTF-16 encoding instead, whose characters use 16 bits instead of 8. Because of that, Box Backup would require a major redesign to support these native characters properly, and that has not happened yet.

What we actually do (on Windows) works well most of the time. We try to read native file names in UTF-16 everywhere and convert them to UTF-8. Most of Box Backup uses UTF-8 internally on Windows. We also convert messages sent to the Windows Event Log back to UTF-16, so you should see to correct characters there. However, Box Backup is a console application and the console does not properly support UTF-16. So Box Backup's console messages, and console tools such as bbackupquery, may still display characters incorrectly.

We never convert file contents. That is your responsibility. We provide a faithful backup of the 8-bit binary data contained in each file.

Windows Console and International Characters

The first problem is that the default console fonts (Raster Fonts) do not support all character sets. Instead of a single font, Microsoft supplies different versions for different encodings. This works as long as:

  • all your characters will fit into a single encoding (not true of Chinese, Korean or Japanese); and
  • you have the correct console font installed.

If you have problems entering or displaying characters when using Box Backup, the first test is to:

  • Open up a Command Prompt (from All Programs/Accessories?)
  • List the files on the local disk that are causing problems
  • Check that they display properly
  • Check that you can enter their names properly
  • Check that you can change to directories with international characters in their names

If any of these test fail, you may need to change the console font to Lucida Console or similar:

  • Click on the System Menu (program icon) at the top-left of the Command Prompt window
  • Choose Properties
  • Go to the Font tab
  • Choose Lucida Console
  • Click OK
  • Choose Modify the shortcut that started this window and click OK

If this does not fix the above issues with using the Command Prompt itself, please contact us via the mailing lists for support.

Now that you know that the console works for you, check whether bbackupquery.exe does too. If not, try adding the -u option to its command line, which forces it to change the console mode to Unicode, allowing the entry and display of any character, not just those supported by the current encoding. (although display will still not work unless the font supports it).

In theory, you may not have to change the font to Lucida Console (which is quite ugly) as the following combinations work correctly:

  • Raster fonts with correct code page
  • Raster fonts with the wrong code page and bbackupquery -u
    • In this case, characters are displayed wrongly but you can still enter them from the keyboard
  • Lucida Console with the correct code page and normal bbackupquery
  • Lucida Console with the wrong code page and bbackupquery -u

Log messages displayed on the console of bbackupd are still not converted to Unicode, so file names displayed there may be incorrect. At least it is now possible to fix this, as we have a standard logging infrastructure.

Configuration files must be encoded in UTF-8. That includes bbackupd.conf. This is not a natural encoding for Windows and very few applications can read and write it properly. It should be possible to convert from the system default code page, but there is a risk that if you change the codepage, your backups will stop working. It might be possible to automatically detect if the configuration file is written in UTF-16 and load it correctly. If you have views about this, please contact us via the mailing lists.