When docker kicked in to the enterprise market many things started changing. Along with microservice-based architectures it became a must-have technology for any "modern" project (the containers idea itself isn't new - docker has just made things much simplier). In analogy to Java related slogan from 1995 "Write once, run anywhere" almost 20 years later docker enthusiasts say: "Build once, run everywhere". In fact both ideas can coexist easily and that's probably one of the reason they really do in practice nowadays. However, there ain't no such thing as a free lunch...

What could possibly go wrong?

Suppose we are developing a commercial, large scale web app. Microservices encouraged us to split parts of our applications across multiple docker containers. Each of them may be build from different based image - with another OS or configuration. Everything should be just fine, unless we need to support some unicode symbols (like national characters). On our local system instance unicode it's not a problem. Just a quick test on dockerized production environment and...

Kaboom

It's 2016 and our system seems to be helpless against Unicode... To make matters worse, whole backend has been implemented in Java but we still can't handle words like "żółć" (which is by the way one of the most polish word at all). So whats with "run anywhere" part?

Even as portable technology as Java (or JVM in general) isn't 100% independent from the OS. When operating on text data programmer should specify the encoding explicitly where it's possible to avoid using the default one. For example, let's compare following two methods from Java's String class:

  • getBytes() Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array.
  • byte[] getBytes(Charset charset) Encodes this String into a sequence of bytes using the given charset, storing the result into a new byte array.

Static code analyzers like FindBugs can easily find this kind of issues. Unfortunately even if we eliminate all of these from our code we can't be sure that all of our dependencies don't have them. To make things worse, similar problems may occur with any third-party app.

What does it mean from a practical point of view? In some situations our "completely portable" systems may rely on platform's default encoding - whatever it is.

Unifying containers locale settings

On the typical Debian based OS we should be able to check current locale settings with locale command - it's output may look like this:

$ locale                                                                                                                                                            
LANG=en_US.UTF-8
LANGUAGE=en_US
LC_CTYPE=en_US.UTF-8
LC_NUMERIC=en_US.UTF-8
LC_TIME=en_US.UTF-8
LC_COLLATE="en_US.UTF-8"
LC_MONETARY=en_US.UTF-8
LC_MESSAGES="en_US.UTF-8"
LC_PAPER=en_US.UTF-8
LC_NAME=en_US.UTF-8
LC_ADDRESS=en_US.UTF-8
LC_TELEPHONE=en_US.UTF-8
LC_MEASUREMENT=en_US.UTF-8
LC_IDENTIFICATION=en_US.UTF-8
LC_ALL=

Everything should be fine if we only have the values with .UTF-8 suffix. The most general (and propably suitable for most situations) solution is to choose C.UTF-8 which is C language locale with UTF-8 encoding don't related to any specific country. Unfortunately, on many docker images the default value is POSIX - another standard C language locale with ASCII encoding (no UTF-8 support).

It seems to be quite reasonable in many real life scenarios to set the same locale for all cooperating docker containers (within one system or application). Let's assume that we want to set polish locale (pl_PL.UTF-8) in our docker container running a Debian-based OS. All we've got to do is to add the following lines to our Dockerfile:

RUN apt-get update
# make sure that locales package is available
RUN apt-get install --reinstall -y locales
# uncomment chosen locale to enable it's generation
RUN sed -i 's/# pl_PL.UTF-8 UTF-8/pl_PL.UTF-8 UTF-8/' /etc/locale.gen
# generate chosen locale
RUN locale-gen pl_PL.UTF-8
# set system-wide locale settings
ENV LANG pl_PL.UTF-8
ENV LANGUAGE pl_PL
ENV LC_ALL pl_PL.UTF-8
# verify modified configuration
RUN dpkg-reconfigure --frontend noninteractive locales

You can find more details about locale settings in the Ubuntu Community Help Wiki.

Changing system locale settings should make things easier and also prevent our applications from many nasty and mysterious bugs.