SDA 3.5 Documentation for LANGUAGE
NAME
language - Using non-English languages
DESCRIPTION
SDA allows users to set up SDA datasets and to display results in
practically any language. This document summarizes the issues
involved. Note that Unicode may be used for many languages, so
long as the UTF-8 encoding is used (NOT the UTF-16 or UTF-32
encoding).
This document includes the following topics:
DATA DEFINITIONS IN VARIOUS LANGUAGES
The names and labels and question text for variables are all
defined in a
DDL file.
The variable names must only contain ASCII
characters, but the labels and question text may be encoded with
any character set.
After the DDL file has been used to create the SDA dataset (by
using the
MAKESDA program),
all displays of SDA results will (try to) use that character set.
SPECIFYING THE CHARACTER ENCODING
If the text in your DDL file is just plain ASCII (also known as
’US-ASCII’), then you don’t have to worry about character
encoding issues. However, if the data definitions are encoded
with another character set (to include accent marks, for
example), or if the user interface has been modified using
another character set, browsers might not display the characters
properly.
If you are not using ASCII text, then the name of the character
encoding used for a dataset should be specified using the
’CHARSET=’ keyword in the general section of a DDL file. The name
of this character set will then be stored as a permanent part of
the SDA dataset (in the STUDYINF/studyinf file) when MAKESDA is
executed.
For a list of recognized character sets, see the list of
IANA
Character Sets.
Some commonly encountered encodings are: ’Windows-1252’ (older
Windows files) and ’ISO-8859-1’ (Western European). However,
UTF-8 is today the preferred encoding for storing a study’s
metadata text because:
- It is part of the Unicode standard.
- It does not have any byte-order "endian" issues (unlike
other Unicode encodings such as UTF-16 and UTF-32).
- It is a superset of the original ASCII text encoding and is
therefore fully backwards compatible with it.
Therefore UTF-8 should be used for creating and storing metadata
for datasets whenever possible. (Remember, ASCII text is
UTF-8 text. So if your metadata is plain ASCII then you’re
already using UTF-8.)
When HTML pages are generated by various SDA programs, the
charset information stored with the dataset will be taken into
account so the pages can be displayed correctly in a browser.
Usually the charset information will be used to write a
meta tag in the head element of an
HTML page. For example:
HTML output written by one of SDA’s servlet-based webapps is an
exception to this rule however. Java internally encodes all
characters in Unicode. Therefore, the crucial point here is that
encoded metadata text from an SDA dataset must be "de-coded"
correctly -- using the charset specification stored in the
dataset’s STUDYINF/studyinf file -- when it is imported into the
Java environment and turned into Unicode. After that, the
"native" encoding of the text is lost. Therefore, HTML pages
produced by a servlet will typically not include a meta
tag specifying the original charset for the dataset.
Instead, servlet pages are usually written using UTF-8 encoding.
There are a couple of other technical issues concerning character
encoding that should be kept in mind.
-
Apache Web Server’s Default Charset:
In 2.x versions of the Apache configuration file (httpd.conf),
there’s often an "AddDefaultCharset" directive that’s turned on
by default. The charset specified by this directive will be added
to the server’s response header that accompanies every HTML page
and will override any charset setting in the
meta tag of an HTML file, making all SDA
charset specifications inoperative. Therefore, this directive
should usually be commented out in the Apache httpd.conf file so
that SDA charset specifications will be effective. For more
information see the
section on the AddDefaultCharset directive in the online
Apache Manual.
-
The Byte-Order Mark (BOM):
Some Unicode-capable text editors (such as Windows’ Notepad) will
automatically output a "BOM" (Byte-Order Mark) at the beginning
of any file saved in a Unicode format. For more information see
the
section on
the BOM in the Unicode FAQ.
If a DDL file contains a BOM in the initial bytes of the file,
then MAKESDA (and other SDA programs that process DDL files) will
ignore the BOM during processing. (However, MAKESDA will display
a message informing the user that a BOM was encountered and
ignored.) SDA programs that read DDL will handle the three types
of BOM written by Notepad: utf-8, utf-16 little endian and utf-16
big endian. It is unlikely that other BOM types will be
encountered in real-world circumstances.
SPECIFYING THE FONT FOR CHARTS
Normally it is the browser that is responsible for selecting the
correct font for displaying text, using whatever information is
available in the HTML code (or response header from the server)
as a guide. However, charts present a special problem because the
chart inserted into the HTML output is just an image -- a picture
-- and the browser has no control over selecting the font that is
used in the chart’s headings, labels, etc. Instead, the font is
selected when the chart image is created by the chartgen
servlet on the server.
By default, the chartgen servlet uses the generic Java
"SansSerif" font when displaying text. (This "SansSerif" font is
mapped to a particular physical font on the server on a system-
dependent basis.) In many instances this default font will work
fine. However, there may be cases where a specific font is
required to display a given language. There are two ways this
information about the required font can be relayed to the
chartgen servlet: 1) a font specification can be globally
applied in the chartgen configuration file; 2) a chart
font for a particular dataset can be specified in the HARC file
(overriding any global specification). For more information on
specifying fonts for charts, see the section on charts in the
SDA
Archive Developer’s Guide.
Remember too that the font specified must actually be present on
the server machine that’s running the Java JVM. And the server
must be configured so that the font is available to Tomcat (or
your chosen servlet container). For more information on language
issues in Java and in the servlet environment see the
Java Internationalization FAQ.
LANGUAGE LIMITATIONS IN SDA SEARCH
The SDA search utility currently works with search terms entered
in English or a Western European language. The search utility is
configured so that accented Latin characters (German umlauts,
etc.) will be displayed correctly; however, the search terms
themselves can only be entered using non-accented characters.
Languages that aren’t compatible with the Latin character set at
all -- Asian ideographs, Georgian script, etc. -- can’t be used
in search terms (although they will still display correctly in
search results). These language limitations in SDA searching will
likely be removed in a future version of SDA. However, it is
important to be aware of these issues if you have datasets that
are not in English.
SPECIFYING THE LANG ATTRIBUTE
In addition to specifying a charset in the global section of the
DDL file, you can also specify the dataset’s "lang" attribute.
The "lang" attribute is generally of far less importance than the
"charset" specification in displaying HTML correctly and will
probably rarely be needed. However, if you do specify a "lang"
attribute in the DDL file, it will also be written to the SDA
dataset’s STUDYINF/studyinf file when MAKESDA is executed.
Here is an example of specifying a "charset" and a "lang"
attribute in the global section of a DDL file:
title = French Canadian Study
charset = utf-8
lang = fr-CA
When SDA programs write HTML, the dataset’s "lang" (if any) will
be written as an attribute of the main "html" tag. For example:
A two-character language name like ’fr’ represents the language
itself. An optional subfield can be added, to indicate the
country in which that language is spoken, in case a browser might
know what to do with that information. In the example above "fr-
CA" indicates that the language is French, as spoken in Canada.
For more information on the declaration of the "lang" attribute
and the uses of that attribute by browsers (or other user
agents), see this
W3C document
on the "lang" attribute in HTML.
MODIFYING THE USER INTERFACE
The SDA option screens and the output from analysis programs can
be changed to any language. There is a separate
interface document
that explains how to do this.
To ensure that browsers know how to display characters properly,
it is best to have also specified the character encoding, as
described
above,
if the modified interface uses a character set other than ’US-
ASCII’.
SEE ALSO
CSM, UC Berkeley
April 12, 2011