Database I18N Development Consideration

Configuration
When the products offer the script to create database, it should set internationalization related parameters as below.

Character Set and Collation

 * Refer to Database_Character_Set_and_Sort_Order

Calendar/Timezone

 * The default calendar should be set to Gregorian if database vendors offer the options and the default calendar can be different from Gregorian calendar. This concern is applied only to Oracle database. MS SQL Server supports Hijri calendar at expression level only and DB2 does not support non-Gregorian calendar.


 * With regard to timezone, it is also the same as calendar and Oracle seems to be only the database vendor that offers some timezone handling capability at SQL level. So as for Oracle, it should better to set the default timezone to UTC.

Others

 * There are some other internationalization related database parameters. All database vendors offer the default per locale. So, in most of case, it should be sufficient to set locale of the database. However, the way to set the default database locale is different from the vendor to the vendor as below:


 * Oracle
 * In initialization parameter file, a.k.a init.ora, you can set NLS_LANGUAGE, NLS_TERRITORY explicitly. If those are not set, Oracle's default will be AMERICAN_AMERICA regardless of OS locale. Also there are other parameters can be set in initilization parameter file and session parameter. see Setting Up a Globalization Support Environment for details.
 * SQL Server
 * SQL server's locale is the same as Regional setting of OS. Also session language can be set by "SET LANGUAGE". see Session Language and sys.syslanguages (the list of supported languages) for details.
 * DB2
 * Territory has to be set when the database is created as well as code page. And session language will be determined based on the client locale. e.g. if clp is launched under locale=en_US, the session language will be en_US.


 * TODO: It is not mentioned how locale can be set over the session when JDBC connection is built.

Data Type

 * All databases provide two different types for character data. One is ANSI data type, which is char/varchar. Another is Unicode data type, which is known as nchar/nvarchar. Our recommendation is using char/varchar for Oracle/DB2/Sybase and nchar/nvarchar for SQL Server. The reason why we have to use nchar/nvarchar is simply SQL Server does not support any Unicode encoding for char/varchar type and they support UTF16 only with nchar/nvarchar. So, it will be necessary to prepare the different DDL scripts for SQL Server for sure.

Length Validation and Column Size Consideration

 * Please be aware the difference in length semantics between Java and database. While Java is based on code point semantics, database is based on byte semantics. i.e. String.length returns number of code points in java but varchar(10) means 10 bytes maximum. Therefore, when you design the schema and implement length validation, you need to take this into your consideration. Unfortunately, there is not the perfect answer for this issue and it is necessary to select one of the following options depending on your product needs.


 * Limit incoming data length up to 1000 characters and have the schema definition as length limit x 4. (max 4000 bytes):
 * If the incoming data is expected to be less than 1000 characters, you can define column length 4 times larger than that. In this way, you can guarantee to store incoming data in database regardless of database character set.
 * Pros:
 * No need to worry about database character set in length validation
 * Can gurantee to store the same amount of data regardless of database character set (good for database migration)
 * Length check will be easy and fast in Java
 * Cons:
 * Storage restriction is tight
 * In case it has to deal with long data than 1000 chars, it has to do some trick like storing such data in multiple records


 * Perform length validation in database character set at client:
 * Java provides a way to get byte array in supported encoding. So, it is still possible to check the incoming data length in byte in Java. However, performing this byte oriented length validation all the time would be expensive whie most data are less than the limitation. So, this validation can be performed either before inserting/updating data or after exception occurs depending on the situation that the application has.
 * Pros:
 * Better storage size
 * Accurate in length validation. If most of data are ASCII characters, this can afford more
 * Cons:
 * Need to transcode java strings from UTF16 to something else (a bit expensive)
 * Amount of data will be different based on database character set if there are non-ASCII data (not good for database migration)
 * Need to have mapping of database character set name to java encoding (e.g. AL32UTF8 in Oracle DB has to be mapped to UTF8)


 * Use varchar for small data and do the same as the first one and use CLOB for large data:
 * The first option is good when you don't expect large data. In case we expect larget incoming data, using CLOB is another option other than chunking into multiple records
 * Pros:
 * No need to worry about database character set in length validation
 * Can gurantee to store the same amount of data regardless of database character set (good for database migration)
 * Length check will be easy and fast in Java
 * Can handle large data
 * Cons:
 * CLOB might be overwhelming if incoming data is not that big.

Indexing with Collation

 * When the products perform linguistic sorting at database tier, it should have linguistic indexes to have better performance in queries. To do so, please see the following notes for each database.


 * Oracle
 * Oracle offers a way to build linguisic indexes by explicitly specifing sorting preference and also allow to build multiple indexes on a column. However, please note that creating multiple indexes on a column will make insert/update transaction slow since it has to maintain multiple indexes. So, if insert/update performance is crucial for the products, you should build one index on a column. The recommendation for such case is building a linguistic index with one of multiingual sort like GENERIC_M as below. Also please note that it is necessary to set NLS_SORT/NLS_COMP session parameter accordingly to make use of those indexes. see Linguistic Indexes for details.

   
 * SQL Server
 * When creating database schema, collation can be specified for the columns as below. And you don't have to do something else to build a linguistic index.

   
 * DB2
 * When creating database, collation can be sepecified for the database. And you don't have to do something else to build a linguistic index.

Multilingual Table Schema

 * Refer to Creating Multilingual Databases

Character Set Validation

 * Character set of database should better be verified. This may be required if it is necessary to perform the accurate length validation in Java.
 * Oracle
 * You can check database configuration and session setting by viewing SYS.NLS_DATABASE_PARAMETERS or V$NLS_PARAMETERS.
 * SQL Server
 * You can check database configuration by using the sp_helpdb stored procedure.
 * DB2
 * You can check database configuration by using the GET DATABASE CONFIGURATION command.

Length Validation

 * Refer to the above Database Schema Design

Collation and Comparison

 * Please note collation setting can affect on comparison and result set accordingly. For example, when you have SQL statement like select ename from emp where ename > 'A' and ename <'z' order by ename; against the following data, the result set will be affected by the collation rule as well as order. If the collation is set to binary sort, the result will not have any non-ASCII characters since all non-ASCII characters' code points are greater than the code point of 'z'. This is fine for English but it is not correct for French. They expect &#x00C1;(Capital Letter A with acute) in the query result since &#x00C1; is expected to be after 'A' (Capital A) in French. This type of issue can often be observed when the products make indexes of stored data or allow users to specify the range as search criteria. Please note collation setting can affect on the query result and take this into your implementation.

 </dd></dl>

Conversion functions

 * Though databases may provide convert functions to covert string into number/date/timestamp or vice verse, it is highly discouraged to use any of them since it will be affected by the default format mask and translation provided by databases and you can easily hit internationalization issues. Please bind them instead. In case you need to use any of those convert functions, please be sure to set format mask, which can be locale neutral like YYYY-MM-DD, explicitly.
 * e.g. This example is based on Oracle's convert function.
 * select * from emp where hiredate = to_date('11-Mar-2006');  ---> This will not work if the language is different from English. Also format mask will be different, too.
 * select * from emp where hiredate = to_date('2006-03-11', 'YYYY-MM-DD'); ---> Though this works, this is not encouraged since you need to deal with the different format mask syntax instead.