Multilingual Data Structure

Backgroud
These days it is getting more and more important to provide multilingual support for user interface, configuration data and some of transactional data. While the facility for multilingual support of user interface is provided by technology stack like Java, JSF and etc, multilingual support for configuration data/transactional data requires application developers to design application's data structure in multilingual manner. This note discusses several multilingual data structures and advantage/disadvantage of those structures. Also the sample implementations with XML and RDBMS are discussed.

Consideration
While multilingual data structure is discussed, the following points shall be discussed.
 * Language data isolation
 * Multilingual data have to be coupled with context by some language neutral data like ID and etc. Multiligual data structure should isolate language data like translation/localized value from context well. Otherwise, it leads into the difficult situation in maintaining language data afterwards. This should be considered in the very beginning of the design phase.


 * Fallback mechanism
 * In many cases, language data can be sparse either permanently or temporarily. Therefore, it is necessary to consider how to fallback when language data for a particular context does not exist.


 * Maintenanceability
 * Multilingual data would be released as a part of application or created by application users. In either case, it is necessary to consider the maintenanceability of language data. For example, if multilingual data have to be released as a part of application and language data is not well isolated from context, it would require to release language data whenever context is updated regardless of the changes in language data part. This would be nightmare for the localization group and users since the localization group always have to produce language data patch and users have to apply them as many as languages they have in their environment.


 * Storage Efficiency
 * Multilingual support certainly requires some extra storage since it stores some context key for each language. It should be minimized especially multilingual data can grow over the time.


 * Performance
 * Multilingual support always requires some extra work compared to single language support. Therefore, it can impact on the performance unless you are careful enough about the performance impact in both updating and querying language data. In many cases, update and query performance are exclusive. So, it is important to evaluate which is more important for the application case by case.

Base - Language

 * In this approach, it has two schema, which are Base and Language. Base includes both language neutral data and language data for the base language, which is the language used for the fallback. Language includes context keys and language data except the base language, i.e. translation. Then View for the specific language shall be derived from Base and Language.

Base - langauge neutral data + language data for the base language (i.e. English in the sample below)  BlackBerry_8700c 10    BlackBerry 8700c BlackBerry 8700c (Refurb) HP_iPAQ_hw6515/name> 25    HP iPAQ hw6515 HP iPAQ hw6515 (Refurb) Language - context keys and language data except the base language (i.e. sku is a key and displayname, description are language data)  BlackBerry 8700c - Japanese BlackBerry 8700c (Refurb) - Japanese HP iPAQ hw6515 - Japanese HP iPAQ hw6515 (Refurb) - Japanese View - View derived from base and translation data (i.e. the following is Japanese view)  BlackBerry_8700c 10    BlackBerry 8700c - Japanese BlackBerry 8700c (Refurb) - Japanese HP_iPAQ_hw6515/name> 25    HP iPAQ hw6515 - Japanese HP iPAQ hw6515 (Refurb) - Japanese
 * Language data isolation
 * Language data is isolated thoroughly from other language neutral data.
 * Fallback mechanism
 * Since the base language data is a part of Base and it can be expected to be always fulfilled, the simplest fallback is fallbacking to the base language if the specific language data does not exist. In case the more sophisticated fallback is required, it should perform the fallback with language schema like checking ja-JP -> ja and then fallback to the base language data as needed.
 * Maintenanceability
 * Since language data schema is thoroughly isolated from language neutral data, the maintenanceability is pretty good with this approach. Also since the base language data can be used as the final fallback point, the language data can also be sparse. Therefore, in terms of the maintenanceability of the release, this approach would be the best.
 * Storage Efficiency
 * Unless it is necessary to store View above with some reason, this approach does not require any redundant data except context keys.
 * Performance
 * Since this approach can afford the sparse language data, update performance should be good. However, query performance can be poor if the fallback mechanism has to be sophisticated to perform the fallback with language data first.
 * Summary
 * This approach should work well for most of multilingual scenarios. The weakness of this approach are:
 * It is difficult to switch the base language once it starts having the sparse language data
 * Query performance can be poor if data size is large and language data gets sparse and the fallback takes time

Core - Language

 * In this approach, it has two schema, which are Core and Language. Core includes only language neutral data. Language includes context keys and language data. Then View for the specific language shall be derived from Core and Language.

Core - langauge neutral data only  BlackBerry_8700c 10    HP_iPAQ_hw6515/name> 25 Language - context keys and all language data (i.e. sku is a key and displayname, description are language data)

English data <?xml version="1.0" encoding="UTF-8" ?> <displayname xml:lang="en">BlackBerry 8700c <description xml:lang="en">BlackBerry 8700c (Refurb) <displayname xml:lang="en">HP iPAQ hw6515 <description xml:lang="en">HP iPAQ hw6515 (Refurb) Japanese Data <?xml version="1.0" encoding="UTF-8" ?> <displayname xml:lang="ja">BlackBerry 8700c - Japanese <description xml:lang="ja">BlackBerry 8700c (Refurb) - Japanese <displayname xml:lang="ja">HP iPAQ hw6515 - Japanese <description xml:lang="ja">HP iPAQ hw6515 (Refurb) - Japanese View - View derived from base and translation data (i.e. the following is Japanese view) <?xml version="1.0" encoding="UTF-8" ?> BlackBerry_8700c 10    <displayname xml:lang="ja">BlackBerry 8700c - Japanese <description xml:lang="ja">BlackBerry 8700c (Refurb) - Japanese HP_iPAQ_hw6515/name> 25    <displayname xml:lang="ja">HP iPAQ hw6515 - Japanese <description xml:lang="ja">HP iPAQ hw6515 (Refurb) - Japanese
 * Language data isolation
 * Language data is isolated thoroughly from other language neutral data.
 * Fallback mechanism
 * Since the base language data does not exist, the fallback has to be sophisticated enough to deal with the sparse language data. It would be necessary to do one of the following:
 * Fulfill a certain language data so the language data can act as the base language in the fallback
 * Fulfill all language data with the initial entry temporarily to avoid the sparse language data
 * Fallback to some language neutral data if language data does not exist (note: this cannot be option for data to show up in UI.)
 * Maintenanceability
 * Since language data schema is thoroughly isolated from language neutral data, the maintenanceability is pretty good with this approach. However, if the implementation cannot afford the sparse language data, it will be necessary to release language data patch whenever new language data is added unlike Base - Language model above.
 * Storage Efficiency
 * Unless it is necessary to store View above with some reason, this approach does not require any redundant data except context keys. However, if the implementation requires some extra data to cope with the sparse language data, it would be less efficient than Base - Language model.
 * Performance
 * Depending on the approach to cope with the sparse language data, the performance will be impacted. If the sparse language data is resolved by fulfilling a certain language to make it the base language, update performance will be good but query performance will be the same as Base - Language approach. Instead, if the sparse language data is resolved by fulfilling all language data, update performance will be poor but query performance will be better than Base - Language model since query does not require any fallback.
 * Summary
 * This approach should work well unless the sparse language data has to be considered. In other words, if language data is faily static and is not changed much as UI, this approach would work better than Base - Language model. But it is still necessary to consider the temporary sparse language data issue for patching.

One for all language

 * In this approach, it has one schema. And each language data should be tagged with language information.

Data - Includes langauge neutral data + all language data with language information <?xml version="1.0" encoding="UTF-8" ?> BlackBerry_8700c 10    <displayname xml:lang="en">BlackBerry 8700c <description xml:lang="en">BlackBerry 8700c (Refurb) <displayname xml:lang="ja">BlackBerry 8700c - Japanese <description xml:lang="ja">BlackBerry 8700c (Refurb) - Japanese HP_iPAQ_hw6515/name> 25    <displayname xml:lang="en">HP iPAQ hw6515 <description xml:lang="en">HP iPAQ hw6515 (Refurb) <displayname xml:lang="ja">HP iPAQ hw6515 - Japanese <description xml:lang="ja">HP iPAQ hw6515 (Refurb) - Japanese View - View derived from base and translation data (i.e. the following is Japanese view) <?xml version="1.0" encoding="UTF-8" ?> BlackBerry_8700c 10    <displayname xml:lang="ja">BlackBerry 8700c - Japanese <description xml:lang="ja">BlackBerry 8700c (Refurb) - Japanese HP_iPAQ_hw6515/name> 25    <displayname xml:lang="ja">HP iPAQ hw6515 - Japanese <description xml:lang="ja">HP iPAQ hw6515 (Refurb) - Japanese
 * Language data isolation
 * Language data isolation is poor since one file includes both language neutral data and language data.
 * Fallback mechanism
 * Same as Core - Language model.
 * Maintenanceability
 * Language data is not isolated from language neutral data, the maintenanceability is poor with this approach. Also unless there is some mechanism to cope with the sparse language data, it makes the maintenanceability worse.
 * Storage Efficiency
 * Unless it is necessary to store View above with some reason, this approach does not require any redundant data.
 * Performance
 * Same as Core - Language model.
 * Summary
 * This model works well only if multilingual data is static and is not updated frequently. (e.g. locale specific seed data.) Using this model for frequently updated data is highly discouraged.

Full per language

 * This approach should be considered as temporary solution for multilingual support. The idea is simply dulicating full set of language data per each language.

Data - Includes langauge neutral data + a language data per language

English Data <?xml version="1.0" encoding="UTF-8" ?> <inventory xml:lang="en"> BlackBerry_8700c 10    BlackBerry 8700c <description xml:lang="en">BlackBerry 8700c (Refurb) HP_iPAQ_hw6515/name> 25    HP iPAQ hw6515 HP iPAQ hw6515 (Refurb) Japanese Data <?xml version="1.0" encoding="UTF-8" ?> <inventory xml:lang="ja"> BlackBerry_8700c 10    BlackBerry 8700c - Japanese BlackBerry 8700c (Refurb) - Japanese HP_iPAQ_hw6515/name> 25    HP iPAQ hw6515 - Japanese HP iPAQ hw6515 (Refurb) - Japanese
 * Language data isolation
 * Language data isolation is poor since one file includes both language neutral data and language data.
 * Fallback mechanism
 * Similar to Core - Language model. But the fallback should happen at schema level rather than each element.
 * Maintenanceability
 * Language data is not isolated from language neutral data, the maintenanceability is poor with this approach. Also unless there is some mechanism to cope with the sparse language data, it makes the maintenanceability worse.
 * Storage Efficiency
 * Since it duplicates all data basically, the storage efficiency is poor.
 * Performance
 * Similar to Core - Language model. But since the fallback should happen at schema level, query performance should be better than other cases with the sparse language data.
 * Summary
 * This model works well only if multilingual data is static and is not updated frequently and data size is small. The good thing in this approach is that the implementation would be simplest out of all approaches. However, using this model for frequently updated data is highly discouraged since the maintenanceability is pretty poort.

Multilingual Data Structure with XML

 * TODO, sample implementation should be discussed, refer to ITS

Multilingual Data Structure with Relational Database

 * TODO, sample implementation should be discussed