CLDR makes special use of XML because of the way it is structured. In particular, the XML is designed so that you can read in a CLDR XML file and interpret it as an unordered list of <path,value> pairs, called a CLDRFile internally. These path/value pairs can be added to or deleted, and then the CLDRFile can be written back out to disk, resulting in a valid XML file. That is a very powerful mechanism, and also allows for the CLDR inheritance model.
Sounds simple, right? But it isn't quite that easy.
In summary, when you add an element, attribute, or new kind of attribute value, there are some important steps you must also take. Note that running our unit tests and ConsoleCheck will catch most of these, but you should understand what is going on. Make sure that you don't break any of the invariants below (read through once to make sure you get them)! There is more detailed information further down on the page.
If you are only adding new alt values, it is much easier. You still need to change related information, otherwise your strings won‘t show up properly in the Survey Tool, or the right default values won’t be set. So go to Root Aliases.
We augment the DTD structure in various ways.
Add the annotations.
The following are required for elements, attributes, and attribute values.
We never have “mixed” content. That is, no element values can occur in anything but leaf nodes. You can never have <x>abcd<y>def</y></x>. You must instead introduce another element, such as: <x><z>abcd</z><y>def</y></x>
There is a strong distinction between rule elements and structure elements. Example: in collations you have <p>x</p><p>y</p> representing x < y. Clearly changing the order would cause problems! There are restrictions on this, however:
In order to write out an XML file correctly, we also have to know the valid ordering of paths for elements that are not ordered. This ordering is generated automatically from the DTD, constructed by merging. If there are any cycles in the ordering, then the CLDR tools will throw an exception, and you have to fix it. That also means that we cannot have complicated DTDs; each non-leaf node MUST be of the form:
The subelements of an element will vary between * and ?. Note however that all leaf nodes MUST allow for the attributes alt=... draft=... and references=.... So that the alt can work, the leaf nodes MUST occur in their parent as *, not ?, even if logically there can be only one. For example, even though logically there is only a single quotationStart, we see:
The attribute order is much more flexible, since it doesn't affect the validity of the file. That is, in XML the following are equal:
However, when this is turned into a path, the order does matter. That is, as strings the following are not equal
The ordering of attributes in the string path and in the output file is controlled by the ordering in the DTD. Certain attributes always come first (like _q and type), and certain others always come last (like draft and references). Normally you add new attributes to the middle somewhere.
When computing the file ordering, we compare paths using CLDRFile.ldmlComparator. Here is the basic ordering algorithm:
Walk through the elements in the path. For each element and its attributes:
While attribute value orderings are mostly alphabetic, we do have a number of tweaks in getAttributeValueComparator so that values come in a reasonable order, such as “sun” < “mon” < “tues” < ...
There is an important distinction for attributes. The distinguishing attributes are relevant to the identity of the path and for inheritance. For example, in <language type=“en”...> the type is a distinguishing attribute. The non-distinguishing attributes instead carry information, and aren't relevant to the identity of the path, nor are they used in the ordering above. Non-distinguishing elements in the ldml DTD cause problems: try to design all future DTD structure to avoid them; put data in element values, not attribute values. It is ok to have data in attributes in the other DTDs. The distinction between the distinguishing and non-distinguishing elements is captured in the distinguishingData in CLDRFile. So by default, always put new ldml attributes in this array.
We use some default attribute values in our DTD, such as
This was a mistake, since it makes the interpretation of the file depend on the DTD; we might fix it some day, maybe if we go to Relax, but for now just don't introduce any more of these. It also means that we have a table in CLDRFile with these values: defaultSuppressionMap.
When you make a draft attribute on a new element, don't copy the old ones like this:
<!ATTLIST xxx draft ( approved | contributed | provisional | unconfirmed | true | false ) #IMPLIED ><!-- true and false are deprecated. -->
That is, we don't want the deprecated values on new elements. Just make it:
<!ATTLIST xxx draft ( approved | contributed | provisional | unconfirmed ) #IMPLIED >
The DTD cannot do anything like the level of testing for legitimate values that we need, so supplemental data also has a set of attributeValueValidity.xml data for checking attribute values. For example, we see:
This means that whenever you see any matching dtd/element/attribute combination, it can be tested for a list of values that are contained in the variable $_bcp47_calendar. Some of these variables are lists, and some are regex, and some (those with $_) are generated internally from other information. When you add a new attribute to ldml, you must add a <validity> element unless it is a closed set.
The ones we have in CLDR were (in hindsight) a mistake, since it makes the interpretation of the file depend on the DTD; we might fix it some day, maybe if we go to Relax, but for now just don't introduce any more of these. It also means that for writing out the files we have a table in CLDRFile with these values: defaultSuppressionMap and in supplementalMetadata <suppress>.
For many many reasons, you never reuse an element name or attribute name unless you mean precisely the same thing, and the item is used in the same way. So to=“2009-05-21” is always an attribute that means an end date. Be very careful about new elements with the same name as old ones. You can‘t have <territory> be an orderedElement in one place, and a non-orderedElement in another. The attribute type=... is always used as an id. For historial reasons, sometimes it is distinguishing and sometimes note (this is very painful, don’t add to it!). It is also not used as the id in numberingSystems.
If your new structure should have aliases, such as when the “narrow” values should default to the “short” values, which should default to the regular values, then you need to add aliases in root.xml. Look at examples there for how to do this.
PathHeader.txt determines the placement and ordering in SurveyTool. It consists of a sequence of regex lines of the following form:
<regex> ; <section> ; <page> ; <header> ; <code>
Here's an example:
//ldml/dates/timeZoneNames/metazone[@type=“%A”]/%E/%E ; Timezones ; &metazone($1) ; $1 ; $3-$2
These are also in the header of PathHeader.txt:
There are a set of variables at the top of the file. These all are in parens, so the %A, %E, and %E correspond to the $1, $2, and $3 in the <section> ; <page> ; <header> ; <code>
The order of the section and page is determined by the enums in the PathHeader.java file. So the <section> and <page> must correspond to those enum values.
The results from PathHeader must be unique: that is, if the source paths are different, then at least one of <section> ; <page> ; <header> ; <code> must be different.
If you need to change the order of the header or code or the appearance programmatically, then you need to create a function (call it xyz), and use it in the PathHeader.txt file (eg &xyz($1)). In PathHeader.java, search for functionMap to see examples of these.
The order of the header and then of the code within the same header is normally determined by the ordering in the file. To override this, set the order field in your function. For example, the following gets integer values and changes them into real ints for comparison.
int m = Integer.parseInt(source);
order = m;
There is also a “suborder” used in a few cases for the code. You probably don't need to worry about this, but here is an example. Ask for help on the cldr-dev list if you need this.
suborder = new SubstringOrder(source, 1);
The return value is the appearance to the user. For example, the following changes integer months into strings for display:
static String[] months = { “Jan”, “Feb”, “Mar”, “Apr”, “May”, “Jun”, “Jul”, “Aug”, “Sep”, “Oct”, “Nov”, “Dec”, “Und” };
...
return months[m - 1];
If a value has placeholders, edit Placeholders.txt:
This file provides a description of each kind of path, and a link to a section of https://cldr.unicode.org/translation. Easiest is to take an existing description and modify.
Coverage determines the minimum coverage level at which a given item will appear in the survey tool. If a given field is not in coverage, then the item will not appear in the survey tool at all. This data is required for the elements in /main/.
The file common/supplemental/coverageLevels.xml is a series of regular expressions describing the paths and the coverage levels associated with each. The file also gives you the ability to define a “coverage variable”, which can then be used as a placeholder in the regular expressions used for matching. Always try to be as exact as possible and avoid using wildcards in the regular expressions, as they can impact lookup performance.
Coverage values are currently numeric, although we may change them to be words in the near future in order to make them easier to understand. The coverage level values are:
10 = Core data, 20 = POSIX, 30 = Minimal, 40 = Basic, 60 = Moderate, 80 = Modern, 100 = Comprehensive
Example: The following two lines define the coverage for the exemplar characters items. Note that “//ldml” is automatically prepended to the path names, in order to make the paths in this file smaller.
<coverageVariable key=“%exemplarTypes” value=“(auxiliary|index|punctuation)”/>
<coverageLevel value=“10” match=“characters/exemplarCharacters[@type=‘%exemplarTypes’]”/>
Modify the following files as described in ldml2icu_readme.txt. This will allow NewLdml2IcuConverter.java to work properly so that the data can be read into ICU and tested there.
Unfortunately, you have to change input parameters to get the different kinds of generated files. Here's an example:
-s {workspace-cldr}/common/supplemental
-d {workspace-temp}/cldr/icu/
-t supplementalData
-k
Use -k to build into a single file, which is helpful for checking the supplemental data. There are a few other useful parameters if you look at the top of NewLdml2IcuConverter.
If you add a new kind of file or directory, you may have to adjust the tool to make sure it is seen and built. For example, if you add a new kind of supplemental file, you also have to modify SupplementalMapper.fillFromCldr(...).
There are three ways for paths to show up in the Survey Tool (and in other tooling!) even if the value is null for a given locale. These are important, since they determine what users will be able to enter.
Certain paths don‘t have to be present in locales. They are not counted as Missing in the Dashboard and shouldn’t have an effect on coverage. To handle these, modify the file missingOk.txt to provide a regex that captures those paths. Be careful, however, to not be overly inclusive: you want all and only those paths that are ok to skip. Typically those are paths for which root values are perfectly fine.
The following is an example of the different files that may need to be modified. It has both count= and a placeholder, so it hits most of the kinds of changes.
Whenever you modify values in English or Root, be sure to run GenerateBirth as described on Updating English/Root and check in the results. That ensures that CheckNew works properly. This must be done before the Survey Tool starts or is in the Submission Phase.