By Craig Silverstein
Khan Academy started as a collection of videos, but now has over 100,000 pieces of written content, from exercises to articles to programming challenges. All of these are now available in multiple languages. But the Khan Academy codebase was originally written to be English-only. We had to retrofit the codebase to support internationalization (i18n) and localization (l10n) of written content after a lot of infrastructure was already in place. As most guides to i18n and l10n will tell you, life is much happier if you design for them before the fact. This was not an option for us.
The task was made more difficult by the variety of technologies we’ve used over the years. We use 5 different HTML-rendering technologies:
- jinja2 (for our Python server code)
- react (for our modern JavaScript code)
- raw JavaScript (for older JavaScript code)
- handlebars (for HTML that has to be rendered via both Python and JavaScript)
- Python (for our very old Python server code, which just wrote HTML directly from source)
and all of them needed to be converted to add i18n markup.
i18nize-templates
There are plenty of tools out there to handle the actual translation of strings; we use Babel and Jed.
And there are plenty of services out there to manage the actual translation of strings; we use Crowdin.
What is lacking is a tool that will mark up all the natural language text in your code and templates; this is the process that determines what text to show to translators. For this task, we developed i18nize-templates, a tool for finding natural language text in a variety of templating languages, and automatically munging it to be i18n-aware.
::: {.label} sample input (jinja2): :::
“` {.html}
Badges
“` ::: {.label} **sample output**: ::: “` {.html}
{{ _(“Badges”) }}
“` ::: {.label} **sample input (handlebars)**: ::: “` {.html} From the author: {{{ discussionFormControls “Post feedback” }}} “` ::: {.label} **sample output**: ::: “` {.html} {{#_}}From the author:{{/_}} {{#_}}{{{ discussionFormControls “Post feedback” }}}{{/_}} “` i18nize-templates isn’t magic: it can’t convert `item{#if n != 1#}s{#endif}` to the proper ngettext call. But it can reduce the time needed to annotate templates by over 90{8f4f32d1c7b49be91940769544f9246738fa263bda282b485c2c255cc064ffb6}. i18nize-templates can convert raw HTML, jinja2 templates, and handlebars templates. (Due to similarities between jinja2 and django, it may also support django templates, though this is untested.) It can also convert text files written using jinja2 or handlebars. Using i18nize-templates ======================= We are pleased to announce i18nize-templates as an open source Python module. You can install i18nize-templates via “` {.sh} $ pip install i18nize-templates “` Rewriting templates ——————- “` {.sh} $ pip install i18nize-templates $ echo “Hello {{world}}!” | i18nize-templates i18nizing – {{ _(“Hello {8f4f32d1c7b49be91940769544f9246738fa263bda282b485c2c255cc064ffb6}(world)s!”, world=world) }} $ echo “Hello {{world}}!” | i18nize-templates –handlebars i18nizing – {{#_}}Hello {{world}}!{{/_}} “` Extracting natural language text ——————————– You can also just use i18nize-templates as a Python library to easily extract runs of natural language text from HTML and templated-HTML (or templated-text) documents. Here’s a Python snippet we use to fake-translate our website into our testing language, called box-language (
``` (Each template language has its own parser, of course, but these parsers are not suitable for text rewriting of the type we are attempting here, since they parse into an AST but do not provide a way to get from the AST back to a textual representation.) For this reason, i18nize-templates implements its own lexers, one that can handle raw HTML, one that can handle jinja2-annotated HTML, and one that can handle that handlebars-annotated HTML. They are all based on the Python standard library module `markupbase`, which is what the standard libarary class `HTMLParser` is based on. We did not base the lexer on HTMLParser directly, since it was too difficult to subclass for the template-specific lexers. This also allowed for some simplifications: we don't parse out HTML entities, for instance. The lexers call a user-provided callback function for every 'element' that they see. There are only a few different types of elements: - An HTML tag - A run of text between HTML tags - A template variable (`{{variable}}` in jinja2) - A template comment (`{#comment#}` in jinja2) - A template block construct (`{{8f4f32d1c7b49be91940769544f9246738fa263bda282b485c2c255cc064ffb6}block construct{8f4f32d1c7b49be91940769544f9246738fa263bda282b485c2c255cc064ffb6}}...{{8f4f32d1c7b49be91940769544f9246738fa263bda282b485c2c255cc064ffb6}endblock{8f4f32d1c7b49be91940769544f9246738fa263bda282b485c2c255cc064ffb6}}` in jinja2) The main role of the lexer, besides tokenizing the input into elements, is to categorize each element as either **separating natural language text** or **not separating natural language text**. This concept is closely related to the HTML distinction between block and inline elements. If you have (somewhat ill-formed) HTML like the following: ``` {.html} This is what I like to do: md5-026f27c4aa80a61e63ea135333148e50 ``` You want to present the translator with four different strings to translate: "This is what I like to do" (probably you don't want to include the colon); "Go to the movies"; "Read books"; "Sleep a lot". You don't want to present the translator with that entire block of HTML as just one giant string to translate. In this example, the `
- ` and `
- ` tags **separate** blocks of natural language text into semantically distinct blocks that can (and should) be translated separately. The `` and ``, on the other hand, do not; we don't want to tell the translator to translate "Sleep a" and "lot" separately! When making a callback on an element, the i18nize-templates lexers say whether that element separates natural language text or not. Note that while related to the concept of HTML inline elements, the implementation of natural language text separation is slightly different, due to the semantics of some of the HTML tags. For instance, `