Drupal 8: Multilingual and revisionnable entities

The Entity System

Content Translation (CT)

Today in Drupal 7 we have the concept Content Translation (CT) for nodes. They are a special entity that supports multilanguage and revisioning at the same time.
With CT we have a separate entity instance per language with an own primary key (nid). The nodes are held together via a translation set (tnid) that is set to the primary key (nid) of the source language node. Each of those nodes have their own revision history. This allows clearly separated per-language workflows. As a drawback, tools like i18n sync are needed to keep values in sync that should be the same for all languages. This approach contains risk to have nodes out-of-sync (e.g. accidentally different product prices across languages). Also an update of a revisionned “untranslateable” value leads to a new revision record per language node.

Drupal.org Contrib Entity Translation (DOCET)

While Drupal 7 core already introduced tools to make fields multilingual, there was no UI to cover that approach. A contrib module was introduced to provide those features and further progress was made over time. However it never reached a state where it was discussed as a reasonable full replacement for content translation in Drupal 7.
http://drupal.org/project/entity_translation

Current Entity core iteration (Drupal 8)

As a first step, the current iteration of the entity system will introduce a clean base system. This will be able to have unified methods to store properties, access them and describe metadata for it. The system will have adult introspection capabilities and act as a base system for further improvements.
Those features are missing by intention in order to avoid complexity atm:

Revisions
Multilingual

Once the entity base system is implemented cleanly, those features will be added step by step.
In addition the expectation is that both properties + fields will be covered by the same mechanisms in the api system.
The intention is that Drupal 8 Entity Translation will completely replace CT.

New Entity OO architecture VS DOCET

The new architecture of core entities is much advanced compared to the DOCET project. However as a starting point we’re comparing with problems of DOCET to understand what problems showed up along this path and what we need to solve differently in core.

Regressions

If the new Drupal 8 core entity translation system will produce significant regressions for a specific usecase, one still can disable the entity translation feature and introduce a CT like concept in contrib based on per-language entity instances with relations to the source entity node (translation set).
If we care about these regressions and make entity translation fully fledged, we don’t need something like CT for future and can avoid a whole bunch of duplicate work.
Find possible regressions below and approaches to avoid / fix them.

Heavy Entities

In DOCET, all multilingual data of an entity for all languages is loaded at the same time. This results in heavy objects when dealing with many languages, resulting also in bad performance.
The new OO based core entities architecture have the capabilities to allow Lazy Loading with magic methods if we need it.
An entity instance always needs its general data values, but lazy loading might make sense for multilingual values on a per-language basis. More about the approach below.

Pessimistic Locking

In DOCET, the pessimistic locking strategy of drupal leads to multilingual concurrency problems. If a user creates a entity, notifies N translators to translate it and simultaneous interaction happens, things go wrong. If multiple translators have the edit UI open, the first one can only save. All others will receive a message “The content has changed”.
It should be possible that saving translation data for one language only doesn’t outdate the entity base data and also doesn’t affect other translations in any case. This also affects revisioning.

No language source

In DOCET an entity contains untranslateable values and all multilingual ones at the same time. There’s no concept of a language source or default language (as introduced by CT).
If someone updates items of a specific language we never know if it’s updating an translation (to make it fit the source) or if it’s an update on the “source” that outdates other translations.
In general, we cannot mark things as up-to-date or outdated and no relationship between language values exist.

No multilingual revision support

In DOCET multilingual fields are not covered by revisions.
Multilingual data need revisioning.

No Revisions per language

In DOCET, revisions are per entity only (not per language) and can even NOT be assigned to a specific language. If it would cover multilingual values per revisions, still all written revisions would be displayed in one sequence (revisions tab).
If we want to know if a revision changes any value for a specific language, there’s no simple way to extract this data.
It’s important to have a revision graph per language and know in which revision data changes for a specific language.

No multilingual diff

In DOCET, diffing multiple revisions only shows the diff for the current language only without further notice. If you accidentally create an entity diff with the wrong UI language it looks like no change happened to the entity. However only the current language didn’t change.
When dealing with revisions, we need to compare values of translations independent of the current language.
Possibly we need to display changes across all languages as an overview.

No per-language save

In DOCET, every save stores all languages at the same time. If it would implement revisioning it would duplicate all multilingual records when creating a new revision
This leads to heavy data duplication on systems with many languages. Heavy duplication will make your database explode, the dumps annoyingly big and kill your database performance (huge, growing query cache needed).
If we have revisions per language, a specific per language save should be provided.
If D8 even introduces inline editing, each field edit might trigger an entity save. If creating new revisions is enabled, this will create even more duplicates.

Multilingual workflow

In DOCET we cannot publish a node in one language only. If we allow to make the node status multilingual, this results
It’s important to understand that this is not a “translated status”. It provides different workflow status per language and that’s why the term “Translation” would be wrong. It provides status per language = multilingual status.
We need to be able to define independent workflow states per language. That’s also why we need revision separation.

Per-language access

A specific language of a node entity might be published and others not.
Thus the access to an entity might be language dependent.
A node with ID might be available on /en/node/ID while it’s “access denied” or “not found” on /fr/node/ID.
Independent if some fields are translated or not, access is decided by entity + requested language, not the possibly partially merged language.
DOCET thus has a global state table to store per-language states.

Current revision per language

Since languages need to be revisionned separately, we need to have per-language current revision in addition a language neutral current revision.

Query entities

We need to write a per-language record into the multilingual table with undefined language (currently “und”) for untranslateable entities. If this record is not present we cannot efficiently query for “nodes that are untranslateable or available in current language”.
@todo explain ...

Further possibly needed features

Revision integrity

Revisions currently mock you have a record about the history of an entity. However we allow that revisions are updated which can lead to significant integrity failure. In Drupal 7 editors can suppress writing a new revision on an update, even if revisioning is generally enabled.
If you e.g. pass a node revision A to an external service for translation, you never know if the entity changed (revision A is not in state B) - although it has the same revision number. A local diff to the original revision now show you no change.
If meanwhile a new revision C is written and you want to compare that revisions C to the original revision submitted to outline changes, you’ll end up with a diff(C,B) and changes between A and B are inaccessible.
An option is to define that revisions are not allowed to be updated and for history/integrity records only.
If revisions are allowed to change, we should be able to mark a revision as locked which forces the creation of a new revision on the next update. In the following lines we refer to this principle as “integrity locked revision”.

Hidden / Important revisions

Writing more revisions (due to the integrity issue) might result in unexpected amount of revisions.
In Drupal 7 editors can decide if they want to write a new revision. In order to allow them to manage their revisions, we could mark revisions as explicitly requested = important. Editor created revisions also typically have a description while automated don’t. Editors might not see the other (autocreated) revisions in the revision log until they’re explicitly querying them. This improves Editor Experience and still allows revision integrity.
This feature can be covered by contrib.

Per-property save

A strategy is needed to reduce revisioning duplicates. If revisioning is enabled and every entity save creates a new revision, heavy data duplication occurs.
Even introduction of property save methods can’t prevent the need to update much more than the single requested property only. In case of writing a new revision, the complexity is equal to a full entity save. However, with a property save interface we could record the source of reason.

Prune intermediate inline revisions

When introducing inline editing in combination with revisioning, every edit leads to the creation of a revision.
We are able to automatically invalidate (delete) a previous revision as long as it’s not locked and thus limit the amount of data persisted.
We might also suppress creation of new revisions while inline editing as long as a revision is not integrity locked.

Prune old revisions

There might be the request to prune old unneeded revisions to reduce space needs.
A contrib module might even delete editor created revisions that are older than a specific date. However it would need to keep integrity locked revisions because they might be subject for explicit external reference. Integrity locks possibly need reference counters.
For references: Wikipedia seems to keep their revisions forever to save records.

Multilingual fallback per property

When presenting an entity the Drupal 7 field / CT system implements per-field language fallback. Thus it’s possible to have a multilingual field translated into language X and a second multilingual field is undefined in language X.
However the per-field / per-property fallback strategy has no effect on general entity access. So the general decision - if a entity is accessible in a certain language - needs to be made without consideration of multilingual property states.

The current revision

An entity consists of N revision records in a strict graph (save sequence) where one of those is set as current.
In Drupal 7 you need to save the entity node again with the original current revision, because saving a new “hidden” revision automatically sets the current revision pointer to the newly written revision. This invalidates caches unneededly and leads to some unclean state in the middle of this operation.
We need to be able to write a future revision without invalidating the current revision pointer.

Untranslated items / Language fallback

The language fallback works well if an item is missing in the current language.
Since language fallbacks contain rules of sequence they cannot be put efficiently into a SQL query. A separate join for each of the fallback steps is needed since the SQL operator “IN” doesn’t support priority / sequences.
A recent issue with e.g. views showed that the fallback sequence even needs to be applied per field (because the translation state might be different per field) resulting in subqueries per field containing the fallback sequence.
It’s important to understand and define the role of language fallbacks and if they need to be queryable simple. Possibly denormalization is needed in the storage system to make fast access possible.

Next steps

After a discussion of this document i’ll outline an approach to cover all the regressions and needs.

Credits: Thanks Schnitzel for reviewing the document.