MT Evaluation Procedures

Created by Sebastian David Garcia Saiz, Modified on Wed, 11 Jun at 5:00 PM by Sebastian David Garcia Saiz

Introduction

In this page we describe Pangeanic procedures for Machine Translation quality evaluation.

The metrics are standardized by defining a mandatory step for the production pipeline, both in engines created by Pangeanic deep-learning team and by engines trained by customers when there is a requirement for Pangeanic to evaluate.

Evaluation Step in the Training Pipeline

The pipeline for production of a new neural engine is outlined in the next chart.

The first steps are related to data, acquiring the training corpus, selecting it and preparing it for training.

Typically, several millions of alignments are finally ready to enter the training procedure, but before the training starts, a subset of the bilingual corpus is reserved for the evaluations.

Three evaluation steps are executed during the pipeline:

Internal training driver: an automated step executed after every training epoch to track the training progress and drive the evolution. This step is a core procedure in the machine learning training.
Automatic evaluation: executed after the training is over, and generates the standard evaluation metrics. It’s a refinement of the internal training driver and doesn’t trigger any validation/rework action.
Human evaluation: run after the engine is finished and packaged into a Docker pod to be included in Pangeanic's service ecosystem. If the metrics do not reach desired thresholds, the process may return to the data selection step and repeat the process.

Automated Evaluation

Pangeanic automated evaluation is performed calculating TER and BLEU metrics.

The main objective is to compute numerical scores that reflect the performance level of specific MT systems. These scores are expected to correlate with human judgments about translation quality.

Automated tools are typically calibrated using human evaluation. We use reference proximity techniques comparing MT output with gold-standard human translations.

For example:

WER (Word Error Rate): a Levenshtein distance-based metric, counts insertions, deletions, and substitutions.
TER (Translation Error Rate): improves WER by including shifts (reordering of word sequences).
BLEU: based on n-gram overlap between MT output and reference translation (usually with n from 1 to 4).

Human Evaluation

Pangeanic human evaluation is based on DFQ-MQM (Multidimensional Quality Metrics).

The DQF-MQM error typology is a standard framework with a comprehensive catalogue of quality error types, used to generate quality scores and check against a predefined quality threshold.

The MQM system was developed under the QTLaunchPad project and is designed to be flexible and integrative across production cycles.

The central component is a hierarchical issue type list. MQM currently includes 114 issue types.
Fine-grained categories can be added as custom MQM extensions if needed.

Official definition: http://qt21.eu/mqm-definition

MTET Tool

Pangeanic uses MTET (Machine Translation Evaluation Tool) to execute human evaluation.

Two user profiles:

Project Manager:
Creates evaluation projects, sets parameters (extension and DFQ-MQM variation), assigns to linguists, tracks progress.
Linguist:
Executes evaluations, either:
- Absolute: generates a 0–100 score using weighted error typology and severity.
- Relative: compares Pangeanic output vs Google Translate with gold reference.

Final scores are differential metrics per dimension (accuracy, fluency, etc.).

Project Creation in MTET

The PM uploads a TMX with source, reference, target, and Google Translate output:

Each Translation Unit (TU) includes:

Source
Reference
Pangeanic output
Google Translate output

Example: Estonian → Italian

Evaluators are assigned tasks:

They use a full MQM interface to annotate issues with type and severity:

Current Metrics

The evaluation scores for selected Pangeanic engines vs Google Translate are listed below (scale 0–100):

Engine/Model	Pangeanic	Google Translate
English to Greek, generic domain, formal style	92.45	86.60
Dutch to Bulgarian, legal domain	89.64	83.83
English to French, Pharma industry	83.63	67.09
German to English, European Commission (Legal)	63.12	58.10
English to Spanish, European Commission (Legal)	80.22	69.14