Why Your FFmpeg Drawtext Words Don't Line Up: A PHP Developer's Guide to Font Metrics That Lie

Context

In our last post, we fought FFmpeg's filter graph parser over single-quote escaping in our video subtitle system. That battle was about parsing. This one is about rendering.

Since then, our subtitles evolved from per-sentence overlays to per-word overlays — 257 individual drawtext filters per video, each with its own animated entry, opacity dimming, and precisely computed X/Y coordinates. Per-sentence subtitles could use x=(w-tw)/2 and let FFmpeg center the text. Per-word subtitles can't. We need to measure every word's pixel width in PHP, lay them out into lines, and tell FFmpeg exactly where each one goes.

The measurement tool: PHP's imagettfbbox(). The target renderer: FFmpeg's drawtext. Both use FreeType under the hood. How far off could they be?

35%.

Problem 1: Words That Float

The first render looked like ransom-note typography. Some words sat higher than their neighbors. "a" floated above "the". "on" floated above "big".

The Cause

FFmpeg's drawtext positions text by the top of each word's individual bounding box, not by a shared baseline. At y=100:

text=the: the bounding box top (including ascenders t, h) sits at y=100
text=a: the bounding box top (no ascenders, ~20px shorter) sits at y=100

Same Y value, different baselines. Words without ascenders appear to float.

The Obvious Fix (That Didn't Work)

Measure each word's ascent with imagettfbbox, compare to a reference ascent, push short words down by the difference:

$fontAscent = $this->measureAscent('Hh', $fontPath, $fontSize);  // 48px
$wordAscent = $this->measureAscent('a', $fontPath, $fontSize);   // 28px
$yOffset = $fontAscent - $wordAscent;                            // 20px

Where measureAscent uses the upper-corner Y values from imagettfbbox:

public function measureAscent(string $text, string $fontPath, int $fontSize): int
{
    $bbox = imagettfbbox((float) $fontSize, 0.0, $fontPath, $text);
    return (int) abs(min($bbox[5], $bbox[7]));
}

We applied the 20px offset. Too much — "a" now sat below the baseline of "the".

The Calibration Factor

We rendered a fine-grained offset grid: the word "a" at offsets +12 through +18, each next to "the" at offset 0. The correct value was +16px, not +20px.

16/20 = 0.8.

GD and FFmpeg use different FreeType hinting configurations. Their tight bounding box ascents diverge by ~20%. The fix:

$yOffset = (int) round(max(0, $fontAscent - $wordAscent) * 0.8);

Is 0.8 a magic number? Yes. Does it produce pixel-perfect baselines across 257 words? Also yes.

Problem 2: Words That Drift Apart

With baselines fixed, the horizontal spacing was wrong. Longer words had proportionally larger gaps after them — like someone cranked the word spacing to 150%.

The Rabbit Hole: Advance Width

Our first theory: imagettfbbox returns the visual bounding box width (tight around visible pixels), but text renderers advance the cursor by the advance width (which includes an invisible right-side bearing). We were using the wrong measurement.

We built a way to extract the advance width from GD using a doubling method:

public function measureAdvanceWidth(string $text, string $fontPath, int $fontSize): int
{
    $singleBbox = imagettfbbox((float) $fontSize, 0.0, $fontPath, $text);
    $doubleBbox = imagettfbbox((float) $fontSize, 0.0, $fontPath, $text . $text);

    return (int) max($doubleBbox[2], $doubleBbox[4])
         - (int) max($singleBbox[2], $singleBbox[4]);
}

The logic: in "braverybravery", the second "b" starts at exactly the advance width of the first word. Subtract single from double and you get the advance width that imagettfbbox doesn't directly expose.

The advance widths were perfectly consistent within GD. The sum of per-word advance widths matched a single-string measurement with zero error. We committed the fix, ran a render, and... the spacing looked unchanged.

The Actual Problem

We'd been comparing GD measurements against GD measurements. We never checked whether GD's numbers matched what FFmpeg actually rendered. Time for a cross-engine comparison.

We wrote an FFmpeg filter that rendered individual words at our GD-computed positions (yellow) next to the same sentence as a single drawtext string (cyan). The result was immediately damning:

Yellow (GD positions):   and    bravery      to    make     a    new     friend.
Cyan (FFmpeg native):    and bravery to make a new friend.

The GD-positioned line was about 35% wider. Not a subtle drift — a completely different scale.

Measuring FFmpeg's Actual Widths

To quantify it, we rendered each word individually with FFmpeg to a PNG, loaded the output in GD, and scanned for the rightmost non-black pixel:

$process = new Process([
    'ffmpeg', '-y',
    '-f', 'lavfi', '-i', 'color=c=black:s=600x100:d=1',
    '-vf', "drawtext=fontfile=$fontPath:text=$word:fontsize=64:fontcolor=white:x=0:y=0",
    '-frames:v', '1', $outputPng
]);
$process->run();

$img = imagecreatefrompng($outputPng);
for ($x = imagesx($img) - 1; $x >= 0; $x--) {
    for ($y = 0; $y < imagesy($img); $y++) {
        if ((imagecolorat($img, $x, $y) & 0xFFFFFF) > 0) {
            $ffmpegWidth = $x + 1;
            break 2;
        }
    }
}

The ratios were remarkably consistent:

Word	GD Width	FFmpeg Width	Ratio
"and"	141px	106px	0.752
"bravery"	309px	226px	0.731
"to"	81px	61px	0.753
"make"	204px	151px	0.740
"a"	45px	33px	0.733
"new"	158px	116px	0.734
"Allison"	266px	197px	0.741
"the"	126px	95px	0.754
space	21px	16px	0.762

Every measurement: 0.73–0.76. Average: 0.74.

Why 0.74?

GD's imagettfbbox defaults to 96 DPI. FFmpeg's drawtext uses FreeType at (likely) 72 DPI. The ratio 72/96 = 0.75 — close to our measured 0.74, with the small deviation probably coming from hinting and rounding differences.

The point-size is the same. The font file is the same. The FreeType library is the same. But the DPI setting is different, and that scales every pixel measurement proportionally.

The Fix

Apply a configurable scaling factor to all GD width measurements:

$widthScale = (float) config('video.text_width_scale', 0.74);

$spaceWidth = (int) round($rawSpaceWidth * $widthScale);

foreach ($words as $word) {
    $widths[$word]   = (int) round($this->measureText($word, ...) * $widthScale);
    $advances[$word] = (int) round($this->measureAdvanceWidth($word, ...) * $widthScale);
}

After the fix, individual word positions matched FFmpeg's native single-string rendering almost exactly.

The Red Herring Postmortem

The advance width theory was internally correct — visual bounding box width does differ from advance width by 1-3px per word, and using visual widths does cause cumulative drift in GD's coordinate space. But that 11px-over-7-words problem was invisible next to the 35% DPI scaling error.

We spent hours perfecting GD-to-GD measurements before doing the one test that mattered: rendering in the target engine and comparing.

Lesson: When you're computing positions in one engine for use in another, validate against the target engine first. Internal consistency in the source engine tells you nothing about cross-engine accuracy.

Quick Reference

What	GD Says	FFmpeg Does	Fix
Word widths	~96 DPI	~72 DPI	Multiply by 0.74
Space widths	Inflated for `" "` alone	Natural kerning	Measure via `bbox("n n") - bbox("nn")`, then scale by 0.74
Ascent difference	Per GD hinting	Per FFmpeg hinting	Multiply difference by 0.8
Visual vs advance width	`max(bbox[2], bbox[4])` vs doubling	Native advance	Use advance widths for cursor positioning, visual for last word on line

The Code

Three measurement functions and one scaling step. The raw GD functions return unscaled values — the caller applies WIDTH_SCALE to everything horizontal:

private const WIDTH_SCALE = 0.74;

public function measureText(string $text, string $fontPath, int $fontSize): int
{
    $bbox = @imagettfbbox((float) $fontSize, 0.0, $fontPath, $text);
    return $bbox !== false ? (int) max($bbox[2], $bbox[4]) : (int) (strlen($text) * $fontSize * 0.55);
}

public function measureAdvanceWidth(string $text, string $fontPath, int $fontSize): int
{
    $single = @imagettfbbox((float) $fontSize, 0.0, $fontPath, $text);
    $double = @imagettfbbox((float) $fontSize, 0.0, $fontPath, $text . $text);
    if ($single !== false && $double !== false) {
        return (int) max($double[2], $double[4]) - (int) max($single[2], $single[4]);
    }
    return (int) (strlen($text) * $fontSize * 0.55);
}

public function measureAscent(string $text, string $fontPath, int $fontSize): int
{
    $bbox = @imagettfbbox((float) $fontSize, 0.0, $fontPath, $text);
    return $bbox !== false ? (int) abs(min($bbox[5], $bbox[7])) : (int) ($fontSize * 0.8);
}

// Usage:
$ffmpegWidth = (int) round($this->measureText($word, $fontPath, $fontSize) * self::WIDTH_SCALE);
$yOffset = (int) round(max(0, $fontAscent - $wordAscent) * 0.8);

TL;DR

imagettfbbox uses 96 DPI. FFmpeg drawtext uses ~72 DPI. All GD widths are 35% too large. Scale by 0.74.
drawtext positions text by per-word bounding box top, not a shared baseline. Scale GD ascent differences by 0.8.
Validate against FFmpeg early. GD-to-GD consistency means nothing if the target engine disagrees.
Advance width matters (use the doubling method), but it's a 1-3px refinement — don't mistake it for the main problem.

Why Your FFmpeg Drawtext Words Don't Line Up