gentle, a personal history

2021.11.24 / rmo


gentle automatically lines up an audio file and a transcript, by the phoneme and to the centisecond. this is an associative history of gentle in five parts:

  1. origins—writing gentle; speculations concerning my motivations; initial progress
  2. release—packaging gentle; description and comparison; leaving california
  3. return—back to california; hints of gentle in the world; a fateful prototype
  4. incorporation—a startup; an office; a friend; some challenges
  5. assimilation—gentle is everywhere; some media theory; a transitional object
room full of kaldi lattices
wall of kaldi source code and papers


june, 2015— was it days, weeks, or months that i confined myself to that room? i surrounded myself with code, with every scrap of documentation and relevant research paper that i could find.

bret later told me that he feared for my health, feared that i might never emerge.

to this day i wonder if the challenge was primarily my inexperience with c++, my ignorance of the domain, or the vagueness of my intentions.

but despite slow progress, i kept returning, or just never leaving, passing out instead on a couch in the library and dreaming my way into the 200,000+ line-of-code kaldi project, guided at crucial moments by an excellent private repo from max.

why did i do this?

by the end of june, my modest "kaldi tools" directory could extract and interpret the internal datastructure of a speech recognition system:

word+phoneme lattice visualization (unmute for full effect). source material from correspondence with dave.


december 31, 2015— six months (and plenty of digressions) later, max cleaned up the shell scripts and i packaged gentle into a self-demonstrating project site, a web demo, liberally licensed source code, and a mac dmg.

gentle diagram
diagram by chris

the name "gentle" is a playful inversion of the term forced aligner. an aligner, in this context, finds timecodes for tokens (phonemes or words) in a transcript with matching audio. you provide an audio file and a transcript, separately, and the aligner tries to precisely line up the two media.

as the name implies, a forced aligner imposes a rigid correspondence between every token of the input text and every utterance of the audio.

based on a misunderstanding from a chat with steve, i circumvented both issues in gentle by redefining the problem away from force and towards an imperfect resilience, thinking about alignment as normal speech recognition with a "nudge" towards the provided transcript:

  1. creation of a "miniature language" that only contains bigrams (pairs of words) found in the transcript;
  2. forward-pass speech recognition with the constructed language, which will get close to the input transcript;
  3. sequence alignment between the input transcript and the speech recognition output, accounting for misses; and
  4. creation of even smaller "languages" to re-run the process on all of the missed sub-regions.

forced aligners are often used to create datasets for speech synthesis or recognition, a fraught process that was rarely undertaken.

with gentle, alignment was suddenly accessible to anyone with basic scripting abilities.

the night of gentle's release, i hosted the world's most awkward new year's party, took off the next morning for mexico, moved to berlin, and followed projects to shenzhen, warsaw, cairo, spoleto, ramallah, helsinki, pittsburgh, são paulo...


october, 2017— ...and, eventually, back to california.

i was low on narrative and cash.

i was called on to apply for a mysterious role at apple, where i was told: "you seem creative, i don't know if apple is a good place for you."

at google brain the room received me kindly but evaded questions about interface, which they regard as low-status.

during this time, i started running into gentle unexpectedly:

and my old friend prabhas had been using gentle to help nepali immigrants gain confidence in english. he convinced me to hole up with him in his garage for three days to make the video editor that i claimed was an obvious consequence of gentle, that i claimed adobe would soon be releasing. (they still haven't.)

i filmed him describing why he knew it would be important to his grad school community, and cut down the hour-long ramble using our new tool (then called distill, now reduct):

prabhas on videos in user research (unmute for full effect)

prabhas told me we'd made a great product in those three days.

i told him to start selling it; by the next week, we had our first customer.


november, 2019— we raised a seed funding round, so there was no more denying that prabhas and i had a startup on our hands. (i had been telling myself that, perhaps, it was a simple business. or an art project. or a dream.)

i left berlin. as a parting gift, sam designed the interiors for our office, which we had taken over from some fellow travelers.

office interior
"the pentagon," inside reduct's sf mission headquarters.

and when danyq (my co-conspirator on many projects and adventures) decided to come out of retirement to help out, we started dreaming up a real infrastructure around gentle.

while we occasionally patch something in our internal fork of gentle, basically we don't touch it: it works.

but even though the premise (api) of gentle is fairly simple, the systems-level consequences of following what gentle wants is more involved. many software engineering patterns and frameworks break down:

...and that's not even getting started on the design and conceptual issues with a new sort of tool, but that's another story.


november, 2021— it no longer surprises me when i hear that someone has found a new way to use gentle: it has been absorbed into many projects and endeavors, far beyond even my knowledge.

ui screenshot of drift3

it seems to me that gentle is a small but essential tool underneath a remarkable transformation taking place in the world.

if words can "stick" to media so tightly, then media can become as malleable as words. (for better and for worse.)

orality and literacy may soon cease to be a meaningful distinction.

with gentle, the written word and its utterance are united.

but if it's taken me six years to wonder about a significance of my own tool, which i still see as existing primarily due to a combination of my ignorance of the field and my build system perseverance, i am hardly the person to say where this will go next.

in truth, i can't imagine gentle, as such, will be relevant for much longer: it's too bound to english, and automatic transcripts are getting so good that aligning an external transcript may become less important.

still, i have the impression that this modest project of mine, made amidst an ambitious lab, underlies a few billion dollars of economic activity, some dozens of research papers, a few academic conferences, and media art both refined and vulgar.

and, used as it is to train the next generation of speech recognition and synthesis engines, that it may warrant study as a transitional object.

many of the characters mentioned briefly contributed to my life and thoughts far more profoundly than i have expressed in this piece. thank you.

this telling benefited immensely from conversations, correspondences, and review with many people including andy, chris, cristóbal, danyq, katherine, mayo, and noor.

want to learn more? get in touch: email; twitter.