gentle automatically lines up an audio file and a transcript, by the phoneme and to the centisecond. this is an associative history of gentle in five parts:
june, 2015— was it days, weeks, or months that i confined myself to that room? i surrounded myself with code, with every scrap of documentation and relevant research paper that i could find.
bret later told me that he feared for my health, feared that i might never emerge.
to this day i wonder if the challenge was primarily my inexperience with c++, my ignorance of the domain, or the vagueness of my intentions.
but despite slow progress, i kept returning, or just never leaving, passing out instead on a couch in the library and dreaming my way into the 200,000+ line-of-code kaldi project, guided at crucial moments by an excellent private repo from max.
why did i do this?
by the end of june, my modest "kaldi tools" directory could extract and interpret the internal datastructure of a speech recognition system:
december 31, 2015— six months (and plenty of digressions) later, max cleaned up the shell scripts and i packaged gentle into a self-demonstrating project site, a web demo, liberally licensed source code, and a mac dmg.
the name "gentle" is a playful inversion of the term forced aligner. an aligner, in this context, finds timecodes for tokens (phonemes or words) in a transcript with matching audio. you provide an audio file and a transcript, separately, and the aligner tries to precisely line up the two media.
as the name implies, a forced aligner imposes a rigid correspondence between every token of the input text and every utterance of the audio.
based on a misunderstanding from a chat with steve, i circumvented both issues in gentle by redefining the problem away from force and towards an imperfect resilience, thinking about alignment as normal speech recognition with a "nudge" towards the provided transcript:
with gentle, alignment was suddenly accessible to anyone with basic scripting abilities.
the night of gentle's release, i hosted the world's most awkward new year's party, took off the next morning for mexico, moved to berlin, and followed projects to shenzhen, warsaw, cairo, spoleto, ramallah, helsinki, pittsburgh, são paulo...
october, 2017— ...and, eventually, back to california.
i was low on narrative and cash.
i was called on to apply for a mysterious role at apple, where i was told: "you seem creative, i don't know if apple is a good place for you."
at google brain the room received me kindly but evaded questions about interface, which they regard as low-status.
during this time, i started running into gentle unexpectedly:
and my old friend prabhas had been using gentle to help nepali immigrants gain confidence in english. he convinced me to hole up with him in his garage for three days to make the video editor that i claimed was an obvious consequence of gentle, that i claimed adobe would soon be releasing. (they still haven't.)
prabhas told me we'd made a great product in those three days.
i told him to start selling it; by the next week, we had our first customer.
november, 2019— we raised a seed funding round, so there was no more denying that prabhas and i had a startup on our hands. (i had been telling myself that, perhaps, it was a simple business. or an art project. or a dream.)
while we occasionally patch something in our internal fork of gentle, basically we don't touch it: it works.
but even though the premise (api) of gentle is fairly simple, the systems-level consequences of following what gentle wants is more involved. many software engineering patterns and frameworks break down:
...and that's not even getting started on the design and conceptual issues with a new sort of tool, but that's another story.
november, 2021— it no longer surprises me when i hear that someone has found a new way to use gentle: it has been absorbed into many projects and endeavors, far beyond even my knowledge.
it seems to me that gentle is a small but essential tool underneath a remarkable transformation taking place in the world.
if words can "stick" to media so tightly, then media can become as malleable as words. (for better and for worse.)
orality and literacy may soon cease to be a meaningful distinction.
with gentle, the written word and its utterance are united.
but if it's taken me six years to wonder about a significance of my own tool, which i still see as existing primarily due to a combination of my ignorance of the field and my build system perseverance, i am hardly the person to say where this will go next.
in truth, i can't imagine gentle, as such, will be relevant for much longer: it's too bound to english, and automatic transcripts are getting so good that aligning an external transcript may become less important.
still, i have the impression that this modest project of mine, made amidst an ambitious lab, underlies a few billion dollars of economic activity, some dozens of research papers, a few academic conferences, and media art both refined and vulgar.
and, used as it is to train the next generation of speech recognition and synthesis engines, that it may warrant study as a transitional object.
many of the characters mentioned briefly contributed to my life and thoughts far more profoundly than i have expressed in this piece. thank you.