| At the
Linguistic Data Consortium, I direct efforts to create software in support
of linguistic corpora annotation and production. Our language data are used by our partners at tech companies and research universities as
training data for machine learning systems that support technologies such as automatic
speech recognition, machine translation, language and speaker identification, character and handwriting recognition, and many other language technology application.
I am the lead engineer on the
MADCAT project, an Arabic handwriting recognition project. I oversee all
word alignment corpora at LDC: primarily Arabic-English and
Chinese-English data in support of the BOLT and GALE programs. I also
work on the DEFT project (Deep Exploration and Filtering of Text) where we build sample data for named entity recognition and information extraction projects and evaluations. I also periodically contribute to the TAC-KBP (Text Analysis Conference-Knowledge Base Population) project. Finally, I oversee delivery of all data from LDC and coordinate a team that works to validate our data to ensure it is error-free.
|