Stephen Grimes, Ph.D.


I am a Senior Software Developer at the Linguistic Data Consortium at the University of Pennsylvania in Philadelphia working in natural language annotation for machine learning and language technology applications.
all smiles



At the Linguistic Data Consortium, I direct efforts to create software in support of linguistic corpora annotation and production. Our language data are used by our partners at techology companies and research universities as training data for machine learning systems that support technologies such as automatic speech recognition, machine translation, language and speaker identification, character and handwriting recognition, and many other language technology applications.

I am the lead engineer on the TAC-KBP project, an information retrieval program sponsored by DARPA. I oversee all word alignment corpora at LDC: primarily Arabic-English and Chinese-English data in support of the BOLT and GALE programs. I also work on the DEFT project (Deep Exploration and Filtering of Text) where we build sample data for named entity recognition and information extraction projects and evaluations.


Experience

2014-present Linguistic Data Consortium, Senior Software Developer
20010-2014 Linguistic Data Consortium, Application Developer
2008-2010 Linguistic Data Consortium, Programmer Analyst
2005-2008 American Indian Studies Research Institute, Computational Linguistics Research Programmer
2005-2006 Nuance Communications, Pronunciation Lexicon Consultant
2004-2005 Library Electronic Text Resource Archive (LETRS), XML Developer
2001 Lernout & Hauspie and Dragon Systems (companies now under Scansoft Nuance), Language Model Engineer
1999-2006 Associate Instructor in Mathematics, Linguistics, and Statistics, Indiana University
1998 Center for Discrete Mathematics and Theoretical Computer Science at Rutgers University: Research Assistant in Computational Group Theory
1997 Research Experience for Undergraduates at Lafayette College: Research Assistant in Dynamic Systems and Symmetry

Education

Ph.D., Linguistics, Indiana University, 2010
M.A., Computational Linguistics, Indiana University, 2003
M.A., Mathematics, Indiana University, 2000
B.S., Mathematics Computer Science minor, Bucknell University, 1999

Additional Training

The Wharton School, University of Pennsylvania, Executive Education division, 2010-2012
  • Certificate in Management/Business Essentials 2011
Linguistic Society of America, Summer Linguistic Institute University of Debrecen, Graduate Exchange Fellowship, 2004

Debrecen Summer School, summers 2002, 2004

North American Summer School in Logic, Language, and Information, Indiana University, Summer 2003

Budapest Semesters in Mathematics, Spring 1998


Papers, corpora, conference presentations, and other projects

2014 GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 2. With Xuansong Li, Safa Ismael, and Stephanie Strassel. Linguistic Data Consortium catalog no. LDC2014T25
GALE Arabic-English Word Alignment -- Broadcast Training Part 2. With Xuansong Li, Safa Ismael, and Stephanie Strassel. Linguistic Data Consortium catalog no. LDC2014T22
GALE Arabic-English Word Alignment -- Broadcast Training Part 1. With Xuansong Li, Safa Ismael, and Stephanie Strassel. Linguistic Data Consortium catalog no. LDC2014T19
GALE Arabic-English Word Alignment Training Part 3 -- Web. With Xuansong Li, Safa Ismael, Stephanie Strassel. Linguistic Data Consortium catalog no. LDC2014T14
MADCAT Chinese Pilot Training Set. With Zhiyi Song, David Lee, Dave Doermann, Stephanie Strassel. Linguistic Data Consortium catalog no. LDC2014T13
GALE Arabic-English Word Alignment Training Part 2 -- Newswire. With Xuansong Li, Safa Ismael, and Stephanie Strassel. Linguistic Data Consortium catalog no. LDC2014T10
GALE Arabic-English Parallel Aligned Treebank -- Web Training. With Xuansong Li, Safa Ismael, and Stephanie Strassel. Linguistic Data Consortium catalog no. LDC2014T08
GALE Arabic-English Word Alignment Training Part 1 -- Newswire and Web. With Xuansong Li, Safa Ismael, Stephanie Strassel. Linguistic Data Consortium catalog no. LDC2014T05
GALE Arabic-English Parallel Aligned Treebank -- Broadcast News Part 2. With Xuansong Li, Safa Ismael, Stephanie Strassel, Mohamed Maamouri, and Ann Bies. Linguistic Data Consortium catalog no. LDC2014T03
2013 GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 1. With Xuansong Li and Stephanie Strassel. Linguistic Data Consortium catalog no. LDC2013T23
MADCAT Phase 3 Training Set. With David Lee, Safa Ismael, Dave Doermann, Stephanie Strassel, Zhiyi Song. Linguistic Data Consortium catalog no. LDC2013T15
GALE Arabic-English Parallel Aligned Treebank -- Broadcast News Part 1. With Xuansong Li, Safa Ismael, Stephanie Strassel, Mohamed Maamouri, Ann Bies. Linguistic Data Consortium catalog no. LDC2013T14
MADCAT Phase 2 Training Set. With David Lee, Safa Ismael, Dave Doermann, Stephanie Strassel, Zhiyi Song. Linguistic Data Consortium catalog no. LDC2013T09
GALE Arabic-English Parallel Aligned Treebank -- Newswire. With Xuansong Li, Safa Ismael, Dalal Zakhary, Stephanie Strassel, Mohamed Maamouri, Ann Bies. Linguistic Data Consortium catalog no. LDC2013T10
GALE Chinese-English Word Alignment and Tagging Training Part 4 -- Web. With Xuansong Li and Stephanie Strassel. Linguistic Data Consortium catalog no. LDC2013T05
2012GALE Chinese-English Word Alignment and Tagging Training Part 3 -- Web. With Xuansong Li and Stephanie Strassel. Linguistic Data Consortium catalog no. LDC2012T24
GALE Chinese-English Word Alignment and Tagging Training Part 2 -- Newswire. With Xuansong Li and Stephanie Strassel. Linguistic Data Consortium catalog no. LDC2012T20
Automatic Word Alignment Tools to Scale Production of Manually Aligned Parallel Texts. Stephen Grimes, Katherine Peterson, Xuansong Li. LREC 2012: 8th International Conference on Language Resources and Evaluation, Istanbul, May 21-27. (.pdf)
Parallel Aligned Treebanks at LDC: New Challenges Interfacing Existing Infrastructures. Xuansong Li, Stephanie Strassel, Stephen Grimes, Safa Ismael, Mohamed Maamouri, Ann Bies, Nianwen Xue. LREC 2012: 8th International Conference on Language Resources and Evaluation, Istanbul, May 21-27. (.pdf)
Parallel Aligned Treebanks at LDC: New Challenges Interfacing Existing Infrastructures. Xuansong Li, Stephanie Strassel, Stephen Grimes, Safa Ismael, Mohamed Maamouri, Ann Bies, Nianwen Xue. LREC 2012: 8th International Conference on Language Resources and Evaluation, Istanbul, May 21-27. (.pdf, poster)
MADCAT Phase 1 Training Set. With David Lee, Safa Ismael, Dave Doermann, Stephanie Strassel, and Zhiyi Song. Linguistic Data Consortium catalog no. LDC2012T15
GALE Chinese-English Word Alignment and Tagging Training Part 1 -- Newswire and Web. With Xuansong Li and Stephanie Strassel. Linguistic Data Consortium catalog no. LDC2012T16
2011Word Alignment for Improved Machine Translation. Xuansong Li, Xiaoyi Ma, Stephen Grimes, Stephanie Strassel, Gary Krug, and Dalal Zakhary. In Olive, J., Christianson, C., and McCary, J. (eds.) Handbook of Natural Language Processing and Machine Translation: DARPA Global Autonomous Language Exploitation.
2010Parallel Aligned Treebank Corpora at LDC: Methodology, Annotation and Integration. Xuansong Li, Stephanie Strassel, Stephen Grimes, Safa Ismael, Xiaoyi Ma, Niyu Ge, Ann Bies, Nianwen Xue, Mohammed Maamouri. Workshop on Annotation and Exploitation of Parallel Corpora. TLT9 - The ninth interational workshop on treebanks and linguistic theories. December 2, 2010, University of Tartu, Estonia. (.pdf)
Creating Arabic-English Parallel Word-Aligned Treebank Corpora at LDC. With Xuansong Li, Ann Bies, Seth Kulick, Xiaoyi Ma, and Stephanie Strassel. Proceedings of the Seventh International Language Resources and Evaluation Conference (LREC2010). (.pdf)
Enriching Word Alignment with Linguistic Tags. With Xuansong Li, Niyu Ge, Stephanie Strassel and Kazuaki Maeda. Proceedings of the Seventh International Language Resources and Evaluation Conference (LREC2010). (.pdf)
Technical Infrastructure at Linguistic Data Consortium: Software and Hardware Resources for Linguistic Data Creation. With Kazuaki Maeda, Haejoong Lee, Jonathan Wright, Robert Parker, David Lee, and Andrea Mazzucchi. Proceedings of the Seventh International Language Resources and Evaluation Conference (LREC2010). (.pdf)
2009Reader for A Practical Hungarian Grammar(Gyarkorlo Magyar Nyelvtan) by Szilvia Szita and Tamas Gorbe. Published by Akademiai Kiado, Budapest.
Quantitative studies in Hungarian phonotactics and syllable structure. Doctoral dissertation. (.pdf)
2008Thorsten Trippel, Michael Maxwell, Greville Corbett, Cambell Prince, Christopher Manning, Stephen Grimes and Steve Moran. Lexicon schemas and related data models: when standards meet users. Proceedings of the Sixth International Language Resources and Evaluation (LREC'08). Marrakech, Morocco. (.pdf)
2007On the creation of a pronunciation dictionary for Hungarian, Ph.D. Qualifying Exam paper (.pdf)
Word final consonant extrametricality in Hungarian, Ph.D. Qualifying Exam paper (.pdf)
2006 Hungarian pronunciation dictionary project
From a presentation given at The Hungarian Language: Past and Present at UCLA, May 2006. "On creating a pronunciation dictionary for Hungarian" (.ppt, .pdf, references in .rtf). Similar presentation also at Midwest Computational Linguistics Colloquium at Illinois, May 2006.
Translation of Tamás Forgács's "Postalveolar assimilation and its exceptions from the point of view of Hungarian language history" from the original article "Die Postalverolare Assimilation und ihre ausnahmen -- aus der sicht der ungarischen sprachgeschichte", which appeared in Acta Linguistica Hungarica in 2001.
Linguist List Review of Alchemist morphological annotation software (.html) (.pdf)
2005McWOP11 conference presentation, Word minimality in Hungarian, November 4-6, 2005, University of Michigan
International Conference on the Structure of Hungarian (ICSH7) Moraic weight, extraprosodic word-final consonants, and morphophonological length alterations in Hungarian, May 29-31, 2005, Veszprém, Hungary
2004Linguist List Review of Peter Mühlhausler's Language of Environment / Environment of Language(.html)
2003The developments, uses, and functions of preverbal particles in Hungarian and other Uralic languages (.pdf)
Moraicity and morphophonological length alternations in Hungarian (.pdf)
Hungarian Epenthetic Vowels (.pdf)
2002Morphological gemination and root augmentation in three Muskogean languages (.pdf)
NC Phonology in Modern Greek (.pdf)
Mora Augmentation in the Alabama Imperfective: an Optimality Theoretic Perspective (pdf)
Review of Bernhardt & Stemberger's Handbook of phonological development from the perspective of constraint-based non-linear phonology (.pdf)
The use of the reflexive marker in Lusaamia (.pdf)
2001Review of Statistics in Historical Linguistics by Sheila Embleton (.pdf)
1998Nathan C. Carter, Richard L. Eagles, Stephen M. Grimes, Andrew C. Hahn, and Clifford A. Reiter, Chaotic Attractors with Discrete Planar Symmetry. Chaos, Solitons & Fractals, 9 12 (1998) 2031-2054; errata10 7 (1999) 1261-1264.
Nathan C. Carter, Stephen M. Grimes, and Clifford A. Reiter, Frieze and Wallpaper Chaotic Attractors with a Polar Spin. Computers & Graphics, 22 6(1998) 765-779.
1997Nathan Carter, Richard Eagles, Stephen Grimes, Andrew Hahn, and Clifford A. Reiter, Chaos with Symmetry: Reflections on an Exhibition, CDROM Proceedings of the APL97 International Conference on APL, Toronto, Ontario, August 17-20, 1997.

Teaching at Indiana University

MATH M119 Brief Survey of Calculus 2005 - 2013
MATH M110 Excursions into Mathematics 2005 - 2006
PSY K300 Statistics Spring 2005
LING L103 Introduction to the Study of Language Fall 2003
MATH M025 Precalculus Fall 2001
COAS J111 College Algebra Fall 2000
MATH M118 Finite Mathematics Fall 1999

Project Supervision at University of Pennsylvania

Sandeep (Sunny) SinghConsistency checking for parallel word aligned treebanks
John MayerWord alignment tool development using PyQt
Bohan YangData processing for word alignment and parallel aligned treebanks
Mishal AwadahWord alignment tool development
Pranshu SharmaAutomatic error detection of word alignment
Brian GainorDatabase development for Arabic handwriting recognition program (MADCAT)
Kate PetersonMachine translation corpus creation, error analysis

Erdős number: 5
counter
( magyarul is olvasható )