NLP in low resource settings, Ann Irvine (JHU)
Many state of the art approaches to NLP tasks assume that a large amount of data of a particular type is available to supervise the learning of statistical models. Examples include sentiment annotations for sentiment analysis, parallel sentences for machine translation, and transcriptions of spoken language for speech recognition. However, for many languages and domains, such data isn't available in large quantities, and we must adapt standard approaches accordingly. In this breakout discussion, we'll share ideas about how to approach NLP in low resource settings. Possible points for discussion include: (1) unsupervised and semi-supervised learning, (2) alternative types of data, (3) crowdsourcing, and (4) active learning.
Ann Irvine is a final year PhD student in the Center for Language and Speech Processing and the Computer Science Department at Johns Hopkins University. Her advisor is Chris Callison-Burch, and she also work closely with Alex Klementiev and David Yarowsky. She am a Graduate Fellow at the Human Language Technology Center of Excellence. Her research interests include Machine Translation, particularly for low resource languages and domains.
Dynamic programming is a crucial practical tool for implementing NLP systems, but unfortunately it can be very hard to get correct. It is all too common for students to start a project with vague notions of chart variables and pseudo-code for CKY, and end up with unsalvageable code. In this break-out session, we give an opinionated overview of alternative representations of dynamic programming as a logical system, as a hypergraph, and as linear optimization. We then discuss frameworks for implementing DP in practice and what possibilities may exist in the near-future. We end with a collaborative discussion of techniques researchers use in practice and what pragmatic problems should be addressed going forward.
Alexander Rush is a Ph.D. candidate in Computer Science at the Massachusetts Institute of Technology studying with Prof. Michael Collins and is currently a visiting scholar at Columbia University in New York. He received his A.B. in Computer Science from Harvard University in 2007. Before starting graduate study, he worked as lead engineer on the Platform/API team at Facebook. His research interest is in formally sound, but empirically fast inference methods for natural language processing, with a focus on models for syntactic parsing, translation, and speech. Last spring, he was the lead TA for NLP on Coursera, Columbia's first MOOC, with over 30,000 registered students. He has received best paper awards from EMNLP 2010 and NAACL 2012, the latter for work completed as an intern at Google research.
Nathaniel Wesley Filardo is a a fourth-year Ph.D. student at the Johns Hopkins Center for Language and Speech Processing. He did his undergraduate work at Carnegie Mellon University, getting a degree in Physics and another in Computer Science. he is a graduate fellow of the Human Language Technology Center Of Excellence (HLTCOE).
One of the great technical challenges in big data is to construct computer systems that learn continuously over years, from a continuing stream of diverse data, improving their competence at a variety of tasks, and becoming better learners over time. This discussion will describe Carnegie Mellon University's research to build a Never-Ending Language Learner (NELL) that runs 24 hours per day, forever, learning to read the web. Each day NELL extracts (reads) more facts from the web, and integrates these into its growing knowledge base of beliefs. Each day NELL also learns to read better than yesterday, enabling it to go back to the text it read yesterday, and extract more facts, more accurately, today. NELL has been running 24 hours/day for over three years now. The result so far is a collection of 50 million interconnected beliefs (e.g., servedWith(coffee, applePie), isA(applePie, bakedGood)), that NELL is considering at different levels of confidence, along with hundreds of thousands of learned phrasings, morphological features, and web page structures that NELL has learned to use to extract beliefs from the web. Track NELL's progress at http://rtw.ml.cmu.edu.
Partha Pratim Talukdar is a Postdoctoral Fellow in the Machine Learning Department at Carnegie Mellon University, working with Tom Mitchell on the Never Ending Language Learning (NELL) project. Partha received his PhD (2010) in CIS from the University of Pennsylvania, working under the supervisions of Fernando Pereira, Zack Ives, and Mark Liberman. Partha is broadly interested in Machine Learning, Natural Language Processing, Data Integration, and Cognitive Science. His dissertation introduced novel graph-based weakly-supervised methods for Information Extraction and Integration. His past industrial research affiliations include HP Labs, Google Research, and Microsoft Research. He is currently co-authoring a book on graph-based semi-supervised learning.