Back to Catalog

Text Analytics at Scale

BeginnerCourse

A continuation of Text Analytics-Getting Results with SystemT, this course shows how early Information Extraction systems based on a standard formalism of cascading grammars suffer from fundamental limitations in both expressivity and runtime performance. The class explains in detail how these limitations are addressed via the declarative principles in the SystemT Information Extraction system, resulting in extractors that are scalable, accurate and easy to maintain and enhance for a new domain.

Language

  • English

Topic

  • Text Analytics

Enrollment Count

  • 103

Skills You Will Learn

  • Spark, Big Data, Text Analytics

Offered By

  • BDU

Estimated Effort

  • 6 hours

Platform

  • SkillsNetwork

Last Update

  • October 28, 2025
About this Course

About This Course

In this course you will learn about text analytics at scale: how do modern text analytics systems handle big data ? We'll first take you down the history lane and show you why earlier formalisms for text analytics languages suffer from fundamental limitations in both expressivity and runtime performance, leading to scalability, accuracy and usability issues. We will then dive into details on the declarative principles behind the SystemT Information Extraction system and its automatic Optimizer, and show you how it addresses all these limitations.

Course Syllabus

  • Module 1 - Limitations in previous approaches to rule-based IE
  • Module 2 - Declarative IE and the SystemT optimizer

Recommended skills prior to taking this course

  • None

Grading scheme

  • The minimum passing mark for the course is 60%, where the review questions are worth 40% and the final exam is worth 60% of the course mark.
  • You have 1 attempt to take the exam with multiple attempts per question.

Requirements

None.

Course Staff

Yunyao Li

Yunyao Li

Yunyao Li joined IBM Almaden Research Center in July 2007 after obtaining her Ph.D degree in Computer Science & Engineering from the University of Michigan in April 2007. Before that, she was a graduate student in the Database Research Group, Department of Electrical Engineering and Computer Science, under the guidance of Professor H. V. Jagadish. Her primary research area is Database Systems and Natural Language Processing. She is particularly interested in designing, developing and analyzing large scale systems that can improve the accessibility of information for a wide spectrum of users. Her current research towards this direction involves a number of disciplines, most notably natural language processing, databases, human-computer interaction, information retrieval, and machine learning. Before she started her Ph.D study, she obtained dual-degrees of M.S.E in Computer Science & Engineering and M.S in Information from Computer Science and Engineering and School of Information respectively at the University of Michigan. She went to college at Tsinghua University, Beijing, China, and graduated with dual-degrees of B.E in Automation and B.S in Economics. Follow her on Twitter @yunyao_li.

Laura Chiticariu

Laura Chiticariu

Laura Chiticariu is a Research Staff Member in the Scalable Natural Language Processing group at IBM Research-Almaden. Her primary research is in Database Systems and Natural Language Processing. She is one of the core members of SystemT, a declarative system for specifying NLP algorithms and executing them at scale. Her current research focuses on making information extraction systems transparent and easier to use, utilizing a range of techniques including data provenance, information integration and machine learning. Laura has a Ph.D. in Computer Science from University of California, Santa Cruz, and a B.S. in Computer Engineering with a major in Automation and Industrial Informatics from Politehnica University of Bucharest.

Marina Danilevsky

Marina Danilevsky

Marina Danilevsky is a Research Staff Member of the SystemT group at IBM Almaden Research Center in San Jose, California. She is interested in data mining, text mining, natural language processing, network ontologies, information networks, and other related areas. She holds a Ph.D. in Computer Science, awarded in 2014 from the University of Illinois at Urbana-Champaign (UIUC). Her research was in the area of Data Mining, supervised by Professor Jiawei Han. She previously received an M.S. in Computer Science from UIUC in 2011, and a B.S. in Mathematics from the University of Chicago in 2007.

Huaiyu Zhu

Huaiyu Zhu

Huaiyu Zhu is a member in the Infrastructure for Intelligent Information Systems group at IBM Research-Almaden. His main research focus is on text analytics, natural language processing, machine learning and statistical information processing.