Back to Catalog

LLM Foundations: Get started with tokenization

BeginnerGuided Project

Tokenization is a preprocessing technique in natural language processing (NLP) that converts text to structured data so a computer can understand human language. It breaks down unstructured text data into smaller units called tokens. A single token can range from a single character or individual word to much larger textual units.

4.5 (43 Reviews)

Language

  • English

Topic

  • Text Analytics

Enrollment Count

  • 201

Skills You Will Learn

  • Artificial Intelligence, Generative AI, LLM, NLP, Python

Offered By

  • IBMSkillsNetwork

Estimated Effort

  • 20 min

Platform

  • SkillsNetwork

Last Update

  • June 5, 2025
About this Guided Project
Tokenization is a stage in text-mining pipelines that converts raw text data into a structured format for machine processing. It's a required step for other preprocessing techniques,  so it's usually one of the first preprocessing steps in NLP pipelines. In this project, you’ll learn how to tokenize raw text data for use in machine learning models and NLP tasks. You'll use the Python natural language toolkit (NLTK) to convert .txt files to tokens at different levels of granularity using an open-access text file sourced largely from Project Gutenberg.

This project is based on the IBM Developer tutorial Tokenizing text in Python, by Jacob Murel (Ph.D).  

A Look at the Project Ahead

  1. Introduction to tokenization concepts in text processing.
  2. Exploring different methods and libraries for tokenizing text in Python.
  3. Practical examples and exercises to apply tokenization techniques.

What You'll Need

A basic knowledge of Python and a browser.

Instructors

Sina Nazeri

Data Scientist at IBM

I am grateful to have had the opportunity to work as a Research Associate, Ph.D., and IBM Data Scientist. Through my work, I have gained experience in unraveling complex data structures to extract insights and provide valuable guidance.

Read more

Kang Wang

Data Scientist

I was a Data Scientist in the IBM. I also hold a PhD from the University of Waterloo.

Read more