LAB

NLP Sandbox

Natural Language Processing
Tokenizer
Sentiment
Word Frequency
Chatbot Builder
NLP Concepts

Text Tokenizer

Word Number Punctuation Stopword
Tokens will appear here...

Preprocessing Steps

Step-by-Step Pipeline
Run tokenizer to see pipeline steps.
What is Tokenization?

Tokenization is splitting text into individual units (tokens). It is the first step in any NLP pipeline.

# Python example import nltk text = "I love AI" tokens = nltk.word_tokenize(text) # Output: ['I', 'love', 'AI']

Stopwords — common words with little meaning (the, is, a…).
Stemming — reduces words to root form (running → run).

Sentiment Analyzer

Batch Analysis

How Sentiment Analysis Works

Lexicon-based: Count positive/negative words from a dictionary.

ML-based: Train a classifier on labeled data (positive/negative reviews).

Transformer-based: BERT, RoBERTa understand context and sarcasm.

Word Frequency Analysis

Frequency Chart

TF-IDF Concept

TF (Term Frequency): How often a word appears in a document.
IDF (Inverse Document Frequency): How rare the word is across all documents.
TF-IDF = TF × IDF — gives importance score to each word.

from sklearn.feature_extraction.text import TfidfVectorizer vectorizer = TfidfVectorizer() X = vectorizer.fit_transform(documents) # X is a matrix of TF-IDF scores

Chat with Your Bot

Hello! I am an AI assistant. Ask me about artificial intelligence, Python, machine learning, or greet me!

Intent Builder

Add Custom Intent
How Rule-based Chatbots Work

1. User sends a message
2. Bot checks for keyword patterns
3. Matches intent → returns response
4. If no match → fallback response

def get_response(user_input): inp = user_input.lower() if "hello" in inp: return "Hi there!" elif "ai" in inp: return "AI is exciting!" else: return "I don't understand."

1. Natural Language Processing (NLP)

NLP is the branch of AI that enables computers to understand, interpret, and generate human language. It bridges the gap between human communication and computer understanding.

Applications: Search engines, chatbots, translation, spam detection, voice assistants.

2. Tokenization

Breaking text into individual words or sub-words (tokens). The fundamental first step of any NLP pipeline.

"Hello World" → ["Hello", "World"]

3. Stopword Removal

Removing common words that carry little meaning: the, is, a, an, in, of, to. Reduces noise and improves model performance.

from nltk.corpus import stopwords stop = set(stopwords.words('english')) clean = [w for w in tokens if w not in stop]

4. Stemming vs Lemmatization

Stemming: Crude rule-based — "running" → "run".
Lemmatization: Dictionary-based, context-aware — "better" → "good".

from nltk.stem import PorterStemmer ps = PorterStemmer() ps.stem("running") # → "run"

5. Bag of Words (BoW)

Represent text as a vector of word counts. Ignores order and grammar.

"I love AI" → {"I":1,"love":1,"AI":1} "AI is great" → {"AI":1,"is":1,"great":1}

6. TF-IDF

Term Frequency × Inverse Document Frequency. Gives higher weight to rare, important words. Better than BoW for most tasks.

TF = (count in doc) / (total words in doc) IDF = log(total docs / docs with word) TF-IDF = TF × IDF

7. Word Embeddings

Words represented as dense numeric vectors. Similar words are close in vector space. Word2Vec, GloVe, FastText are popular models.

king - man + woman ≈ queen (semantic arithmetic with vectors!)

8. Named Entity Recognition (NER)

Identifying and classifying entities in text: Person, Organization, Location, Date.

"Elon Musk founded Tesla in California" → Person: Elon Musk → Org: Tesla → Location: California

9. Sentiment Analysis

Determining the emotional tone of text: positive, negative, or neutral. Used in customer feedback, social media monitoring.

10. Transformers & BERT

Modern deep learning models that understand context using self-attention. BERT (2018) revolutionized NLP. ChatGPT uses GPT architecture (decoder-only transformer).

Attention(Q,K,V) = softmax(QKᵀ/√dk)·V