Tokenizer

Sentiment

Word Frequency

Chatbot Builder

NLP Concepts

Text Tokenizer

Enter Text

Token Legend

Word Number Punctuation Stopword

Token Stream

Tokens will appear here...

Preprocessing Steps

Step-by-Step Pipeline

Run tokenizer to see pipeline steps.

What is Tokenization?

Tokenization is splitting text into individual units (tokens). It is the first step in any NLP pipeline.

# Python example
import nltk
text = "I love AI"
tokens = nltk.word_tokenize(text)
# Output: ['I', 'love', 'AI']

Stopwords — common words with little meaning (the, is, a…).
Stemming — reduces words to root form (running → run).

Sentiment Analyzer

Enter Text to Analyze

Batch Analysis

Test Multiple Sentences (one per line)

How Sentiment Analysis Works

Lexicon-based: Count positive/negative words from a dictionary.

ML-based: Train a classifier on labeled data (positive/negative reviews).

Transformer-based: BERT, RoBERTa understand context and sarcasm.

Word Frequency Analysis

Frequency Chart

TF-IDF Concept

TF (Term Frequency): How often a word appears in a document.
IDF (Inverse Document Frequency): How rare the word is across all documents.
TF-IDF = TF × IDF — gives importance score to each word.

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(documents)
# X is a matrix of TF-IDF scores

Chat with Your Bot

Hello! I am an AI assistant. Ask me about artificial intelligence, Python, machine learning, or greet me!

Intent Builder

Add Custom Intent

Intent Name

Keywords (comma separated)

Bot Response

How Rule-based Chatbots Work

1. User sends a message
2. Bot checks for keyword patterns
3. Matches intent → returns response
4. If no match → fallback response

def get_response(user_input):
  inp = user_input.lower()
  if "hello" in inp:
    return "Hi there!"
  elif "ai" in inp:
    return "AI is exciting!"
  else:
    return "I don't understand."

1. Natural Language Processing (NLP)

NLP is the branch of AI that enables computers to understand, interpret, and generate human language. It bridges the gap between human communication and computer understanding.

Applications: Search engines, chatbots, translation, spam detection, voice assistants.

2. Tokenization

Breaking text into individual words or sub-words (tokens). The fundamental first step of any NLP pipeline.

"Hello World" → ["Hello", "World"]

3. Stopword Removal

Removing common words that carry little meaning: the, is, a, an, in, of, to. Reduces noise and improves model performance.

from nltk.corpus import stopwords
stop = set(stopwords.words('english'))
clean = [w for w in tokens if w not in stop]

4. Stemming vs Lemmatization

Stemming: Crude rule-based — "running" → "run".
Lemmatization: Dictionary-based, context-aware — "better" → "good".

from nltk.stem import PorterStemmer
ps = PorterStemmer()
ps.stem("running") # → "run"

5. Bag of Words (BoW)

Represent text as a vector of word counts. Ignores order and grammar.

"I love AI" → {"I":1,"love":1,"AI":1}
"AI is great" → {"AI":1,"is":1,"great":1}

6. TF-IDF

Term Frequency × Inverse Document Frequency. Gives higher weight to rare, important words. Better than BoW for most tasks.

TF = (count in doc) / (total words in doc)
IDF = log(total docs / docs with word)
TF-IDF = TF × IDF

7. Word Embeddings

Words represented as dense numeric vectors. Similar words are close in vector space. Word2Vec, GloVe, FastText are popular models.

king - man + woman ≈ queen
(semantic arithmetic with vectors!)

8. Named Entity Recognition (NER)

Identifying and classifying entities in text: Person, Organization, Location, Date.

"Elon Musk founded Tesla in California"
→ Person: Elon Musk
→ Org: Tesla
→ Location: California

9. Sentiment Analysis

Determining the emotional tone of text: positive, negative, or neutral. Used in customer feedback, social media monitoring.

10. Transformers & BERT

Modern deep learning models that understand context using self-attention. BERT (2018) revolutionized NLP. ChatGPT uses GPT architecture (decoder-only transformer).

Attention(Q,K,V) = softmax(QKᵀ/√dk)·V

NLP Sandbox