Tokenization is splitting text into individual units (tokens). It is the first step in any NLP pipeline.
Stopwords — common words with little meaning (the, is, a…).
Stemming — reduces words to root form (running → run).
Lexicon-based: Count positive/negative words from a dictionary.
ML-based: Train a classifier on labeled data (positive/negative reviews).
Transformer-based: BERT, RoBERTa understand context and sarcasm.
TF (Term Frequency): How often a word appears in a document.
IDF (Inverse Document Frequency): How rare the word is across all documents.
TF-IDF = TF × IDF — gives importance score to each word.
1. User sends a message
2. Bot checks for keyword patterns
3. Matches intent → returns response
4. If no match → fallback response
NLP is the branch of AI that enables computers to understand, interpret, and generate human language. It bridges the gap between human communication and computer understanding.
Applications: Search engines, chatbots, translation, spam detection, voice assistants.
Breaking text into individual words or sub-words (tokens). The fundamental first step of any NLP pipeline.
Removing common words that carry little meaning: the, is, a, an, in, of, to. Reduces noise and improves model performance.
Stemming: Crude rule-based — "running" → "run".
Lemmatization: Dictionary-based, context-aware — "better" → "good".
Represent text as a vector of word counts. Ignores order and grammar.
Term Frequency × Inverse Document Frequency. Gives higher weight to rare, important words. Better than BoW for most tasks.
Words represented as dense numeric vectors. Similar words are close in vector space. Word2Vec, GloVe, FastText are popular models.
Identifying and classifying entities in text: Person, Organization, Location, Date.
Determining the emotional tone of text: positive, negative, or neutral. Used in customer feedback, social media monitoring.
Modern deep learning models that understand context using self-attention. BERT (2018) revolutionized NLP. ChatGPT uses GPT architecture (decoder-only transformer).