Telegram-канал datasciencefun - Data Science & Machine Learning: Unsorted

Data Science & Machine Learning

26 October 2025 15:14

✅ NLP (Natural Language Processing) – Interview Questions & Answers 🤖🧠

1. What is NLP (Natural Language Processing)?
NLP is an AI field that helps computers understand, interpret, and generate human language. It blends linguistics, computer science, and machine learning to process text and speech, powering everything from chatbots to translation tools in 2025's AI boom.

2. What are some common applications of NLP?
⦁ Sentiment Analysis (e.g., customer reviews)
⦁ Chatbots & Virtual Assistants (like Siri or GPT)
⦁ Machine Translation (Google Translate)
⦁ Speech Recognition (voice-to-text)
⦁ Text Summarization (article condensing)
⦁ Named Entity Recognition (extracting names, places)
These drive real-world impact, with NLP market growing 35% yearly.

3. What is Tokenization in NLP?
Tokenization breaks text into smaller units like words or subwords for processing.
Example: "NLP is fun!" → ["NLP", "is", "fun", "!"]
It's crucial for models but must handle edge cases like contractions or OOV words using methods like Byte Pair Encoding (BPE).

4. What are Stopwords?
Stopwords are common words like "the," "is," or "in" that carry little meaning and get removed during preprocessing to focus on key terms. Tools like NLTK's English stopwords list help, reducing noise for better model efficiency.

5. What is Lemmatization? How is it different from Stemming?
Lemmatization reduces words to their dictionary base form using context and rules (e.g., "running" → "run," "better" → "good").
Stemming cuts suffixes aggressively (e.g., "running" → "runn"), often creating non-words. Lemmatization is more accurate but slower—use it for quality over speed.

6. What is Bag of Words (BoW)?
BoW represents text as a vector of word frequencies, ignoring order and grammar.
Example: "Dog bites man" and "Man bites dog" both yield similar vectors. It's simple but loses context—great for basic classification, less so for sequence tasks.

7. What is TF-IDF?
TF-IDF (Term Frequency-Inverse Document Frequency) scores word importance: high TF boosts common words in a doc, IDF downplays frequent ones across docs. Formula: TF × IDF. It outperforms BoW for search engines by highlighting unique terms.

8. What is Named Entity Recognition (NER)?
NER detects and categorizes entities in text like persons, organizations, or locations.
Example: "Apple founded by Steve Jobs in California" → Apple (ORG), Steve Jobs (PERSON), California (LOC). Uses models like spaCy or BERT for accuracy in tasks like info extraction.

9. What are word embeddings?
Word embeddings map words to dense vectors where similar meanings are close (e.g., "king" - "man" + "woman" ≈ "queen"). Popular ones: Word2Vec (predicts context), GloVe (global co-occurrences), FastText (handles subwords for OOV). They capture semantics better than one-hot encoding.

10. What is the Transformer architecture in NLP?
Transformers use self-attention to process sequences in parallel, unlike sequential RNNs. Key components: encoder-decoder stacks, positional encoding. They power BERT (bidirectional) and GPT (generative) models, revolutionizing NLP with faster training and state-of-the-art results in 2025.

💬 Double Tap ❤️ For More!