This is a very simple and naive introductory to summary the knowledge in natural language processing, based on my self learning.
What is Natural Language Processing?
Natural Language Processing (NLP) is an important sub category in Artificial Intelligence that enabling computers to understand and process human languages, it tries to get computers closer to a human-level understanding of language.
Some research topics in NLP
- Information Retrieval/Extraction/Filtering
- Machine Translation
- Document/Topic Classification/Summarization
- Question Answering
- Text Mining
- Sentiment Analysis
- Speech Recognition
- Machine Writing/Content Genetation
Statistical Language Models
This is to compute the probability of a sentence or sequence of words.
N-Gram
N-gram is a popular statistical language model.
After building a model, we usually use cross-entropy and perplexity to evaluate the model.
Lower perplexities correspond to higher likelihoods, so lower scores are better on this
metric.
A major concern in language modeling is to avoid the situation p(w) = 0
, which could arise as a result of a single unseen n-gram, the solution is using smoothing methods, some smoothing methods includes:
- Add-One(Laplace) smoothing
- Good-Turing smoothing
- Kneser-Ney smoothing
- Witten-Bell smoothing
Bag of Words
A sentence/document is represented by the counts of distinct terms that occur within it. Additional information, such as word order, POS tag, semantics and syntax etc, are all discarded.
Probabilistic Graphical Models
This is an important math theory/algorithm used in NLP tasks.
- Bayesian Network
- Markov Network
- Condition Random Fields
- Hidden Markov Models
- Estimation Maximization
- Max Entropy
Topic Model
- Latent Dirichlet Allocation (LDA): Based on probabilistic graphical models
- LSA: Uses Singular Value Decomposition (SVD) on the Document-Term Matrix. Based on Linear Algebra
- NMF: Non-Negative Matrix Factorization – Based on Linear Algebra
Some popular tasks in NLP
These are some tasks that may not be the solution to any particular NLP problem but are done as pre-requisites to simplify a lot of different problems in NLP. These are pretty much like reading comprehension we learn in school.
Parts of Speech Tagging
Identify Proper nouns, Common nouns, Verbs, Adjectives, Preposition etc.
Name Entity Recognition
Identify name of people, location etc.
Tokenization
Morphosyntactic Attributes
Deep Learning in NLP
Word2Vec
Previously, there are some other popular distributed representation of word as vectors, like Tf-Idf.
But they are sparse and long which is not computing efficient. Word2Vec instead is a dense vector representation of words(commonly 100-500 dimensions). and it models the meaning of a word as an embedding.
But how to get the dense vectors? Singular value decomposition(Latent Semantic Analysis) can be used, but a more successful way is through neural network inspired learning strategy.
- CBOW: Predict center/target word based on context words
- Skip-grams: Predict context words based on center/target word.
Other vector based models include: fastText, Doc2Vec, GloVe etc.
RNN
CNN
Stay tuned…
Highly recommend https://people.cs.umass.edu/~miyyer/cs585/ as 101 course for NLP.
More advanced courses:
https://github.com/lovesoft5/ml/tree/master/NLP-%E5%93%A5%E4%BC%A6%E6%AF%94%E4%BA%9A%E5%A4%A7%E5%AD%A6