NLTK
#classification #data #language #natural #naturel #nlp #notes #processing #sentence #text #token #tokenization #words
pip3 install nltk
import nltk
nltk.download('all')
Corpora -> Body of text
lexicon -> Words with meaning
sent_tokenize -> Sentence Tokenizer, splits sentences in a body of text
word_tokenize -> Word Tokenizer, splits words in a sentence
from nltk.tokenize import sent_tokenize, word_tokenize
sampleText = "your text file"
print(sent_tokenize(sampletext))
print(word_tokenize(sampletext))
Grouping words in meaningful group
import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer
train_text = state_union.raw("2005-GWBush.txt")
txt = "Kids are playing. Kids like to play games. He got played"
custTokenizer = PunktSentenceTokenizer(state_union.raw("2006-GWBush.txt"))
tokenizedText = custTokenizer.tokenize(txt)
for i in tokenizedText:
words = nltk.word_tokenize(i)
tag = nltk.pos_tag(words)
chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}"""
chunkParser = nltk.RegexpParser(chunkGram)
chunked = chunkParser.parse(tag)
#chunked.draw()
chunked.pretty_print()
stop words are words that don't contribute a lot the text, or filler words like "a", "the"... this words are removed so that text can be made easier for machine to understand
stopwords.words("english")
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
txt = "your text file / input"
words = word_tokenize(txt)
nw = [i for i in words if i not in stopwords.words("english")]
The state_union data set will be used for training PunktSentenceTokenizer to create a custom tokenizer
import nltk
from nltk.tokenize import PunktSentenceTokenizer
from nltk.corpus import state_union
txt = "Your text file or input"
custTokenizer = PunktSentenceTokenizer(state_union.raw("2006-GWBush.txt"))
tokenizedText = custTokenizer.tokenize(txt)
for i in tokenizedText:
words = nltk.word_tokenize(i)
tag = nltk.pos_tag(words)
nameEnt = nltk.ne_chunk(tag)
nameEnt.pretty_print()
Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. ps.stem("word")
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
ps = PorterStemmer()
txt = "Kids are playing. Kids like to play games. He got played"
wt = word_tokenize(txt)
print([ps.stem(i) for i in wt])
A very similar operation to stemming is called lemmatizing. The major difference between these is, stemming can often create non-existent words, whereas lemmas are actual words.
pos = "a"
⇒ Adjective
pos = "v"
⇒ Verb
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("cats"))
print(lemmatizer.lemmatize("cacti"))
print(lemmatizer.lemmatize("geese"))
print(lemmatizer.lemmatize("rocks"))
print(lemmatizer.lemmatize("python"))
print(lemmatizer.lemmatize("better", pos="a"))
print(lemmatizer.lemmatize("best", pos="a"))
print(lemmatizer.lemmatize("run"))
print(lemmatizer.lemmatize("run",'v'))
import nltk
from nltk.tokenize import PunktSentenceTokenizer
from nltk.corpus import state_union
# state_union data set will be used for training PunktSentenceTokenizer to create a custom tokenizer
txt = "Kids are playing. Kids like to play games. He got played"
custTokenizer = PunktSentenceTokenizer(state_union.raw("2006-GWBush.txt"))
tokenizedText = custTokenizer.tokenize(txt)
for i in tokenizedText:
words = nltk.word_tokenize(i)
tag = nltk.pos_tag(words)
print(tag)
Rather then letters, words are encoded into numbers because different words can have same letters in them
Tokenizer(num_words = 100)
in above code, num_words
refer to the maximum number of words to keep, if we have a huge text body, we can keep only top 100 most used words
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
sentences = [ "I love my dog", "I love my cat" ]
tokenizer = Tokenizer(num_words = 100) # Creating the model object
tokenizer.fit_on_texts(sentences) # Training the model
print(tokenizer.word_index) # prints tokenized dictionary
# output -> {'i': 1, 'love': 2, 'my': 3, 'dog': 4, 'cat': 5}
if there is an unknown word that is passed in tokenizer.texts_to_sequences()
method, that word is ignored.
to deal with this, oov_token
property is passed to Tokenizer()
object, so whenever a new word is faced, it will be replaced by whatever is passed as argument to oov_token
in this case it is <OOV>
as it is unlikely to appear naturally in a body of text.
from tensorflow.keras.preprocessing.text import Tokenizer
sentences = [ "I love my dog", "I love my cat", "you love my dog", "Do you think my dog is amazing? " ]
tokenizer = Tokenizer(num_words = 100, oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
# passing array of sentences to convert to ints
seq = tokenizer.texts_to_sequences([ "He likes my cat", "My dog is amazing!" ])
print(seq)
# [[4, 2, 1, 3], [4, 2, 1, 6], [5, 2, 1, 3], [7, 5, 8, 1, 3, 9, 10]]
Text input in a mode can be of different sizes, to deal with this Padding is used.
function pad_sequence()
is used, this method will set the length of all your inputs to the longest string by adding zeros either in front or at end of the sequence
paddedSeq()
adds zeros in beginning
paddedSeq(padding='post')
adds zeros at the end
paddedSeq(seqData, padding='post', maxlen=10, truncating='post')
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
sentences = [ "I love my dog", "I love my cat", "you love my dog", "Do you think my dog is amazing? " ]
tokenizer = Tokenizer(num_words = 100, oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
# passing array of sentences to convert to ints
seq = tokenizer.texts_to_sequences([ "He likes my cat", "My dog is amazing!" ])
paddedSeq = pad_sequences(seq)
print(paddedSeq)
Textual data is converted into numerical vectors that machine learning algorithms can understand.
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
"This is the first document.",
"This document is the second document.",
"And this is the third one.",
"Is this the first document?",
]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())
print(X.toarray())
TF-IDF stands for Term Frequency-Inverse Document Frequency. It's a numerical statistic used in information retrieval and text mining to evaluate the importance of a word in a document relative to a collection of documents. Here's a breakdown of what it entails:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
"This is the first document.",
"This document is the second document.",
"And this is the third one.",
"Is this the first document?",
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())
print(X.toarray())
Word embeddings are a type of word representation in natural language processing (NLP) that captures the semantic meaning of words by mapping them to dense vectors in a continuous vector space.
There are several popular algorithms for generating word embeddings, including Word2Vec, GloVe, and fastText.
Word2Vec learns distributed representations of words by training a neural network on a large corpus of text. The key idea behind Word2Vec is the distributional hypothesis, which posits that words appearing in similar contexts tend to have similar meanings.
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess
# Sample corpus
corpus = [
"This is the first document.",
"This document is the second document.",
"And this is the third one.",
"Is this the first document?",
]
# Tokenize the corpus
tokenized_corpus = [simple_preprocess(text) for text in corpus]
# Train Word2Vec model
model = Word2Vec(tokenized_corpus, vector_size=100, window=5, min_count=1, workers=4)
# Get word vectors
word_vectors = model.wv
# Find similar words
similar_words = word_vectors.most_similar("document")
print("Similar words to 'document':", similar_words)
# Get vector for a specific word
vector_for_word = word_vectors["document"]
print("Vector for 'document':", vector_for_word)
In this example, we tokenize the corpus into lists of words using simple_preprocess
from Gensim. Then, we train a Word2Vec model with a vector size of 100, a window size of 5 (maximum distance between the current and predicted word within a sentence), and a minimum word count of 1. After training, we can access word vectors using the wv
attribute of the model. Finally, we can find similar words or retrieve the vector representation of a specific word.