Skip to content Skip to sidebar Skip to footer

Creating Sequence Vector From Text In Python

I am now trying to prepare the input data for LSTM-based NN. I have some big number of text documents and what i want is to make sequence vectors for each document so i am able to

Solution 1:

Solved with Keras text preprocessing classes: http://keras.io/preprocessing/text/

done like this:

from keras.preprocessing.text import Tokenizer, text_to_word_sequence

train_docs = ['this is text number one', 'another text that i have']
tknzr = Tokenizer(lower=True, split=" ")
tknzr.fit_on_texts(train_docs)
#vocabulary:
print(tknzr.word_index)

Out[1]:
{'this': 2, 'is': 3, 'one': 4, 'another': 9, 'i': 5, 'that': 6, 'text': 1, 'number': 8, 'have': 7}

#making sequences:
X_train = tknzr.texts_to_sequences(train_docs)
print(X_train)

Out[2]:
[[2, 3, 1, 8, 4], [9, 1, 6, 5, 7]]

Solution 2:

You could use NLTK to tokenise the training documents. NLTK provides a standard word tokeniser or allows you to define your own tokeniser (e.g. RegexpTokenizer). Take a look here for more details about the different tokeniser functions available.

Here might also be helpful for pre-processing the text.

A quick demo using NLTK's pre-trained word tokeniser below:

from nltk import word_tokenize

train_docs = ['this is text number one', 'another text that i have']
train_docs = ' '.join(map(str, train_docs))

tokens = word_tokenize(train_docs)
voc = {v: k for k, v indict(enumerate(tokens)).items()}

Post a Comment for "Creating Sequence Vector From Text In Python"