Part 2
Subscribe to Tech with Tim
In this tutorial we will continue to preprocess our data and get it ready to feed to our neural network for training.
Word Stemming
You may have heard me talk about word stemming in the previous tutorial. Stemming a word is attempting to find the root of the word. For example, the word "thats" stem might be "that" and the word "happening" would have the stem of "happen". We will use this process of stemming words to reduce the vocabulary of our model and attempt to find the more general meaning behind sentences.
words = [stemmer.stem(w.lower()) for w in words if w != "?"] words = sorted(list(set(words))) labels = sorted(labels)
This code will simply create a unique list of stemmed words to use in the next step of our data preprocessing.
Bag of Words
Now that we have loaded in our data and created a stemmed vocabulary it's time to talk about a bag of words. As we know neural networks and machine learning algorithms require numerical input. So out list of strings wont cut it. We need some way to represent our sentences with numbers and this is where a bag of words comes in. What we are going to do is represent each sentence with a list the length of the amount of words in our models vocabulary. Each position in the list will represent a word from our vocabulary. If the position in the list is a 1 then that will mean that the word exists in our sentence, if it is a 0 then the word is nor present. We call this a bag of words because the order in which the words appear in the sentence is lost, we only know the presence of words in our models vocabulary.
As well as formatting our input we need to format our output to make sense to the neural network. Similarly to a bag of words we will create output lists which are the length of the amount of labels/tags we have in our dataset. Each position in the list will represent one distinct label/tag, a 1 in any of those positions will show which label/tag is represented.
training = [] output = [] out_empty = [0 for _ in range(len(labels))] for x, doc in enumerate(docs_x): bag = [] wrds = [stemmer.stem(w.lower()) for w in doc] for w in words: if w in wrds: bag.append(1) else: bag.append(0) output_row = out_empty[:] output_row[labels.index(docs_y[x])] = 1 training.append(bag) output.append(output_row)
Finally we will convert our training data and output to numpy arrays.
training = numpy.array(training) output = numpy.array(output)
Full Code
import nltk from nltk.stem.lancaster import LancasterStemmer stemmer = LancasterStemmer() import numpy import tflearn import tensorflow import random import json with open("intents.json") as file: data = json.load(file) words = [] labels = [] docs_x = [] docs_y = [] for intent in data["intents"]: for pattern in intent["patterns"]: wrds = nltk.word_tokenize(pattern) words.extend(wrds) docs_x.append(wrds) docs_y.append(intent["tag"]) if intent["tag"] not in labels: labels.append(intent["tag"]) words = [stemmer.stem(w.lower()) for w in words if w != "?"] words = sorted(list(set(words))) labels = sorted(labels) training = [] output = [] out_empty = [0 for _ in range(len(labels))] for x, doc in enumerate(docs_x): bag = [] wrds = [stemmer.stem(w.lower()) for w in doc] for w in words: if w in wrds: bag.append(1) else: bag.append(0) output_row = out_empty[:] output_row[labels.index(docs_y[x])] = 1 training.append(bag) output.append(output_row) training = numpy.array(training) output = numpy.array(output)