Tech With Tim Logo
Go back

Text Classification P1

Text Classification

Another large application of neural networks is text classification. In these next few tutorials we will use a neural network to classify movie reviews as either positive or negative.

Install Previous Version of Numpy

There is a bug when using this specific dataset that requires us to install the previous version of numpy, we can do this by running the following in our cmd:

pip install numpy==1.16.1

This is the current working solution as of May 14, 2019. If you are reading this after that date you may not need to do this.

Loading Data

The dataset we will use for these next tutorials is the IMDB movie dataset from keras. To load and split the data we will do the same as we did in the previous tutorial.

import tensorflow as tf
from tensorflow import keras
import numpy

imdb = keras.datasets.imdb

(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)

Integer Encoded Data

Having a look at our data we'll notice that our reviews are integer encoded. This means that each word in our reviews are represented as positive integers where each integer represents a specific word. This is necessary as we cannot pass strings to our neural network. However, if we (as humans) want to be able to read our reviews and see what they look like we'll have to find a way to turn those integer encoded reviews back into strings. The following code will do this for us:

# A dictionary mapping words to an integer index
_word_index = imdb.get_word_index()

word_index = {k:(v+3) for k,v in _word_index.items()}
word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["<UNK>"] = 2  # unknown
word_index["<UNUSED>"] = 3

reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

def decode_review(text):
	return " ".join([reverse_word_index.get(i, "?") for i in text])

# this function will return the decoded (human readable) reviews  

We start by getting a dictionary that maps all of our words to an integer, add some more keys to it like , etc. and then reverse that dictionary so we can use integers as keys that map to each word. The function defied will take as a list the integer encoded reviews and return the human readable version.

Preprocessing Data

If we have a look at some of our loaded in reviews we'll notice that they are different lengths. This is an issue. We cannot pass different length data into out neural network. Therefore we must make each review the same length. To do this we will follow the procedure below:

  • if the review is greater than 250 words then trim off the extra words
  • if the review is less than 250 words add the necessary amount of 's to make it equal to 250.

Luckily for us keras has a function that can do this for us:

train_data = keras.preprocessing.sequence.pad_sequences(train_data, value=word_index["<PAD>"], padding="post", maxlen=250)
test_data = keras.preprocessing.sequence.pad_sequences(test_data, value=word_index["<PAD>"], padding="post", maxlen=250)

Defining the Model

Finally we will define our model! This model is a little bit different and will be discussed in depth in the next tutorial.

model = keras.Sequential()
model.add(keras.layers.Embedding(88000, 16))
model.add(keras.layers.GlobalAveragePooling1D())
model.add(keras.layers.Dense(16, activation="relu"))
model.add(keras.layers.Dense(1, activation="sigmoid"))

model.summary()  # prints a summary of the model

Full Code

import tensorflow as td
from tensorflow import keras
import numpy as np

data = keras.datasets.imdb

(train_data, train_labels), (test_data, test_labels) = data.load_data(num_words=88000)


word_index = data.get_word_index() 

word_index = {k:(v+3) for k, v in word_index.items()}
word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["<UNK>"] = 2
word_index["<UNUSED>"] = 3

reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])


def decode_review(text):
	return " ".join([reverse_word_index.get(i, "?") for i in text])


train_data = keras.preprocessing.sequence.pad_sequences(train_data, value=word_index["<PAD>"], padding="post", maxlen=250)
test_data = keras.preprocessing.sequence.pad_sequences(test_data, value=word_index["<PAD>"], padding="post", maxlen=250)


model = keras.Sequential()
model.add(keras.layers.Embedding(88000, 16))
model.add(keras.layers.GlobalAveragePooling1D())
model.add(keras.layers.Dense(16, activation="relu"))
model.add(keras.layers.Dense(1, activation="sigmoid"))

model.summary()  # prints a summary of the model
Design & Development by Ibezio Logo