Introduction To Tensorflow Datasets

Reading Time: 9 minutes

Hello guys. It’s Friday again and I hope that you’re all ready for another weekend project. As Tensorflow 2.0-alpha came out, I feel so eager to try out its new features and I guess you guys may too. The final release won’t be around until this summer so, we have a lot of time to get ready by then.

This week’s topic is about Tensorflow Datasets (tensorflow_datasets), a new Tensorflow’s package that I have tried out recently. This package may be the perfect solution which can eliminate all the pain you may suffer when preparing training data.

This post was originally written in Japanese (which you can find here). If you prefer Japanese (and can stand my weird Japanese), feel free to jump to that post.

This post will be a little bit long and consist of following parts:

  • The painful data preparation
  • An introduction to Tensorflow Datasets
  • Project: IMDB Movie Review Classification

Well, let’s tackle them one by one.

The painful data preparation

Basically speaking, Machine Learning (especially Deep Learning) is an approach in which we have a set of data and our mission is to create a learning model which can fit that data to our desire target.

Did I say that we had a set of data? In order to have that data, we usually have to do a lot of preparation, which may be the most expensive and tedious task, to all of us. The data preparation task can be boiled down to 3 steps:

  1. Data collection: we can get data either from the internet or some kind of databases
  2. Data preprocessing: normally the raw data requires some modification. For example, to image data, we may want to perform some augmentation, normalize pixel values to (-1, 1) range, etc. How about text data? We will need to create a vocabulary set for tokenization, add zero-padding, etc.
  3. Input pipeline creation: lastly, we need to create a flow to feed our data in. This step usually consists of converting data to Tensor, creating batches, etc.

I bet that everyone who is reading this post experienced the pain when going through all the steps above, especially before tf.data API came out. With the help of tf.data API, life has become a lot easier since steps 2 and 3 above are no longer challenging.

What about step 1? Nothing changed. In fact, tf.keras can help us with small datasets like MNIST or CIFAR-10, but those were never considered enough.

Tensorflow’s team knew the community’s pain and tensorflow_datasets is their answer!

Then what is tensorflow_datasets and how can it be a life saver? Let’s find out.

An introduction to tensorflow_datasets

Before trying out tensorflow_datasets, let’s talk about machine specs. Below is mine:

  • OS: Ubuntu 18.04 (can be ignored if you are using Docker)
  • GPU: NVIDIA GTX 1070
  • Tensorflow: 2.0-alpha

I highly suggest that you use Tensorflow 2.0 for this tutorial. If you haven’t installed it yet, don’t worry. I have a post for that: link. Remember to come back when you’re done 😉

Are you ready? Let’s go ahead and import necessary packages:

As you saw above, we’re gonna use IMDB movie review dataset for today’s project. The data is a collection of movie reviews (of course) which are classified as either positive or negative. Here is an example of a negative review:

The Canterville Ghost (1996).The director made this too sappy a production. Maybe it’s the generation, but I really liked the Charles Laughton version. There is a time and place for “emoting” and this production does not translate very well. Patrick Stewart, reciting Shakespeare was very good, but still inappropriate. Would neither recommend nor watch again. The close-ups and padded text and sub-plots were lost on me. Adding extraneous material and scenes takes away from a truly great work. The screenplay writer should find another profession in which to misplace his talent, maybe afternoon soap operas would be a better venue. Check out the really good version and pass on this one.

Label: Negative

And here is what a positive review looks like:

This is one of the finest TV movies you could ever see. The acting, writing and production values are top-notch. The performances are passionate with Beverly D’Angelo superb as the older woman with a teenage daughter and Rob Estes simply perfect as the young stud boyfriend. However, the best part of this film was how it showed the consequences of sexual abuse instead of going for the usual happy ending. It showed that abuse can happen in good families; involve good people; and wreck lives. It is thought provoking
 and entertaining. Congratulations to all concerned with this exceptional movie.

Label: Positive

Wow, honestly, I haven’t written such long reviews in my entire life (I don’t think I can).

The traditional approach

Personally, I always think that the best way to know exactly how important something is is to compare to when you have to live without.

Sounds confusing, huh? Okay, let’s suppose tensorflow_datasets doesn’t exist. What are we gonna do? We will have to download the data, right? After the files are downloaded, we need to take a look at the directory structure. Well, you don’t have to do that (because I did). It looks like below:

After knowing how the data is structured, we need to write a script to read data from files. And since this is text data, we have to create a set of vocabulary, manually tokenize everything and create a tf.data.Dataset object. That sounds indeed time-consuming…

If we use tensorflow_datasets

Okay guys. Now we know tensorflow_datasets (or tfds for short) does exist. What does it change?

Well, tfds will do all of the above without us even knowing! It means that under the hood, tfds will download the data, create the vocabulary, tokenize words and return an instance of tf.data.Dataset.

A little piece of fact, though, it’s not possible for tfds to handle every single dataset out there. But I can say that it contains enough for our needs: MNIST, CIFAR-10 or even LSUN and CELEB-A (where are GAN fans?). And those were just a few of image category. In text category, we have the famous WMT to train language models. And most importantly, new datasets are gradually added so that when Tensorflow 2.0 officially comes out, you will find it more than enough!

Okay, back to the point. How to use tfds, then?

There are two ways: by creating a builder or by calling load() function.

Method 1: Creating a builder

Creating a builder is the fundamental way to construct the dataset. You can explicitly tell the builder what to do, that’s pretty handy.

To create a builder, we only need to give out the dataset’s name:

The builder comes with an info attribute, which stores a number of handful information:

We can extract a lot of useful information from there, such as:

  • The data consists of text features (string) and labels (2 classes)
  • Number of examples: 50000
  • Training examples: 25000
  • Test examples: 25000
  • and so on

Something may be worth your notice that after we created the builder, the data is not automatically downloaded and we have to explicitly call download_and_prepare().

It will take a while for the data to be completely downloaded…

Okay so we had the data on the disk. With just one more method called, we will have an instance of tf.data.Dataset which is ready to be used:

As you can see above, the builder actually created a separate tf.data.Dataset instance for each data split. Let’s check out the training split. What we’re gonna do next is similar to what we normally do with tf.data API: specify the batch size and create an iterator (which acts like a Python generator):

I will elaborate the line above a little bit. Normally, after calling batch() to specify the batch size, we would use make_one_shot_iterator() or make_initializable_iterator() to get an iterator to loop through the dataset. But either of them is deprecated and removed from Tensorflow 2.0. Instead, we can just call take() and pass in the number of batches we want.

So, the code above means: I will take 2 batches, in which each contains 2 examples. Let’s have a look to be sure:

Okay, we have 2 batches with 5 reviews in each. It is much more intuitive than the old API, isn’t it?

Before we move onto the other way to use tfds, I want to address one slightly little thing. Some of you might have noticed, though: the features and the labels are stored in a dictionary. What’s wrong with a dictionary anyway? You have to know the keys in order to get the values. Some folks, including me, prefer using a tuple!

We can have each batch returned as a tuple by explicitly specifying as_supervised=True when creating tf.data.Dataset objects:

Each batch is now packed as a tuple. Perfect!

Next, let’s check out the other method: calling load().

Method 2: Calling load()

Instead of creating a builder, we can create a tf.data.Dataset by calling the load() method of tensorflow_datasets.

In short, load() makes everything even easier. Just in one call, all the data will be downloaded and converted to tf.data.Dataset automatically:

Wow, that was neat! But in return, we can no longer use other cool stuff from the builder above like info, can’t we? No, we won’t lose anything. Just pass with_info=True to get the favored info object returned:

Before you ask, yes, the data is packed within a dictionary by default and of course, we can tell load() to use a tuple 😉

I guess those were the tuples we all fancy. And that’s all about tensorflow_datasets that I think you should know. If you want to know more, it’s your turn to dig deeper into the API document. For now, it’s time for a real challenge: the project!

Project: IMDB movie review classification

Above we have done quite a lot of talking about tfds, now we will actually apply what we have learned into a real project.

Create the input pipeline

Like always, the first thing to tackle is the input pipeline. Without it, the data won’t be able to flow into the model. We will start off by importing packages and defining some constants:

Next, let’s create the dataset (an instance of tf.data.Dataset), which is not a challenge anymore with using tfds:

Something is quite not right here. What is subwords32k?

Technically, no matter we are using builder or load() method, we need to specify not only the name of the dataset but the config type as well. It means that the full form should be “dataset/config“. If we do not specify the config type, tfds will use the default setting.

Why we have to explicitly specify as imdb_reviews/subwords32k then?

Because, I don’t want to manually do the tokenization, which means I want the returned data to be arrays of integers, not just raw text like above. Okay, here’s what the data looks like:

As you can see, by specifying the config type, we can obtain a dataset which we can input directly into the neural networks. Of course, it would be meaningless if we don’t have a way to convert integers back to raw tokens, right? Remember the handy info object? It does contain a tokenizer which will handle that conversion for us:

Cool, right? For that reason, let’s grab that tokenizer to use later on:

Next, we will get the datasets ready by familiar procedures: shuffle the training dataset and generate batches on both datasets:

Oops! Why use padded_batch? Because after tokenized, our raw text data now contains arrays of integers (or sequences). In order to create batches, tf.data API requires that all elements in one batch have the same length. And that’s what padded_batch() excels at.

Let’s have a look again. When we don’t set the config type:

features=FeaturesDict({
‘label’: ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
‘text’: Text(shape=(), dtype=tf.string, encoder=None)
},

The text features are scalar tensors of type tf.string. If we specify a config type, which tells tfds to tokenize the data for us:

features=FeaturesDict({
‘label’: ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
‘text’: Text(shape=(None,), dtype=tf.int64, encoder=<SubwordTextEncoder vocab_size=32650>)
},

Then the data will become 1D Tensors of type tf.int64, which is also the reason why we had to call padded_batch().

And we are done with the input pipeline.

Create the model

Next, we will go ahead and define a model. Since we are dealing with sequences and the IMDB dataset is quite small, a simple RNN can get the job done. Creating neural networks with tf.keras.Sequential is never a challenging task:

Personally, I prefer the style of inheriting tf.keras.Model, which provides me with the same flexibility as the Keras Functional API. Note that the lines below is for demonstration purpose only:

Next, we need a loss function. Since we have movie reviews as either positive or negative, we’re gonna use BinaryCrossentropy loss function:

From Tensorflow 2.0 (actually from v1.3), we can specify whether we are passing the raw logit values (i.e. no sigmoid/softmax applied). For visualization purpose, you can see the that when we defined the model, I commented out the activation parameter of the last layer. Doing sigmoid or softmax by hand is usually prone to numeric instability. We should just let the library do its job!

And that’s it. We have finished creating the model.

Create the training loop

This is our last step guys. Let’s create a training loop. First, we need an optimizer to tell the model how to update its weights. Most of the times, Adam is my first choice:

Next, let’s define a for loop. At every epoch, we will get the sequences and the labels from the dataset, pass them to the model and get a loss value back.

A small notice though: from Tensorflow 2.0 Eager Execution will be the default mode, so we need to put all the network’s computation inside a tf.GradientTape():

Now that we got the network’s loss. We can go ahead to compute gradients using the tape object above. Then finally, use the optimizer to apply those gradients (i.e. update the network’s weights):

I also wanted to know how well the model can perform, so I put some code to compute the accuracy in there too 😉

Okay, the last thing to do is to periodically print out loss and accuracy values. Let’s also randomly pick up 5 reviews from the test dataset and see how well the model can classify them:

That’s also the last piece of code we had to write. It’s time to go get yourself a cup of coffee and enjoy the training!

If there is nothing wrong with the code, you will see that the model starts to give out good results after 3 to 4 epochs of training:

How about your model? Do you enjoy watching an AI model classify a long list of long movie reviews? Whether you should go to that particular movie this weekend? Just let your model decide for you!

And as always, you can find the code for this project on my GitHub: link. Feel free to drop me a line if you have any problems. I would be very happy to help.

Final words

Before ending this post, I want to say good jobs, guys! We have walked through a long way today to have a taste of Tensorflow Datasets and use that module to tackle one deep learning project. I’m pretty sure that you guys didn’t even feel the pain of preparing training data, even slightly.

Finally, I also want to say thank you for following such a long blog post. Hope that you guys can benefit from what we did today and apply them to your work.

And that’s all for today, guys. I’m gonna see you, in no time!

Reference

  • Tensorflow Datasets Homepage: link
  • Tensorflow’s Text Classification: link
  • Install Tensorflow 2.0: link

Trung Tran is a Deep Learning Engineer working in the car industry. His main daily job is to build deep learning models for autonomous driving projects, which varies from 2D/3D object detection to road scene segmentation. After office hours, he works on his personal projects which focus on Natural Language Processing and Reinforcement Learning. He loves to write technical blog posts, which helps spread his knowledge/experience to those who are struggling. Less pain, more gain.

Leave a reply:

Your email address will not be published.