Blog Post 6 - Fake News Classification Using TensorFlow

With the development of the Internet, the news media has become faster and faster to spread. However, some fake news reports have also appeared in the Internet along with it. Some fake news media hope to arouse public opinion through fake news, so as to increase their own influence in the internet. Hence, it is important for everyone to have the ability to recognize the real and fake news.

In this blog post, we are going to develop a fake news classifier using Tensorflow from Python. It is highly recommended to work on this blog post in Google Colab.

Data Source

Our data for this blog post comes from the following source:

Ahmed H, Traore I, Saad S. (2017) “Detection of Online Fake News Using N-Gram Analysis and Machine Learning Techniques. In: Traore I., Woungang I., Awad A. (eds) Intelligent, Secure, and Dependable Systems in Distributed and Cloud Environments. ISDDC 2017. Lecture Notes in Computer Science, vol 10618. Springer, Cham (pp. 127-138).

We can access it from Kaggle. The data has been done a small amount of data cleaning, and performed a train-test split by Professor Phil Chodrow.

Now, let’s get start!

§1. Acquire Training Data

The following URL is the training data set. We can either read it into our Jupyter Notebook directly or download it to our computer and read it from disk. In this blog, we will use pd.read_csv() to read it into Python directly.

train_url = "https://github.com/PhilChodrow/PIC16b/blob/master/datasets/fake_news_train.csv?raw=true"

Before we read the data, we need to import required packages.

import pandas as pd
import numpy as np
import tensorflow as tf
import string
import re

# Read the data set
df = pd.read_csv(train_url)
# Let's take a quick look
df.head()

	Unnamed: 0	title	text	fake
0	17366	Merkel: Strong result for Austria's FPO 'big c...	German Chancellor Angela Merkel said on Monday...	0
1	5634	Trump says Pence will lead voter fraud panel	WEST PALM BEACH, Fla.President Donald Trump sa...	0
2	17487	JUST IN: SUSPECTED LEAKER and “Close Confidant...	On December 5, 2017, Circa s Sara Carter warne...	1
3	12217	Thyssenkrupp has offered help to Argentina ove...	Germany s Thyssenkrupp, has offered assistance...	0
4	5535	Trump say appeals court decision on travel ban...	President Donald Trump on Thursday called the ...	0

From the above pandas dataframe, we can see that each row of the data corresponds to an news article. The title column is the title of the article, and the text column is the full article text. The last column fake is 0 if the article is true and 1 if the article contains fake news.

§2. Make a Dataset

Now, we are going to write a function call make_dataset. This function will do following two things:

Remove the stopwords from the article text and title. Note that a stopword means a word that is usually considered to be uninformative. For exampl, stopwords could be: “the”, “but”, “and”, “or”. We can get help from this StackOverFlow thread.
Our function should construct and return a tf.data.Dataset, and the function should contains two inputs and one output. Input should be of the form (title, text). The output should only contains the fake column.

Write The Function

First, we import stopwords from nltk.corpus

# Import stopwords
import nltk
from nltk.corpus import stopwords
# Download stopwords
nltk.download('stopwords')
# Extract stopwords
stop = stopwords.words('english')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

def make_dataset(df):
  # remove stopwords from texts and articles
  df['text_without_stop'] = df['text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
  df['title_without_stop'] = df['title'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
  # contruct the dataset
  data = tf.data.Dataset.from_tensor_slices(
      (
          {
              "title" : df[["title_without_stop"]],
              "text" : df[["text_without_stop"]]
          },
          {
              "fake" : df[["fake"]]
          }
      )
  )
  # we shuffle the dataset
  data = data.shuffle(buffer_size = len(data))

  # batch our dataset to increase the training speed
  data = data.batch(100)

  return data

Next, we call our function make_dataset() on our training dataframe to produce a Dataset. I’ve included data = data.batch(100) prior to returning our dataset. Batching causes our model to train on chunks of data rather than individual rows. Although sometimes batching can reducen the accuracy, but it can also increase the speed of training. It is important to find a balance. I found that batches of 100 rows worked well.

data = make_dataset(df)

Validation Data

After we have constucted our primary Dataset, we split 20% of it to use for validation.

train_size = int(0.8*len(data))
val_size = int(0.2*len(data))

train_dataset = data.take(train_size)
val_dataset = data.take(val_size)
# Checking the size of training and validation set
len(train_dataset), len(val_dataset)

(180, 45)

Base Rate

We know that the base rate refers to the accuracy of a model that always makes the same guess. Now, we need to determine the base rate for this data set by checking the labels on our training data set.

a = list(data.as_numpy_iterator())[0]
labels = a[1].values()
print(sum(list(labels)[0] == 0))
print(sum(list(labels)[0] == 1))
# Compute the Base Rate
47 / 100

[44]
[56]

0.47

We can see that the base rate is 47%, which is close to the 50%.

§3. Create Models

Now, we are going to create our TensorFlow model to give a perspective for the following question:

When detecting fake news, is it most effective to focus on only the title of the article, the full text of the article, or both?

In order to answer the above question, we will create three TensorFlow models.

First model, we use only the article title as input.
Second model, we use only the article text as input.
Third model, we use both the article title and the article text as input.

Let’s train our model until they appear to be fully trained. And we will compare their performance! We will need to use the Function API, rather than the Sequential API for this modeling task.

Standardization and Vectorization

First, we want to convert all text to lowercase and remove punctuation. Then, we want to vectorize the text by representing text as a vector (array, tensor).

# Import necessary packages
import string
import re
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
from tensorflow.keras import Input, layers, Model, utils, losses

# set size_vocabulary equal to 2000
size_vocabulary = 2000

# define standardization function
def standardization(input_data):
    lowercase = tf.strings.lower(input_data)
    no_punctuation = tf.strings.regex_replace(lowercase,
                                  '[%s]' % re.escape(string.punctuation),'')
    return no_punctuation

# set vectorize layer
vectorize_layer = TextVectorization(
    standardize=standardization,
    max_tokens=size_vocabulary, # only consider this many words
    output_mode='int',
    output_sequence_length=500) 

# vectorize both title and text
vectorize_layer.adapt(train_dataset.map(lambda x, y: x["title"]))
vectorize_layer.adapt(train_dataset.map(lambda x, y: x["text"]))

Construct Model

Now, we’ve successfully prepared our data, and it’s time to construct our models! The first step is to specifiy the two kinds of keras.Input for our model.

# specify input for title
title_input = Input(
    shape = (1,), 
    name = "title",
    dtype = "string"
)
# specify input for text
text_input = Input(
    shape = (1,), 
    name = "text",
    dtype = "string"
)

Hidden Layers

Now, we are going to write a pipeline for the titles and text. Since title and text are different part of text but share similar vocabulary, we are going to use shared layers for input.

# Use sharing embedding
embedding = layers.Embedding(size_vocabulary, 16, name = "embedding")

# First, we write a pipeline for the title
title_features = vectorize_layer(title_input)
title_features = embedding(title_features)
title_features = layers.Dropout(0.2)(title_features)
title_features = layers.GlobalAveragePooling1D()(title_features)
title_features = layers.Dropout(0.2)(title_features)
title_features = layers.Dense(32, activation='relu')(title_features)

# Next, we write a pipeline for the text
text_features = vectorize_layer(text_input)
text_features = embedding(text_features)
# Here I increased the dropout to 0.75 to avoid 
# large overfitting when training model2
text_features = layers.Dropout(0.75)(text_features)
text_features = layers.GlobalAveragePooling1D()(text_features)
text_features = layers.Dropout(0.5)(text_features)
text_features = layers.Dense(32, activation='relu')(text_features)

# output layers for model 1
title_output = layers.Dense(2, name = "fake")(title_features)

# output layers for model 2
text_output = layers.Dense(2, name = "fake")(text_features)

# output layers for model 3
main = layers.concatenate([title_features, text_features], axis = 1)
main = layers.Dense(32, activation='relu')(main)
model3_output = layers.Dense(2, name = "fake")(main)

Now, it’s time to create our model! In order to save our time, I’ve enclose the model training process in a function called training_model. We are able to re-use this function for our model1, model2, and model3.

from matplotlib import pyplot as plt
# enclosed training process in a function to 
# repeatedly use in model 1,2,3
def training_model(input, output):
  # create model
  model = Model(
      inputs = input,
      outputs = output
  )

  # Complile the model
  model.compile(optimizer = "adam",
              loss = losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy']
              )
    
  # train the model
  history = model.fit(train_dataset, 
                      validation_data=val_dataset,
                      epochs = 50, 
                      verbose = False)
  
  # plot the training and validation accuracy
  plt.plot(history.history["accuracy"])
  plt.plot(history.history["val_accuracy"])

  # return result
  return model, history

Model 1: Train Using Title

model1, history1 = training_model(title_input, title_output)

/usr/local/lib/python3.7/dist-packages/keras/engine/functional.py:559: UserWarning: Input dict contained keys ['text'] which did not match any model input. They will be ignored by the model.
  inputs = self._flatten_to_reference_inputs(inputs)

png

# Visualiza the structure of our model1
tf.keras.utils.plot_model(model1)

png

# Output the maximum validation accuracy of model1
round(max(history1.history["val_accuracy"]),5)

0.98978

From the above result, we can see that our the validation accuracy of our model 1 reached around 98.978%, which is not bad! However, we can do it better in our next model!

Model 2: Train Using Text

model2, history2 = training_model(text_input, text_output)

/usr/local/lib/python3.7/dist-packages/keras/engine/functional.py:559: UserWarning: Input dict contained keys ['title'] which did not match any model input. They will be ignored by the model.
  inputs = self._flatten_to_reference_inputs(inputs)

png

# Visualiza the structure of our model2
tf.keras.utils.plot_model(model2)

png

# Output the maximum validation accuracy of model2
round(max(history2.history["val_accuracy"]),5)

0.99

Our model2 reached 99% validation accuracy, which is better than model1! Let’s now move to model 3 and see if the accuracy can still be improved!

Model 3

model3, history3 = training_model([title_input, text_input], model3_output)

png

# Visualiza the structure of our model3
tf.keras.utils.plot_model(model3)

png

# Output the maximum validation accuracy of model3
round(max(history3.history["val_accuracy"]))

By comparing the above model1, model2, model3, we conclude that model3 is the best model because this model is able to consistently obtain at least 99% accuracy on the validation set. And the best validation set reached the highest score of 100%!

§4. Model Evaluation

In this part, we are going to test our model performance on unseen test data set. We will focus on our best model: Model 3 and ignore model1 and model2.

Now, we first download the test data from the following URL:

test_url = "https://github.com/PhilChodrow/PIC16b/blob/master/datasets/fake_news_test.csv?raw=true"

We would have to convert this test data using the make_dataset function we have created in Part 2. Then, we evaluate our best model on this data.

test_data = pd.read_csv(test_url)
test_dataset = make_dataset(test_data)

Then, we evaluate the data

model3.evaluate(test_dataset)

225/225 [==============================] - 2s 9ms/step - loss: 0.0195 - accuracy: 0.9950

[0.019546041265130043, 0.9949663877487183]

According to the above result, we can see that we are able to correctly detect and classify a fake news with 99.5% accuracy.

§5. Embedding Visualization

In this last section, we are going to take a look at the embedding learned by our model!

# get the weights from the embedding layer
weights = model3.get_layer('embedding').get_weights()[0]

# get the vocabulary from our data prep for later
vocab = vectorize_layer.get_vocabulary()

# import PCA
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
weights = pca.fit_transform(weights)

embedding_df = pd.DataFrame({
    'word' : vocab,
    'x0' : weights[:,0],
    'x1' : weights[:,1]
    })

import plotly.express as px 
fig = px.scatter(embedding_df, 
                 x = "x0", 
                 y = "x1", 
                 size = list(np.ones(len(embedding_df))),
                 size_max = 5,
                 hover_name = "word")
# show the figure
fig.show()

Everything looks great! We’ve successfully visualize our embedding. From the above visualization, we found that some points of words are far away from the center of the plot. For example, trump, obamas, myanmar, rohingya, reportedly, gop.

Whenever we create a machine learning model, it is our responsibility to understand the limitations and biases of our model. Now, let’s find out what kinds of words in our model associates with democrat and republican.

democrat = ['democrat']
republican = ['republican']

highlight_1 = ["liberal", "biden", "left",     "equality"]
highlight_2 = ["trumps",    "fake",     "conservative", "energy"]

def mapper(x):
    if x in democrat:
        return 1
    elif x in republican:
        return 4
    elif x in highlight_1:
        return 3
    elif x in highlight_2:
        return 2
    else:
        return 0

embedding_df["highlight"] = embedding_df["word"].apply(mapper)
embedding_df["size"]      = np.array(1.0 + 50*(embedding_df["highlight"] > 0))

fig2 = px.scatter(embedding_df, 
                 x = "x0", 
                 y = "x1", 
                 color = "highlight",
                 size = list(embedding_df["size"]),
                 size_max = 20,
                 hover_name = "word")

fig2.show()

By observing the above plot, we found that:

word Republican is far away from the word liberal.
word trump is far away from the center of the plot.
word democrat is more close to the center of the plot.
word biden is more close to energy

The reason we observed these is that: for example, Biden always pay attention to the new energy during his presidency. Republican and democrat represent two different parties in the United States, and they would probably have different thoughts on different topics.

Written on March 2, 2022