top of page

​

Twitter Virality Prediction Algorithm

​

#based on a data set of tweets I played around with Python to predict wheter or not a Tweet will become viral or not based on the tweet length, word count, as well as the nr of followers, friends, hashtags and links.

​

#You can find the original data in the random tweets zip file

and if you have have the necessary packages installed

you can see I reached an accuracy of 92%.

Here is how I did it.

​

​

import pandas as pd

#Having a look at the data set structure

all_tweets = pd.read_json("random_tweets.json", lines=True)

print(len(all_tweets))

print(all_tweets.columns)

print(all_tweets.loc[0]['text'])

#Print the user here and the user's location here.

print(all_tweets.loc[0]["user"]["location"])

 

#So how do we define a viral tweet? A good place to start is to look at the number of retweets the tweet has. This can be found using the feature `"retweet_count"`. Let's say we wanted to create a column called `is_viral` that is a `1` if the tweet had more than `x` retweets and `0` otherwise.

​

import numpy as np

all_tweets['is_viral'] = np.where(all_tweets['retweet_count'] > 5000, 1, 0)

print(all_tweets['is_viral'].value_counts())

 

# Making Features: Now that we've created a label for every tweet in our dataset, we can begin thinking about which features might determine whether a tweet is viral. We can create new columns in our dataset to represent these features. For example, let's say we think the length of a tweet might be a valuable feature.

​

all_tweets['tweet_length'] = all_tweets.apply(lambda tweet: len(tweet['text']), axis=1)

all_tweets['followers_count'] = all_tweets.apply(lambda tweet: tweet['user']['followers_count'], axis=1)

all_tweets['friends_count'] = all_tweets.apply(lambda tweet: tweet['user']['friends_count'], axis=1)

all_tweets['hashtag_count'] = all_tweets.apply(lambda tweet: tweet['text'].count("#"), axis=1)

all_tweets['link_count'] = all_tweets.apply(lambda tweet: tweet['text'].count("http"), axis=1)

all_tweets['word_count'] = all_tweets.apply(lambda tweet: len(tweet['text'].split(" ")), axis=1)

 

#Normalizing data to avoid skewing

from sklearn.preprocessing import scale

labels = all_tweets["is_viral"]

data = all_tweets[["tweet_length","followers_count","friends_count", "hashtag_count", "link_count", "word_count"]]

scaled_data = scale(data,axis=0)

print(scaled_data[0])

#creating training set and test set

from sklearn.model_selection import train_test_split

train_data, test_data, train_labels, test_labels = train_test_split(scaled_data, labels, test_size = 0.2, random_state = 1)

 

#Determining ideal nr of neigherest neighbor K

from sklearn.neighbors import KNeighborsClassifier

scores=[]

for k in range(1,200):

    classifier = KNeighborsClassifier(n_neighbors=k)

    classifier.fit(train_data,train_labels)

    scores.append(classifier.score(test_data,test_labels))

 

import matplotlib.pyplot as plt

plt.plot(range(1,200),scores)

plt.show()

 

#Choosing optimal nr of neighrest neighbor

classifier = KNeighborsClassifier(n_neighbors=4)

 

#Fitting classifier and calculating accuracy

classifier.fit(train_data,train_labels)

print(classifier.score(train_data,train_labels))

​

lazy machine learning.gif

© 2010-21 ideas conceived by Roby D in the name of various agencies

bottom of page