Saturday, July 5, 2014

Has Brazil Stopped Playing the Beautiful Game?

Brazil, the 5-time winner of World Cup, is not only known as a soccer powerhouse, but also as the country that made soccer beautiful.  Players such as PelĂ©, Eusebio, and Ronaldo exemplified ‘o jogo bonito’, the beautiful game made up of dance-like moves, relentless team-work, and fair play.

However, Brazil’s team in the 2014 World Cup has been accused of playing rough in order to win the cup.  So, today we will interrupt our usual articles about Twitter analytics to see if data can help us understand whether Brazil has stopped playing the beautiful game.

Although playing rough is made up of multiple characteristics, the simple way to measure it is by using the number of fouls a team commits per match.  Using this metric, Brazil not only ranks as one of the roughest teams in the World Cup, but also takes the top spot.  Similarly surprising is that four of the top five teams with most fouls are Latin American, a region known more for style than brawn.



The most vivid example of Brazil’s physical game was the match against Colombia, which the New York Times chronicled so wonderfully.   The match had a total of 54 fouls, the most fouls for a game in this World Cup. Although both teams were to blame, Brazil’s 31 fouls were also the greatest number in a match by any team in this year’s championship.  Colombia’s 23 fouls were still significant though, cracking the top 10% fouls by a team.

The match against Colombia might not have been a big deal if it wasn't another data point in a disturbing trend.  In the match against Chile, Brazil once again seemed to rely more on brawn than beauty.  The team committed 28 fouls, tying Switzerland for the second most fouls committed by a team in the World Cup.  Between these two matches, Brazil was responsible for the top two matches with most fouls in this championship.


Some allege that Brazil has gotten away with a rougher style of play due to the leniency of the referees to the home team.  However, the data seems to tell a different story.  Although it’s hard to measure how many fouls referees miss, we can see how willing referees were to give yellow cards to a team that was committing multiple fouls.  If we rank all the teams in the World Cup by the average number of fouls committed per yellow card, we find that Brazil is safely in the bottom.  Brazil committed 6 fouls for every yellow card, less than half of the 15 fouls per yellow card for Algeria, which had the most leniency from the referees.


Brazil is an outstanding team, the favorites to win it all according to many experts.  Unless they clean up their game, Brazil might win the cup but not the hearts of soccer fans around the world.  For a team known for playing a beautiful game, this would be an ugly shame.

Thursday, July 25, 2013

How Well Can We Predict Guy Kawasaki’s Top 5 Tweets of the Day

My goal when I started this blog was to create a Twitter account that would share a small number of Guy Kawasaki’s best tweets instead of the huge volume of tweets I would receive if I followed him.  After 4 months of work, @TT_GuyKawasaki, a Twitter account that retweets Guy Kawasaki’s top 5 tweets for the day is finally live!  Because I just started running the account, I don’t know how well it will predict Kawasaki’s best tweets.  However, we can run the model on historical data and see how well it performs.  Even though I am initially using a simple linear regression, the model predicts whether a tweet is one of the top 5 of the day with 95% accuracy.  It looks like we’re off to a good start!

For the first version of the prediction model, I chose to use a simple linear regression model.  Although I could have used a few different models to predict Kawasaki’s tweets, I chose the simple linear model at first due to its good accuracy, and ease of implementation.  In the future I will use other models if I find them to increase prediction accuracy sufficiently.

A good way to measure the prediction model’s accuracy is to determine the percentage of tweets that we predicted correctly and incorrectly.  The tweets that were predicted incorrectly are then split into false positives, and false negatives.  False positives are tweets that were categorized as top tweets even though they weren’t top tweets, and false negatives are tweets that should have been categorized as top tweets but were not.    The chart below shows the percentage of tweets that were correctly categorized in green, false positives in yellow, and false negatives in red.  The bars show the model’s accuracy if it uses the number of retweets 5, 15, 25, and 35 minutes after the tweet is posted to predict whether the tweet is a top tweet.


Overall, the model does a very good job of predicting the top tweets correctly.  Using the number of retweets 5 minutes after the tweet was posted, the model predicted 91% of the tweets correctly.  This means that the model will incorrectly classify a tweet only about once every two days.  If the model uses the number of retweets 35 minutes after the tweet was posted, the accuracy increases to 95%, or only one mistake every four days.

The results of the model look so good, that it is worth increasing the bar for accuracy.  Instead of judging the model’s accuracy by its ability to classify all tweets, lets take a look at how good it is at classifying only the top tweets.  The chart below shows the number of tweets that were predicted correctly, and those that weren’t divided into false positives and false negatives if we use the number of retweets 5, 15, 25, and 35 minutes after the tweet is posted for prediction.


Once we zoom into only the top tweets, the chart doesn’t make the model look as good as the previous one.  If we use the number of retweets 5 minutes after the tweet was posted for prediction, the model will correctly predict 75% of the top tweets.  The model will also falsely identify 25% more tweets as top tweets even though they were not.  If we use the number of retweets 35 minutes after the tweet was posted, the percentage of correctly identified top tweets increases to 84%, and the percentage of false positives decreases to only 9% of all top tweets.  This more nuanced view of the performance of the model will be a better comparison of the performance of other models in the future.

This new view of the data exposes an interesting trend related to the change in false positives and false negatives the longer we wait to sample the number of retweets.  Although the number of false positives decreases the longer we wait to sample the number of retweets, the number of false negatives seems to plateau after 25 minutes.  This data appears to corroborate the conclusion we reached in a previous blog post that the optimum amount of time to wait to predict retweets is 25 minutes.

These results do come with a caveat.  First, we are using historical data in this analysis.  Although history can repeat itself, I am not sure if historical data will be representative of future behavior.  Second, we used the same data to create the model and for prediction.  Using the same dataset for both purposes gives us a little of an unfair advantage.  A better comparison would be to use one dataset to create the model, and apply to another dataset.  Because we didn’t have a lot of data yet, we couldn’t perform this type of comparison yet.

Even with these caveats, I think this analysis is a very good start to understand how well the @TT_GuyKawasaki account will predict Kawasaki’s top tweets.  Stay tuned; in a few weeks we will perform a similar analysis with the data we collected from running the account.

Wednesday, July 17, 2013

At what time should Guy Kawasaki tweet?

The band I played in back in college had a love-hate relationship with song requests.  We loved  requests for the few songs we knew, and hated all the rest.  I have a similar love-hate relationship with requests in this data-driven blog world that I inhabit.  I only love requests if I have the data.  So, imagine my dorky giddiness when my friend Mary Liu asked me whether the time-of-day had any effect on the popularity of a tweet, a question for which I actually had the data!  Mary’s hunch proved to be correct.  Tweets that were posted by Guy Kawasaki between 9 – 10 a.m. and 7- 8 p.m. PST had a higher probability of being retweeted.  On the other hand, the day of the week when a tweet was posted didn’t influence a tweet’s popularity at all.

To determine whether the time of day when Kawasaki posts a tweets affects its popularity we need to compare the average number of retweets for tweets posted during different hours of the day.  The chart below shows the average number of retweets1 for tweets that were posted across different hours of the day using a two weeks sample of Kawasaki's tweets.  The horizontal line describes the average number of retweets for the whole sample, 5.8 retweets.  Although some hours seem to have more or fewer retweets than the average, not all of them are significantly different.  The only hours that proved to be significantly different from the average were 9 a.m., 10 a.m., and 7 p.m.2.


Tweets posted between 9 – 10 a.m. had the most number of retweets with 72% more retweets than the average.  This result is not totally surprising.  I assume that many of Kawasaki’s followers live in the US.  At 9 a.m. PDT people in the west coast are checking their feeds first thing in the morning, and people in the east coast are checking their feeds as they leave for lunch at noon.

On the other hand, Tweets posted between 10 – 11 a.m. had 36% less retweets than the average.  The reason why we would see a drop in the popularity of tweets quickly after the spike at 9 a.m. is puzzling.  Maybe the hour between 9 – 10 a.m. is truly a critical time for people to popularize tweets, and people then quickly stop viewing their feeds and retweeting content.

The third set of tweets that showed a significant difference in retweets were those posted between 7 - 8 p.m.  These tweets had 46% more retweets than the average.  The reason behind the increased popularity during this time could be similar to the reason why we thought there was a spike of retweets at 9 a.m.  People in the west coast might be checking their twitter feeds as the end of the week-day while people in the rest of the country might be checking their feeds as the day comes to an end.

We can use the same methodology described above to determine whether the popularity of tweets changes across week-days.  The chart below shows the average number of retweets1 for tweets that were posted during different days of the week.  Even though it looks like Sunday and Monday are the best days for Tweets and Friday is the worst, none of these differences are statistically significant.  So, apparently the day when you post a tweet doesn’t have any effect on its popularity.


This analysis is based on a strong assumption that Kawasaki doesn’t tweet better content at different times of the day on purpose.  If this assumption was not true, the effects that we found might be due to Kawasaki’s posting choices not the time-of-day or day-of-the-week.  Kawasaki has been pretty open about his Tweeting strategy and hasn’t mentioned anything around tweeting at different times of the day so I think it’s fair to assume that Kawasaki doesn’t tweet different quality of content at different times.

Even though we did find evidence for a time-of-day effect on a tweet's popularity, it is worth noting that more than 85% of the day had no effect.  So even though a few hours of the day might boost a tweet's popularity, there are very few bad times to post a tweet.



1 The chart shows the average number of retweets 35 minutes after a tweet was posted.  As discussed in a previous post, about two thirds of Kawasaki’s tweets go missing, so using the final number of retweets would reduce our sample by about 66%.  In addition, we’ve seen that the number of retweets a few minutes after a tweet is posted can predict the final number of tweets very accurately.

2 Tweets that were posted between 9 – 10 a.m., 10 – 11 a.m., and 7 – 8 p.m. were significantly different from the mean at p < .01, p < .05, and p < .1 respectively. 

Sunday, June 30, 2013

There’s More Than One Way to Predict Guy Kawasaki’s Retweets

Like many things in life, such as tying a tie, cooking an egg, and skinning a cat, there are different ways that we could predict the final number of retweets for one of Guy Kawasaki’s tweets.  However, unlike the three previous tasks, predictions have objective ways to determine what method is best.  In this post I’ll compare three different statistical tools, simple regression, the lasso (or L1), and random forests to determine which is best at predicting retweets.  I found that the random forest predicted the data best and only showed an average error of 19%.  The lasso and simple regression did similarly well will errors of 46% and 36% respectively. 

The chart below shows how well each one of the tools performed by graphing the actual number of retweets of each tweet in the x-axis and the predicted number of retweets in the y-axis.  The error rate of each tweet is color coded in green, yellow, and red depending on the accuracy of each prediction.



The simple regression is our baseline tool and uses the number of retweets 25 minutes after Kawasaki posts a tweet to predict the tweet’s final number of retweets.  I picked the number of retweets 25 minutes after Kawasaki posted the tweet because in a previous post we found that 25 minutes was the optimum time to sample.  The regression does a pretty good job of predicting the data.  On average, each prediction has an error of only 36%1.  The graph above also shows how well the regression performed.  We see mostly green and yellow dots, and only a few red dots.

The second model that we tried was the lasso.  The lasso shares some of a regression’s characteristics such as that it uses assumes a linear relationship between variables.  However, the lasso usually does a better job at picking up the signal from the noise in the data by selecting variables that are truly significant.  I threw a lot more data at the lasso than I did to the simple regression.  The following table shows all the data I used with the lasso:

Data
Sample Frequency       
Description
Number of retweets
5, 15, 25 min. after posting
Retweets after a certain time
Difference in retweets
5, 15, 25 min. after posting
Increase in retweets during a certain time
Audience size
5, 15, 25 min. after posting
Number of users that follow those who retweeted the tweet
Difference in audience size
5, 15, 25 min. after posting
Increase in the number of users that follow those who retweeted the tweet
Number of followers
At time of posting
Number of people following @GuyKawasaky when the tweet was posted
Source
At time of posting
Tool used to post the tweet
Date, week day, hour, minute, second
At time of posting
Time information on when the tweet was posted

The Lasso picked four of these data points as being significant, the number of retweets 5, 15, and 25 minutes after the tweet was posted, and the difference between the number of retweets 25 and 15 minutes after the tweet was posted.  This result is surprising not because of what the lasso selected, but what it didn't select.  There doesn't seem to be a strong relationship between the time of day a tweet is posted or what tool is used to post it, and the number of retweets.  In addition, by looking at the chart above it seems that the lasso did similarly well at predicting the final number of retweets as the simple regression.  Actually, it did a little worse by predicting with an average error of 46%.

The third tool I used is called a random forest.  Random forests are different from the simple regression and the lasso because instead of relying on linear releationships to predict the final number of retweets, random forests use a combination of decision trees.  Decision trees use a series of conditional statement such as, is the number of retweets 25 minutes after the tweet was posted more or less than 10, to make a decision.  Random forests create many decision trees with random sections of the data and use the median value generated by all the trees as the prediction. This tool worked much better than the other two, predicting the data with an error rate of only 19%.

Although the random forest doesn't tell us exactly how it made every decision, it does tell us how influential each of the variables was in making the decision with the graph below.  The graph shows us that the four most important variables are the same ones the lasso selected.  The random forest also finds the size of the audience to which the tweet was exposed, and the number of followers at the time the tweet was posted to be slightly influential.


The three tools proved to be very helpful in predicting the number of retweets; none of them were completely off the mark.  However, which one you use for prediction might depend not only on its accuracy but also on how easy it is for you to run in your application.  A regression and the lasso give you a simple equation to run the prediction, while the random forest requires a computer to run the prediction model.


1 The error is calculated as the average of the ratio of the absolute value of the difference between the prediction and the actual number of retweets for all data points.

Tuesday, June 11, 2013

How Long Should We Wait to Predict Guy Kawasaki’s Tweets?

In a previous post, we saw that the number of retweets a tweet receives shortly after Guy Kawasaki posts it are a good predictor of the tweet's final number of retweets. We also learned that the more we waited to make a prediction, the more accurate we were at predicting the final number of retweets.  But, how long should we wait to make a prediction?  In this post I will cover how we can make this decision by balancing two priorities: how long Twitter followers are willing to wait for a prediction, and how much accuracy we gain for waiting an additional minute.  We will see that waiting 25 minutes after a tweet is posted best balances Twitter users’ need for recency, and prediction accuracy.

Twitter users are not willing to wait too long for a tweet.  Although good data on how frequently people check Twitter every day is not available at the time of writing, we can use Kawasaki’s reposting strategy as a reference.  Kawasaki reposts some of his tweets every 8 hours.  So, it is reasonable to assume that he believes people will check their Twitter feeds about once every 8 hours.  Therefore, we can’t wait more than 8 hours to predict the final number of tweets, a pretty low bar as we will see later on.

In contrast to Twitter user’s interest in recency, my prediction model produces more accurate results the more we wait to make a prediction.  However, accuracy doesn’t increase proportionally to how long we wait.  The more we wait the less accuracy we will gain for each additional minute.  To better understand these diminishing returns, I measured the accuracy of waiting 5, 15, 25, and 35 minutes using three different measures.

The first measure is called out-of-sample proportion of variance (PVE) and it tells us how well we can predict the final number of retweets on data we haven’t seen yet.  Instead of single number, we run the PVE for different subsets of our data and take the average.  The second and third measures are the Akaike information criterion (AIC), and the Bayesian information criterion (BIC).  Much like PVE, both models measure how well the model explains the data we want to predict, but they do so a little differently.  We will use the three measures to triangulate the diminishing returns of waiting to make a prediction.

The graph below shows the results of the three measures for models that wait 5, 15, 25, and 35 minutes.  The graph shows the distribution of the out-of-sample PVE calculations, the orange boxes and whiskers, and the average PVE, the thick black lines in the boxes.  The table at the bottom shows the AIC and BIC values for each model.  We also include a Null model, which uses the mean final number of retweets to make predictions.

The graph shows that as we wait longer to make a prediction, the average PVE for each model increases and distribution of PVE values gets tighter.  However, the rate of average PVE improvement decreases with time.  The average increase in PVE for waiting 15 minutes instead of 5 minutes is 73%, while the average increase for waiting 35 minutes instead of 25 minutes is just 6%.  So, it looks like we are reaching diminishing returns in accuracy at around 25 or 35 minutes .

The AIC and BIC values tell a similar story.  For both AIC and BIC a lower value describes a better model, so the table shows that the more wait the more accurate we will be.  However, with BIC and AIC, we do see an inflection point.  The AIC and BIC for waiting 25 minutes is marginally lower than waiting 35 minutes.  This inflection point matches the diminishing returns seen with the PVE.

So, the data has spoken and we have been lucky to find a Goldilocks answer. Waiting 15 minutes seems to be too little, waiting 35 minutes was too long, but 25 minutes seems just right.  In future posts we will talk about how we can use more data and better tools to get even more accurate retweet predictions.

Monday, June 3, 2013

What are Guy Kawasaki’s Best Twitter Tools?

Guy Kawasaki has been very open about his tweeting strategy, so it’s no surprise that he openly shared which tools he uses to manage his social media.  His article outlined why he chose Facebook, Tweetdeck, Hootsuite, and GRATE, Kawasaki’s custom tool that posts content from Alltop.com, to post his tweets.  However, what Kawasaki didn’t cover was how effective each of these tools was at posting tweets that people liked.  By analyzing two weeks-worth of tweets, the data showed that Facebook is the most effective tool for Kawasaki, that Kawasaki is using more tools that he described in his article, and that the vast majority of Kawasaki's tweets come from Alltop.com.


I measured the success of a tweet by using the number of retweets 35 minutes after the tweet was posted.  I chose 35 minutes because, as discussed in a previous post, this time period is a good predictor of the final number of retweets.  Also, using 35 minutes helps us get around the Kawasaki’s missing tweets problem, which would reduce our sample size by 64% in this case.  The graph below shows the distribution of retweets 35 minutes after posting for each Twitter tool.  The top of the boxes represent the 75th percentile of retweets while the bottom of the box represents the 25th percentile of retweets.  The horizontal lines in the boxes describe the average number of retweets.



The biggest surprise in the data is that the tweets posted from Facebook outperform those posted by Kawasaki’s custom twitter posting tool GRATE, which posts content from Alltop.com.  On average, tweets posted from Facebook receive 50% more retweets than those posted by GRATE.  The only tool with better average results than Facebook is Bufferapp.  However, because we only found 3 tweets for Bufferapp, this tool’s performance should be taken with a grain of salt.

A second surprise is that Kawasaki seems to be using more tools than the ones he shared in his article.  In his article, Kawasaki described how he used GRATE, Facebook, Tweetdeck, and Hootsuite.  However, the data shows that he is also using Bufferapp, a social medial scheduler similar to Hootsuite. Kawasaki also posted tweets from the Twitter sharing functionality of Paper.li, a customer newspaper site that created a page for Kawasaki.  The data also included one tweet posted through Twitter’s tweet button, and one tweet using Mailchimp’s twitter sharing functionality, demonstrating that if he finds something interesting he is willing to share it on the spot.

Third, I thought it was interesting to see that 85% of tweets on Kawasaki’s account come from Alltop.com.  This percentage shows that Kawasaki’s account in mostly another channel for Alltop content.  Kawasaki’s personal comments and responses in contrast account for only 4% of the accounts’ tweets.  However, promotional tweets posted through Hootsuite only account for 5%, so at least only a small fraction of the accounts’ tweets are advertising.

Finally, an interesting insight is the large variance in the number of retweets posted by Hootsuite, the tool that Kawasaki uses for promotional tweets, according to his article.  The difference between the 75th percentile and 25th percentile for Hootsuite is 3 times greater than the difference for Facebook, the tool with the next largest variance difference.  The gap in variance might be a sign that some promotional tweets are very binary.  They either become incredibly popular, or are mostly ignored by Kawasaki's followers.

The data above is just snapshot of how these tools look today.  In an ever changing technology landscape, the winners' podium can change very easily.  Stay tuned as we check back in a few months and see whether Facebook will be able to keep its crown as the best Twitter tool, or whether an existing, or potentially new tool, took the lead.

Sunday, May 19, 2013

Predicting Guy Kawasaki’s Retweets is (not that) Hard


Prediction doesn’t have to be hard.  We do it every day.  We become amateur meteorologists when we look out the window to help us predict the day’s weather.  This seemingly simple  act is firmly based in reason.  A very good predictor of the weather in a few hours is the weather right now.

Similarly, a very good predictor of the number of retweets that a tweet will receive in a few minutes is the number of retweets that it has right now.  In this post, we’ll see how the number of retweets that tweet receives soon after Guy Kawasaki posts it are a very good predictor of the final number of retweets the tweet will receive.

First though, we will need to agree on how long we will wait to measure the final number of retweets that a tweet receives.  In a two week sample I used for this analysis, I found that 95% of tweets stop receiving retweets 12 hours after they were posted.  So, the number of retweets that a tweet receives 12 hours after the tweet is posted will be our benchmark for final number of retweets.

The second question we need to answer is how good are the number retweets that a tweet receives a few minutes after Kawasaki posts it at predicting the tweet's final number of retweets.  A good way to visualize this relationship is to plot the number of retweets a tweets receives soon after posting versus the final number of retweets.

The left graph below shows, for a two week sample, the number of retweets each tweet received 5 minutes after Kawasaki posted it compared to the number of retweets it received 12 hours after posting.  The right graph below shows, for the same sample, the number of retweets 35 minutes after posting versus 12 hours after posting.  Both graphs include a blue best-fit line drawn through the middle of the points.

Both graphs show that the number of retweets that a tweet receives just a few minutes after posting is a good predictor of the tweet's final number of retweets.  We know that one that one is a good predictor of the other because retweets early on increase as the total number of retweets increases as well.  However, the main difference between the graphs is that the points on the right graph are closer to the blue best-fit line.

The closer points are to a best-fit line, the better one variable is at predicting another.  If the closer we get to the best-fit line, the better we will be at predicting the final number of retweets, then why don’t we just wait longer than 35 minutes to predict final number of retweets?

A trade-off exists between the time we are willing to wait for a prediction of the total number of retweets, and the prediction's accuracy.  For example, it would not be very helpful for us to wait 11 hours to predict the total number of retweets.  In the next post, we will discuss how long we’re willing wait for a prediction, and how much prediction quality we really need in order to show that predicting Kawasaki's retweets isn't really that hard.