__author__ = "Chris Tran"
__email__ = "tranduckhanh96@gmail.com"
__website__ = "chriskhanhtran.github.io"
In order to detect emerging trends of food consumption from social media data, I employ two simple methods:
First, let's load libraries and data that we are going to use in this project.
import os
import re
import sys
from tqdm import tqdm
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from wordcloud import WordCloud, STOPWORDS
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
%matplotlib inline
%%time
# Load data
months = list(range(1, 13))
years = list(range(2011, 2016))
data = []
data_index = []
for year in tqdm(years):
for month in months:
pathname = f'facebook_posts/fb{year}/fpost-{year}-{month}.csv'
data_idx = f'{year}-{month}'
with open(pathname, encoding='utf-8') as f:
text = f.readlines()
data.append(text)
data_index.append(data_idx)
def text_preprocessing(s):
"""
Preprocess text:
- Lower the string
- Remove punctuations
- Remove '\n'
- Remove trailing whitespace
"""
# Lower string
s = s.lower()
# Remove punctuation
s = re.sub(r'[^\w\s]', ' ', s)
# Remove '\n'
s = re.sub(r'\n', ' ', s)
# Remove trailing whitespace
s = re.sub(r'\s+', ' ', s).rstrip()
return s
ingredients.txt
is a list of more than 1,000 dishes. We will perform some text preprocessing to extract unique ingredients from this list and remove several stop words that are not ingredients. After these processing steps, we have a list of 940 unique ingredients.
# Load ingredient list
with open('facebook_posts/ingredients.txt', encoding='utf-8') as f:
ingredients = f.readlines()
# Preprocess ingredients
ingredients = list(map(text_preprocessing, ingredients))
# Make a list of unique ingredients
ingredients = list(set([word for item in ingredients for word in item.split()]))
# Sort ingredients alphabetically
ingredients = sorted(ingredients)
# Remove stop words
stopwords = ['and', 'de', 'of', 's', 'new', 't', 'food']
for stopword in stopwords:
ingredients.remove(stopword)
# Create word-index dictionary
word2Idx = {w: i for (i, w) in enumerate(ingredients, 0)}
As mentioned in the introduction section, below is the implementation of the two methods I use to detect food trends:
def count_ingredients(data, ingredients):
"""
@params data (list): list of list of monthly facebook posts
@params ingredients (list): list of 940 ingredients
@return ingredient_counts_all (np.array): np.array with shape (940, 60)
"""
ingredient_counts_all = []
for monthly_data in tqdm(data):
# Preprocess text and make a list of all words
all_words = text_preprocessing(' '.join(monthly_data)).split()
# Word counts
word_counts = dict(nltk.FreqDist(all_words))
# Ingredient counts
ingredient_counts = pd.Series(index=ingredients, data=np.zeros(len(ingredients)))
for word in word_counts.keys():
if word in ingredients:
ingredient_counts[word] = word_counts[word]
ingredient_counts_all.append(ingredient_counts.to_numpy())
ingredient_counts_all = np.stack(ingredient_counts_all).T
return ingredient_counts_all
ingredient_counts = count_ingredients(data, ingredients)
def plot_ingredient_counts(ingredient):
"""
Plot the trend of an ingredient from 01-2011 to 12-2015
@params ingredient (str): name of an ingredient
"""
series = pd.Series(ingredient_counts[word2Idx[ingredient], :], index=data_index)
plt.figure(figsize=(10, 5))
plt.plot(series)
plt.xticks([0, 12, 24, 36, 48, 59])
plt.title(f'Trend of "{ingredient}" from 01-2011 to 12-2015')
plt.show()
This method detects accurately the trend of pumpkin, which is mentioned the most around Thanksgiving. However, it is very hard to use this method to detect new food trends such as cauliflower rice, because in this analysis cauliflower can be mentioned with any other ingredients and this trend only correlates with the increasing number of Facebook posts.
plot_ingredient_counts('pumpkin')
plot_ingredient_counts('cauliflower')
In the 2nd method, I creat a co-occurence matrix of 940 ingredients for every month from 01-2011 to 12-2015. Then I stack these 60 matrices into a 3-dimensional array of shape (60, 940, 940)
. For each month, I look at the 50 most mentioned ingredient combinations to see whether there is any interesting combination being mentioned. Then I examine the trend of these combinations over time to see whether there are abrupt changes.
def cooccurrence_symmetric_window(sentlist, word2Idx, weights):
"""
Contruct co-occurence matrix from the text data.
@param sentlist (list[str]): list of preprocessed string.
@param word2Idx (dict): {word: index} dictionary of relevant vocabulary.
@param weights (numpy.array): array of weights to specify window size
and weight for each word in the window.
@return res (numpy.array): co-occurence matrix of words in word2Idx.
Output shape: (len(word2Idx), len(word2Idx))
"""
# Specify weights and window size
m = len(weights)
# Specify vocabulary size
V = len(word2Idx)
# Construct co-occurence matrix
cooc = np.zeros((V, V), np.float64)
for sent in sentlist:
# Preprocess sentence
sent = text_preprocessing(sent)
# Tokenize sentence
words = word_tokenize(sent)
n = len(words)
for i in range(n):
end = min(n - 1, i + m)
for j in range(i + 1, end + 1):
if (words[i] in word2Idx.keys() and words[j] in word2Idx.keys()):
cooc[word2Idx[words[i]], word2Idx[words[j]]] += weights[j - i - 1]
# Necessary step since symmetry is exploited to save computation during construction
res = cooc + cooc.T
# Fill diagonal of co-occurence matrix with 0
np.fill_diagonal(res, 0.0)
return res
Now we will create co-occurence matrix of ingredients for each month and stack them together to make a 3D array with shape (60, 940, 940).
# Specify weights
weights = np.array([1, 1, 1, 1, 1])
# Construct co-occurence matrixes
cooc_data = []
for monthly_data in tqdm(data):
cooc_matrix = cooccurrence_symmetric_window(monthly_data, word2Idx, weights)
cooc_data.append(cooc_matrix)
# Stack outputs together
cooc_data = np.stack(cooc_data)
Let's save the ouput so that we will not have to re-compute the co-occurence matrices in the future.
%%time
# Save co-occurence matrix data
filename = 'cooc_data.npy'
np.save(filename, cooc_data)
# Load co-occurence matrix data
cooc_data = np.load(filename)
With co-occurence matrices, we can compute Lift and PPMI. These two metrics tell us how much more than by chance two words occur together.
def lift_ppmi(tcm, eps=1e-10):
"""
Compute Lift and PPMI from co-occurence matrix.
@param tcm (np.array): 2D array of co-occurence matrix.
@param eps (float): a very small value to prevent dividing by zero
@return lift (np.array): 2D array of Lift
@return ppmi (np.array): 2D array of PPMI
"""
tcm = np.array(tcm, np.float)
marginal = np.sum(tcm, axis=0).reshape(-1, 1) + eps
lift = np.sum(tcm) * tcm / (marginal @ marginal.T)
ppmi = np.log2(lift * (lift > 1) + (lift <= 1))
return lift, ppmi
lift_data = []
ppmi_data = []
for month in tqdm(range(60)):
cooc_matrix = cooc_data[month, :, :]
lift, ppmi = lift_ppmi(cooc_matrix)
lift_data.append(lift)
ppmi_data.append(ppmi)
lift_data = np.round(np.stack(lift_data))
ppmi_data = np.round(np.stack(ppmi_data), 4)
Let's define functions to extract indices of the largest values in our data.
def largest_indices(ary, n):
"""
This function returns the index of n largest number in the 2-dimensional array
@param ary (np.array): numpy array to calculate
@param n (int): number of top indexes
@return new_idx (list): a list of tuple of indexes
"""
flat = ary.flatten()
indices = np.argpartition(flat, -2*n)[-2*n:]
indices = indices[np.argsort(-flat[indices])]
indices = np.unravel_index(indices, ary.shape)
idx = [(indices[0][i], indices[1][i]) for i in range(2*n)]
# Remove duplicated index
new_idx = idx.copy()
for i in range(len(idx)):
for j in range(i+1, len(idx)):
if sorted(idx[i]) == sorted(idx[j]):
new_idx.remove(idx[j])
return new_idx
def check_trend(time, n, matrix):
"""
This function returns the counts of the top 50 ingredient combination
@params time (str): 'yyyy-mm'
@params n (int): Return n ingredient combinations
@params matrix (np.array): 2D array of the matrix to check trend
@return print a data frame of top n ingredient combinations
"""
time_idx = data_index.index(time)
matrix = matrix[time_idx, :, :]
idx = largest_indices(matrix, n)
foods_count = {}
for i in idx:
foods_count[f'{ingredients[i[0]]} - {ingredients[i[1]]}'] = matrix[i]
foods_count = pd.DataFrame(pd.Series(foods_count), columns=['count']).reset_index()
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
print(foods_count)
Below is an example of the top 10 ingredient combinations being mentioned in Facebook posts on November 2014 from the co-occurence matrix. Ingredient combinations from the co-occurence matrix are regular foods/ingredients like pepper salt, ice cream or chocolate cake. We can identify some foods that are special for November, such as pumpkin pie. However, it is hard to detect new trends from these results.
check_trend('2014-11', 10, cooc_data)
On the other hand, results from lift data tell us interesting combinations that are rare to see such as guinea fowl, chervil blini and bresaola fleur. These results show up approximately 20,000 times more than by chance. PPMI data basically tells us the same thing. With the lift data, we can just look at the top results in each month to identify new trends, and verify these trends by plotting time-series.
check_trend('2014-11', 10, lift_data)
check_trend('2014-11', 10, ppmi_data)
Guineafowl - Look like a good alternative for Turkey in Thanksgiving
Chervil Blini - Pile a little salmon roe and fresh chervil on top. Why not?
Bresaola Fleur - Do you want to try it?
Now we will plot time-series from 2011-1 to 2015-12 of some foods we discover to verify our findings.
def plot_trend(food, data=cooc_data):
"""
@params food (str): name of a potential two-word food
@params data (np.array): an array of 3D data to plot from
@return plot the trend of `food` from 01-2011 to 12-2015
"""
word1 = food.split()[0].lower()
word2 = food.split()[1].lower()
series = pd.Series(data[:, word2Idx[word1], word2Idx[word2]], index=data_index)
plt.figure(figsize=(15, 5))
plt.plot(series)
plt.xticks([0, 12, 24, 36, 48, 59])
plt.title(f'Trend of "{food}" from 01-2011 to 12-2015')
plt.show()
Vegetable Noodle and Cauliflower Rice
There were spikes of Vegetable Noodle in January 2012 and from October 2014 to December 2014. It makes sense because after that, in January 2015, Vogue (American edition) featured this trend in its Lifestyle section.
plot_trend('vegetable noodle', lift_data)
The trend of Cauliflower Rice is also portrayed in the below plot. The number of mentions of Cauliflower Rice started to increase in January 2014 and peaked in 2015.
plot_trend('cauliflower rice', lift_data)
Thanksgiving foods
The trends of foods consumed in Thanksgiving are clearly portrayed in co-occurence matrix data. In below graphs, we can see the mentions of pumpkin pie, mashed potatoes, cranberry sauce and squash butternut peaked around November but remained low in other seasons.
plot_trend('pumpkin pie', cooc_data)
plot_trend('mashed potatoes', cooc_data)
plot_trend('cranberry sauce', cooc_data)
plot_trend('squash butternut', cooc_data)
One interesting thing to be noticed from above graphs is that different foods have different patterns, but these patterns repeat at the same time every year.
Interesting Findings
Let's look at time-series plots of some foods we discovered in Section 2.
plot_trend('Guinea Fowl', lift_data)
Guinea Fowl
spiked on June 2011 and November 2013. What was interesting then?
plot_trend('Chervil Blini', lift_data)
plot_trend('Bresaola Fleur', lift_data)
Interestingly, we picked up the very few times Chervil Blini and Bresaola Fleur showed up in our data. This indicates that after we discover any interesting foods emerging from our analysis, we have to perform validation to make sure they are really emerging trends that worth investing in. Plotting time-series data is a good way to validate our findings. Additionally, we can conduct surveys to understand how consumers perceive potentially new foods.
The word-count method is simple and straightforward, however the results are very noisy, making it hard to detect any interesting combinations of ingredients that are being mentioned on social media. Co-occurence matrix helps us observing the trends of specific ingredient combinations, but to identify any new trend we need to spend much time mining our results because the results from co-occurence matrix mostly show popular foods.
The most effective way is to calculate Lift or PPMI from our constructed co-occurence matrices. These metrics tell us how much more than by chance we observe pair of ingredients showing up together in Facebook posts. With this method we can detect interesting and rare foods. However, we need to validate our findings by plotting time-series data or conducting surveys before putting these foods into production.