Recommendations Based on Pitchfork Written Reviews
Introduction
Setup
First, let’s make sure we’re in the correct working directory and have all the necessary packages imported. We’ll also read in the dataset–It’s been cleaned beforehand on my local machine; a simple process to get it into a csv rather than the SQLLite db I had downloaded. Finally, we’ll set a flag to indicate which vectorizer we are using in the interest of testing both and comparing (A/B testing).
%cd "drive/My Drive/Colab Notebooks"
!pwd
!pip3 install sklearn pandas rake_nltk
/Users/tejassiripurapu/Notebooks
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: sklearn in /Users/tejassiripurapu/Library/Python/3.7/lib/python/site-packages (0.0)
Requirement already satisfied: pandas in /Users/tejassiripurapu/Library/Python/3.7/lib/python/site-packages (1.0.5)
Requirement already satisfied: rake_nltk in /Users/tejassiripurapu/Library/Python/3.7/lib/python/site-packages (1.0.4)
Requirement already satisfied: scikit-learn in /Users/tejassiripurapu/Library/Python/3.7/lib/python/site-packages (from sklearn) (0.23.1)
Requirement already satisfied: python-dateutil>=2.6.1 in /Users/tejassiripurapu/Library/Python/3.7/lib/python/site-packages (from pandas) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in /Users/tejassiripurapu/Library/Python/3.7/lib/python/site-packages (from pandas) (2020.1)
Requirement already satisfied: numpy>=1.13.3 in /Users/tejassiripurapu/Library/Python/3.7/lib/python/site-packages (from pandas) (1.18.5)
Requirement already satisfied: nltk in /Users/tejassiripurapu/Library/Python/3.7/lib/python/site-packages (from rake_nltk) (3.5)
Requirement already satisfied: scipy>=0.19.1 in /Users/tejassiripurapu/Library/Python/3.7/lib/python/site-packages (from scikit-learn->sklearn) (1.4.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in /Users/tejassiripurapu/Library/Python/3.7/lib/python/site-packages (from scikit-learn->sklearn) (2.1.0)
Requirement already satisfied: joblib>=0.11 in /Users/tejassiripurapu/Library/Python/3.7/lib/python/site-packages (from scikit-learn->sklearn) (0.15.1)
Requirement already satisfied: six>=1.5 in /Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/site-packages (from python-dateutil>=2.6.1->pandas) (1.12.0)
Requirement already satisfied: click in /Users/tejassiripurapu/Library/Python/3.7/lib/python/site-packages (from nltk->rake_nltk) (7.1.2)
Requirement already satisfied: tqdm in /Users/tejassiripurapu/Library/Python/3.7/lib/python/site-packages (from nltk->rake_nltk) (4.47.0)
Requirement already satisfied: regex in /Users/tejassiripurapu/Library/Python/3.7/lib/python/site-packages (from nltk->rake_nltk) (2020.6.8)
import pandas as pd
from rake_nltk import Rake
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
df = pd.read_csv("data/pitchfork_reviews_cleaned.csv")
print(df.head())
tfidf = True # if True: use TfidfVectorizer, else: CountVectorizer
             artist                                       album       genre  \
0       David Byrne  “…The Best Live Show of All Time” — NME EP        Rock   
1         DJ Healer           Lost Lovesongs / Lostsongs Vol. 2  Electronic   
2       Jorge Velez                                 Roman Birds  Electronic   
3           Chandra                          Transportation EPs        Rock   
4  The Chainsmokers                                    Sick Boy  Electronic   
   score             author                                             review  
0    5.5          Andy Beta  Viva Brother, Terris, Mansun, the Twang, Joe L...  
1    6.2        Chal Ravens  The Prince of Denmark—that is, the proper prin...  
2    7.9   Philip Sherburne  Jorge Velez has long been prolific, but that’s...  
3    7.8          Andy Beta  When the Avalanches returned in 2016 after an ...  
4    3.1  Larry Fitzmaurice  We’re going to be stuck with the Chainsmokers ...  
A Little Preprocessing
Now that our data is loaded into a dataframe, let’s separate out the written review and use the album title as the index.
NOTE: The set we’re working with may be too big to handle atm (session is crashing when trying to vectorize). Randomly sample 8000 reviews to begin with
working_df = df[['artist', 'album', 'review']].copy()
# working_df.set_index('album', inplace=True)
print(working_df.head())
print(len(working_df))
working_df.dropna(inplace=True, how='any')
print(len(working_df)) # How many na rows were there? 5 (which ones?)
sample_df = working_df.sample(8000)
             artist                                       album  \
0       David Byrne  “…The Best Live Show of All Time” — NME EP   
1         DJ Healer           Lost Lovesongs / Lostsongs Vol. 2   
2       Jorge Velez                                 Roman Birds   
3           Chandra                          Transportation EPs   
4  The Chainsmokers                                    Sick Boy   
                                              review  
0  Viva Brother, Terris, Mansun, the Twang, Joe L...  
1  The Prince of Denmark—that is, the proper prin...  
2  Jorge Velez has long been prolific, but that’s...  
3  When the Avalanches returned in 2016 after an ...  
4  We’re going to be stuck with the Chainsmokers ...  
20873
20867
Keyword Extraction
Use the nltk Rake function (from rake-nltk Python package) to extract important keywords from the reviews. This improves the vectorization by removing punctuation and allows for larger batch size by decreasing size of corpus and individual docs (reviews).
Furthermore, as a new Rake object is instantiated for every row in the dataframe, I’ve added the artist name as a stopword when extracting keywords for any given review of an artist. This is because the artist name is likely used frequently in a review of their music, so our similarity scores (and therefore recommendations) will be skewed towards albums by that artist. We want more variety to the recommendation, a user has likely already heard of other albums by the same artist–or can simply Google it.
artists = sample_df['artist']
sample_df['keywords'] = ""
for idx, row in sample_df.iterrows():
  r = Rake(stopwords=artists[idx])
  r.extract_keywords_from_text(row['review'])
  keywords = list(r.get_word_degrees().keys())
  row['keywords'] = " ".join(keywords)
corpus = sample_df['keywords']
Vectorization
The working dataframe is set up containing album titles (as index) and the full written review. We’re going to vectorize that review column using Tfidf. This will give us a vector of word frequencies across the whole corpus. Think of this as mapping every review onto an n-dimensional space, where n is the number of distinct words in the corpus. Then we can measure similarity between the vectors using cosine similarity. Admittedly, Tfidf is a little different as it measures frequency relative to how often that word appears in different documents, but this is a good way to conceptualize.
if tfidf:
  tv = TfidfVectorizer()
  tv_counts = tv.fit_transform(corpus)
  # tv_counts = tv.transform(df['review'])
  tv_counts = tv_counts.toarray()
  print(len(tv_counts[0])) # How many words are being counted?
89524
if not tfidf:
    cv = CountVectorizer()
    cv_counts = cv.fit_transform(corpus)
    cv_counts = cv_counts.toarray()
    print(len(cv_counts[0]))
labels = []
albums = sample_df['album']
for i in range(len(tv_counts[0])):
  labels.append("word" + str(i))
counts_df = pd.DataFrame(tv_counts, columns=labels, index=albums)
print(counts_df.head())
                                 word0  word1  word2  word3  word4  word5  \
album                                                                       
Pimps Don't Pay Taxes              0.0    0.0    0.0    0.0    0.0    0.0   
Exmilitary                         0.0    0.0    0.0    0.0    0.0    0.0   
Like Someone In Love EP            0.0    0.0    0.0    0.0    0.0    0.0   
Expressions (2012 A.U.)            0.0    0.0    0.0    0.0    0.0    0.0   
Songs from the Hermetic Theatre    0.0    0.0    0.0    0.0    0.0    0.0   
                                 word6  word7  word8  word9  ...  word89514  \
album                                                        ...              
Pimps Don't Pay Taxes              0.0    0.0    0.0    0.0  ...        0.0   
Exmilitary                         0.0    0.0    0.0    0.0  ...        0.0   
Like Someone In Love EP            0.0    0.0    0.0    0.0  ...        0.0   
Expressions (2012 A.U.)            0.0    0.0    0.0    0.0  ...        0.0   
Songs from the Hermetic Theatre    0.0    0.0    0.0    0.0  ...        0.0   
                                 word89515  word89516  word89517  word89518  \
album                                                                         
Pimps Don't Pay Taxes                  0.0        0.0        0.0        0.0   
Exmilitary                             0.0        0.0        0.0        0.0   
Like Someone In Love EP                0.0        0.0        0.0        0.0   
Expressions (2012 A.U.)                0.0        0.0        0.0        0.0   
Songs from the Hermetic Theatre        0.0        0.0        0.0        0.0   
                                 word89519  word89520  word89521  word89522  \
album                                                                         
Pimps Don't Pay Taxes                  0.0        0.0        0.0        0.0   
Exmilitary                             0.0        0.0        0.0        0.0   
Like Someone In Love EP                0.0        0.0        0.0        0.0   
Expressions (2012 A.U.)                0.0        0.0        0.0        0.0   
Songs from the Hermetic Theatre        0.0        0.0        0.0        0.0   
                                 word89523  
album                                       
Pimps Don't Pay Taxes                  0.0  
Exmilitary                             0.0  
Like Someone In Love EP                0.0  
Expressions (2012 A.U.)                0.0  
Songs from the Hermetic Theatre        0.0  
[5 rows x 89524 columns]
(Cosine) Similarity
Now we have a dataframe containing the word frequencies and we can compare each album using cosine similarity. Doing this in a pairwise manner (across every row in the dataframe) allows us to measure the similarity between two albums based on their review.
similarity = cosine_similarity(counts_df)
print(similarity[:5])
God bless, everything ran and nothing has crashed yet (hopefully). Now let’s make sense of the similarity matrix. Each album gets a row in the matrix (i.e. album[0] has a corresponding vector at similarity[0]).
Recommendation and User Testing
Every value in the album vector represents how similar that album is with another. So for album[i], there is a corresponding album vector such that the value at index j in that vector is showing how similar album[i] is with album[j]. If i == j we should see a value of 1, meaning that the two albums are exactly alike (as they should be, it’s the same album).
So to find recommendations for a given album, sort the corresponding similarity vector (while maintaining the index). Using the index of the highest value which isn’t 1 will reference the album most similar to the given album based on the written Pitchfork review.
The process to recommend is as follows: Given an input album title -> get review corresponding to album, vectorize using Tfidf, measure similarity against sample, return top 5 (or 10?) most similar albums.
working_df.set_index('album', inplace=True)
search_album = "My Beautiful Dark Twisted Fantasy"
review = working_df.loc[search_album]['review']
vec = tv.transform([review]).toarray()
print(len(vec[0]))
search_sim = cosine_similarity(vec, counts_df)
print(search_sim)
search_df = pd.DataFrame(search_sim[0])
top_5 = search_df.nlargest(5, 0)
# print(top_5.index)
for idx in top_5.index:
  # print(idx)
  print(albums.iloc[idx])