Awhile ago, I found myself trying to make a decision on which of several restaurants to eat at. They were all highly rated in Yelp, but surely there might be more insights I could pull from their reviews. So I decided to Splunk them!
TL;DR If you want to get straight to the code, go to https://github.com/dmuth/splunk-yelp-reviewsto get started.
Downloading the reviews
Yelp has an API but, I am sorry to say that it is awful. It will only let you download 3 reviews for any venue. That’s it! What a crime.
So… I had to crawl Yelp venue pages to get reviews. I am not proud of this, but I was left with no other other option.
Python has been my go-to language lately, so I decided to solve the problem of review acquisition with Python. I used the Requests module to fetch the HTML code, and the Beautiful Soup module to extract reviews and page links from the HTML.
One thing that is nice about Yelp is that their HTML is well-formed, CSS classes are used in an intelligent manner, and with a good parser, it is easy to extract the data you want. Here’s the code I used to extract reviews from the HTML:
retval = []
soup = BeautifulSoup(html, 'html.parser')
for review in soup.find_all("div", {"class": "review-content"}):
row = {}
row["venue"] = venue
#
# Parse our date
#
date = review.find_all("span", {"class": "rating-qualifier"})[0].text
date = re.sub("[^/0-9]", "", date)
date_time_obj = datetime.datetime.strptime(date, '%m/%d/%Y')
date = date_time_obj.strftime("%Y-%m-%dT%H:%M:%S.000")
row["date"] = date
#
# Grab our review
#
row["review"] = review.find_all("p")[0].text
#
# Parse our stars
#
stars = review.find_all("div", {"class": "i-stars"})[0]["title"]
if stars == "5.0 star rating":
row["stars"] = 5
elif stars == "4.0 star rating":
row["stars"] = 4
elif stars == "3.0 star rating":
row["stars"] = 3
elif stars == "2.0 star rating":
row["stars"] = 2
elif stars == "1.0 star rating":
row["stars"] = 1
else:
raise Exception("Could not parse 'stars' value: {}".format(stars))
retval.append(row)
And determining if there is another page of reviews is straightforward with just grabbing the link:
next_page = soup.find_all("a", {"class": "next"})
if next_page:
url = next_page[0]["href"]
Then I just wrote the reviews to disk as JSON and had my reviews!
Splunk Was The Easy Part
A few months ago, I introduced an app I wrote called Splunk Lab. Splunk Lab consists of a script and a Docker container which can be used to stand up Splunk in a matter of seconds for ad hoc data analysis. And for the data analysis part of this task, Splunk Lab is exactly what I used to speed up my creation of dashboards on venues.
Here’s how to install and run the Splunk Yelp Reviews app that I built:
SPLUNK_START_ARGS=--accept-license \
bash <(curl -s https://raw.githubusercontent.com/dmuth/splunk-yelp-reviews/master/go.sh) \
./urls.txt
When Splunk is ready to run, the script will explain to you what is about to be run, and give you the chance to back out and change settings.
Once that’s done, go to https://localhost:8000/, log in with the credentials you set in the script, and you’ll see a dashboard from which you can select a venue and see a report for it!
Here are some screenshots:
Takeaways I Learned From Splunking Yelp
I found some interesting things by looking at the data in Splunk:
- The number of reviews of venues near each other can vary by as much as two orders of magnitude (4-5 reviews vs 400-500 reviews)
- Bars and chain stores generally have the least reviews, standalone restaurants tend to have the most reviews.
- A gap in reviews on a heavily reviewed place usually means it was closed–if reviews after are higher, chances are there was a change in ownership.
- Reading recent negative reviews (2 stars or less) is VERY useful, as it can highlight specific issues with an establishment.
- Tag clouds seemed like a neat idea, but aren’t all that useful.
And I’m just scratching the surface here, as restaurants and bars aren’t the only things reviewed in Yelp. There are so many different businesses, doctors, etc. that are reviews, that I’m sure there are some serious insights hidden in the data.
So give the app a try, and let me know what you think!
— Doug