Blog

Record the Evidences: Build Goodreads Best Crime and Mystery Books Dataset

06 March 2023

Background and Motivation

In today’s job market, I fully understand the importance of having a solid portfolio to stand out from the crowd and get noticed by potential employers. However, building one can be challenging, especially if we need help knowing where to start. Luckily, our incredible friend Google Search can help. By typing “data analyst project” on the search bar, I got the responses like those in the image below.

I chose an article from the topmost answer and followed their suggestion by starting the first project on the list: web scraping.

The article suggested Beautiful Soup or Scrapy library to crawl the web. I chose Beautiful Soup because they mentioned it first; I did not explore the crawler libraries.

Data scraping involves extracting data from websites and saving it in a structured format. Since I love crime and mystery books, I selected the best crime and mystery books list on Goodreads as my source. And later, I saved the data in CSV format; hence we can utilize it for a future project. In short, three working blocks in this project are:

  1. saving website pages into documents,
  2. scraping data from the documents, and
  3. saving the data into a CSV file.

Project Blocks

STEP 1: Save website pages into documents

The pages I work with were Goodreads’ best crimes and mystery books. They consist of 69 pages. With help from the requests library, I opened and retrieved all pages’ content. Then, I saved each page’s content into a different HTML document.

I built two pages for this step: (1) html_docs.py to host the HTMLDocs class that gets the website page, retrieves its content, and saves the content in a document, and (2) main.py that calls HTMLDocs class and iterates savedoc() function from that class for all pages we want to read.

html_docs.py

import requests


class HTMLDocs:
    def __init__(self, main_url, my_title,  queryable_object):
        self.main_url = main_url
        self.title = my_title
        self.queryable_object = queryable_object
        self.text = ""

    def get_doc(self, page_number):
        url = self.main_url
        if page_number > 1:
            url += f"{self.queryable_object}{page_number}"

        page = requests.get(url)
        self.text = page.text

    def save_doc(self, page_number):
        self.get_doc(page_number)

        with open(f"{self.title}_{page_number}.html", "w") as doc:
            doc.write(self.text)

main.py

from html_docs import HTMLDocs


TITLE = "goodreads_crime_mystery_books"     # title for documents
URL = "https://www.goodreads.com/list/show/11.Best_Crime_Mystery_Books"
QUERYABLE_OBJECT = "?page="                 # query string
MAX_PAGE = 69                       # maximum number of pages

html_doc = HTMLDocs(URL, TITLE, QUERYABLE_OBJECT)

for i in range(1, MAX_PAGE+1):        # iterate from first page to the last page
    html_doc.save_doc(i)

STEP 2: Scrap data from documents

The data we can scrap from the crime and mystery books list are title, author, average rating, number of ratings, score, and number of voters. For example, for the first rank book, the title is The Girl with the Dragon Tattoo (Millennium, #1), the author is Stieg Larsson, the average rating is 4.16, the number of ratings is 3,012,134, the score is 306,573, and the number of voters is 3,099.

The five highest score of crime and mystery books.

The beautiful soup (bs4) library takes action in this second step. I used the find_all() method to search the parse tree. Beautiful Soup filters the class attribute to locate the title, author, rating, and number of ratings, (b) the text string pattern to find the score, and (c) both the id attribute and id name pattern to see the number of voters.

soup.find_all(tag, attribute)
soup.find_all(string=re.compile(pattern))
soup.find_all(attribute=re.compile(pattern))
# After executing the find_all() method, the five first items in the scores list from the first page are below.
print(scores[:5])

Next, for each location found, I extracted the data by using the text() method. Notice, before locating the exact position of the correct data, I have to split the list items that have more than one word, e.g., average rating, the number of ratings, the score, and the number of voters.

For example, each item in scores has two words: “score:” and “{number}”. After splitting an item in the scores list, we must locate the last index item to get the correct number.

So I could convert a number from string to integer, I also had to remove the thousands separator (“,”).

"score": int(scores[j].text.strip().split()[-1].replace(',', ''))

I concluded this step by storing all the extracted data in a dictionary.

from bs4 import BeautifulSoup


books_dict = []
for i in range(1, MAX_PAGE+1):
    # Open the html document
    with open(f"{TITLE}_{i}.html") as fp:
        soup = BeautifulSoup(fp, 'html.parser')

    # Locate the data we want to extract from the document
    rank = soup.find_all("td", {"class": "number"})
    title_tags = soup.find_all("a", {"class": "bookTitle"})
    author_names = soup.find_all("a", {"class": "authorName"})
    ratings = soup.find_all("span", {"class": "minirating"})
    n_ratings = soup.find_all("span", {"class": "minirating"})
    scores = soup.find_all(string=re.compile("score:"))
    n_voters = soup.find_all(id=re.compile("loading_link_"))

    # Extract the data
    for j in range(len(title_tags)):
        book_item = {
            "rank": int(rank[j].text.strip()),
            "title": title_tags[j].text.strip(),
            "author": author_names[j].text.strip(),
            "avg_rating": float(ratings[j].text.strip().split()[-6]),
            "n_rating": int(ratings[j].text.strip().split()[-2].replace(',', '')),
            "score": int(scores[j].text.strip().split()[-1].replace(',', '')),
            "n_voter": int(n_voters[j].text.strip().split()[0].replace(',', ''))
        }
        books_dict.append(book_item)

Notes

The data is missing 100 entries from page 27. After several attempts, the program could only open “Goodreads request took too long” page that contain no information I was trying to locate.

print(soup.prettify())

STEP 3: Save the data into a CSV file

Using pandas library, I turned the dictionary into a data frame. Last, I converted that data frame into a CSV file.

import pandas as pd


books_df = pd.DataFrame(books_dict)
books_df.to_csv(f"{TITLE}.csv", index=False)

Complete Code and CSV File

Complete code

I collected 6,767 rows of data from 69 pages of Goodreads’ Best Crime and Mystery Books list. CSV file – last updated March 3, 2022

I am glad I found an article that inspired me to start this project and the resources that helped me along the way. Not only did I learn a valuable skill, but I also had a tangible project to showcase in my portfolio.


P.S. Here is my recommended crime and mystery reads.*

*As an Amazon Associate, I earn from qualifying purchases.

- Ilma Andayana