Version: en

Preparing Dataset for PyTorch

Cleaning Up Datasets

We are using 2 datasets to build the PyTorch model.

1. Metacritic Game Info Dataset

This dataset contains all video games published from 1998-2018. Here's a snippet of what it looks like:

Unnamed0TitleYearPublisherGenrePlatformMetascoreAvg_UserscoreNo_Players
00The Legend of Zelda: Ocarina of Time1998NintendoAction Adventure;FantasyNintendo64999.11 Player
11Tony Hawk's Pro Skater 22000NeversoftEntertainmentSports;Alternative;SkateboardingPlayStation 987.41-2
22Grand Theft Auto IV2008RockstarNorthAction Adventure;Modern;Modern;Open-WorldPlayStation3987.51 Player

To keep our recommendation engine up-to-date, we are only interested in game titles after the year 2010. So let's clean up this dataset.

Import pandas

First we import pandas, a Python package used for data manipulation and analysis.

import pandas as pd

Read and Filter Data

Then, we use pandas to read our dataset file that we have imported.

games_df = pd.read_csv("game_info.csv")

This dataset contains some Year values that are null. Select only those that are not null and after the year 2010.

games_df = games_df[games_df['Year'].notnull()]
games_df=games_df.loc[games_df.Year>=2010.0].reset_index(drop=True)

Finally, let's remove the 'Unnamed' column since it is redundant and rename our cleaned up dataframe to df.

df = games_df.drop(columns=['Unnamed'], axis=1)

Below is a snippet of what df looks like:

IndexTitleYearPublisherGenrePlatformMetascoreAvg_UserscoreNo_Players
0Super Mario Galaxy 22010NintendoEADTokyoAction;Platformer;Platformer;3DWii979.1No Online Multiplayer
1Grand Theft Auto V2014RockstarNorthAction Adventure;Modern;Open-WorldXboxOne977.8Up to 30
2Grand Theft Auto V2013RockstarNorthModern;Action Adventure;Modern;Open-WorldPlayStation3978.3Up to 16

2. Metacritic Game User Ratings

This dataset contains user ratings and comments for specific games. We are interested in only looking at the game titles in our df, username and the userscore columns. Before clean up, the dataset looks like this snippet:

Unnamed0TitlePlatformUserscoreCommentUsername
00The Legend of Zelda: Ocarina of TimeNintendo6410Everything in OoT is so near at perfection, it...SirCaestus
11The Legend of Zelda: Ocarina of TimeNintendo6410I won't bore you with what everyone is already...Kaistlin
22The Legend of Zelda: Ocarina of TimeNintendo6410Anyone who gives the masterpiece below a 7 or ...Jacody

Read and Filter Data

Same the previous dataset, we use pandas to read the data file.

user_df=pd.read_csv("metacritic_game_user_comments.csv")

Then we remove the columns we don't need.

user_df = user_df.drop(columns=['Unnamed', 'Platform','Comment'], axis=1)

And we select only the data with Title that exists in df Title column, because we only want user ratings for games after 2010. The final cleaned dataframe is called users.

users = user_df.loc[user_df['Title'].isin(df['Title'])].reset_index(drop=True)

A snippet of the cleaned up dataset looks like:

IndexTitleUserscoreUsername
0Super Mario Galaxy 210S.Kumar
1Super Mario Galaxy 28ThePlasmaQuasar
2Super Mario Galaxy 210juanandesign

Export Dataset for Model

In order to work with the dataset easily, it is better to use continuous ids to identify game titles and users.

Encode Columns

We can create a function to encode a pandas columns with ids.

def encode_column(column):
""" Encodes a pandas column with continous IDs"""
keys = column.unique()
key_to_id = {key:idx for idx,key in enumerate(keys)}
return key_to_id, np.array([key_to_id[x] for x in column]), len(keys)

Encode Dataframe

Then, encode the users dataframe for the Title and Username columns with the following function.

def encode_df(users):
"""Encodes rating data with continuous user and game title ids"""
game_ids, users['TitleId'], num_games = encode_column(users['Title'])
user_ids,users['UserId'], num_users = encode_column(users['Username'])
return users, num_users, num_games, user_ids, game_ids

Export to csv

Execute the function and export the new dataset to csv.

users, num_users, num_games, user_ids, game_ids = encode_df(users)
users.to_csv('game_ratings.csv')

This csv file will be used for training and re-training the model with PyTorch. Below is a snippet of it:

IndexTitleUserscoreUsernameUserIdTitleId
0Super Mario Galaxy 210S.Kumar00
1Super Mario Galaxy 28ThePlasmaQuasar10
2Super Mario Galaxy 210juanandesign20