Preparing Dataset for PyTorch
Cleaning Up Datasets
We are using 2 datasets to build the PyTorch model.
1. Metacritic Game Info Dataset
This dataset contains all video games published from 1998-2018. Here's a snippet of what it looks like:
Unnamed | 0 | Title | Year | Publisher | Genre | Platform | Metascore | Avg_Userscore | No_Players |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | The Legend of Zelda: Ocarina of Time | 1998 | Nintendo | Action Adventure;Fantasy | Nintendo64 | 99 | 9.1 | 1 Player |
1 | 1 | Tony Hawk's Pro Skater 2 | 2000 | NeversoftEntertainment | Sports;Alternative;Skateboarding | PlayStation 98 | 7.4 | 1-2 | |
2 | 2 | Grand Theft Auto IV | 2008 | RockstarNorth | Action Adventure;Modern;Modern;Open-World | PlayStation3 | 98 | 7.5 | 1 Player |
To keep our recommendation engine up-to-date, we are only interested in game titles after the year 2010. So let's clean up this dataset.
Import pandas
First we import pandas, a Python package used for data manipulation and analysis.
Read and Filter Data
Then, we use pandas to read our dataset file that we have imported.
This dataset contains some Year values that are null. Select only those that are not null and after the year 2010.
Finally, let's remove the 'Unnamed' column since it is redundant and rename our cleaned up dataframe to df
.
Below is a snippet of what df
looks like:
Index | Title | Year | Publisher | Genre | Platform | Metascore | Avg_Userscore | No_Players |
---|---|---|---|---|---|---|---|---|
0 | Super Mario Galaxy 2 | 2010 | NintendoEADTokyo | Action;Platformer;Platformer;3D | Wii | 97 | 9.1 | No Online Multiplayer |
1 | Grand Theft Auto V | 2014 | RockstarNorth | Action Adventure;Modern;Open-World | XboxOne | 97 | 7.8 | Up to 30 |
2 | Grand Theft Auto V | 2013 | RockstarNorth | Modern;Action Adventure;Modern;Open-World | PlayStation3 | 97 | 8.3 | Up to 16 |
2. Metacritic Game User Ratings
This dataset contains user ratings and comments for specific games. We are interested in only looking at the game titles in our df
, username
and the userscore
columns. Before clean up, the dataset looks like this snippet:
Unnamed | 0 | Title | Platform | Userscore | Comment | Username |
---|---|---|---|---|---|---|
0 | 0 | The Legend of Zelda: Ocarina of Time | Nintendo64 | 10 | Everything in OoT is so near at perfection, it... | SirCaestus |
1 | 1 | The Legend of Zelda: Ocarina of Time | Nintendo64 | 10 | I won't bore you with what everyone is already... | Kaistlin |
2 | 2 | The Legend of Zelda: Ocarina of Time | Nintendo64 | 10 | Anyone who gives the masterpiece below a 7 or ... | Jacody |
Read and Filter Data
Same the previous dataset, we use pandas to read the data file.
Then we remove the columns we don't need.
And we select only the data with Title that exists in df
Title column, because we only want user ratings for games after 2010. The final cleaned dataframe is called users
.
A snippet of the cleaned up dataset looks like:
Index | Title | Userscore | Username |
---|---|---|---|
0 | Super Mario Galaxy 2 | 10 | S.Kumar |
1 | Super Mario Galaxy 2 | 8 | ThePlasmaQuasar |
2 | Super Mario Galaxy 2 | 10 | juanandesign |
Export Dataset for Model
In order to work with the dataset easily, it is better to use continuous ids to identify game titles and users.
Encode Columns
We can create a function to encode a pandas columns with ids.
Encode Dataframe
Then, encode the users
dataframe for the Title
and Username
columns with the following function.
Export to csv
Execute the function and export the new dataset to csv.
This csv file will be used for training and re-training the model with PyTorch. Below is a snippet of it:
Index | Title | Userscore | Username | UserId | TitleId |
---|---|---|---|---|---|
0 | Super Mario Galaxy 2 | 10 | S.Kumar | 0 | 0 |
1 | Super Mario Galaxy 2 | 8 | ThePlasmaQuasar | 1 | 0 |
2 | Super Mario Galaxy 2 | 10 | juanandesign | 2 | 0 |