Datasets

From Pigbert Wiki

Table of contents

MovieLens Dataset (Small)

#Users = 943; Numbered consecutively from 1
#Items = 1682; Numbered consecutively from 1
#Ratings = 100,000
Each user has at least 20 ratings
Ratings scale from 1..5
Date format: timestamp.

Original Files

movies.txt
MovieID|Title[%s] (year[%4d])|release date[d-M-Y]|video release date[null]|IMDB URL[http://%s]|Genre1[0|1]|...|Genre20[0|1]
ordered by MovieID ASC.
user.txt (u.data)
UserID|age[%d]|Gender[M|F]|Occupation[%s]|Zip-code[%5s]
ordered by UserID ASC.
ratings.txt
UserID \t MovieID \t Rating \t Timestamp
not ordered at all.

MovieLens Dataset (Million)

#Users = 6,040; ID ranges between 1 and 6040
#Items = 3,883; ID ranges between 0 and 3952
#Ratings = 1,000,209
Each user has at least 20 ratings
3,706 of the 3,883 items have at least 1 ratings
Ratings scale from 1..5
Date format: timestamp.

Original Files

movies.txt
MovieID::Title consistent with IMDB[%s] (year[%4d])::Genre1[%s]|Gener2[%s]|...
ordered by MovieID ASC.
user.txt
UserID::Gender[M|F]::Age[1|18|..|50|56]::Occupation[%d:0..20]::Zip-code[%5s]
ordered by UserID ASC.
ratings.txt
UserID::MovieID::Rating::Timestamp
ordered by UserID ASC.

NetFlix Dataset

Number of Users: 480,189 with IDs range from 1 to 2,649,429
Number of Items: 17,770
Number of Ratings: 100,480,507
Dates have the format YYYY-MM-DD
The ratings are integers from 1 to 5.
The data were collected between Oct 1998 and Dec 2005 and reflect the distribution of all ratings received during this period.

Item Rating Files

Directory training_set_item_sorted_marked
Inner File name format: %07d.txt
Inner File format:
First Line: item details
Following Lines: userID[%d],Rating[%d],YYYY-MM-DD
Statistics:
The number of ratings each item has ranges from [3 to 232,944].
The average number of ratings an item has = 5654.502364
Modification: first line with item details. sorted by user ID, marked with probe entries.

User Rating Files

Directory training_set_user_sorted
Inner File Name Format: %07d.txt
Inner File Format:
First Line: user details
Following Lines: ItemID[%d],Rating[%d],YYYY-MM-DD
Statistics:
The number of ratings each user has ranges from [1 to 17,653].
The average number of ratings a user has = 209.251997
Modification: sorted by item ID.

Test Files

probe_complete.txt
sorted by itemID, then userID.
Format: item description [itemID:] followed by ratings [userID,rating,YYYY-MM-DD]
Used by class NetFlickerProbePartitioner
#Ratings =~ 1,415,334;
probe_complete_patched.txt
same as probe_complete.txt
But the item descriptions are complete: [itemID,[YYYY||null],name:]
qualifying.txt
sorted by itemID but not userID.
Format: item ID [itemID:] followed by user IDs [userID,YYYY-MM-DD]
#Ratings = 2,817,131
#Items = 17,470
#Users = 478,615
Each item has [min=1 max=23,826 avg=161.2554] many test ratings.
Each user has [min=1 max=9 avg=5.886] many test ratings.

Result Files

result_tmp/computational_cache/user_files.txt
Used by class CachedUserData.
Format: [UserID, votingHabit(%.4f)] followed by Similar Users in the form of [SimilarUsersID, similarity(%.4f)], ordered by decreasing similarity.
result_tmp/computational_result/
Used by class Evaluator

Other Files

gaint_file.txt
movie_titles.txt
Movie_ID,YYYY[%4d],Movie Title[%s]
ordered by Movie_ID ASC.
Personal tools