Datasets
From Pigbert Wiki
| Table of contents |
MovieLens Dataset (Small)
- #Users = 943; Numbered consecutively from 1
- #Items = 1682; Numbered consecutively from 1
- #Ratings = 100,000
- Each user has at least 20 ratings
- Ratings scale from 1..5
- Date format: timestamp.
Original Files
- movies.txt
- MovieID|Title[%s] (year[%4d])|release date[d-M-Y]|video release date[null]|IMDB URL[http://%s]|Genre1[0|1]|...|Genre20[0|1]
- ordered by MovieID ASC.
- user.txt (u.data)
- UserID|age[%d]|Gender[M|F]|Occupation[%s]|Zip-code[%5s]
- ordered by UserID ASC.
- ratings.txt
- UserID \t MovieID \t Rating \t Timestamp
- not ordered at all.
MovieLens Dataset (Million)
- #Users = 6,040; ID ranges between 1 and 6040
- #Items = 3,883; ID ranges between 0 and 3952
- #Ratings = 1,000,209
- Each user has at least 20 ratings
- 3,706 of the 3,883 items have at least 1 ratings
- Ratings scale from 1..5
- Date format: timestamp.
Original Files
- movies.txt
- MovieID::Title consistent with IMDB[%s] (year[%4d])::Genre1[%s]|Gener2[%s]|...
- ordered by MovieID ASC.
- user.txt
- UserID::Gender[M|F]::Age[1|18|..|50|56]::Occupation[%d:0..20]::Zip-code[%5s]
- ordered by UserID ASC.
- ratings.txt
- UserID::MovieID::Rating::Timestamp
- ordered by UserID ASC.
NetFlix Dataset
- Number of Users: 480,189 with IDs range from 1 to 2,649,429
- Number of Items: 17,770
- Number of Ratings: 100,480,507
- Dates have the format YYYY-MM-DD
- The ratings are integers from 1 to 5.
- The data were collected between Oct 1998 and Dec 2005 and reflect the distribution of all ratings received during this period.
Item Rating Files
- Directory training_set_item_sorted_marked
- Inner File name format: %07d.txt
- Inner File format:
- First Line: item details
- Following Lines: userID[%d],Rating[%d],YYYY-MM-DD
- Statistics:
- The number of ratings each item has ranges from [3 to 232,944].
- The average number of ratings an item has = 5654.502364
- Modification: first line with item details. sorted by user ID, marked with probe entries.
User Rating Files
- Directory training_set_user_sorted
- Inner File Name Format: %07d.txt
- Inner File Format:
- First Line: user details
- Following Lines: ItemID[%d],Rating[%d],YYYY-MM-DD
- Statistics:
- The number of ratings each user has ranges from [1 to 17,653].
- The average number of ratings a user has = 209.251997
- Modification: sorted by item ID.
Test Files
- probe_complete.txt
- sorted by itemID, then userID.
- Format: item description [itemID:] followed by ratings [userID,rating,YYYY-MM-DD]
- Used by class NetFlickerProbePartitioner
- #Ratings =~ 1,415,334;
- probe_complete_patched.txt
- same as probe_complete.txt
- But the item descriptions are complete: [itemID,[YYYY||null],name:]
- qualifying.txt
- sorted by itemID but not userID.
- Format: item ID [itemID:] followed by user IDs [userID,YYYY-MM-DD]
- #Ratings = 2,817,131
- #Items = 17,470
- #Users = 478,615
- Each item has [min=1 max=23,826 avg=161.2554] many test ratings.
- Each user has [min=1 max=9 avg=5.886] many test ratings.
Result Files
- result_tmp/computational_cache/user_files.txt
- Used by class CachedUserData.
- Format: [UserID, votingHabit(%.4f)] followed by Similar Users in the form of [SimilarUsersID, similarity(%.4f)], ordered by decreasing similarity.
- result_tmp/computational_result/
- Used by class Evaluator
Other Files
- gaint_file.txt
- movie_titles.txt
- Movie_ID,YYYY[%4d],Movie Title[%s]
- ordered by Movie_ID ASC.
