|||
Two datasets in 7 text files, downloadable:
a) Training dataset : some fields are in the file rec_log_train.txt
b) Testing dataset: some fields are in the file rec_log_test.txt
Format of the above 2 files:
(UserId)t(ItemId)t(Result)t(Unix-timestamp)
Result: values are 1 or -1, where 1 represents the user UserId accepts the recommendation of item ItemId and follows it (i.e., adds it to his/her social network), and -1 represents the user rejects the recommended item.
We provide the true values of the ‘Result’ field in rec_log_train.txt, whereas in rec_log_test.txt, the true values of the ‘Result’ field are withheld (for simplicity, in the file they are always -1).
c) More fields of the training and the testing datasets about the user and the item are in the following 5 files:
i. User profile data: user_profile.txt
Each line contains the following information of a user: the year of birth, the gender, the number of tweets and the tag-Ids. It is important to note that information about the users to be recommended is also in this file.
Format:
(UserId)t(Year-of-birth)t(Gender)t(Number-of-tweet)t(Tag-Ids)
Year of birth is selected by user when he/she registered.
Gender has an integer value of 0, 1, or 2, which represents “unknown”, “male”, or “female”, respectively.
Number-of-tweet is an integer that represents the amount of tweets the user has posted.
Tags are selected by users to represent their interests. If a user likes mountain climbing and swimming, he/she may select "mountain climbing" or "swimming" to be his/her tag. There are some users who select nothing. The original tags in natural languages are not used here, each unique tag is encoded as an unique integer.
Tag-Ids are in the form “tag-id1;tag-id2;...;tag-idN”. If a user doesn’t have tags, Tag-Ids will be "0".
ii. Item data: item.txt
Each line contains the following information of an item: its category and keywords.
Format:
(ItemId)t(Item-Category)t(Item-Keyword)
Item-Category is a string “a.b.c.d”, where the categories in the hierarchy are delimited by the character “.”, ordered in top-down fashion (i.e., category ‘a’ is a parent category of ‘b’, and category ‘b’ is a parent category of ‘c’, and so on.
Item-Keyword contains the keywords extracted from the corresponding Weibo profile of the person, organization, or group. The format is a string “id1;id2;…;idN”, where each unique keyword is encoded as an unique integer such that no real term is revealed.
iii. User action data: user_action.txt
The file user_action.txt contains the statistics about the ‘at’ (@) actions between the users in a certain number of recent days.
Format:
(UserId)t(Action-Destination-UserId)t(Number-of-at-action)t(Number-of-retweet )t(Number-of-comment)
If user A wants to notify another user about his/her tweet/retweet/comment, he/she would use an ‘at’ (@) action to notify the other user, such as ‘@tiger’ (here the user to be notified is ‘tiger’)..
For example, user A has retweeted user B 5 times, has “at” B 3 times, and has commented user B 6 times, then there is one line “A B 3 5 6” in user_action.txt.
iv. User sns data: user_sns.txt
The file user_sns.txt contains each user’s follow history (i.e., the history of following another user). Note that the following relationship can be reciprocal.
Format:
(Follower-userid)t(Followee-userid)
v. User key word data: user_key_word.txt
The file user_key_word.txt contains the keywords extracted from the tweet/retweet/comment by each user.
Format:
(UserId)t(Keywords)
Keywords is in the form “kw1:weight1;kw2:weight2;…kw3:weight3”.
Keywords are extracted from the tweet/retweet/comment of a user, and can be used as features to better represent the user in your prediction model. The greater the weight, the more interested the user is with regards to the keyword.
Every keyword is encoded as a unique integer, and the keywords of the users are from the same vocabulary as the Item-Keyword.
EVALUATIONTeams are to submit a result file with respect to the testing dataset in text format, in which each line contains 3 fields, (UserId)t(ItemId)t(Result), for each user UserID and item ItemID in the testing dataset, the user’s action (Result = 1 or -1) upon recommendation of the item , delimited by a tab.
Teams’ scores and ranks on the leaderboard are based on a metric calculated from the predicted results in submitted result file and the held out ground truth of a validation dataset whose instances were a fixed set and were randomly sampled from the testing dataset in the beginning and, until the last day of the competition (June 1, 2012) by then the scores and associated ranks on leaderboard are based on the predicted results and that of the rest of the testing dataset. This entails that the top-3 ranked teams on the leaderboard at the time when the competition ends are the winners.
The evaluation metric is average precision. For a detailed definition of the metric, please refer to the tab ‘Evaluation’.
PRIZESThe prizes for the 1st, 2nd and 3rd winners for task 1 are US Dollars $5000, $2000, and $1000, respectively.
The "Date Started" below refers to the release of the competition descriptions on February 20, 2012. The data will not be released until March 1, and entries will not be enabled until March 15.
Archiver|手机版|科学网 ( 京ICP备07017567号-12 )
GMT+8, 2024-12-27 21:32
Powered by ScienceNet.cn
Copyright © 2007- 中国科学报社