老码农分享 http://blog.sciencenet.cn/u/seawan //敲键读书打酱油;

博文

2012 KDDCup 竞赛题目已经出来 了

已有 4666 次阅读 2012-2-23 11:51 |个人分类:微博|系统分类:科研笔记| 2012, KDDCUP

地址:
可能腾讯出了不少¥¥,所以先做广告:200M用户,每天40M信息。

目标:
【为了不让广大QQer们信息过载(information overload),
必须有一个发现用户兴趣所在的计算方法。】

任务1:
预测某个用户是否会follow某条信息。

dataset:
6k个item(微博用户),隶属于某个或多个类;
它们的tweet动作即是发wzbo;发的tweet可被follower看到;
retweet是转发;
这个数据集很大,It is of a larger scale compared to other publicly available datasets ever released. 
主要就是各种推荐和转发等等的记录。

Two datasets in 7 text files, downloadable:

a)      Training dataset : some fields are in the file rec_log_train.txt

b)      Testing dataset: some fields are in the file rec_log_test.txt

Format of the above 2 files:

(UserId)t(ItemId)t(Result)t(Unix-timestamp)

Result: values are 1 or -1, where 1 represents the user UserId accepts the recommendation of item ItemId and follows it (i.e., adds it to his/her social network), and -1 represents the user rejects the recommended item.

We provide the true values of the ‘Result’ field in rec_log_train.txt, whereas in  rec_log_test.txt, the true values of the ‘Result’ field are withheld (for simplicity, in the file they are always -1).

c)      More fields of the training and the testing datasets about the user and the item are in the following 5 files:

          i.              User profile data: user_profile.txt

Each line contains the following information of a user: the year of birth, the gender, the number of tweets and the tag-Ids. It is important to note that information about the users to be recommended is also in this file.

Format:

(UserId)t(Year-of-birth)t(Gender)t(Number-of-tweet)t(Tag-Ids)

Year of birth is selected by user when he/she registered.

Gender has an integer value of 0, 1, or 2, which represents “unknown”, “male”, or “female”, respectively.

Number-of-tweet is an integer that represents the amount of tweets the user has posted.

Tags are selected by users to represent their interests. If a user likes mountain climbing and swimming, he/she may select "mountain climbing" or "swimming" to be his/her tag. There are some users who select nothing. The original tags in natural languages are not used here, each unique tag is encoded as an unique integer.

Tag-Ids are in the form “tag-id1;tag-id2;...;tag-idN”. If a user doesn’t have tags, Tag-Ids will be "0".

        ii.              Item data: item.txt

Each line contains the following information of an item: its category and keywords.

Format:

(ItemId)t(Item-Category)t(Item-Keyword)

Item-Category is a string “a.b.c.d”, where the categories in the hierarchy are delimited by the character “.”, ordered in top-down fashion (i.e., category ‘a’ is a parent category of ‘b’, and category ‘b’ is a parent category of ‘c’, and so on.

Item-Keyword contains the keywords extracted from the corresponding Weibo profile of the person, organization, or group. The format is a string “id1;id2;…;idN”, where each unique keyword is encoded as an unique integer such that no real term is revealed.

      iii.              User action data: user_action.txt

The file user_action.txt contains the statistics about the ‘at’ (@) actions between the users in a certain number of recent days.

Format:

(UserId)t(Action-Destination-UserId)t(Number-of-at-action)t(Number-of-retweet )t(Number-of-comment)

If user A wants to notify another user about his/her tweet/retweet/comment, he/she would use an ‘at’ (@) action to notify the other user, such as ‘@tiger’ (here the user to be notified is ‘tiger’)..

For example, user A has retweeted user B 5 times, has “at” B 3 times, and has commented user B 6 times, then there is one line “A   B     3     5     6” in user_action.txt.

       iv.              User sns data: user_sns.txt

The file user_sns.txt contains each user’s follow history (i.e., the history of following another user). Note that the following relationship can be reciprocal.

Format:

(Follower-userid)t(Followee-userid)

         v.              User key word data: user_key_word.txt

The file user_key_word.txt contains the keywords extracted from the tweet/retweet/comment by each user.

Format:

(UserId)t(Keywords)

Keywords is in the form “kw1:weight1;kw2:weight2;…kw3:weight3”.

Keywords are extracted from the tweet/retweet/comment of a user, and can be used as features to better represent the user in your prediction model. The greater the weight, the more interested the user is with regards to the keyword.

Every keyword is encoded as a unique integer, and the keywords of the users are from the same vocabulary as the Item-Keyword.

EVALUATION

Teams are to submit a result file with respect to the testing dataset in text format, in which each line contains 3 fields, (UserId)t(ItemId)t(Result), for each user UserID and item ItemID in the testing dataset, the user’s action (Result = 1 or -1) upon recommendation of the item , delimited by a tab.

Teams’ scores and ranks on the leaderboard are based on a metric calculated from the predicted results in submitted result file and the held out ground truth of a validation dataset whose instances were a fixed set and were randomly sampled from the testing dataset in the beginning and, until the last day of the competition (June 1, 2012) by then the scores and associated ranks on leaderboard are based on the predicted results and that of the rest of the testing dataset. This entails that the top-3 ranked teams on the leaderboard at the time when the competition ends are the winners.

The evaluation metric is average precision. For a detailed definition of the metric, please refer to the tab ‘Evaluation’.

PRIZES

The prizes for the 1st, 2nd and 3rd winners for task 1 are US Dollars $5000, $2000, and $1000, respectively.

 

The "Date Started" below refers to the release of the competition descriptions on February 20, 2012.  The data will not be released until March 1, and entries will not be enabled until March 15.




https://blog.sciencenet.cn/blog-461456-540543.html

上一篇:发现一处道家经典集中地
下一篇:大学生到饭店打工的价格
收藏 IP: 113.59.89.*| 热度|

0

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

数据加载中...
扫一扫,分享此博文

Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2024-12-27 21:32

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部