1. Title of Database: Buzz prediction on Twitter 2. Sources: -- Creators : François Kawala (1,2) and Ahlame Douzal (1) and Eric Gaussier (1) and Eustache Diemert (2) -- Institutions : (1) Université Joseph Fourier (Grenoble I) Laboratoire d'informatique de Grenoble (LIG) (2) BestofMedia Group -- Donor: BestofMedia (ediemert@bestofmedia.com) -- Date: May, 2013 3. Past Usage: -- References : Predicting Buzz Magnitude in Social Media (in submission (ECML-PKDD 13)) -- Predicted attribute : Mean Number of active discussion (NAD). This attribute is a positive integer that describe the popularity of the instance's topic. It is stored is the rightmost column. -- Study results : The results achieved are acceptable, nevertheless the unbalanced nature of this dataset leaves some room for improvement. The data may be scaled and normalized. First order discrete difference over features may also be considered as additional features. Please see the reference to get further details regarding the results. 4. Relevant Information Paragraph: -- Observations : Each instance covers height weeks of observation for a specific topic (eg. overclocking...). The predicted attribute's mean value is computed on the the couple weeks following this initial observation. Observations are Independent and identically distributed. 5. Number of Instances -- Total number of instances : 38393 6. Number of Attributes -- Total number of attributes : 77. -- Time representation : Each instance is described by 77 features, those describe the evolution of 77 `primary features' through time. Hence each feature name is postfixed with the relative time of observation. For instance, the value of the feature `Nb_Active_Discussion' at time t is given in 'Nb_Active_Discussion_t'. 7. Attributes -- Number of Created Discussions (NCD) (columns [0,6]) -- Type : Numeric, integers only -- Description : This feature measures the number of discussions created at time step t and involving the instance's topic. -- Columns : From column 0 (NCD at relative time 0) to column 6 (NCD at relative time 6) -- Abbreviations : NCD_0, NCD_1, NCD_2, NCD_3, NCD_4, NCD_5, NCD_6 -- Statistics : +---------+-----+-------+---------+---------+ | feature | min | max | mean | std | +---------+-----+-------+---------+---------+ | NCD_0 | 0 | 24210 | 140.340 | 431.772 | +---------+-----+-------+---------+---------+ | NCD_1 | 0 | 29574 | 136.770 | 432.305 | +---------+-----+-------+---------+---------+ | NCD_2 | 0 | 37505 | 159.679 | 502.057 | +---------+-----+-------+---------+---------+ | NCD_3 | 0 | 72366 | 181.592 | 574.883 | +---------+-----+-------+---------+---------+ | NCD_4 | 0 | 79079 | 201.097 | 630.448 | +---------+-----+-------+---------+---------+ | NCD_5 | 0 | 79079 | 220.175 | 669.205 | +---------+-----+-------+---------+---------+ | NCD_6 | 0 | 79079 | 219.388 | 672.182 | +---------+-----+-------+---------+---------+ -- Author Increase (AI) (columns [7,13]) -- Type : Numeric, integers only -- Description : This featurethe number of new authors interacting on the instance's topic at time t (i.e. its popularity) -- Columns : From column 7 (AI at relative time 0) to column 13 (AI at relative time 6) -- Abbreviations : AI_0, AI_1, AI_2, AI_3, AI_4, AI_5, AI_6 -- Statistics : +---------+-----+-------+---------+---------+ | feature | min | max | mean | std | +---------+-----+-------+---------+---------+ | AI_0 | 0 | 18654 | 71.038 | 196.877 | +---------+-----+-------+---------+---------+ | AI_1 | 0 | 22035 | 69.830 | 202.200 | +---------+-----+-------+---------+---------+ | AI_2 | 0 | 29402 | 82.198 | 239.523 | +---------+-----+-------+---------+---------+ | AI_3 | 0 | 57617 | 93.775 | 278.892 | +---------+-----+-------+---------+---------+ | AI_4 | 0 | 63147 | 103.896 | 309.197 | +---------+-----+-------+---------+---------+ | AI_5 | 0 | 63147 | 113.611 | 326.077 | +---------+-----+-------+---------+---------+ | AI_6 | 0 | 63147 | 112.941 | 326.015 | +---------+-----+-------+---------+---------+ -- Attention Level (measured with number of authors) (AS(NA)) (columns [14,20]) -- Type : Numeric, real in [0,1] -- Description : This feature is a measure of the attention payed to a the instance's topic on a social media. -- Columns : From column 14 (AS(NA) at relative time 0) to column 20 (AS(NA) at relative time 6) -- Abbreviations : AS(NA)_0, AS(NA)_1, AS(NA)_2, AS(NA)_3, AS(NA)_4, AS(NA)_5, AS(NA)_6 -- Statistics : +----------+-----+-------+-------+-------+ | feature | min | max | mean | std | +----------+-----+-------+-------+-------+ | AS(NA)_0 | 0 | 0.025 | 0.000 | 0.001 | +----------+-----+-------+-------+-------+ | AS(NA)_1 | 0 | 0.029 | 0.000 | 0.001 | +----------+-----+-------+-------+-------+ | AS(NA)_2 | 0 | 0.040 | 0.000 | 0.001 | +----------+-----+-------+-------+-------+ | AS(NA)_3 | 0 | 0.081 | 0.000 | 0.001 | +----------+-----+-------+-------+-------+ | AS(NA)_4 | 0 | 0.095 | 0.000 | 0.001 | +----------+-----+-------+-------+-------+ | AS(NA)_5 | 0 | 0.095 | 0.000 | 0.001 | +----------+-----+-------+-------+-------+ | AS(NA)_6 | 0 | 0.095 | 0.000 | 0.001 | +----------+-----+-------+-------+-------+ -- Burstiness Level (BL) (columns [21,27]) -- Type : Numeric, defined on [0,1] -- Description : The burstiness level for a topic z at a time t is defined as the ratio of ncd and nad -- Columns : From column 21 (BL at relative time 0) to column 27 (BL at relative time 6) -- Abbreviations : BL_0, BL_1, BL_2, BL_3, BL_4, BL_5, BL_6 -- Statistics : +---------+-----+-----+-------+-------+ | feature | min | max | mean | std | +---------+-----+-----+-------+-------+ | BL_0 | 0 | 1 | 0.925 | 0.256 | +---------+-----+-----+-------+-------+ | BL_1 | 0 | 1 | 0.918 | 0.268 | +---------+-----+-----+-------+-------+ | BL_2 | 0 | 1 | 0.907 | 0.285 | +---------+-----+-----+-------+-------+ | BL_3 | 0 | 1 | 0.920 | 0.265 | +---------+-----+-----+-------+-------+ | BL_4 | 0 | 1 | 0.931 | 0.247 | +---------+-----+-----+-------+-------+ | BL_5 | 0 | 1 | 0.956 | 0.196 | +---------+-----+-----+-------+-------+ | BL_6 | 0 | 1 | 0.955 | 0.197 | +---------+-----+-----+-------+-------+ -- Number of Atomic Containers (NAC) (columns [28,34]) -- Type : Numeric, integer -- Description : This feature measures the total number of atomic containers generated through the whole social media on the instance's topic until time t. -- Columns : From column 28 (NAC at relative time 0) to column 34 (NAC at relative time 6) -- Abbreviations : NAC_0, NAC_1, NAC_2, NAC_3, NAC_4, NAC_5, NAC_6 -- Statistics : +---------+-----+-------+---------+---------+ | feature | min | max | mean | std | +---------+-----+-------+---------+---------+ | NAC_0 | 0 | 26644 | 150.662 | 453.689 | +---------+-----+-------+---------+---------+ | NAC_1 | 0 | 29574 | 146.642 | 452.039 | +---------+-----+-------+---------+---------+ | NAC_2 | 0 | 37505 | 170.721 | 523.801 | +---------+-----+-------+---------+---------+ | NAC_3 | 0 | 72397 | 193.786 | 598.287 | +---------+-----+-------+---------+---------+ | NAC_4 | 0 | 79136 | 214.196 | 654.468 | +---------+-----+-------+---------+---------+ | NAC_5 | 0 | 79136 | 234.218 | 694.165 | +---------+-----+-------+---------+---------+ | NAC_6 | 0 | 79136 | 233.236 | 696.419 | +---------+-----+-------+---------+---------+ -- Attention Level (measured with number of contributions) (AS(NAC)) (columns [35,41]) -- Type : Numeric, real in [0,1] -- Description : This feature is a measure of the attention payed to a the instance's topic on a social media. -- Columns : From column 35 (AS(NA) at relative time 0) to column 42 (AS(NAC) at relative time 6) -- Abbreviations : AS(NAC)_0, AS(NAC)_1, AS(NAC)_2, AS(NAC)_3, AS(NAC)_4, AS(NAC)_5, AS(NAC)_6 -- Statistics : +-----------+-----+-------+-------+-------+ | feature | min | max | mean | std | +-----------+-----+-------+-------+-------+ | AS(NAC)_0 | 0 | 0.022 | 0.000 | 0.000 | +-----------+-----+-------+-------+-------+ | AS(NAC)_1 | 0 | 0.022 | 0.000 | 0.000 | +-----------+-----+-------+-------+-------+ | AS(NAC)_2 | 0 | 0.022 | 0.000 | 0.000 | +-----------+-----+-------+-------+-------+ | AS(NAC)_3 | 0 | 0.040 | 0.000 | 0.000 | +-----------+-----+-------+-------+-------+ | AS(NAC)_4 | 0 | 0.046 | 0.000 | 0.000 | +-----------+-----+-------+-------+-------+ | AS(NAC)_5 | 0 | 0.046 | 0.000 | 0.000 | +-----------+-----+-------+-------+-------+ | AS(NAC)_6 | 0 | 0.046 | 0.000 | 0.000 | +-----------+-----+-------+-------+-------+ -- Contribution Sparseness (CS) (columns [42,48]) -- Type : Numeric, real in [0,1] -- Description : This feature is a measure of spreading of contributions over discussion for the instance's topic at time t. -- Columns : From column 42 (CS at relative time 0) to column 48 (CS at relative time 6) -- Abbreviations : CS_0, CS_1, CS_2, CS_3, CS_4, CS_5, CS_6 -- Statistics : +---------+-----+-----+-------+-------+ | feature | min | max | mean | std | +---------+-----+-----+-------+-------+ | CS_0 | 0 | 1 | 0.931 | 0.254 | +---------+-----+-----+-------+-------+ | CS_1 | 0 | 1 | 0.923 | 0.266 | +---------+-----+-----+-------+-------+ | CS_2 | 0 | 1 | 0.912 | 0.284 | +---------+-----+-----+-------+-------+ | CS_3 | 0 | 1 | 0.925 | 0.264 | +---------+-----+-----+-------+-------+ | CS_4 | 0 | 1 | 0.935 | 0.246 | +---------+-----+-----+-------+-------+ | CS_5 | 0 | 1 | 0.961 | 0.194 | +---------+-----+-----+-------+-------+ | CS_6 | 0 | 1 | 0.961 | 0.194 | +---------+-----+-----+-------+-------+ -- Author Interaction (AT) (columns [49,55]) -- Type : Numeric, integer. -- Description : This feature measures the average number of authors interacting on the instance's topic within a discussion. -- Columns : From column 49 (AT at relative time 0) to column 55 (AT at relative time 6) -- Abbreviations : AT_0, AT_1, AT_2, AT_3, AT_4, AT_5, AT_6 -- Statistics : +---------+-----+-----+-------+-------+ | feature | min | max | mean | std | +---------+-----+-----+-------+-------+ | AT_0 | 0 | 282 | 1.045 | 1.359 | +---------+-----+-----+-------+-------+ | AT_1 | 0 | 283 | 1.035 | 1.470 | +---------+-----+-----+-------+-------+ | AT_2 | 0 | 283 | 1.017 | 1.208 | +---------+-----+-----+-------+-------+ | AT_3 | 0 | 263 | 1.033 | 1.250 | +---------+-----+-----+-------+-------+ | AT_4 | 0 | 282 | 1.046 | 1.286 | +---------+-----+-----+-------+-------+ | AT_5 | 0 | 249 | 1.078 | 1.300 | +---------+-----+-----+-------+-------+ | AT_6 | 0 | 283 | 1.082 | 1.423 | +---------+-----+-----+-------+-------+ -- Number of Authors (NA) (columns [56,62]) -- Type : Numeric, integer. -- Description : This feature measures the number of authors interacting on the instance's topic at time t. -- Columns : From column 49 (NA at relative time 0) to column 55 (NA at relative time 6) -- Abbreviations : NA_0, NA_1, NA_2, NA_3, NA_4, NA_5, NA_6 -- Statistics : +---------+-----+-------+---------+---------+ | feature | min | max | mean | std | +---------+-----+-------+---------+---------+ | NA_0 | 0 | 21723 | 123.064 | 350.215 | +---------+-----+-------+---------+---------+ | NA_1 | 0 | 26134 | 119.580 | 347.236 | +---------+-----+-------+---------+---------+ | NA_2 | 0 | 34085 | 138.959 | 403.603 | +---------+-----+-------+---------+---------+ | NA_3 | 0 | 64984 | 157.424 | 460.884 | +---------+-----+-------+---------+---------+ | NA_4 | 0 | 72159 | 173.702 | 503.660 | +---------+-----+-------+---------+---------+ | NA_5 | 0 | 72159 | 189.521 | 532.315 | +---------+-----+-------+---------+---------+ | NA_6 | 0 | 72159 | 188.410 | 531.152 | +---------+-----+-------+---------+---------+ -- Average Discussions Length (ADL) (columns [63,69]) -- Type : Numeric, real. -- Description : This feature directly measures the average length of a discussion belonging to the instance's topic. -- Columns : From column 63 (ADL at relative time 0) to column 69 (ADL at relative time 6) -- Abbreviations : ADL_0, ADL_1, ADL_2, ADL_3, ADL_4, ADL_5, ADL_6 -- Statistics : +---------+-----+-----+-------+-------+ | feature | min | max | mean | std | +---------+-----+-----+-------+-------+ | ADL_0 | 0 | 294 | 1.095 | 1.488 | +---------+-----+-----+-------+-------+ | ADL_1 | 0 | 295 | 1.084 | 1.587 | +---------+-----+-----+-------+-------+ | ADL_2 | 0 | 295 | 1.067 | 1.316 | +---------+-----+-----+-------+-------+ | ADL_3 | 0 | 273 | 1.085 | 1.371 | +---------+-----+-----+-------+-------+ | ADL_4 | 0 | 294 | 1.100 | 1.406 | +---------+-----+-----+-------+-------+ | ADL_5 | 0 | 262 | 1.137 | 1.432 | +---------+-----+-----+-------+-------+ | ADL_6 | 0 | 295 | 1.140 | 1.552 | +---------+-----+-----+-------+-------+ -- Average Discussions Length (NAD) (columns [70,76]) -- Type : Numeric, integer. -- Description : This features measures the number of discussions involving the instance's topic until time t. -- Columns : From column 70 (NAD at relative time 0) to column 76 (NAD at relative time 6) -- Abbreviations : NAD_0, NAD_1, NAD_2, NAD_3, NAD_4, NAD_5, NAD_6 -- Statistics : +---------+-----+-------+---------+---------+ | feature | min | max | mean | std | +---------+-----+-------+---------+---------+ | NAD_0 | 0 | 24301 | 140.790 | 432.625 | +---------+-----+-------+---------+---------+ | NAD_1 | 0 | 29574 | 137.181 | 433.026 | +---------+-----+-------+---------+---------+ | NAD_2 | 0 | 37505 | 160.106 | 502.774 | +---------+-----+-------+---------+---------+ | NAD_3 | 0 | 72366 | 182.057 | 575.658 | +---------+-----+-------+---------+---------+ | NAD_4 | 0 | 79083 | 201.596 | 631.258 | +---------+-----+-------+---------+---------+ | NAD_5 | 0 | 79083 | 220.706 | 670.050 | +---------+-----+-------+---------+---------+ | NAD_6 | 0 | 79083 | 219.937 | 673.032 | +---------+-----+-------+---------+---------+ -- Annotation (column 77) -- Type : Numeric, real -- Description : See 3. and 4. -- Columns : 96 8. Missing Attribute Values: -- There is not any missing values. 9. Class Distribution: -- This is a regression task.