MusiCLEF 2011: tagging task -- description of the dataset ------------------------------------------------------------ ------------------------------------------------------------ The dataset for the music tagging task is composed by 1355 different songs played by 217 different artists. Each song and each artist are identified by a unique identificative code, "id_song" and "id_artist" respectively, which is used to name all the features related to an artist or a song. The mapping details are in: "ground_truth/songs.csv" and "ground_truth/artist.csv". The semantic vocabulary is composed by 94 distinct tags ("ground_truth/tags.csv"). All the data are archived in bzip format and organized in directories. The entire dataset (with the same file system structure) is in: "all-data-tagging.tar.bz2". The data structure is explained in the following. --- audio_features/ --- It contains the mfcc and fbmel coefficients computed for each song. --- ground_truth/ --- It contains all the files describing the dataset. The participants must train the auto-taggers using the songs reported in "train.csv" and apply the systems to the songs in "test.csv". Each estimated song should be described by a vector of values, one per each tag (same order of "tags.csv"), outputting the relevance of each tag for that song. --- lastfm/ --- This directory contains all the data downloaded through the last.fm API. In particular, "lastfm_id.tar.bz2" contains the unique lastfm song ID, whereas "lastfm_tags.tar.bz2" contains the social tags collected. --- web_mined_data --- This directory contains the Web mined artist-related data in six different languages: English, German, Spanish, French, Italian, and Swedish. In particular, the Web page-directories (google_crawl_* )for each artist include a list of URLs (urls.dat) and a bit more detailed xml file (info.xml) containing time stamp and query for the Google request and approximate page count returned by Google. The actual html files are stored in subdirectory "pages". The weight-directories (weight-*) for each artist include the following files: terms.txt: a list of all terms (corresponding to the order of the weights) Global_DF.txt: document frequencies (calculated on virtual artist docs - one doc comprises all pages of an artist) over all terms in terms.txt Global_IDF.txt: logarithmic idf formulation TF*.txt: for each artist the raw term frequencies over all terms in terms.txt TFIDF*.txt: for each artist the tfidf values over all terms in terms.txt (log formulation for both, tfs and idfs) aggregated.csv - all tfidf vectors for all artists in one file (each line corresponds to the tfidf vector of an artist) For issues and requests, please contact us at: "musiclef@dei.unipd.it"