首页 > 语义网, 幻灯片 > 为国际会议建元数据(1)

为国际会议建元数据(1)

为国际会议建元数据:1,2,3 (更新中)

总结我为ISWC 2010做元数据负责人(metadata chair)的经验,兼RuleML 2011元数据展望。一直想做这事,一直懒。主要是为我自己查找方便,可能比较天马行空。因为很多原始资料是英文的,所以抱歉就不翻译了。

(1)背景和简述

ISWC 2010 Logo

ISWC=International Semantic Web Conference 国际语义网会议,是语义网界最主要的学术会议。另外有个IEEE International Symposium on Wearable Computers,也简称ISWC,有时会被搞混。

从2003年起,ISWC都指定专人负责收集和发布会议相关的元数据(metadata),比如论文,组委会,议程等。这个人被称为Metadata Chair。ISWC2010,我做这事。从2009年秋,做了将近一年,最后几个月花了比较多的时间。没有经验,错误很多,所以准备总结一下。做一个事件或者几百人规模的组织的元数据,可能有很多共性,所以这里的经验教训,也许对别人有用。

关于目的、手段、结果,这里是我在ISWC 2010 Lightening Talk上做的总结(就一页)。后面附文字版(英文的),算是个提纲。

——————————–

(Brief History) Nothing speaks louder for a technology than that it is used by its own advocates. That’s why each year at ISWC since 2003, there is a member in the organizing committee who is responsible for collecting metadata about the conference, represented in Semantic Web formats. The dataset is about papers, authors, organizers, organizations, sessions and other sub-events of the conference. The data is always published on the Semantic Web Dog Food website, data.semanticweb.org.

(Goal) This year, there are a couple of changes in the metadata work. The goal is that, instead of simply generating the dataset and let it sleep, we should get the data used, both on-site for the conference attendees, and for the whole community now and in the future. For this purpose, we significantly extended the scope of the work, and accordingly, the workflow of the metadata project.

(Metadata Committee) First, instead having only one or two metadata chairs, we recruited help from the community in the form a metadata committee. The committee is divided into 4 different types of tasks: data generation, linking data to other datasets, data consumption (that is, building applications based on the data), and integrating the dataset into the infra-structure of the conference (such as the website).

(Generation) The data generation part is essentially a data integration task of moderate size. Data are from diverse sources: proceeding data is from dump from the easychair submission site; people’s profile data are partly from eashchair, partly from papers themselves, and partly from manual input; some are from spreadsheets. We tried to reuse data as much as we can, such as previous year’s ISWC data and people’s foaf files. We also tried natural language processing and data mining approaches to extract semantic data from unstructured data, such as keywords, paper’s structure and citation data are mined from pdfs. However, there are still data has to be manually collected, verified, aligned, and cleaned, such as people’s names and affiliations in different spelling variations, and the geo-locations of organization. In total, we have generated about 100 thousand triples, in which about 15 thousand triples are about the basic conference information.  For comparison, previous ISWC metadata contains about 7-9 thousand triples each year.

(Linking) To link the datasets to other datasets in the linked data cloud, we resort to both automatic and manual mappings. Johanna Flores, RPI, generated some mappings of organizations to Dbpedia using fuzzy name matching; Jie Tang of Tsinghua University and his group helped on mapping people to ArnetMiner, a researcher social profile repository. Oktie Hassanzadeh helped to map authors to their DBLP entries. Jie Bao created some mappings to geonames. Thus, the ISWC data is now part of the linked data cloud.

(Apps) We have explored using the data in several interesting ways, both for practical use and for fun. You may have seen from Ian’s opening talk that we have generated various visualizations from the data. We have also developed several browsers, including a mobile browser, a faceted browser, and a filtered browser, many of you may already tried. Once we released the data, the data is quickly picked up by other people, and new apps and demonstrations are coming almost everyday: faceted browsers, visualizations, triple stores, interactive querying, just mention a few. By Nov 8th, the day before the conference, we have 15 working data browsers and tools for the ISWC data, only 4 of them are from the Metadata committee. The next day, we have 19, and yesterday, we have 26. Even this morning, we have learned new tools built for ISWC 2010 data, so the number is even larger now.

(Summary) To summarize, the metadata work is both challenging and rewarding. We applied a wide range of tools in getting all the pieces of work together. Many of such tools of technologies were not even there last year. We are also surprised that how quickly the semantic community can adopt our dataset to develop visualization and tools, some literally in hours, if not in days. Some do not require programming at all. This clearly shows that semantic web tools are approaching the level of maturity for “citizen users” (using the term from mc).

(Future) The ISWC data is rich in its content and its potentials. For example, we also collected real-time data  and mashed up it with the traditional static data. There are many ways we may leverage data like this for better serving the conference attendees, but, due to resource limitations, we can’t do for this year. But we believe, future ISWC metadata work will be even more useful for the community. Again, nothing speaks louder for a technology than that it is used by its own researchers. I believe this year at ISWC, it has been shown that Semantics is beautiful by eating our own dog food.

Advertisements
分类:语义网, 幻灯片
  1. Shangguan
    2011/04/11 @ 19:22

    前排沙发

  2. Channing
    2011/04/12 @ 15:57

    学习了~!

  1. 2012/04/16 @ 01:31

发表评论

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / 更改 )

Twitter picture

You are commenting using your Twitter account. Log Out / 更改 )

Facebook photo

You are commenting using your Facebook account. Log Out / 更改 )

Google+ photo

You are commenting using your Google+ account. Log Out / 更改 )

Connecting to %s

%d 博主赞过: