档案

Archive for the ‘资料’ Category

ISWC数据的24种武器[2010]

2012/01/05 1条评论

原文写于2010-11-08

http://tw.rpi.edu/weblog/2010/11/08/15-ways-to-explore-iswc-2010-data/

15 (and counting) Ways to Explore ISWC 2010 Data

This year at ISWC, when we worked on the metadata, we have a Data Consuming task force to develop tools that can browse/visualize the data many different ways, e.g., faceted browser, filter browser and mobile browser.

As soon as we have the basic dataset published, we immediately get feedback from people on off-the-shelf tools that can work with the data. The list is quickly growing. I collected the screen shots of some working instances (including tools the metadata committee has built) in a slides. I have no doubt that the number “15” will be changed when the main conference begins …. in 2.5 hours! So expect some updates very quick.

What strikes me is that the number and diversity of data browsers currently available, and many of them are clearly reaching the level of maturity for non-expert users to explore. That was not the case even one year ago. So much has been changed for the Semantic Web in 2010!

P.S. 2012-01-05 later I added more browsing tools, making the count 24.

一个Semantic Media Wiki简短教程

2012/01/03 2 条评论

原文写于2009-07-21, 做于RPI Web Science Summer Research Week (http://tw.rpi.edu/wiki/SummerProgram2009)。时间:30分钟;级别:初级。

Jesse Wang等在ASWC做过一个很棒的SMW的完整教程,感兴趣的可以去看他的Slidesahre主页:http://www.slideshare.net/jiaxinwang

讲座照片:http://tw.rpi.edu/wiki/Image:IMG_0881.jpg

P.S. 今天 Semantic MediaWiki 1.7.0 发布。

用语义维基来写应用

2011/12/27 1条评论

摘要:语义网应用适合于那种数据不断动态变化的情况。另外一个特点,就是它可以打破应用间的界限,打破服务间的界限。用语义维基(Semantic Wiki)做应用的一些例子,本身谈不上什么价值,只是这种思路,我觉得以后可能会有用。

注:关于语义维基基础,参《一个Semantic Media Wiki简短教程》(2009-07-21)

今天和人聊到语义网应用的一些特点,我举了用语义维基(Semantic Wiki)做应用的一些例子。下面引用的,大部分来自我以前的一篇文章:

Jie Bao, Li Ding, Rui Huang, Paul Smart, Dave Braines, Gareth Jones. A Semantic Wiki based Light-Weight Web Application Model, In Proceedings of the 4th Asian Semantic Web Conference, pp. 168-183, 2009

文章里面举了几个应用实例,比如地图应用,本体编辑器。

首先要指出,这个文章并不是说语义维基现在就是很好的开发工具了,或者语义网的应用都应该是这个模式。具体的开发工具,比如Semantic MediaWiki (SMW),还在很早期的阶段,比如IDE啊,Library啊(比如可重用的模板),这些都还没有,也许再过十年才能成熟——可以类比于1995年的JavaScript,直到后来发展为AJAX,才成为不可或缺的利器。这里想讲的,是一种范式(paradigm),就是我觉得,好的语义应用可能被开发的一种方法。

我以前讲过,语义网应用适合于那种数据不断动态变化的应用。你很难定义一个固定的数据schema,然后一劳永逸。相反,你的应用应该有与时俱进的能力。如果用户的需要变化了,你的应用应该可以非常迅速的跟进,甚至不需要你在应用上做什么事,也不需要用户做什么事,而是在用户本身产生的数据里,就体现了这种变化,被你的应用捕捉。

这种与时俱进的能力,要求应用开发也要走一条新路。比如基于语义维基的开发,就把数据的建模,业务的逻辑,界面的构造,大量地转移到“用户”可以控制的领域(见下图)。具体的讲,就是用一大堆浏览器里就可以编辑的模板,把应用变成随时、随地可以更新的东西。这样,传统的服务器和客户端的界限已经模糊了,数据和元数据、业务逻辑(以前通常都是写死在代码里的)的界限也模糊了。这样带来的好处,就好比在浏览器里写博客之于传统的用FTP上传HTML页面,并不是说真的实现了什么原来不能实现的功能,而是提高了演进的能力,降低了演进的代价。

另外我觉得语义应用的一个特点,就是它可以打破应用间的界限,打破服务间的界限。我们在RPI做的实验性开发,在维基上做了博客、任务列表、日历、邮件列表、文献管理系统、个人主页系统,等等很多不同的信息管理工具,而底层的数据,无非都是维基页面(wiki page)。这样的建模,不再是基于“文件”这样一种组织模式,而是统一的,把一切数据的组织都看作关系,而应用不过是大的关系网的一个映射。所以日历啊,邮件啊,都不过是一些模板在一个统一的结构化知识库上的用户界面构造而已(见下图)。你很难说清楚,到底那个triple是属于那个应用的。而一个应用里的改变(比如日历),也就可以自动地激发另一个应用里的改变(比如个人主页)。什么是语义?关系就是语义,通过现有关系推导出新的关系是更强的语义。把数据的结构彻底从应用的界面上解放出来,把智能从代码里转移到数据本身里,这是一种非常有力量的变化。当然,我们用维基做的这些玩具应用,本身都谈不上什么价值,只是这种思路,我觉得以后可能会有用。

这又两年多过去了,又有了新的想法。语义网的应用开发,应该会催生新的编程模式,新的编程语言——就如同Web本身催生了很多新的语言。现在看到的,都只是雏形,难用,但有合理性内核,应该不断加以发展。

2009年论文完整的幻灯片在这里:

分类:语义网, 幻灯片

维基中不可承受之轻

2011/12/22 2 条评论

The Unbearable Lightness of Wiking

这是我2010年5月在Spring SMW Conference 2010 (MIT)上的一个幻灯片。总结了KAHT项目(由DARPA支持)关于语义维基(Semantic Wiki)可用性的一个实验的结果。我们发现,普通用户的语义建模能力,很难产生有意义的语义数据。这不仅是系统本身(Semantic MediaWiki)的问题,更深刻的,是人的认知能力的问题。很多在知识表现学者想当然的问题,在“普通人”,会有完全不同的,“千奇百怪”的想法。

要补充一下:我这个幻灯,现在看,里面的结论(i.e., 需要对SMW做扩展),不见得正确。更多应该思考的,是元数据生态周期和用户心理的问题。

Towards Webtop [2008]

2011/11/24 2 条评论

http://tw.rpi.edu/wiki/Blog:Baojie/Item-50
http://tw.rpi.edu/weblog/2008/07/25/towards-webtop/

2008-07-25

Some of our Tetherless World researchers including me have just written a short paper to sell the idea of constructing a “webtop” using semantic technologies. In short, a webtop is a desktop on the web, that does similar jobs such as managing files, doing word processing, managing contacts, scheduling tasks, emailing, etc. Please see some examples of webtops with pretty GUIs.

Almost one decade ago, there has been hot for a while for the concept of “network computer”. At that time, a network computer means some low-end computer with limited storage and computational capacity that relying on the network to get great power. The webtop idea reminds me of network computer as they, while are different in many aspects, share the same idea of powering users with networked infrastructure. Ten years ago, this vision was tested with physical computers but largely failed, while today, with the advance of technologies, is revived by allowing users to create virtual computers that only exist on the websphere. I have many reasons to believe this time it will not only survive, but also prevail.

[P.S. 2011-11-24 It’s dubbed “Cloud” this time. 也就是坑爹的“云”忽悠。其实云才不是关键。关键是知识管理,把知识从用户行为和生成数据中提取出来(注意,不是挖掘,而是提取,相对容易)。]

One reason is from my personal experience. From about two years ago, I stopped installing many software that have been with me for many years: Encarta is replaced by Wikipedia.com, Outlook is replaced by Gmail, MS Street is replaced by Google Maps, MS Word is replaced by writing in wiki, Powerpoint is replaced by online latex writing with the Beamer package, among a long list of other things. Browser is the application I stayed for more than 80% of time when I’m on my computers. There is indeed a strong need for me to organize all such online applications and data — simply bookmarking is barely a solution. I need something that can organize them, enable me quick access to them, and last but not least, pretty and neat. A webtop does exactly those things.

How semantic technologies help in providing a webtop? Actually, long before the term “ontology” getting popular, users are already creating ontologies on daily basis: email classification, creating file folder trees, grouping contacts or naming a photo as “Wedding picture at Troy”, all those efforts are creating relations between things or annotating a “meaning” to an entity. With semantic technologies, those relations and annotations can be made explicit so that data can be more easily managed and queried. For example, I may query that “find all 2005 photos of my friends”, or “show all meetings (even if they are not called meeting, such as “briefing”) in the past month”. A webtop based on semantic technologies will make such an ability universal to any application on its top.

[P.S. 2011-11-24 嗯,就是语义搜索个人“知识”库。这个不远的将来就可能出现在市场上]

There have been controversies about semantic web ever since that term is coined. I think this is partly because the semantic web community as a whole, failed to provide enough end-user friendly tools that can do something helpful in daily life. I wish to see more tools to help daily web activities: semantic email, semantic blog, semantic calender, semantic abstract of news (a little more than RSS), tagging files (picture, mp3,…) with taxonomy, etc. Even more important, to survive, such an application should never ask users to learn RDF or anything needs more than 3 minutes to understand. Bring such applications together, it’s a webtop. I believe something like this is one of the killer apps the community has long been waiting for.

[P.S. 2011-11-24 现在回来看这个三年前的blog,觉得后悔,为什么浪费了三年不实现这些想法。也不是不想实现,实在是“执行力”不到——比如支配自己时间的权力和能力,比如稳定后方基础的工作,比如将想法转化为现实可行的技术配置,比如PPT的忽悠能力,比如人脉…这些都是今后一年我要重点学习的东西]

{{BlogInfo
|page=Blog:Baojie
|title=Towards Webtop
|visitor=User:Baojie
|date=2008/07/25 00:00 EDT
|source=http://tw.rpi.edu/weblog/2008/07/25/towards-webtop/
|tag=Jie’s_SW_Blog, Webtop
}}

参考:

Jie Bao, Li Ding, Deborah L. McGuinness, James A. Hendler. Towards Social Webtops Using Semantic Wiki, In International Semantic Web Conference (ISWC), Poster Track, 2008 (Download) (Slides) .

Enhanced by Zemanta

语义网与推荐(3)推荐系统基础

2011/11/21 1条评论

找了一些入门的slides来看。语义不语义,其实关系不大

Recommender Systems http://www.slideshare.net/T212/recommender-systems-1311490 【非常基础】

Recommender Engines http://www.slideshare.net/antiraum/recommender-engines 【同上,一般方法综述】

Tutorial: Recommender Systems http://www.recommenderbook.net/media/Tutorial_IJCAI_2011.pdf 【IJCAI 2011上的教程,by Dietmar Jannach & Gerhard Friedrich】

王守崑 – 豆瓣在推荐领域的实践和思考 http://www.slideshare.net/clickstone/ss-2756065 【挺不错,有些经验之谈】

How to build a recommender system http://www.slideshare.net/blueace/how-to-build-a-recommender-system-presentation 【Wakoopa;关于数据的选择,有趣】

Music Recommendation Tutorial  http://www.slideshare.net/ocelma/music-recommendation-tutorial 【虽然是说音乐,技术是通用的】

Music Recommendation and Discovery in the Long Tail http://www.slideshare.net/ocelma/celma-ph-d-defense-1067735 【Oscar Celma的博士答辩,2009】

Social Recommender Systems Tutorial – WWW 2011 http://www.slideshare.net/idoguy/social-recommender-systems-tutorial-www-2011-7446137

Google Tech Talk on Social Recommendation http://www.slideshare.net/dancarroll56/google-tech-talk-on-social-recommendation

更多

Enhanced by Zemanta
分类:语义网, 幻灯片 标签:

资源:DL-Learner

2011/05/21 1条评论

一些关于从ABox学习TBox的资源。

目的:语义压缩。所有的Machine Learning,在本质上都是压缩算法。

DL-Learner: http://dl-learner.org/Projects/DLLearner/OnePageIntroduction

Jens Lehmann:http://jens-lehmann.org [publications]

略读了一下,感觉不是很有说服力(convincing)。对所谓Re nement Operator,觉得没有什么特别的。由于这本质是一个搜索问题,文中却没有什么讨论搜索策略。没有和统计方法和信息论结合,我觉得是很大的弱点。

分类:资料,

增强语义维基查询应答的表达力

2011/05/12 1条评论

Expressive Query Answering For  Semantic Wikis

May 11th, 2001 at Cambridge Semantic Web Meetup

#1 Welcome

#2 Semantic Wiki as a Data Store

Semantic wikis have been increasingly popular in the past a few years. Their popularity may be attributed to many features of “wikiness”, such as being collaborative, simple, easy to learn, informality-tolerate, and evolving-capable. A semantic wiki allows you to start from unstructured, raw data, and gradually adding structures or even semantics to the data by yourself or by others. This approach often works better than many other knowledge management approaches for non-expert users.

The part I love most of semantic wikis is that I can use them as a Web-based light-weight database. A wiki acts as an abstraction over the real data, regardless whether it is in a relational database, in a triple store, or online somewhere else. It also offers an easily-accessible interface that I can do almost all data management tasks from a browser: modeling, querying, and some inferencing. On the top of the wiki abstraction of data, we may build other interesting applications, such as maps, blogs, to-do lists, bibliography repository, and many other things.

#3 Semantic Media Wiki (SMW)

Semantic MediaWiki can be said the most popular semantic wiki system currently available. There are a couple of reasons for the success of semantic wikis in general, and of SMW in particular.

One prominent property shared by almost all semantic wikis  is their simplicity and low-costness. Traditionally, to build a semantic application, one need tools for building ontologies, for annotating data with the ontologies, for querying data, for reasoning with the data and the ontologies, and languages to build the user interface. This involves learning a whole set of languages and tools, such as OWL, Protégé, SPARQL, Jena, Pellet and Java, etc.

For many developers or users, the adoption cost of semantic web technologies is too high and the reward is relatively low. For example, a gym manager wants to build a website with a little bit semantics, will it make sense for him to learn the above set of languages? or to hire a semantic web programmer?

Semantic wikis fill the gap with a low-cost solution for light-weight semantic applications. SMW, for example, provides an integrated environment for ontology building, for data annotation, for reasoning and querying, and for UI building. As it is built on the top of Mediawiki, there are many extensions, from visualization to I/O, that we can use to build applications.

SMW provides a simple modeling language and a query language, which are considerably simpler than RDF and SPARQL, respectively. It is in fact a quite powerful tool and can be seen as a light-weight triple store, and we can build applications on its top.

#4 However, we often need more expressivity

However, despite its power, we often feel that the expressivity of SMW is too limited. For example, there are not inverse properties in SMW: I can not say that “has author” is the inverse of “author of”. Developers often need to use complicated templates and other tricks to work around this limitation.

Another frequently needed feature is transitive property. For example, I may want to say that Nashua is a part of New Hampshire, and New Hampshire is a part of United States; therefore, Nashua is a part of United States.

Similarly, we often need additional expressivity in the query language of SMW. One example is negation, such as to find cities that are not capitals. Another example is counting, for example, to find professors who advise more than 5 students.

#5 Desired Expressivity 

To pick up a right set of expressivity for semantic wiki modeling, we need to balance between expressiveness and simplicity. For example, why not pick OWL 2 QL as SMW data is stored in a relational database anyway? Or why not OWL 2 RL which can be implemented with rule-based reasoning?

To find the right mix of supported features, I believe that what matters the most is not whether the set is maximally expressive, or whether it is tractable for the worst case time complexity. The right criteria might be

1)If users need it
2)If the adoption cost is low

Keeping this in mind, I selected OWL Prime as the subset of OWL supported in the extended SMW modeling language.

For the query language, I extended SMW-QL with negation as failure and cardinality queries.

#6 Formalization

The next question is what semantics to use. OWL adopts the open world assumption (OWA), that is, if something can not be proven true, it is not necessarily false. Databases and many rule systems, on the other hand, adopt the closed world assumption (CWA).

Semantic wiki, is in fact more close to a database than to a knowledge base with OWA. When we query against a wiki, we are, for most of time, only interested in the knowledge mentioned in the wiki. If something is not said in the wiki, we assume that it is false. If we list two authors for a paper, then by default the paper has just the two authors and no others. For another example, if Berlin is not said to be a person, then Berlin is not a person.

A right semantics for SMW, is therefore not that of OWL, but a closed world semantics. For this research, I used datalog, which has a descriptive, closed world semantics, and with well-understood complexity and mature tool support.

For the sake of time, I will not cover the full details of modeling SMW in datalog, but only on the new features. You may refer more details in the backup slides.

#7 SMW-ML+

This slide shows the translation of extended SMW-ML into datalog. Their meanings are similar to the corresponding constructs in RDF or OWL, thus I may not have to explain them in details.

One thing worth noting is that the SameAs relation here is weaker than owl:sameAs, so that in counting, even if SameAS(x,y) is true, x and y are still counted as two individuals.

#8 Translation Rules for SMW-QL

This slide shows the translation of a SMW “ask” query into logic program rules. The query asks for cities that are capital of something. The query is turned into a rule on the right. The head of the rule is a special predicate “result”, which is used to collect all matched results in query answering. Each selection condition is translated into a body item in the rule.

This is a very simple example. For other constructs, such as conjunction, disjunction, subquery, and property chain etc, see the backup slides

 #9 SMW-QL+ : Negations

This slide shows the translation of the extended query language with negation into datalog.

For the second case, why not “C(X), not P(X,Y)” ?

If we have C(a), P(a,b), then the above query will return {a,b}, because C(a) and “not P(a,a)” are both true. Thus, “C(X), not P(X,Y)”  is not a right translation.

#10 SMW-QL+: (Non)qualified Cardinality

Qualified cardinality queries and nonqualified cardinality queries are translated into similar rules using the count function.

“Thing(x)” is added for safeness of the rule, that is, the rule will always return a result. We have a set of rules to ensure that everything is an instance of “thing”.

#11 Implementation

A quick note on the implementation. The backend reasoner I used is DLV, which has won the first ASP competition. In theory, other logic program solvers may be used as well. I have tried clasp, which was the winner of the second ASP competition. The performance of DLV and clasp are similar. I didn’t tried other solvers yet, such as smodels or cmodels. But it should not be too difficult to use them.

The implementation has a file-based mode and a database-based mode. In the database-based mode, real-time changes of instance data will be captured, but it is in general a little slower than the file-based mode.

As a side-benefit of this implementation, you are now able to decouple the content storage of the wiki and the semantic data storage of the wiki. As long as you provide an ODBC interface, your semantic data can be stored anywhere, not necessarily locally. This also enables remote querying of another wiki, or federated query of multiple wikis.

#12 Example 

This page shows a screen shots. On the left we show modeling and query scripts of two pages, using inverse property and transitive property. The query result is shown on the right.

#13 Scalability: Data Complexity

The next two slides show the scalability results. For data complexity, we measure query time as a function of the dataset size, for a fixed query. It is almost linear. This is largely because building an result set, or in DLV’s terminology, an answer set, requires linear time to the number of facts when the number of non-fact rules are small. In this experiment, we have about 100k triples of facts, but only less than 100 rules.

#14 Scalability: Query Complexity

In the second graph, we can see that the query complexity is almost constant. Query complexity measures, for a fixed dataset, how fast query time increases as a function of query size. I have tried several query patterns, and all of them show constant time behavior. It is not true for SMW itself as it translates queries into SQL.

An explanation for the constant time complexity is that the extended query are translated into non-ground rules, which are small when compared with the size of ground facts. For this sake, DLV is sensitive to factbase size in a linear way (probably because of grounding), but is insensitive to the rule set size as long as the factbase size is much larger.

As most semantic wikis as of today have less than 10k pages and 100k triples, the implementation is probably fast enough for typical wiki users.

#15 The SemanticQueryRDFS++ extension

We have released our work as an extension of Semantic MediaWiki, called SemanticQueryRDFS++. You may try it out.

We pick up this name because the OWL Prime subset of OWL has been called but others as RDFS 3.0 or RDFS++, and we believe “RDFS++” may give the best intuition of what is supported by our extension.

#16 Some other work on SMW by us

[a list]

#17 Summary

Summary, we have shown that formalizing SMW using datalog allows us to extend SMW for an expressive subset of OWL,  to implement a SMW query engine that is scalable for typical uses, and, not mentioned in this talk because it only be interesting to logicians, to analyze the reasoning complexity of SMW and our extensions

There are a couple things we want to do in the future. We want to support incremental reasoning so that we don’t have to compute the answer set every time from the scratch. We may support customized reasoning rules; if some users need more advanced reasoning, they should be able to. Finally, for exchanging data with other semantic web application, it would be nice to a translation between SPARQL and the query language of SMW.

[end]

分类:语义网, 幻灯片

资源:一些关于Web 3.0的忽悠

2011/05/03 2 条评论

罗列于此,不代表本人支持以下的观点。我特地过滤掉那些称Web3.0就是语义网的说法。

2007

The Evolution of Web 3.0

2008

Web 3.0

Why Web 3.0?

Web 3.0 explained with a stamp

web 3.0 this time its personal

2009

Web 3.0: How’s That Panning Out Then

2010

Web 3.0, a doc by Kate Ray

分类:Web, 幻灯片 标签:

为什么Context很重要

2011/05/03 2 条评论

最近两个消息都说明了Context(背景,环境,域)的重要性

TriQuint是我很关注的一个股票。一个新闻说:TriQuint’s quarterly revenue falls 11% to $224.3m

这看起来是个坏消息。但是,如果你对半导体行业熟悉,你会发现几乎所有的公司第一季度的收入都会比上年第四季度低,TriQuint也不例外。你把这个下降本身放在历史数据这个context下看,再正常不过,无非是圣诞旺销季后的季节性变化而已。

另一个是关于本拉登的被发现:因为他不用电话和网络

Bin Laden’s villa lacked Internet, phone service:Absence of communications was one key tip-off to U.S. intelligence officials

我前几天看一张图片,阿富汗人围观烧掉一个美国牧师的偶像(因为他烧了一本古兰经)。让我微微诧异的是几乎所有的阿富汗人都拿着手机或者相机在拍照。阿富汗的电信发达程度都如此,巴基斯坦应该更进步。放在这个Context下,没有电子信息传递本身就是信息。

Context可以是很多东西,比如比较对象、历史,时间空间、状态、假设,等等。下面这个是我在WebSci’10的关于context的poster(海报):

分类:语义网, 幻灯片

描述逻辑手册中文翻译(1-2章)

2011/04/26 留下评论

说明: 这个版本的翻译是以前一个朋友发给我的. 很抱歉忘了他们的出处. 虽然有一些地方翻译的不精确, 但可以作为一个非常好的中文版翻译的起点

第一章

第二章

[forum=29&topic=146]

分类:逻辑, 语义网, 资料

收藏帖:Startup Best Practices [创业最佳实践]

2011/04/24 1条评论

一本免费的书 http://startupbestpractice.com/

Startup Best Practices contains in-depth conversations with Silicon Valley serial entrepreneurs who share a wealth of business experience and lessons learned that help newbie entrepreneurs focus on the important issues.

Their practical guidance in business fields such as finance, marketing and sales, and management and organization is directed at the key challenges that startups typically face. Cees J. Quirijns gets these startup wizards to share their entertaining, informative, and invaluable insights and devises the common thread.

Youtube上的简介

全书(版权A Creative Commons – Attribution – Non Commercial )

分类:资料, 工程创业

RDF的语义

2011/04/21 4 条评论

2008-09-04

这个讲义是基于RDF的官方语义(Hayes,2004)【题外话,这个Hayes是个老顽童】

更多资料看: http://tw.rpi.edu/wiki/index.php/RDF_and_OWL_Semantics

【我一直有想法把RDF,OWL和RIF的语法,语义串起来整理一遍。网上系统的中文资料好像还没有。英文的有一本Foundations of Semantic Web Technologies,最好。不过我觉得因为作者是逻辑学家,还是有点难。如果能有一个更深入浅出的讲义,最好。】

为国际会议建元数据(1)

2011/04/11 3 条评论

为国际会议建元数据:1,2,3 (更新中)

总结我为ISWC 2010做元数据负责人(metadata chair)的经验,兼RuleML 2011元数据展望。一直想做这事,一直懒。主要是为我自己查找方便,可能比较天马行空。因为很多原始资料是英文的,所以抱歉就不翻译了。

(1)背景和简述

ISWC 2010 Logo

ISWC=International Semantic Web Conference 国际语义网会议,是语义网界最主要的学术会议。另外有个IEEE International Symposium on Wearable Computers,也简称ISWC,有时会被搞混。

从2003年起,ISWC都指定专人负责收集和发布会议相关的元数据(metadata),比如论文,组委会,议程等。这个人被称为Metadata Chair。ISWC2010,我做这事。从2009年秋,做了将近一年,最后几个月花了比较多的时间。没有经验,错误很多,所以准备总结一下。做一个事件或者几百人规模的组织的元数据,可能有很多共性,所以这里的经验教训,也许对别人有用。

关于目的、手段、结果,这里是我在ISWC 2010 Lightening Talk上做的总结(就一页)。后面附文字版(英文的),算是个提纲。

——————————–

(Brief History) Nothing speaks louder for a technology than that it is used by its own advocates. That’s why each year at ISWC since 2003, there is a member in the organizing committee who is responsible for collecting metadata about the conference, represented in Semantic Web formats. The dataset is about papers, authors, organizers, organizations, sessions and other sub-events of the conference. The data is always published on the Semantic Web Dog Food website, data.semanticweb.org.

(Goal) This year, there are a couple of changes in the metadata work. The goal is that, instead of simply generating the dataset and let it sleep, we should get the data used, both on-site for the conference attendees, and for the whole community now and in the future. For this purpose, we significantly extended the scope of the work, and accordingly, the workflow of the metadata project.

(Metadata Committee) First, instead having only one or two metadata chairs, we recruited help from the community in the form a metadata committee. The committee is divided into 4 different types of tasks: data generation, linking data to other datasets, data consumption (that is, building applications based on the data), and integrating the dataset into the infra-structure of the conference (such as the website).

(Generation) The data generation part is essentially a data integration task of moderate size. Data are from diverse sources: proceeding data is from dump from the easychair submission site; people’s profile data are partly from eashchair, partly from papers themselves, and partly from manual input; some are from spreadsheets. We tried to reuse data as much as we can, such as previous year’s ISWC data and people’s foaf files. We also tried natural language processing and data mining approaches to extract semantic data from unstructured data, such as keywords, paper’s structure and citation data are mined from pdfs. However, there are still data has to be manually collected, verified, aligned, and cleaned, such as people’s names and affiliations in different spelling variations, and the geo-locations of organization. In total, we have generated about 100 thousand triples, in which about 15 thousand triples are about the basic conference information.  For comparison, previous ISWC metadata contains about 7-9 thousand triples each year.

(Linking) To link the datasets to other datasets in the linked data cloud, we resort to both automatic and manual mappings. Johanna Flores, RPI, generated some mappings of organizations to Dbpedia using fuzzy name matching; Jie Tang of Tsinghua University and his group helped on mapping people to ArnetMiner, a researcher social profile repository. Oktie Hassanzadeh helped to map authors to their DBLP entries. Jie Bao created some mappings to geonames. Thus, the ISWC data is now part of the linked data cloud.

(Apps) We have explored using the data in several interesting ways, both for practical use and for fun. You may have seen from Ian’s opening talk that we have generated various visualizations from the data. We have also developed several browsers, including a mobile browser, a faceted browser, and a filtered browser, many of you may already tried. Once we released the data, the data is quickly picked up by other people, and new apps and demonstrations are coming almost everyday: faceted browsers, visualizations, triple stores, interactive querying, just mention a few. By Nov 8th, the day before the conference, we have 15 working data browsers and tools for the ISWC data, only 4 of them are from the Metadata committee. The next day, we have 19, and yesterday, we have 26. Even this morning, we have learned new tools built for ISWC 2010 data, so the number is even larger now.

(Summary) To summarize, the metadata work is both challenging and rewarding. We applied a wide range of tools in getting all the pieces of work together. Many of such tools of technologies were not even there last year. We are also surprised that how quickly the semantic community can adopt our dataset to develop visualization and tools, some literally in hours, if not in days. Some do not require programming at all. This clearly shows that semantic web tools are approaching the level of maturity for “citizen users” (using the term from mc).

(Future) The ISWC data is rich in its content and its potentials. For example, we also collected real-time data  and mashed up it with the traditional static data. There are many ways we may leverage data like this for better serving the conference attendees, but, due to resource limitations, we can’t do for this year. But we believe, future ISWC metadata work will be even more useful for the community. Again, nothing speaks louder for a technology than that it is used by its own researchers. I believe this year at ISWC, it has been shown that Semantics is beautiful by eating our own dog food.

分类:语义网, 幻灯片

收藏帖:描述逻辑的两组视频教程

2011/04/11 1条评论

语义万维网的逻辑基础
授课时间:2009年8月24日 至 2009年8月28日
授课地点:南京市 东南大学九龙湖校区 纪忠楼(研究生教学楼)310室
主讲人:黄智生教授(Zhisheng Huang)

土豆网上一组视频:http://www.tudou.com/playlist/id/7262185

视频里还看到了瞿裕忠教授、潘智霖教授(Jeff Pan)等。

我以前看Uli Sattler 一组描述逻辑视频,2005年的,也很好。贴目录在这里:

分类:逻辑, 语义网, 资料

RDF and Context (域)

2011/04/06 4 条评论

昨天决定把Context翻译成域。今天接着说说用RDF来表示Context,主要是帮我自己理清思路。

去年做了一点这个方向的工作,参加了RDF Next Step Workshop。这是幻灯片:

现在有很多人研究怎么做RDF的时域(Temporal)和空域(Spatial)扩展,文章很多。一部分我认为重要的Temporal RDF的文章待会列到回复里。Spatial RDF我不很熟悉。比如一种建议是这样的:

:stmt2 rdf:subject <s>;
rdf:predicate <p> ;
rdf:object <o> ;
tmp:interval [
tmp:initial “t1″ˆˆxsd:dateTimeStamp ;
tmp:final “t2″ˆˆxsd:dateTimeStamp ] .

也就是用Reification.

Gutierrez等(2007)提出的抽象语法是

s p o [t1 t2]

那如果我们要表示空间域,比如“布什是总统”这句话在2002-2009时间域为真,在美国空间域为真。那我们就要再加一维

s p o [t1 t2] <s>

那如果有更多的其他context呢?比如情绪、上下文、背景数据(比如,“高个子”在NBA和武大饼屋含义不同)。那一个triple上加的注释(annotation)就会越来越多,不胜其烦。

一个办法就是把这些context通通包装起来,放在一个“context document”里面。一个triple,或者一组triple,在某些context里是真的,在其他一些context里为假。在查询的时候,首先指定我们感兴趣的context,则只有相关的triple被返回。

用抽象语法

s p o @c
c  [t1 t2] <s>

看起来差别不大。但是这样有几个好处:

  • 域可以被重用。如果有很多triple共享空间和时间,那不必重复
  • 域可以被推理。一个triple可以属于多个域,或者根据推理的结果属于某个域(域本身可以是变量,在具体语法上是空节点blank-node)
  • 域之间可以有关系。比如我们对谁是总统这种论断制定空域的封闭世界假设,那一旦知道且仅知道“布什是总统”这句话在2002-2009时间域为真,那在2010年“布什是总统”并为假。反之,“今天下雨了”这句话在Boston为真,在NYC可能也为真(开放世界假设)。又,“今天下雨了”这句在曼哈顿为真,这在NYC也为真。

为什么不仅是用Named Graph(名图)?名图只是表示了一种语法上的包含关系。你无法说,或者推理出,(s p o)属于名图g这种关系。名图可以用来做语法上的陈述,但是需要新的语义才能表示context(域)。

未完,下次讲域的不确定性(uncertainty)模态(modal property)。

语义网:走向下一代杀手级应用

2011/04/04 5 条评论

[注(2011-12-10):这一年我对这个问题认识变化很大。这个幻灯片,我现在觉得很不成熟。对语义网的产业化,我又有些新的、总的来说更积极的看法,待机会成熟的时候来做一个更新。]

原发在slideshare, 2010-10-22,是在UMass Lowell的一个Talk。级别:普及。

Semantic Web: In Quest for the Next Generation Killer Apps

提纲

  • Why Semantic Web? (为什么要有语义网)
  • Key SW Standards (语义网关键标准)
  • Opening Data for SW (语义网和开放数据)
  • Building SW Applications (构建语义网应用)