Convenciones de transcripción

@Title: Challenges for Information Extraction
@File: mavir01
@Participants: GRI, Ralph, (man, C, 3, professor, lecturer, New York) 
@Date: 16/11/2006
@Place: Madrid
@Situation: conference (I Jornadas MAVIR), conference room at university, not hidden, observing researcher  
@Topic: Current situation of  information extraction research
@Source: MAVIR 
@Class: formal in natural context, conference, monologue
@Length: 1 h 07' 39"
@Words: 9113
@Acoustic_quality: A
@Transcriber: M. Garrote
@Revisor: L. Campillos, M. Garrote
@Comments: 

GRI thank you Antonio / and thank you to the / entire MAVIR Network / for the invitation  00:06

GRI &ah / can everybody hear me in the back ? 00:09

GRI it's ok ? 00:10

GRI &ah / one of the dangers of having an introduction like this / and being told / how many years you've worked already in computation linguistics // is to be asked of it and / why didn't you get further by now ? 00:22

GRI after forty years we've learned / things go rather slowly / that we make progress / but progress is never as fast as / we predicted that it will be 00:33

GRI looking ahead / we think in five years / everything will be wonderful 00:37

GRI and now I say / well / maybe my children / or my grandchildren / will solve all these problems 00:43

GRI one of the nice things about / natural language processing is a problem // is that is so rich / so deep // that / you can / peel off one layer of problems // get some level of solutions // but there will still be many layers / left for the next rank of students to solve 01:02

GRI ok // so let me talk over about / the challenges for information extraction 01:09

GRI &ah / I present both / some very / general overview / and then some technical details about various projects we've worked on 01:20

GRI so let me begin by just explaining what I mean by information extraction 01:25

GRI so identifying instances / of important entities / relations and events / from unstructured texts 01:33

GRI ok // so for example / &ah / this is one / we've worked on enough for a long time // identifying people who are hired and fired / by various companies 01:48

GRI so we have the sentence // George Garret / forty years old / president of London-based European Information Services / was appointed chief executive officer of Nielsen Marketing 01:59

GRI and somebody comes and says // ok // we want to read all the newspapers // and keep track of who's been hired or fired 02:06

GRI so we'd like to produce = 02:08

GRI see if my mouse is awake 02:10

GRI no 02:10

GRI not awake up this morning 02:12

GRI &ah / we'd like to produce a database / which lists position // and the company / and the location / and the person / and whether they got / left the job / or they got into the job 02:25

GRI cause that's a typical example of what we / mean by doing information extraction 02:30

GRI so to make a contrast / xxx / also hear about information retrieval // from the web / contrast between information extraction / and / information retrieval 02:42

GRI so / in information extraction / the operations are based on and then return normalized values like dates // or entities 02:52

GRI so by an entity I mean not just the reference / but / if we have several references to the same person / to recognize that / as / the single particular person // and most important relations between entities / or between a company / and a person // between a person and a location 03:10

GRI so information extraction has to be adapted to specific tasks 03:15

GRI somebody will come and say / we're interested in hirings and firings 03:18

GRI or we are interested in / talks of universities 03:22

GRI or we are interested in / a task on cities or something 03:26

GRI and the system has to be customized for that particular task 03:30

GRI in contrast / information retrieval / basically based on terms // just on [/] on differentiated tokens 03:40

GRI and we would turn documents or passages right than specific facts 03:44

GRI and the benefit is that it's a general technology // something that doesn't have to be customized for a particular task 03:51

GRI but the power I'd like to suggest of information extraction is that there's some questions / which are really hard to answer using just / Google 04:01

GRI so if you wanted to get an answer to where has Condoleezza Rice been in the last month // you might have to Google for her name // see which articles refer to her / locations // and then slowly xxx the database might take you ten or twenty minutes / to answer this question 04:20

GRI whereas with information extraction if we're interested in tracking people and their locations // this is a plausible question we might pose directly 04:28

GRI hhh {%act: cough} or what terrorist attacks occurred in Europe in 2004 04:34

GRI if you xxx in terrorist attacks in Europe in 2004 // you might not get very much 04:39

GRI 'cause articles might talk about England / and Spain / and various countries 04:44

GRI if you don't have relationships capture this part of retrieval // you're not going to do very well from term retrieval alone 04:51

GRI one way in which / information extraction might be used / this is a search tool / &ah as a complement for / sort of standard Google-based tools for doing web search 05:07

GRI the idea here is that people would be interested in particular domain and they'd come and say / we're interested in / &ah searching articles 05:14

GRI in this example I'll show in a moment about / disease outbreaks 05:19

GRI so we'll build the system for extracting relationships about disease outbreaks 05:25

GRI what disease / occurred when / and where / and so forth 05:28

GRI and then we pick the system // and we run it each day against the days news / using a web crawler up and down all the new sites 05:38

GRI retrieve the latest news // build the database // and then we provide access to the article through this database 05:47

GRI so / basic flow is the following // the web crawler // some filter to just get articles which are relevant to the task 05:56

GRI then build an extraction engine which builds a database 05:59

GRI on the other side somebody xxx with the browser // and would look something like this 06:05

GRI so what we have over here on the top // is / a database meant to look very much like Excell // has the same sort of capabilities as an Excell database 06:17

GRI which is the document date / the disease name / the time / the location / the country of the location / how many cases were reported / whether / they were sick or dead // and specific task description 06:32

GRI so then you can search this just like you would in a Excel database / to restrictions // and when you are interested in a particular article / you click on a row / and up comes / at the bottom of the screen / the corresponding passage 06:46

GRI so in this case it will say somebody click on dengue fever // and it says down below that the article it came from / why / state officials reported one additional recent case of dengue fever / and six cases that occur hhh {%act: onomatopoeia} 07:01

GRI ok / the extraction technology's not completely reliable 07:06

GRI so we get maybe / two thirds of correct information / three quarters of correct information / in the database 07:12

GRI but then / the database gets / immediately link to articles // so / even though you can't rely on the database per se it's been accurate / certain information // if you view it as a search tool is really / can be much more powerful search tool for particular topics // and a key word scheme / like / Google or Yahoo 07:35

GRI 'cause you wouldn't expect if you typed in / the search terms to something like Google that / two thirds or three quarters of the articles were irrelevant 07:43

GRI so / this sort of tools / for doing directed search have been applied in a number of areas // been applied in genomics 07:53

GRI that's one area where there's certainly a lot of money / in the United States / and perhaps here also 07:58

GRI and / &ah / researchers go with the money is xxx 08:04

GRI so there's been quite a bit of work / on extracting information in particular about gene protein interactions 08:10

GRI there's an enormous literature xxx by genomics researchers 08:14

GRI you simply / can't keep up with all the staff that's been published / to get to search for something // it's a major effort 08:20

GRI so being &ab [/] being able to pull out this specific relationships // we get a much more powerful search vehicle 08:28

GRI in / medical applications / there's a lot of demand for generating summaries for medical reports 08:35

GRI what fraction of / patients have this outcome / who xxx and so forth 08:40

GRI and some of this is now being done by information extraction 08:44

GRI also / in the idea of going with the money is / &ah / financial / information extraction is also being a major area 08:53

GRI so / in keeping with this idea that / I've been slaving at this for a long time // the other people have been slaving at this area for even longer 09:04

GRI so / although the idea of doing search from information extraction / seems like a / sort of timely and novel one // it's one which has been around now for // let's see // fifty eight so it's / almost half a century 09:19

GRI so / back in nineteen fifty eight / there was a presentation / by Zellig Harris in Washington / where they had a conference 09:28

GRI how should we be doing information retrieval / in a time when / there was very little online and / being online mean / taking an article / and typing it all into punch cards / and reading those punch cards in a reader 09:40

GRI and having a couple of articles // punch cards // people who remember // &ah -> / having a small collection of articles online / maybe a few hundred 09:52

GRI but at that time Zellig Harris succeeded in thinking / how could we automate this process / and have more powerful search vehicles // than just having keywords search 10:02

GRI so he talked / in this fifty eight paper / about the idea of taking a set of articles / discovering the main relationships which appeared in the articles 10:12

GRI he was interested in scientific / literature 10:15

GRI automatically indexing the articles / xxx work was then probably UNIVAC I or / old 7090 mainframe // indexing the articles // and then doing retrieval based upon relationships 10:29

GRI so it's taking us maybe half a century for the technology to catch up to this ideas // to have / the corpus [/] corpus frame methods / which can now / analyze large portions of texts reliably 10:45

GRI and / basically only over the last decade / have methods for discovering relationships from texts 10:53

GRI ok 10:56

GRI so -> / so much for history 10:58

GRI now / one of the challenges of information extraction at the moment / to understand there isn't progress / we need to appreciate very briefly / what the basic approach is / to information extraction 11:10

GRI and then the problems which arise / because of the complexities of language 11:14

GRI so expend a couple of minutes talking about / these problems // and then go through a number of &ah / areas of current research / and how they're trying to address these problems 11:25

GRI in a particular xxx / hhh {%act: cough} / survey or advertisement for what's going on at NYU // and so we'll have a &l [///] NYU symbol / xxx all torches will appear here and there // to show what we've been doing 11:37

GRI ok 11:40

GRI so the basic approach doing information extraction is very simple 11:46

GRI suppose to come back to the early application we were interested in figuring out // people who are hired or fired by companies 11:54

GRI ok? 11:57

GRI so / I give you that challenge 12:01

GRI maybe you / take this challenge up in Spanish 12:04

GRI think about five ways of saying somebody was hired or fired from a job 12:09

GRI ok? 12:11

GRI you can probably think of a couple of ways xxx here 12:14

GRI ok? 12:14

GRI this was a classroom where nobody's allowed to / give up assignments 12:18

GRI I have everybody sit down and write down / three or four patterns they could remember 12:22

GRI ok? 12:25

GRI and then we have some programmers here // we tell them to write some Perl program / or Python program // something which this nice pattern matching // and just run against / some newspapers // and see how all the pattern matches 12:39

GRI and / if some of the pattern match / we say / ok / we take the person / we put in this column // and we take the job / we put in this column // and we take the company / we put in this column // and we're all done 12:53

GRI and we would go have coffee / and you don't have to hear about all my problems 12:56

GRI well / this works // but it doesn't work very well 13:02

GRI people would discover after / hhh {%act: cough} / would discover after they try it // that they make it / five percent / ten percent recall they're really lucky 13:15

GRI so / why the simple pattern matching like this not work so well ? 13:19

GRI well / as I think we all know language is really very complicated // and all of the different problems of language / come forth in trying to do information extraction 13:29

GRI so / for example / the lots of different words / people who write for the Wall Street Journal / have to write articles like this everyday 13:42

GRI somebody was named to the job // somebody was appointed to the job // somebody was selected for the job 13:47

GRI so they are good at / finding new ways of saying the same thing 13:51

GRI they're paid to find new ways of saying the same thing 13:55

GRI so that makes it nice for readers / but it makes it more work for us as computational linguists 14:01

GRI then there're different constructs for providing the same information 14:08

GRI IBM named Fred as president // IBM announced the appointment of Fred as president // Fred / who was named president by IBM 14:17

GRI and so on and so forth 14:19

GRI ok then / people could be referred to in different ways 14:24

GRI so we can have George H. W. Bush / former president Bush / is something scored forty one / because he was a forty first president // so you / can differentiate / forty one from forty three 14:37

GRI ok? 14:38

GRI who's / in the office now ? 14:40

GRI ok so / all of these problems have to be addressed / beyond pattern matching in order to get reasonable information extraction 14:47

GRI then there are some ambiguities // so I present some in English / but there are probably comparable situations in Spanish 14:56

GRI so Fred's appointment as professor 14:59

GRI versus Fred's three o'clock appointment with the dean 15:04

GRI it's just a meeting // not [/] no job get started 15:08

GRI ok? 15:09

GRI so you can just look forward appointment / with somebody's name / and find xxx and say / &ah / somebody got a job 15:15

GRI a problem we had when we started doing the disease outbreak / extraction system // we had in mind that there would be outbreaks of typhoid / outbreaks of dengue // and then we run it against the day's newspaper // and the most common pattern we got was outbreaks of violence 15:34

GRI so we got all the tasks instead of all the diseases 15:37

GRI so lexical ambiguity becomes a problem 15:42

GRI then the structures aren't simple 15:45

GRI you'd like to have / person was appointed to job 15:49

GRI ok? 15:50

GRI so / we search through a few articles // and we found the following 15:54

GRI I don't know / if people can even figure out what the subject and object of the sentence are in this 16:00

GRI for the Federal Election Commission / Bush picked Justice Department employee / and former Fulton County Georgia Republican chairman / Hans von Spakovsky / for one of the three openings 16:15

GRI ok? 16:17

GRI so the / problem for the / person and the system / is picked / what ? 16:23

GRI or picked whom ? 16:24

GRI and you have to get through just this department employee hhh {%act: onomatopoeia} to get / von Spakovsky // as the object to picked 16:33

GRI ok? 16:34

GRI so simple pattern matching is not going to work 16:37

GRI we wouldn't need to do structure analysis in order to figure out what's going on 16:41

GRI even if we get through all of this // we have problems where we may have to go across sentences 16:50

GRI George Garrick has served as president of Sony for thirteen years 16:54

GRI the company announced his retirement effective next may 16:57

GRI ok? 16:58

GRI so you have to figure out what's his company / we can fill the database with / the company and him 17:04

GRI hhh {%act: cough} ok? 17:08

GRI hhh {%act: cough} excuse me one second 17:10

GRI {%com: drinks} so all of this means that we have a lot of work to do // in analyzing language in order to / be able to do / effective information extraction 17:29

GRI hhh {%act: cough} below I'll discussed a lot problems / we can group them roughly into two basic / types of problems 17:37

GRI collecting the patterns for a given relationship // and identifying the instances of these patterns in the text 17:45

GRI and / I'll begin by looking at the second problem // identifying the instances 17:51

GRI hhh {%act: cough} so / as I try to explain / with this example of von Spakovsky // it really doesn't work xxx do / extraction by just writing patterns which look for / sequences of tokens 18:11

GRI person pick name / is not going to work with // person pick just hhh {%act: onomatopoeia} / Hans von Spakovsky 18:23

GRI like you have just some way of figuring out / that / von Spakovsky is the object of picked 18:29

GRI so the patterns have to be stated at the structural level 18:34

GRI which means / as we understand that / before you can really do information extraction / you have to do a lot of linguistic analysis 18:41

GRI you have to identify names // and classify the names as people // and organizations and locations 18:48

GRI you have to figure out the syntactic structure // so we know what's the subject and what's the object hhh {%act: onomatopoeia} picked 18:56

GRI and we have to figure out coreference / so it is / the company and him we know what has been referred to 19:03

GRI and if the analysis is wrong at anyone of these stages / the pattern's not going to match 19:08

GRI so / what have people done over the last / decade or two decades in trying to address this problem of structure analysis ? 19:17

GRI well / people have broken it down into / different kinds of subtasks 19:21

GRI so named entities finding names // finding syntactic structure // finding coreference 19:27

GRI and people specialize each one of these problems 19:32

GRI building separate typically now / corpus frame models for doing each one of these tasks 19:37

GRI hhh {%act: cough} so people have / built large corpora annotated with names // large treebanks annotated with / syntactic relations // and even coreference / corpora 19:51

GRI and / after they've done this / started machine learning for different methods // they come and they give papers and say / look / we can get this wonderful / level of performance 20:03

GRI we can get ninety percent performance / for recognizing names 20:07

GRI and ninety percent accuracy for doing parsing 20:10

GRI and well / we can do coreference so well but / that's something for a children to work on 20:15

GRI so / we'd look 20:17

GRI ninety percent accuracy 20:18

GRI let's go out and let's sell our product 20:20

GRI well / you look back at the problem 20:23

GRI you see actually we've just / decomposed the problem into / name analysis // reference resolution relation tagging 20:31

GRI each one of these is ninety percent accurate 20:34

GRI let's say 20:35

GRI ok? 20:35

GRI maybe we are not xxx of everything 20:37

GRI but let's say everything is ninety percent accurate 20:39

GRI well // the end result is gonna be ninety percent / times ninety percent / times ninety percent // depending on how many of these components you / put together 20:52

GRI and so / with three components maybe we have seventy percent // maybe we can't sell our system anymore 20:59

GRI so / what do we do ? 21:03

GRI well / we can just / go home crying 21:06

GRI or we can look at the problem and say / well / we decomposed it // and in decomposing the problem we've looked at / trying to optimize each problem separately / each task separately 21:20

GRI so if looked separately / &ah finding the best names // finding the best relations // finding the best events and so forth 21:28

GRI and what we should do now / that we've decomposed these problems is take advantage of the interactions between the stages 21:35

GRI so / instead of making that xxx // try to / pick advantage / of all of these stages 21:42

GRI so what does that mean ? 21:45

GRI it means for example / preferring names / which allow for more coreference 21:49

GRI so / if you couldn't tell / if it was / the name was ABC // or ABCD 21:55

GRI but you find / some other articles xxx ABC // then most likely is gonna be the name mentioned more than once 22:03

GRI so / we did statistics like this // in which we looked at names which were only mentioned once // and names which were mentioned more than once 22:14

GRI and this graph shows &ah / if the name only appears once in an article 22:21

GRI and the chance that we got the name right is pretty low 22:24

GRI it's between forty five and sixty five percent 22:27

GRI but if the name is the same / as another mentioned which appeared somewhere else in the article / or in a different article // then we're much more confident that we got the name correct in this instance 22:38

GRI so if we can't tell whether it's von Spakovsky / or the name was just Spakovsky / and so forth 22:43

GRI if you see / somewhere else von Spakovsky // then you have much more confidence that you got the name correct 22:51

GRI hhh {%act: cough} &eh similar observation can be made about the connection between names and relations 22:58

GRI so / &ah the main fact to be seen from this graph / is the difference between / for each pair / of bars between the light purple and the dark purple 23:11

GRI so the light purple indicates / a name which appears in a relationship // like von Spakovsky / Fred's father / or something 23:20

GRI and the / purple / a name which shouldn't appear any relationships 23:26

GRI and basically what this graph tells us / let's see / we just look at the last bar // that / if it didn't participate in a relationship we only have a fifty percent chance / that we've got the name correct 23:39

GRI but if it participate in a relationship / we've got a ninety percent chance 23:43

GRI so / it's basically saying / if we've got the name right // then there's a much better chance / that we'll be able to identify relationship 23:51

GRI and so xxx backwards and say // if we've got a relationship / involving this name / then we probably identify this correctly as a person 24:00

GRI if we have some text where we have hhh {%act: onomatopoeia} / Fred's father // and we couldn't figure out whether hhh {%act: onomatopoeia} was a person or an organization or a location // you can probably tell from the fact that was Fred's father // and things on to screen here / that this was a person rather an organization 24:18

GRI so we try to use this sort of / relationship / between the different stages // and use in particular the constraints / that semantic relations / impose on the arguments 24:31

GRI so if we have hhh {%act: onomatopoeia} somebody's father // that means it has to be a person 24:35

GRI and we use these relationships to pick / preferred analysis 24:41

GRI so / how do we put that together ? 24:43

GRI or the basic idea is instead of with the name analyzer / analyze just / one possibility 24:49

GRI if the / analyzer is not sure what / the correct analysis is 24:55

GRI xxx generate multiple possibilities 24:58

GRI so-called / N-best choices 25:00

GRI so we can't tell whether the name is von Spakovsky or Spakovsky / we'll take both and will say // I'm not sure / what the analysis is 25:09

GRI let me pass two hypothesis on to the next stage 25:12

GRI then the next stage we'll re-rank this / based upon / the overall sentence structure / upon the relationships / upon coreference 25:20

GRI and we found / in this analysis by one of the students working in Chinese // that we could get a substantial reduction in the error rate 25:27

GRI just between / name analysis / coreference / and relations 25:34

GRI so the overall picture now look something like this // which start with the raw document // we've got a bunch of hypothesis for names / a bunch for coreference / a bunch for relations 25:48

GRI and once we've / put them all together // we go to this re-ranking model / which looks for global optimum 25:56

GRI and / by finding a global optimum across names and relations / and coreference // we're not able to get ninety percent / but maybe we can get eighty five percent accuracy instead of just seventy percent accuracy 26:10

GRI ok? 26:13

GRI so there's been several researches of work in this / &ah xxx view / &ah / professor Roth at / &ah Illinois also been working on this / in terms of a more standard optimization strategy 26:26

GRI so / I won't go into the / detail here // but I think this is a trend we're gonna see much more / instead of / optimizing the components separately // to have a situation now where we / to much credit extent use the components together // in order to break through this performance variants 26:45

GRI ok 26:48

GRI so / see that it's the main hope / in terms of improving overall performance // of course people will continue working on / coreference and names / and relations separately 26:58

GRI but I think it's gonna be the synthesis / of analysis which is going to / push us to our higher performance 27:05

GRI ok / let's take all one now and look at the other type of problem // the problem of collecting the patterns for a given relationship 27:12

GRI so / as I was saying a few minutes ago // there're lots of ways of expressing an event 27:21

GRI so this is / xxx / assassination of president Lincoln back in eighteen sixty five 27:28

GRI Booth assassinated Lincoln ? 27:31

GRI Lincoln was assassinated by Booth 27:33

GRI the assassination of Lincoln by Booth 27:36

GRI Booth went through with the assassination of Lincoln 27:39

GRI Booth murdered Lincoln 27:40

GRI Booth fatally shot Lincoln 27:42

GRI we can probably go through ten or twenty / or thirty or fifty / or maybe a hundred different ways / in which somebody can be / killed 27:50

GRI and again the situation is / if I / ask you to name / five ways of doing this // and say you can't go to coffee until you've found five ways // probably everybody will come up for five ways 28:06

GRI maybe / we have / almost hundred people and may have xxx data collection here // collect all the ways in which can we represent it 28:15

GRI but I think there still be a lot of ways missing 28:18

GRI and if I told people // you have to find a hundred ways of saying somebody was killed / or we can't get coffee // I think we'll have some very angry people in the audience 28:28

GRI so / what do we do about this ? 28:32

GRI this tail / the standard tail of the distribution which we see in some many linguistics' problems 28:39

GRI always comes to xxx / or we try to get good coverage 28:42

GRI so one thing we could do is / spend the rest of the afternoon reading newspapers 28:51

GRI ok? 28:53

GRI so / &ah we go and / collect El País // so whatever everybody reads here 29:00

GRI and hand out a copy 29:02

GRI everybody gets one day's copy of the newspaper 29:05

GRI we have some people here // &ah / Antonio // people who are expert in doing corpus annotation // so they will organize all the corpus annotated xxx / everybody will get a / marker // and have to mark / all the sentences saying somebody got hired or fired 29:21

GRI ok? 29:24

GRI and / we can / this maybe if someone paying for xxx but / you can everybody do this for a few days // you have to stay here for the rest of the week // and just mark / instances 29:35

GRI ok? 29:36

GRI well / again / some people might not be very happy with / doing this // 'cause I don't know if some people might prefer reading the newspaper 29:42

GRI so I'm not sure 29:43

GRI can we somehow / automate this process / in so get rid of all of this manual annotation ? 29:49

GRI and that's really what we're looking at / most of the rest of this talk 29:54

GRI how do we collect patterns more automatically ? 29:57

GRI in order to address this problem / I've divided into / syntactic and semantic paraphrases 30:08

GRI so syntactic paraphrases / involve the same words / or morphologically related words // and they are paraphrased which are broadly applicable 30:19

GRI so they can apply both to / instances of being killed // and instances of hiring and firing // any type of event 30:28

GRI so for example / Booth was &assassina [///] sorry 30:33

GRI Booth assassinated Lincoln 30:35

GRI Lincoln was assassinated by Booth 30:37

GRI the assassination of Lincoln by Booth 30:40

GRI Booth went through with the assassination of Lincoln 30:44

GRI ok? 30:45

GRI with a rather broad notion of &syntac [/] syntactic paraphrase // we can say that these are all syntactic 30:51

GRI we can put a different lexical item here // different word // and we still have a set of / paraphrases 30:57

GRI in contrast / all the paraphrase relations involve different word choices 31:04

GRI assassinated / murdered / fatally shot / in this would be semantic paraphrases 31:11

GRI so / how do we go about attacking these paraphrase relations ? 31:16

GRI the syntactic paraphrases can be addressed / by having deeper syntactic representations // in which we reduce / the paraphrases to a common relationship 31:26

GRI so / we can start with very / simple syntactic relations // for example just finding chunks // finding noun phrases and verb phrases 31:35

GRI then we might go to surface syntax 31:39

GRI we will look at / surface subject / and surface object 31:43

GRI then the next stage / is we might go to / deep structure 31:48

GRI so in deep structure we do logical subject and object 31:51

GRI and / Booth assassinated Lincoln / and Lincoln was assassinated by Booth // could get the same representation 31:59

GRI beyond that we then go to semantic // role structure / even though I'll call this as syntactic paraphrase // also can be described as predicate argument structure 32:11

GRI in a predicate argument structure // we go beyond what's / conventionally called deep structure // and say that the assassination of Lincoln by Booth / even though is a normalization // does have the same basic argument relationship 32:27

GRI the assassination of Lincoln by Booth // is the same argument relationship as the other two instances 32:33

GRI so the deeper we go / the deeper the / level of relationship we capture // the more syntactic paraphrases / upon handle was 32:45

GRI so how do we build analyzers to take care of these / deeper relationships ? 32:51

GRI well nowadays most syntactic analyzers / are created through training from treebanks 32:57

GRI so / with syntactic paraphrases like passives // and actives we'd see lots of examples / even with a limited corpus 33:07

GRI and so / with a treebank we can capture these relationships rather quickly 33:15

GRI the next stage in treebanking / which is now being / actively pursued // is the creation of predicate argument banks 33:24

GRI and there's a lot of work // in the United States and xxx somewhere also in / Europe in building predicate argument banks 33:31

GRI so / in particular there's been work / of English and something called the PropBank // which captures verb relationships // verb argument relationships 33:41

GRI that's been done / at the university of Pennsylvania 33:45

GRI and / we have been working on the NomBank / for noun arguments // so capturing for example relationships between assassination and the assassinated 33:55

GRI all the relationships between / Fred walked for an hour / and Fred took a walk for an hour 34:02

GRI again / nominalization and / verbal form 34:06

GRI so / this predicate-argument banks / will assign common argument labels to a wide range of constructs 34:17

GRI so they'll handle both a verbal // the Bulgarians attacked the Turks 34:24

GRI the Bulgarians' attack on the Turks 34:27

GRI the Bulgarian launched an attack on the Turks 34:29

GRI so all three of these sentences at a predicate-argument level / would become the same structure 34:35

GRI we'll have the same / would be called arg-zero or arg-one in a / predicate-argument bank 34:41

GRI same relationship being / xxx Bulgarians and Turks will appear / for all three sentences 34:47

GRI so / by training an analyzer based on this predicate-argument structure / we can eliminate a good deal of the syntactic paraphrase 34:59

GRI ///but analyze at this predicate-argument level / an then we will look for patterns / in terms of this predicate / argument relationships 35:09

GRI so that's the good news 35:12

GRI the bad news / is we've put another stage into the pipeline // and just go back a second 35:19

GRI remember our said pipeline ? 35:22

GRI well / we've put one more stage on it 35:25

GRI we have before syntactic / we have before surface parsing // and now we've put on one more stage / to do predicate-argument analysis 35:33

GRI so / we get some benefit / but we get some error 35:38

GRI hhh {%act: cough} the deeper the analysis / generally the less accurate the analysis becomes 35:49

GRI we've put in one more stage 35:51

GRI we've &p [/] produced one more stage of errors along with one more stage of analysis 35:57

GRI so / what do we do ? 35:59

GRI this is xxx problem for information extraction deeper 36:03

GRI do we take / a very shallow analysis like chunks / where we might get / ninety five percent accuracy 36:09

GRI or we take / a deep / analysis like predicate-argument structure / which captures much more data // but we might get with the only eighty percent accuracy 36:18

GRI the answer / is both 36:24

GRI so by &mo [/] xxx more complicated system / we can allow patterns at multiple levels 36:32

GRI so we'll write each pattern in terms of the chunk sequence // and the parse tree sequence // and the predicate-argument sequence 36:39

GRI and then we'll use a machine learning method / to weight these different analyses together 36:45

GRI so the hope is / that when the deep analysis fails / when we mess up on the predicate-argument analysis // we'll still be able to make the correct decision from a shallow analysis // finding from the chunks 36:59

GRI so we hope to get coverage / from predicate-argument structure when we get it right // and get accuracy from chunks / when we have / a pattern we've already seen at the chunk level 37:10

GRI so we did some experiments / using this basic approach / in order to / try to discover relation and events 37:19

GRI so / we used what's called a Kernel-based method / in which we measure the similarity between / an example we try to annotate // and one of the examples in the training corpus 37:32

GRI and these kernels work both at the word level // and we have one of the chunk level // and one at the predicate-argument level 37:41

GRI and then we combine / all of these measures / so that anyone of them got a good match // we were able to identify things 37:48

GRI so / the structure looks something like the following 37:52

GRI we have all of these levels of / analysis / logical relations / parsing / name tagging 37:58

GRI we put them into a classifier which does this / generalized matching // and that puts out a result based on what it finds as the best match 38:06

GRI and we got a significant performance / not overwhelming / but a couple of percent gained / in coverage / by being able to match both at the low level / chunk level / syntax level / and predicate-argument level 38:21

GRI and so we think by this approach where we use several levels of analysis // we will get both the benefits / of deep analysis / without a cost / in terms of greater errors 38:32

GRI ok 38:35

GRI so we believe this can / address this / syntactic paraphrase // so we xxx for the semantic paraphrase problem 38:43

GRI so some of the semantic paraphrase can be addressed / by existing lexical resources // such as WordNet 38:59

GRI so in particular people at / Sheffield / measure the degree to which / information extraction patterns could be generalized just using WordNet 39:09

GRI and they measure it on this task / which I've been talking about on and off / for the last few minutes 39:17

GRI this so-called executive succession task // of people being hired and fired 39:21

GRI so you start with a very small seed 39:25

GRI the things people can think of in the first two minutes 39:29

GRI company appointed / elected / promoted / and named a person 39:34

GRI a person resigned / or departed / or quit 39:39

GRI and then / shall explain a moment / we basically use WordNet to generalize from this / co-examples / to see what other / examples of hiring and firing we can find 39:53

GRI and then we won't have some measurement of how effective we are improving coverage 39:57

GRI so we'll use a rather simple metric / so-called text filtering metric // where we see whether / basically what fraction of the sentences which are relevant / are we able to extract 40:11

GRI what fraction of events can we find 40:13

GRI so in Sheffield's experiments / starting with / &ah the seed documents // those just = sorry / the seed patterns 40:26

GRI we would / get hundred percent precision on finding documents 40:31

GRI so every document which matched the pattern was / &ah relevant to the task // but only twenty six percent recall 40:39

GRI by applying WordNet / we then were able to get up to ninety six percent recall / we found almost every document // which was relevant // to hiring or firing 40:50

GRI but unfortunately now only about two thirds / of the documents we found / were relevant 40:55

GRI 'cause we get lots of / &ah / other senses of the words once we start to xxx things 41:03

GRI and if we do the same statistics at the sentence level // we get similar performance so recall goes up from ten to sixty four percent // but the precision finding sentences which are relevant / goes down to forty seven percent 41:16

GRI so / this is fairly effective but / it's not going to do / everything we want by itself 41:24

GRI furthermore the problem is / WordNet is gonna be good for some tasks // and not so good for others 41:31

GRI semantic paraphrase is much more domain-specific than syntactic paraphrase 41:36

GRI so it's hard to prepare some comprehensive resource 41:39

GRI if we suddenly start a talk about genomics // and want to get paraphrase is gonna be much harder to do that using WordNet 41:46

GRI so we turn back to the corpus and say // how can we do things with the corpus ? 41:53

GRI so instead of having people mark the corpus up // can we do things automatically ? 42:00

GRI how can we find / giving a few examples of &peo [/] people being hired // how can we find / automatically from a corpus without marking anything up / other ways of stating the same fact ? 42:14

GRI and we look for a briefly at two approaches here 42:17

GRI predicates with the same arguments // and predicates / in the same documents 42:23

GRI so let's first look at predicates with the same arguments 42:26

GRI ok / the basic intuition / is we find pairs of passages which probably convey the same information 42:36

GRI so we get two newspapers which talk about / hirings and firings on the same day 42:42

GRI and then we align the structures / that points at none correspondence 42:48

GRI so for example Fred XXX Harriet in one newspaper // and Fred YYY Harriet in other newspaper 42:56

GRI and we say / hhh {%act: assent} 42:58

GRI here two patterns which occur between the same names / maybe they are paraphrases / maybe they are relationships 43:07

GRI they / are related terms / because they connect the same people / in news from the same day 43:13

GRI so how accurate this is going to be / depends on part on / &ah / how we pull the texts 43:22

GRI so we have almost parallel text 43:24

GRI I'll explain in a moment 43:26

GRI we were pretty sure they talk about the same things // then we might be able to learn a paraphrase from a single example 43:33

GRI if it's from comparable texts 43:35

GRI so we have some evidence / that this is about the same staff / but we're not sure // then we might use a few examples 43:42

GRI or we can use just / any text / without any constraint / and then we'd need lots of examples 43:48

GRI so in terms of / parallel texts / experiments we'd done in Columbia a couple of years ago / taking two translations of the same novel 43:58

GRI so you would expect if we / take the same novel in French or in Spanish // and translate into English / there'll be a very close correspondence between the two English texts 44:09

GRI so they did this // and they aligned the sentences // and they aligned the individual constituents within the sentences 44:20

GRI and they were able to obtain a number of interesting paraphrases // mostly synonyms // paraphrase of the lexical level / from the translation 44:29

GRI but the problem is / the amount of data you have like this is rather limited 44:35

GRI if you basically have literary data you made of novels which got translated several times 44:40

GRI but whether genomics article is gonna be translated several times from Spanish to English / seems much less likely 44:47

GRI so we have to look at other ways in which we can get / &ah related / texts 44:55

GRI so experiments which we did a couple of years ago at NYU / were based on news stories from multiple sources from the same day 45:02

GRI so we take two newspapers // same day // and we'd looked for / pairs of articles which overlap in terms of several words // particularly several names 45:12

GRI if we find two articles which have a bunch of names in common // then there's a good chance / since they were from the same day / that they talked about the same subject 45:24

GRI ok? 45:24

GRI now we go down into those articles // and we take two -> sentences / which have the same names in them 45:31

GRI if they have the same names in common // then we have a pretty good chance that they're conveying the same information // along maybe with some / related facts 45:42

GRI so we looked then / we keep drilling down into more more detail / we looked for syntactic structures / in the sentences which shared the same names 45:52

GRI and we found sharing two names // we get a paraphrase / precision of sixty two percent 45:58

GRI so this was an experiment about murder in Japanese but / the details perhaps are not so relevant 46:05

GRI we were able to / pull out from news articles without any manual intervention / things which / two thirds of the time were correct / synonyms / correct paraphrases 46:14

GRI so / from single examples we were able to do fairly well 46:24

GRI if we want to increase the accuracy / we can look for multiple examples of the same relationship 46:30

GRI and the basic idea here = ups! / excuse me 46:33

GRI is that xxx an expression appears with several pairs of names 46:37

GRI so we have some / phrase R which appears / between A and B / C and D / E and F 46:43

GRI and then some other expression appears between / also several pairs // A and B / E and F 46:50

GRI then there's a good chance that R and S are paraphrases 46:53

GRI the more examples you find / the better the probability is 46:56

GRI so for example / if we have Eastern Group's agreement to buy Hanson 47:03

GRI Eastern Group / to acquire Hanson 47:06

GRI CBS will acquire Westinghouse 47:09

GRI CBS's purchase of Westinghouse 47:11

GRI CBS agreed to buy Westinghouse 47:12

GRI we pull out the main words // and we look for pairs of main words 47:19

GRI so buy appeared with both CBS / Westinghouse / and Eastern Group / Hanson 47:25

GRI and acquire appeared the same way 47:27

GRI so we say / two examples of each there's a pretty good chance / that we're getting paraphrase 47:33

GRI so there've been a number of / experiments along this line / trying to get the paraphrases 47:40

GRI &ah perhaps the best known fellow is Sergey Brin / who went on to get / billions of dollars doing Google // and isn't worried about / paraphrases anymore // but is hiring / probably dozens of people to worry about paraphrases 47:54

GRI and Lin and Pantel // and some work which / &ah Satoshi Sekine / did at NYU // trying to acquire [/] in each case trying to acquire in this way paraphrases between relations 48:08

GRI and the accuracy we can get if we have enough examples / is quite high 48:14

GRI so we were able to get for example / eighty six percent / accuracy in finding paraphrases / for person-company pairs 48:21

GRI ok 48:26

GRI so this is one approach / to finding / paraphrases with no human intervention 48:31

GRI just / put in millions of texts // and crank away 48:34

GRI the other approach we've been investigating / a frequently / &doc = sorry 48:39

GRI words which occur frequently / co-occur frequently in the same documents 48:44

GRI so here the basic idea / is we start from a topic // get a set of documents on the topic // and get &paraphra [/] get patterns about this topic 48:55

GRI so to explain this / goes back to some work which Ellen Riloff did now ten years ago // on / identifying / paraphrases from relevance judgements for topic // for documents 49:11

GRI so she divided / a large corpus into relevant and irrelevant documents 49:17

GRI at this time / ten years ago / it was about / &ah Latin-American terrorism 49:21

GRI and / classified &m [/] words as people / organizations and so forth 49:28

GRI identified the predicate-argument structures // in the document 49:33

GRI and then count how often / particular structures appear in relevant and irrelevant documents 49:38

GRI and using this / metric shown here / ranked the various constructions 49:45

GRI so the basic intuition is the following 49:48

GRI if we take a bunch of documents // which are about terrorism // and we take another bunch of documents / which are just about everything else / about / cocaine / and politics / and economy // and we ask / which words appear / much more often in the terrorism documents / than appeared in other documents 50:09

GRI and rank words by their relative frequency // then when we'll come to the top of the list // the top of this all divided by illegal / ranking // would be words which are specifically about terrorism 50:21

GRI and most likely we would have / collections of related words 50:26

GRI so I'll show an example in one second / how this works / for the hiring and firing case 50:33

GRI &ah this is an small extension of Riloff's work which was done by / Roman Yangarber at NYU 50:40

GRI basically converting this into bootstrapping method 50:44

GRI to start with a small seed like we did / &ah at Sheffield 50:49

GRI we filled some documents 50:52

GRI picked additional structures with high Riloff metric // and then repeat 50:58

GRI so how do this [/] how does this work ? 51:02

GRI we start with somebody retires // and get all the documents which talk about somebody retiring 51:10

GRI Fred retired / Maki retired / Harry retired 51:14

GRI just collect these retiring documents // and ask / what other / predicates occurred a lot in these documents 51:22

GRI but if you think about articles about retiring // it's very often gonna mention somebody else got hired for the job 51:30

GRI so we look at all the articles // and we look at / what patterns occur / repeatedly / in these retiring articles 51:39

GRI and we see / Harry was named president 51:41

GRI Yuki was named president // and so forth 51:43

GRI so we're just gonna rank the articles 51:45

GRI sorry 51:47

GRI rank the / constructs / by how often they appear in the articles // and pull out the most / commonly occurring constructs 51:55

GRI so we'll pull out / person was named president 52:00

GRI we'd stick it onto the set // and we repeat this process 52:06

GRI each time through the iteration // we'll now retrieve documents which have one of these two patterns // and we'll pick up the third pattern 52:17

GRI we did a very similar experiment what I described for Sheffield // we run another similar seed 52:24

GRI and / we got patterns like this 52:30

GRI again no human intervention 52:32

GRI this is just based upon co-occurrence statistics 52:35

GRI we can find / person succeeded person // person become president // person named president // person joined company // person &le [/] left post 52:47

GRI in terms of performance / statistical performance in terms of finding relevant documents // we got from / eleven percent recall with ninety three percent precision // using just the seed set // to having something like eighty one percent precision / eighty eight percent recall // just doing this / automatic / looping 53:13

GRI so this is comparable to the WordNet-based expansion // and provides a different set of patterns 53:19

GRI so in principle we could combine this / to get better performance 53:23

GRI shown graphically we can see / each step here going from left to right for instance / one iteration // going through / picking up one more / pattern from the set / using the Riloff metric 53:39

GRI and we can see the recall the blue line / rises rather rapidly and then // ups! // and then sort of flattens out 53:50

GRI and the position / slowly declining // going from ninety percent / to something like eighty percent / over the course of these iterations 54:00

GRI so there are a lot of numbers on this graph / but the main thing I wanted to / point out / is we compare the performance / doing this / automatic procedure just described // against the manual procedure // in which / people basically sat there for thirty days / and read newspapers / and try to find all the patterns they could 54:24

GRI and the automatic procedure is quite comparable in performance // to thirty days of manual work 54:31

GRI so they discovered patterns / got about / sixty percent performance 54:38

GRI the manually / collected patterns / I think gonna take some of the audience up next I am not sure 54:45

GRI &ah / the manually collected patterns got fifty six / to sixty four percent performance 54:51

GRI so we can see this automatic procedure can do about as well at finding new patterns / as / people sitting there for / bunch of weeks / collecting data 55:06

GRI so / one word now about / self combining the methods I've just described // the topic based method can find / set of paraphrases quite well / like name appoint select 55:22

GRI but / because they're just looking for articles / rather than patterns which appear in the same articles / they will also find / topically related phrases which are not paraphrases 55:32

GRI so they'll find / group together / appoint and resign / or shoot and die 55:37

GRI so the trick now is to combine the two approaches / we used before 55:44

GRI we talked about / finding patterns because they are in the same articles 55:49

GRI we talked about finding patterns / because they appear with the same arguments 55:53

GRI now we can couple this // this sort of topical discovery / and paraphrase discovery 55:59

GRI first find / all of the topical patterns 56:01

GRI so we'd find retire / and hire / and name and so forth 56:07

GRI and then just put / this set of patterns / into the paraphrase discovery 56:12

GRI and we get a / considerably better result than we could with either method alone 56:16

GRI so we did experiments like this 56:20

GRI we weren't able to find paraphrase / for all of the patterns which were topically relevant 56:26

GRI but we did much better than we could / just finding / using the paraphrase discovery by itself 56:32

GRI so in terms of / &ah numerical performance / we got up to precision by using these two methods together // ninety four percent precision / in terms of / automatically finding patterns which meant the same thing 56:46

GRI it's a rather remarkable level 56:49

GRI &ah / considering / level automation / the coverage is not so good // it's only about forty seven percent of the patterns / which were topically relevant were gathered together 57:01

GRI &ah / it turned out / unfortunately perhaps / we did more experiments and we found out that xxx this works very well / for executive succession // because people normally get hired only one job at a time // we tried it for the / for an arrest domain // where we had / crime stories about people being / in a burglary and theft and being arrested 57:24

GRI and it turns out that / paraphrase doesn't work so well // because people get arrested / typically or / quite often / for several crimes at once 57:35

GRI so Fred was arrested for burglary 57:37

GRI and Fred was arrested for / xxx 57:40

GRI so we combine / all these crimes into one / paraphrase unit 57:44

GRI so there's certainly still a lot of work to do 57:47

GRI whenever a student say / oh! this problem got solved ! 57:49

GRI xxx 57:50

GRI this language problem aren't gonna be solved for another / twenty years / thirty years / up till I retire 57:56

GRI ok 57:59

GRI &ah / Antonio asked me to mention briefly two other topics / before I close here 58:04

GRI so I'll say a few words about this 58:05

GRI one is about cross-language information extraction 58:08

GRI we've done a / number of work 58:11

GRI &ah / at least in the United States / everybody wants to xxx in English // ok? 58:17

GRI nobody wants to read / Spanish / or Chinese / or Arabic / or xxx 58:23

GRI everybody wants English // ok? 58:25

GRI so / we typically get this problem and want to do a database / in language L1 // for language / from texts in L2 58:34

GRI so we get / Spanish / and Chinese / and Arabic texts 58:38

GRI but people say / I want my database in English 58:41

GRI so / how do we do this ? 58:45

GRI well / there're basically two ways of doing this 58:48

GRI we can take the / upper path / or the lower path 58:51

GRI shall explain this two / for a moment 58:54

GRI the upper path / ok / we start with a bunch of articles in / &ah -> Chinese / let's say // ok? 59:05

GRI let's imagine / green is Chinese and red is English 59:08

GRI ok? 59:09

GRI so we start with a bunch of articles in Chinese // and we run / extraction / in Chinese 59:16

GRI we run the Chinese information extraction component 59:19

GRI and we'd get a database in Chinese 59:21

GRI ok? 59:23

GRI then we take / the Chinese database // and we take each terminate // and we run it through some / machine translation system // and get a database in English 59:34

GRI ok? 59:37

GRI other approach / is we take the Chinese texts // and we translate them with a machine translation system / into English 59:48

GRI and then we run an English extraction system 59:51

GRI ok? 59:54

GRI so two ways of addressing the same basic problem 59:59

GRI ok 59:59

GRI which one is gonna work better ? 1:00:01

GRI we take a vote / to vote as a class // like half people raise their hands 1:00:06

GRI but / this is maybe formal lecture / we are not supposed to have people / picking votes here 1:00:11

GRI so study might guess / ok so we did experiments / we had / Japanese graduate students so we did experiments in Japanese // same management succession task 1:00:25

GRI ok? 1:00:25

GRI people being hired or fired 1:00:27

GRI well / we did extraction on / machine translation output 1:00:32

GRI so we took the lower path 1:00:36

GRI we went down with MT / and then across / xxx the red lines 1:00:40

GRI we got forty one percent accuracy 1:00:43

GRI hhh {%act: } 1:00:45

GRI in &cont [/] I ´d come back to the / other items but let's look at the last line 1:00:50

GRI in contrast / if we did / extraction in Japanese // and then just translating the database entrance // we did much better with a sixty percent // which is not much worse than we did / using straight English // there we might get / sixty five or seventy percent performance 1:01:09

GRI so why is it much worse / even you might think 1:01:13

GRI they sell systems to do Japanese translation // they're designed to translate texts // not database 1:01:19

GRI why they should do much worse / if we translate the texts / and then do extraction ? 1:01:24

GRI well / it turns out / the argument structure / is not very well translated by these translation systems 1:01:32

GRI I think for pairs which &ah / have more similar / sentence structure / like Spanish and English // and French and English // you won't have some much / this problem 1:01:43

GRI but / if you / try to do Japanese / or Chinese / or Arabic / or something with a substantially different sentence structure // you'll get the right subject // and you'll get the correct object // and you'll get the correct verb 1:01:54

GRI but they won't be in the correct order // xxx of English 1:01:58

GRI and people sort of compensate / when they read machine translation 1:02:01

GRI and they say / &ah / Fred Smith appointed / IBM president 1:02:08

GRI they all know that wasn't actually that Fred Smith who appointed at / it was &ah / the IBM 1:02:15

GRI but IBM appointed Fred Smith as president 1:02:17

GRI so it's a sort of fix things up 1:02:19

GRI but the extraction system would do that // in so it gets / considering worse performance 1:02:25

GRI so in general / if we can afford the labor to do source language extraction followed by translation // we'll do better 1:02:37

GRI we were able to do some improvement / one of the hard things / in the translation is finding the names 1:02:43

GRI so by doing some projection of structure / from the source language / to the target language // so that we can / identify the names properly in the English texts / by / first finding them in the / Japanese // we were able to do somewhat better 1:03:00

GRI we got a fifty two percent 1:03:02

GRI but still not close to what we can do by doing / genuine foreign language extraction 1:03:08

GRI finally / two slices about / what's going on in terms of multilingual extraction programs in the United States 1:03:17

GRI one is basically an evaluation program 1:03:20

GRI this is a / sneaky trick where they don't / pay people to do research // they just advertise this evaluation 1:03:27

GRI and hope people would do research on this evaluation task // and will all come 1:03:33

GRI so / the United States / we run this so-called ACE evaluation // for doing information extraction 1:03:39

GRI and / until now / until this / past year it was in Arabic / in Chinese / in English 1:03:47

GRI ok? 1:03:48

GRI you can sort of imagine why the United States is interested in Arabic and Chinese and / English 1:03:53

GRI &ah / and we may / wonder / but / &ah Spanish has been added as a fourth language for the evaluation this year 1:04:02

GRI so if anybody has a good information extraction system for Spanish // they are welcomed to come to Washington 1:04:09

GRI &ah this evaluation is being run the first day of February of two thousand seven // and participate 1:04:17

GRI completely open / for the / Chinese task / we have a number of Chinese universities coming // participate in the evaluation 1:04:25

GRI &ah the Spanish task for this year would basically be / finding entities 1:04:29

GRI which means if you have a good / chunker / and you have good coreference in Spanish // you can probably do good job at this task 1:04:37

GRI other multilingual IE program / &ah / is a research program / a large scale research program founded by DARPA // and begun just one year ago // called GALE // also trilingual // Arabic / and Chinese / and English 1:04:54

GRI and involves / all the NLP processing you &cou [///] one 1:04:59

GRI so it involves / automatic speech recognition / machine translation / information retrieval / information extraction / a little bit of summarization / almost everything 1:05:09

GRI so the / retrieval extraction task involves answering questions about a multilingual corpus 1:05:18

GRI so we give them / for example / &ah / Arabic Al Jazeera / &ah / TV broadcasts 1:05:26

GRI and we're supposed to do / automatic / ASR / followed by machine translation / followed by extraction 1:05:34

GRI and xxx to answer questions like / describe reaction of country to event 1:05:39

GRI or list attacks in location during time and related deaths 1:05:43

GRI so a sort of mix of summarization is supposed to do extraction to pull out the relevant sentences // and then build the summary around them 1:05:52

GRI and this is / as I was saying the design / United States' every sponsors only in English 1:05:59

GRI people don't want to talk anything but English in the United States 1:06:02

GRI so we are supposed to / pull out / this / Al Jazeera staff 1:06:07

GRI transcribe it // translate it // present English sentences which are relevant to answering / these particular questions 1:06:15

GRI and DARPA says by the time / five years have finished we are supposed to do better than people do at these tasks 1:06:21

GRI so / I don't know if we believe that / but / they try to scare us every year with / very high performance targets 1:06:29

GRI so we are looking / in order to reach this level of performance / to combine the different IE paths 1:06:36

GRI we think that will have to go both ways / and combine the results at the end // in order to get the best extraction performance 1:06:43

GRI ok 1:06:47

GRI so to conclude / the current IE technology which I've / been describing / provides a powerful tool / for targeted searches in large text collections 1:06:57

GRI so I think it provides a / real capability beyond what we ever want to get / from information retrieval 1:07:03

GRI but it has limitations of performance // and recent basic research on NLP methods / some of which I've described here / offer substantial opportunities for improving extraction performance / and portability 1:07:17

GRI and the main points I've talked about a global optimization for improving / performance levels / treebanks / for things like predicate-argument level // &pre [/] predicate-argument / discovery // and corpus-based discovery methods / for greater coverage 1:07:33

GRI thank you 1:07:34