Corpus MAVIR: video

Convenciones de transcripción

@Title: Challenges for Information Extraction
@File: mavir01
@Participants: GRI, Ralph, (man, C, 3, professor, lecturer, New York) 
@Date: 16/11/2006
@Place: Madrid
@Situation: conference (I Jornadas MAVIR), conference room at university, not hidden, observing researcher  
@Topic: Current situation of  information extraction research
@Source: MAVIR 
@Class: formal in natural context, conference, monologue
@Length: 1 h 07' 39"
@Words: 9113
@Acoustic_quality: A
@Transcriber: M. Garrote
@Revisor: L. Campillos, M. Garrote
@Comments: 

GRI thank you Antonio / and thank you to the / entire MAVIR Network / for the invitation  00:06

GRI  &ah / can everybody hear me in the back ?  00:09

GRI  it's ok ?  00:10

GRI  &ah / one of the dangers of having an introduction like this / and being told / how many years you've worked already in computation linguistics // is to be asked of it and / why didn't you get further by now ?  00:22

GRI  after forty years we've learned / things go rather slowly / that we make progress / but progress is never as fast as / we predicted that it will be  00:33

GRI  looking ahead / we think in five years / everything will be wonderful  00:37

GRI  and now I say / well / maybe my children / or my grandchildren / will solve all these problems  00:43

GRI  one of the nice things about / natural language processing is a problem // is that is so rich / so deep // that / you can / peel off one layer of problems // get some level of solutions // but there will still be many layers / left for the next rank of students to solve  01:02

GRI  ok // so let me talk over about / the challenges for information extraction  01:09

GRI  &ah / I present both / some very / general overview / and then some technical details about various projects we've worked on  01:20

GRI  so let me begin by just explaining what I mean by information extraction  01:25

GRI  so identifying instances / of important entities / relations and events / from unstructured texts  01:33

GRI  ok // so for example / &ah / this is one / we've worked on enough for a long time // identifying people who are hired and fired / by various companies  01:48

GRI  so we have the sentence // George Garret / forty years old / president of London-based European Information Services / was appointed chief executive officer of Nielsen Marketing  01:59

GRI  and somebody comes and says // ok // we want to read all the newspapers // and keep track of who's been hired or fired  02:06

GRI  so we'd like to produce =  02:08

GRI  see if my mouse is awake  02:10

GRI  no  02:10

GRI  not awake up this morning  02:12

GRI  &ah / we'd like to produce a database / which lists position // and the company / and the location / and the person / and whether they got / left the job / or they got into the job  02:25

GRI  cause that's a typical example of what we / mean by doing information extraction  02:30

GRI  so to make a contrast / xxx / also hear about information retrieval // from the web / contrast between information extraction / and / information retrieval  02:42

GRI so / in information extraction / the operations are based on and then return normalized values like dates // or entities  02:52

GRI  so by an entity I mean not just the reference / but / if we have several references to the same person / to recognize that / as / the single particular person // and most important relations between entities / or between a company / and a person // between a person and a location  03:10

GRI  so information extraction has to be adapted to specific tasks  03:15

GRI  somebody will come and say / we're interested in hirings and firings  03:18

GRI  or we are interested in / talks of universities  03:22

GRI  or we are interested in / a task on cities or something  03:26

GRI  and the system has to be customized for that particular task  03:30

GRI  in contrast / information retrieval / basically based on terms // just on [/] on differentiated tokens  03:40

GRI  and we would turn documents or passages right than specific facts  03:44

GRI  and the benefit is that it's a general technology // something that doesn't have to be customized for a particular task  03:51

GRI  but the power I'd like to suggest of information extraction is that there's some questions / which are really hard to answer using just / Google  04:01

GRI  so if you wanted to get an answer to where has Condoleezza Rice been in the last month // you might have to Google for her name // see which articles refer to her / locations // and then slowly xxx the database might take you ten or twenty minutes / to answer this question  04:20

GRI  whereas with information extraction if we're interested in tracking people and their locations // this is a  plausible question we might pose directly  04:28

GRI  hhh {%act: cough} or what terrorist attacks occurred in Europe in 2004  04:34

GRI  if you xxx in terrorist attacks in Europe in 2004 // you might not get very much  04:39

GRI  'cause articles might talk about England / and Spain / and various countries  04:44

GRI  if you don't have relationships capture this part of retrieval // you're not going to do very well from term retrieval alone  04:51

GRI  one way in which / information extraction might be used / this is a search tool / &ah as a complement for / sort of standard Google-based tools for doing web search  05:07

GRI  the idea here is that people would be interested in particular domain and they'd come and say / we're interested in / &ah searching articles  05:14

GRI  in this example I'll show in a moment about / disease outbreaks  05:19

GRI  so we'll build the system for extracting relationships about disease outbreaks  05:25

GRI  what disease / occurred when / and where / and so forth  05:28

GRI  and then we pick the system // and we run it each day against the days news / using a web crawler up and down all the new sites  05:38

GRI  retrieve the latest news // build the database // and then we provide access to the article through this database  05:47

GRI  so / basic flow is the following // the web crawler // some filter to just get articles which are relevant to the task  05:56

GRI  then build an extraction engine which builds a database  05:59

GRI  on the other side somebody xxx with the browser // and would look something like this  06:05

GRI  so what we have over here on the top // is / a database meant to look very much like Excell // has the same sort of capabilities as an Excell database  06:17

GRI  which is the document date / the disease name / the time / the location / the country of the location / how many cases were reported / whether / they were sick or dead // and specific task description  06:32

GRI  so then you can search this just like you would in a Excel database / to restrictions // and when you are interested in a particular article / you click on a row / and up comes / at the bottom of the screen / the corresponding passage  06:46

GRI  so in this case it will say somebody click on dengue fever // and it says down below that the article it came from / why / state officials reported one additional recent case of dengue fever / and six cases that occur hhh {%act: onomatopoeia}  07:01

GRI  ok / the extraction technology's not completely reliable  07:06

GRI  so we get maybe / two thirds of correct information / three quarters of correct information / in the database  07:12

GRI  but then / the database gets / immediately link to articles // so / even though you can't rely on the database per se it's been accurate / certain information // if you view it as a search tool is really / can be much more powerful search tool for particular topics // and a key word scheme / like / Google or Yahoo  07:35

GRI  'cause you wouldn't expect if you typed in / the search terms to something like Google that / two thirds or three quarters of the articles were irrelevant  07:43

GRI  so / this sort of tools / for doing directed search have been applied in a number of areas // been applied in genomics  07:53

GRI  that's one area where there's certainly a lot of money / in the United States / and perhaps here also  07:58

GRI  and / &ah / researchers go with the money is xxx  08:04

GRI  so there's been quite a bit of work / on extracting information in particular about gene protein interactions  08:10

GRI  there's an enormous literature xxx by genomics researchers  08:14

GRI  you simply / can't keep up with all the staff that's been published / to get to search for something // it's a major effort  08:20

GRI  so being &ab [/] being able to pull out this specific relationships // we get a much more powerful search vehicle  08:28

GRI  in / medical applications / there's a lot of demand for generating summaries for medical reports  08:35

GRI  what fraction of / patients have this outcome / who xxx and so forth  08:40

GRI  and some of this is now being done by information extraction  08:44

GRI  also / in the idea of going with the money is / &ah / financial / information extraction is also being a major area  08:53

GRI  so / in keeping with this idea that / I've been slaving at this for a long time // the other people have been slaving at this area for even longer  09:04

GRI  so / although the idea of doing search from information extraction / seems like a / sort of timely and novel one // it's one which has been around now for // let's see // fifty eight so it's / almost half a century  09:19

GRI  so / back in nineteen fifty eight / there was a presentation / by Zellig Harris in Washington / where they had a conference  09:28

GRI  how should we be doing information retrieval / in a time when / there was very little online and / being online mean / taking an article / and typing it all into punch cards / and reading those punch cards in a reader  09:40

GRI  and having a couple of articles // punch cards // people who remember // &ah -> / having a small collection of articles online / maybe a few hundred  09:52

GRI  but at that time Zellig Harris succeeded in thinking / how could we automate this process / and have more powerful search vehicles // than just having keywords search  10:02

GRI  so he talked / in this fifty eight paper / about the idea of taking a set of articles / discovering the main relationships which appeared in the articles  10:12

GRI  he was interested in scientific / literature  10:15

GRI  automatically indexing the articles / xxx work was then probably UNIVAC I or / old 7090 mainframe // indexing the articles // and then doing retrieval based upon relationships  10:29

GRI  so it's taking us maybe half a century for the technology to catch up to this ideas // to have / the corpus [/] corpus frame methods / which can now / analyze large portions of texts reliably  10:45

GRI  and / basically only over the last decade / have methods for discovering relationships from texts  10:53

GRI  ok  10:56

GRI  so -> / so much for history  10:58

GRI  now / one of the challenges of information extraction at the moment / to understand there isn't progress / we need to appreciate very briefly / what the basic approach is / to information extraction  11:10

GRI  and then the problems which arise / because of the complexities of language  11:14

GRI  so expend a couple of minutes talking about / these problems // and then go through a number of &ah / areas of current research / and how they're trying to address these problems  11:25

GRI  in a particular xxx / hhh {%act: cough} / survey or advertisement for what's going on at NYU // and so we'll have a &l [///] NYU symbol / xxx all torches will appear here and there // to show what we've been doing  11:37

GRI  ok  11:40

GRI  so the basic approach doing information extraction is very simple  11:46

GRI  suppose to come back to the early application we were interested in figuring out // people who are hired or fired by companies  11:54

GRI  ok?  11:57

GRI  so / I give you that challenge  12:01

GRI  maybe you / take this challenge up in Spanish  12:04

GRI  think about five ways of saying somebody was hired or fired from a job  12:09

GRI  ok?  12:11

GRI  you can probably think of a couple of ways xxx here  12:14

GRI  ok?  12:14

GRI  this was a classroom where nobody's allowed to / give up assignments  12:18

GRI  I have everybody sit down and write down / three or four patterns they could remember  12:22

GRI  ok?  12:25

GRI  and then we have some programmers here // we tell them to write some Perl program / or Python program // something which this nice pattern matching // and just run against / some newspapers // and see how all the pattern matches  12:39

GRI  and / if some of the pattern match / we say / ok / we take the person / we put in this column // and we take the job / we put in this column // and we take the company / we put in this column // and we're all done  12:53

GRI  and we would go have coffee / and you don't have to hear about all my problems  12:56

GRI  well / this works // but it doesn't work very well  13:02

GRI  people would discover after / hhh {%act: cough} / would discover after they try it // that they make it / five percent / ten percent recall they're really lucky  13:15

GRI  so / why the simple pattern matching like this not work so well ?  13:19

GRI  well / as I think we all know language is really very complicated // and all of the different problems of language / come forth in trying to do information extraction  13:29

GRI  so / for example / the lots of different words / people who write for the Wall Street Journal / have to write articles like this everyday  13:42

GRI  somebody was named to the job // somebody was appointed to the job // somebody was selected for the job  13:47

GRI  so they are good at / finding new ways of saying the same thing  13:51

GRI  they're paid to find new ways of saying the same thing  13:55

GRI  so that makes it nice for readers / but it makes it more work for us as computational linguists  14:01

GRI  then there're different constructs for providing the same information  14:08

GRI  IBM named Fred as president // IBM announced the appointment of Fred as president // Fred / who was named president by IBM  14:17

GRI  and so on and so forth  14:19

GRI  ok then / people could be referred to in different ways  14:24

GRI  so we can have George H. W. Bush / former president Bush / is something scored forty one / because he was a forty first president // so you / can differentiate / forty one from forty three  14:37

GRI  ok?  14:38

GRI  who's / in the office now ?  14:40

GRI  ok so / all of these problems have to be addressed / beyond pattern matching in order to get reasonable information extraction  14:47

GRI  then there are some ambiguities // so I present some in English / but there are probably comparable situations in Spanish  14:56

GRI  so Fred's appointment as professor  14:59

GRI  versus Fred's three o'clock appointment with the dean  15:04

GRI  it's just a meeting // not [/] no job get started  15:08

GRI  ok?  15:09

GRI  so you can just look forward appointment / with somebody's name / and find xxx and say / &ah / somebody got a job  15:15

GRI  a problem we had when we started doing the disease outbreak / extraction system // we had in mind that there would be outbreaks of typhoid / outbreaks of dengue // and then we run it against the day's newspaper // and the most common pattern we got was outbreaks of violence  15:34

GRI  so we got all the tasks instead of all the diseases  15:37

GRI  so lexical ambiguity becomes a problem  15:42

GRI  then the structures aren't simple  15:45

GRI  you'd like to have / person was appointed to job  15:49

GRI  ok?  15:50

GRI  so / we search through a few articles // and we found the following  15:54

GRI  I don't know / if people can even figure out what the subject and object of the sentence are in this  16:00

GRI  for the Federal Election Commission / Bush picked Justice Department employee / and former Fulton County Georgia Republican chairman / Hans von Spakovsky / for one of the three openings  16:15

GRI  ok?  16:17

GRI  so the / problem for the / person and the system / is picked / what ?  16:23

GRI  or picked whom ?  16:24

GRI  and you have to get through just this department employee hhh {%act: onomatopoeia} to get / von Spakovsky // as the object to picked  16:33

GRI  ok?  16:34

GRI  so simple pattern matching is not going to work  16:37

GRI  we wouldn't need to do structure analysis in order to figure out what's going on  16:41

GRI  even if we get through all of this // we have problems where we may have to go across sentences  16:50

GRI  George Garrick has served as president of Sony for thirteen years  16:54

GRI  the company announced his retirement effective next may  16:57

GRI  ok?  16:58

GRI  so you have to figure out what's his company / we can fill the database with / the company and him  17:04

GRI  hhh {%act: cough} ok?  17:08

GRI  hhh {%act: cough} excuse me one second  17:10

GRI  {%com: drinks} so all of this means that we have a lot of work to do // in analyzing language in order to / be able to do / effective information extraction  17:29

GRI  hhh {%act: cough} below I'll discussed a lot problems / we can group them roughly into two basic / types of problems  17:37

GRI  collecting the patterns for a given relationship // and identifying the instances of these patterns in the text  17:45

GRI  and / I'll begin by looking at the second problem // identifying the instances  17:51

GRI  hhh {%act: cough} so / as I try to explain / with this example of von Spakovsky // it really doesn't work xxx do / extraction by just writing patterns which look for / sequences of tokens  18:11

GRI  person pick name / is not going to work with // person pick just hhh {%act: onomatopoeia} / Hans von Spakovsky  18:23

GRI  like you have just some way of figuring out / that / von Spakovsky is the object of picked  18:29

GRI  so the patterns have to be stated at the structural level  18:34

GRI  which means / as we understand that / before you can really do information extraction / you have to do a lot of linguistic analysis  18:41

GRI  you have to identify names // and classify the names as people // and organizations and locations  18:48

GRI  you have to figure out the syntactic structure // so we know what's the subject and what's the object hhh {%act: onomatopoeia} picked  18:56

GRI  and we have to figure out coreference / so it is / the company and him we know what has been referred to  19:03

GRI  and if the analysis is wrong at anyone of these stages / the pattern's not going to match  19:08

GRI  so / what have people done over the last / decade or two decades in trying to address this problem of structure analysis ?  19:17

GRI  well / people have broken it down into / different kinds of subtasks  19:21

GRI  so named entities finding names // finding syntactic structure // finding coreference  19:27

GRI  and people specialize each one of these problems  19:32

GRI  building separate typically now / corpus frame models for doing each one of these tasks  19:37

GRI  hhh {%act: cough} so people have / built large corpora annotated with names // large treebanks annotated with / syntactic relations // and even coreference / corpora  19:51

GRI  and / after they've done this / started machine learning for different methods // they come and they give papers and say / look / we can get this wonderful / level of performance  20:03

GRI  we can get ninety percent performance / for recognizing names  20:07

GRI  and ninety percent accuracy for doing parsing  20:10

GRI  and well / we can do coreference so well but / that's something for a children to work on  20:15

GRI  so / we'd look  20:17

GRI  ninety percent accuracy  20:18

GRI  let's go out and let's sell our product  20:20

GRI  well / you look back at the problem  20:23

GRI  you see actually we've just / decomposed the problem into / name analysis // reference resolution relation tagging  20:31

GRI  each one of these is ninety percent accurate  20:34

GRI  let's say  20:35

GRI  ok?  20:35

GRI  maybe we are not xxx of everything  20:37

GRI  but let's say everything is ninety percent accurate  20:39

GRI  well // the end result is gonna be ninety percent / times ninety percent / times ninety percent // depending on how many of these components you / put together  20:52

GRI  and so / with three components maybe we have seventy percent // maybe we can't sell our system anymore  20:59

GRI  so / what do we do ?  21:03

GRI  well / we can just / go home crying  21:06

GRI  or we can look at the problem and say / well / we decomposed it // and in decomposing the problem we've looked at / trying to optimize each problem separately / each task separately  21:20

GRI  so if looked separately / &ah finding the best names // finding the best relations // finding the best events and so forth  21:28

GRI  and what we should do now / that we've decomposed these problems is take advantage of the interactions between the stages  21:35

GRI  so / instead of making that xxx // try to / pick advantage / of all of these stages  21:42

GRI  so what does that mean ?  21:45

GRI  it means for example / preferring names / which allow for more coreference  21:49

GRI  so / if you couldn't tell / if it was / the name was ABC // or ABCD  21:55

GRI  but you find / some other articles xxx ABC // then most likely is gonna be the name mentioned more than once  22:03

GRI  so / we did statistics like this // in which we looked at names which were only mentioned once // and names which were mentioned more than once  22:14

GRI and this graph shows &ah / if the name only appears once in an article  22:21

GRI and the chance that we got the name right is pretty low  22:24

GRI it's between forty five and sixty five percent  22:27

GRI but if the name is the same / as another mentioned which appeared somewhere else in the article / or in a different article // then we're much more confident that we got the name correct in this instance  22:38

GRI so if we can't tell whether it's von Spakovsky / or the name was just Spakovsky / and so forth  22:43

GRI if you see / somewhere else von Spakovsky // then you have much more confidence that you got the name correct  22:51

GRI  hhh {%act: cough} &eh similar observation can be made about the connection between names and relations  22:58

GRI so / &ah the main fact to be seen from this graph / is the difference between / for each pair / of bars between the light purple and the dark purple  23:11

GRI so the light purple indicates / a name which appears in a relationship // like von Spakovsky / Fred's father / or something  23:20

GRI  and the / purple / a name which shouldn't appear any relationships  23:26

GRI and basically what this graph tells us / let's see / we just look at the last bar // that / if it didn't participate in a relationship we only have a fifty percent chance / that we've got the name correct  23:39

GRI but if it participate in a relationship / we've got a ninety percent chance  23:43

GRI so / it's basically saying / if we've got the name right // then there's a much better chance / that we'll be able to identify relationship  23:51

GRI and so xxx backwards and say // if we've got a relationship / involving this name / then we probably identify this correctly as a person  24:00

GRI if we have some text where we have hhh {%act: onomatopoeia} / Fred's father // and we couldn't figure out whether hhh {%act: onomatopoeia} was a person or an organization or a location // you can probably tell from the fact that was Fred's father // and things on to screen here / that this was a person rather an organization  24:18

GRI so we try to use this sort of / relationship / between the different stages // and use in particular the constraints / that semantic relations / impose on the arguments  24:31

GRI so if we have hhh {%act: onomatopoeia} somebody's father // that means it has to be a person  24:35

GRI and we use these relationships to pick / preferred analysis  24:41

GRI so / how do we put that together ?  24:43

GRI or the basic idea is instead of with the name analyzer / analyze just / one possibility  24:49

GRI if the / analyzer is not sure what / the correct analysis is  24:55

GRI xxx generate multiple possibilities  24:58

GRI so-called / N-best choices  25:00

GRI so we can't tell whether the name is von Spakovsky or Spakovsky / we'll take both and will say // I'm not sure / what the analysis is  25:09

GRI let me pass two hypothesis on to the next stage  25:12

GRI then the next stage we'll re-rank this / based upon / the overall sentence structure / upon the relationships / upon coreference  25:20

GRI and we found / in this analysis by one of the students working in Chinese // that we could get a substantial reduction in the error rate  25:27

GRI just between / name analysis / coreference / and relations  25:34

GRI so the overall picture now look something like this // which start with the raw document // we've got a bunch of hypothesis for names / a bunch for coreference / a bunch for relations  25:48

GRI and once we've / put them all together // we go to this re-ranking model / which looks for global optimum  25:56

GRI and / by finding a global optimum across names and relations / and coreference // we're not able to get ninety percent / but maybe we can get eighty five percent accuracy instead of just seventy percent accuracy  26:10

GRI ok?  26:13

GRI so there's been several researches of work in this / &ah xxx view / &ah / professor Roth at / &ah Illinois also been working on this /  in terms of a more standard optimization strategy  26:26

GRI so / I won't go into the / detail here // but I think this is a trend we're gonna see much more / instead of / optimizing the components separately // to have a situation now where we / to much credit extent use the components together // in order to break through this performance variants  26:45

GRI ok  26:48

GRI so / see that it's the main hope / in terms of improving overall performance // of course people will continue working on / coreference and names / and relations separately  26:58

GRI but I think it's gonna be the synthesis / of analysis which is going to / push us to our higher performance  27:05

GRI ok / let's take all one now and look at the other type of problem // the problem of collecting the patterns for a given relationship  27:12

GRI so / as I was saying a few minutes ago // there're lots of ways of expressing an event  27:21

GRI so this is / xxx / assassination of president Lincoln back in eighteen sixty five  27:28

GRI Booth assassinated Lincoln ?  27:31

GRI  Lincoln was assassinated by Booth  27:33

GRI the assassination of Lincoln by Booth  27:36

GRI Booth went through with the assassination of Lincoln  27:39

GRI Booth murdered Lincoln  27:40

GRI Booth fatally shot Lincoln  27:42

GRI we can probably go through ten or twenty / or thirty or fifty / or maybe a hundred different ways / in which somebody can be / killed  27:50

GRI and again the situation is / if I / ask you to name / five ways of doing this // and say you can't go to coffee until you've found five ways // probably everybody will come up for five ways  28:06

GRI maybe / we have / almost hundred people and may have xxx data collection here // collect all the ways in which can we represent it  28:15

GRI but I think there still be a lot of ways missing  28:18

GRI and if I told people // you have to find a hundred ways of saying somebody was killed / or we can't get coffee // I think we'll have some very angry people in the audience  28:28

GRI so / what do we do about this ?  28:32

GRI this tail / the standard tail of the distribution which we see in some many linguistics' problems  28:39

GRI always comes to xxx / or we try to get good coverage  28:42

GRI so one thing we could do is / spend the rest of the afternoon reading newspapers  28:51

GRI ok?  28:53

GRI so / &ah we go and / collect El País // so whatever everybody reads here  29:00

GRI and hand out a copy  29:02

GRI everybody gets one day's copy of the newspaper  29:05

GRI we have some people here // &ah / Antonio // people who are expert in doing corpus annotation // so they will organize all the corpus annotated xxx / everybody will get a / marker // and have to mark / all the sentences saying somebody got hired or fired  29:21

GRI ok?  29:24

GRI and / we can / this maybe if someone paying for xxx but / you can everybody do this for a few days // you have to stay here for the rest of the week // and just mark / instances  29:35

GRI ok?  29:36

GRI well / again / some people might not be very happy with / doing this // 'cause I don't know if some people might prefer reading the newspaper  29:42

GRI  so I'm not sure  29:43

GRI can we somehow / automate this process / in so get rid of all of this manual annotation ?  29:49

GRI and that's really what we're looking at / most of the rest of this talk  29:54

GRI how do we collect patterns more automatically ?  29:57

GRI in order to address this problem / I've divided into / syntactic and semantic paraphrases  30:08

GRI so syntactic paraphrases / involve the same words / or morphologically related words // and they are paraphrased which are broadly applicable  30:19

GRI so they can apply both to / instances of being killed // and instances of hiring and firing // any type of event  30:28

GRI so for example / Booth was &assassina [///] sorry  30:33

GRI Booth assassinated Lincoln  30:35

GRI Lincoln was assassinated by Booth  30:37

GRI the assassination of Lincoln by Booth  30:40

GRI Booth went through with the assassination of Lincoln  30:44

GRI ok?  30:45

GRI with a rather broad notion of &syntac [/] syntactic paraphrase // we can say that these are all syntactic  30:51

GRI we can put a different lexical item here // different word // and we still have a set of / paraphrases  30:57

GRI in contrast / all the paraphrase relations involve different word choices  31:04

GRI assassinated / murdered / fatally shot / in this would be semantic paraphrases  31:11

GRI so / how do we go about attacking these paraphrase relations ?  31:16

GRI the syntactic paraphrases can be addressed / by having deeper syntactic representations // in which we reduce / the paraphrases to a common relationship  31:26

GRI so / we can start with very / simple syntactic relations // for example just finding chunks // finding noun phrases and verb phrases  31:35

GRI then we might go to surface syntax  31:39

GRI we will look at / surface subject / and surface object  31:43

GRI then the next stage / is we might go to / deep structure  31:48

GRI so in deep structure we do logical subject and object  31:51

GRI and / Booth assassinated Lincoln / and Lincoln was assassinated by Booth // could get the same representation  31:59

GRI beyond that we then go to semantic // role structure / even though I'll call this as syntactic paraphrase // also can be described as predicate argument structure  32:11

GRI in a predicate argument structure // we go beyond what's / conventionally called deep structure // and say that the assassination of Lincoln by Booth / even though is a normalization // does have the same basic argument relationship  32:27

GRI the assassination of Lincoln by Booth // is the same argument relationship as the other two instances  32:33

GRI so the deeper we go / the deeper the / level of relationship we capture // the more syntactic paraphrases / upon handle was  32:45

GRI so how do we build analyzers to take care of these / deeper relationships ?  32:51

GRI well nowadays most syntactic analyzers / are created through training from treebanks  32:57

GRI so / with syntactic paraphrases like passives // and actives we'd see lots of examples / even with a limited corpus  33:07

GRI and so / with a treebank we can capture these relationships rather quickly  33:15

GRI the next stage in treebanking / which is now being / actively pursued // is the creation of predicate argument banks  33:24

GRI and there's a lot of work // in the United States and xxx somewhere also in / Europe in building predicate argument banks  33:31

GRI so / in particular there's been work / of English and something called the PropBank // which captures verb relationships // verb argument relationships  33:41

GRI that's been done / at the university of Pennsylvania  33:45

GRI and / we have been working on the NomBank / for noun arguments // so capturing for example relationships between assassination and the assassinated  33:55

GRI all the relationships between / Fred walked for an hour / and Fred took a walk for an hour  34:02

GRI again / nominalization and / verbal form  34:06

GRI so / this predicate-argument banks / will assign common argument labels to a wide range of constructs  34:17

GRI so they'll handle both a verbal // the Bulgarians attacked the Turks  34:24

GRI the Bulgarians' attack on the Turks  34:27

GRI the Bulgarian launched an attack on the Turks  34:29

GRI so all three of these sentences at a predicate-argument level / would become the same structure  34:35

GRI we'll have the same / would be called arg-zero or arg-one in a / predicate-argument bank  34:41

GRI same relationship being / xxx Bulgarians and Turks will appear / for all three sentences  34:47

GRI so / by training an analyzer based on this predicate-argument structure / we can eliminate a good deal of the syntactic paraphrase  34:59

GRI ///but analyze at this predicate-argument level / an then we will look for patterns / in terms of this predicate / argument relationships  35:09

GRI so that's the good news  35:12

GRI the bad news / is we've put another stage into the pipeline // and just go back a second  35:19

GRI remember our said pipeline ?  35:22

GRI well / we've put one more stage on it  35:25

GRI we have before syntactic / we have before surface parsing // and now we've put on one more stage / to do predicate-argument analysis  35:33

GRI so / we get some benefit / but we get some error  35:38

GRI  hhh {%act: cough} the deeper the analysis / generally the less accurate the analysis becomes  35:49

GRI we've put in one more stage  35:51

GRI we've &p [/] produced one more stage of errors along with one more stage of analysis  35:57

GRI so / what do we do ?  35:59

GRI this is xxx problem for information extraction deeper  36:03

GRI do we take / a very shallow analysis like chunks / where we might get / ninety five percent accuracy  36:09

GRI or we take / a deep / analysis like predicate-argument structure / which captures much more data // but we might get with the only eighty percent accuracy  36:18

GRI the answer / is both  36:24

GRI so by &mo [/] xxx more complicated system / we can allow patterns at multiple levels  36:32

GRI so we'll write each pattern in terms of the chunk sequence // and the parse tree sequence // and the predicate-argument sequence  36:39

GRI and then we'll use a machine learning method / to weight these different analyses together  36:45

GRI so the hope is / that when the deep analysis fails / when we mess up on the predicate-argument analysis // we'll still be able to make the correct decision from a shallow analysis // finding from the chunks  36:59

GRI so we hope to get coverage / from predicate-argument structure when we get it right // and get accuracy from chunks / when we have / a pattern we've already seen at the chunk level  37:10

GRI so we did some experiments / using this basic approach / in order to / try to discover relation and events  37:19

GRI so / we used what's called a Kernel-based method / in which we measure the similarity between / an example we try to annotate // and one of the examples in the training corpus  37:32

GRI and these kernels work both at the word level // and we have one of the chunk level // and one at the predicate-argument level  37:41

GRI and then we combine / all of these measures / so that anyone of them got a good match // we were able to identify things  37:48

GRI so / the structure looks something like the following  37:52

GRI we have all of these levels of / analysis / logical relations / parsing / name tagging  37:58

GRI we put them into a classifier which does this /  generalized matching // and that puts out a result based on what it finds as the best match  38:06

GRI and we got a significant performance / not overwhelming / but a couple of percent gained / in coverage / by being able to match both at the low level / chunk level / syntax level / and predicate-argument level  38:21

GRI and so we think by this approach where we use several levels of analysis // we will get both the benefits / of deep analysis / without a cost / in terms of greater errors  38:32

GRI ok  38:35

GRI so we believe this can / address this / syntactic paraphrase // so we xxx for the semantic paraphrase problem  38:43

GRI so some of the semantic paraphrase can be addressed / by existing lexical resources // such as WordNet  38:59

GRI so in particular people at / Sheffield / measure the degree to which / information extraction patterns could be generalized just using WordNet  39:09

GRI and they measure it on this task / which I've been talking about on and off / for the last few minutes  39:17

GRI this so-called executive succession task // of people being hired and fired  39:21

GRI  so you start with a very small seed  39:25

GRI the things people can think of in the first two minutes  39:29

GRI company appointed / elected / promoted / and named a person  39:34

GRI a person resigned / or departed / or quit  39:39

GRI and then / shall explain a moment / we basically use WordNet to generalize from this / co-examples / to see what other / examples of hiring and firing we can find  39:53

GRI and then we won't have some measurement of how effective we are improving coverage  39:57

GRI so we'll use a rather simple metric / so-called text filtering metric // where we see whether / basically what fraction of the sentences which are relevant / are we able to extract  40:11

GRI what fraction of events can we find  40:13

GRI so in Sheffield's experiments / starting with / &ah the seed documents // those just = sorry / the seed patterns  40:26

GRI we would / get hundred percent precision on finding documents  40:31

GRI so every document which matched the pattern was / &ah relevant to the task // but only twenty six percent recall  40:39

GRI by applying WordNet / we then were able to get up to ninety six percent recall / we found almost every document // which was relevant // to hiring or firing  40:50

GRI but unfortunately now only about two thirds / of the documents we found / were relevant  40:55

GRI 'cause we get lots of / &ah / other senses of the words once we start to xxx things  41:03

GRI and if we do the same statistics at the sentence level // we get similar performance so recall goes up from ten to sixty four percent // but the precision finding sentences which are relevant / goes down to forty seven percent  41:16

GRI so / this is fairly effective but / it's not going to do / everything we want by itself  41:24

GRI furthermore the problem is / WordNet is gonna be good for some tasks // and not so good for others  41:31

GRI semantic paraphrase is much more domain-specific than syntactic paraphrase  41:36

GRI so it's hard to prepare some comprehensive resource  41:39

GRI if we suddenly start a talk about genomics // and want to get paraphrase is gonna be much harder to do that using WordNet  41:46

GRI so we turn back to the corpus and say // how can we do things with the corpus ?  41:53

GRI so instead of having people mark the corpus up // can we do things automatically ?  42:00

GRI how can we find / giving a few examples of &peo [/] people being hired // how can we find / automatically from a corpus without marking anything up / other ways of stating the same fact ?  42:14

GRI and we look for a briefly at two approaches here  42:17

GRI predicates with the same arguments // and predicates / in the same documents  42:23

GRI so let's first look at predicates with the same arguments  42:26

GRI ok / the basic intuition / is we find pairs of passages which probably convey the same information  42:36

GRI so we get two newspapers which talk about / hirings and firings on the same day  42:42

GRI and then we align the structures / that points at none correspondence  42:48

GRI so for example Fred XXX Harriet in one newspaper // and Fred YYY Harriet in other newspaper  42:56

GRI and we say / hhh {%act: assent}  42:58

GRI here two patterns which occur between the same names / maybe they are paraphrases / maybe they are relationships  43:07

GRI they / are related terms / because they connect the same people / in news from the same day  43:13

GRI so how accurate this is going to be / depends on part on / &ah / how we pull the texts  43:22

GRI so we have almost parallel text  43:24

GRI I'll explain in a moment  43:26

GRI we were pretty sure they talk about the same things // then we might be able to learn a paraphrase from a single example  43:33

GRI if it's from comparable texts  43:35

GRI so we have some evidence / that this is about the same staff / but we're not sure // then we might use a few examples  43:42

GRI or we can use just / any text / without any constraint / and then we'd need lots of examples  43:48

GRI so in terms of / parallel texts / experiments we'd done in Columbia a couple of years ago / taking two translations of the same novel  43:58

GRI so you would expect if we / take the same novel in French or in Spanish // and translate into English / there'll be a very close correspondence between the two English texts  44:09

GRI so they did this // and they aligned the sentences // and they aligned the individual constituents within the sentences  44:20

GRI and they were able to obtain a number of interesting paraphrases // mostly synonyms // paraphrase of the lexical level / from the translation  44:29

GRI but the problem is / the amount of data you have like this is rather limited  44:35

GRI if you basically have literary data you made of novels which got translated several times  44:40

GRI but whether genomics article is gonna be translated several times from Spanish to English / seems much less likely  44:47

GRI  so we have to look at other ways in which we can get / &ah related / texts  44:55

GRI so experiments which we did a couple of years ago at NYU / were based on news stories from multiple sources from the same day  45:02

GRI so we take two newspapers // same day // and we'd looked for / pairs of articles which overlap in terms of several words // particularly several names  45:12

GRI if we find two articles which have a bunch of names in common // then there's a good chance / since they were from the same day / that they talked about the same subject  45:24

GRI ok?  45:24

GRI now we go down into those articles // and we take two -> sentences / which have the same names in them  45:31

GRI if they have the same names in common // then we have a pretty good chance that they're conveying the same information // along maybe with some / related facts  45:42

GRI  so we looked then / we keep drilling down into more more detail / we looked for syntactic structures / in the sentences which shared the same names  45:52

GRI and we found sharing two names // we get a paraphrase / precision of sixty two percent  45:58

GRI so this was an experiment about murder in Japanese but / the details perhaps are not so relevant  46:05

GRI we were able to / pull out from news articles without any manual intervention / things which / two thirds of  the time were correct / synonyms / correct paraphrases  46:14

GRI so / from single examples we were able to do fairly well  46:24

GRI if we want to increase the accuracy / we can look for multiple examples of the same relationship  46:30

GRI and the basic idea here = ups! / excuse me  46:33

GRI is that xxx an expression appears with several pairs of names  46:37

GRI so we have some / phrase R which appears / between A and B / C and D / E and F  46:43

GRI and then some other expression appears between / also several pairs // A and B /  E and F  46:50

GRI then there's a good chance that  R and S are paraphrases  46:53

GRI the more examples you find / the better the probability is  46:56

GRI so for example / if we have Eastern Group's agreement to buy Hanson  47:03

GRI Eastern Group / to acquire Hanson  47:06

GRI CBS will acquire Westinghouse  47:09

GRI CBS's purchase of Westinghouse  47:11

GRI CBS agreed to buy Westinghouse  47:12

GRI we pull out the main words // and we look for pairs of main words  47:19

GRI so buy appeared with both CBS / Westinghouse / and Eastern Group / Hanson  47:25

GRI and acquire appeared the same way  47:27

GRI so we say / two examples of each there's a pretty good chance / that we're getting paraphrase  47:33

GRI so there've been a number of / experiments along this line / trying to get the paraphrases  47:40

GRI &ah perhaps the best known fellow is Sergey Brin / who went on to get / billions of dollars doing Google // and isn't worried about / paraphrases anymore // but is hiring / probably dozens of people to worry about paraphrases  47:54

GRI and Lin and Pantel // and some work which / &ah Satoshi Sekine / did at NYU // trying to acquire [/] in each case trying to acquire in this way paraphrases between relations  48:08

GRI  and the accuracy we can get if we have enough examples / is quite high  48:14

GRI so we were able to get for example / eighty six percent / accuracy in finding paraphrases / for person-company pairs  48:21

GRI ok  48:26

GRI so this is one approach / to finding / paraphrases with no human intervention  48:31

GRI just / put in millions of texts // and crank away  48:34

GRI the other approach we've been investigating /  a frequently  / &doc = sorry  48:39

GRI words which occur frequently / co-occur frequently in the same documents  48:44

GRI so here the basic idea / is we start from a topic // get a set of documents on the topic // and get &paraphra [/] get patterns about this topic  48:55

GRI so to explain this / goes back to some work which Ellen Riloff did now ten years ago // on / identifying / paraphrases from relevance judgements for topic // for documents  49:11

GRI so she divided / a large corpus into relevant and irrelevant documents  49:17

GRI at this time / ten years ago / it was about  / &ah Latin-American terrorism  49:21

GRI and / classified &m [/] words as people / organizations and so forth  49:28

GRI identified the predicate-argument structures // in the document  49:33

GRI and then count how often / particular structures appear in relevant and irrelevant documents  49:38

GRI and using this / metric shown here / ranked the various constructions  49:45

GRI so the basic intuition is the following  49:48

GRI if we take a bunch of documents // which are about terrorism // and we take another bunch of documents / which are just about everything else / about / cocaine / and politics / and economy // and we ask / which words appear / much more often in the terrorism documents / than appeared in other documents  50:09

GRI and rank words by their relative frequency // then when we'll come to the top of the list // the top of this all divided by illegal / ranking // would be words which are specifically about terrorism  50:21

GRI and most likely we would have / collections of related words  50:26

GRI so I'll show an example in one second / how this works / for the hiring and firing case  50:33

GRI &ah this is an small extension of Riloff's work which was done by / Roman Yangarber at NYU  50:40

GRI basically converting this into bootstrapping method  50:44

GRI to start with a small seed like we did / &ah at Sheffield  50:49

GRI we filled some documents  50:52

GRI picked additional structures with high Riloff metric // and then repeat  50:58

GRI so how do this [/] how does this work ?  51:02

GRI we start with somebody retires // and get all the documents which talk about somebody retiring  51:10

GRI Fred retired / Maki retired / Harry retired  51:14

GRI just collect these retiring documents // and ask / what other / predicates occurred a lot in these documents  51:22

GRI but if you think about articles about retiring // it's very often gonna mention somebody else got hired for the job  51:30

GRI so we look at all the articles // and we look at / what patterns occur / repeatedly / in these retiring articles  51:39

GRI and we see / Harry was named president  51:41

GRI Yuki was named president // and so forth  51:43

GRI so we're just gonna rank the articles  51:45

GRI sorry  51:47

GRI rank the / constructs / by how often they appear in the articles // and pull out the most / commonly occurring constructs  51:55

GRI so we'll pull out / person was named president  52:00

GRI we'd stick it onto the set // and we repeat this process  52:06

GRI each time through the iteration // we'll now retrieve documents which have one of these two patterns // and we'll pick up the third pattern  52:17

GRI we did a very similar experiment what I described for Sheffield // we run another similar seed  52:24

GRI  and / we got patterns like this  52:30

GRI again no human intervention  52:32

GRI this is just based upon co-occurrence statistics  52:35

GRI we can find / person succeeded person // person become president // person named president // person joined company // person &le [/] left post  52:47

GRI in terms of performance / statistical performance in terms of finding relevant documents // we got from / eleven percent recall with ninety three percent precision // using just the seed set // to having something like eighty one percent precision / eighty eight percent recall // just doing this / automatic / looping  53:13

GRI so this is comparable to the WordNet-based expansion // and provides a different set of patterns  53:19

GRI so in principle  we could combine this / to get better performance  53:23

GRI shown graphically we can see / each step here going from left to right for instance / one iteration // going through / picking up one more / pattern from the set / using the Riloff metric  53:39

GRI and we can see the recall the blue line / rises rather rapidly and then // ups! // and then sort of flattens out  53:50

GRI and the position / slowly declining // going from ninety percent / to something like eighty percent / over the course of these iterations  54:00

GRI so there are a lot of numbers on this graph / but the main thing I wanted to / point out / is we compare the performance / doing this / automatic procedure just described // against the manual procedure // in which / people basically sat there for thirty days / and read newspapers / and try to find all the patterns they could  54:24

GRI  and the automatic procedure is quite comparable in performance // to thirty days of manual work  54:31

GRI so they discovered patterns / got about / sixty percent performance  54:38

GRI the manually / collected patterns / I think gonna take some of the audience up next I am not sure  54:45

GRI &ah / the manually collected patterns got fifty six / to sixty four percent performance  54:51

GRI so we can see this automatic procedure can do about as well at finding new patterns / as / people sitting there for / bunch of weeks / collecting data  55:06

GRI so / one word now about / self combining  the methods I've just described // the topic based method can find / set of paraphrases quite well / like name appoint select  55:22

GRI but / because they're just looking for articles / rather than patterns which appear in the same articles / they will also find  / topically related phrases which are not paraphrases  55:32

GRI so they'll find  / group together / appoint and resign / or shoot and die  55:37

GRI so the trick now is to combine the two approaches / we used before  55:44

GRI we talked about / finding patterns because they are in the same articles  55:49

GRI we talked about finding patterns / because they appear with the same arguments  55:53

GRI now we can couple this // this sort of topical discovery / and paraphrase discovery  55:59

GRI first find / all of the topical patterns  56:01

GRI so we'd find retire / and hire / and name and so forth  56:07

GRI and then just put / this set of patterns / into the paraphrase discovery  56:12

GRI and we get a / considerably better result than we could with either method alone  56:16

GRI so we did experiments like this  56:20

GRI we weren't able to find paraphrase / for all of the patterns which were topically relevant  56:26

GRI but we did much better than we could / just finding / using the paraphrase discovery by itself  56:32

GRI so in terms of / &ah numerical performance / we got up to precision by using these two methods together // ninety four percent precision / in terms of / automatically finding patterns which meant the same thing  56:46

GRI it's a rather remarkable level  56:49

GRI &ah / considering / level automation / the coverage is not so good // it's only about forty seven percent of the patterns / which were topically relevant were gathered together  57:01

GRI &ah / it turned out / unfortunately perhaps / we did more experiments and we found out that xxx this works very well / for executive succession // because people normally get hired only one job at a time // we tried it for the / for an arrest domain // where we had / crime stories about people being / in a burglary and  theft and being arrested  57:24

GRI and it turns out that / paraphrase doesn't work so well // because people get arrested / typically or / quite often / for several crimes at once  57:35

GRI so Fred was arrested for burglary  57:37

GRI and Fred was arrested for / xxx  57:40

GRI so we combine / all these crimes into one / paraphrase unit  57:44

GRI so there's certainly still a lot of work to do  57:47

GRI whenever a student say / oh! this problem got solved !  57:49

GRI xxx  57:50

GRI this language problem aren't gonna be solved  for another / twenty years / thirty years / up till I retire  57:56

GRI ok  57:59

GRI &ah / Antonio asked me to mention briefly two other topics / before I close here  58:04

GRI so I'll say a few words about this  58:05

GRI one is about cross-language information extraction  58:08

GRI we've done a / number of work  58:11

GRI &ah / at least in the United States / everybody wants to xxx in English // ok?  58:17

GRI nobody wants to read / Spanish / or Chinese / or Arabic / or xxx  58:23

GRI everybody wants English // ok?  58:25

GRI so / we typically get this problem and want to do a database / in language L1 // for language / from texts in L2  58:34

GRI so we get / Spanish / and Chinese / and Arabic texts  58:38

GRI but people say / I want my database in English  58:41

GRI so / how do we do this ?  58:45

GRI well / there're basically two ways of doing this  58:48

GRI we can take the / upper path / or the lower path  58:51

GRI shall explain this two / for a moment  58:54

GRI the upper path / ok / we start with a bunch of articles in / &ah -> Chinese / let's say // ok?  59:05

GRI let's imagine / green is Chinese and red is English  59:08

GRI ok?  59:09

GRI so we start with a bunch of articles in Chinese // and we run / extraction / in Chinese  59:16

GRI we run the Chinese information extraction component  59:19

GRI and we'd get a database in Chinese  59:21

GRI ok?  59:23

GRI then we take / the Chinese database // and we take each terminate // and we run it through some / machine translation system // and get a database in English  59:34

GRI ok?  59:37

GRI other approach / is we take the Chinese texts // and we translate them with a machine translation system / into English  59:48

GRI and then we run an English extraction system  59:51

GRI ok?  59:54

GRI so two ways of addressing the same basic problem  59:59

GRI ok  59:59

GRI which one is gonna work better ?  1:00:01

GRI we take a vote / to vote  as a class // like half people raise their hands  1:00:06

GRI but / this is maybe formal lecture / we are not supposed to have people / picking votes here  1:00:11

GRI so study might guess / ok so we did experiments / we had / Japanese graduate students so we did experiments in Japanese // same management succession task  1:00:25

GRI ok?  1:00:25

GRI people being hired or fired  1:00:27

GRI well / we did extraction on / machine translation output  1:00:32

GRI so we took the lower path  1:00:36

GRI we went down with MT / and then across / xxx the red lines  1:00:40

GRI we got forty one percent accuracy  1:00:43

GRI hhh {%act: }  1:00:45

GRI in &cont [/] I ´d come back to the / other items but let's look at the last line  1:00:50

GRI in contrast / if we did / extraction in Japanese // and then just translating the database entrance // we did much better with a sixty percent // which is not much worse than we did / using straight English // there we might get / sixty five or seventy percent performance  1:01:09

GRI so why is it much worse / even you might think  1:01:13

GRI they sell systems to do Japanese translation // they're designed to translate texts // not database  1:01:19

GRI why they should do much worse / if we translate the texts / and then do extraction ?  1:01:24

GRI well / it turns out / the argument structure / is not very well translated by these translation systems  1:01:32

GRI I think for pairs which &ah / have more similar / sentence structure / like Spanish and English // and French and English // you won't have some much / this problem  1:01:43

GRI but / if you / try to do Japanese / or Chinese / or Arabic / or something with a substantially different sentence structure // you'll get the right subject // and you'll get the correct object // and you'll get the correct verb  1:01:54

GRI but they won't be in the correct order // xxx of English  1:01:58

GRI and people sort of compensate / when they read machine translation  1:02:01

GRI and they say / &ah / Fred Smith appointed / IBM president  1:02:08

GRI they all know that wasn't actually that Fred Smith who appointed at / it was &ah / the IBM  1:02:15

GRI but IBM appointed Fred Smith as president  1:02:17

GRI so it's a sort of fix things up  1:02:19

GRI but the extraction system would do that // in so it gets / considering worse performance  1:02:25

GRI so in general / if we can afford the labor to do source language extraction followed by translation // we'll do better  1:02:37

GRI we were able to do some improvement / one of the hard things / in the translation is finding the names  1:02:43

GRI so by doing some projection of structure / from the source language / to the target language // so that we can / identify the names properly in the English texts / by / first finding them in the / Japanese // we were able to do somewhat better  1:03:00

GRI we got a fifty two percent  1:03:02

GRI but still not close to what we can do by doing / genuine foreign language extraction  1:03:08

GRI finally / two slices about / what's going on in terms of multilingual extraction programs in the United States  1:03:17

GRI one is basically an evaluation program  1:03:20

GRI this is a / sneaky trick where they don't / pay people to do research // they just advertise this evaluation  1:03:27

GRI and hope people would do research on this evaluation task // and will all come  1:03:33

GRI so / the United States / we run this so-called ACE evaluation // for doing information extraction  1:03:39

GRI and / until now / until this / past year it was in Arabic / in Chinese / in English  1:03:47

GRI ok?  1:03:48

GRI you can sort of imagine why the United States is interested in Arabic and Chinese and / English  1:03:53

GRI &ah / and we may / wonder / but / &ah Spanish has been added as a fourth language for the evaluation this year  1:04:02

GRI so if anybody has a good information extraction system for Spanish // they are welcomed to come to Washington  1:04:09

GRI &ah this evaluation is being run the first day of February of two thousand seven // and participate  1:04:17

GRI completely open / for the / Chinese task / we have a number of Chinese universities coming // participate in the evaluation  1:04:25

GRI &ah the Spanish task for this year would basically be / finding entities  1:04:29

GRI which means if you have a good / chunker / and you have good coreference in Spanish // you can probably do good job at this task  1:04:37

GRI other multilingual IE program / &ah / is a research program / a large scale research program founded by DARPA // and begun just one year ago // called GALE // also trilingual // Arabic / and Chinese / and English  1:04:54

GRI  and involves / all the NLP processing you &cou [///] one  1:04:59

GRI  so it involves / automatic speech recognition / machine translation / information retrieval / information extraction / a little bit of summarization / almost everything  1:05:09

GRI so the / retrieval extraction task involves answering questions about a multilingual corpus  1:05:18

GRI so we give them / for example / &ah / Arabic Al Jazeera / &ah / TV broadcasts  1:05:26

GRI and we're supposed to do / automatic / ASR / followed by machine translation / followed by extraction  1:05:34

GRI  and xxx to answer questions like / describe reaction of country to event  1:05:39

GRI  or list attacks in location during time and related deaths  1:05:43

GRI so a sort of mix of summarization is supposed to do extraction to pull out the relevant sentences // and then build the summary around them  1:05:52

GRI and this is / as I was saying the design / United States' every sponsors only in English  1:05:59

GRI people don't want to talk anything but English in the United States  1:06:02

GRI so we are supposed to / pull out / this / Al Jazeera staff  1:06:07

GRI transcribe it // translate it // present English sentences which are relevant to answering / these particular questions  1:06:15

GRI and DARPA says by the time / five years have finished we are supposed to do better than people do at these tasks  1:06:21

GRI so / I don't know if  we believe that / but / they try to scare us every year with / very high performance targets  1:06:29

GRI so we are looking / in order to reach this level of performance / to combine the different IE paths  1:06:36

GRI we think that will have to go both ways / and combine the results at the end // in order to get the best extraction performance  1:06:43

GRI ok  1:06:47

GRI so to conclude / the current IE technology which I've / been describing / provides a powerful tool / for targeted searches in large text collections  1:06:57

GRI so I think it provides a / real capability beyond what we ever want to get / from information retrieval  1:07:03

GRI but it has limitations of performance // and recent basic research on NLP methods / some of which I've described here / offer substantial opportunities for improving extraction performance / and portability  1:07:17

GRI and the main points I've talked about a global optimization for improving / performance levels / treebanks / for things like predicate-argument level // &pre [/] predicate-argument  / discovery // and corpus-based discovery methods / for greater coverage  1:07:33

GRI thank you  1:07:34