Method for producing and using a recursive index of search engines

Abstract

The present invention pertains to extraction of text from an index of a search engine starting at an arbitrary position in the text, and to analysis of texts for co-occurrence of words, and to the use of said extraction and analysis for inferring implicit (causal, associative, etc) relationships among objects in sequences thereof. The present invention is particularly useful in extending the utility of search engines to index and retrieve information represented by a sequence of objects different from text information objects. The present invention extends the basic hit with at least two wordID's, one for the “previous word”, whose position in the document is position−1, and one for the “next word”, whose position in the document is position+1.

Claims

1 . A method of indexing, searching and extracting digital data from an index; wherein said digital data are a plurality of machine-readable data sets; wherein said data sets are numbered sequences of objects; wherein said objects are represented by unique machine-readable object values; wherein indexing a data set records a hit into an index for each occurrence of a unique object in the unique data set being indexed; wherein said hits contain at least the ordinal number of the occurrence of said unique object in the numbered sequence of objects in said unique data set; wherein said ordinal number is referred to as <position> hereafter, wherein said hits contain at least the value of the previous object, wherein the ordinal number of the occurrence of the previous object in the numbered sequence of objects in said unique data set is one unit less than said <position> and/or said unique hits contain at least the value of the next object, wherein the ordinal number of the occurrence of the next object in the numbered sequence of objects in said unique data set is one unit more than said <position>; wherein searching for occurrences of a unique object in indexed data sets comprises gathering the value of the search object, retrieving hits with the occurrences of said object in data sets; wherein said hits include at least the value <position> of the ordinal number of the occurrence of the object in the numbered sequence of objects in the corresponding data set, and the value Previous Object and/or Next Object; wherein said hits are the result of the search, which is distinguished by storing in the hit data and making retrievable or by making computable the values <previous position>=<position−1> and/or <next position>=<position+1> for the ordinal number of occurrence of the Previous Object and/or Next Object, correspondingly, in the numbered sequence of objects in said data set; wherein said index is also searched for the hit in said data set for the object with the value identical with the value of said Next Object and the value <position> of the ordinal number of the occurrence of the object in the numbered sequence of objects in said data set is equal to <position+1> and/or said index is also searched for the hit in said data set for the object with the value identical with the value of said Previous Object and the value <position> of the ordinal number of the occurrence of the object in the numbered sequence of objects in said data set is equal to <position+1>; wherein, at least, the value <position> of the ordinal number of the occurrence of the object in the numbered sequence of objects in said data set, and the values Previous Object and Next Object are retrieved from from the found hit, and the retrieved value <position> and the values Previous Object and Next Object are used as a step in Restoration of Past and/or as a step in Restoration of Future; 2 . The method of claim 1 , further comprising the following steps of Restoration of Future: Variable N is assigned value 1, variable Start is assigned the value <position> from the hit of the search object and the following loop is performed: The value Next Object and the value <next position> of the ordinal number of the occurrence of the object in the numbered sequence of objects in said data set are retrieved and used as the N-th result of Restoration of Future of the numbered sequence of objects in said data set; The index is searched for the hit in said data set for the object with the value identical with the value Next Object for the N-th result of Restoration of Future and the value <position> of the ordinal number of the occurrence of the object in the numbered sequence of objects in said data set is equal to <Start+N>; The value Next Object is retrieved from said hit in said data set, and the value <next position>=<Start+N+1> of the ordinal number of the occurrence of said Next Object in the numbered sequence of objects in said data set is computed or retrieved; The variable N is assigned the value (N+1) and the loop is performed until a certain condition for the loop termination is satisfied; The values Next Object and the value <position> of the ordinal number of the occurrence of the Next Object in the numbered sequence of objects in said data set are retrieved from each N-th result of Restoration of Future, the retrieved Next Objects are arranged in the order of succession of their numbers <position>, and the resultant numbered sequence of object is used as the result Future of the operation Restoration of Future. 3 . The method of claim 1 , further comprising the following steps of Restoration of Past: Variable N is assigned value 1, variable Start is assigned the value <position> from the hit of the search object and the following loop is performed: The value Previous Object and the value <previous position> of the ordinal number of the occurrence of the object in the numbered sequence of objects in said data set are retrieved and used as the N-th result of Restoration of Future of the numbered sequence of objects in said data set; The index is searched for the hit in said data set for the object with the value identical with the value Previous Object for the N-th result of Restoration of Future and the value <position> of the ordinal number of the occurrence of the object in the numbered sequence of objects in said data set is equal to <Start−N>; The value Previous Object is retrieved from said hit in said data set, and the value <previous position>=<Start−N− 1 > of the ordinal number of the occurrence of said Previous Object in the numbered sequence of objects in said data set is computed or retrieved; The variable N is assigned the value (N−1) and the loop is performed until a certain condition for the loop termination is satisfied; The values Previous Object and the value <position> of the ordinal number of the occurrence of the Previous Object in the numbered sequence of objects in said data set are retrieved from each N-th result of Restoration of Past, the retrieved Previous Objects are arranged in the order of succession of their numbers <position>, and the resultant numbered sequence of object is used as the result Past of the operation Restoration of Past. 4 . The method of claim 2 , further comprising the following steps: The operation Search is performed in one or multiple machine-readable data sets, which are numbered sequences of objects, the result of which is a bounded set of hits that represent the occurrences of the search object in said data sets; A certain condition for the loop termination in the operation Restoration of Future is set up and the operation Restoration of Future is performed for each hit in said bounded set of hits, applying said condition for the loop termination, and a bounded set of Future Objects is obtained; A known method is used for object co-occurrence analysis in the bounded set of Future Objects. 5 . The method of claim 3 , further comprising the following steps: The operation Search is performed in one or multiple machine-readable data sets, which are numbered sequences of objects, the result of which is a bounded set of hits that represent the occurrences of the search object in said data sets; A certain condition for the loop termination in the operation Restoration of Past is set up and the operation Restoration of Past is performed for each hit in said bounded set of hits, applying said condition for the loop termination, and a bounded set of Past Objects is obtained; A known method is used for object co-occurrence analysis in the bounded set of Past Objects. 6 . The method of claim 4 , wherein a known method is used for object co-occurrence analysis in the bounded set of Future Objects. 7 . The method of claim 5 , wherein a known method is used for object co-occurrence analysis in the bounded set of Past Objects. 8 . The method of claim 4 , wherein a known method is used for object co-occurrence analysis in the bounded sets of Past Objects and Future Objects simultaneously. 9 . The method of claim 6 , wherein additionally a Sample Data Set (SDS) is given or an address of SDS is given, and at least one SDS Key Object is given, and SDS or Key Object is associated with a Jump Address and SDS is then indexed, and, when searching, a co-occurrence analysis is conducted for each search object with objects in at least one indexed data set, and, when using the search results, search objects are compared with SDS Key Objects, and, upon equality of a search object with an SDS Key Object, a co-occurrence analysis is conducted for the SDS Key Object with other objects in SDS, wherein a Congruence Gauge is established for the results of the co-occurrence analysis of the search object hits and SDS Key Object hits and the results of the co-occurrence analysis of search objects are compared with the results of the co-occurrence analysis of SDS Key Objects and the congruence of the results of the co-occurrence analysis of search objects is estimated, and, if said estimate is within the established Congruence Gauge, the search object hits and/or the Key Object hits and/or SDS and/or the Jump Address are used as the result of the search. 10 . The method of claim 1 , wherein at least a TimeStamp or a TimeOffset are stored in the hit, and, when searching and restoring, the TimeStamp and/or TimeOffset are retrieved and the chronology of the search objects is taken into account. 11 . The method of claim 1 , wherein at least a Location or a LocationOffset are stored in the hit, and, when searching and restoring, the Location and/or LocationOffset are retrieved and the spatial location of the search objects is taken into account. 12 . The method of claim 1 , wherein at least a MedtaID pointer is stored in the hit, and, when searching and restoring, the MedtaID pointer is retrieved and used to access the metadata of search objects. 13 . The method of claim 1 , wherein the value of objects is a TimeStamp or a TimeOffset, and the the numbered sequence of objects is the numbered sequence of TimeStamp or TimeOffset values. 14 . The method of claim 1 , wherein the value of objects is a Location or a LocationOffset, and the numbered sequence of objects is the numbered sequence of Location or LocationOffset values. 15 . The method of claim 1 , wherein the machine-readable data sets are files or streams of data containing sequences of objects for recognition, and the following steps are performed before data sets are indexed: The objects for recognition are recognized by a known recognition method. Each recognized object is assigned a number. Using the numbers assigned to the recognized objects, the set of the recognized objects is arranged as numbered sequence of objects. The numbered sequence of objects is used to index, search for and retrieve digital information. 16 . The method of claim 15 , wherein said files or streams of data are video files or audio files or video data streams, or audio data streams, the objects for recognition are words of speech and/or sounds and/or collections of sounds, and/or faces, and/or objects, and/or collections of objects, and/or sets of points, and/or symbols, and/or letters, and/or digits, and/or words of text, and the known recognition method is speech recognition, and/or sound recognition, and/or face recognition, and/or object recognition, and/or optical character recognition, and/or text recognition or another known recognition method. 17 . The method of claim 16 , wherein the machine-readable data sets are in one of the following audio formats: CD, MP3, WMA, AAC, AIFF, M4A or in other known audio formats, or in one of the following video formats: MPEG4, AVI, MOV, RAM, SWF, WMV, DVD, Blu-Ray or in other known video formats. 18 . The method of claim 17 , wherein indexation, search and information retrieval from the index is performed simultaneously with recording or playback of the machine-readable data sets. 19 . The method of claim 17 , wherein the index is stored in an ID3 tag file, or a tag file of another known format. 20 . The method of claim 17 , wherein the index is stored on information media, or in a portable memory device, or in the memory of a device for recording and/or playback of machine-readable data sets, or in the memory of a communication device. 21 . The method of claim 20 , wherein the index is stored with the machine-readable data sets. 22 . The method of claim 20 , wherein the information media is a CD disk, or a DVD disk, or a Blu-Ray disk, or a disk of another known audio and/or video format, or another known removable information media. 23 . The method of claim 15 , wherein the objects for recognition are geodesic and/or relative location coordinates, and the value of objects is the value Location and/or LocationOffset. 24 . The method of claim 15 , wherein a known method is used to compute the geodesic coordinates of Location and/or LocationOffset, and the method further comprises the following steps: At least the value Location and/or LocationOffset is additionally included in the hit, and the search and restoration operations the values Location and/or LocationOffset are retrieved; The values of said coordinates are computed via a known recognition method; The objects for recognition that are related with said coordinates are determined; The computed values of said coordinates are assigned to the field Location in the hits for the determined objects for recognition. 25 . The method of claim 15 , wherein a known method is used to compute or obtain the value TimeStamp and/or TimeOffset, and the method further comprises the following steps: At least the value TimeStamp and/or TimeOffset is additionally included in the hit, and the search and restoration operations the values TimeStamp and/or TimeOffset are retrieved; The values of TimeStamp and/or TimeOffset are computed or obtained via a known method; The objects for recognition that are related with TimeStamp and/or TimeOffset are determined; The values of TimeStamp and/or TimeOffset are assigned to the field TimeStamp and/or TimeOffset in the hits for the determined objects for recognition. 26 . The method of claim 1 , wherein the index stored on a CD disk, or a DVD disk, or a Blu-Ray disk or another removable information medium, or stored in the memory of a device. 27 . The apparatus capable of indexing, searching for and retrieving digital information from an index using the method of claim 1 . 28 . The method of claim 1 , wherein the data set is a numbered sequence of textual objects in an electronic book. 29 . The method of claim 1 , wherein the data set is a file or stream of data in an audio book, and the numbered sequence of recognized objects are objects for recognition of speech, sounds, or collections of sounds. 30 . The method of claim 1 , wherein the data set is a numbered sequence of geodesic coordinates of an itinerary or a map, wherein the sequences of geodesic coordinates are indexed, and the index is stored on an information media, or a portable memory device, or in the memory of an apparatus for visualization, co-occurrence analysis, and other use of said geodesic coordinates. 31 . The method of claim 1 , wherein the data set is a numbered sequence of addresses of an itinerary or a map, wherein the sequences of addresses are indexed, and the index is stored on an information media, or a portable memory device, or in the memory of an apparatus for visualization, co-occurrence analysis, and other use of said addresses. 32 . The method of claim 1 , wherein the data set is a numbered sequence of pages of a web site, and the index is stored in such a way that search and information retrieval from the index is possible with and without accessing the web site. 33 . The method of claim 32 , where the index is stored as a semantic web structure. 34 . The method of claim 1 , wherein the data sets are data sets of a local area network or of the Internet, and the index is stored in the memory of the computers of a search engine in said network, or on a user's device, or an a device where the indexed data set is stored. 35 . The method of claim 9 , wherein the SDS is a web site or a web page, and the Jump Address is an Internet address.
[0001] The present invention pertains generally to the field of computer science and digital data processing. [0002] More specifically, the present invention pertains to a certain method of creation and use of a recursive index for search engines. [0003] The present invention also pertains to extraction of text from an index of a search engine starting at an arbitrary position in the text, and to analysis of texts for co-occurrence of words, and to the use of said extraction and analysis for inferring implicit (causal, associative, etc) relationships among objects in sequences thereof. [0004] The present invention is particularly useful in extending the utility of search engines to index and retrieve information represented by a sequence of objects different from text information objects. [0005] Particular embodiments of the present invention are described as examples of a search engine that creates and uses information encoded in hit lists. A BRIEF DESCRIPTION OF DRAWINGS [0006] The following examples are meant to illustrate, but in no way to limit, the present invention. [0007] FIG. 1 Hit list of a search engine as described in “The Anatomy of a Large-Scale Hypertextual Web Search Engine” (1), Sergey Brin and Lawrence Page. [0008] FIG. 2 . Basic hit. [0009] FIG. 3 . Extension of the basic hit with the fields “previous word” and “next word”. The positions of the previous word and the next word are specified. [0010] FIG. 4 . Extended hit, where the positions of the previous word and the next word are not specified. They are computed using the position in the basic hit. [0011] FIG. 5 . A text sample for indexing. [0012] FIG. 6 . A hit list example for a recursive index created from the text in FIG. 5 . [0013] FIG. 7 . Extended hit based on the structure of the basic hit. [0014] FIG. 8 . Forward and inverted indices (barrels) from (1). [0015] FIG. 9 . An extended hit can describe an object of any nature as disclosed in the present invention. [0016] FIG. 10 . Additional MetaID field; the field can can be used as a pointer to metadata, e.g., ID3v2 tags. [0017] FIG. 11 . An example of the co-occurrence table. [0018] FIG. 12 . An example of extraction from a recursive index of sentences having the word “tiger”. [0019] FIG. 13 . An extended basic hit table, with additional record type “phrase”. [0020] FIG. 14 . The initial form of a snippet, the word “ ” (Russian for “convenient”) is selected. [0021] FIG. 15 . The form of the snippet after the word “ ” (Russian for “convenient”) has been “pulled out”. This is the effect of “zooming into” or extraction from an index of text centered on the previously selected word XXX. [0022] FIG. 16 . An extended hit with a timestamp and a location field. [0023] FIG. 17 . An extended hit with a timestamp. [0024] FIG. 18 . A temporal recursive index. [0025] FIG. 19 . A temporal recursive index with location. [0026] FIG. 20 . An extended hit of a spatial recursive index. [0027] FIG. 21 . An extended hit of a spatial recursive index with a timestamp. BACKGROUND OF THE INVENTION [0028] Those skilled in the art know of search engines and indices utilized therein, which are used to find information in unstructured textual data situated in a file system, in a local area network or the Internet, as well as textual data containing markup and embedded objects such as graphics, music, video, etc. [0029] The prototype of a search engine's index is the index in a paper book, which has been known from ancient times. Such an index is usually situated at the beginning or the end of a paper book and contains a list of the keywords mentioned in the book. For each keyword, the numbers of pages where the word is mentioned are given. The Internet search engines using prior art have indices organized analogously. An index contains a list of known words of a language (lexicon), where each is assigned a unique identifier (wordID), a list of indexed documents, where each is assigned a unique identifier (docID), and each occurrence of the word identified by wordID in the document identified by docID is represented by a hit. A hit is a record (c.f. FIG. 1 ) that contains the position of the sought word (wordID) in a document (docID). [0030] U.S. Pat. No. 5,265,244 G06F 17/30 (20060101); G06F 012/00 teaches of methods of creating an index, but does not disclose the content of the statistical data contained therein. U.S. Pat. No. 6,490,579 (G06F 17/30 (20060101)I G06F 017/30) discloses of a search engine and a method utilizing contextual information in a search query. U.S. Pat. No. 7,925,641 (G06F 17/30 (20060101); G06F 7/00 (20060101)) discloses of indexing a web page and storing in a search engine's index of the URI attribute and the content of the page at the time of indexing. [0031] Sergey Brin and Lawrence Page, in “The Anatomy of a Large-Scale Hypertextual Web Search Engine”, hereafter referenced as (1), teach of encoding statistical information in the hit lists of the index of a search engine. For greater relevance of the results of an Internet search, the search engine stores in its indices the attributes related to HTML markup and hyperlinks. FIG. 1 illustrates that hits of the search engine contain the position of a word in the numbered list of words in the text of a document. [0032] The encoding of hits is explained in “4.2.5. Hit Lists” of (1) as follows: [0033] There are two types of hits: fancy hits and plain hits. Fancy hits include hits occurring in a URL, title, anchor text, or meta tag. Plain hits include everything else. A plain hit consists of a capitalization bit, font size, and 12 bits of word position in a document (all positions higher than 4095 are labeled 4096). Font size is represented relative to the rest of the document using three bits (only 7 values are actually used because 111 is the flag that signals a fancy hit). A fancy hit consists of a capitalization bit, the font size set to 7 to indicate it is a fancy hit, 4 bits to encode the type of fancy hit, and 8 bits of position. For anchor hits, the 8 bits of position are split into 4 bits for position in anchor and 4 bits for a hash of the docID the anchor occurs in. This gives us some limited phrase searching as long as there are not that many anchors for a particular word. We expect to update the way that anchor hits are stored to allow for greater resolution in the position and docIDhash fields. We use font size relative to the rest of the document because when searching, you do not want to rank otherwise identical documents differently just because one of the documents is in a larger font. [0034] Persons skilled in the art will understand from (1) that a hit of the search engine encodes information on only one particular word (wordID) situated at a particular position in a particular document (docID). Such a hit will be referred to as “basic” hereafter. [0035] More generally, as was explained above, the occurrence in a text of a word is represented in an index by a basic hit, which depends on a set of variables (cap, type, size, docID, wordID, position), thus a basic hit may be regarded as a function of said variables: [0036] Basic Hit=function(cap, type, size, docID, wordID, position) [0037] Closest to the present invention is the method of indexing disclosed in Russian Federation Patent RU 2273879, which relates to a method of synthesis of a self-learning system for extraction of knowledge from textual document for search engines, which indexes textual information, wherein the index contains data on the previous word and the next word with regard to the subject of the search. This allows for indexing, searching and extracting digital information from an index. SUMMARY OF THE INVENTION [0038] The present invention advances the state of the art by introducing the “extended hit”, wherein, in addition to the information of the basic hit, at least the following information is contained (c.f. FIG. 3 ): wordID of the “previous word”, the position of which in the document is the position of the current word minus 1 (one); wordID of the “next word”, the position of which in the document is the position of the current word plus 1 (one). [0039] While FIG. 3 demonstrates a particular embodiment of the extended hit where the positions of the previous word and the next word are specified, these positions may also be computed, thus another embodiment of the extended hit (c.f. FIG. 4 ) may not contain the positions of the previous word and the next word, which helps reduce the storage required for the hit. [0040] The docID field for all the three words is identical, because all the words are sequential words of a document identified by docID. [0041] Symbolically, the occurrence of a word in a document can be described as a function: [0042] Extended Hit=function(cap, type, size, docID, position, previous wordID, next wordID) or [0043] Extended Hit=function(Basic hit, previous wordID, next wordID) [0044] The conversion from the basic hit used in prior art to the extended hit, as disclosed in the present invention, will facilitate restoring the original text from an index, and increase the efficiency of the co-occurrence analysis in comparison with the efficiency of the analysis that uses only the basic hits. [0045] FIG. 6 demonstrates the content of the extended hits in a recursive index created for the text shown in FIG. 5 . DETAILED DESCRIPTION OF THE INVENTION [0046] Recursive Extraction of Text [0047] In a particular embodiment, the extended hit is based on the basic hit, and the index is created for textual information (c.f. FIG. 7 ). [0048] Said embodiment is used for recursive extraction of the original text, as disclosed herein, which was not possible in prior art. [0049] For brevity, let <wordID−1>denote “previous wordID”, and <wordID+1>denote “next wordID”, while keeping in mind that the position of the previous word in the text is <position−1> and the position of the next word in the text is <position+1>. Further, let [0050] Start Hit=Extended Hit(wordID, docID, position, <wordID−1>, <wordID+1>) [0051] which specifies the starting hit for text extraction. [0052] To extract (restore) the original text from the index, perform the following steps: [0053] Step 1. Get <wordID+1> from Start Hit. [0054] Step 2. In the lexicon list (c.f. FIG. 8 ), locate the entry for <wordID+1> and, in the inverted index, locate Next Hit: [0055] Next Hit=Extended Hit(<wordID+1>, docID, position+1, wordID, <wordID+2>) [0056] Step 3. Get <wordID+1> from Next Hit. [0057] Step 4. Repeat steps 2-3, obtaining <wordID+2>, <wordID+3>, . . . , <wordID+N> until some particular N or some particular “stop” wordID. [0058] The result of the steps above is the list of words <wordID>, <wordID+1>, <wordID+2>, . . . , <wordID+N>. Those skilled in the art will appreciate that the algorithm described above can also be reversed to obtain <wordID−N>, . . . , <wordID−1>. [0059] Thus, said embodiment can restore the text of an indexed document entirely, or within certain limits. The limits can be specified either as the number of words to extract, or as a stop word, or any other stop condition. An example of the stop word is the period (full stop), which separates sentences, in which case the extraction will continue till the end or the beginning of a sentence. Another example of a stop word is any word whose first letter is capital (which is signified by the cap field in the basic hit), a word with another attribute. In like manner, multiple sentences can be restore from an index. The stop condition can be any condition that can be formulated for a search engine. [0060] The presence of the “capital letter” (cap) and other text formatting attributes in the hits makes it possible to restore text to a certain degree of faithfulness to the original. Extending the number of attributes to the number of attributes used in text editors will make it possible to restore text very closely or identically to the original. [0061] Another aspect of the present invention is related to its ability to index not only text, but also arrays of information of arbitrary nature, which are sequentially ordered by their position. For non-textual objects, the sequence of restored objects may, for example, be <previous object N> . . . <previous object 2> <previous object 1> <object> <next object 1> <next object 2> . . . <next object N>, or, with a stop object: <stop object> . . . <previous object 2> <previous object 1> <object> <next object 1> <next object 2> . . . <stop object> [0062] The presence in the extended hits of the previous object and the next object make it possible for the index to refer to itself to extract the i-th previous object or the i-th next object, whose positions in the ordered sequence are position−i and position+1, respectively. Because of said self-reference, the index disclosed in the present invention is called the “recursive index”. [0063] The ability to restore the text of the original document, fully or partially, by extracting it from the recursive index makes possible a number of advantageous embodiments. A number of such embodiments are described herein. [0064] Indexing Non-Textual Objects [0065] Because the contemporary information frequently includes, in addition to texts, among others, geodesic or geographic information, media information, and an ever growing multitude of other data types, a particular embodiment of the present invention uses identifiers for objects of arbitrary nature instead of wordID. Libraries can contain not only textual documents, but also video materials, pictures, audio files, geolocation data, and any other files. The extended hit in this embodiment will include the library type libraryID, the source identified sourceID (instead of docID), and the object identifier objectID (instead of wordID), and so on. FIG. 9 demonstrates an encoding of the extended hit for objects of arbitrary nature (including text), where the basic hit encodes the current object, the nextID field identifies the object following the current object in the ordered sequence of objects, and the prevID field identifies the object preceding the current object in the ordered sequence of objects [0066] Metadata in the Recursive Index [0067] Many current media file formats support rich metadata. For example, this is ID3 tags in MP3 files, metadata in MPEG4, the Karaoke prompt data, TimedText for subtitles, etc. [0068] A particular embodiment of the present invention has a metaID field in the extended hit structure (c.f. FIG. 10 ). The metadata referenced by the metaID field extend the basic hit, which contains the position of the object. Because metadata for MP3, MPEG4, Karaoke, TimedText, etc, may be different, the metaID field contains a pointer to the metadata storage. [0069] The extended hit of a search engine can be used advantageously in many other embodiments, for example, to store a ranked list of search queries as metadata in the temporal recursive index for rating of search queries; to store in the temporal recursive index a recipe, where the metadata will be the descriptions of ingredients, their quantities, temperature, the description of cooking utensils, and other particulars of preparation; to store in the index the sequence of steps in a chemical synthesis; to store the sequence of cartographic points of a road or an itinerary; etc. [0070] Co-Occurrence Analysis [0071] An analysis of co-occurrence of words in a language can establish which collocations are stable, and which are not. Some of the stable collocation can be notions, which cannot be described by a single word. An example of such notion is “Neuro-Linguistic Programming”. Obviously, the occurrence of words in the sequence “neuro linguistic programming” is far more frequent than the occurrence in “programming linguistic neuro”. In the example, the collocation “neuro linguistic programming” is the notion “NLP”. Determining the difference in likelihoods of a given group of words to form a particular stable collocation (“anisotropy of co-occurrence”) makes it possible to hypothesize on implicit relationships among words and notions, in particular, causal and associative relationships. For example, a co-occurrence table (c.f. FIG. 11 ) has (artificial) likelihoods of co-occurrence of the words present in “neuro linguistic programming”. [0072] Suppose (c.f. FIG. 11 ) the collocation “neuro linguistic” and “linguistic programming” have the identical likelihood 0.7. Yet the two-dimensional table cannot be used to determine the likelihood of the co-occurrence of “neuro linguistic programming”, which would require a three-dimensional table. [0073] More generally, an analysis of two-word co-occurrence among N words requires a two-dimensional N×N table, where each cell is the likelihood of co-occurrence of two words. An analysis of three-word co-occurrence requires an N x N x N table, and so on, with the difficulty of the problem increasing exponentially. [0074] Because the basic hits used in prior art contain the position of only one word, a search for collocations or a frequency analysis of many-word co-occurrence requires an N G table, where G is the number of words in a collocation. This explains why the search engines using prior art cannot find phrases with more than three words efficiently. [0075] Those skilled in the art will thus appreciate that the recursive index can be used advantageously to reduce significantly the complexity of a G-word co-occurrence analysis for a language of N words, by replacing a multidimensional analysis of an N G region, with an analysis of a reduced region of radius R≧, which is disclosed herein in more detail. [0076] Use of the Recursive Index for Co-Occurrence Analysis [0077] To analyze co-occurrence, for example, of the word “linguistic”, a particular embodiment of the present invention extracts (using the method disclosed herein) from a recursive index all the sequences of words containing “linguistic”, with the condition that the sequences have no more than R words before or after “linguistic”. Thus, if the word “linguistic” is mentioned in a number of sentences in one or multiple documents, the result of the extraction can be visualized as a “sphere of words”, of the radius R words, centered on the word “linguistic”. [0078] In another embodiment, the extraction is limited by the beginning and the end of a sentence. An example of a sphere limited by the beginning and the end of a sentence, centered on “Tiger”, is given in FIG. 12 . When the beginning and the end of a sentence are used as stop conditions for extraction, the radius of the sphere is not R words, but R sentences. [0079] Said embodiments then analyze the co-occurrence of the word “linguistic” with the other words in the volume of the extracted textual sequences. All the collocations “neuro linguistic programming” will be found if R=1 word or R=1 sentence. Those skilled in the art will appreciate that an analysis of this collocation in said radius is feasible even for a desktop computer. It will then be understood that the recursive index and the embodiments using the extended hits solve the problem of search and co-occurrence analysis substantially more efficiently than the search engines using prior art. [0080] FIG. 12 demonstrates sequences containing the word “Tiger” extracted from a recursive index with R=1 sentence. All the sentences in this particular example are part of one document, but are related to different hits for the word “Tiger” in the document. A recursive index can just as easily extract sentences from many available documents. The sequences repeatedly contain the words “large”, “big” and “cat”, from which a possible relationship of the word “Tiger” with “big”, “cat”, and “large” can be inferred. The words “big” and “cat” are present together, twice, which may be indicative of a stable collocation “big cat”. [0081] The stability of an arbitrary collocation word1 word2 can be examined by extracting from a recursive index all the word sequences centered on word1. To analyze the stability of the collocation in the inverse direction (word2 word1), all the word sequences centered on word2 can be extracted. Then, if the frequency of word1 word2 is substantially greater than that of word2 word1, this means that word1 word2 is stable, while word2 word1 is ad hoc. [0082] For example, extracting the sequences centered on “big” and on “cat”, and analyzing the co-occurrence of “big” and “cat” can determine whether the collocation “big cat” is stable or ad hoc. The data on FIG. 12 indicate that “big cat” is probably stable, while “cat big” may be non-existent. [0083] Co-occurrence of words (or other objects) can be more general than merely collocational. [0084] The words “fire” and “smoke” may not have a stable collocation, yet they have a causal relationship, due to which their co-occurrence in a text is more likely than the co-occurrence of words without a causal relationship. This likelihood will be magnified on a large corpus of texts. It can also be expected that the co-occurrence of words may have preferred direction, e.g., the likelihood of “fire smoke” may be higher than “smoke fire”, which may be indicative of a causal relationship between words or notions. [0085] A co-occurrence analysis of collocations having three words and more may reveal stable collocations previous-word(i) next-word(j), which may further reveal a relationship between previous-word(i) and next-word(j). If previous-word(i) occurs with current-word, and current-word occurs with next-word(j), and previous-word(i) occurs with next-word(j) directly, the relationship between previous-word(i) and next-word(j) is likely associative, because the relationship is bound with current-word, but is not directly linked to it, so current-word is likely an associative link between previous-word(i) and next-word(j). A difference in the frequencies of the co-occurrences of previous-word(i) next-word(j) and of next-word(j) previous-word(i) can be used to deduce a causal relationship in an associative sequence of words or other objects. [0086] Indexing of Discovered Notions [0087] Notions described by multiple words are difficult to search for in prior art, because the basic hit in prior art describes the position of a word, not of a notion. Because the individual words describing a notion have individual meanings different from the meaning of their stable collocation (the notion), basic hits may frequently result in irrelevant search results when the notion is searched for. Notions are frequently denoted by acronyms, thus becoming a new “word” in a language. [0088] The collocations that are a trademark, or a slogan, the name of song or a film are also stable and are used as identifiers for products, films, songs and other subjects of intellectual property. Automatic discovery of such stable collocations is also difficult in prior art. [0089] As disclosed herein, the recursive index can be used efficiently to analyze words for co-occurrence and discover stable collocations representing a notion, a slogan or a name. As explained above, one particular embodiment of prior art uses imp=7 as a flag denoting a fancy hit. [0090] An advantageous embodiment of the present invention, as shown in FIG. 13 , extends the functionality of the basic hit by encoding stable collocations (denoted as “phrase” in the figure) with a flag signifying a hit for a stable collocation (the figure uses imp=0 as the flag; type=2 encodes the number of words in the collocation, and 8 bits of position are relevant for the first word in the collocation). [0091] Those skilled in the art will understand that FIG. 2 and FIG. 13 are only illustrative and their limitations shall not be interpreted as the limitations of the present invention. Particular embodiments of the present invention may use more fields, more storage allocated for each field, fields of varying lengths, etc, as is well known to the persons skilled in the art. [0092] Indexing of Discovered Implicit Relationships [0093] A particular embodiment of the present invention uses the metadata field metaID in an extended hit, which contains a pointer to a file of objects associatively or causally related with the hit in the context of a particular document. Each particular hit (out of many for a particular wordID found in multiple documents) has a unique metaID corresponding to the context of the word identified by wordID in the particular hit. [0094] Preferred Embodiments of the Recursive Index [0095] Textual Recursive Index [0096] In this embodiment, a search engine's user interface is enhanced via the use of a recursive index. The user can use a pointing device to “pull out” sequences of words from a snippet in a search result, wherein the sequences are extracted from a recursive index via the method disclosed herein. The text thus “pulled out” helps the user understand with a minimal effort how relevant the search result is for the user's search query. “Pulling out” and “zooming into” text in snippets is illustrated in FIG. 14 and FIG. 15 . To “zoom into” is here synonymous with to “pull out” and is achieved using the method of text extraction from a recursive index before and after the selected word, with the subsequent display of the text “zoomed into” in the user interface. [0097] Spatial and Temporal Information in a Recursive Index [0098] The basic hit is not sensitive to the position in space and time, because it only contains the position of a word in a text (c.f. FIG. 2 ). For video and audio documents (of a conversation, of a speech, of events, etc), spatial and temporal location may be essential. [0099] A particular embodiment of the present invention has the fields <TimeStamp> and/or <Location> in the extended hit, which are used to store the time and/or place where an event occurred. Those skilled in the art are well aware of multiple digital formats that can be used to store time and location data, and said formats can be advantageously used in said fields. FIG. 16 illustrates. [0100] An index comprised of such extended hits can be used, for example, to store objects recognized from a video file or stream captured by a video camera mounted, for example, on a moving vehicle. [0101] In another embodiment of the present invention, the extended hit has the field <TimeOffset>, which is used to store the duration of a period of time after some known event. [0102] Another embodiment of the present invention that uses <TimeStamp> or <TimeOffset> in the extended hit can, for example, convey the tempo of speech. In such an embodiment the values of wordID, nextID and prevID are spoken words, codes of images recognized from a video, or digital objects of other nature that are related chronologically. [0103] Indexing the Worldview with a Recursive Index [0104] It is known that the worldview of a human being is created from the information transferred into the brain via sensory channels (vision, audition, gustation, olfaction, tactition, thermoreception and others). This information can be digitized and captured in a chronological order. [0105] A particular embodiment of the present invention uses a recursive index to store the digital objects representing the information captured for one or multiple human sensory channels using the method of the present invention as disclosed herein. [0106] Another embodiment of the present invention uses a spatial and/or temporal recursive index to store the digital objects representing the information captured for one or multiple human sensory channels using the method of the present invention as disclosed herein. [0107] These and other embodiments of the present invention can be advantageously used to search for sensory information, extract temporal sequences of sensory data objects, and analyze co-occurrence of said objects. The timestamps can be advantageously used to synchronize the reproduction of digitized sensory data from multiple recursive indices, and the location data can be used to correlate the sensory information with the geographic or astronomic location of the place of recording. [0108] Temporal Recursive Index [0109] A particular embodiment of the present invention has the field TimeStamp instead of wordID in the extended hit (c.f. FIG. 18 ). The metaID field stores a pointer to metadata related to the time described by the TimeStamp field, and the fields Prey TimeStamp and Next TimeStamp store pointers to the previous timestamp and the next timestamp, respectively. [0110] Another embodiment of the present invention adds the location field to the temporal recursive index (c.f. FIG. 19 ). [0111] Spatial Recursive Index [0112] A characteristic feature of the present invention is sequencing of indexed objects, which can be advantageously used, for example, to index recursively a sequence of geographical coordinates and/or street addresses that represent, for example, an itinerary or cartographic data for a road network. [0113] A particular embodiment of the present invention stores street addresses or geographical coordinates in a recursive index instead of the wordID field (c.f. FIG. 20 , FIG. 21 ). [0114] Hardware Implementation of the Recursive Index [0115] A particular embodiment of the present invention is a hardware module (a microcontroller, a system-on-chip, and the like, as will be understood by those skilled in the art) that can be embedded into any device or apparatus processing information. The hardware module implements the methods of the present invention disclosed herein and can be used to index images recognized from video streams, speech, and any other digital, and search in said data efficiently. [0116] A Distributed Recursive Index [0117] In a particular embodiment of the present invention, a recursive index is stored on each computer of a network (or the Internet) that has been indexed, with the plurality of these recursive indices forming a single distributed recursive index. [0118] The users of the network access the local recursive indices without using a search engine, using the hardware module disclosed herein. A particular embodiment of the present invention uses a distributed recursive index for the global computer network (the Internet). [0119] Enhancing Relevance of a Search Engine [0120] A particular embodiment of the present invention is a search engine that accepts, in addition to a search query, a sample text. The sample text is used to build a temporary recursive index, which is then used to analyze co-occurrence of the words in the search query with the words in the sample text, using the methods of the present invention disclosed herein. [0121] Then the search is performed in the main recursive index of the search engine, and each search result is analyzed for co-occurrence with the words in the search query, using the methods of the present invention disclosed herein. [0122] The results of the analysis of the search results in the main recursive index are compared with the results of the analysis of the sample text. The search results that are comparable in co-occurrence with the sample text are returned to the user as relevant, and the other results are discarded. [0123] Examples of Other Particular Embodiments of the Recursive Index [0124] A particular embodiment of the present invention indexes video information: films, reportages, video surveillance, photography, and so on. Video materials can be indexed with both the textual recursive index and the temporal recursive index. In a particular embodiment, a textual recursive index of subtitles, recognized speech or text contains video images of recognized faces or objects in metadata. Another advantageous embodiment stores video frames, recognized images and text in a temporal recursive index. Those skilled in the art will appreciate that further embodiments of the present invention can be advantageously employed to search in video materials for recognized video and photo images, for recognized text and text supplied with the video and photo materials, to search as presently disclosed in real time, and embed into or attach such advantageous embodiments to image capturing apparatuses, such as, for example, photo and video cameras, portable telephones and other portable apparatuses capable of image capture, as well as computers and other apparatuses capable of image playback. [0125] The Recursive Index in Media Files [0126] The textual prompt in a “Karaoke” system or subtitles in a video film are made using the Timed Text or similar means, which are specified in industrial or proprietary standards and are well known to the persons skilled in the art. [0127] A particular embodiment of the present invention uses face, object, speech and text recognition methods well known to the persons skilled in the art to convert video and audio data from a recorded or streamed video (including films) into unique digital codes of faces, objects and text, which are then stored in a recursive index with temporal information. The embodiment then uses the method disclosed herein to search chronologically for recognized faces, objects, speech and text. Said embodiment makes it possible to search for recognized objects and to extract events preceding the occurrence of a recognized object, which was difficult or impossible in prior art. Those skilled in the art will appreciate that this and similar embodiments of the present invention can greatly reduce the difficulty of searching for objects in video surveillance records. [0128] Another embodiment of the present invention is embedded into or attached to a telephone and uses the method disclosed herein and the methods of speech and face recognition for automatic stenography of audio or video conversations and for searching for recognized text and faces chronologically. [0129] The MP3 and MPEG4 media files use the ID3 tagging system and other similar systems, such as the ID3v2 system. The latter system can store up to 256 megabytes of arbitrary information in a media file. A particular embodiment of the present invention uses the ID3v2 metadata to store a recursive index of subtitles in a film in all languages, and a recursive index of songs, books, recognized images, speech and text. [0130] A particular embodiment of the present invention uses a recursive index to store the text of an electronic book. The extraction method disclosed herein is used to display the book sequentially. [0131] Another advantageous embodiment of the present invention stores a recursive index in a tagging system, such as ID3 or any other, in a media file such as, including but not limited to, CD, MP3, WMA, AAC, AIFF, M4A, DVD, Blu-Ray, HD, MP4, AVI, MOV, RAM, SWF, WMV and extracts, as disclosed herein, fully or partially, text and/or images chronologically, searches for text and/or images as disclosed herein, analyses text and/or images for co-occurrence as disclosed herein, analyses text and/or images for causal or associate relationships as disclosed herein. [0132] Texts in Electronic Books [0133] A plurality of electronic book formats are well known to persons skilled in the art. A particular embodiment of the present invention stores the text and formatting of the text of an electronic book in a recursive index, such that the reader can read the book sequentially, which is done via the text extraction as disclosed herein, and search and analyse the text using the methods disclosed herein. [0134] A Recursive Index of Cartographic Information [0135] A particular embodiment of the present invention stores geodesic or geographic coordinates and indexes itineraries or roadmaps, taking permitted directions of travel into account. [0136] A Recursive Index of News and RSS Feeds [0137] A particular embodiment of the present invention uses a temporal recursive index or a textual recursive index to store timestamps for news, RSS feeds and other sources known to persons skilled in the art. [0138] Business Models [0139] Paid creation, and/or emplacement, and/or use of a recursive index in electronic books. [0140] Paid creation, and/or emplacement, and/or use of a recursive index in audio books. [0141] Paid creation, and/or emplacement, and/or use of a recursive index in electronic or digital media, said media containing cartographic information, including geodesic information. [0142] Paid creation, and/or emplacement, and/or use of a recursive index in navigation devices or apparatuses. [0143] Paid creation, and/or emplacement, and/or use of a recursive index in communication devices or apparatuses. [0144] Paid creation, and/or emplacement, and/or use of a recursive index in computers, including portable computers. [0145] Paid creation, and/or emplacement, and/or use of a recursive index in telephones, smartphones, personal digital assistants, tablets, media players, and other portable digital devices. [0146] Paid creation, and/or emplacement, and/or retrieval of a recursive index from video media, wherein said media are CD, DVD, Blu-Ray and any other formats currently known and will be known in future, including, but not limited to, audio formats MP3, WMA, AAC, AIFF, M4A and video formats MP4, AVI, MOV, RAM, SWF, WMV. [0147] Paid creation, and/or emplacement, and/or retrieval of a recursive index or a hyperlink to a recursive index from a web site. [0148] Paid creation, and/or emplacement, and/or retrieval of a recursive index or a hyperlink to a recursive index from a semantic web. [0149] Paid creation, and/or emplacement, and/or retrieval of a recursive index from a computer in a local area network. [0150] Paid creation, and/or emplacement, and/or retrieval of a recursive index from a computer in a wide area network, including the Internet. [0151] Paid creation, and/or emplacement, and/or retrieval of a recursive index from a search engine, wherein said search engine communicates with other computers in an arbitrary network, wherein said network uses arbitrary means of communications. [0152] Paid creation, and/or emplacement, and/or display of advertisements, wherein said advertisements are related to data retrieved from a recursive index, wherein the relationship of said data with said advertisements is contextual or of a different nature. [0153] Paid design, and/or development, and/or distribution and/or use of a computer program that can create extended hits of a recursive index, wherein said hits contain information on the next object and the previous objects. [0154] Paid design, and/or development, and/or distribution and/or use of a computer program that can retrieve information from extended hits of a recursive index, wherein said information contains information on the next object and the previous objects. [0155] Paid design, and/or development, and/or distribution and/or use of a computer program that can create extended hits of a recursive index, wherein said hits contain information on the next object and the previous objects, wherein said objects are recognized by a computer program, wherein said computer program recognizes speech, text, faces and other objects such as graphical images, and others (chemical, geodesic, etc). [0156] Paid design, and/or development, and/or distribution and/or use of an apparatus that can create extended hits of a recursive index, wherein said hits contain information on the next object and the previous objects. [0157] Paid design, and/or development, and/or distribution and/or use of an apparatus that can retrieve information from extended hits of a recursive index, wherein said information contains information on the next object and the previous objects. [0158] Paid design, and/or development, and/or distribution and/or use of an apparatus that can use information from extended hits of a recursive index, wherein said information contains information on the next object and the previous objects, where said use includes co-occurrence analysis.

Description

Topics

Download Full PDF Version (Non-Commercial Use)

Patent Citations (2)

    Publication numberPublication dateAssigneeTitle
    US-2003158850-A1August 21, 2003Lawrence Technologies, L.L.C.System and method for identifying relationships between database records
    US-5745889-AApril 28, 1998Digital Equipment CorporationMethod for parsing information of databases records using word-location pairs and metaword-location pairs

NO-Patent Citations (2)

    Title
    Johnson, Damn Cool Algorithms: Spatial indexing with Quadtrees and Hilbert Curves, Nov. 2009, pp. 1-7.
    Stantic et al., ADBIS 2010, LNCS 6295, Indexing Temporal Data with Virtual Structure, 2010 pp. 591–594.

Cited By (1)

    Publication numberPublication dateAssigneeTitle
    US-9846692-B2December 19, 2017Abbyy Production LlcMethod and system for machine-based extraction and interpretation of textual information