README file for the General Index - There are no rights reserved on this public domain data. - This is an alpha release dated October 4, 2021. - The General Index was created by Public.Resource.Org, Inc., a 501(c)(3) nonprofit. - The URL for the General Index is https://archive.org/details/GeneralIndex - The data files total 4.7 tbytes, but will expand to 37.9 tbytes when unzipped. - The corpus of 107,233,728 articles has been split into 16 slices, numbered from 0 to f. - The files in this distribution were created using the Postgres pgdump command. - The collection is not complete and text extraction was not always successful. - The metadata and sample files are here (https://archive.org/download/GeneralIndex/data). - The ngrams and keywords files are each on their own item. - Ngrams are on identifiers with the naming scheme GeneralIndex.ngrams.n where n=0..f - Keywords are on identifiers with the scheme GeneralIndex.keywords.n where n=0..f - So, ngram slice 0 is at https://archive.org/download/GeneralIndex.ngrams.0 You can see all the items here: https://archive.org/search.php?query=%22general%20index%22%20AND%20collection%3Amulticasting 1. The ngrams Table - The _ngrams table is the core of the General Index. - SpaCy is used to extract ngrams, from unigrams to 5-grams, into the doc_ngrams_n tables. - There are 355,279,820,087 rows in total. - Each row represents how many instances of an n-gram are in an article. - The files unzip to 2.1-2.3 tbytes each, for a total of 36 tbytes. - There are 3 sample files generated using head and fgrep. 2. The keywords table - The _keywords table extracts the meaningful terms in a document. - YAKE is used to extract document keywords. - There are 19,740,906,314 rows. - The files unzip to 95-102 gbytes each, for a total of 1.6 tbytes. - Sample files are available. 3. The metadata table - The _info table attempts to map an md5 unique identifier to metadata. - In some cases, we are unable to extract appropriate metadata. - In some cases, the data may be wrong. - The files unzip to 70 gbytes total. - A sample file is available. - *NEW* An updated combined metadata file that unzips to 70 gbytes is available. - The slice metadata files have also been updated with enhanced metadata. An easy way to begin is to start working with a single slice. Loading the keywords and metadata for one slice is a way to work with the data. While we provide Postgres load files, feel free to parse these into other formats. We hope to add other information, such as td/idf in the future. ========== The Tables ========== doc_ngrams_n – 16 slices: 0-f dkey [text]: document key (md5 hash of document) ngram [text]: proper case version of ngrams (unigrams, bigrams, trigrams, 4grams, 5grams) ngram_lc [text]: lower case version of ngrams – best for search ngram_tokens [int]: number of tokens (words) in the ngram (e.g., unigrams: 1, bigrams: 2) term_freq [numeric]: number of occurences of the ngram in the document doc_count [int]: always 1 (used for other analytic purposes) insert_date [date]: date record inserted into table, initial load has a null insert_date doc_keywords_n – 16 slices: 0-f dkey [text]: document key (md5 hash of document) keywords [text]: proper case version of keywords captured by YAKE process, from 1 to 5grams keywords_lc [text]: lower case version of keywords keywords_tokens [int]: number of tokens (words) in the keywords phrase (e.g., unigrams: 1, bigrams: 2) keyword_score [numeric]: YAKE score of how meaninful the word is in the document, the smaller value, the more meaningful doc_count [int]: always 1 (used for other analytic purposes) insert_date [date]: date record inserted into table, initial load has a null insert_date doc_info_n – 16 slices: 0-f dkey [text]: document key (md5 hash of document) meta_doi [text]: DOI for doc from doc_meta source doc_doi [text]: DOI for doc from original text doi [text]: DOI for doc from doc_meta if available, else from original text doc_pub_date [date]: publish date for document from original text meta_pub_date [date]: publish date for document from original text pub_date [date]: publish date for document from doc_meta if available, else from original text doc_authors [text]: list of authors from original text meta_authors [text]: list of authors from doc_meta authors [text]: list of authors from doc_meta if available, else from original text doc_title [text]: document title from original text meta_title [text]: document title from doc_meta title [text]: document title from doc_meta if available, else from original text /sign/ Carl Malamud (carl@media.org) :seal: Last revised: Mon Oct 22 12:17:08 PDT 2021