The package operates on a relational database of bibliometric data from Microsoft Academic Graph and dissertation data from ProQuest. To illustrate the usage, the package ships with a small replicate of the database with the necessary tables. Let’s load the database:

db_file <- db_example("AcademicGraph.sqlite")
conn <- connect_to_db(db_file)
#> The database connection is: 
#> src:  sqlite 3.38.5 [/tmp/RtmptPxKJr/temp_libpath72cc939d74f39/magutils/extdata/AcademicGraph.sqlite]
#> tbls: author_coauthor, author_output, AuthorAffiliation, current_links,
#>   current_links_advisors, FieldsOfStudy, FirstNamesGender, pq_advisors,
#>   pq_authors, pq_fields_mag, pq_unis

The tbls output above shows the all tables in the database.

Introducing the database

Overview

The database contains some of the tables originally in MAG and custom tables created by us. The tables taken from MAG have the same name as in MAG; the schema to their database is here.

Naming conventions

The names of the tables in the database and the columns in the tables follow two different conventions. Tables from MAG are named with CamelCase, as in the original database; custom tables created by us are named with snake_case (with a few exceptions not relevant to the front-end). Columns in MAG tables are named with CamelCase; columns in custom tables can be named either with CamelCase or snake_case.

Tables from ProQuest data are indicated with a “pq_” prefix; they and their columns are named with snake_case.

Database structure

The following describes each table; columns irrelevant to the package are omitted.

Tables with links between records in MAG and ProQuest. They are created outside of the package.

current_links has links between the authors dissertations and authors in MAG. current_links_advisors has links between the advisors of dissertations and authors in MAG.

  • AuthorId: The author identifier in MAG.
  • link_score: The probability \(\in [0, 1]\) that the linked entities are truly the same. Calculated from the linking algorithm not part of the package.
  • iteration_id: The iteration identifer of the linking algorithm; not relevant for this package.

The identifier in ProQuest differs:

  • for current_links, it is goid: The dissertation identifier in ProQuest.
  • for current_links_advisors, it is relationship_id: the combination of dissertation identifier and position of the reported advisor.

Table FieldsOfStudy

The fields of study as classified by MAG. The relevant columns are

  • FieldOfStudyId: field identifier.
  • NormalizedName: the name of the field.

Table FirstNamesGender

Maps first names to gender; data from genderize.io. Columns:

  • FirstName: string of first name
  • ProbabilityFemale: probability that a person with FirstName is female.

Table pq_authors

Table with authors of PhD dissertations from ProQuest. The relevant columns are:

  • goid
  • fullname, firstname
  • degree_year: year of PhD degree.
  • degree_level: level of the degree; usually PhD or similar.
  • university_id: identifier of the university.
  • thesistitle: title of the thesis, normalized to match the paper titles in MAG.

Table pq_fields_mag

The fields of the dissertation and their corresponding field (level 0) in MAG.

  • goid
  • position: position, starting at 0, as reported in the xml files of the dissertation. The ordering has, to our knowledge, no meaning; but some dissertations have only one field.
  • fieldname: the field name as reported in ProQuest
  • mag_field0: the corresponding field0 in MAG. We make this correspondence based on similarity of field names and on some tables of fields from the OECD.

Table pq_unis

Universities in ProQuest

  • university_id: university identifier
  • normalizedname: name of the university
  • location: location of the university. Not standardized, but location of U.S. universities have all a “United States” string in them.

Table pq_advisors

The dissertation advisors.

  • goid
  • position: advisor’s position, starting at 0, as reported in the xml files of the dissertation. The ordering has, to our knowledge, no meaning; but some dissertations have only one field. A position \(> 0\) indicates the dissertation reports more than one advisor.
  • lastname, firstname: reported names of the advisor
  • relationship_id: unique identifier of the student-advisor relationship. This is just the combined goid and the position.

Note that advisors do not have their own unique identifier.

Getting information on the database

We can query information from the database schema with some custom functions.

We can use the following query to see the specific contents of each table:

dplyr::glimpse(sqlite_master_to_df(conn = conn))
#> Rows: 30
#> Columns: 4
#> $ type     <chr> "table", "index", "index", "table", "index", "index", "table"…
#> $ name     <chr> "current_links", "idx_t_AuthorIdgoid", "idx_t_goid", "current…
#> $ tbl_name <chr> "current_links", "current_links", "current_links", "current_l…
#> $ sql      <chr> "CREATE TABLE `current_links` (\n  `AuthorId` INTEGER,\n  `go…

Using the package

Loading data from proquest: get_proquest

The function loads the dissertations written in the United States (available in ProQuest) between start_year and end_year. The underlying table used for the output depends on from and is thus either pq_authors when querying “graduates” or pq_advisors when querying “advisors”. The final output of the function also depends on from, and in general the returned columns always refer to the respective units.

Let’s see an example:

graduates <- get_proquest(conn, from = "graduates", lazy = FALSE, limit = 3)
head(graduates)
#> # A tibble: 3 × 5
#>        goid degree_year university_id fieldname0_mag gender
#>       <int>       <int>         <int> <chr>          <chr> 
#> 1  89188005        2000           107 engineering    Male  
#> 2  89237817        1985            49 psychology     Male  
#> 3 193768856        1993           505 psychology     Male

The output consists of the following:

  • The field of study, using define_field: The MAG field 0, mapped from the reported first field of the dissertation.
  • The gender, using define_gender: The imputed gender given the first name of the person.

Alternatively, we can query advisors:

advisors <- get_proquest(conn, from = "advisors") 
head(advisors)
#> # Source:   SQL [6 x 4]
#> # Database: sqlite 3.38.5 [/tmp/RtmptPxKJr/temp_libpath72cc939d74f39/magutils/extdata/AcademicGraph.sqlite]
#>        goid position relationship_id gender
#>       <int>    <int> <chr>           <chr> 
#> 1  89188005        0 89188005_0      NA    
#> 2 193768856        0 193768856_0     Male  
#> 3 193769023        0 193769023_0     Male  
#> 4 193953537        0 193953537_0     Male  
#> 5 193973039        0 193973039_0     Male  
#> 6 194032064        0 194032064_0     Male

Important here is that the gender refers to the advisor. If we wanted to study student-advisor matches, we could join the output from “graduates” to the output from “advisors”, using the column goid for the join. Note that a student can have multiple advisors, as indicated by the position column.

This returns the links between records in MAG and in ProQuest. With from = "graduates", we get the links from current_links. With from = "advisors", we get the links from current_links_advisors.

links <- get_links(conn, from = "graduates", lazy = TRUE)
head(links)
#> # Source:   SQL [6 x 3]
#> # Database: sqlite 3.38.5 [/tmp/RtmptPxKJr/temp_libpath72cc939d74f39/magutils/extdata/AcademicGraph.sqlite]
#>     AuthorId      goid link_score
#>        <int>     <int>      <dbl>
#> 1  218478716 194032064      0.996
#> 2  685122076 194043915      0.952
#> 3 2022033572 194031402      1.00 
#> 4 2100285539 193768856      0.979
#> 5 2130621523  89237817      0.979
#> 6 2132399126 193769023      1.00
links <- get_links(conn, from = "advisors", lazy = TRUE)
#> Note: At the moment, using a link score below 0.95 for advisors can result in suspiciously many false positives. Carefully inspect the linked records you use.
head(links)
#> # Source:   SQL [6 x 3]
#> # Database: sqlite 3.38.5 [/tmp/RtmptPxKJr/temp_libpath72cc939d74f39/magutils/extdata/AcademicGraph.sqlite]
#>     AuthorId relationship_id link_score
#>        <int> <chr>                <dbl>
#> 1 1584137904 1002730962_0         0.995
#> 2 1584137904 1015034204_1         0.995
#> 3 1584137904 2190212570_0         0.999
#> 4 1584137904 304122610_0          0.999
#> 5 1584137904 304225816_2          0.999
#> 6 1584137904 304320549_0          0.995

As stated in the message, we suspect the links between advisors need a higher min_score constraint to omit many false positives.

Adding information to records: augment_tbl

The function has two purposes:

  1. Join output or affiliation information to author units

  2. Join information on affiliations of co-authors of author units

Join output or affiliation information to author units

The tables used to augment tbl contain multiple rows per AuthorId, as the following example illustrates:

graduates_affiliation <- get_links(
    conn = conn,
    from = "graduates") %>%
  augment_tbl(conn = conn,
              with = "affiliation") 

head(graduates_affiliation)
#> # Source:   SQL [6 x 5]
#> # Database: sqlite 3.38.5 [/tmp/RtmptPxKJr/temp_libpath72cc939d74f39/magutils/extdata/AcademicGraph.sqlite]
#>     AuthorId      goid link_score AffiliationId  Year
#>        <int>     <int>      <dbl>         <int> <int>
#> 1  218478716 194032064      0.996      22759111  1996
#> 2  218478716 194032064      0.996      22759111  1997
#> 3  685122076 194043915      0.952      32971472  2015
#> 4  685122076 194043915      0.952      32971472  2016
#> 5 2022033572 194031402      1.00        1925986  2020
#> 6 2022033572 194031402      1.00       16109614  2005

For each AuthorId, we have one row for each year in which there is at least one paper with a reported affiliation of the author. The same when using with = "output". The function checks if there is already a Year column in the tbl, and if so, it automatically joins on AuthorId and Year and therefore avoids duplicates.

Join information on affiliations of co-authors of author units.

TBD.

Processing after querying

When should you load the data into memory with collect()? I suggest to do as many operations as possible on the database with dbplyr. That said, when data from ProQuest and MAG are combined, the resulting tables are usually not very large and should fit into memory.

DBI::dbDisconnect(conn)