vignette.Rmd
The package operates on a relational database of bibliometric data from Microsoft Academic Graph and dissertation data from ProQuest. To illustrate the usage, the package ships with a small replicate of the database with the necessary tables. Let’s load the database:
db_file <- db_example("AcademicGraph.sqlite")
conn <- connect_to_db(db_file)
#> The database connection is:
#> src: sqlite 3.38.5 [/tmp/RtmptPxKJr/temp_libpath72cc939d74f39/magutils/extdata/AcademicGraph.sqlite]
#> tbls: author_coauthor, author_output, AuthorAffiliation, current_links,
#> current_links_advisors, FieldsOfStudy, FirstNamesGender, pq_advisors,
#> pq_authors, pq_fields_mag, pq_unis
The tbls
output above shows the all tables in the
database.
The database contains some of the tables originally in MAG and custom tables created by us. The tables taken from MAG have the same name as in MAG; the schema to their database is here.
Naming conventions
The names of the tables in the database and the columns in the tables follow two different conventions. Tables from MAG are named with CamelCase, as in the original database; custom tables created by us are named with snake_case (with a few exceptions not relevant to the front-end). Columns in MAG tables are named with CamelCase; columns in custom tables can be named either with CamelCase or snake_case.
Tables from ProQuest data are indicated with a “pq_” prefix; they and their columns are named with snake_case.
The following describes each table; columns irrelevant to the package are omitted.
current_links
and
current_links_advisors
Tables with links between records in MAG and ProQuest. They are created outside of the package.
current_links
has links between the authors
dissertations and authors in MAG. current_links_advisors
has links between the advisors of dissertations and authors in MAG.
The identifier in ProQuest differs:
current_links
, it is goid: The dissertation
identifier in ProQuest.current_links_advisors
, it is relationship_id: the
combination of dissertation identifier and position of the reported
advisor.FieldsOfStudy
The fields of study as classified by MAG. The relevant columns are
FirstNamesGender
Maps first names to gender; data from genderize.io. Columns:
FirstName
is female.pq_authors
Table with authors of PhD dissertations from ProQuest. The relevant columns are:
pq_fields_mag
The fields of the dissertation and their corresponding field (level 0) in MAG.
pq_unis
Universities in ProQuest
pq_advisors
The dissertation advisors.
goid
and the
position
.Note that advisors do not have their own unique identifier.
We can query information from the database schema with some custom functions.
We can use the following query to see the specific contents of each table:
dplyr::glimpse(sqlite_master_to_df(conn = conn))
#> Rows: 30
#> Columns: 4
#> $ type <chr> "table", "index", "index", "table", "index", "index", "table"…
#> $ name <chr> "current_links", "idx_t_AuthorIdgoid", "idx_t_goid", "current…
#> $ tbl_name <chr> "current_links", "current_links", "current_links", "current_l…
#> $ sql <chr> "CREATE TABLE `current_links` (\n `AuthorId` INTEGER,\n `go…
get_proquest
The function loads the dissertations written in the United States
(available in ProQuest) between start_year
and
end_year
. The underlying table used for the output depends
on from
and is thus either pq_authors
when
querying “graduates” or pq_advisors
when querying
“advisors”. The final output of the function also depends on
from
, and in general the returned columns always refer to
the respective units.
Let’s see an example:
graduates <- get_proquest(conn, from = "graduates", lazy = FALSE, limit = 3)
head(graduates)
#> # A tibble: 3 × 5
#> goid degree_year university_id fieldname0_mag gender
#> <int> <int> <int> <chr> <chr>
#> 1 89188005 2000 107 engineering Male
#> 2 89237817 1985 49 psychology Male
#> 3 193768856 1993 505 psychology Male
The output consists of the following:
define_field
: The MAG field
0, mapped from the reported first field of the dissertation.define_gender
: The imputed gender
given the first name of the person.Alternatively, we can query advisors:
advisors <- get_proquest(conn, from = "advisors")
head(advisors)
#> # Source: SQL [6 x 4]
#> # Database: sqlite 3.38.5 [/tmp/RtmptPxKJr/temp_libpath72cc939d74f39/magutils/extdata/AcademicGraph.sqlite]
#> goid position relationship_id gender
#> <int> <int> <chr> <chr>
#> 1 89188005 0 89188005_0 NA
#> 2 193768856 0 193768856_0 Male
#> 3 193769023 0 193769023_0 Male
#> 4 193953537 0 193953537_0 Male
#> 5 193973039 0 193973039_0 Male
#> 6 194032064 0 194032064_0 Male
Important here is that the gender refers to the advisor. If we wanted
to study student-advisor matches, we could join the output from
“graduates” to the output from “advisors”, using the column
goid
for the join. Note that a student can have multiple
advisors, as indicated by the position
column.
get_links
This returns the links between records in MAG and in ProQuest. With
from = "graduates"
, we get the links from
current_links
. With from = "advisors"
, we get
the links from current_links_advisors
.
links <- get_links(conn, from = "graduates", lazy = TRUE)
head(links)
#> # Source: SQL [6 x 3]
#> # Database: sqlite 3.38.5 [/tmp/RtmptPxKJr/temp_libpath72cc939d74f39/magutils/extdata/AcademicGraph.sqlite]
#> AuthorId goid link_score
#> <int> <int> <dbl>
#> 1 218478716 194032064 0.996
#> 2 685122076 194043915 0.952
#> 3 2022033572 194031402 1.00
#> 4 2100285539 193768856 0.979
#> 5 2130621523 89237817 0.979
#> 6 2132399126 193769023 1.00
links <- get_links(conn, from = "advisors", lazy = TRUE)
#> Note: At the moment, using a link score below 0.95 for advisors can result in suspiciously many false positives. Carefully inspect the linked records you use.
head(links)
#> # Source: SQL [6 x 3]
#> # Database: sqlite 3.38.5 [/tmp/RtmptPxKJr/temp_libpath72cc939d74f39/magutils/extdata/AcademicGraph.sqlite]
#> AuthorId relationship_id link_score
#> <int> <chr> <dbl>
#> 1 1584137904 1002730962_0 0.995
#> 2 1584137904 1015034204_1 0.995
#> 3 1584137904 2190212570_0 0.999
#> 4 1584137904 304122610_0 0.999
#> 5 1584137904 304225816_2 0.999
#> 6 1584137904 304320549_0 0.995
As stated in the message, we suspect the links between advisors need
a higher min_score
constraint to omit many false
positives.
augment_tbl
The function has two purposes:
Join output or affiliation information to author units
Join information on affiliations of co-authors of author units
The tables used to augment tbl
contain multiple rows per
AuthorId, as the following example illustrates:
graduates_affiliation <- get_links(
conn = conn,
from = "graduates") %>%
augment_tbl(conn = conn,
with = "affiliation")
head(graduates_affiliation)
#> # Source: SQL [6 x 5]
#> # Database: sqlite 3.38.5 [/tmp/RtmptPxKJr/temp_libpath72cc939d74f39/magutils/extdata/AcademicGraph.sqlite]
#> AuthorId goid link_score AffiliationId Year
#> <int> <int> <dbl> <int> <int>
#> 1 218478716 194032064 0.996 22759111 1996
#> 2 218478716 194032064 0.996 22759111 1997
#> 3 685122076 194043915 0.952 32971472 2015
#> 4 685122076 194043915 0.952 32971472 2016
#> 5 2022033572 194031402 1.00 1925986 2020
#> 6 2022033572 194031402 1.00 16109614 2005
For each AuthorId
, we have one row for each year in
which there is at least one paper with a reported affiliation of the
author. The same when using with = "output"
. The function
checks if there is already a Year
column in the
tbl
, and if so, it automatically joins on
AuthorId
and Year
and therefore avoids
duplicates.
When should you load the data into memory with
collect()
? I suggest to do as many operations as possible
on the database with dbplyr
. That said, when data from
ProQuest and MAG are combined, the resulting tables are usually not very
large and should fit into memory.
DBI::dbDisconnect(conn)