Glossary

betweenness – a measure of a node’s centrality. For every pair of nodes in a network, there is at least one path through the network between those nodes that uses the fewest number of edges. A node’s betweenness is determined by the number of these “shortest paths” in the network that pass through it.

bimodal – having two types of nodes. Used to describe a type of network.

boolean – a data type for which the possible values are True and False.

centrality – a measure of a node’s importance in a network. There are many different ways of measuring centrality, including but not limited to betweenness.

data cleaning – the process of editing data, often following some kind of automated processing, to regularize, normalize, and/or transform the data.

degree – the number of edges connected to a given node.

directed – of an edge in a network, describing a one-way relationship between two nodes. For example, in a network of correspondence, letters travel from one person to another, making the network directed. Directed edges are typically visualized as arrows. Contrast with undirected.

directed network– a network in which all edges are directed. See also mixed network and undirected network.

document segmentation – the process of parsing a textual document into semantically coherent chunks. Related to layout analysis.

edge – relates two nodes in a network. The lines in a typical network diagram.

edge weight – The “strength” of the connection between two nodes, often visualized with varying line thickness.

ForceAtlas2 – a force-directed layout algorithm included in Gephi.

Gephi a software tool for conducting network visualization and analysis.

graph – a structure in which pairs of objects are related to one another; a synonym for network. “Graph” tends to be used in more mathematical contexts, as in the area of mathematics called “graph theory” which studies such structures.

in-degree – the number of edges entering a given node in a directed network.

integer a whole number.

layout – in the context of a network, the arrangement of nodes in space. Layout algorithms are often used to determine node arrangement via mathematical formulas. In the context of layout analysis, the arrangement of text, images, and other elements on a page.

layout analysis – the process of identifying regions of a (usually scanned) document, usually to determine the order in which different pieces of text should be read. Related to document segmentation.

list – a data type in computer science, referring to a collection of other data. Typically notate with square brackets and comma delineation. For example, a list of integers might look like [1, 2, 3, 4, 5] and a list of strings like [“Apple”, “Banana”, “Cherry”]

machine-readable text text in a format that can be processed by a computer. This distinguishes “text a computer can read” from “text a person can read.” For example, if you take a picture of an open book with your phone and add that picture to a Word document, you won’t be able to edit it, search it, copy/paste text, or do any of the things you can do with text you’ve typed in that document. For the technology that converts images of text into machine-readable text, see OCR.

mixed network – a network in which some edges are directed, and some edges are undirected. See also directed network and undirected network.­

network – a data structure in which pairs of objects (nodes) are related to one another in some way (by edges).

node – One “thing” in a network that is related to other “things” via an edge. The circles in a typical network diagram.

object – used in computer science and related fields to refer to a generic “thing” that is being modeled. Also, a data structure in some programming languages referring to a set of key-value pairs. For example, an object called “apple” might have a key called “variety” with the value “Fuji” and a key called “price per pound” with the value “1.31”

OCR – Optical Character Recognition (OCR) is the conversion of images of text (handwritten or printed) into machine-readable text. For example, you will not be able to copy/paste from, or use Ctrl+F to search through, a scanned document that has not been run through OCR software.

OpenRefine – a data cleaning tool.

out-degree – the number of edges leaving a given node in a directed network.

string – a data type in computer science, referring to a sequence of characters (i.e., text.)  Typically notated with surrounding quotation marks—for example, “hello world!”

Tesseract – a popular free & open-source OCR tool. 

undirected – of an edge in a network, describing a two-way relationship between two nodes. For example, in a social network, the relationship “coworker” has no directionality—people are mutually coworkers of one another. Contrast with directed.

undirected network – a network in which all edges are undirected. See also mixed network and directed network.

unimodal having only one type of node. Used to describe a type of network.

web scraping – the process of using computer programs to automatically retrieve information from pages on the web.

weight – see edge weight.