A Textual Transmission Simulator

Timothy J. Finney

Revision History
	2015-03-15
This version (150315) incorporates the following changes and bug fixes: Change rec parameter to be an absolute number rather than a proportion. Fix bug where recovered initial text incorporates edits made during its life time. Increase number of simulated data sets using trend parameter values 0, 1, 2, 3, 4, and 5.
	2014-06-01
Version 140601 is available here.

Table of Contents

Abstract
Introduction
Program design
Configuration parameters
Using the program
Actual versus simulated data sets
Bibliography

A computer program is used to simulate transmission of an ancient book through manual copying. [1] Processes to mimic import, correction, loss, and recovery of instances of the text are incorporated, with import and correction processes allowing textual mixture to occur. Program behaviour is governed by configuration parameters which control the processes; these parameters also describe population centres which produce demand for the book and supply virtual copies as well. A trend parameter sets the degree to which each centre adheres to its own textual preferences. Simulation results produced using one set of hopefully realistic parameters are compared with actual data from the Gospel of Mark. A relatively large value of the trend parameter, corresponding to a strong parochialism within copying centres, is required to produce results which resemble the actual data. This finding is not inconsistent with a theory of local texts where each major centre of text production had a strong tendency to propagate its own flavour of the text.

Introduction

The following terms are used in relation to the virtual texts produced by the simulation:[2]

case: An instance of the text represented as a set of characters; also known as a witness.
character: A place where the text varies; also known as a variation site.
state: One of the possible states of a character; also known as a reading. In a real textual tradition, states are alternative words or phrases; in this simulation they are numerals.

The program uses stochastic processes to produce virtual copies of an initial text through a series of annual cycles. A string of numerals represents each copy of the text. Every time a copy is made, there is a chance that elements of the text will change state. As a result, copies diverge from the initial text, which is a string comprised entirely of ones:

Example 1. Initial text and a subsequent copy

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 2 2 2 3 1 3 1 2 2 2 2 1 5 2 1 2 2 2 3

The annual increment in the population of copied texts is governed by the logistic growth equation:

Equation 1.

increment = growth * demand * (1 - demand/K)

Here, growth is the growth rate, demand is the number of copies required in a place, and K is the maximum demand. This equation initially produces exponential growth which declines to stasis (zero growth) as demand approaches its maximum.

The program imitates stochastic processes by generating pseudo-random numbers (random variates) that conform to particular probability distributions. Although the numbers seem to be random, they are actually generated by a deterministic process which depends on an initial seed value. Consequently, the same sequence of numbers will be generated every time a particular seed is used. The effect of this is to produce precisely the same simulation result given the same configuration parameters and the same seed value.

The program generates two kinds of random variate: uniform and Zipfian. The first kind conforms to the uniform distribution and is therefore equally likely to be any number within a specified range (typically zero to one). A number of operations in the program use uniform variates to randomly select items from a set. Given a probability of selection p, a uniform variate between zero and one is generated for each item. If the variate is less than or equal to p then the corresponding item is selected.

The second kind of random variate conforms to the Zipf distribution. A Zipfian variate has an integral value somewhere in the range of one to a maximum number m (i.e. 1, 2, 3, ..., m). The probability that one of these integers will be selected is inversely related to its rank (i.e. position in the list):

Equation 2.

p = k / rⁿ

Here, k is a normalization constant chosen so that the sum of probabilities from one to m equals one, r is the item's rank, and n is a positive exponent. The larger the value of the exponent, the more pronounced is the tendency to select integers near the head (i.e. beginning) of the list. To illustrate, the following table shows the results of generating one hundred Zipfian variates with seven possible states using various exponent values:

Table 1. Zipfian variates (100 variates, 7 states, various exponents)

Rank	1	2	3	4	5	6	7
Exponent	Frequencies
0	10	13	18	14	15	17	13
1	34	22	14	12	6	8	4
2	76	11	2	5	5	1	0
3	83	7	4	3	2	1	0
4	87	11	2	0	0	0	0
5	97	2	1	0	0	0	0

If the exponent is zero then there is no tendency to select items near the head of the list; that is, the variates conform to the uniform distribution, each integer being equally likely to be chosen. The generated variates of the zero exponent row illustrate an important point: even though there is a particular (in this case, equal) probability of selection for each item in the list, the actual frequency varies from trial to trial because of the stochastic nature of the generating process. As number of trials increases, the average number of times each list item is chosen will tend toward the expected value, which is 100 / 7 (about 14) for an exponent of zero.

Distributions produced by exponent values near one match various real world phenomena such as word frequencies in a natural language corpus or book borrowing frequencies in a library: the most popular item is chosen about twice as often as the second, about three times as often as the third, and so on. Larger exponent values increase the tendency to select items near the head of the list. Regardless of the exponent, generated variate frequencies will vary about expected values due to the stochastic nature of the generation process.

To choose an item from a list so that items near the head of the list are preferred, a Zipfian variate is generated whose maximum possible value equals the length of the list. The variate's value (which is always an integer) is then used to select the corresponding item in the list.

Program design

The program is comprised of a number of components: a Book represents a case; a Conf carries configuration data; a Domain represents a collection of places (e.g. the Roman Empire); and a Place represents a region within the domain (e.g. a Roman province such as Syria).[3] Each place has two collections: current, corresponding to accessible cases; and lost, corresponding to cases which have become inaccessible.

A single domain is created which contains a number of places. The domain then goes through the following phases:

prime: Grow demand for copies from the start to publication dates using the logistic growth equation to increment demand each year.
publish: Introduce the initial text at the place and date specified in the configuration data. The initial text is assigned the word initial as an identifier. Its characters all have an initial state of 1.
propagate: Perform a series of annual propagation cycles starting at the the publication date and ending at the end date, again using logistic growth to increment demand each year. The annual propagation cycle is described in more detail below.
recover: Recover a number of copies from each place.
write: Write recovered copies to an output file. The output is formatted as a data matrix which gives the identifier and set of character states for each case.

For each year between publication and end dates, the program executes a propagation cycle for every place. The propagation cycle consists of a sequence of steps comprised of zero or more of the following operations:

import: Select and import a case from another place. One effect of the import operation is to allow the text to migrate from the place of publication to other places included in the domain.
copy: Choose a pattern case (i.e. exemplar) from this place's current collection, copy it, edit selected characters of the copy, then add the copy to the same collection. Copies are made until the supply of current cases matches the demand.
correct: Choose a subject and reference case from the current collection then make selected characters of the subject conform to corresponding characters of the reference.
lose: Transfer a copy from this place's current collection to its lost collection.

At the end of each place's annual propagation cycle, demand for copies is incremented according to the logistic growth equation.

The uniform distribution is used to select cases to copy, correct, and lose. Each copy operation randomly selects a pattern case from the current collection then randomly selects a number of the copy's characters to be edited. Each correct operation randomly selects a subject and reference case from the current collection. The program generates a proportion between 0 and 1 then randomly selects a corresponding number of characters for correction against the reference. To illustrate, if there are 50 characters and the program generates a proportion of 0.2 then ten (i.e. 0.2 * 50) of the subject case's characters will be randomly selected to be assigned the same state as the corresponding character of the reference case. Each lose operation transfers a randomly selected case from the current to lost collection.

An identifier comprised of three parts separated by periods is assigned to each copy. The three parts are generated as follows:

a contraction of the place where the copy is made
the production year
the number of the copy produced in that year.

For example, the second copy produced in Italy in 259 would be assigned identifier Iy.259.2.

The process of choosing a state for each character selected to be edited depends on a trend setting:

zero: The same set of lists of preferred states per character is used for each place. Each state of a character is equally likely to be selected.
greater than zero: Each place has its own array of lists of preferred states of characters. When a character is selected to be edited, states near the head of its own list of preferred states are more likely to be chosen to a degree determined by the magnitude of the trend parameter. The trend parameter is used as the exponent of the Zipfian variate generator used to select an item from the character's list of preferred states.

During initialization, the program assigns each character its own number of states using a Zipfian variate generator with a specified maximum possible number of states and an exponent of 1.8.[4] A list of preferred states is then generated for each character. If the trend parameter is zero then lists for corresponding characters are the same across places, being the numerals from one to the assigned number of states (e.g. 1 2 for two states; 1 2 3 4 for four). A different method is used if the trend parameter exceeds zero: each character of each place is assigned a list which is a permutation of the numerals from one to the assigned number of states (e.g. 2 1 if two states; 3 2 4 1 if four). That is, each place has its own combination of lists of preferred states unless the trend parameter is set to zero.

Import operations also rely on Zipfian variates. Exporting places are ranked by distance from the importing place using great circle distances calculated from latitude and longitude coordinates specified in the configuration data. A Zipfian probability profile (generated using an exponent of one) is associated with the list of nearest places then used to select a place from which to import a copy.

Once the program reaches the end of the propagation phase (i.e. the end date), it steps through the recover and write phases. Recovery consists of random selection from the current and lost collections of each place according to two parameters:

ratio of current to lost cases, and
number of copies to recovery.

The first allows various ratios of current to lost cases to be recovered, with a ratio of one giving approximately equal numbers of both types.

The initial text is always included among recovered cases.

The write phase outputs a data matrix whose rows give the identifier and character states of each recovered case. The data matrix is written to a file formatted as a comma-separated vector (CSV). This file can then be opened for inspection, processing, and analysis by R or some other application such as a spreadsheet program.

Configuration parameters

Operation of the program is governed by configuration data contained in an XML file (e.g. conf.xml) which specifies a number of parameters:

characters: Number of characters used to represent a case. Too small a number will produce quantization of distances between cases; the larger the number, the slower the simulation will run.
growth: Initial rate of annual growth for demand of the book. In this simulation, the same rate applies across places in the domain.
origin: Place where the initial text is introduced.
ratio: Ratio of current to lost cases recovered (i.e. current / lost). Values larger than one favour current cases while values between zero and one favour lost copies.
seed: Random number generator seed. Any number will do. Other things being equal, using the same seed will produce the same simulation result.
states: Maximum number of character states.[5]
trend: Degree to which each place favours its own combination of lists of preferred character states. The larger the value, the more parochial each place with respect to its flavour of the text.

Probabilities are specified as follows:

corr: Probability (per year) that a case will be corrected.
edit: Probability (per character copied) that a character will be edited.
import: Probability (per case required) that a case will be imported from another place.
loss: Probability (per year) that a case will be lost.

Any number of places may be specified to constitute the domain. Each place has the following components:

name: Name of the place.
capacity: Maximum demand for copies in this place.
lat: Latitude of some point in the place.
long: Longitude of some point in the place.
rec: Number of cases to recover from this place.

Finally, three dates are specified:

start: First year of simulation.
publish: Year when initial copy is introduced.
end: Last year of simulation.

Using the program

R must be installed on the user's computer before the simulation can run. Instructions on how to download and install R are provided at the project's website. Much useful information is available at this site, including platform-specific instructions on a number of topics such as how to set the working directory.

Subsequent steps will be facilitated by creating a suitable directory (i.e. folder) structure to hold the simulation program and related files. A parent directory (say, Simulation) should be created with subdirectories named scripts and data. The simulation program's source file (sim.R) and a configuration file (say, conf.xml) would then be downloaded to the scripts and data directories, respectively. If desired, R scripts to perform analysis on the simulation results can also be downloaded to the scripts directory from here. A subdirectory corresponding to each analysis script should be created at the same level as the scripts and data subdirectories.[6] Finally, the scripts directory should be selected as the working directory for R.

A text editor can be used to edit the configuration file as desired. Once the R environment is launched and the working directory changed to scripts, the simulation program is run by typing source("sim.R") at the R interface's command prompt.

Input and output parameters are specified by editing the following line found towards the end of the simulation program's source file:

c = Conf$new(conf="conf.xml", dir="../data", rem=2, write=TRUE)

The values of conf, dir, rem, and write can be edited to change the configuration file name, data directory path, level of remarks (0 = minimal; 1 = main phases; 2 = years; 3 = all events), and whether or not to write the output data matrix to the data directory (TRUE = yes; FALSE = no).

Only a small part of the data generated by the simulation is output to the data matrix. However, all generated data is accessible via the Domain object, d. Once the simulation has run, features of interest can be inspected by addressing specific components and subcomponents. To illustrate, the components of d are listed with this command, typed at the R command line:

ls(d)

The following would print the entire contents of the first place in domain d, a potentially large amount of data:

d$place[[1]]

This would print the first recovered case:

d$recovered[[1]]

The ancestry and history of this case would be printed as follows:

d$recovered[[1]]$ancestry

d$recovered[[1]]$events

The ancestry field lists the identifiers of all ancestors back to the initial text. The list also includes the identifier of each reference case used to perform a correction operation somewhere in a case's past. These reference case identifiers are placed in parentheses (e.g. (Iy.259.2)) to distinguish them from genetic ancestor identifiers. The events field lists all operations (i.e. import, copy, correct, lose) that have occurred in a case's history.

Actual versus simulated data sets

The program is intended to simulate the process by which a model text was transmitted. In order to assess how well the simulation performs, analysis results derived from model and simulator-generated data sets will now be compared. If the model and simulated traditions have similar characteristics then it is not unreasonable to believe that the simulation program is successfully imitating the model's transmission history.

The model data set, Mark-UBS4, derives from a data matrix constructed by Richard Mallett using the critical apparatus of Mark's Gospel found in the fourth edition of the United Bible Societies Greek New Testament [UBS4]. The Mark-UBS4 data matrix was produced by passing Mr Mallett's data matrix through the clean.R script to remove witnesses with large numbers of undefined readings.

The simulator-generated data sets are based on six configuration files: sim-0.xml, sim-1.xml, sim-2.xml, sim-3.xml, sim-4.xml, sim-5.xml. These are identical in all respects except for their trend parameters, which have values of 0, 1, 2, 3, 4, and 5, respectively. The data set generated with a setting of 0 has no location-specific tendencies with respect to combinations of preferred character states while the others have increasingly pronounced tendencies toward parochialism.

Other parameter settings, which are common across all of these configuration files, require some explanation. Start and end dates bracket the period of interest, while the seed allows the user to obtain repeatable results. The number of characters and maximum number of states are based on the model data set.

The number of places in the domain is based on analysis results for the model data set which indicate that the tradition has four major streams. The assumption expressed in the configuration files that these locations are the West, Asia Minor, Syria, and Egypt (centred at Rome, Ephesus, Antioch, and Alexandria, respectively) is speculative. Nevertheless, the number and identity of these places seem reasonable choices given what is known of the spread of Christianity in the first few centuries after Christ's birth, and it is hard to think of any other centres that would have generated greater demand for copies of Christian books during this era. Whereas a configuration with more places would be more realistic, the one given seems reasonable as a first approximation.

Each location has been assigned the same maximum demand for copies, namely three hundred. Without doubt, the respective demands would have differed though who knows to what extent? The assigned figure is a very rough estimate based on these assumptions:

maximum demand for copies occurred about the time of the first Council of Nicaea (325)
maximum demand across Asia Minor, Syria, and Egypt was about nine hundred (based on the figure of about one thousand Eastern bishops supposed to have been invited to the Council)
there is a one-to-one correspondence between copies and bishops.

These assumptions are probably too conservative so the figure of three hundred copies per location probably underestimates the actual maximum demand.

The number of cases recovered per place is also based on analysis results for the model data set which show three textual varieties with about the same number of members and one variety with about as many as the others combined.

A process of trial and error was used to find a combination of values for the remaining parameters which gives a reasonably good match between analysis results derived from the model and simulator-generated data sets. Users are encouraged to experiment with their own combinations of parameter values in order to find a better match. The effect of an incremental change in value varies from one parameter to the next.

The following table presents the model and generated data sets along with analysis results derived from them. Distance matrices are derived from the corresponding data matrices using the script dist.R. Classical multidimensional scaling (CMDS), divisive clustering (DC), and neighbour-joining (NJ) results were obtained with scripts cmds.R, dc.R and nj.R.[7] Mean chars gives the average number of characters (or variation sites) where states (or readings) are defined. Mean dist. and std. dev. (standard deviation) figures relate to text-to-text distances recorded in the distance matrix.[8] The prop. var. (proportion of variance) figure indicates how much of the original distance data is captured by the CMDS result, and the div. coeff. (divisive coefficient) figure indicates degree of clustering in the DC result.

Table 2. Model and generated data sets

Name	Trend	Data matrix	Distance matrix	CMDS	DC	NJ	Mean chars	Mean dist.	Std. dev.	Prop. var.	Div. coeff.
Mark-UBS4	-	→	→	→	→	→	123	0.471	0.170	0.51	0.74
sim-0	0	→	→	→	→	→	120	0.593	0.047	0.15	0.34
sim-1	1	→	→	→	→	→	120	0.575	0.052	0.19	0.35
sim-2	2	→	→	→	→	→	120	0.531	0.090	0.37	0.50
sim-3	3	→	→	→	→	→	120	0.485	0.138	0.55	0.61
sim-4	4	→	→	→	→	→	120	0.459	0.171	0.65	0.70
sim-5	5	→	→	→	→	→	120	0.440	0.192	0.72	0.76

Given the other configuration parameters used here, the trend parameter must be set to a value of at least two for analysis results from model and simulated data sets to be broadly consistent.[9] Such a high value of the trend parameter corresponds to a strong tendency for each place to prefer its own combination of character states when editing copies. This indicates that ancient centres of Christianity strongly preferred their own combinations of readings, if indeed the actual transmission process is mirrored by the simulator running with these configuration parameters. It would seem that a theory of local texts deserves renewed attention when seeking to understand the early history of New Testament textual transmission.

Bibliography

[CC1980] Chatfield, Christopher and Alexander J. Collins. Introduction to Multivariate Analysis. London: Chapman and Hall, 1980.

[R] R Core Team. R: A Language and Environment for Statistical Computing. Vienna: R Foundation for Statistical Computing, 2014. http://www.R-project.org/

[SM1987] Saitou, Naruya and Masatoshi Nei. “The Neighbor-Joining Method: A New Method for Reconstructing Phylogenetic Trees.” Molecular Biology and Evolution 4, no. 4 (1987): 406-425.

[SWH2004] Spencer, Matthew, Klaus Wachtel, and Christopher J. Howe. “Representing Multiple Pathways of Textual Flow in the Greek Manuscripts of the Letter of James Using Reduced Median Networks.” Computers and the Humanities 38, no. 1 (2004): 1-14.

[UBS4] Aland, Barbara, Kurt Aland, Johannes Karavidopoulos, Carlo M. Martini, and Bruce M. Metzger, eds. The Greek New Testament. 4th rev. ed. Stuttgart: United Bible Societies, 1983.

^[1]The program, sim.R, is a script written in the [R] statistical computing language. See the R project web site for more information on the language.

^[2][SWH2004] introduces character and state nomenclature. In multivariate analysis, a general term such as case or object is often used for an item comprised of multiple characters.

^[3]For R programmers, these components are reference classes.

^[4]This value of the exponent produces a distribution of numbers of states not unlike that found in the critical apparatus of the [UBS4] Greek New Testament.

^[5]A suitable value would be the maximum number of alternative readings found at a variation site of the data being modelled.

^[6]For example, if scripts dist.R, cmds.R, dc.R, and nj.R were downloaded to the scripts directory then dist, cmds, dc, and nj directories should be created as siblings of the scripts and data directories.

^[7]See [CC1980] chapters 10 and 11 for an introduction to multidimensional scaling and cluster analysis; [SM1987] introduces neighbour-joining.

^[8]The simple matching distance, which is the ratio of disagreements to places compared, measures the distance between two texts. Being a pure number, it does not have a unit.

^[9]Decreasing the trend parameter much below 2 or the chance of editing per character copied much below 0.5 makes it difficult to produce a reasonable match with the actual data.