Big Data And Archaeology

François Djindjian

Archaeologist

The advent of Big Data has affected all domains of research. François Djindjian traces the use of digital technology in archaeology.

Margalit Berriet, Valcamonica archaeological site, Italy, 2014

In this short paper dedicated to Big Data in archaeology, I will emphasize the concept of Big Data through the technological evolution of computing and the recent history of computational archaeology.

Since the development of the Internet, the volume of stored data has been growing rapidly—digital data created worldwide have increased from 1.2 zettabytes per year in 2010 to 2.8 zettabytes in 2012, and will amount to 40 zettabytes in 2020. The most data is produced by technical/scientific facilities. The Square Kilometre Array radio telescope, for example, produces 50 terabytes of analyzed data per day, extracted from raw data produced at a rate of 7000 terabytes per second.

The volume of data produced by archaeologists is obviously not of the same order of magnitude. After more than a decade of annual campaigns, an archaeological excavation site will produce data of several hundred megabytes, compatible with the storage capacity of desktop computers. These volumes are mainly due to the digitization of drawings and photographic documents. However, applications producing large volumes of data, such as 3D, Lidar, or laboratory analyses, are increasingly being used in archaeology. Also, and independently, the question arises of data archiving in an institutional context, where archaeological data remains under the direct responsibility of the archaeologist, in his or her individual computer environment, and whose security of archiving is therefore neither safe nor sustainable.

History of the Concept of Big Data and Archaeology

At its origins, archaeology was a science of the object, and archaeologists often referred to themselves as antiquarians or collectors. Beginning in the 1960s, archaeology gradually became a science of the information of past societies—intrinsic information that describes the artefacts of material culture and extrinsic information that records the archaeological context of these artefacts and their relationships. This information was disseminated through written media—books, corpus, papers in academic scientific reviews searchable in institutional and private libraries. Archaeologists archived their working documents—excavation books, stratigraphic and planographic drawings, object drawings, plans, photographs, inventories, measurements, notes, draft papers, and offprints, as well as letter exchanges between archaeologists. All of these documents were subject to institutional or private archiving at best.

The development of computer science gradually transformed paper media into electronic media—bibliographic systems, data retrieval systems (“data banks”), inventory files, and measure files. Typewriters, which appeared in the second half of the 19th century, disappeared in the late 1980s, replaced by the microcomputers and word processing software. The post mail has become an e-mail, but in most cases the message is no longer archived. Historiography thus loses private exchanges between researchers, which are often more instructive than official exchanges. Offprints from published articles, or their photocopies, are replaced by PDF files. They are exchanged or are accessible for free on open-access sites, or are for sale on private publishers’ websites.

Drawing, which was a manual activity (CNRS – French National Centre for Scientific Research – laboratories employing ITA draftsmen), became computer-assisted drawing (DAO) with the famous Adobe suite—vector drawing (Illustrator), composition (Pagemaker/Indesign), creation and image retouching (Photoshop), or its competitors. Then, from the 1990s, digitization accelerated, which generated many new data for archaeology:

Physical and chemical measurements
- Geophysical surveys (terrestrial and maritime)
- Lidar data
- Mapping
- Geographic Information System (GIS)

Digital photography
- Digitization of silver photographs
- Digital film
- Digital stratigraphic and planographic drawings
- 3D with virtual reality and digital photogrammetry

From then on, the issue of Big Data was a matter for everyone in archaeology.

Big Data – A Long History Linked to the Progress of Computing

The concept of “Big Data” is relative. It is related to the problems of archiving and processing large volumes of data relative to the availability of hardware (data storage) and software tools to search for them (documentary systems, indexing, search engines), consult them, extract a part, visualize them (graphic systems, GIS, 3D), and process them (graphic visualization, multidimensional data analysis, modeling, etc.).

The modern scientific world likes to periodically resurrect the unsuccessful issues faced by the technological difficulties of the moment by giving new names to the same concepts. Artificial Intelligence—the great myth of modern times—is a good example. Born from pre-war cybernetics, it was created in the 1950s (Rosenblatt perceptron) with the first computers, and it has reappeared periodically under different names: AI, machine learning, expert system, neural network, rule engine, and— the latest—deep learning. Its most successful applications are found in robotics, machine translation, shape recognition, diagnostic assistance, decision support, Big Data processing (where it replaces the data mining of the 1990s), and—the most well-known—games (where the machine beats the human: chess, Go).

Big Data also has a long history. It is related to changes in the size of computer memory and the volume of mass memory storage (discs and magnetic tapes). In the 1960s and 1970s, ferrite memories were limited to tens or hundreds of kilobytes. RAMs now have several dozen gigabytes, or a million times more. Mass storage has followed the same technological evolution since the 2 megabytes of IBM’s first hard drive in 1962, the 300 megabytes of the 1980s, the 25 gigabytes of 1998, the several terabytes of the present day, or a million times more!

Magnetic tapes, organized into storage bays that can hold about ten or twenty tapes, can reach a total capacity of several dozen terabytes. Tape libraries are therefore the easiest way to back up and archive large amounts of data, such as web-based computer farms or the institutional storage of research organizations. Alas, the lifespan of magnetic tape is only about twenty years!

In the 1970s, databanks and large tables were the “Big Data” of that period. In France, this was the great institutional period of the documentary systems implemented by the Ministry Of Culture (museum inventories, General Inventory Of Monuments, Artistic Richness Of France, Archaeological Map), using Bull’s Mistral software. But the data from this period are text, the images being stored on microfiche and searchable on a reader installed next to the terminal. It was only in the 1980s that technological advances in memory, storage units (magnetic discs, videodiscs, and digital optical discs), and networks, saw the arrival of the first data/image/voice server prototypes that became operational in the mid-1990s with the development of the internet. It should be noted, however, that the Videotex system, the precursor of the internet, was operational in France from 1980 until 2012.

Large tables, which are the basic data of archaeologists for most of the problems they deal with (Djindjian 1991, 2011), were the subject of graphic manipulations in the 1960s before being treated by multidimensional data analysis in the mid-1970s, despite the limitations of computers in computing power and central memory. From the 1990s onwards, these

limitations disappeared, and these treatments began to be carried out on microcomputers. The 1980s and 1990s, which were the years of the development of micro-computers, networks, and office software, saw the archaeologist individually appropriating these tools, and the institution was then set back on institutional research projects.

The 1990s saw the arrival of a new vocabulary, if not a new approach. Data mining, which applied multidimensional statistical techniques to large bodies of data such as those obtained by Internet data or questionnaires, had the target of identifying types of consumer behaviour (segmentation, scoring). Learning techniques were also emerging, but archaeologists are not concerned by the primarily marketing interests of data mining.

The 2000s saw the emergence of the vocabulary of Big Data, linked to the massive (“Orwellian”) production of data that the technological progress of computing and telecommunication now allowed us to store, communicate through networks, visualize, and process. Institutional research organizations are beginning to be disturbed by the dispersion of data recorded by individual researchers (though institutionally funded) which get lost when the microcomputer breaks down or when the researcher retires, especially in the field of humanities and social sciences, where the individual researcher takes precedence over laboratory team work.

In France, the CNRS is launching the CNRS TGIR Huma-Num project (www.huma-num.fr) for the archiving of digital data from the Humanities. It is a computer platform for data acquisition, storage, dissemination, processing and archiving. Several archaeology laboratories have joined forces with the MASA consortium (Memory Of Archaeologists and Archaeological Sites) to use the services of the TGIR Huma-Num. It aims to offer unified access to a variety of data and documentation produced by archaeologists. It develops methods and

tools for the archaeological community, respecting international standards (https:// masa.hypotheses.org/). In Europe, the Ariadne project has launched cooperation between archaeologists on unifying projects, particularly thesauri, and service platforms, including the subject of archiving (https://ariadne-infrastructure.eu/).

What are archeological Big Data?

Archaeological Big Data consists of a non-limiting set of files of varying sizes, formats, and structures:

Databases: results of recording archaeological excavation data (extrinsic information) and description of artefacts (intrinsic information). This data is stored in a variety of software, from word processing software and spreadsheets to database management systems.
- Ancient texts in their original writing and translation (philology).
- Databanks created with documentary system software.
- Digitized documents: digital photos, digitized slides, digitized stratigraphic and planographic drawings, digitized video films, 3D.
- Vector graphic documents such as those created by desktop publishing (DTP) software or geographic information system (GIS) software.
- Quantitative data tables.
- Measurement files such as those produced by physical-chemical devices: geophysical prospection, Lidar, varied spectrometry, dating, etc.

The Functions of Big Data Service

The functions of a Big Data service are not limited to archiving. They cover the entire workflow—acquisition (Submission Information Package), storage, signaling (i.e., indexing and the definition and management of metadata that describes the data), dissemination (which allows internet consultation), archiving (in a standardized format), selection (which allows data to be extracted and formatted for processing), and processing. The processing function is rich and varied, and includes all the tools and software used for more than fifty years—lexicographic analysis, statistics, multidimensional data analysis, geographic information system, processing imaging, modeling, 3D and, more recently, the return of Artificial Intelligence using machine-learning techniques (deep learning), and so on.

Best Practices

Beyond the pleasure of getting drunk on buzzwords, the archaeologist must invest himself or herself in the field of projects, which effectively mix technical innovations and pragmatism. Good practices are the best guarantor of a successful project.

Metadata, which are the data that describe the data, brings together two datasets— individual metadata related to the data produced by the archaeologist, and the common, institutional, global, and specialized metadata, which are more and more standardized. This institutional meta-data is derived in archaeology from the documentary projects of the 1970s—Ministry Of Culture, CNRS, Scientific and Technical Information (INIST), which invested in the realization of the first major thesaurus, which are current metadata databases. In archaeology, the reference thesaurus is Pactols, originally developed for the Frantiq documentary system, which has 30,000 references (compliant with ISO 25964 of multilingual thesauri). The different thesauri of the Ministry Of Culture (General Inventory, Museum Database) have been grouped on the Ginco platform. Standards that have homogenized industrial production for more than fifty years are also gradually affecting archaeology, either indirectly, through generic software, or directly, but still rarely by standards dedicated to archaeology.

Archiving (OAIS, The Open Archival Information System) has its own standard, ISO 14721:2012. In this standard, an “information package” contains information to be archived, retained, or communicated to users. The information package always contains the object you want to archive, and the metadata needed to preserve it.

Three types of data are defined:

The information package to be archived (SIP): It is produced by the archive depositary, according to the model imposed by the deposit manager;
- The archived Information Package (AIP): Content Data Objects and metadata. It is produced by and for the deposit manager;
- The disseminated information package (DIP), depending on the rights of the user making the request and the dissemination rights.

The CIDOC-CRM standard (ISO 21127:2014) is a cultural heritage standard and as such is concerned with the theme of Big Data and archiving.

The Basic Data

The information package to be archived must contain the information at the most basic level known. The archiving system must have the selection, filtering, and aggregation functions useful for building data at any higher aggregation level. Otherwise, the information at the most basic level is definitively lost.

The Raw Data

The information package to be archived must contain raw data, with the best possible definition, without format or processing to reduce the volume or change the data.

Processing

It is illusory to think that an accumulation of data can be able, as powerful as they are, to provide knowledge or decisions under the action of a few algorithms, however powerful they may be. Data exploration (the name given to the various methods referred to in the 1970s as multidimensional data analysis) can only be effective in the context of a formal construction that can allow both to highlight a structure in the data and to be able to validate it. It is undoubtedly this overconfidence (or laziness) that is the origin of disappointment in the use of these techniques, starting from the 2000s, augmented by the fashion of post-modernism.

Integrating data exploration techniques into a global cognitive process requires a multi-level approach, such as the one we proposed under the title “systemic triplet” (Djindjian, 2002): A systemic triplet S (O, I, E) is defined by the Objects O, the Intrinsic Information I, and the Extrinsic Information E.

Step 1: Definition of the systemic triplet S (O, I, E). The S system is defined by a set of constant values of E, such as objects of the same stratigraphic unit (closed set), the same burial, the paintings of the same cave, the tools of the same dwelling structure, the contemporary urban structures of the same territory, etc., all of which can be defined by a set of constant values of extrinsic information type T (time), H (dwelling structure), R (territory), L (location), M (origin), EV ( Environment), EC (economy), etc.
Step 2: Perception and description of intrinsic information I
Step 3: Recording of extrinsic information E
Step 4: Formalization
- Structuring the system formalized by a Table Objects x Description of Objects (O x I), which provides partition structures (classification) or serial structures (seriation), giving a new order on O, that is O+, and correlations on I, that is I+. The system then moves from the cognitive state S (O, I) to the s. Such structuring is called intrinsic structuring.

Structuring the system formalized by the occurrence table (I x E), which provides structures of correspondences between the two sets of information, structuring in chronological facies for E x T, spatial structuring for E x H or E x L, determinism for E x Ev, etc. The system then moves from a cognitive state S (O, I, E) to a cognitive state of S+ (O+, I+, E+). Such structuring is called extrinsic structuring.
Step 5: Application of multidimensional data analysis techniques on tables (O x I) or (I x E)
Step 6: Feedbacks by iteration return on I and E (a learning mechanism)
Step 7: Progressive enrichments by integration of new I and E
Step 8: Validation (on another O system, by another E correlation, etc.)

These processes, in order to be truly cognitive, must explicitly integrate learning mechanisms, which data analysis achieves through the option of additional elements and by iteration on the intrinsic information enabling archaeologist-object interaction, a real mechanism of learning.

More generally, the “systemic triplet” method is following the logic of C.S. Peirce, which has found important applications, particularly in the field of automatisms (process control, robotics), cognitive psychology, and also in archaeology:

A: Acquisition of intrinsic information (cognitive interaction archaeologist-object) and extrinsic information (recording during survey and excavation operations),
- S: Structuring by learning, obtained by correlation mechanisms between intrinsic information (intrinsic structuring) or by mechanisms of correlation between intrinsic information and extrinsic information (extrinsic structuring),
- R: Reconstitution (Cognitive Modeling).

The challenge of Artificial Intelligence, through the various algorithms it has developed since the 1950s, can actually be summed up in the following paradox: Using the ever-increasing computing power of computers with iterative simple algorithms to implement a sophisticated formal construction. The analogy with chess illustrates this paradox—either calculate all possible combinations or devise a game strategy that reduces the calculation of combinations. The first option, the success of which is due only to the improvement of the computer’s computing capabilities, is only one step in preparing the second option, hence the success of the term and probably—eventually—the results of the concept of deep learning, which must exceed being the buzzword.

Conclusions

Beyond the term “Big Data,” there is in fact the relationship between the research scientist and the fantastic evolution of computer technology in the second half of the 20th century. The more additional means this technology offers (computing capacity, storage volume and communication channels), the more needs appear (often more with the help of industrial marketing than with an expressed expectation of researchers). Archaeology has followed this trend with, certainly, incomparably lower needs, but the development of certain methods (such as 3D), the particular sociology of the archaeologist implies that the institution mobilizes to offer environments, standards, and services for archaeological Big Data.

REFERENCES:

Djindjian, F. 1991. Méthodes pour l’Archéologie. Paris : Armand Colin.

Djindjian, F. 2002. “Pour une théorie générale de la connaissance en archéologie,” in XIV Congrès International UISPP, Liège Septembre 2001. Colloque 1.3. Archeologia e Calcolatori 13: 101–117.

Djindjian, F. 2011. Manuel d’Archéologie. Paris: Armand Colin.

François Djindjian is Associate Professor at the University of Paris 1 Panthéon Sorbonne, President of Commission “Methods and Theory of Archaeology” of the International Union of Prehistoric and Protohistoric Sciences (UISPP) and Vice-President of CIPSH (International Council of Philosophy and Humanistic Studies of UNESCO), prehistorian specialized in the teaching of archaeological methods.

Previous publication

Summary

Next publication

Big Data and
Singularities

JUNE 2020

Author

PDF version