Institut für Dokumentologie und Editorik

Genre Analysis and Corpus Design: Nineteenth-Century Spanish-American Novels (1830–1910)

 

5 Conclusion

767A central question raised in the introduction was how literary genres can be conceived in theoretical terms as categories that can capture the common aspects that humans see in literary works of the same genres. Moreover, the theoretical conception of literary genres should be able to grasp the common features of literary works belonging to a genre on a textual level. In the context of computational literary studies, this applies not only to the analysis of textual properties of genres by literary scholars but also by computers. To that end, concepts of genre stemming from literary theory were discussed in the theoretical part of this dissertation and were related to the aims and procedures of digital genre stylistics.

768Regarding the ontological status of genres, it was argued that they can best be understood as communicative norms or conventions which have an influence on the stylistic form of the literary texts that participate in them. Surface cues of the texts that are related to genre labels can be interpreted as traces or “normative facts”, in the terms of Hempfer (1973Hempfer, Klaus W. 1973. Gattungstheorie. Information und Synthese. München: Fink.), which are left by the genre conventions. In this way, even a digital stylistic text analysis can capture elements of communicative phenomena that are, at least to a certain degree, determined by factors that lie outside of the syntactic and semantic level of literary texts.

769As a second aspect of the theoretical part of this dissertation, the usefulness of semiotic models of genres for the modeling of generic terms was highlighted. They allow to organize conventional signals that are transported by genre labels on different linguistic and contextual levels so that stylistic analyses can focus on selected semiotic levels. Semiotic levels include the thematic one, which is, for instance, primarily addressed by sentimental or social novels, or the level concerned with the relationship of the texts to reality, for which historical, science fiction, or fantastic novels are examples. These are only two of the possible semiotic levels to which generic names can refer. Most of the genre labels do not refer to one level exclusively, but it is helpful to decide on a specific dimension of genre discourse that is examined, to compare genres or subgenres on a similar level. To make decisions for selected levels of analysis is especially relevant for digital genre stylistics because corpus studies are usually performed in a contrastive setting, either by comparing one subgenre to a larger corpus covering a superordinate major genre or by opposing individual genres or subgenres directly. Analyzing the discursive levels that genre labels refer to can also provide useful hints about the textual traits that could be relevant for them and can lead to hypotheses about the stylistic characteristics of the texts participating in the genre in question. For thematic subgenres, for instance, topic features can be a good choice. However, it must be assumed that most genres are defined on several textual levels at once. In any case, in digital stylistic analysis, the textual traits must all be captured on the textual surface, and they must therefore be formalized and leveled down to text style, as is done by using topics as indicators for thematic developments. As a result, other aspects of text surface style may enter the feature space. In the case of topics, for example, not only thematic elements become visible but also aspects of the setting, plot, or specific literary motifs. This shows that a connection between genre conventions, textual traits of genres, and surface style can be assumed and deduced, or induced, but that this connection is a complex one, which still needs to be investigated further in theoretical and empirical terms.

770A third finding of the theoretical part was that for digital genre stylistics, it is advantageous to have separate concepts of literary text types, conventional genres, and textual genres. In this way, not only theoretically driven genre analyses, starting from certain concepts of textual genres are enabled, but also exploratory ones concerned with groupings of texts that emerge from stylistic features. This is important because it is not necessarily directly possible to clarify the relationship between literary definitions of genres and surface style captured in computational analyses. Theoretically, such a relationship is most often hypothesis-driven, and practically, many tools that formalize literary concepts are still missing. In literary genre theory, there are proposals of how to distinguish text types from genres, but they usually adhere to a relationship of dependence between both. Here it was proposed to completely separate the levels of literary text types and literary conventional genres to be able to define and find intersections of both as a result.

771In the part on the different literary theoretical concepts of genres as categories (classes, prototypes, and families), it was laid out how they can be applied fruitfully by using statistical categorization methods. Classification, clustering, or network analysis constitute different options to analyze literary text types and to relate them to conventional genres. It was argued that there is no exclusive relationship between each of these text categorization methods and the different literary theoretical concepts of genre categories. Instead, each method covers different aspects of several types of categories. Statistical classification, for instance, can also be used to determine the prototypicality of literary texts as instances of genres to a certain degree. On the other hand, networks can not only be used for family resemblance analyses, but also allow the formation of clusters in the networks, which then represent delimited groups.

772In the chapter on style, a definition of literary genre style was formulated that applies the concept of style formulated by Herrmann, Schöch, and van Dalen-Oskam (2015Herrmann, J. Berenike, Christof Schöch, and Karina van Dalen-Oskam. 2015. “Revisiting Style, a Key Concept in Literary Studies.” Journal of Literary Theory 9 (1): 25–52. https://doi.org/10.1515/jlt-2015-0003.) to the field of literary genre analysis. In this definition, the distinction between communicatively determined, conventional genres and formally determined literary text types is included. Furthermore, a distinction is made between higher-level stylistic traits and low-level surface cues, and it is formulated which types of linguistic features are considered stylistic cues. In this way, the relationships between literary genres and their stylistic characteristics on the one hand and linguistic features of the surface of texts on the other hand is differentiated and made clear.

773In the last part of the Concepts chapter, selected thematic subgenres and literary currents of the nineteenth-century Spanish-American novel were presented with a view to genre-stylistic characteristics. In this way, the textual analysis of these subgenres conducted in the Analysis chapter of the dissertation was prepared and related to established literary-historical knowledge.

774The various considerations of the theory section contribute to relating established theoretical concepts of literary and linguistic studies to computer-based analyses of literary genres. On the one hand, this should promote a theoretical foundation of textual analyses of digital genre stylistics. On the other hand, it should also be possible to relate the results obtained with digital genre stylistic analyses to previous results obtained with classical methods. They should not be detached from previous research in literary history, but should enter, be, and remain in dialogue with it. Recent developments in the digital humanities show that the need for a stronger theoretical foundation is seen. For example, a working group on digital humanities theory has been established in the DHd association (AG Digital Humanities Theorie 2023AG Digital Humanities Theorie. 2023. Digital Humanities Theorie. Theorie und Theoriebildung in den digitalen Geisteswissenschaften. https://web.archive.org/web/20230208165032/https://dhtheorien.hypotheses.org/.). In relation to the results of this dissertation, the next step will be to see to what extent the suggestions made here for concepts of digital genre stylistics can be useful not only for the analyses in this specific thesis, but also in other application studies. Furthermore, the conceptual considerations of genre categories and computational categorization can still be extended and need to be supported by further empirical studies.

775In the empirical part of this thesis, the methodology for the creation of a digital bibliography serving as a sampling frame and a text corpus aimed at the analysis of nineteenth-century Spanish-American novels with regard to subgenres was discussed and presented in detail. The approaches taken were outlined in practical terms and by example, addressing the specific questions, challenges, and solutions found for the bibliography and corpus at hand. Still, the proceedings followed were also the result of general considerations and discussions in the CLiGS project. They can be taken as a proposal for how to prepare a digital corpus of literary texts for genre analysis in a good way. The findings of these reflections in the project led to the creation of the “textbox” as an exemplary set of text collections for digital literary analysis (Schöch et al. 2019Schöch, Christof, José Calvo Tello, Ulrike Henny-Krahmer, and Stefanie Popp. 2019. “The CLiGS Textbox: Building and Using Collections of Literary Texts in Romance Languages Encoded in TEI XML.” Journal of the Text Encoding Initiative. Rolling Issue. https://doi.org/10.4000/jtei.2085.). For this dissertation in particular, the considerations and implementation of data modeling in the digital bibliography on novels, their subgenres, and the editions in which they were published is also of a general character and it goes beyond what was worked out on text corpora in the project group. Own decisions have also been made for the digital text corpus regarding its composition and the modeling of the metadata and text structure. These have mainly resulted from the subject matter of the nineteenth-century Spanish-American novels.

776One question was, for example, how to define the limits of the corpus in generic terms – how to determine if a literary work can be considered a novel in several hundreds of cases without being able to read every single work? Here, the decision was to start from a very general formal definition of the novel and to check the fulfillment of its requirements in part through quantitative analysis and in part by evaluating available metadata and by checking the information in paratexts manually. The selection of subgenres, on the other hand, was not predetermined, so that the bibliography Bib-ACMé and the corpus Conha19 constitute general resources covering the Spanish-American nineteenth-century novel as it is represented by works from the three countries Argentina, Cuba, and Mexico. It is hoped that this corpus will be used in the future by other researchers for digital text analysis, which may also have a different focus than that pursued here. For example, the corpus could also be used for the study of authorship, for an analysis that examines chronological developments within the nineteenth century, or for the analysis of specific motifs or themes in the texts. Individual texts from Conha19 encoded in TEI could also serve as a starting point for the development of digital critical editions of selected works. Finally, a desirable future development is to add novels from other Spanish-American countries to the corpus, or texts from other major genres (for example, drama), in order to make cross-genre analyses possible.

777As the analysis of subgenres was the goal of this thesis, a particular focus in the corpus-building process was on the information about the subgenres of the novels, which was collected by evaluating different literary-historical sources and explicit and implicit historical genre labels and signals. All of this material represents the conventional level of the novels’ subgenres. It was organized by encoding it in TEI according to an empirically induced model of discursive levels of subgenres, including the following levels:

  • thematic subgenre labels
  • labels referring to literary currents
  • labels related to the cultural-geographical and linguistic identity
  • labels pointing to the relationship between the novel and reality
  • labels concerned with the mode that a novel is narrated in
  • labels reflecting the author's or narrator's intention or attitude
  • labels that refer to the medium that a novel uses and the mode that it is represented in linguistically or narratively
An interesting question is whether other extensive empirical studies of genre signals and genre assignments will arrive at similar categorizations, i.e., whether a general trend of the types of genre signals and, in particular, their frequencies can be identified, or whether the results obtained here are specific to Spanish-American novels. Only further corpus-based studies of historical and literary critical genre assignments will be able to show this.

778The information on subgenres was evaluated in the metadata analysis part to see which subgenres were frequent enough to be analyzed on a textual level with quantitative methods. It was found that several of the discursive levels initially proposed by Raible (1980Raible, Wolfgang. 1980. “Was sind Gattungen? Eine Antwort aus semiotischer und textlinguistischer Sicht.” Poetica 12: 320–349.) and Schaeffer (1983Schaeffer, Jean-Marie. 1983. Qu’est-ce qu’un genre littéraire? Paris: Seuil.) are present in literary critical and also explicit historical subgenre labels but that only a few of them are quantitatively relevant. The frequent ones were found in particular on the thematic level (“novela histórica”, “novela de costumbres”, “novela sentimental”, etc.) and the contextual levels of literary current (“novela romántica”, “novela realista”, “novela naturalista”, and so on) as well as of the cultural, geographical, or linguistic identity (as expressed through the labels “novela argentina”, “novela mexicana”, “novela cubana”, or “novela original”). For the text analysis part, it was decided to focus on the thematic subgenres and the literary currents. Labels related to the cultural, geographical, or linguistic identity have been evaluated in a research article outside of the scope of this dissertation (Henny-Krahmer 2022Henny-Krahmer, Ulrike. 2022. “Novelas originales y americanas. A Digital Analysis of References to Identity in Subtitles of Spanish American 19th Century Novels.” apropos [Perspektiven auf die Romania] 9: 14–36. https://doi.org/10.15460/apropos.9.1893.). Thus, not all categories of genre labels were effectively evaluated in this thesis. The great variety of subgenre labels found for the novels in Bib-ACMé and Conha19 shows how multifaceted the novel as a genre is and further analyses that can proceed from the material encoded here are possible. As a result, quantitative digital text analysis is only one of several possible approaches for analyzing the subgenres of the nineteenth-century Spanish-American novel. The big pool of different subgenres and generic terms that are not so frequent can only meaningfully be analyzed qualitatively.

779Besides assessing different levels and perspectives on subgenres of the novels, the metadata analysis also served to check how well the corpus represents the material collected in the more extensive bibliography in quantitative terms. The relationship between both resources was found to be approximately proportional in most but not all aspects. For example, the corpus contains more works written by well-known, high-prestige authors than the bibliography. This is not surprising because these are the works that have primarily been digitized. Still, this aspect of the corpus can still be improved. It is desirable to enlarge the corpus in the future and include even more works of lesser-known authors than the ones it already contains. That would make the corpus more representative of the whole production of Argentine, Cuban, and Mexican nineteenth-century novels. As the corpus is published with this dissertation, it can be hoped that it provides a starting point for building a more extensive collection of nineteenth-century Spanish-American novels.

780In the text analysis part, statistical classification was used as the first method for categorizing the novels by their subgenre. To this end, one primary subgenre label was determined for each novel, and the classification was done for thematic subgenres and literary currents. Both subgenres and currents can be classified well on the basis of most frequent words and topic features. Experiments with three types of classifiers (K-nearest neighbor, Support Vector Machine, and Random Forest) and several different feature sets were run. In the case of MFW, different token units, numbers of words, and normalization techniques were used. For topics, different numbers of topics and optimization parameters for the modeling were chosen. The results showed that both subgenres and currents are textually coherent to degrees of 70 to 90 %, depending on the type of features and subgenre. Textual coherence here refers to the degree to which the communicatively determined subgenre assignments of the novels coincide with their class assignments as determined by text classification, measured in classification accuracy.

781In general, the differences between the two feature types were minor. In the case of thematic subgenres, there was almost no difference in the accuracy values between most frequent words and topics, and for literary currents, most frequent words worked a bit better than topics.

782The classification results for the literary currents were better than for the thematic subgenres. This was not expected from the beginning because the currents are broader phenomena which, at least in the case of the realist and romantic novels, include several thematic subgenres. While the realist and naturalistic novels were mainly published in the last two decades of the nineteenth century and the first decade of the twentieth century, the romantic novel was present from the early decades of the nineteenth century up to the last ones, with a dominant role up to the 1870s. So also diachronically, especially the romantic novels constitute a general phenomenon. The classification results suggest that the literary currents nonetheless represent a level of the novels that can be captured better stylistically than the thematic subgenres.

783That literary currents can be better recognized stylistically than thematic subgenres was also visible in the visualizations of the amount of correct and wrong classifications for individual novels of the corpus. For the thematic subgenres, there were cases of regular false positives and false negatives, that is, novels that carry the respective genre label but were classified as another subgenre in more than 70 % of the cases or novels that do not carry the label but were recognized as members of the subgenre almost in every case. Such instances of novels were not present in the results for the literary currents. The same could be observed for the “middle range” of novels that are misclassified in around 30 to 70 % of the cases, which happened for the thematic subgenres but not for the literary currents. This concludes that the levels of genre convention and text type are more congruent for the literary currents than for the thematic subgenres. When evaluating these results, it has to be kept in mind that the labels for literary currents were mainly collected from literary-historical sources. In contrast, the thematic labels are also present as explicit historical labels on the texts, so it is more probable that there are discrepancies between convention and text type than for critically established labels. Nevertheless, it may as well be the case that there is less consensus on the major theme of novels than on their dominant style in terms of literary currents.

784The text classification in the context of this dissertation has not yet taken into account the multiple subgenre assignments recorded in Bib-ACMé and Conha19. For the purposes of this thesis, the most plausible primary subgenre label was selected as the classification target in order to establish a classification baseline in a classical setting in the first place. This has not been done before for Spanish-American novels. The next obvious step is applying multi-label classification procedures to include secondary and further subgenre assignments. Especially in the case of thematic subgenres, there are very often multiple classifications on a communicative level. These occur in the historical subtitles of the novels (e.g., “novela histórica de costumbres”) and especially when literary scholars describe the novels in terms of thematic subgenres. It can therefore be assumed that such multiple attributions are also textually and stylistically relevant. Another way to further map the complexity of the novels in terms of their subgenres would be to include chapter structures or to divide the novels into text sections to classify them individually. The corpus Conha19 is well prepared for both approaches.

785As an alternative approach to text categorization, a family resemblance network analysis was conducted, with a focus on the internal structure of individual subgenres and a focus on the historical novel. It showed that specific subtypes of historical novels are obscured in the classification approach because they do not represent the quantitatively dominant type of romantic historical novels. However, these subtypes of historical novels become visible as a family when the corpus of historical novels is partitioned into network communities based on individual similarity relationships between the texts. When two subgenres were combined in the network, it became clear that several factors other than genre influence the style of the novels, for instance, the period of publication or the narrative perspective. These results show that statistical classification is a powerful method that yields very good results but also occludes interior subdivisions and stylistic differences of the novels that are part of the subgenres, so for exploratory and refining analyses, a family resemblance network analysis is a good alternative.

786Returning to the observation made at the outset that categorization is a basic human need, the analyses here have shown that human assignments of texts to genres, in terms of communicative phenomena, function differently from computer-assisted stylistic genre classification. The classifiers are powerful tools that can recognize which subgenres the novels belong to in most cases. However, they do not (yet) understand how the text style is related to the perceived conventional genre of the texts. This led to errors and unexpected results in some individual cases. The classifiers are more strict in interpreting the textual surface structures. Because they are different in that way from a human author or reader, through the analysis of mistakes that the statistical models make, they bring the opportunity to understand better how genres function as communicative phenomena. This understanding is supported by alternative categorization methods, such as the family resemblance approach presented in this thesis. Research in digital genre stylistics and computational literary studies needs to continue to engage with literary theory, literary history, and computational approaches to develop their own suitable methods for their objects of study.

787In the future, the text analysis can be developed further by using other features, especially those that can be related more directly to literary concepts, such as character constellations or literary space. In many cases, however, the computational caption of such features on the textual surface has yet to be developed. Another enhancement for the analysis would be not to analyze the novels as a whole, as was done here, but to evaluate the results for text segments, chapters, or different parts of the narration, such as direct speech versus narrator text. The textual characteristics of genres may also be defined in connection to these substructures of the texts and not only for the entire texts. Furthermore, classification and network analysis are just two of the further available options of computational text categorization methods that could be employed for genre analysis. Especially the concept of genres as prototypical structures could be approached with clustering methods, for example, which has not been done in the scope of this dissertation. Moreover, of course, there should be more empirical studies based on other literatures and corpora in order to be able to identify broader trends in the history of genres and to be able to come to more general conclusions than was possible in this work on the nineteenth-century Spanish-American novel. Thus, future research directions in digital genre stylistics could be envisioned here but have yet to be carried out. Hopefully, the spirit of Open Science will be followed in future genre-stylistic research, as it has been done in this work, by making the texts, data, and scripts all openly available, as far as legally possible.