IDENTIFYING FEATURES OF PAKISTANI LEARNERS WRITING THROUGH MDA AND COHMETRIX

http://dx.doi.org/10.31703/gssr.2021(VI-III).13      10.31703/gssr.2021(VI-III).13      Published : Sep 2021
Authored by : Rabia Tabassum , Mahwish Farooq , Muhammad Asim Mahmood

13 Pages : 119‒127

    Abstract

    Learner language has been a source of interest for researchers of all times as it possesses common features of language in use. For investigating this, Multi-dimensional analysis (MDA) by Biber is one such approach that empirically studies practiced language and establishes grounds for those varieties too which are striving for their place in linguistic cline (Crossely, et al., 2014). The present research is an effort to explore common patterns of learner language, which are explored through Coh-Metrix (an online data tagging tool used to assess cohesion, coherence, readability level, etc.) to study those features and their respective functions while partially using MDA methodology. Following Biber's methodology, Factor analysis was conducted, and four dimensions were identified, which provided clues for the functional association of these dimensions. The results show that Pakistani learners' argumentative writing possesses narrative features and is dominatingly overlapping at the level of vocabulary, syntactic constructions, and passage development, and even in argumentation. These findings help us to establish the fact that Pakistani English has its own identity. These results are helpful for linguists as well as teachers as the knowledge of common linguistic and syntactic structures can be assessed easily while keeping in mind the grade level of the students.

    Key Words

    Coh-Metrix, Factor Analysis, Multidimensional Analysis, ICLE, Corpus Linguistics

    Introduction

    Researches on Pakistani English addressed issues of the Pakistani English language at two levels. One, such studies focused on individual linguistic features, and these features were studied descriptively. These studies helped to establish the grounds for Pakistani English as an independent variety like Baumgartner (1996) studied the unique style of Pakistani English by reporting its lexical and grammatical attributes. Talaat (1988) compared Pakistani English with British English and explored the lexical variety of Pakistani English. Mehmood (2009) studied the syntactic and phonological features of Pakistani English. But currently, this trend has transformed into a more empirical and objective approach known as a corpus-based approach. 

    Corpus-based studies, largely known as empirical studies, have been adopted as a method to read linguistic features of Pakistani English. These corpus-based studies are helping the linguists in knowing the features of various verities, comparing their features, and bringing the points of unity and diversity in native and non-native varieties. ICLE corpus being center for the study of learners writings has gained much importance in this regard 'we can use the ICLE corpora to know about the linguistic features of learners writing, bringing out quantitative statistics about word frequency in the use of words, word categories, syntactic structures and discourse features (Granger, 1998).

    Literature Review

    A number of studies were conducted on exploring the linguistic characteristics of texts or registers. These studies mainly intended to focus on linguistic specifications, which further added the factor solution (statistical approach to make bundles of meaningfully similar features) to be studied. The factorial study leads towards the investigation of the communicative function of every text. Therefore, the study of linguistic features in a large number of data, register, discourse, etc., with the help of various data tagging tools, remains central in this regard (in the case of the present study tool is coh-metrix).  

    There are a number of studies conducted through CM, and these studies validates the utility of CM in various aspects. The notable studies range from cohesion and LSA indices (McNamara et al., 2010) to lexical diversity indices and L2 index (McCarthy & Jarvis, 2010, Crossley, Salsbury, &McNamara, 2009). Cm has also helped the researchers to establish evidences regarding text levels, their patterns, comprehension grade, and texts suitable to L2 learners for learning the language. For instance, McCarthy and Lewiet (2006) were of the view that CM works as an effective tool in knowing and establishing authorship even when the author himself/herself hides underwriting shifts in his/her writing styles. In the study of psychology articles, McCarthy (2007) has used CM-based indices of LSA so that he may show the structural cohesion in the themes of these articles. Similarly, Duran (2006) used CM to study temporal cohesion in the texts of history, narratives, and science. The purpose behind this was to study the textual domains of these genre types. The studies (McNamara, Ozuru, Graesser, & Louwerse, 2006) related to measuring the coherence levels in the writings of learners were also conducted, which established the level from highly coherent to low coherent texts, and Dufty (2006) used CM to assess human ratings across grade levels and how it differs due to certain socio and psychological reasons. Similarly, Duran & McNamara (2006), McCarthy (2006) conducted research where they assessed the structural organization of various high school published textbooks and recommended the grade levels according to comprehension level. Lightman (2007) studied the variation in formal/ informal and written/spoken texts and found the differences and similarities among these genres.    Dempsey, McCarthy, & McNamara (2007) and Louwerse (2004) also studied gender differences manifested across various texts.   Crossley and Louwerse (2007), Crossley McCarthy, & McNamara (2007) conducted research to find out authentic and modified texts which are necessary for the learners of the second language.  All these studies rationalize the utility of CM in investigating the characteristics of texts and how l2 learners need to pre-decide their reading material.

    Coh-Metrix studies are not only concerned with language in use but also in the features and functions of practiced language that are addressed and observed. The features and functions of language were analyzed through CM by a group of researchers who collected more than 1500 essays from students and analyzed them by using the methodology of MD by Biber (1988). Their study mainly aimed to examine functional parameters that are revealed through co-occurring linguistic features in learners' corpus. The essays were grouped together at the criteria of shared features with shared functions. On this criteria, four dimensions were identified, namely essays' prompt, quality and grade level. The results showed the functional parameters that affect the writing of the learners.  This research also adds validity to the MDA methodology.    

    The present study will combine the approaches of MD and Coh-metrix and discuss dimensions extracted from these two approaches. The earlier studies (Hussain, 2015 & Abdulaziz 2017) done on learner corpus were limited to exploring dimensions through MDA, but the present study will employ both MD and Coh-metrix dimensions to see not only co-occurring linguistic features of MD but also indices of coh-metrix like cohesion, readability, text easability and quality of the text. The identification of these features is necessary as Crossely (2014) observes that coh-metrix indices known as functional parameters help in implications for writing theory, writing assessment, and writing pedagogy.

    Research Methodology

    The present research has used the indices that are collected one by one for 308 essays from the online coh-metrix data tagging tool, and results were saved in an excel file. For further filtration of data, the procedure of Biber's methodology was adopted.  Following Biber (1988) methodology (1988), the obtained indices were first normalized and then standardized, and factor analysis was conducted, leaving the indices which have low or close to 0 weight. Principal component analysis (PCA) using Promax rotation was applied. PCA is used when the underlying structure is undefined, and thus PCA reduces the variables into meaningful sets. It allows a large number of indices to be reduced into small meaningful sets of variables, i.e., factors or dimensions. These dimensions were then interpreted based on writing parameters through a qualitative analysis of each dimension. For the meaningful inclusion of values, the cutoff point ?.35 eigenvalues were considered. Thus set of features combined through factor scorings are interpreted qualitatively, giving a functional interpretation to the text. The loadings of indices are helpful in the sense that an index is only included in a factor if that shows higher loading in one and is excluded in the other due to low loadings. For example, if a feature (here index) is higher in values on factor 1 than on factor 2, then that feature will only be included in factor 1. Following further Biber's methodology, the factor scores of each feature were calculated and by subtracting the mean of the standardized scores of the negative features from the positive scores of the indices. Till this process, coh-metrix helps to generate results. Now the next step is to statistically analyze data following Biber's methodology. To select the prominent factors, Scree plot was used, which is as under:  

    Figure

    The study of the scree plot shows that four factors are prominent. The features retained through this estimate the possible groups of linguistic features that co-occur and mostly share a communicative function. According to Scree plot, four factors are important, but here some other cut points are also suggested by Biber that is to include only those features that have at least.± 35 value and secondly, only that factor that is consists of more than four features less to these is considered non-significant. Keeping this point in view, only three factors are worth to be interpreted. Interpretation of the factors is as under: 

    Results

    Factor 1. Based On Coh-Metrix Indices

    Factor 1.

    Positive Component

    Values

    Negative Component

    Values

    PCREFz

    .953

    SYNMEDlem

    -.734

    PCREFp

    .943

    SYNMEDWRD

    -.730

    CRFCWO1

    .850

    LIMITED

    -.698

    CRFCWOa

    .820

    LDTTRc

    -.687

    CRFNO1

    .743

    LDVOCD

    -.650

    CRFAO1

    .740

    LDTTRA

    -637

     

    LASAGNA

    .698

    SYNMEDpos

    -.517

     

    CRFSO1

    .692

     

     

     

    CRFAOa

    .674

     

     

     

    QUINOA

    .669

     

     

     

    LSASS1

    .651

     

     

     

    CRFSOA

    .641

     

     

     

    CRFCWOad

    .637

     

     

     

    LSASSP

    .550

     

     

     

     


    Factor 1 is the most powerful dimension containing 15 positive and eight negative features. In positive features PCREF2 and PCREFp are of the highest loadings .922 and .912 respectively as compared to others. Both these indices are from the bank of text easability indices. These features show a tendency for higher referential cohesion in the text. Text of higher cohesion has a dominant trend of overlap among words, ideas, and sentences scattered in the whole text. To elaborate this McNamara et al. (2014) say: 'A text with higher referential cohesion contains words and ideas that overlap across sentences and the entire text forming explicit threads that connect the text for the reader' (p, 85). It's a common observation that low cohesion in the texts brings comprehension difficulty because of the few connectives or textual connections that put ideas together for the readers.  The overlapping in the content word, content word overlap in all sentences, argument overlap, stem overlap symbolized in  CRFCWO,CRFCOA CRFSO1, and CRFAO1 are other features that have positive highest loadings in factor 1.  Indices of LSA are also prominent in features of factor 1 with positive loadings. Indices LSAGN, LSASS,LSA1, and LSAssp  are important in this regard. These indices also show overlapping in sentences, passages, and content words. The information which is shared through these indices is both at a given or new information level. The learners are using the same syntactic and linguistic features, which show overlapping at the level of LSA too.

    A negative pole features of lexical diversity are prominent. Lexical diversity refers to the unique words that a text possesses in relation to a total number of words. TTR is counted for all the content and all other words. When analyses of data show that a number of word types are equal to the total number of words, it means the words of a text are different and lexical diversity is at its peak. Such texts are either short or have low cohesion if, in contrast, the lexical diversity is low in cohesion. In contrast, if lexical diversity is low with higher cohesion, it means words are repeated by the user across the text. Therefore, a high number of words need to be used and repeated multiple times so that cohesion may be retained. TTR is influenced by text length as the number of words increases in length, and they provide those words more space to be reused and make words less unique. MLTD and VOcd measures use estimation algorithms. As far as the SYNMED pos is considered, both nearby sentences are similar though the second sentence is odd, pos do not concern it. Whereas SYNMEDwrd and SYNMEDlem consider the different positions of the word and will focus on the point that they have the same syntax but different words so different meanings too. So all these indices consider different aspects of sentences. All these features show that Pakistani learner writing is cohesive as overlapping at the level of content words, sentences and argument is seen as a dominant feature, yet there is scarce of new and unique words. Texts are highly informational, and there is intersecting of similar sentences and arguments scattered in the whole passages. Thus the right label for this dimension is 'overlapping informative features vs. simple structure.' An example of this can be taken from the data and are quoted below:

     

    ICLE-PA-GF-004.1>

    Often marriages are settled between two people who have different financial backgrounds. After their marriages, one who is poor or less rich is subjected to the agony of taunting. Especially if a woman is poor, then she had to suffer for the whole of her married life and had to obey her husband and serve him like a servant, or otherwise, she had to be ready for the consequences such as divorce. It was also come to line light from the research that men and their family hope that his wife will bring a large amount of dowry, and if she does not bring it, then she is divorced. Usually, young girls who belong to a rich are royal family are unable to live a married life in serenity due to their lack of flexibility; they do not compromise on any matter, thus resulting in a breakdown of a loving bond of marriages.

    The example shows the overlapping of content words which is a common feature of learner writing.

    The other example is:

     

    ICLE-PA-GF-0080.1>

    Terrorists and Mujahideen are two different categories of people. Terrorists are those who create terror and dread through their destructive activities. While Mujahideen are called Islamic soldiers. They fight for Allah against injustice or for Islam when anti-Muslim powers dominate Muslims and forbid them to lead their lives according to Islam. They don't fight for personal or worldly benefits, while terrorists fight for worldly benefits or purposes. Islam is the religion of peace. It stresses brotherhood, sacrifice, and welfare. It forbids to frighten someone. The killing of someone is inhuman.


     

    Factor 2. Based on Coh-Metrix Indices

    Factor 2

    Positive Component

    Values

    Negative Component

    Values

    PCNARz

    .901

    DESWLsy

    -.828

    PCNARp

    .900

    DESWLlt

    -.810

    RDF

    .802

    DESWLltd

    -.759

    WRDPRO

    .718

    DESWLsyd

    -.751

    WRDFRQc

    .712

    WRDAOAc

    -.634

    RDL2

    .635

    SIENNA

    -.571

    WRDFAMc

    .635

    WRDNOUN

    -.565

    WRDFRQmc

    .556

    WRDADJ

    -.493

    GREG

    .538

     

     

    WRDPRP3s

    .419

     

     

     


    In Factor 2, prominent features with positive loadings are related to psychological rating bank indices. These indices give information regarding age, familiarity, imageability, concreteness, etc. For getting additional information, CM uses two databases for words interpretation. First is MRC psycholinguistic database, which provides several words with several psycho dimensions.  For example, the use of acquisition measures calculates the specific period in which a word first time enters into a child's vocabulary, and another scale measures an adult's content word vocabulary with a scale of 1 -7 points. Results with higher scores represent easier processing. Ratings on the scale 1-7 were subsequently multiplied by 100 and rounded to the nearest integer. So, as to be able to present all the ratings as integers on a scale from 100 to 700. Other measures like familiarity, concreteness, imageability were attained from merging Paivio, Yuille, and Madigan (1968) norms. The second source is WordNet (Fellbaum 1998 Miller 1990) from which C.M estimates polysemy and hypernyms.

    The dominant feature of PCNARz related to narrativity, where description is more near to narrativity. Text is like telling a story, sharing events, information about places, characters, and things. It is like a conversation about everyday oral conversation, and vocabulary is highly familiar, showing world knowledge.

    On the negative pole, features of the descriptive index are prominent. These indices thus provide a detailed description of the text, its nature, and its complexity level. For instance, for the calculation of length, the paragraphs and sentences which have extended length may indicate more words and complex syntax, which means such sentences are difficult to process. Similarly, a large standard deviation of the mean of sentences indicates that the text has a large variation in respect of the length of sentences in which some are very long, and some are too short, which the author is deliberately doing to present utterances of characters and scenes description respectively. (McNamara, 2014)

    Thus the right label for this dimension is 'narrative vs descriptive concerns.'

     

    <ICLE-PA-VL-0001.1>

    The turning point of American policy was 9/11. When 2000 Americans were killed in this attack, Americans claimed that Afghan was responsible for that and started a war against Afghanistan. Millions were killed in this war. The prisoners of war were killed in this and treated inhumanly and sent to Abu Garib Jail. Americans put them in cagescage-like animals and torture them by letting dogs lie on them.

    They were not given food and other basic

    needs. This true violation of the U.N Charter. Which was recognized at the Geneva conference. The media explored the cruelties of America in this regard. The American diplomacy to fight against terrorist was exposed that how America falsely got the support of the world. But in reality, the USA deceived the whole world for killing the innocent people.

    The other example is showing the description of place.

     

    <ICLE-PA-AO-0011.1>

    Europe is one of the world's seventh continents, Europe is generally divided from Asia to its east by the water, divided by the Ural Mountains, the Ural River, the Caspian Sea, the Caucasus region, and the Black Sea to the southeast. Europe is bordered by the Arctic Ocean and other bodies of water to the north, the Atlantic Ocean to the west, the Mediterranean sea to the south, and the black sea to the southeast.


     

    Factor 3. Based on Coh-Metrix Indices

    Factor 3.

    Positive Component

    Values

    Negative Component

    Values

    PCSYNz

    .914

    DESS

    -.934

    PCSYNp

    .905

    RDFKGL

    -.821

    SYNSTRUTt

    .833

    dressed

    -.767

    SYNSTRUTa

    .804

    SYNLE

    -.620

    SMCAUSv

    .665

     

     

    DEC

    .647

     

     

    SMINTEp

    .568

     

     

    CRFCWO1

    .513

     

     

    SMCAUSvp

    .490

     

     

     


    Factor three is based on eight features, all showing positive results with no negative loadings. The features with positive loadings belong to the indices of text easability measures. Indices of syntactic simplicity like PCSYNz and PCSYNp clues the syntactic structure, which is simple, showing fewer words, simple and familiar syntactic patterns. Such structures are easy to process. The other features are PCCNCp & PCCNCz belong to word concreteness giving information about content words which are more common, concrete, and non-abstract. Such vocabulary generates mental images and is less ambiguous and more meaningful. Such information is easy to process and comprehend.  Therefore, the right label for this dimension is 'concrete factual information.

     

    <ICLE-PA-AO-0024.1>

    The modern age is not hunting animals for personal hunt and pleasure. In ancient times Kings and their companions used to hunt animals not for their larder but for their personal pleasure. It was very cruel and inhuman treatment toward the beauty of nature. They left many animals to rote often hunt.

     

    <ICLE-PA-AO-0015.1>

    If we take the example of domestic donkey, it is treated by human too much worse, a ton of weight is put on the back of this innocent creature, and the master take much more work from him., which is much more from his capacity and tendency.

    In the circus, many animals are badly treated by humans, they are change their habitat, location and snatch their native and natural surroundings, and here they badly treated by a human.

     

    <ICLE-PA-VL-0004.1>

    At 3.79 million sq. miles and with over 309 million people, the America is the third-largest country by total area and population. America is the world largest economy with a GDP of $ 14.3 trillion- with the literacy rate of 99% America has one of the finest systems of education. In the field of sports, America has achieved many landmarks, and they have the highest number of medals any country won in the Olympics.

    Although the global slow down in economic growth, America is still the highest funding nation in the world. America has one of the largest Army in the world, still operating in different parts of the world, and have taken part in world war 1 and 2. Still, due to many set backs, America is one of the strongest countries in the world. 

    Conclusion and Findings

    Pakistani learners’ argumentative writing is 

    largely focused on sharing information. Even in putting arguments, the writers do not try to take a clear stance. Instead, they use indirect style. In argumentative essays, students need special training while dealing with argumentative topics as these are more cognitive, complex, and interactive. Students perform well in narrative and descriptive essays as compared to argumentative topics. Text is like telling a story, sharing events, information about places, characters, and things. It is like conversation about every day oral conversation, and vocabulary is highly familiar, showing world knowledge. Learners writing is cohesive as overlapping at the level of content words, sentences, and argument is seen as a dominant feature, yet there is scarce of new and unique words. They are using the same syntactic and linguistic features, which show massive overlapping in the text. They write highly cohesive text but lack variety of expression. Instead of putting arranged and well-designed arguments, they prefer to share information with interactive features.

    Note: This paper is part of researcher’s Ph.D. dissertation.

References

  • Baumgardner, R. J. (1987). Utilizing Pakistani newspaper English to teach grammar. World Englishes, 6(3): pp. 241-252.
  • Baumgardner, R. J. (Ed.). (1996). South Asian English: Structure, use, and users. Urbana: University of Illinois Press.
  • Biber, D. & Finegan, E. (1994). Multi- dimensional analysis of authors' style: some case studies from eighteenth century. In D. Ross, D. Brink (Eds.). Research in humanities computing, III: pp. 3-17.
  • Biber, D. (2004b). Modal use across registers and time. In Anne Curzan and Kimberly Emmons (eds.), Studies in the history of the English language II: Unfolding conversations. Berlin: Mouton de Gruyter. pp. 189-216.
  • Biber, D. (1988). Variation across speech and writing. Cambridge: Cambridge University Press.
  • Biber, D. (1995). 'On the role of computational, statistical, and interpretive techniques in multi-dimensional analysis of register variation. Text 15/3: pp. 314-370.
  • Biber, D. (1995). Dimensions of Register Variation: A cross-linguistic comparison. Cambridge University Press.
  • Biber, D. (2004a). Conversation text types: A multi-dimensional analysis. In Gérald Purnelle, Cédrick Fairon, and Anne Dister (eds.), Le poids des mots: Proceedings of the 7th International Conference on the Statistical Analysis of Textual Data. Louvain: Presses universitaires de Louvain. Pp. 15-34
  • Biber, D., Connor, U. & Upton, T. A. (2007). Discourse on the Move: Using Corpus Analysis to Describe Discourse Structure. Amsterdam: John Benjamins
  • Biber, D., Conrad, S. & Rappen, R. (2002). Speaking and writing in the university: A multi-dimensional comparison. TESOL quarterly. 36(1). Pp. 9-48
  • Crossley. S. Salsbury, T. Titak, A. McNamara, D. (2014). Frequency effects and second language lexical acquisition: Word types, word tokens, and word production. International Journal of Corpus Linguistics,9(3), 301-332
  • Crowhurst. (1990). How many millions? The statistics of English today. English Today 1(1), 7-9.
  • Friginal, E. (2012). The Discourse of Outsourced Call Centres: A Corpus-Based, Multi- Dimensional Analysis.
  • Geisler, C. (2002). Investigating register variation in nineteenth-century English: A multi- dimensional comparison. In R. Rappen, S. M. Fitzmaurice &D. Biber (Eds.), Using corpora to explore linguistic variation. Amsterdam: john benjamins. pp.249-271.
  • Graesser, A. C., & D'Mello, S. K. (2012). Moment- to-moment emotions during reading. Reading Teacher, 66, 238-242.

Cite this article

    APA : Tabassum, R., Farooq, M., & Mahmood, M. A. (2021). Identifying Features of Pakistani Learners Writing Through MDA and Coh-Metrix. Global Social Sciences Review, VI(III), 119‒127. https://doi.org/10.31703/gssr.2021(VI-III).13
    CHICAGO : Tabassum, Rabia, Mahwish Farooq, and Muhammad Asim Mahmood. 2021. "Identifying Features of Pakistani Learners Writing Through MDA and Coh-Metrix." Global Social Sciences Review, VI (III): 119‒127 doi: 10.31703/gssr.2021(VI-III).13
    HARVARD : TABASSUM, R., FAROOQ, M. & MAHMOOD, M. A. 2021. Identifying Features of Pakistani Learners Writing Through MDA and Coh-Metrix. Global Social Sciences Review, VI, 119‒127.
    MHRA : Tabassum, Rabia, Mahwish Farooq, and Muhammad Asim Mahmood. 2021. "Identifying Features of Pakistani Learners Writing Through MDA and Coh-Metrix." Global Social Sciences Review, VI: 119‒127
    MLA : Tabassum, Rabia, Mahwish Farooq, and Muhammad Asim Mahmood. "Identifying Features of Pakistani Learners Writing Through MDA and Coh-Metrix." Global Social Sciences Review, VI.III (2021): 119‒127 Print.
    OXFORD : Tabassum, Rabia, Farooq, Mahwish, and Mahmood, Muhammad Asim (2021), "Identifying Features of Pakistani Learners Writing Through MDA and Coh-Metrix", Global Social Sciences Review, VI (III), 119‒127
    TURABIAN : Tabassum, Rabia, Mahwish Farooq, and Muhammad Asim Mahmood. "Identifying Features of Pakistani Learners Writing Through MDA and Coh-Metrix." Global Social Sciences Review VI, no. III (2021): 119‒127. https://doi.org/10.31703/gssr.2021(VI-III).13