Abstract
Learner language has been a source of interest for researchers of all times as it possesses common features of language in use. For investigating this, Multi-dimensional analysis (MDA) by Biber is one such approach that empirically studies practiced language and establishes grounds for those varieties too which are striving for their place in linguistic cline (Crossely, et al., 2014). The present research is an effort to explore common patterns of learner language, which are explored through Coh-Metrix (an online data tagging tool used to assess cohesion, coherence, readability level, etc.) to study those features and their respective functions while partially using MDA methodology. Following Biber's methodology, Factor analysis was conducted, and four dimensions were identified, which provided clues for the functional association of these dimensions. The results show that Pakistani learners' argumentative writing possesses narrative features and is dominatingly overlapping at the level of vocabulary, syntactic constructions, and passage development, and even in argumentation. These findings help us to establish the fact that Pakistani English has its own identity. These results are helpful for linguists as well as teachers as the knowledge of common linguistic and syntactic structures can be assessed easily while keeping in mind the grade level of the students.
Key Words
Coh-Metrix, Factor Analysis, Multidimensional Analysis, ICLE, Corpus Linguistics
Introduction
Researches on Pakistani English addressed issues of the Pakistani English language at two levels. One, such studies focused on individual linguistic features, and these features were studied descriptively. These studies helped to establish the grounds for Pakistani English as an independent variety like Baumgartner (1996) studied the unique style of Pakistani English by reporting its lexical and grammatical attributes. Talaat (1988) compared Pakistani English with British English and explored the lexical variety of Pakistani English. Mehmood (2009) studied the syntactic and phonological features of Pakistani English. But currently, this trend has transformed into a more empirical and objective approach known as a corpus-based approach.
Corpus-based studies, largely known as empirical studies, have been adopted as a method to read linguistic features of Pakistani English. These corpus-based studies are helping the linguists in knowing the features of various verities, comparing their features, and bringing the points of unity and diversity in native and non-native varieties. ICLE corpus being center for the study of learners writings has gained much importance in this regard 'we can use the ICLE corpora to know about the linguistic features of learners writing, bringing out quantitative statistics about word frequency in the use of words, word categories, syntactic structures and discourse features (Granger, 1998).
Literature Review
A number of studies were conducted on exploring the linguistic characteristics of texts or registers. These studies mainly intended to focus on linguistic specifications, which further added the factor solution (statistical approach to make bundles of meaningfully similar features) to be studied. The factorial study leads towards the investigation of the communicative function of every text. Therefore, the study of linguistic features in a large number of data, register, discourse, etc., with the help of various data tagging tools, remains central in this regard (in the case of the present study tool is coh-metrix).
There are a number of studies conducted through CM, and these studies validates the utility of CM in various aspects. The notable studies range from cohesion and LSA indices (McNamara et al., 2010) to lexical diversity indices and L2 index (McCarthy & Jarvis, 2010, Crossley, Salsbury, &McNamara, 2009). Cm has also helped the researchers to establish evidences regarding text levels, their patterns, comprehension grade, and texts suitable to L2 learners for learning the language. For instance, McCarthy and Lewiet (2006) were of the view that CM works as an effective tool in knowing and establishing authorship even when the author himself/herself hides underwriting shifts in his/her writing styles. In the study of psychology articles, McCarthy (2007) has used CM-based indices of LSA so that he may show the structural cohesion in the themes of these articles. Similarly, Duran (2006) used CM to study temporal cohesion in the texts of history, narratives, and science. The purpose behind this was to study the textual domains of these genre types. The studies (McNamara, Ozuru, Graesser, & Louwerse, 2006) related to measuring the coherence levels in the writings of learners were also conducted, which established the level from highly coherent to low coherent texts, and Dufty (2006) used CM to assess human ratings across grade levels and how it differs due to certain socio and psychological reasons. Similarly, Duran & McNamara (2006), McCarthy (2006) conducted research where they assessed the structural organization of various high school published textbooks and recommended the grade levels according to comprehension level. Lightman (2007) studied the variation in formal/ informal and written/spoken texts and found the differences and similarities among these genres. Dempsey, McCarthy, & McNamara (2007) and Louwerse (2004) also studied gender differences manifested across various texts. Crossley and Louwerse (2007), Crossley McCarthy, & McNamara (2007) conducted research to find out authentic and modified texts which are necessary for the learners of the second language. All these studies rationalize the utility of CM in investigating the characteristics of texts and how l2 learners need to pre-decide their reading material.
Coh-Metrix studies are not only concerned with language in use but also in the features and functions of practiced language that are addressed and observed. The features and functions of language were analyzed through CM by a group of researchers who collected more than 1500 essays from students and analyzed them by using the methodology of MD by Biber (1988). Their study mainly aimed to examine functional parameters that are revealed through co-occurring linguistic features in learners' corpus. The essays were grouped together at the criteria of shared features with shared functions. On this criteria, four dimensions were identified, namely essays' prompt, quality and grade level. The results showed the functional parameters that affect the writing of the learners. This research also adds validity to the MDA methodology.
The present study will combine the approaches of MD and Coh-metrix and discuss dimensions extracted from these two approaches. The earlier studies (Hussain, 2015 & Abdulaziz 2017) done on learner corpus were limited to exploring dimensions through MDA, but the present study will employ both MD and Coh-metrix dimensions to see not only co-occurring linguistic features of MD but also indices of coh-metrix like cohesion, readability, text easability and quality of the text. The identification of these features is necessary as Crossely (2014) observes that coh-metrix indices known as functional parameters help in implications for writing theory, writing assessment, and writing pedagogy.
Research Methodology
The present research has used the indices that are collected one by one for 308 essays from the online coh-metrix data tagging tool, and results were saved in an excel file. For further filtration of data, the procedure of Biber's methodology was adopted. Following Biber (1988) methodology (1988), the obtained indices were first normalized and then standardized, and factor analysis was conducted, leaving the indices which have low or close to 0 weight. Principal component analysis (PCA) using Promax rotation was applied. PCA is used when the underlying structure is undefined, and thus PCA reduces the variables into meaningful sets. It allows a large number of indices to be reduced into small meaningful sets of variables, i.e., factors or dimensions. These dimensions were then interpreted based on writing parameters through a qualitative analysis of each dimension. For the meaningful inclusion of values, the cutoff point ?.35 eigenvalues were considered. Thus set of features combined through factor scorings are interpreted qualitatively, giving a functional interpretation to the text. The loadings of indices are helpful in the sense that an index is only included in a factor if that shows higher loading in one and is excluded in the other due to low loadings. For example, if a feature (here index) is higher in values on factor 1 than on factor 2, then that feature will only be included in factor 1. Following further Biber's methodology, the factor scores of each feature were calculated and by subtracting the mean of the standardized scores of the negative features from the positive scores of the indices. Till this process, coh-metrix helps to generate results. Now the next step is to statistically analyze data following Biber's methodology. To select the prominent factors, Scree plot was used, which is as under:
Figure
The study of the scree plot shows that four factors are prominent. The features retained through this estimate the possible groups of linguistic features that co-occur and mostly share a communicative function. According to Scree plot, four factors are important, but here some other cut points are also suggested by Biber that is to include only those features that have at least.± 35 value and secondly, only that factor that is consists of more than four features less to these is considered non-significant. Keeping this point in view, only three factors are worth to be interpreted. Interpretation of the factors is as under:
Results
Factor 1. Based On Coh-Metrix
Indices
Factor
1. |
||||
Positive
Component |
Values |
Negative
Component |
Values |
|
PCREFz |
.953 |
SYNMEDlem |
-.734 |
|
PCREFp |
.943 |
SYNMEDWRD |
-.730 |
|
CRFCWO1 |
.850 |
LIMITED |
-.698 |
|
CRFCWOa |
.820 |
LDTTRc |
-.687 |
|
CRFNO1 |
.743 |
LDVOCD |
-.650 |
|
CRFAO1 |
.740 |
LDTTRA |
-637 |
|
LASAGNA |
.698 |
SYNMEDpos |
-.517 |
|
CRFSO1 |
.692 |
|
|
|
CRFAOa |
.674 |
|
|
|
QUINOA |
.669 |
|
|
|
LSASS1 |
.651 |
|
|
|
CRFSOA |
.641 |
|
|
|
CRFCWOad |
.637 |
|
|
|
LSASSP |
.550 |
|
|
|
Factor 1 is the most
powerful dimension containing 15 positive and eight negative features. In
positive features PCREF2 and PCREFp are of the highest loadings .922 and .912
respectively as compared to others. Both these indices are from the bank of
text easability indices. These features show a tendency for higher referential
cohesion in the text. Text of higher cohesion has a dominant trend of overlap
among words, ideas, and sentences scattered in the whole text. To elaborate
this McNamara et al. (2014) say: 'A text with higher referential cohesion
contains words and ideas that overlap across sentences and the entire text
forming explicit threads that connect the text for the reader' (p, 85). It's a
common observation that low cohesion in the texts brings comprehension
difficulty because of the few connectives or textual connections that put ideas
together for the readers. The overlapping
in the content word, content word overlap in all sentences, argument overlap,
stem overlap symbolized in CRFCWO,CRFCOA
CRFSO1, and CRFAO1 are other features that have positive highest loadings in
factor 1. Indices of LSA are also
prominent in features of factor 1 with positive loadings. Indices LSAGN,
LSASS,LSA1, and LSAssp are important in
this regard. These indices also show overlapping in sentences, passages, and
content words. The information which is shared through these indices is both at
a given or new information level. The learners are using the same syntactic and
linguistic features, which show overlapping at the level of LSA too.
A negative pole
features of lexical diversity are prominent. Lexical diversity refers to the
unique words that a text possesses in relation to a total number of words. TTR
is counted for all the content and all other words. When analyses of data show
that a number of word types are equal to the total number of words, it means
the words of a text are different and lexical diversity is at its peak. Such
texts are either short or have low cohesion if, in contrast, the lexical
diversity is low in cohesion. In contrast, if lexical diversity is low with
higher cohesion, it means words are repeated by the user across the text.
Therefore, a high number of words need to be used and repeated multiple times
so that cohesion may be retained. TTR is influenced by text length as the
number of words increases in length, and they provide those words more space to
be reused and make words less unique. MLTD and VOcd measures use estimation
algorithms. As far as the SYNMED pos is considered, both nearby sentences are
similar though the second sentence is odd, pos do not concern it. Whereas
SYNMEDwrd and SYNMEDlem consider the different positions of the word and will
focus on the point that they have the same syntax but different words so
different meanings too. So all these indices consider different aspects of
sentences. All these features show that Pakistani learner writing is cohesive
as overlapping at the level of content words, sentences and argument is seen as
a dominant feature, yet there is scarce of new and unique words. Texts are
highly informational, and there is intersecting of similar sentences and
arguments scattered in the whole passages. Thus the right label for this
dimension is 'overlapping informative features vs. simple structure.' An
example of this can be taken from the data and are quoted below:
ICLE-PA-GF-004.1>
Often marriages are settled between two people
who have different financial backgrounds. After their marriages, one who is
poor or less rich is subjected to the agony of taunting. Especially if a woman
is poor, then she had to suffer for the whole of her married life and had to
obey her husband and serve him like a servant, or otherwise, she had to be
ready for the consequences such as divorce. It was also come to line light from
the research that men and their family hope that his wife will bring a large
amount of dowry, and if she does not bring it, then she is divorced. Usually,
young girls who belong to a rich are royal family are unable to live a married
life in serenity due to their lack of flexibility; they do not compromise on
any matter, thus resulting in a breakdown of a loving bond of marriages.
The example shows the
overlapping of content words which is a common feature of learner writing.
The other example is:
ICLE-PA-GF-0080.1>
Terrorists and Mujahideen are two different
categories of people. Terrorists are those who create terror and dread through
their destructive activities. While Mujahideen are called Islamic soldiers.
They fight for Allah against injustice or for Islam when anti-Muslim powers
dominate Muslims and forbid them to lead their lives according to Islam. They
don't fight for personal or worldly benefits, while terrorists fight for
worldly benefits or purposes. Islam is the religion of peace. It stresses
brotherhood, sacrifice, and welfare. It forbids to frighten someone. The
killing of someone is inhuman.
Factor 2. Based on Coh-Metrix
Indices
Factor 2 |
|||
Positive Component |
Values |
Negative Component |
Values |
PCNARz |
.901 |
DESWLsy |
-.828 |
PCNARp |
.900 |
DESWLlt |
-.810 |
RDF |
.802 |
DESWLltd |
-.759 |
WRDPRO |
.718 |
DESWLsyd |
-.751 |
WRDFRQc |
.712 |
WRDAOAc |
-.634 |
RDL2 |
.635 |
SIENNA |
-.571 |
WRDFAMc |
.635 |
WRDNOUN |
-.565 |
WRDFRQmc |
.556 |
WRDADJ |
-.493 |
GREG |
.538 |
|
|
WRDPRP3s |
.419 |
|
|
In Factor 2, prominent
features with positive loadings are related to psychological rating bank
indices. These indices give information regarding age, familiarity,
imageability, concreteness, etc. For getting additional information, CM uses
two databases for words interpretation. First is MRC psycholinguistic database,
which provides several words with several psycho dimensions. For example, the use of acquisition measures
calculates the specific period in which a word first time enters into a child's
vocabulary, and another scale measures an adult's content word vocabulary with
a scale of 1 -7 points. Results with higher scores represent easier processing.
Ratings on the scale 1-7 were subsequently multiplied by 100 and rounded to the
nearest integer. So, as to be able to present all the ratings as integers on a
scale from 100 to 700. Other measures like familiarity, concreteness,
imageability were attained from merging Paivio, Yuille, and Madigan (1968)
norms. The second source is WordNet (Fellbaum 1998 Miller 1990) from which C.M
estimates polysemy and hypernyms.
The dominant feature
of PCNARz related to narrativity, where description is more near to
narrativity. Text is like telling a story, sharing events, information about
places, characters, and things. It is like a conversation about everyday oral
conversation, and vocabulary is highly familiar, showing world knowledge.
On the negative pole,
features of the descriptive index are prominent. These indices thus provide a
detailed description of the text, its nature, and its complexity level. For
instance, for the calculation of length, the paragraphs and sentences which
have extended length may indicate more words and complex syntax, which means
such sentences are difficult to process. Similarly, a large standard deviation
of the mean of sentences indicates that the text has a large variation in
respect of the length of sentences in which some are very long, and some are
too short, which the author is deliberately doing to present utterances of
characters and scenes description respectively. (McNamara, 2014)
Thus the right label
for this dimension is 'narrative vs descriptive concerns.'
<ICLE-PA-VL-0001.1>
The turning point of American policy was 9/11.
When 2000 Americans were killed in this attack, Americans claimed that Afghan
was responsible for that and started a war against Afghanistan. Millions were
killed in this war. The prisoners of war were killed in this and treated
inhumanly and sent to Abu Garib Jail. Americans put them in cagescage-like
animals and torture them by letting dogs lie on them.
They were not given food and other basic
needs. This true violation of the U.N Charter.
Which was recognized at the Geneva conference. The media explored the cruelties
of America in this regard. The American diplomacy to fight against terrorist
was exposed that how America falsely got the support of the world. But in
reality, the USA deceived the whole world for killing the innocent people.
The other example is
showing the description of place.
<ICLE-PA-AO-0011.1>
Europe is one of the
world's seventh continents, Europe is generally divided from Asia to its east
by the water, divided by the Ural Mountains, the Ural River, the Caspian Sea,
the Caucasus region, and the Black Sea to the southeast. Europe is bordered by the
Arctic Ocean and other bodies of water to the north, the Atlantic Ocean to the
west, the Mediterranean sea to the south, and the black sea to the southeast.
Factor 3. Based on Coh-Metrix
Indices
Factor 3. |
|||
Positive Component |
Values |
Negative Component |
Values |
PCSYNz |
.914 |
DESS |
-.934 |
PCSYNp |
.905 |
RDFKGL |
-.821 |
SYNSTRUTt |
.833 |
dressed |
-.767 |
SYNSTRUTa |
.804 |
SYNLE |
-.620 |
SMCAUSv |
.665 |
|
|
DEC |
.647 |
|
|
SMINTEp |
.568 |
|
|
CRFCWO1 |
.513 |
|
|
SMCAUSvp |
.490 |
|
|
Factor three is based
on eight features, all showing positive results with no negative loadings. The
features with positive loadings belong to the indices of text easability
measures. Indices of syntactic simplicity like PCSYNz and PCSYNp clues the
syntactic structure, which is simple, showing fewer words, simple and familiar
syntactic patterns. Such structures are easy to process. The other features are
PCCNCp & PCCNCz belong to word concreteness giving information about content
words which are more common, concrete, and non-abstract. Such vocabulary
generates mental images and is less ambiguous and more meaningful. Such
information is easy to process and comprehend.
Therefore, the right label for this dimension is 'concrete factual
information.
<ICLE-PA-AO-0024.1>
The modern age is not hunting animals for
personal hunt and pleasure. In ancient times Kings and their companions used to
hunt animals not for their larder but for their personal pleasure. It was very
cruel and inhuman treatment toward the beauty of nature. They left many animals
to rote often hunt.
<ICLE-PA-AO-0015.1>
If we take the example of domestic donkey, it is
treated by human too much worse, a ton of weight is put on the back of this
innocent creature, and the master take much more work from him., which is much
more from his capacity and tendency.
In the circus, many animals are badly treated by
humans, they are change their habitat, location and snatch their native and
natural surroundings, and here they badly treated by a human.
<ICLE-PA-VL-0004.1>
At 3.79 million sq. miles and with over 309
million people, the America is the third-largest country by total area and
population. America is the world largest economy with a GDP of $ 14.3 trillion-
with the literacy rate of 99% America has one of the finest systems of
education. In the field of sports, America has achieved many landmarks, and
they have the highest number of medals any country won in the Olympics.
Although the global slow down in economic growth,
America is still the highest funding nation in the world. America has one of
the largest Army in the world, still operating in different parts of the world,
and have taken part in world war 1 and 2. Still, due to many set backs, America
is one of the strongest countries in the world.
Conclusion and Findings
Pakistani learners’ argumentative writing is
largely focused on sharing information. Even in putting arguments, the writers do not try to take a clear stance. Instead, they use indirect style. In argumentative essays, students need special training while dealing with argumentative topics as these are more cognitive, complex, and interactive. Students perform well in narrative and descriptive essays as compared to argumentative topics. Text is like telling a story, sharing events, information about places, characters, and things. It is like conversation about every day oral conversation, and vocabulary is highly familiar, showing world knowledge. Learners writing is cohesive as overlapping at the level of content words, sentences, and argument is seen as a dominant feature, yet there is scarce of new and unique words. They are using the same syntactic and linguistic features, which show massive overlapping in the text. They write highly cohesive text but lack variety of expression. Instead of putting arranged and well-designed arguments, they prefer to share information with interactive features.
Note: This paper is part of researcher’s Ph.D. dissertation.
References
- Baumgardner, R. J. (1987). Utilizing Pakistani newspaper English to teach grammar. World Englishes, 6(3): pp. 241-252.
- Baumgardner, R. J. (Ed.). (1996). South Asian English: Structure, use, and users. Urbana: University of Illinois Press.
- Biber, D. & Finegan, E. (1994). Multi- dimensional analysis of authors' style: some case studies from eighteenth century. In D. Ross, D. Brink (Eds.). Research in humanities computing, III: pp. 3-17.
- Biber, D. (2004b). Modal use across registers and time. In Anne Curzan and Kimberly Emmons (eds.), Studies in the history of the English language II: Unfolding conversations. Berlin: Mouton de Gruyter. pp. 189-216.
- Biber, D. (1988). Variation across speech and writing. Cambridge: Cambridge University Press.
- Biber, D. (1995). 'On the role of computational, statistical, and interpretive techniques in multi-dimensional analysis of register variation. Text 15/3: pp. 314-370.
- Biber, D. (1995). Dimensions of Register Variation: A cross-linguistic comparison. Cambridge University Press.
- Biber, D. (2004a). Conversation text types: A multi-dimensional analysis. In Gérald Purnelle, Cédrick Fairon, and Anne Dister (eds.), Le poids des mots: Proceedings of the 7th International Conference on the Statistical Analysis of Textual Data. Louvain: Presses universitaires de Louvain. Pp. 15-34
- Biber, D., Connor, U. & Upton, T. A. (2007). Discourse on the Move: Using Corpus Analysis to Describe Discourse Structure. Amsterdam: John Benjamins
- Biber, D., Conrad, S. & Rappen, R. (2002). Speaking and writing in the university: A multi-dimensional comparison. TESOL quarterly. 36(1). Pp. 9-48
- Crossley. S. Salsbury, T. Titak, A. McNamara, D. (2014). Frequency effects and second language lexical acquisition: Word types, word tokens, and word production. International Journal of Corpus Linguistics,9(3), 301-332
- Crowhurst. (1990). How many millions? The statistics of English today. English Today 1(1), 7-9.
- Friginal, E. (2012). The Discourse of Outsourced Call Centres: A Corpus-Based, Multi- Dimensional Analysis.
- Geisler, C. (2002). Investigating register variation in nineteenth-century English: A multi- dimensional comparison. In R. Rappen, S. M. Fitzmaurice &D. Biber (Eds.), Using corpora to explore linguistic variation. Amsterdam: john benjamins. pp.249-271.
- Graesser, A. C., & D'Mello, S. K. (2012). Moment- to-moment emotions during reading. Reading Teacher, 66, 238-242.
Cite this article
-
APA : Tabassum, R., Farooq, M., & Mahmood, M. A. (2021). Identifying Features of Pakistani Learners Writing Through MDA and Coh-Metrix. Global Social Sciences Review, VI(III), 119‒127. https://doi.org/10.31703/gssr.2021(VI-III).13
-
CHICAGO : Tabassum, Rabia, Mahwish Farooq, and Muhammad Asim Mahmood. 2021. "Identifying Features of Pakistani Learners Writing Through MDA and Coh-Metrix." Global Social Sciences Review, VI (III): 119‒127 doi: 10.31703/gssr.2021(VI-III).13
-
HARVARD : TABASSUM, R., FAROOQ, M. & MAHMOOD, M. A. 2021. Identifying Features of Pakistani Learners Writing Through MDA and Coh-Metrix. Global Social Sciences Review, VI, 119‒127.
-
MHRA : Tabassum, Rabia, Mahwish Farooq, and Muhammad Asim Mahmood. 2021. "Identifying Features of Pakistani Learners Writing Through MDA and Coh-Metrix." Global Social Sciences Review, VI: 119‒127
-
MLA : Tabassum, Rabia, Mahwish Farooq, and Muhammad Asim Mahmood. "Identifying Features of Pakistani Learners Writing Through MDA and Coh-Metrix." Global Social Sciences Review, VI.III (2021): 119‒127 Print.
-
OXFORD : Tabassum, Rabia, Farooq, Mahwish, and Mahmood, Muhammad Asim (2021), "Identifying Features of Pakistani Learners Writing Through MDA and Coh-Metrix", Global Social Sciences Review, VI (III), 119‒127
-
TURABIAN : Tabassum, Rabia, Mahwish Farooq, and Muhammad Asim Mahmood. "Identifying Features of Pakistani Learners Writing Through MDA and Coh-Metrix." Global Social Sciences Review VI, no. III (2021): 119‒127. https://doi.org/10.31703/gssr.2021(VI-III).13