Software utilities

Since I became addicted to R, I started writing some little routines to automatize some repetitive steps in my job, like selecting sets of items for experiments, computing statistical properties of these items, or pre-processing data before modelling them. (OK, that’s the official version. The truth is that I was desperately trying to find a way to avoid doing myself the most boring part of my job, and making lots of mistakes while doing that. R was just perfect for that and, at least as important, it’s for free). This is not intended to be stand-alone work; I’ve done everything you’ll find here just to meet my very specific needs at the time I was writing, so don’t expect my routines to fulfil perfectly your needs with no additional work on your own. A copy-and-paste approach won’t work, but if you have some experience using R, you might find useful starting points in this page.

1. PairwiseMatching.Rfunction. This R function is designed to extract from a database a set of X elements that are matched pairwise to an existing dataset on a number of variables, given a criterion that defines what is matched pairwise for each of those variables, and, optionally, a set of weights to be assigned to each variable in case more than one candidate is good enough, and the function has to decide which is the best. The syntax is as follows, with reference to the bold words above: PairwiseMatching(database, dataset, variables, criteria, weights). Database and dataset are data frames of any dimension — of course, the former has to have at least as many rows as the latter, but hopefully quite more. Variables is a character vector with the names of the columns that correspond to the variables to be matched — these names must be the same within database and dataset. Criteria is a numeric vector which expresses, for each variable in variables, how many percentiles any word can be from the target in order to be considered as matched — this vector must be the same length as variables. And finally, weights is a numeric vector that expresses how important is each variable as compared to the others — no restriction on the numbers to be used as the only thing that matters is the ratio between those numbers, e.g., c(1,1,2,4,1) would be the same as c(2,2,4,8,2).

2. Diagnostics.Rfunction. This function is designed to pre-process data from behavioural or eye tracking experiments, particularly language experiments. It takes a raw dataset of response times and accuracies in the long format (each row is a given subject on a given item) and computes participant and item means and SDs for speed and accuracy. More specifically, it takes as arguments five vectors (RTs, accuracies, participant ID, target ID, and condition ID — typically, word/nonword in a lexical decision experiment) and one string (the name that should be given to its output plot). This output plot sums up all the information that should be relevant for the experimenter to decide quickly and effectively which items/subjects/individual datapoints should be excluded from further analysis. Here’s an example.The left upper panel is a bit messy at a first glance, but becomes easy once a few things are sorted out. First, each bubble is a participant, and its position in the graph is determined by its mean RT (X axis) and accuracy (Y axis); so, up and left is quick and accurate, down and right is slow and error prone. The bubble diameter is proportional to RT SD; so, small bubble is good, big bubble is bad.  This was a lexical decision experiment, so we had both target words (YES responses) and target nonwords (NO responses); red is word, blue is nonword (this is applicable to any other relevant, discrete variable in any other experiment, really). The dotted lines are the typical cut-off criteria that you can find out in the literature; in this case, they represents the mean minus two SDs, for both words (red) and nonwords (blue), and for both RT (X axis) and accuracy (Y axis). Finally, because each participant was shown both words and nonwords, you get two bubbles for each of them; these two bubbles are connected with a light grey line, so that you can quite easily find out where the other bubble of any given subject is. The plot is dense in the upper left corner; but we don’t care, because people there are quick and accurate, and so we’ve no reason to exclude them. What we’re normally interested in the rest of the plot, where bad participants lie. The good thing about this plot is that you get all the relevant information at a glance, and so the risk of excluding some relevant data blindly is minimized. Take sbj 12, for example: her red bubble lies on the horizontal red line, i.e., she has a mean accuracy on word trials that is mean-2SD. She would be excluded, if we applied the classical cut-off here. But from the graph we learn that she was quite quick on target words (well below the RT cutoff, i.e., far left from the vertical red line), she had a low SD (the bubble is tiny), and she was exceptionally good on nonword trials. So probably she shouldn’t be excluded. I don’t remember exactly, but I guess that the only participant that I took out here is sbj 17. She made lots of errors on word trials (her red bubble is low…), though she was not quite quick (…and also quite far from the left edge of the plot) . She performed quite poorly on nonwords too (her blue bubble is quite outside the left upper cloud), and was also very variable in their RT (both her red and blue bubble are rather big compared to the others). It’s not that intuitive at the beginning, I know, but after two or three experiments you’ll save quite some time in making this decisions (and you’ll start right form the start saving time on computing subject and item means and RTs). The other two plots are much easier. On the right upper corner, you find a similar plot, this time with each point being a target word. There you see that WEDLOCK, HEDGEROW, SYNDROME, MINSTREL and SHRAPNEL clearly step out of the left upper (=good) cloud, and should thus be excluded. Finally, a histogram of individual item-by-subject RTs. There you see that several data points lie in the tiny right tail above ~1500 ms; you should probably take them out.

It might seem complicate at the beginning, and probably it is. But your effort will be well rewarded just after a handful of experiments; at that point, it will be very easy to grasp all the relevant information at a glance, thus sparing time and making better decisions. And you could always forget about the graph and used this routine as a calculator for participant and item means and RTs; a huge amount of time saved anyway.

3. OutputFilesMerger.Rscript. This is a script that merges several output files (typically, one for each subject) into one. It is designed specifically for the needs of the MoMo lab (that uses in-house software based on Octave and Psychtoolbox to run experiments), and so will most likely be of any use for everyone else in the world. But you never know, maybe some other R geek is out there and might want to take a look inside the script and get some parts of it. If you don’t feel that illogical pleasure in playing around with non-sense words and see what happens on your screen, I seem to remember that Matt Davis has something similar for merging DMDX output files, which is surely more standard.

4. MorphoOrthographicFamilySize.Rfunction. This function computes family size figures for a bunch of morphemes, based on morpho-orthography (which means that ‘corner’ will contribute to the family size count for the suffix ‘er’, although ‘er’ isn’t a suffix in ‘corner’). I needed to compute these values for a recent revision of one of my papers, with the idea that on the semantics-blind, pre-lexical morphological level, this kind of family size, rather than the classical version of it (which takes semantics into consideration), should be what really matters. The function takes as input three vectors (the bunch of morphemes we’re curious about, all the words in a given language, and their frequency values) and one string (‘type’ or ‘token’ according to whether we want to weight the contribution of each word according to their relative frequency in the lexicon). It doesn’t generate external files, but an R data frame at the moment; however, this could be easily modified by any R user with little experience.

5. NgramTrough.Rfunction. This function quantifies the size of the so-called bigram trough (for more information, check out here) that typically characterises morphemic boundaries, at least in English. The trough size is operationalized as follows:

BTD = (log(BFa) – log(BFb)) + (log(BFc) – log(BFb))

where BFa is the frequency of the bigram immediately preceding the boundary, BFb is the frequency of the bigram straddling the boundary, and BFc is the frequency of the bigram immediately after the boundary. The index is thus close to zero when the bigram frequency pattern over the boundary is flat, and grows with the depth of the bigram trough. The function takes in input: (i) a vector with the words we’re curious about; (ii) a vector with all the words in a given language; (iii) a vector with their frequency values; (iv) a number (either ‘2’ or ‘3’) specifying whether we want bigram or trigram stats; (v) a string (either ‘type’ or ‘token’) that tells the function whether to weight the contribution of each word according to its frequency; (vi) a vector of letter positions where the morphemic boundary of the target words lies (the position of the first letter of the suffix); (vii) a logical value (either TRUE or FALSE) telling the function whether to apply a log transformation to frequency values; and (viii) a logical value telling the function whether to consider only concave trough (the most typical pattern) or also convex trough (i.e., the frequency of the bigram straddling the boundary is higher than that of the surrounding bigrams; this is unusual, but it’s still a signature that the visual identification system might be sensitive about). The routine will produce (i) an R data frame called ‘NgramFreqTrough’ with the output stats and (ii) a plot with the bigram frequency pattern represented as boxplots centred on the morphemic boundary. Here are two examples, where you can see a bigram trough in the lower panel, and no bigram trough in the upper panel:A note for R beginners. R scripts, that are known to be scripts because of their extension (‘Rscript.txt’), can be run directly in R by simply calling ‘source(“file_name”). On the contrary, before being able to use R functions (whose file names extends with ‘Rfunciton.txt’), you should (i) put the text files in the directory you’re currently working in, or in the directory you intend to work in; (ii) open R in that same directory; (iii) run ‘source(“file_name”)’. Then you just need to call the function with its appropriate arguments, and have fun! The name of the function corresponds to the name of the text file, once ‘.Rfunction.txt’ has been stripped.