Tag Archives: data cleaning

That’s What She sed: !awk Lessons From Fun[ctional] Programming

Somewhere at the intersection of unexpected genius, linguistic mastery, and femininity there’s a trope of compelling film/fiction that goes something like this: a character (ideally a woman or weakling) speaks a language that no one expects and suddenly reveals a competency or comprehension that strengthens his or her position, provides for some comedy, or drops a beat of provocative timing. This kind of surprising exolingual + monolingual situation is common and interesting. I’m thinking Daenaerys Tarygarden speaking Valerian, or when Nancy Travis speaks Russian to her cat-callers in So I Married an Axe-Murderer, or that scene in the Goonies when Corey Feldman, a child, gives the maid surprising instructions in Spanish, or those times on the subway when I can tell who the françaises next to me are gossiping about and giggle to myself at the semantic secrets I’m privy to by virtue of closet bilingualism. It’s a common and compelling scene, not one wholly relegated to spoken tongues; it has its echoes in computational languages too.

Unexpected fluency in a programming language is fascinating. There’s still an interesting amount of surprise that accompanies any woman speaking intelligently at a tech conference, or a child-ish programming prodigy who sells his company at 18 and enjoys wild and precocious success. With that in mind, I decided to explore some languages recently that I had little experience with, if only to investigate their utility, and build up some surprising and cinematic techcred of my own.

Ontology Web Language? (http://www.w3.org/2001/sw/wiki/OWL)

Informing this, a recent and short tumble into the land of Game of Thrones led me through the wikipedian labyrinth to LCS, this linguistic non-prof that constructs languages (conlangs), composed of member constructors (conlangers) and the responsible creators of languages like Klingon in Star Trek and Dothraki in Game of Thrones. My tangent into a trope sparked some curiosity about how we define computer languages and how we use them thereafter, and the authority of the inventors of these languages.ll-sarcasmantics

Like other languages, computational tongues are often indexed by stereotypes, but unlike spoken conlangs which have evolved to express a multiplicity of (in)translatable nuances, CSlangs often are more objectively restricted by a purpose, not developed to express all of the things but rather to accomplish a task. Valarian is “the only language for poetry,” while Dothraki is harsh and gutteral like its speaker population; SQL is a “special-purpose” query language, Objective-C is a “general purpose” object-oriented language, Visual Basic is the “most-WTF-y” language; but even in these stereotypical distinctions, coders contend about what these adjectives might mean, and who is best suited to speak these languages in such-or-such situation. And personalities presumptions align to these -types as well: women, being generally lovely and fluffy, are unlikely to speak a brutal and ugly  bash shell scripts….they should be front-end programmers because pretty, and easy. 😦

And despite this, I’ve been investing a bit of time in the prelims of every data project, somewhere between scoring a raw pile of data and shaping it up for a visualization, always accomplished via some language/library. New projects and experiments always make me wonder if there’s a better library, plugin, resource or language to articulate my objectives and otherwise get me the results I’m after, which in this case involve a bit of OCR and semantic analysis, batch processing and cleaning and file pruning where language all-around is pretty important. For this round of tech adventures, I settled on SED, but I’m sure the operations I’ll be performing in this post could be fairly accomplished by other languages. What he sed.Further notes on my actual adventure can be found here, but as a quick suite of examples, say you have a batch of files whose extensions you need to change:


You can do this with text utilities:


This converts all .docx files to .txt in a given directory (ignore the bogus .pdf dud).


Then, say you need to restructure file names in a directory so that you can sort them, as I wanted to by date, but your current file format is something like this:

23NY080214.txt Or ##-NY-DDMMYY.txt

You can reorder characters in a set of files by running a sed script like this:


This tells Terminal to break up the file name by “.” to represent characters and then re-order those parenthetical entities according to the numerical set order at the end of the line (4\3\2\1) where 4=YY, 3=MM, 2=DD, 1=23NY. It makes that reorder actionable for each (*) .txt file in the directory.

blog-rename text

None of these applications is really what sed was “made for,” but I found them pretty satisfactory implementations of the language for my immediate need. Taken together, all this got me thinking about linguistic development and about the “meta”-languages of programmatic thinking, the classes and cases of computational articulation that lead us toward fluency in one or more languages, preference, and eventual specialty in the operations most suited to that lexicon.

newLangsWhile living on a continent with ~3,000+ spoken languages, pidgins, and regional dialects, I also started thinking about how the diversity of computer languages compares to other paroles of parlance, and how our systems for organizing and inventing new tongues might best map to eachother for optimal productivity. There are rough guides for this kind of crosswalkexpected hierarchies, rankings, paradigm comparisons, and schemes of which languages are appropriate for the most hardcore hackers (see also, the “Real Programmer” fallacy).

But to redirect the conversation to a more critical and less-subjective breakdown, it seems appropriate to consider the semantics of not just the language itself but also its classification schemas in trying to assess their flexibility and purpose. One of the beautiful things about objectively breaking down languages by purpose, is that they can be ranked according to their flexibility and utility, their merits, rather than subjective judgements about their syntax. As with most anything in code, bash, or whatever scripting, part of the learning process is absorbing typical commands and the rest is playing with how to appropriately pair them for more complex operations (roughly: what commends are possible and how to link them). Snooping through Stack Overflow can usually get you pretty far on the first one, the second comes later, when repeated compartmentalized operations become exhaustive and your frustration has driven you to the point of investment in some serious study or thought on how to most efficiently arrive at your goal.


languagesFor this project, I selected sed because I’d read about its utility for my purposes. I’ve got several years worth of newspaper and journal data to convert from various file formats to one, and then rename in a batch before diving into the actual contents and cleaning and reformatting. Sed seemed appropriate for this, I could probably do it in Python or bash or JS or somesuch and maybe there’s someone who’s already build an online GUI that automates all this…but I was looking for something that worked and something new to learn, a new dialect to surprise myself with. While I felt stupidly proud when surprising others with this workflow and earning the ‘hacker’ merit badge du jour at work, I didn’t choose it to be cool, I chose it because it fit my needs. I chose it because sed is simpler than awk an perl, syntactically and performatively, but it provides a variety of text processing and regex support operations, and suits most things I would need in combination with other commands. I’m still at the ‘hello world’ stage with some of the magic of stream editors, but sed had some pun promise for the title of this post so I thought I’d go with that and see how far I could get with the operations that I wanted to perform.

And this is where I started thinking, perhaps there are other language paradigms to adapt for this purpose. Taking tips from symbollic and declarative languages might be useful, if only conceptually. I’d like to type in my desired output and allow the language to fumble through the mechanics of its implementation. When in SQL and I’m select from where’ing, I’d like to sed-ify that operation for data cleaning. Select *.csv from _ directory where _[date].csv. In researching and polling friends about addtional “sql-ish” (pronounces “squish” please) languages, I came across a few interesting features that I have yet to test in practice but seem like pretty cool operations to incorporate in a meta-sed lang.

In the past, and via wikipedia, I’ve heard  “declarative” applied to XSLT. Your blocks ll-intentof code are statements, declared like: “when you get to {this} w/ property {that}, do {these things}.” You can declare them in any order and they will run in the appropriate sequence.  However, is XSLT “declarative” according to all definitions? Diving further down the language research rabbit-hole had me questioning more of what “declarative” means in this context. Despite the overwhelming arguments you can get yourself into when defending the merits of one computer language over another, the terminology used to rll-morphefer to different programmatic concepts and classification schemas can be vague, misleading and largely unhelpful if you approach them as a foreigner, with other linguistic fluencies influencing your translations. The term “declarative language” for example can reference “non-procedural”, but that is also valid for the other language styles. The author in this article linked above uses “where you declare…” to define his term of “declarative language.” With XSLT, you write blocks of procedural code, called in reaction to something in the source doc, otherwise unlinked to the calling procedure (“where you declare…”).

If you think of lots of front-end and web prog languages, they pretty much fall into this category: small blocks of code linked to a user interaction, operation (onClick listen –> then run {this}). The author features a bunch of interesting language paradigms like concatenated languages, but there are other, now (perhaps) obsolete meta-languages that also address these concepts with more flourish and in many cases the same hiccupy classification semantics that can obscure their utility. Like what about languages made to describe algorithms, APL-ish tongues with general and placeholder operators, “compression functions” to apply operators pairwise to members of a vector, right to left programming execution sequencing. Or what about REXX, a shell scripting language using juxtaposition and ‘|’ interchangeably for concatenation, using blanks as operators. Even the semantics of concatenation have been through debates about the appropriateness of the term to “co-chain” vs. just catenate (“chain”).

Both conlangs seem to require quite a bit of syntactical adjustment but have features I’ve never seem echoed in other languages. And still, the point is, no one remembers these syntactical idiosyncrasies, languages are remembered for what operations they perform and how well. Our memories are operation-orientated, perhaps not-solely focused on syntax. Are these lexicons appropriate for high poetry, are they guttural and direct; what do they evoke, how do they surprise?

Plus, I’m wondering if I even understand how to appropriately use and manipulate a language when I’m not sure how to best describe it. Taking a page from my spoken fluencies, those languages that I know best and feel most comfortable using in practice are always those whose grammar and constructs I can explain and justify with greatest ease. There’s little mastery in the unwritten blundering I do in Swahili or Creole, though I’ve spent serious time in places where they were spoken; English and French, the product of formal study and informal fumbles, I totally own like whoa.


In programming there’s a declarative and imperative paradigms; likewise an imperative mood (expressing commands) in most spoken/written languages. One might read Dothraki or Klingon, a brutal class of LCS languages and particularly “imperative” in their ‘commanding’ manner, unapologetic guttural articulation. But what might be the meaning of declarative? Do many people know? The internet suggests not. As per uszhe, everyone has his own definition, disambiguations + citation needed, wikipedia, hint hint.

So what’s the best language to communicate what we want, when the writing about languages is indecisive and muddled? Probably, and unsuprisingly, the language you speak best. True masters can adapt languages to their purpose, but most still recognize that CS languages are freighted with an intention, and this limits their applicability to all situations. The ambiguity of classifications like “declarative” in reference to a few languages or other terms applied to and restricting language adoption crumbles when you consider languages for their ideal operations, and not their syntax or semantics. What is the purpose of the language, how to absorb typical commands and how to appropriately pair them for more complex operations? Operation-oriented language selection (ruby is good for… and …) rather than grammaticentric (ruby syntax is “bloated and confusing“) might be the best approach for study; one that respects the romantic tropes of surprise, and pushes you to build a vocabulary based on the declared objectives of your goal, rather than the pretense of some predefined language hierarchy.

That, appropriately, perhaps unsurprisingly, is what she sed.

ilikethisNote: I like to alliterate my titles so if you thought this would be a post about functional programming and are now disappointed, you should check out my friend Jonathon’s post on functional programming coming out in Smashing Mag at some point in the soon, or this explanation series which is fairly brill IMHO.

If you wanted more stream editing and shell scripting, some resources you might enjoy are this one, and for awk reading (the best!), this one.

Tagged , , , ,