We Can Work it Out...

SQL for Linguists

Why should I learn SQL, I'm a linguist! Why not just manage data with a spreadsheet?

For one thing, SQL is kind of fun. It's neat to be able to ask a database complex questions. It will make you feel like a superhero.

More prosaically: consider a problem that all linguists have to deal with sooner or later: language names and language codes. You can put those into spreadsheets, but how do you combine the data from different lists, with possibly conflicting information? Those are the kinds of questions that can really only be answered with a relational database (or more spreadsheet-fu than I have or desire to acquire).

I find that it's easier for me to care about databases if I'm actually interested in the data. Hence this tutorial: we're going to use SQL to take a look at the WALS.info database.

Wals.info is an amazing web-based resource for cataloging the linguistic features of the languages of the world. Anyone interested in language or linguistics should check it out.

In a flurry of forward-thinking, the fine folks at WALS have made the SQL database behind their website available for (non-commercial) use: the page Exporting WALS Data explains how to download the file devdata.sqlite.

We're going to use some basic SQL to see what we can learn from this database.

I'm not going to cover the details of installing sqlite, since it's explained on lots of websites, one of which covers your platform.

Tell Me What You See: Accessing the database

At a terminal, change to the directory where you have the sqlite database you downloaded. It should be called devdata.sqlite.

There are a couple ways to interact with the database: use the sqlite3 terminal application, and write scripts in an external file and run them as a script. We'll do a bit of both. Under Linux and OSX (I'm uninformed on Windows, education welcome), you can pipe a series of SQL commands directly to the command-line tool:

$ cat somesqlcommands.sql | sqlite3 devdata.sqlite

For example, here are the contents of list-language.sql:

.output list-language.txt
.mode tabs
select * from language order by name;

So we can run that file like so:

$ cat list-language.sql | sqlite3 devdata.sqlite

The output of this table can be seen at list-language.txt.

A SQL database is basically a set of tables, and you can think of a SQL table as analogous to a spreadsheet: across the top are the names of the columns. Those correspond to attributes of the objects, which are represented as rows. For instance, the language table contains rows of languages. If it were a spreadsheet, it would look like this, beginning with:

spreadsheet screenshot

And then a few thousand more languages, ending with:

spreadsheet screenshot

Notice that the sorting here appears a bit odd. That's because it's not sorted. If we want a particular sort order, we have to ask for it.

So, what does this look like in SQL? Firstly, we'll use the the command-line client (sqlite3) that comes with recent versions of sqlite.

$ sqlite3 
SQLite version 3.4.0
Enter ".help" for instructions

The client has some useful built-in commands which we can use to investigate the current database. As the prompt mentions, you can type .help for a full list. (Also note that these "dot" commands are sqlite-specific, not part of SQL itself.)

Okay, let's start up the client with the name of the database, like this:

$ sqlite3 devdata.sqlite 
SQLite version 3.4.0
Enter ".help" for instructions
sqlite> .databases
seq  name             file                                                      
---  ---------------  ----------------------------------------------------------
0    main             /Users/pat/sqlforlinguists/devdata.sqlite

That shows us that sqlite knows where the database is (your path will differ depending on where you put devdata.sqlite).

There are quite a few tables to consider. We can list of them all with .tables:

sqlite> .tables
altname                 family                  permission            
altname_language        feature                 reference             
area                    feature_author          routledgename         
author                  genus                   routledgename_language
author_chapter          group_permission        ruhlenname            
chapter                 igt                     ruhlenname_language   
chapter_reference       iso_language            tg_group              
country                 isolanguage             tg_user               
country_language        isolanguage_language    user_group            
datapoint               language                value                 
datapoint_reference     language_country        visit                 
example                 language_iso_language   visit_identity        
example_feature         othername             
example_reference       othername_language

Some of these names might look suspiciously familiar if you've dealt with language names or language families much. The central table here (I think) is language.

Something... How a language is represented

If we continue to analogize a SQL database with a spreadsheet, you might think of each of these tables as a single "sheet" with a single set of data on it. The data is arranged with column names, which represent the attributes of the objects we're cataloging. To understand how column names of a particular table are, we look at the table's schema:

sqlite> .schema language
CREATE TABLE language (
    id VARCHAR(3) NOT NULL, 
    name VARCHAR(255), 
    ascii_name TEXT, 
    genus_id VARCHAR(36), 
    longitude NUMERIC(10, 2), 
    latitude NUMERIC(10, 2), 
    modified TIMESTAMP NOT NULL, 
    PRIMARY KEY (id), 
     UNIQUE (name), 
     FOREIGN KEY(genus_id) REFERENCES genus (id)
);

We see the line CREATE TABLE language up top. As you might expect, this is the command that was used to create the table in the first place.

The information in this line is similar to what you have to ask yourself before you set up a new table: "what are the attributes of the objects I want to collect data about?"

Once again, we can compare this schema to the columns in the our hypothetical spreadsheet:

spreadsheet screenshot

As you can see, each row in the CREATE statement corresponds to a column in the spreadsheet. Each of the attribute lines has the following format:

column-name DATATYPE OPTIONAL-OTHERSTUFF

The column-name is what the database creator decided to call the column fields.

We'll ignore the optional bits for now. The datatype is more important: it says what kind of value is going to be put into that field. A simplistic (but practical, I think) way to think of this is to compare it to the "formatting" you sometimes need to set in a spreadsheet to get a column to look right. TEXT and NUMERIC are easy enough, and even TIMESTAMP sort of makes sense. If you keep in mind that VARCHAR(n) just means "a text that contains 'n' letters," you're not far off.

So we can see right off the top that the id field is 3 letters long. As you can see from looking at the example data, that's the language code. At first blush, I thought this was the ISO 639-3 code, however, I learned later that WALS has its own distinct language codes (with a mapping to ISO-639-3). These sorts of differences are inevitable and ubiquitous when you're dealing with a large standard. We'll see later that SQL can help us deal efficiently with such complexities.

Do You Want to Know a Secret... Picking out data with SELECT

The last three lines in the schema are a bit more magical, but the one that goes PRIMARY KEY (id) is explicable: in this database, language objects are identified uniquely by a 3-letter string. That's what PRIMARY KEY means. So, if we happen to know the language code of a language, we can look it up in this table:

sqlite> select * from language where id = 'eng';
eng|English|english|germanic|0|52|2009-08-07 22:17:36.834584

I just looked up the row that has eng as a unique identifier, and sqlite gave us the corresponding row. I wonder if there's a language with the code abc?

sqlite> select * from language where id = 'abc';
sqlite>

We just get the prompt back—no language has the code abc. Is there a language with the code cat?

sqlite> select * from language where id = 'cat';
cat|Catio|catio|choco|-76.3333333333333|7.16666666666667|2009-08-07 22:17:23.846659
sqlite> -- Haha!  cat! The language code is cat. Oh I slay me. 

sqlite> select * from language where id = 'meh';
meh|Mehri|mehri|semitic|51.5|17|2009-08-07 22:17:32.847436
sqlite> -- Hurhur, "meh". How droll.

sqlite> select * from language where id = 'win';
win|Wintu|wintu|wintuan|-122.5|41|2009-08-07 22:17:35.187726
sqlite> -- Win! Clearly Wintu is full of win.

sqlite> select * from language where id = 'cya';
cya|Chatino (Yaitepec)|chatinoyaitepec|zapotecan|-97.25|16.25|2009-08-07 22:17:31.307560
sqlite> -- Okay, enough fo this foolishness, cya!

Yes, I am easily amused.

Back to the spreadsheet, let's put this SELECT request in familiar terms:

filtering for eng

You've probably used something like this filter dialog to search for particular rows in a spreadsheet. It's quite similar to the sort of SELECT statements we were running above. It simply says, "give me all the rows where the id field contains the value eng."

After we apply this filter in the spreadsheet, unsurprisingly, we can only see the row for English:

filtering for eng

We could have done the same thing to select Wintun:

filtering for win

You get the idea.

Here, There, and Everywhere: Fancier Uses of WHERE

The id fields are unique, so you'll only get result sets with one language if you search for a particular id like eng or win. But let's try querying some of the other fields.

An easy query that should return some plural results is to look up all the languages that belong to a particular language family. Let's start with a personal favorite, Kashaya, a Pomoan language of Northern California.

Let's try looking up Kashaya by name instead of language code:

sqlite> select * from language where name = 'Kashaya';
ksh|Kashaya|kashaya|pomoan|-123.333333333333|38.6666666666667|2009-08-07 22:17:26.844192

Voilà! Here's a trick to output that row in a bit more readable fashion:

sqlite> .mode line
sqlite> select * from language where name = 'Kashaya';
        id = ksh
      name = Kashaya
ascii_name = kashaya
  genus_id = pomoan
 longitude = -123.333333333333
  latitude = 38.6666666666667
  modified = 2009-08-07 22:17:26.844192

So, we can see that the genus_id is spelled pomoan. Let's find all the other Pomoan languages listed in WALS:

sqlite> select * from language where genus_id = 'pomoan';
ksh|Kashaya|kashaya|pomoan|-123.333333333333|38.6666666666667|2009-08-07 22:17:26.844192
pmc|Pomo (Central)|pomocentral|pomoan|-123.333333333333|39|2009-08-07 22:17:27.180125
pmn|Pomo (Northern)|pomonorthern|pomoan|-123.5|39.3333333333333|2009-08-07 22:17:27.503352
pme|Pomo (Eastern)|pomoeastern|pomoan|-122.666666666667|39|2009-08-07 22:17:32.730148
pso|Pomo (Southeastern)|pomosoutheastern|pomoan|-122.5|39|2009-08-07 22:17:35.563406

Golly, that's hard to read. Let's try just looking at the fields we care about. (Now you'll understand what the '*' was about: it means "all the fields".)

sqlite> select id, name, genus_id from language where genus_id = 'pomoan';
ksh|Kashaya|pomoan
pmc|Pomo (Central)|pomoan
pmn|Pomo (Northern)|pomoan
pme|Pomo (Eastern)|pomoan
pso|Pomo (Southeastern)|pomoan

As the WHERE clause indicates, we're just looking at the id, the name, and the genus_id here. To make it a bit clearer, we can use the .mode line trick again:

sqlite> select id, name, genus_id from language where genus_id = 'pomoan';
      id = ksh
    name = Kashaya
genus_id = pomoan

      id = pmc
    name = Pomo (Central)
genus_id = pomoan

      id = pmn
    name = Pomo (Northern)
genus_id = pomoan

      id = pme
    name = Pomo (Eastern)
genus_id = pomoan

      id = pso
    name = Pomo (Southeastern)
genus_id = pomoan

So now we know which Pomoan languages are included in WALS. It's a good sample, as it happens, as there are only 7 Pomoan languages and WALS has 5. Southern and Northeastern Pomo are the missing ones. So we can see here that WALS is not attempting to subsume all the data in ISO-639-3. WALS is a featural atlas, and so a broad variety of languages was chosen with regard to