This is a guide for accessing the Twitter archive of School of Journalism and Mass Communication, University of Wisconsin-Madison. No prior knowledge is assumed, except the ability to read English. Of course, you should have access to hadoop (contact Prof. Dhavan Shah). Linux or Mac are required for this guide.
Hadoop is the system which hosts the Twitter archive. The process of retrieving data from this archive, based on certain parameters such has time range, keywords, user names etc. is referred to as “making a pull” or “submitting a job”. The workflow is the following:
In this guide you’d learn how to do all that!
If you want to make this guide better, have questions or are facing any issues, you are more than welcome to email me.
Below are some commands that you should be comfortable with. This “language” is called “command line”. You can try these out in the terminal on your laptop. This is important because we would not be able to use a graphical user interface when we log into the cs server. All the files that you work with, submitting jobs - everything will be through the terminal. Here is simple tutorial that you’d probably find very useful.
cd
: Moving aroundls
: Listing files and directoriesrm
: Remove files (and directories)mkdir
: Make directorycp
: Copy files/directoriesmv
: Move/rename files/directoriesAll of the above commands have a lot of accessible documentation, so I am not writing the details here. For example, if you do ls -l
, you can view everything in a list format; ls -lh
gives file sizes in “human readable” list format; ls -lt
arranges files according to time last modified; and you can combine these like ls -lht
.
Important time saving tricks
SSH is a way to access a computer remotely. That is how we access the cs server.
ssh username@tdc1.cs.wisc.edu
and press ENTER. Here, username
should be replaced by your own username (mine is aabhishek
).If you didn’t get any error, it means you’re in! On the left of your cursor you would see something like aabhishek@tdc1$
, which verifies that you’re on tdc1.cs
server. Use the commands from the previous page (cd
, ls
, etc.) to look around. This is your space on the cs server. When any hadoop pull is done, the resulting files would be in here. You would then copy those files on your computer.
To actually submit a job, view datasets that are available, etc., we would actually need to SSH again - this time into the hadoop server.
ssh hadoop1
from inside the cs server.username@hadoop1$
right next to the cursor.You’re in! This space that you’re in is a bit different. Commands that we learnt change a bit in the following way:
ls
: hadoop fs -ls
mkdir
: hadoop fs -mkdir
Essentially, commands work the same as before, but with a hadoop fs -
prefixed to them. If you did not prefix hadoop fs -
, the commands would output and work as if you were on the cs server. That is convenient, because then you can just stay on the hadoop server, and depending on what you want to do you can use the prefix or not.
All of the steps in this section are to be done locally on your computer, and not the cs
or hadoop server. If you’re on a server, type exit
and press ENTER to get yourself out.
hsubmit
, but do not worry about the extension - there should be no extension.your_username_goes_here
to your own username (like aabhishek
). There is only one place - line 13 - which you will be editing in this step.hsubmit
is. Enter chmod +x hsubmit
. This gives the computer the permission to run hsubmit
as a script.What you did just now is set up a script hsubmit
which can copy the contents from a job file on your computer to the hadoop server, and run it. Without this script there will be a lot of steps which you would have to do manually, so you’re welcome (Devin Conathan wrote this originally). All the steps in this section need to be done just once, sort of like installing a program - that is why it is titled “Setup”. Now you are set to submit jobs.
your_username_goes_here
in the top of that file to your own username (like aabhishek
).If you skim the file you would notice that it will look in the table (database) gh_rc2
for the keyword gun
during June 6, 2014. More on this later.
hsubmit
is../hsubmit test.sql
in your terminal.The output from test.sql
is now on the hadoop server. It is not in csv
format - its scattered into tiny files. In order to get the job output as a single csv file:
hadoop fs -ls
. You should see gun
when you do so, and the corresponding listed date of file creation should make sense. gun
is the raw output which consists of many tiny files - it is not a regular csv file, which is what we want.getmerge
command to convert the raw gun
to gun.csv
, where the latter would be the final output. Do do this, do hadoop fs -getmerge gun gun.csv
.ls
now, you will see gun.csv
in the cs directory. gun.csv
contains the data from your job, ready for analysis and publication!Note 1: gun
is the name of the raw output from the pull. The name of the raw output is mentioned in the top line of test.sql
. You could change this in the SQL file, and that would change the output name.
Note 2: The general structure of getmerge
command is hadoop fs -getmerge rawoutputXYZ final.csv
, where the name final
is of your choosing, and rawoutputXYZ
is the name of the raw output.
For doing any analysis, you need to copy the the csv file from the above step to your own computer:
scp username@tdc1.cs.wisc.edu:~/gun.csv .
(note the period at the end) where username
is your username.ls
. Open it with some editor. It will be scrambled and columns won’t be labelled (see Appendix for column names).That finishes the first part of the guide, and it doesn’t get much harder than this. In the following sections, you’d learn how to write your own jobs (in brief), and managing jobs that are running.
It would be useful to look at these examples while going over this section.
The first line in any SQL file is something like:
insert overwrite directory 'hdfs://hadoop1.cs.wisc.edu:8020/user/aabhishek/this_file'
This line determines what the output file will be named. In this example, it will be named this_file
, and once the pull is finished you would be able to see it by doing hadoop fs -ls
.
When you write a new script, you should change the file name to something descriptive and avoid overwriting a file that exists already. If you wanted to delete the output from our example, you can do hadoop fs -rm -r -skipTrash this_file
.
The second thing in a SQL file is this:
id_str,
created_at,
user.id_str,
regexp_replace(user.name, '[\\s]+', " "),
user.screen_name,
regexp_replace(user.description, '[\\s]+', " "),
user.followers_count,
user.friends_count,
user.verified,
geo.type,
geo.coordinates,
regexp_replace(text, '[\\s]+', " "),
retweeted_status.id_str,
retweeted_status.created_at,
retweeted_status.user.id_str,
regexp_replace(retweeted_status.user.name, '[\\s]+', " "),
retweeted_status.user.screen_name,
regexp_replace(retweeted_status.user.description, '[\\s]+', " "),
retweeted_status.user.followers_count,
retweeted_status.user.friends_count,
retweeted_status.user.verified,
retweeted_status.geo.type,
retweeted_status.geo.coordinates,
regexp_replace(retweeted_status.text, '[\\s]+', " ")
Which would be in a single line or multiple (like this or this) - it doesn’t matter.
This line tells hadoop whcih fields we are interested in. The fields specified above and in example files are in fact all the available fields in our dataset, so you won’t have to modify anything.
Data on hadoop exists in different “tables” (files). It is important to know which time period is contained in a table, so that we can decide whether to look in it or not. The contents of the tables can change (Alex Hanna maintains this so its up to them), but it happens rarely. To get information regarding this, you can do hdfs dfs -ls /user/ahanna/gh*
from the hadoop server.
Today (July 8, 2018), doing so returns the following:
Found 2 items
-rw-r--r-- 2 ahanna hadoop 1699758080 2014-02-16 18:44 /user/ahanna/gh/gh.20120208.json.gz
-rw-r--r-- 2 ahanna hadoop 149444872392 2014-03-21 20:45 /user/ahanna/gh/gh.20140320.json
Found 5 items
drwxr-xr-x - ahanna hadoop 0 2016-07-18 23:49 /user/ahanna/gh_raw/year=2013
drwxr-xr-x - ahanna hadoop 0 2016-07-23 07:22 /user/ahanna/gh_raw/year=2014
drwxr-xr-x - ahanna hadoop 0 2016-12-02 01:22 /user/ahanna/gh_raw/year=2016
drwxr-xr-x - ahanna hadoop 0 2017-12-02 01:30 /user/ahanna/gh_raw/year=2017
drwxr-xr-x - ahanna hadoop 0 2018-03-02 01:35 /user/ahanna/gh_raw/year=2018
Found 2 items
drwxr-xr-x - ahanna hadoop 0 2014-03-16 13:06 /user/ahanna/gh_rc/year=2012
drwxr-xr-x - ahanna hadoop 0 2014-04-17 13:40 /user/ahanna/gh_rc/year=2013
Found 6 items
drwxr-xr-x - ahanna hadoop 0 2014-04-28 16:24 /user/ahanna/gh_rc2/year=2013
drwxr-xr-x - ahanna hadoop 0 2016-07-23 07:27 /user/ahanna/gh_rc2/year=2014
drwxr-xr-x - ahanna hadoop 0 2016-02-09 15:32 /user/ahanna/gh_rc2/year=2015
drwxr-xr-x - ahanna hadoop 0 2016-12-02 01:26 /user/ahanna/gh_rc2/year=2016
drwxr-xr-x - ahanna hadoop 0 2017-07-24 15:05 /user/ahanna/gh_rc2/year=2017
drwxr-xr-x - ahanna hadoop 0 2018-01-17 11:29 /user/ahanna/gh_rc2/year=2018
Found 3 items
drwxr-xr-x - ahanna hadoop 0 2017-03-17 15:54 /user/ahanna/gh_rc3/year=2016
drwxr-xr-x - ahanna hadoop 0 2017-12-02 01:39 /user/ahanna/gh_rc3/year=2017
drwxr-xr-x - ahanna hadoop 0 2018-03-02 01:39 /user/ahanna/gh_rc3/year=2018
Found 2 items
drwxr-xr-x - ahanna hadoop 0 2017-03-17 15:50 /user/ahanna/gh_rc3_raw/year=2016
drwxr-xr-x - ahanna hadoop 0 2017-03-07 14:23 /user/ahanna/gh_rc3_raw/year=2017
The tables that we are interested in are everything except the ones that have raw
in their name because, well, they are raw and you can’t eat raw fruits. From above we know that if we want data from 2012, we need to look only in gh_rc
; while for 2013, we need to look in gh_rc
as well as gh_rc2
. This information is useful for the line in the SQL file where you specify the table:
FROM gh_rc
If you want to select two tables (for 2013, for instance), you would have to write everything twice (like here). There is no way to select two tables at once (that I know of).
Important: Make sure you check for and remove duplicates while gathering data from multiple tables.
This is the part where you specify what you are searching for. Here are some examples which I think are quite self explanatory. Remember that using brackets is a good idea whenever you are unsure. For example:
(x AND y) OR z
would compute (x AND y)
first, and then do the rest. In comparison, x AND y OR z
is “riskier”, unless you remember the default ordering of operators.
You can submit multiple jobs that would run consecutively in a single file. For example, see this. There are two jobs in that file called job1
and job2
. The only thing to keep in mind while doing is that a semicolon ;
should separate the jobs, as you would notice at the end of line 5 in that file.
Once you submit a job and start seeing something like:
Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j.properties
Hive history file=/tmp/aabhishek/hive_job_log_c1bce8d6-5daa-4927-9ac4-a716a2b3f06b_507858846.txt
Total MapReduce jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201802021028_0174, Tracking URL = http://hadoop1.cs.wisc.edu:50030/jobdetails.jsp?jobid=job_201802021028_0174
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_201802021028_0174
Hadoop job information for Stage-1: number of mappers: 26; number of reducers: 0
2018-02-16 13:35:44,446 Stage-1 map = 0%, reduce = 0%
2018-02-16 13:35:54,502 Stage-1 map = 2%, reduce = 0%
2018-02-16 13:35:55,511 Stage-1 map = 11%, reduce = 0%
2018-02-16 13:35:56,524 Stage-1 map = 17%, reduce = 0%, Cumulative CPU 52.29 sec
You can either close your terminal window (that will not kill the job) or press Ctrl + a followed by d. Essentially, once the job is submitted the way in which you close things doesn’t matter. The job is not running on your laptop, so you can close that too unless you want to see the percent completion, and you know that the job won’t take that long.
You can ssh into hadoop and check if your job is still running by doing mapred job -list
. This would show you something like:
$ mapred job -list
3 jobs currently running
JobId State StartTime UserName Priority SchedulingInfo
job_201804201535_0452 1 1529536111893 aabhishek NORMAL NA
job_201804201535_0455 1 1529616541563 aabhishek NORMAL NA
job_201804201535_0498 4 1530079836201 ahanna NORMAL NA
Which means that 3 jobs are running right now, from 2 users aabhishek
and ahanna
, along with their JobID
s. It is good practice to keep track of your jobID
when you are running more than one jobs in case you decide to terminate a running job.
If doing mapred job -list
does not list your job, it could mean that your job completed or crashed. In the former case, you would be able to see the output you expect by doing hadoop fs -ls
, and in the latter you won’t see anything on doing the same.
You need the JobID
of the job you want to terminate. This can be obtained from mapred job -list
. For example:
$ mapred job -list
3 jobs currently running
JobId State StartTime UserName Priority SchedulingInfo
job_201804201535_0452 1 1529536111893 aabhishek NORMAL NA
job_201804201535_0455 1 1529616541563 aabhishek NORMAL NA
job_201804201535_0498 4 1530079836201 ahanna NORMAL NA
To terminate the job at the top, you can do mapred job -kill job_201804201535_0452
.
You need not go over this section if you’re new to hadoop.
Section 6.2 Selecting Fields from the Tables outlined how to select fields from a table. Different tables have different fields, however. The newer tables, like gh_rc3
from 2016, contain additional fields such as quoted_status
. To check all the fields that are available in a particular table, you can follow the steps below.
user@hadoop1
next to your cursor if you do that successfully.hive
and hit Enter. You should see something like hive>
next to your cursor.show tables;
and hit Enter to see all the available tables on hadoop (note the semicolon at the end of the command). This would show you tables like gh_rc
, gh_rc2
, and others.gh_rc3
- type describe gh_rc3;
and hit Enter. The output might seem hard to read, but you’d see fields such as quoted_status
there. If you did describe gh_rc;
, you won’t see quoted_status
, because that particular functionality did not exist back in 2012/2013.exit;
and hit Enter. This should get you back to user@hadoop1
level.Below are some general things to keep in mind. If you are still stuck on something, please feel free to contact me.
If there is an error in the syntax of your SQL file, you would get to know right when you submit the job. That is, you would never see the percentage completion (see Appendix). Instead, all output from your terminal would disappear and you would see the message screen is terminating
. This means there is a mistake somewhere in the SQL file. This is probably the easiest bug to fix because the job never runs to begins with.
The trickiest cases are where the was running, but later it doesn’t show up in mapred job -list
and neither is there any output on doing hadoop fs -ls
. In such cases, the file screenlog.0
on the cs server contains the output when the job is submitted, and can contain useful information on what the error was before the job crashed. Try to read and understand these errors, which would be at the end of the file right before the job crashed. You can see that screenlog.0
exists by doing ls
and opening the file with a software such as vim
(do vim screenlog.0
). There are plenty of resources on how to use vim
.
For the above reason it a very good idea to delete the screenlog.0
file before submitting a job. That way, you know that the contents of screenlog.0
reflect what happened to your most recent job.
If you cannot do ssh hadoop1
and its not an issue of not have permissions, then contact cs helpdesk. If there are some errors which look strange, let me know - these could be server-end issues too.
Looking at these examples might be useful if you are confused about something.
hdfs dfs -df -h
: to check for available storage space on hadoop.Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j.properties
Hive history file=/tmp/aabhishek/hive_job_log_c1bce8d6-5daa-4927-9ac4-a716a2b3f06b_507858846.txt
Total MapReduce jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201802021028_0174, Tracking URL = http://hadoop1.cs.wisc.edu:50030/jobdetails.jsp?jobid=job_201802021028_0174
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_201802021028_0174
Hadoop job information for Stage-1: number of mappers: 26; number of reducers: 0
2018-02-16 13:35:44,446 Stage-1 map = 0%, reduce = 0%
2018-02-16 13:35:54,502 Stage-1 map = 2%, reduce = 0%
2018-02-16 13:35:55,511 Stage-1 map = 11%, reduce = 0%
2018-02-16 13:35:56,524 Stage-1 map = 17%, reduce = 0%, Cumulative CPU 52.29 sec
0 tweet id
1 created at
2 user id
3 user name
4 user handle
5 user description
6 followers
7 friends
8 verified
9 geo type
10 coordinates
11 tweet
12 retweet id
13 retweet created at
14 retweed user id
15 retweet user name
16 retweet user handle
17 retweet user description
18 retweet user followers
19 retweet user friends
20 retweet user verified
21 retweet geo type
22 retweet coordinates
23 retweet