Web 2.0 is about harnessing the potential of the Internet in a more collaborative and peer-to-peer manner with emphasis on social interaction.

February 18, 2013

Geographical network of refugee movements

What is common between Sigmund Freud, The Dalai Lama, Karl Marx, Aristotle Onassis, Bob Marley, Albert Einstein, Marlene Dietrich, Madeleine Albright, Victor Hugo, Frédéric Chopin and Andy Garcia. Besides the fact that I am personally influenced by many of them, the one common factor is that they were all refugees during some phase of their lifetime. The term refugee is used for those who seek relief and refuge from economic, military, political, or social distress including war, famine, or civil strife. The earliest known instance of refuges was triggered by the invasion of Middle East by Sargon the Great of Mesopotamia (2270-2215 BC). Across the years, human conflicts, battles, wars and epidemics has been the root cause of human displacement. According to the UN High Commissioner for Human Rights, a refugee is "any person who: owing to a well-founded fear of being persecuted for reasons of race, religion, nationality, membership of a particular social group, or political opinion, is outside the country of his nationality, and is unable to or, owing to such fear, is unwilling to avail himself of the protection of that country". Refugees are spread around the world, with more than half in Asia and about 20 percent in Africa. United Nations High Commissioner for Refugees (UNHCR) numbers indicate that by the end of 2011, number of forcibly displaced people worldwide exceeded 42.5 million. Out of this, 15.2 million were refugees (10.4 million under UNHCR's mandate, 4.8 million Palestinian refugees), 895,000 asylum-seekers and 26.4 million internally displaced persons.

According to UN Secretary-General Ban Ki-moon, "Refugees have been deprived of their homes, but they must not be deprived of their futures."

We are interested to understand refugee movement across countries over the past decade. Since we can model origin and destination countries as nodes and people movement as edges, we use social network analysis and visualisations for our purpose. Data about refugee movement across countries during 2000 - 2011 was obtained from UNHCR.

The below network graphs and charts trace annual as well as consolidated refugee movements from 2000-2011. Our consolidated dataset had 211 nodes and 6068 edges, with the nodes representing countries (source of refugees) and edges representing cross country refugee movements. Table 1 gives details of the number of nodes and edges over the years. Figure 1 is a chart of global refugees from 2000-2011.

 

Table 1: Nodes and Edges

 

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

Nodes

192

188

193

193

192

192

197

199

199

202

204

205

Edges

3205

3382

3657

3820

3991

4043

4136

4380

4505

4791

4888

4799

 

 

Figure 1: Global refugees from 2000-2011  

 

Global Refugee Numbers.jpg

Figure 2 is a source and destination country chart of the top ten global refugee movements from 2000-2011. The largest refugee movements were from Afghanistan to Pakistan (18,790,596), which accounted for 15% of the total, followed by Afghanistan to Islamic Republic of Iran (12,423,106 - 10%), Iraq to Syrian Arab Republic (6,146,922 - 5%), Burundi to United Republic of Tanzania (4,030,134 - 3%), Vietnam to China 3,591,405 - 3%), followed by others.

 

Figure 2: Country-wise global refugee movements from 2000 - 2011 

 
 

Global Refugee Stats.jpg

We list below some key network graphs. The first set of graphs are based on aggregated data over the years 2000 - 2011 and the second set of graphs are based on annual data from the years 2000 and 2011. We have not included results for the other years for the sake of keeping this a short blog. Figures 3 & 4 are network graphs of refugee Inflows and Outflows from 2000-2011. The network graphs are based on weighted indegree as well as outdegree measures. The top ten refugee origins/destinations are also listed.   

Figure 3: Network graphs of refugee Inflows from 2000 - 2011

 Indegree Consolidated.jpg

  Top 10: Pakistan, Islamic Rep. of Iran, Germany, Syrian Arab Rep., United Rep. of Tanzania, United States, Kenya, China, United Kingdom, Jordan

 

Figure 4: Network graphs of refugee Outflows from 2000 - 2011

 

Outdegree Consolidated.jpgTop 10: Afghanistan, Iraq, Somalia, Sudan, Dem. Rep. of the Congo, Burundi, Vietnam, Occupied Palestinian Territory, Angola, Myanmar

 

Figure 5 has network graphs of refugee flows during 2011. The network graphs are based on weighted indegree as well as outdegree measures. The top ten refugee origins/destinations are also listed. 

 

Figure 5: Network graphs of refugee flows during 2011

 

Graph of Refugee Inflows - 2011

Graph of Refugee Outflows  - 2011

Indegree 2011.jpg

  

Indegree 2011 - close.jpg  

Outdegree 2011.jpg 

Outdegree 2011 close.jpg 

Top 10: Pakistan, Islamic Rep. of Iran, Syrian Arab Rep., Germany, Kenya, Jordan, Chad, China, Ethiopia, United States

 Top 10: Afghanistan, Iraq, Somalia, Sudan, Dem. Rep. of the Congo, Myanmar, Colombia, Vietnam, Eritrea, China

 

Figure 6 has network graphs of refugee flows during 2000. The network graphs are based on weighted indegree as well as outdegree measures. The top ten refugee origins/destinations are also listed.   

 

Figure 6: Network graphs of refugee flows during 2000 

 

Graph of Refugee Inflows - 2000

Graph of Refugee Outflows  - 2000

Indegree 2000.jpgIndegree 2000 close.jpg  

  

Outdegree 2000.jpg

 

Outdegree 2000 close.jpg 

 

Top 10: Pakistan, Islamic Rep. of Iran, Germany, United Rep. of Tanzania, United States, Serbia (and Kosovo: S/RES/1244 (1999), Guinea, Sudan, Dem. Rep. of the Congo, China 

 Top 10: Afghanistan, Iraq, Somalia, Sudan, Dem. Rep. of the Congo, Myanmar, Colombia, Vietnam, Eritrea, China

 

 

 
Conclusions
1. During 2000-2011, the largest refugee movements were from Afghanistan to Pakistan (18,790,596), which accounted for 15% of the total, followed by Afghanistan to Islamic Republic of Iran (12,423,106 - 10%), Iraq to Syrian Arab Republic (6,146,922 - 5%), Burundi to United Republic of Tanzania (4,030,134 - 3%), Vietnam to China 3,591,405 - 3%), followed by others.
 
2. In 2011, the largest refugee movements were from Afghanistan to Pakistan (1,701,945), which was 16% of total global refugees, followed by Afghanistan to Islamic Republic of Iran (840,451 - 8%), Iraq to Syrian Arab Republic (750,000 - 7%),  Somalia to Kenya (517,666 - 5%), Iraq to Jordan (450,000 - 4%).
 
3. A very large majority of the world's refugees flee to their neighboring countries (e.g. Afghanistan to Pakistan, Afghanistan to Islamic Republic of Iran, Iraq to Syrian Arab Republic, Burundi to United Republic of Tanzania, Vietnam to China, Iraq to Jordan, Somalia to Kenya, Sudan to Chad, Eritrea to Sudan etc.)
 
4. Large number of refugees originate from the poorest countries in Asia and Africa and a large majority of them move to other poor neighboring countries, which have very limited resources to host them. The one exception is Germany, which has consistently been a top 5 refugee destination country for every single year from 2000 - 2011.
 
5. Among economically advanced countries, the largest recipient of refugees were Germany, United States, United Kingdom, France, Canada, Sweden, Netherlands, Switzerland and Australia.
 
6. Analysis of data across the years 2000 - 2011 reveals what has changed in the world of refugees and what has not. Across all the years, Afghanistan is the largest source of origin of refugees and except for 2002 and 2004, Pakistan is the largest destination of refugees.
 

January 3, 2013

Network Structure of Music Genres and their Stylistic Origins

According to musicologist Jean-Jacques Nattiez (Musicologie générale et semiology, 1987), the border between music and noise is always culturally defined and composer John Cage thought that any sound can be music and has said "There is no noise, only sound." Music has been an important part of many cultures and there are several hundred popular music genres and subgenres. Needless to say, the origin and style of many musical genres and subgenres can be traced to other genres and subgenres. Let us explore further the various musical genres and their stylistic origins. The below graphs trace the various genres and their stylistic origins according to data from Dbpedia/Wikipedia. The following are some caveats with regard to the data and methodology:

  • The data in Wikipedia is not complete and the completeness is limited by the topics and updates made by the users. For example, several popular Indian genres such as Ghazal, Qawwali, Bhajan, Lavni, Kirtan, Thumri, Tarana, Kajri, Dhrupad etc., lack rich RDF representations. Similar scenario exists for several African, Asian and Latin American genres. This is a case for creating rich Wikipedia entries with regard to these music genres.
  • The data is spread across musical genres over several generations and we do not have the specific timeline of the stylistic origin of the genres. Therefore the genre which is influenced by a stylistic origin during a certain time period could in turn be the stylistic origin of another genre during another time period. For example, the stylistic origin of Electro can be traced to Funk and Funk in turn owes its origin to Jazz.
    There could be stylistic origins and genres which could be closely related, but not captured in Wikipedia.
  • We explored only the musical genres listed in the English language section of Wikipedia, which in turn could have been limited by the extent of English language penetration and usage in various countries.
  • Another limiting factor for global representation in Wikipedia could be the penetration of Internet and computers.

A musical genre could be associated with more than one stylistic origin. For example, the stylistic origins of Hip Hop can be traced to Rhythm and blues, Reggae, Funk, Disco, Dancehall, Scat singing etc. Our graph takes in the musical genres and stylistic genres as nodes and the edges represent their relationships. Our data has 1058 nodes and 2767 edges. Let us go over to the graphs and results for various centrality measures. 

Overview Graph of Music Genres and their Stylistic Origins

 

Origin - Music - 1.jpg

 

Authority 

 

 

Origin - Music - Authority.jpgBetweeness

Origin - Music - Betweeness.jpg

Closeness 

 

Origin - Music - Closeness.jpg 

Degree 


Origin - Music -  Degree.jpg


Hub

 

Origin - Music - Hub.jpg

  

Indegree

 

Origin - Music - Indegree.jpg 

Outdegree
 

Origin - Music - Outdegree.jpg 

Observations & Conclusions

Wikipedia entries with respect to Musical genres and their stylistic origins follow power-law degree distribution and exhibit characteristics of Scale Free Networks. In other words, the nodes are not evenly connected and have several highly connected nodes which govern the property of the network. Past research by Albert-László Barabási and colleagues found that World Wide Web links have scale free network characteristics. Similar results have been proposed for biological networks, and social networks. It will be interesting to see if similar characteristics can be found for other Wikipedia entries. 

Degree Distribution

 

Degree Distribution.jpgWe observed 30 prominent clusters and 17 smaller clusters (2 or 3 entities).

 

 

Music Origins 7.jpg 

The table below lists the most important nodes of the network for the selected centrality measures (in-degree and out-degree centralities). In Degree Centrality of a node are the connections that the node receives from other nodes. For example, in our result, the top five Indegree centrality measures belong to Girl_group, Dark_psytrance, Breakcore, Boy_band, New_Age and Chillwave. In other words, these genres draw their origins from a large number of other musical genres. Out Degree Centrality of a node are the connections that the node sends to other nodes and the top five in our study are Hip_hop, Punk_rock, Jazz, Pop and Funk, which means that these musical genres have resulted in the origin of the largest number of other genres. Closeness Centrality measures indicate the influence of several regional as well as country specific genres and their origins such as  Kansas City Jazz, Music of India, France, Folk music of England etc. 

 

In-Degree

Out-Degree

Degree

Closeness

Betweenness

Eigenvector

 Girl_group

 Hip_hop

 Hip_hop

 African_American

 Synthpop

 Dark_psytrance

 Dark_psytrance

 Punk_rock

 Punk_rock

 Cakewalk

 Punk_rock

 Chillwave

 Breakcore

 Jazz

 Jazz

 American_march

 Country

 Breakcore

 Boy_band

 Pop

 Funk

 Stride

 Progressive_rock

 Wonky

 New_Age

 Funk

 Pop

 Dixieland

 Post-punk

 Psybient

 Chillwave

 Rhythm_and_blues

 Rhythm_and_blues

 Bounty

 Jazz_fusion

 Wonky_music

 Big_beat

 Folk

 Synthpop

 Ragtime

 New_Wave

 Big_beat

 New_rave

 Synthpop

 House

 Cool_jazz

 Rock_and_roll

 New_rave

 Industrial_rock

 Hardcore_punk

 Post-punk

 Romanticism

 Rock

 Industrial_rock

 New_Wave

 House

 Rock_and_roll

 Music_of_India

 Post-bop

 Post-punk_revival

 Noise

 Post-punk

 Folk

 Medieval

 Hip_hop

 Electroclash

 Trip_hop

 Psychedelic_rock

 Psychedelic_rock

 Kansas_City_jazz

 Techno

 Alternative_dance

 Hip_hop

 Techno

 Techno

 Modal_jazz

 Industrial

 Trip_hop

 Wonky

 Blues

 Industrial

 Hard_bop

 Psychedelic_rock

 Liquid_funk

 Liquid_funk

 Electronic

 New_Wave

 Impressionist

 House

 Girl_group

 Wonky_music

 Rock_and_roll

 Hardcore_punk

 Bebop

 Bluegrass

 Post-rock

 Ambient

 Industrial

 Rock

 Folk_of_Scotland

 Glam_rock

 Dubtech

 Mod_revival

 Rock

 Blues

 Afro-American

 Disco

 Hard_NRG

 Chinese_ambient

 Heavy_metal

 Electronic

 Music_of_Scotland

 Rhythm_and_blues

 Shoegazing

  
 

December 19, 2012

Networks of key decision makers

Board of directors consists of a group of elected members who oversee the activities of a company. They are representatives of the shareholders and play an important role in decisions on major company issues including dividends, mergers and acquisitions, senior appointments, compensation, etc. Board of directors can be modelled as a two mode network and their membership in multiple company boards can be modelled as an interconnected network with the board memberships acting as the tie. One of the means of representing the strength of the tie is the number of board memberships.

In this research, we model the board membership of the top 100 most valuable companies in India. We collected the list of Top 100 Companies in India by Market Capitalization listed on BSE. We then looked up their board memberships and analysed the network. The networks below shows board membership of three organisations in our sample. The red nodes are the companies and the black nodes are members of the board.

 

  

Small Group network 1.png 

We grouped the sample of 100 companies and saw the emergence of 46 clusters. The figures below give details of the clusters as well as their interconnections through board memberships. The links show board membership ties between the members.

  

Bod Groups 1.jpg

  

Bod Groups 2.jpgThe network graphs below are varying visualisations of companies and their board membership.

 

Bod Network 1.jpg

 

Bod Network 2.png  

Bod Network 3.jpg

 
The table below lists the most important nodes of the network for the selected centrality measures (in-degree and out-degree centralities). In Degree Centrality of a node are the connections that the node receives from other nodes. For example, in our sample Dr. Omkar Goswami has the highest in-degree centrality measure as he is on the board of 6 companies in our sample set of top 100 companies in India by market capitalization listed on BSE. Out Degree Centrality of a node are the connections that the node sends to other nodes. Larsen and Toubro has the highest out-degree centrality measure as the company has 28 board members. 

  

Table 1.jpg 

This was a simple illustration of the network of key decision makers of top 100 companies in India by market capitalization listed on BSE. The network between the board members of larger sample sets could get denser.

 

 

 

 

 

 


 

November 11, 2012

Tale of Two Storms: Hurricane Sandy & Cyclone Nilam

Social Media sources have played a key role in disaster reporting, relief and rescue efforts. Social media destinations such as Twitter, Flickr and Facebook were leveraged extensively to spread information during the Victorian bushfires in Australia in 2009. In this blog, we try to explore the role played by government, civic authorities, law and order, general public, celebrities, activists, journalists and media in the advent of two natural disaster situations: Hurricane Sandy and Cyclone Nilam.

The terms hurricane and typhoon are regional names used to refer to a strong tropical cyclone. Once the tropical cyclone reaches winds of at least 17 m/s, they are called a tropical storm and are assigned a name. If winds reach 33 m/s, then they are called Hurricane in North Atlantic ocean, Eastern pacific and South Pacific Ocean, Typhoon in Western pacific and Cyclone in Southern pacific and Indian ocean. In the recent past, we experienced two significant simultaneous cyclones: hurricane Sandy in the East Coast of US and Cyclone Nilam in the East Coast of India. Hurricane Sandy is the largest Atlantic hurricane on record, as well as the second costliest Atlantic hurricane only surpassed by Hurricane Katrina in 2005. Hurricane Sandy struck in late October 2012. In the United States, Sandy caused severe damage in New Jersey and New York. It has claimed more than 50 lives, left millions without power and caused over US$ 50 billion in damage in the United States. Damages to life and property are spread across Jamaica, Haiti, Dominican Republic, Puerto Rico, Cuba and The Bahamas. Cyclone Storm Nilam, which struck India in late October 2012 caused damages across Indian states such as Tamil Nadu, Andhra Pradesh, Karnataka and Odisha.

According to Semiocast, the United States has 141.8 million Twitter users, and India has over 15 million Twitter users. This depicts a healthy Twitter user base in the United States and a fast growing Twitter user base in India. We collected tweets containing the hashtag #Sandy from Oct 30th until Nov 6th. We collected about 500,036 tweets from 306,348 users. We also collected about 1500 tweets with the hashtag #Nilam during 31st Oct 2012. We classified #Sandy tweets into four different sets based on the time interval when the tweets occurred. Time Interval 1 has 180,489 Nodes and 232,578 Edges, Time Interval 2 has 60,351 Nodes and 65,842 Edges, Time Interval 3 has 31,033 Nodes and 31,039 Edges, Time Interval 4 has 14,550 Nodes and 13,350 Edges. The network graphs, tables and charts below unravels the role of social media in disaster reporting, relief and rescue efforts by government, civic authorities, law and order, general public, celebrities, activists, journalists and media.

 

Hurricane Sandy: Time Interval 1 

 

Sandy Time 1 - 1.jpg

  

Sandy Time 1 - 2.jpg

 

Hurricane Sandy: Time Interval 2
 

Sandy Time 2 - 1.jpg

 

   

Sandy Time 2 - 2.jpg

 

Hurricane Sandy: Time Interval 3
 

Sandy Time 3 - 2.jpg 

 Hurricane Sandy: Time Interval 4

 Sandy Time 4 - 2.jpg

 

 Hurricane Sandy: Distribution of most important Tweeters based on Network Centrality measures 

 

Graph.jpg

 

Meanwhile in some other part of the graph, we observed the following clusters. 

 

Ot.jpg  

Cyclone Nilam

   Nilam.jpg  

Conclusions

1. Sandy was 'Instagrammed': Number of #Sandy tweets originating from Instagram ranked fourth, preceded only by Tweets originating from iPhone, Web and Android and in the case of Nilam, Twitpic scored high. This is an interesting social behavior where in disaster scenarios, visual means of message propagation assume prominence. This is in contrast to our observations with regard to message propagation during the Olympic Games, where Tweets originating from Instagram scored very low.

2. Government and officials leveraging Social media: In the case of Hurricane Sandy, @NYGovCuomo, which is the Official Twitter account for the Governor of New York State, Mr. Andrew Cuomo consistently emerged as one of the most important entities. No such trends in the case of Cyclone Sandy where we did not see even a single tweet from a government official or utility.

3. News Media leveraging Social Media: Journalists and News Media extensively leveraged Social Media for news propagation and for news amplification. News Media sources and Journalists were the heaviest users of Social media in the events of both Sandy and Nilam, supporting the findings from previous research. Individual Journalists leveraged their Twitter follower base to spread messages faster and they invariably figured higher up in the network centrality measures as against the twitter accounts of Media houses which employ them.

4. Cyclone Nilam saw a low overall coverage in Social Media in India and surrounding affected regions.

5. Activist groups such as Anonymous etc. leveraged social media for message propagation in the case of Hurricane Sandy.

6. Celebrities in the United States were active in social media during Hurricane Sandy as against celebrities in India who chose to ignore Cyclone Nilam.

 

October 23, 2012

Visualising the Third Presidential Debate

The third Presidential Debate concluded just a few hours back. The two network graphs visualize the entire debate content. The table below lists the most influential keywords used by the two presidential contestants.

 

President Obama

 Obama - Third.pngMitt Romney

 

Romney - Third.png

 

  

 Most Influential keywords

 

President Obama

Mitt Romney

make

1527

make

894

american

395

world

860

world

370

year

782

governor

367

president

726

kind

352

people

590

military

335

america

558

region

330

nation

441

iran

320

military

419

job

303

number

412

america

293

state

256

thing

256

thing

211

making

238

work

209

country

221

time

203

leadership

191

syria

193

office

191

back

182

 

 

 

October 19, 2012

Visualising the Second Presidential Debate

Presidential Debates in the United States occur between the two main candidates of the largest parties. Critical topics are discussed in the debate and these debates have the tendency to sway general public. According to Nielsen Ratings, the second Presidential debate on the 16th Oct 2012 had an estimated viewership of 65 million viewers. This is a slight dip over the first debate which attracted viewership of over 67 million viewers. The table below lists the most influential keywords used by the two presidential contestants. The two network graphs visualize the entire debate content.

 

Most Influential keywords

President Obama

Mitt Romney

make

2004

people

1290

governor

1060

president

1233

job

611

year

826

tax

510

make

677

romney

458

job

581

people

372

percent

356

year

326

bring

336

country

284

america

329

world

230

tax

245

create

211

country

235

folk

200

question

177

education

183

work

166

thing

181

policy

138

family

178

back

136

energy

166

happen

135

 

 

President Obama

 

Obama.png

 

Mitt Romney

  

Romney.png 

 

 

 

September 21, 2012

Visualising State of the Union Messages

State of the Union Messages to the Congress are mandated by Article II, Section 3 of the United States Constitution which states, "He shall from time to time give to the Congress information of the state of the union, and recommend to their consideration such measures as he shall judge necessary and expedient". Since 1790 State of the Union messages have been delivered regularly at approximately 1 year intervals (Source: The American Presidency Project http://www.presidency.ucsb.edu/sou.php#axzz274c0J8EF ). The State of the Union Message is delivered near the beginning of each session of Congress and it is a report on the condition of the country and a platform for presidents to outline their legislative agenda and their priorities. They have over the years emerged as a communication between the president and the people of the United States. The messages are available at http://www.presidency.ucsb.edu/sou.php#axzz274c0J8EF. We are interested to examine what President Barack Obama and past President George Bush have been stressing in their speeches. We used Texttexture (http://textexture.com/) to visualize the State of the Union Messages by President George Bush (2005-08) and President Barack Obama (2009-2012). The network visualisations and results from Texttexure are given below. Each network has an average of 100 words (nodes) and about 1000 edges (co-occurrences). In the network graphs below, the words are the nodes and their co-occurrences are the connections between them.

 

 President Barack Obama - 2009

President Barack Obama - 2010 

 11.jpg

 

22.jpg 

Most influential keywords:  American, Year, Economy, America  

Most influential contexts:
#0:   american    people    bank    country   
#1:   year    economy    crisis    energy   
#2:   america    time    make    large   
#3:   job    million    create    industry
 

Most influential keywords : Year, American, Job, Work

 Most influential contexts:
#0:   year    deficit    decade    office   
#1:   american    work    people    time   
#2:   job    america    economy    nation   
#3:   business    tax    small    cut
    
 

 

 

 President Barack Obama - 2011

President Barack Obama - 2012

 

33.jpg

  

44.jpg

Most influential keywords in this text: Job, Year, People, American

Most influential contexts:
#0:   job    people    american    tax   
#1:   year    work    country    school  
#2:   america    nation    place    energy   
#3:   make    business    support    republican     

Most influential keywords in this text: American, Job, Business, Tax

Most influential contexts:
#0:   american    year    work    million   
#1:   job    business    create    back   
#2:   tax    company    pay    debt   
#3:   america    country    state    build
     

 

 

 President George Bush - 2005

President George Bush - 2006

 

55.jpg

  

66.jpg

Most influential keywords in this text: Make, People, Freedom, America

Most influential contexts:
#0:   make    american    citizen    good 
#1:   people    freedom    america    iraq   
#2:   security    year    great    million   
#3:   government    child    retirement    account
    

Most influential keywords in this text: America, World, American, Freedom

Most influential contexts:
#0:   america    lead    life    require   
#1:   world    american    people    nation   
#2:   freedom    hope    great    iraqi   
#3:   year    science    effort    cut
     

 

 

 President George Bush - 2007

President George Bush - 2008

 

77.jpg

  

88.jpg

Most influential keywords in this text: America, Iraq, American, People

Most influential contexts:
#0:   america    citizen    nation    united   
#1:   iraq    government    force    oil   
#2:   american    people    health    free   
#3:   tonight    congress    country    law
    

Most influential keywords in this text: American, Congress, People, Year

Most influential contexts:
#0:   american    people    nation    future   
#1:   congress    good    million    agreement   
#2:   year    america    enemy    state   
#3:   iraq    terrorist    iraqi    force
    

  

   

September 10, 2012

How the Paralympic Games were Tweeted

The Paralympic games have come a long way from the earliest event in 1960 which was held in Rome exclusively for war veterans and attracted participation of 400 athletes from 23 countries. The London Paralympic games 2012 saw participation of 4,200 athletes from 147 countries, taking part in 21 sports. The International Paralympic Committee has defined six disability categories across physical and intellectual disabilities. This are: Amputee, Cerebral Palsy, Intellectual Disability, Wheelchair, Visually Impaired and Les Autres (athletes who do not fall under the other five categories). In this blog we will explore how Paralympic games were tweeted. We collected Twitter data with the hashtag #paralympics via 140kit.

A sample of 82,560 tweets across 54,522 users was collected between Aug 30th and Sept 6th 2012. The table below lists the overall statistics of the top ten measures with regard to language, source application from where the Tweet originated, location and hashtag. The network graph was created of the entire network of users who retweeted as well as mentioned one another. In this dataset, there were 48,956 nodes and 38,959 edges. There were 42,881 retweets, 229 mentions, 546 average followers, 354 average friends and 113 average favourites. The figure below gives the network graph of the network of users created using Gephi. Each dot in the graph is a network node and corresponds to the Twitter account of a user. The links are the edges and corresponds to the users who retweeted one another or those who mentioned one another. 

 

Overall Statistics

  

Language

Number of Users

Source Application

Number of Users

Location

Number

Hashtag 

Number of Users 

English

80446

Twitter for BlackBerry

22820

London

13042

#teamGB

1075

Dutch

948

Web

21283

Amsterdam

6481

#paralympics

1057

Spanish

246

Twitter for iPhone

12917

Casablanca

2038

#London2012

530

German

213

Twitter for Android

10947

Hawaii

1410

#inspirational

379

French

205

Mobile Web

2502

Greenland

799

#isitok

285

Swedish

166

UberSocial for BlackBerry

2277

Edinburgh

684

#amazing

276

Norwegian

93

TweetDeck

1790

Dublin

533

#respect

254

Japanese

56

Twitter for iPad

1182

Pacific Time

495

#c4paralympics

241

Italian

53

Echofon

562

Eastern Time

456

#inspiring

213

Portuguese

51

HootSuite

512

Central Time

408

#PPProud

187

Turkish

16

txt

505

Athens

383

#Paralympics2012

178

Russian

16

Facebook

478

Quito

321

#Superhumans

157

Indonesian

13

TweetCaster for Android

428

Pretoria

316

#excited

145

da

11

Tweetbot for iOS

341

Alaska

283

#lastleg

145

Korean

8

Twitter for BlackBerry®

300

Singapore

263

#PROUD

125

 

 

 Network Graph 1 

  

Paralympics1.png 

  

 Network Graph 2  

 

Paralympics2.png 

English was the language of choice for people tweeting about the Paralympic games making up about 97% of the sample. An exact similar trend was observed in an earlier study using Olympic tweets. This was followed by other languages such as Dutch, Spanish, German, French and Swedish. Tweets arising from Twitter for BlackBerry constituted about 27% of the total tweets followed by Web based Tweets (27%), Twitter for iPhone (15%), Twitter for Android (13%) followed by the rest. This indicates that more people were tweeting about the Paralympic Games through mobile devices. Most of the tweets are arising from Europe, followed by North America. London leads the list, followed by Amsterdam, Casablanca and Hawaii. Analysing the hashtags used reveals that #teamGB is the most used followed by #paralympics, #London2012 and #inspirational. Team Great Britain seems to be the most popular team.

August 31, 2012

Network Analysis of Drugs and their Active Ingredients

The global appetite for medicine and medications continue to grow at an alarming pace. According to sources, the global pharmaceutical sales is growing at the rate of 6% YoY and sells about $800 billion worth  of drugs globally. Populous markets such as India and China are contributing to a significant part of this growth as these markets are growing at about 15% YoY. The United States is the single largest pharmaceutical market contributing to sales of about $320 billion annually. Prescription drug abuse is the fastest growing drug problem in the United States and in 2007, approximately 27,000 unintentional drug overdose deaths occurred in the United States, one death every 19 minutes (Centers for Disease Control and Prevention). Americans consume more than three times the per capita rate of consumption of prescription drugs than countries such as Germany. Similar scenarios exist for livestock drugs as well.

Given the above context, let us examine further what exactly constitutes the drugs we consume. The motivation for this research is to understand from a network perspective the drugs we consume and their active ingredients network. We use social network analysis methods to visualize and analyse drugs listed in the US FDA Drug Label Data. The large quantum of information available in this database gives us a glimpse of the fundamentals of drugs and their active constituents. The co-occurrence of active ingredients in thousands of drugs provides information about which active ingredients are used frequently, which of them are used together and which of them occur together most often. We analyse the network structure of drugs and their active ingredients, examine their centrality measures and other key indicators.
The data source of our analysis is the US FDA Drug Label Data provided by the Department of Health & Human Services. The agency collects this data on a daily basis and we downloaded this data on Aug 30th 2012. Our dataset had 70,904 National Drug Codes. The dataset contains data elements such as proprietary name, active ingredients, marketing application number or regulatory citation, National Drug Code, and company name. Of particular interest to us is the drug's Proprietary Name and its Ingredients as these two form the nodes of our two-mode network.

The final cleaned up data had 16,444 nodes and 32,627 edges. The network analysis and network graphs were created using Gephi. As seen from Figure 1, there were several clusters around drugs and their active ingredients. Figures 2 to 8 are close-up screenshots of various clusters.

 

Figure 1: Network graph of Drugs and their Active Ingredients

 

Health 1.jpg 

  

Figure 2: Close-up of a cluster in the Network graph

 

2.jpg

 

Figure 3: Cluster around Octinoxate and Titanium Oxide

  

3.jpg

 

Figure 4: Crowded part of the network graph 

  

4.jpg

 

Figure 5: Cluster around Potassium Cation

  

5.jpg

 

 

Figure 6: Cluster around Sodium Flouride

  

6.jpg

 

 Figure 7: Cluster around Alcohol

 

7.jpg

 

 Figure 8: Cluster around Triclosan

 

8.jpgThe resulting associated network statistics are given below. Table 1 lists the top 10 drugs and their active ingredients based on network centrality measures such as Degree, Closeness, Page Rank, Betweeness and Eigenvector. The measures are listed in the decreasing order of their scores. The shaded blocks are the drugs with Proprietary Names which had high scores.

  

Table 1: Top 10 Drugs and their Active Ingredients listed in decreasing order of network centrality measures

 

Degree Centrality

Closeness

Centrality

Page Rank

Betweeness

Centrality

Eigen Vector

Octinoxate

Tylosintartrate

Alcohol

Alcohol

Octinoxate

Titaniumdioxide

Nuflorgold

Octinoxate

Salicylicacid

Titaniumdioxide

Alcohol

Alrex

Triclosan

Octinoxate

Octisalate

Acetaminophen

Lotemax

Titaniumdioxide

Zinc

Oxybenzone

Octisalate

Premarin

Acetaminophen

Zincoxide

Avobenzone

Oxybenzone

Premarinvaginal

Salicylicacid

Potassiumcation

Octocrylene

Avobenzone

Sul-Q-Nox

Zincoxide

Acetaminophen

Acetaminophen

Triclosan

Tylan50

Menthol

Titaniumdioxide

Zincoxide

Zincoxide

Tylan200

Octisalate

Triclosan

Dextromethorphan Hydrobromide

Dextromethorphan

Hydrobromide

Tylan40

Oxybenzone

Additox

Dextromethorphan

 

 

In our case the Eigenvector measures indicates the strength of interconnections between the strongly connected active ingredients. The highly connected nature of active ingredients such as Octinoxate, Titaniumdioxide, Octisalate, Oxybenzone, Avobenzone depicts their widespread usage in key drugs as well as their widespread customer adoption. Degree centrality in our case depicts the drugs and active ingredients which are most connected to other active ingredients.  Betweenness Centrality for us is a measure of the drugs and active ingredients which are best positioned to broker connections between other drugs and active ingredients or serve as connecting node between them.

Let us next examine what these drugs are used for. We looked at various sources such as Drugs.com, Wikipedia etc. to understand what these drugs and ingredients are used for. The blocks coloured in light blue refers to veterinary medicines or active ingredients used in veterinary medicines. The blocks coloured in light green refers to proprietary names of drugs.

 

 Table 2: Drugs, Active Ingredients and their end consumer usage 

 

Drugs, Active Ingredients

Usage

Octinoxate

Ingredient in some sunscreens and lip balms. Used in sunscreens and other cosmetics to and reduce the appearance of scars.

Titaniumdioxide

Used for sunscreen, food colouring etc.

Alcohol

Used as an antiseptic to disinfect, soaps, hand sanitizers, base for medicines etc.

Acetaminophen

Widely used over-the-counter pain reliever and fever reducer.

Octisalate

Used as an ingredient in sunscreens and cosmetics.

Oxybenzone

Organic compound used in sunscreens.

Avobenzone

Oil soluble ingredient used in sunscreen products.

Triclosan

Antibacterial and antifungal agent used in many consumer products.

Zinc Oxide

Used to treat skin conditions, in products such as baby powder, treat diaper rashes, calamine cream, anti-dandruff shampoos, antiseptic ointments, ointments, creams, and lotions etc.

Dextromethorphan Hydrobromide

Active ingredients in many over-the-counter cold and cough medicines.

Tylosintartrate

Used in veterinary medicines to treat bacterial infections in a wide range of species.

Nuflorgold

Veterinary medicine to treat bacterial infections in a wide range of species.

Alrex

Is an eye medicine to treat conjunctivitis.

Lotemax

 

Prescription-only ophthalmic suspension that reduces both internal and external inflammation of the eye.

Sul-Q-Nox

Veterinary medicine to treat Beef Cattle, Chickens, Dairy Cattle,  Turkeys etc.

Tylan50

Veterinary medicine to treat Beef Cattle, Chickens, Dairy Cattle,  Turkeys etc.

Tylan200

Veterinary medicine to treat Beef Cattle, Chickens, Dairy Cattle,  Turkeys etc.

Tylan40

Veterinary medicine to treat Beef Cattle, Chickens, Dairy Cattle,  Turkeys etc.

Salicylic Acid

Used in anti-acne treatments, as an anti-inflammatory ingredient.

Zinc

Zinc is included in most single tablet over-the-counter daily vitamin and mineral supplements.

Potassium Cation

Used in a variety of medicines.

Additox

For temporary relief of debility, exhaustion, exhaustion after slight exertion and dysentery, substance addiction

Menthol

Used against minor sore throat, minor mouth or throat irritation, lip balms, cough medicines, topical analgesic etc.

 

Most of the network measures indicate the dominance of active ingredients such as Octinoxate, Titaniumdioxide, Acetaminophen, Octisalate, Oxybenzone and Avobenzone which are used in sunscreens, lip balms, food colouring, cosmetics etc. This is followed by active ingredients which make up the generic base for drugs such as Alcohol. Other key leaders are Triclosan which is an Antibacterial and antifungal agent used in many consumer products. There is a strong presence of drugs as well as active ingredients used in veterinary medicines as seen from the high ranking of several drugs with proprietary names. The data source of our analysis is the US FDA Drug Label Data provided by the Department of Health & Human Services. Since these drugs are sold globally either in same form or as generics, it is safe to assume that similar trends exist for other markets as well.

 

 

July 31, 2012

How the Olympic Games are being Tweeted

The Olympic Games is the largest sporting spectacle of the planet. The games have come a long way from the ancient Olympic Games which were held in Olympia, Greece during the 8th century BC to the 4th century AD to the ongoing London 2012 games. The London games have attracted participation from 205 countries and over 10,500 athletes. The games started on 27 July and will continue until 12 August 2012. According to The Economist, the British government's budget for the games has risen to £9.3 billion ($14.5 billion) from an initial estimate of £2.4 billion. The eleven global partners of the games are Acer, Atos, Coca-Cola, Dow, General Electric, McDonald's, Omega, Panasonic, Procter & Gamble, Samsung, Visa. In addition there are the Olympic Partners, Olympic supporters and Olympic Providers and suppliers (see http://www.guardian.co.uk/sport/datablog/2012/jul/19/london-2012-olympic-sponsors-list). Given the global nature of the games and the large sums spent by the sponsors, it will be interesting to understand and analyse the ground level interest and support towards the games. Since social media is proxy for end user behaviour, we set out to understand the world's social media behavior during the largest sporting event. We collected Twitter data with the hash tags #olympics, london olympics, londonolympics, olympics2012. The data was obtained from the Twitter data aggregator 140kit (www.140kit.com).

A sample of 43,048 tweets across 27,683 users were collected between July 29th and July 31st. The table below lists the overall statistics of the top fifteen measures with regard to language, source application from where the Tweet originated, location and hashtag. The network graph was created of the entire network of users who retweeted as well as mentioned one another. In the dataset, there were 16,444 nodes and 15,233 edges. There were 16,190 retweets, 274 mentions, 916 average followers, 368 average friends and 135 average favourites. The figure below gives the network graph of the network of users. The network graph was created using Gephi.

 

Overall Statistics

 

Language 

Number of Users 

Source Application 

Number of Users 

Location 

Number 

Hashtag 

Number of Users 

English

41622

web

9220

Empty Field

8399

#olympics

43130

Spanish

347

Twitter for iPhone

7867

london

1759

#london2012

4732

French

202

Mobile Web

5130

UK

425

#teamGB

1112

Dutch

168

Twitter for Android

4909

England

268

#openingceremony

968

Japanese

134

Twitter for BlackBerry

4206

Australia

247

#London

675

German

117

TweetDeck

1686

Manchester

226

#teamUSA

641

Italian

109

Instagram

1520

London, UK

211

#USA

629

Portuguese

78

Echofon

1388

Sydney, Australia

201

#gymnastics

609

Russian

74

Twitter for iPad

994

Sydney

189

#cycling

511

Turkish

36

HootSuite

719

india

180

#basketball

485

Indonesian

35

UberSocial for BlackBerry

528

Melbourne

178

#swimming

446

da

19

txt

526

Melbourne, Australia

166

#equestrian

279

Korean

17

Tweetbot for iOS

425

Ames, UK

157

#rowing

264

Swedish

16

TweetCaster for Android

369

Spain

118

#archery

262

msa

16

Tweet Button

304

Los Angeles

115

#judo

246

 

 

Network Graph 

 

 

 

Network.png

 

 

English was the language of choice for people tweeting about the games, making up about 97% of the sample. Spanish and French took silver and bronze respectively. Web based Tweets constituted about 23% of the total tweets followed by iPhone (20%), Mobile Web (13%), Android (12%), Blackberry (11%), followed by the rest. This indicates that more people were tweeting about the Olympic Games on the go than from home or work. Following a global trend, most users have not revealed their location and where the user location is mentioned, London leads the list. Analysing the hashtags used reveals that gymnastics is the most popular sport followed by cycling, basketball, swimming, equestrian, rowing, archery and judo. Finally Team Great Britain seems to be the most popular team. Are the Olympics really a global phenomenon when we examine it from the perspective of social media mentions? The initial results show concentration across the US, Europe and Australia. We will come back to the same analysis once more post the conclusion of the games.

 

Jai Ganesh

Sheera L. Gendzel, MBA Candidate, ESADE Business School