Web 2.0 is about harnessing the potential of the Internet in a more collaborative and peer-to-peer manner with emphasis on social interaction.

July 31, 2013

Big Data Analytics approach for forecasting Technology Innovations

We have grown up admiring Nicola Tesla and Thomas Edison for the number of inventions they contributed to. Edison held 1,093 U.S. patents across a large swathe of areas including the phonograph, the motion picture camera, long-lasting practical electric light bulb, stock ticker, mechanical vote recorder, battery for electric car, electrical power, recorded music motion pictures etc. Tesla was an equally prolific genius, who invented the modern alternating current electricity supply system. The patent count of Edison was overtaken by Shunpei Yamazaki in 2003 and in 2008, Kia Silverbrook overtook Yamazaki. During 1963-2012, about 10,927,409 patents were filed in the US and about 5,763,095 were granted.

Enterprises are often faced with questions such as which players could in future emerge as key competitors to the business?, who are key players with whom to partner for a new technology area? what should we be aware of while entering a new technology space? what are the 'hottest' technology trends? who are the most influential people in a certain technology space? etc. Patent filings are an important source of knowledge, which reveals early signs of emerging technology trends and is also a key source for predicting business and technology trends. Patent filings are available in the public domain and it is one of the most valuable open databases. Patenting is a key means to protect an enterprise's business and technology inventions and analyzing patent data reveals long term technology and business strategies of companies. It is a key indicator of upcoming products, services and solutions of companies and can predict actions of companies, long before their offerings hit the market.

Big data analytics approaches are appropriate in scenarios such as patent analysis which requires dealing with large volumes of patent data. While the patent data we dealt with does not meet the big data criteria with regard to size, it certainly met the criteria with respect to heterogeneity and complexity and we can draw inferences based on the methodology to apply in other scenarios. There is a proliferation of patent filings and the challenge is to sift through all the data in a fast, cheap and effective manner. Network analysis techniques provide a useful lens to explore the phenomenon in a vastly improved manner and makes it easy to effectively deal with large datasets from entire patent populations. Using patent network analysis, it is possible to create a visual map of connections between patents, companies, inventors, assignees, citations, patent classes, patent classifications, patent families etc. We can leverage these connections to reveal leading companies, inventors, technology groups, patent families as well as most influential patents. By clustering group of patents and citations, we can identify similar patents around specific technology and business areas.

The Hollywood movie named Social Network has the tagline 'You don't get to 500 million friends without making a few enemies'. Over the course of this blog, we will come back to that tagline and see how it relates to patents in the area of social networks. We collected and analysed the population of 6,956 patents applications (both pending and granted) for the phrase 'Social Network'. For our analysis, the network nodes are the patents, inventors, citations, owners, patent families or patent classifications and the connections between them are the links. Table 1 ranks the leading companies based on the 'social network' patents they own.

Table 1: Leading companies by 'Social Network' patents

1.       Microsoft

6. Samsung Electronics

11. Alcatel Lucent

16. ETRI

2.       Facebook

7. Sony Corporation

12. Avaya

17. Salesforce

3.       Yahoo

8. Nokia

13. Tencent Holdings

18. Redhat

4.       IBM

9. AT&T

14. Zynga

19. Ebay

5.       Google

10.LG Corp.

15. Intel

20. Ericsson


Patent Network Analysis of Citations

Patent citation analysis helps identify the most influential patents. Citation numbers are counts of references in patents and more citations indicate the relative importance of the patent and when one patent cites another, it indicates a connection between them. Our social network patents data of 6,956 patents applications had 24,676 nodes and 43,224 edges in the form of a directed network. We first clustered the dataset and 1,431 clusters emerged. The patent network analysis reveals the most influential patents and the owners of the most influential patents (Table 2). For the rest of the write-up we use a few network centrality measures to draw inferences. They are 1. Eigenvector centrality: Measure of the strength of connections between the strongly connected patents, 2. Degree centrality: Measure of most connected patents in terms of citations, 3. Betweeness centrality: Measure of the patent's centrality in the network and Authority: Dense links and source of knowledge. Friendster, which has several influential patents has since morphed from a social networking destination into a social gaming destination, operating mainly in the Asia Pacific region. Yahoo!, Microsoft, Intel and others in the below list have several well cited patents in this space, but do not have a social networking service to match.

The network diagrams visualize the patent citation network, the leading patents in this field and their citation links. Patent citations follow power-law degree distribution and exhibit characteristics of Scale Free Networks. In other words, the patents are not evenly connected and have several highly connected nodes which govern the property of the network.  In our earlier blog, we found that Wikipedia entries with respect to Musical genres and their stylistic origins follow power-law degree distribution and exhibit characteristics of Scale Free Networks.

Table 2: Top ten 'Social Network' Patent Owners by Citations


Eigenvector

Degree

Betweeness

Authority

Microsoft

Boadin

Technology

Spoke Software

Microsoft

Six Degrees

Yahoo

Microsoft

Six Degrees

Friendster

Yahoo

Qurio Holdings

Friendster

Amazon.com

Yahoo

Yahoo

Amazon.com

Intel

Yahoo

Friendster

Intel

Boardwalk

Yahoo

Facebook

Boardwalk

LinkedIn

Yahoo

Mark Zuckerberg

LinkedIn

Parity

Communications

Yahoo

Google

Parity

Communications

Tele-Publishing

Qurio Holdings

IBM

Tele-Publishing

Peter Pezaris,

Michael Gersh

Ewinwin

Qurio Holdings

Peter Pezaris,

Michael Gersh



Figures 1 - 6: Social Network Patent Citations Graphs

Patent - Citations 1.jpg
Patent - Citations 2.jpgPatent - Citations 3.jpg
Patent - Citations 4.jpgPatent - Citations 5.jpgPatent - Citations 6.jpg
Enterprise spread across Patent Families and IPC Classifications

A patent family is a group of related patents filed in multiple geographies in relation to a single invention. The patent family size is measured by the number of geographies where the patent has been filed. A large patent family size is an indication of importance of the patent as well as a corresponding large market opportunity. Table 3 ranks the various enterprises according to the size of their patent families in the Social Networks space. Table 4 lists the leading patent families in this space, the company which has filed patents with respect to the patent family as well as the broad space to which the patent family belongs.

Large number of patent families by Microsoft, Facebook, IBM, Yahoo!, Google, Samsung, Sony and AT&T and Yahoo indicates the broad technology and business spread of their patents. This is an indication of the concentrated investments around specific intellectual property by these companies and is a harbinger of the potential for commericialising these patent families. Concentrated presence across IPC classification by companies such as Microsoft, Yahoo!, Sony, IBM, Google, Facebook, Samsung, LG and Nokia indicate their targeted patent application areas. The prominent IPC classes under which most of the social network patents are filed are 1. G06F & GO6Q, which broadly covers computing, calculating and counting, which includes simulators and data processing to be carried out by computing system, 2. H04L, which broadly covers digital transmission of information and 3. A63F, which relates to games and game playing applications.

Table 3: Company rank based on size of patent families in the Social Networks space

1.       Microsoft

6. Samsung Electronics

11. LG Corp.

16. Avaya

2.       Facebook

7. Sony Corporation

12. Salesforce.com

17. Ebay

3.       IBM

8. AT&T

13. Tencent Holdings

18. Ericsson

4.       Yahoo

9. ETRI

14. Alcatel-lucent

19. Intel

5.       Google

10. Nokia

15. Red Hat

20. KT Corporation


Table 4: Leading patent families, company and broad business/technology space

Patent Family

Company

Patent Family Space

40589145

Facebook

Social advertisement model for social networks. Personalized social content. Sponsored stories and news stories within a social network.

44278318

Microsoft

Content aggregation

39052266

Facebook

Dynamic news feeds about a user of a social network

44816714

Facebook

Personalize a web page with content from a social network

38256994

Microsoft

Indicate and search recent content publication activity by a user

40718209

Facebook

Community translation on a social network

39742759

Addnclick, Robinson, Jack

Establishing live social networks by linking users to each other who are simultaneously engaged in the same and/or similar content

42099881

Microsoft

Transient networks

36319745

Yahoo

Search system with integration of user metadata from a trust network

37188569

Microsoft

Collaboration spaces for online social networks

40351163

Facebook

Platform for providing a social context to software applications

40626121

Facebook

Social advertisement based on peer interests

39052156

Facebook

Display of media content to a user based on user interaction

31976242

Microsoft

Virtual calling card system


Structure of Patent Inventor Networks

Patent inventor network analysis helps identify the most influential inventors. The inventors are the nodes and the connections are the joint patent inventor relation. Our social network patents data of 6,956 patents applications had 8,992 inventors and 6,744 connections. Table 5 lists the most influential patent inventors. The network diagrams help to visualize the patent inventor network. We also note that patent inventor networks follow power-law degree distribution and exhibit characteristics of Scale Free Networks. In other words, the patent inventors are not evenly connected and have several highly connected nodes which govern the property of the network. 

Relating back to the tagline of the movie Social Network, we can see that Mark Zuckerberg, who is the founder of Facebook is the most central inventor of patents in this space. He not only facilitated ease of connections and networks across more than a billion people globally, but also invented and owns the patents behind the technology and processes by which the billions of connections between people are made possible.

Table 5: Top ten 'Social Network' Patent Inventors

Eigenvector

Degree

Betweeness

Authority

Angelo Adam

Angelo Adam

Zuckerberg Mark

Angelo Adam

Zuckerberg Mark

Zuckerberg Mark

Angelo Adam

Zuckerberg Mark

Bosworth Andrew

Davis Marce

Bosworth Andrew

Davis Marc

Sanghvi Ruchi

Bosworth Andrew

Cheng Lili

Bosworth Andrew

Wong Yishan

Cheng Lili

Wong Yishan

Cheng Lili

Cox Chris

Caldwell Nicholas

Davis Marce

Caldwell Nicholas

Kendall Timothy

Wong Yishan

Horvitz Eric

Wong Yishan

Rosenstein Justin

Oconnor Sean

Rubinstein Yigaldan

Oconnor Sean

Corson Dan

Hamlin Drew

Huang Xuedong

Hamlin Drew

Hamlin Drew

Bernard Rob

Deng Peter

Bernard Rob



Figure 7: Leading Social Network Patent Inventor connections

Social Network Patent Inventors.jpg
Mapping Enterprise Patents to product features

A patent family is a group of related patents filed in multiple geographies in relation to a single invention. The patent family size is measured by the number of geographies where the patent has been filed. A large patent family size is an indication of importance of the patent as well as larger market opportunity. Table 6 lists Facebook's patent family spaces, patent file dates and feature introduction dates. In most cases, Facebook is launching a feature or service almost immediately after the patent is filed.

Table 6: Facebook's patent family spaces, patent file dates and feature introduction date

Patent Family

Patent Family Space

Patent file date

Facebook feature introduction date

39052156

Display of dynamically selected media content to a user

August 2006

Facebook News Feeds (Sept 2006)

39052266

Dynamic news feeds about a user of a social network

Aug - 2006

 

Facebook News Feeds (Sept 2006)

40351163

Platform for providing a social context to software applications

Aug 2007

Facebook Platform (May 2007)

40589145

Social advertisement model for social networks. Sponsored stories and news stories within a social network. Personalized social content

August 2008, Dec 2011, April 2012

Advertising Model (2008)

Advertising based on user's connections

40626121

Communicating information in a social networking website about activities from another domain

August 2008

Advertising Model (2008)

40718209

Community translation on a social network

Dec 2008

Facebook Community Translation (Dec 2007)

44816714

Personalize a web page with content from a social network

April 2010

Facebook Community Page start offering content from Wikipedia (April 2010)



Forecasting enterprise Innovations

We can extend the above analysis to identify significant patent families and classes from companies and predict the introduction of new innovations in the form of product and service features and functionalities. Staying on with Facebook and examining its newer patent publications, families, inventors and classes we forecast that Facebook will in the future be introducing the following innovations as part of its  social features and functionalities: 1. Video tagging by analyzing the content uploaded by users, 2.Location based identification and friend connection between users of Facebook leveraging GPS coordinates of the user, 3. Mobile based social check-in and social search of peers, 4. Enhanced settings for controlling objectionable behavior, 5. Social commerce and social deals.

May 29, 2013

Networks of Global Migration

Migration is defined by the International Organization for Migration as 'the movement of a person or a group of persons, either across an international border, or within a State'. It is a population movement, encompassing any kind of movement of people, whatever its length, composition and causes; it includes migration of refugees, displaced persons, economic migrants, and persons moving for other purposes, including family reunification'. The United Nations defines migrant as 'an individual who has resided in a foreign country for more than one year irrespective of the causes, voluntary or involuntary, and the means, regular or irregular, used to migrate'. There are different types of migration such as seasonal (driven by labor conditions), tourism, rural to urban (driven by economic, educational and social conditions) and international migration. Other causes for migration could be regional conflicts, wars and natural disasters. Disparities in income for similar type of jobs and shortage of suitably skilled and employable labor force are key reasons for international migration.

According to the International Organization for Migration there are about 214 million international migrants across the world (about 3% of the global population), which is a significant increase over the year 2000 number of 150 million. According to IOM, countries with a high percentage of migrants include Qatar (87%), United Arab Emirates (70%), Jordan (46%), Singapore (41%) and Saudi Arabia (28%) and countries with a low percentage of migrants include South Africa (3.7%), Slovakia (2.4%), Turkey (1.9%), Japan (1.7%), Nigeria (0.7%), Romania (0.6%), India (0.4%) and Indonesia (0.1%). Global fund remittances by migrants were $529 billion in 2012. Remittances sent by migrants to developing countries were estimated at $401 billion in 2012.  According to the World Bank, the top recipients of officially recorded remittances in 2012 were India ($69 billion), China ($60 billion), the Philippines ($24 billion), and Mexico ($23 billion). Other large recipients are Nigeria, Egypt, Pakistan, Bangladesh, Vietnam and Lebanon.

The Global Bilateral Migration Database maintains details of bilateral migrants for the period 1960-2000. The data is based on 1000 plus national censuses and population details from 226 countries. The data break-up is available based on gender as well as the source and destination of migration. Below are charts of gender-wise and country-wise global migration patterns during 1960-2000.

Gender-wise Human Migration from 1960 -2000.  

 

Aggregate Migration - 1960-2000.jpg


Top ten male migrant sources and destinations (1960-2000)

Top ten female migrant sources and destinations (1960-2000)

 

Male Migrants - 1960-2000.jpg


Female Migrants - 1960-2000.jpg



 

Network Graph of Male migration (1960-2000)

Male Migration weighted degree.jpgNetwork Graph of Female migration (1960-2000)

Female Migration Weighted Degree.jpg

Analysis and Conclusions

Gender-wise migration patterns

  1. During 1960-2000, the largest number of male migrants were from Pakistan to India (12,414,897), which accounted for 3.94% of the total, followed by India to Pakistan (11,114,945 - 3.53%), Mexico to United States (9,721,375 - 3.09%), Russian Federation to Ukraine (8,578,948 - 2.72%), Ukraine to Russian Federation (7,064,887 - 2.24%) followed by others.
  2. During 1960-2000, the largest number of female migrants were from Russian Federation to Ukraine (12,465,470 - 3.98%), Ukraine to Russian Federation (11,889,217 - 3.8%), Pakistan to India (10,650,582 - 3.4%), India to Pakistan (9,580,036 - 3.06%), Mexico to United States (8,264,482 - 2.64%), followed by others.
  3. For male migrants the United States continued to be the number one destination over the years followed by Russian Federation, India, Germany, Canada, France, Ukraine and Australia. During 1990 and 2000, Saudi Arabia emerged as a top ten migrant destination for male migrants.
  4. In the case of outflow of male migrants the leading migrant sources were India, Pakistan, Russian Federation, Poland, China, Italy, Ukraine, United Kingdom, Germany, Spain and Bangladesh. During 1990 and 2000, Mexico and Egypt emerged as a top ten source of male migrants.
  5. For female migrants, the United States has been the number one destination over the years followed by Russian Federation, India, Germany, Canada, France, Ukraine, Kazakhstan and Australia. During 1990 and 2000, Saudi Arabia emerged as a top ten migrant destination for female migrants.
  6. In the case of outflow of female migrants the leading migrant sources were India, Pakistan, Russian Federation, Poland, China, Italy, Ukraine, United Kingdom, Germany, Spain and Bangladesh. During 1990 and 2000, Mexico emerged as a top ten source of female migrants.

Country-wise migration patterns

  1. Majority of the world's migration happens across borders of neighboring countries such as India to Pakistan and vice versa, Russian Federation to Ukraine and vice versa, Mexico to United States, Bangladesh to India, Poland to Germany etc.
  2. Large numbers of migrants originates from developing countries in Asia and Africa and a large majority of them move to other poor developing countries in search of marginally improved economic, political, social, educational and labor conditions.
  3. The largest recipient of migrants were countries such as United States, India, Germany, Canada, France, Ukraine, Australia, Russian Federation, Australia, Saudi Arabia and the United Kingdom.
  4. Large numbers of migrants are driven by labor conditions (Mexico to United States, Bangladesh to India and Ukraine to Russian Federation), economic (Poland to Germany), political (India to Pakistan and vice versa) and social conditions. Disparities in income for similar type of jobs and shortage of suitably skilled and employable labor force are key reasons for the above.

Global Migration clusters

  1. We observed 14 migration clusters based on the aggregate migration patterns over 1960-2000. These include large clusters around India, Pakistan, Russian Federation, Ukraine, United States, United Kingdom, Germany, China, France as well as smaller clusters around Eastern, Southern and Western Africa and South America. The below network graphs capture the gender-wise migration inflow and outflow clusters.

Female Migration Inflow clusters

Female Migration Inflow clusters.jpg

Male Migration Inflow clusters

Male Migration Inflow Clusters.jpg

             Female Migration Outflow clusters
Female Migration Outflow clusters.jpg             Male Migration Outflow clusters
Male Migration Outflow Clusters.jpg

Migration between developed countries

  1. We calculated the weighted indegree and outdegree measures to further understand the leading destinations for migrant inflows and outflows from 1960-2000. We drilled down further into the top ten destinations for migrant outflows. Among the top ten we took the developed countries for further analysis of migrant outflows. In this list, among female migrants, United Kingdom and Germany and Italy figured in the top ten from 1960-2000 and Italy figured in the top ten from 1960-1990. In the same list, among male migrants, United Kingdom, Germany, Spain and Italy figured in the top ten from 1960-1970, the United Kingdom and Italy figured in the top ten from 1980-1990 and only the United Kingdom figured in the top ten in 2000. The below charts capture the gender-wise percentages for developed country outflows. The percentages indicate global shifts in migration particularly in the context of developed countries. Over the years, there has been increasing migration from the developed to the developing world in the form of United Kingdom to Zambia and Zimbabwe, Germany to Brazil and Argentina and Italy to Brazil, Argentina and Venezuela.
  2. We tried to correlate migration from developed countries such as the United Kingdom, Germany, Italy and Spain to their GDP and Unemployment for the time period 1960-2000. The unemployment data from the UK, Spain and Germany moved in tandem with their GDP numbers. In the case of the United Kingdom, the GDP growth has been low during 1960's until mid-1970s and there were slight dips in GDP during the early 80s and 90s. However, we saw no correlation between GDP growth/dips with migrant outflows from the UK as the migrant outflow from the UK has continued to increase at a constant rate over the years. The German economy saw slight dip in its GDP during 1980-86 and then again in mid-late 90s, but the trend in migration was more or less similar to that of the United Kingdom. The Italian economy saw similar GDP dips, however the migrant outflow from Italy has been constantly declining over the years.

United Kingdom: Female migrant outflow destinations as a % of gender-wise total

United Kingdom: Male migrant outflow destinations as a % of gender-wise total

 

UK Female.jpg


UK Male.jpg


Germany: Female migrant outflow destinations as a % of gender-wise total

Germany: Male migrant outflow destinations as a % of gender-wise total

 

Germany Female.jpg


Germany Male.jpg


Italy: Female migrant outflow destinations as a % of gender-wise total

Italy: Male migrant outflow destinations as a % of gender-wise total

 

Italy Female.jpg


Italy Male.jpg


February 18, 2013

Geographical network of refugee movements

What is common between Sigmund Freud, The Dalai Lama, Karl Marx, Aristotle Onassis, Bob Marley, Albert Einstein, Marlene Dietrich, Madeleine Albright, Victor Hugo, Frédéric Chopin and Andy Garcia. Besides the fact that I am personally influenced by many of them, the one common factor is that they were all refugees during some phase of their lifetime. The term refugee is used for those who seek relief and refuge from economic, military, political, or social distress including war, famine, or civil strife. The earliest known instance of refuges was triggered by the invasion of Middle East by Sargon the Great of Mesopotamia (2270-2215 BC). Across the years, human conflicts, battles, wars and epidemics has been the root cause of human displacement. According to the UN High Commissioner for Human Rights, a refugee is "any person who: owing to a well-founded fear of being persecuted for reasons of race, religion, nationality, membership of a particular social group, or political opinion, is outside the country of his nationality, and is unable to or, owing to such fear, is unwilling to avail himself of the protection of that country". Refugees are spread around the world, with more than half in Asia and about 20 percent in Africa. United Nations High Commissioner for Refugees (UNHCR) numbers indicate that by the end of 2011, number of forcibly displaced people worldwide exceeded 42.5 million. Out of this, 15.2 million were refugees (10.4 million under UNHCR's mandate, 4.8 million Palestinian refugees), 895,000 asylum-seekers and 26.4 million internally displaced persons.

According to UN Secretary-General Ban Ki-moon, "Refugees have been deprived of their homes, but they must not be deprived of their futures."

We are interested to understand refugee movement across countries over the past decade. Since we can model origin and destination countries as nodes and people movement as edges, we use social network analysis and visualisations for our purpose. Data about refugee movement across countries during 2000 - 2011 was obtained from UNHCR.

The below network graphs and charts trace annual as well as consolidated refugee movements from 2000-2011. Our consolidated dataset had 211 nodes and 6068 edges, with the nodes representing countries (source of refugees) and edges representing cross country refugee movements. Table 1 gives details of the number of nodes and edges over the years. Figure 1 is a chart of global refugees from 2000-2011.

 

Table 1: Nodes and Edges

 

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

Nodes

192

188

193

193

192

192

197

199

199

202

204

205

Edges

3205

3382

3657

3820

3991

4043

4136

4380

4505

4791

4888

4799

 

 

Figure 1: Global refugees from 2000-2011  

 

Global Refugee Numbers.jpg

Figure 2 is a source and destination country chart of the top ten global refugee movements from 2000-2011. The largest refugee movements were from Afghanistan to Pakistan (18,790,596), which accounted for 15% of the total, followed by Afghanistan to Islamic Republic of Iran (12,423,106 - 10%), Iraq to Syrian Arab Republic (6,146,922 - 5%), Burundi to United Republic of Tanzania (4,030,134 - 3%), Vietnam to China 3,591,405 - 3%), followed by others.

 

Figure 2: Country-wise global refugee movements from 2000 - 2011 

 
 

Global Refugee Stats.jpg

We list below some key network graphs. The first set of graphs are based on aggregated data over the years 2000 - 2011 and the second set of graphs are based on annual data from the years 2000 and 2011. We have not included results for the other years for the sake of keeping this a short blog. Figures 3 & 4 are network graphs of refugee Inflows and Outflows from 2000-2011. The network graphs are based on weighted indegree as well as outdegree measures. The top ten refugee origins/destinations are also listed.   

Figure 3: Network graphs of refugee Inflows from 2000 - 2011

 Indegree Consolidated.jpg

  Top 10: Pakistan, Islamic Rep. of Iran, Germany, Syrian Arab Rep., United Rep. of Tanzania, United States, Kenya, China, United Kingdom, Jordan

 

Figure 4: Network graphs of refugee Outflows from 2000 - 2011

 

Outdegree Consolidated.jpgTop 10: Afghanistan, Iraq, Somalia, Sudan, Dem. Rep. of the Congo, Burundi, Vietnam, Occupied Palestinian Territory, Angola, Myanmar

 

Figure 5 has network graphs of refugee flows during 2011. The network graphs are based on weighted indegree as well as outdegree measures. The top ten refugee origins/destinations are also listed. 

 

Figure 5: Network graphs of refugee flows during 2011

 

Graph of Refugee Inflows - 2011

Graph of Refugee Outflows  - 2011

Indegree 2011.jpg

  

Indegree 2011 - close.jpg  

Outdegree 2011.jpg 

Outdegree 2011 close.jpg 

Top 10: Pakistan, Islamic Rep. of Iran, Syrian Arab Rep., Germany, Kenya, Jordan, Chad, China, Ethiopia, United States

 Top 10: Afghanistan, Iraq, Somalia, Sudan, Dem. Rep. of the Congo, Myanmar, Colombia, Vietnam, Eritrea, China

 

Figure 6 has network graphs of refugee flows during 2000. The network graphs are based on weighted indegree as well as outdegree measures. The top ten refugee origins/destinations are also listed.   

 

Figure 6: Network graphs of refugee flows during 2000 

 

Graph of Refugee Inflows - 2000

Graph of Refugee Outflows  - 2000

Indegree 2000.jpgIndegree 2000 close.jpg  

  

Outdegree 2000.jpg

 

Outdegree 2000 close.jpg 

 

Top 10: Pakistan, Islamic Rep. of Iran, Germany, United Rep. of Tanzania, United States, Serbia (and Kosovo: S/RES/1244 (1999), Guinea, Sudan, Dem. Rep. of the Congo, China 

 Top 10: Afghanistan, Iraq, Somalia, Sudan, Dem. Rep. of the Congo, Myanmar, Colombia, Vietnam, Eritrea, China

 

 

 
Conclusions
1. During 2000-2011, the largest refugee movements were from Afghanistan to Pakistan (18,790,596), which accounted for 15% of the total, followed by Afghanistan to Islamic Republic of Iran (12,423,106 - 10%), Iraq to Syrian Arab Republic (6,146,922 - 5%), Burundi to United Republic of Tanzania (4,030,134 - 3%), Vietnam to China 3,591,405 - 3%), followed by others.
 
2. In 2011, the largest refugee movements were from Afghanistan to Pakistan (1,701,945), which was 16% of total global refugees, followed by Afghanistan to Islamic Republic of Iran (840,451 - 8%), Iraq to Syrian Arab Republic (750,000 - 7%),  Somalia to Kenya (517,666 - 5%), Iraq to Jordan (450,000 - 4%).
 
3. A very large majority of the world's refugees flee to their neighboring countries (e.g. Afghanistan to Pakistan, Afghanistan to Islamic Republic of Iran, Iraq to Syrian Arab Republic, Burundi to United Republic of Tanzania, Vietnam to China, Iraq to Jordan, Somalia to Kenya, Sudan to Chad, Eritrea to Sudan etc.)
 
4. Large number of refugees originate from the poorest countries in Asia and Africa and a large majority of them move to other poor neighboring countries, which have very limited resources to host them. The one exception is Germany, which has consistently been a top 5 refugee destination country for every single year from 2000 - 2011.
 
5. Among economically advanced countries, the largest recipient of refugees were Germany, United States, United Kingdom, France, Canada, Sweden, Netherlands, Switzerland and Australia.
 
6. Analysis of data across the years 2000 - 2011 reveals what has changed in the world of refugees and what has not. Across all the years, Afghanistan is the largest source of origin of refugees and except for 2002 and 2004, Pakistan is the largest destination of refugees.
 

January 3, 2013

Network Structure of Music Genres and their Stylistic Origins

According to musicologist Jean-Jacques Nattiez (Musicologie générale et semiology, 1987), the border between music and noise is always culturally defined and composer John Cage thought that any sound can be music and has said "There is no noise, only sound." Music has been an important part of many cultures and there are several hundred popular music genres and subgenres. Needless to say, the origin and style of many musical genres and subgenres can be traced to other genres and subgenres. Let us explore further the various musical genres and their stylistic origins. The below graphs trace the various genres and their stylistic origins according to data from Dbpedia/Wikipedia. The following are some caveats with regard to the data and methodology:

  • The data in Wikipedia is not complete and the completeness is limited by the topics and updates made by the users. For example, several popular Indian genres such as Ghazal, Qawwali, Bhajan, Lavni, Kirtan, Thumri, Tarana, Kajri, Dhrupad etc., lack rich RDF representations. Similar scenario exists for several African, Asian and Latin American genres. This is a case for creating rich Wikipedia entries with regard to these music genres.
  • The data is spread across musical genres over several generations and we do not have the specific timeline of the stylistic origin of the genres. Therefore the genre which is influenced by a stylistic origin during a certain time period could in turn be the stylistic origin of another genre during another time period. For example, the stylistic origin of Electro can be traced to Funk and Funk in turn owes its origin to Jazz.
    There could be stylistic origins and genres which could be closely related, but not captured in Wikipedia.
  • We explored only the musical genres listed in the English language section of Wikipedia, which in turn could have been limited by the extent of English language penetration and usage in various countries.
  • Another limiting factor for global representation in Wikipedia could be the penetration of Internet and computers.

A musical genre could be associated with more than one stylistic origin. For example, the stylistic origins of Hip Hop can be traced to Rhythm and blues, Reggae, Funk, Disco, Dancehall, Scat singing etc. Our graph takes in the musical genres and stylistic genres as nodes and the edges represent their relationships. Our data has 1058 nodes and 2767 edges. Let us go over to the graphs and results for various centrality measures. 

Overview Graph of Music Genres and their Stylistic Origins

 

Origin - Music - 1.jpg

 

Authority 

 

 

Origin - Music - Authority.jpgBetweeness

Origin - Music - Betweeness.jpg

Closeness 

 

Origin - Music - Closeness.jpg 

Degree 


Origin - Music -  Degree.jpg


Hub

 

Origin - Music - Hub.jpg

  

Indegree

 

Origin - Music - Indegree.jpg 

Outdegree
 

Origin - Music - Outdegree.jpg 

Observations & Conclusions

Wikipedia entries with respect to Musical genres and their stylistic origins follow power-law degree distribution and exhibit characteristics of Scale Free Networks. In other words, the nodes are not evenly connected and have several highly connected nodes which govern the property of the network. Past research by Albert-László Barabási and colleagues found that World Wide Web links have scale free network characteristics. Similar results have been proposed for biological networks, and social networks. It will be interesting to see if similar characteristics can be found for other Wikipedia entries. 

Degree Distribution

 

Degree Distribution.jpgWe observed 30 prominent clusters and 17 smaller clusters (2 or 3 entities).

 

 

Music Origins 7.jpg 

The table below lists the most important nodes of the network for the selected centrality measures (in-degree and out-degree centralities). In Degree Centrality of a node are the connections that the node receives from other nodes. For example, in our result, the top five Indegree centrality measures belong to Girl_group, Dark_psytrance, Breakcore, Boy_band, New_Age and Chillwave. In other words, these genres draw their origins from a large number of other musical genres. Out Degree Centrality of a node are the connections that the node sends to other nodes and the top five in our study are Hip_hop, Punk_rock, Jazz, Pop and Funk, which means that these musical genres have resulted in the origin of the largest number of other genres. Closeness Centrality measures indicate the influence of several regional as well as country specific genres and their origins such as  Kansas City Jazz, Music of India, France, Folk music of England etc. 

 

In-Degree

Out-Degree

Degree

Closeness

Betweenness

Eigenvector

 Girl_group

 Hip_hop

 Hip_hop

 African_American

 Synthpop

 Dark_psytrance

 Dark_psytrance

 Punk_rock

 Punk_rock

 Cakewalk

 Punk_rock

 Chillwave

 Breakcore

 Jazz

 Jazz

 American_march

 Country

 Breakcore

 Boy_band

 Pop

 Funk

 Stride

 Progressive_rock

 Wonky

 New_Age

 Funk

 Pop

 Dixieland

 Post-punk

 Psybient

 Chillwave

 Rhythm_and_blues

 Rhythm_and_blues

 Bounty

 Jazz_fusion

 Wonky_music

 Big_beat

 Folk

 Synthpop

 Ragtime

 New_Wave

 Big_beat

 New_rave

 Synthpop

 House

 Cool_jazz

 Rock_and_roll

 New_rave

 Industrial_rock

 Hardcore_punk

 Post-punk

 Romanticism

 Rock

 Industrial_rock

 New_Wave

 House

 Rock_and_roll

 Music_of_India

 Post-bop

 Post-punk_revival

 Noise

 Post-punk

 Folk

 Medieval

 Hip_hop

 Electroclash

 Trip_hop

 Psychedelic_rock

 Psychedelic_rock

 Kansas_City_jazz

 Techno

 Alternative_dance

 Hip_hop

 Techno

 Techno

 Modal_jazz

 Industrial

 Trip_hop

 Wonky

 Blues

 Industrial

 Hard_bop

 Psychedelic_rock

 Liquid_funk

 Liquid_funk

 Electronic

 New_Wave

 Impressionist

 House

 Girl_group

 Wonky_music

 Rock_and_roll

 Hardcore_punk

 Bebop

 Bluegrass

 Post-rock

 Ambient

 Industrial

 Rock

 Folk_of_Scotland

 Glam_rock

 Dubtech

 Mod_revival

 Rock

 Blues

 Afro-American

 Disco

 Hard_NRG

 Chinese_ambient

 Heavy_metal

 Electronic

 Music_of_Scotland

 Rhythm_and_blues

 Shoegazing

  
 

December 19, 2012

Networks of key decision makers

Board of directors consists of a group of elected members who oversee the activities of a company. They are representatives of the shareholders and play an important role in decisions on major company issues including dividends, mergers and acquisitions, senior appointments, compensation, etc. Board of directors can be modelled as a two mode network and their membership in multiple company boards can be modelled as an interconnected network with the board memberships acting as the tie. One of the means of representing the strength of the tie is the number of board memberships.

In this research, we model the board membership of the top 100 most valuable companies in India. We collected the list of Top 100 Companies in India by Market Capitalization listed on BSE. We then looked up their board memberships and analysed the network. The networks below shows board membership of three organisations in our sample. The red nodes are the companies and the black nodes are members of the board.

 

  

Small Group network 1.png 

We grouped the sample of 100 companies and saw the emergence of 46 clusters. The figures below give details of the clusters as well as their interconnections through board memberships. The links show board membership ties between the members.

  

Bod Groups 1.jpg

  

Bod Groups 2.jpgThe network graphs below are varying visualisations of companies and their board membership.

 

Bod Network 1.jpg

 

Bod Network 2.png  

Bod Network 3.jpg

 
The table below lists the most important nodes of the network for the selected centrality measures (in-degree and out-degree centralities). In Degree Centrality of a node are the connections that the node receives from other nodes. For example, in our sample Dr. Omkar Goswami has the highest in-degree centrality measure as he is on the board of 6 companies in our sample set of top 100 companies in India by market capitalization listed on BSE. Out Degree Centrality of a node are the connections that the node sends to other nodes. Larsen and Toubro has the highest out-degree centrality measure as the company has 28 board members. 

  

Table 1.jpg 

This was a simple illustration of the network of key decision makers of top 100 companies in India by market capitalization listed on BSE. The network between the board members of larger sample sets could get denser.

 

 

 

 

 

 


 

November 11, 2012

Tale of Two Storms: Hurricane Sandy & Cyclone Nilam

Social Media sources have played a key role in disaster reporting, relief and rescue efforts. Social media destinations such as Twitter, Flickr and Facebook were leveraged extensively to spread information during the Victorian bushfires in Australia in 2009. In this blog, we try to explore the role played by government, civic authorities, law and order, general public, celebrities, activists, journalists and media in the advent of two natural disaster situations: Hurricane Sandy and Cyclone Nilam.

The terms hurricane and typhoon are regional names used to refer to a strong tropical cyclone. Once the tropical cyclone reaches winds of at least 17 m/s, they are called a tropical storm and are assigned a name. If winds reach 33 m/s, then they are called Hurricane in North Atlantic ocean, Eastern pacific and South Pacific Ocean, Typhoon in Western pacific and Cyclone in Southern pacific and Indian ocean. In the recent past, we experienced two significant simultaneous cyclones: hurricane Sandy in the East Coast of US and Cyclone Nilam in the East Coast of India. Hurricane Sandy is the largest Atlantic hurricane on record, as well as the second costliest Atlantic hurricane only surpassed by Hurricane Katrina in 2005. Hurricane Sandy struck in late October 2012. In the United States, Sandy caused severe damage in New Jersey and New York. It has claimed more than 50 lives, left millions without power and caused over US$ 50 billion in damage in the United States. Damages to life and property are spread across Jamaica, Haiti, Dominican Republic, Puerto Rico, Cuba and The Bahamas. Cyclone Storm Nilam, which struck India in late October 2012 caused damages across Indian states such as Tamil Nadu, Andhra Pradesh, Karnataka and Odisha.

According to Semiocast, the United States has 141.8 million Twitter users, and India has over 15 million Twitter users. This depicts a healthy Twitter user base in the United States and a fast growing Twitter user base in India. We collected tweets containing the hashtag #Sandy from Oct 30th until Nov 6th. We collected about 500,036 tweets from 306,348 users. We also collected about 1500 tweets with the hashtag #Nilam during 31st Oct 2012. We classified #Sandy tweets into four different sets based on the time interval when the tweets occurred. Time Interval 1 has 180,489 Nodes and 232,578 Edges, Time Interval 2 has 60,351 Nodes and 65,842 Edges, Time Interval 3 has 31,033 Nodes and 31,039 Edges, Time Interval 4 has 14,550 Nodes and 13,350 Edges. The network graphs, tables and charts below unravels the role of social media in disaster reporting, relief and rescue efforts by government, civic authorities, law and order, general public, celebrities, activists, journalists and media.

 

Hurricane Sandy: Time Interval 1 

 

Sandy Time 1 - 1.jpg

  

Sandy Time 1 - 2.jpg

 

Hurricane Sandy: Time Interval 2
 

Sandy Time 2 - 1.jpg

 

   

Sandy Time 2 - 2.jpg

 

Hurricane Sandy: Time Interval 3
 

Sandy Time 3 - 2.jpg 

 Hurricane Sandy: Time Interval 4

 Sandy Time 4 - 2.jpg

 

 Hurricane Sandy: Distribution of most important Tweeters based on Network Centrality measures 

 

Graph.jpg

 

Meanwhile in some other part of the graph, we observed the following clusters. 

 

Ot.jpg  

Cyclone Nilam

   Nilam.jpg  

Conclusions

1. Sandy was 'Instagrammed': Number of #Sandy tweets originating from Instagram ranked fourth, preceded only by Tweets originating from iPhone, Web and Android and in the case of Nilam, Twitpic scored high. This is an interesting social behavior where in disaster scenarios, visual means of message propagation assume prominence. This is in contrast to our observations with regard to message propagation during the Olympic Games, where Tweets originating from Instagram scored very low.

2. Government and officials leveraging Social media: In the case of Hurricane Sandy, @NYGovCuomo, which is the Official Twitter account for the Governor of New York State, Mr. Andrew Cuomo consistently emerged as one of the most important entities. No such trends in the case of Cyclone Sandy where we did not see even a single tweet from a government official or utility.

3. News Media leveraging Social Media: Journalists and News Media extensively leveraged Social Media for news propagation and for news amplification. News Media sources and Journalists were the heaviest users of Social media in the events of both Sandy and Nilam, supporting the findings from previous research. Individual Journalists leveraged their Twitter follower base to spread messages faster and they invariably figured higher up in the network centrality measures as against the twitter accounts of Media houses which employ them.

4. Cyclone Nilam saw a low overall coverage in Social Media in India and surrounding affected regions.

5. Activist groups such as Anonymous etc. leveraged social media for message propagation in the case of Hurricane Sandy.

6. Celebrities in the United States were active in social media during Hurricane Sandy as against celebrities in India who chose to ignore Cyclone Nilam.

 

October 23, 2012

Visualising the Third Presidential Debate

The third Presidential Debate concluded just a few hours back. The two network graphs visualize the entire debate content. The table below lists the most influential keywords used by the two presidential contestants.

 

President Obama

 Obama - Third.pngMitt Romney

 

Romney - Third.png

 

  

 Most Influential keywords

 

President Obama

Mitt Romney

make

1527

make

894

american

395

world

860

world

370

year

782

governor

367

president

726

kind

352

people

590

military

335

america

558

region

330

nation

441

iran

320

military

419

job

303

number

412

america

293

state

256

thing

256

thing

211

making

238

work

209

country

221

time

203

leadership

191

syria

193

office

191

back

182

 

 

 

October 19, 2012

Visualising the Second Presidential Debate

Presidential Debates in the United States occur between the two main candidates of the largest parties. Critical topics are discussed in the debate and these debates have the tendency to sway general public. According to Nielsen Ratings, the second Presidential debate on the 16th Oct 2012 had an estimated viewership of 65 million viewers. This is a slight dip over the first debate which attracted viewership of over 67 million viewers. The table below lists the most influential keywords used by the two presidential contestants. The two network graphs visualize the entire debate content.

 

Most Influential keywords

President Obama

Mitt Romney

make

2004

people

1290

governor

1060

president

1233

job

611

year

826

tax

510

make

677

romney

458

job

581

people

372

percent

356

year

326

bring

336

country

284

america

329

world

230

tax

245

create

211

country

235

folk

200

question

177

education

183

work

166

thing

181

policy

138

family

178

back

136

energy

166

happen

135

 

 

President Obama

 

Obama.png

 

Mitt Romney

  

Romney.png 

 

 

 

September 21, 2012

Visualising State of the Union Messages

State of the Union Messages to the Congress are mandated by Article II, Section 3 of the United States Constitution which states, "He shall from time to time give to the Congress information of the state of the union, and recommend to their consideration such measures as he shall judge necessary and expedient". Since 1790 State of the Union messages have been delivered regularly at approximately 1 year intervals (Source: The American Presidency Project http://www.presidency.ucsb.edu/sou.php#axzz274c0J8EF ). The State of the Union Message is delivered near the beginning of each session of Congress and it is a report on the condition of the country and a platform for presidents to outline their legislative agenda and their priorities. They have over the years emerged as a communication between the president and the people of the United States. The messages are available at http://www.presidency.ucsb.edu/sou.php#axzz274c0J8EF. We are interested to examine what President Barack Obama and past President George Bush have been stressing in their speeches. We used Texttexture (http://textexture.com/) to visualize the State of the Union Messages by President George Bush (2005-08) and President Barack Obama (2009-2012). The network visualisations and results from Texttexure are given below. Each network has an average of 100 words (nodes) and about 1000 edges (co-occurrences). In the network graphs below, the words are the nodes and their co-occurrences are the connections between them.

 

 President Barack Obama - 2009

President Barack Obama - 2010 

 11.jpg

 

22.jpg 

Most influential keywords:  American, Year, Economy, America  

Most influential contexts:
#0:   american    people    bank    country   
#1:   year    economy    crisis    energy   
#2:   america    time    make    large   
#3:   job    million    create    industry
 

Most influential keywords : Year, American, Job, Work

 Most influential contexts:
#0:   year    deficit    decade    office   
#1:   american    work    people    time   
#2:   job    america    economy    nation   
#3:   business    tax    small    cut
    
 

 

 

 President Barack Obama - 2011

President Barack Obama - 2012

 

33.jpg

  

44.jpg

Most influential keywords in this text: Job, Year, People, American

Most influential contexts:
#0:   job    people    american    tax   
#1:   year    work    country    school  
#2:   america    nation    place    energy   
#3:   make    business    support    republican     

Most influential keywords in this text: American, Job, Business, Tax

Most influential contexts:
#0:   american    year    work    million   
#1:   job    business    create    back   
#2:   tax    company    pay    debt   
#3:   america    country    state    build
     

 

 

 President George Bush - 2005

President George Bush - 2006

 

55.jpg

  

66.jpg

Most influential keywords in this text: Make, People, Freedom, America

Most influential contexts:
#0:   make    american    citizen    good 
#1:   people    freedom    america    iraq   
#2:   security    year    great    million   
#3:   government    child    retirement    account
    

Most influential keywords in this text: America, World, American, Freedom

Most influential contexts:
#0:   america    lead    life    require   
#1:   world    american    people    nation   
#2:   freedom    hope    great    iraqi   
#3:   year    science    effort    cut
     

 

 

 President George Bush - 2007

President George Bush - 2008

 

77.jpg

  

88.jpg

Most influential keywords in this text: America, Iraq, American, People

Most influential contexts:
#0:   america    citizen    nation    united   
#1:   iraq    government    force    oil   
#2:   american    people    health    free   
#3:   tonight    congress    country    law
    

Most influential keywords in this text: American, Congress, People, Year

Most influential contexts:
#0:   american    people    nation    future   
#1:   congress    good    million    agreement   
#2:   year    america    enemy    state   
#3:   iraq    terrorist    iraqi    force
    

  

   

September 10, 2012

How the Paralympic Games were Tweeted

The Paralympic games have come a long way from the earliest event in 1960 which was held in Rome exclusively for war veterans and attracted participation of 400 athletes from 23 countries. The London Paralympic games 2012 saw participation of 4,200 athletes from 147 countries, taking part in 21 sports. The International Paralympic Committee has defined six disability categories across physical and intellectual disabilities. This are: Amputee, Cerebral Palsy, Intellectual Disability, Wheelchair, Visually Impaired and Les Autres (athletes who do not fall under the other five categories). In this blog we will explore how Paralympic games were tweeted. We collected Twitter data with the hashtag #paralympics via 140kit.

A sample of 82,560 tweets across 54,522 users was collected between Aug 30th and Sept 6th 2012. The table below lists the overall statistics of the top ten measures with regard to language, source application from where the Tweet originated, location and hashtag. The network graph was created of the entire network of users who retweeted as well as mentioned one another. In this dataset, there were 48,956 nodes and 38,959 edges. There were 42,881 retweets, 229 mentions, 546 average followers, 354 average friends and 113 average favourites. The figure below gives the network graph of the network of users created using Gephi. Each dot in the graph is a network node and corresponds to the Twitter account of a user. The links are the edges and corresponds to the users who retweeted one another or those who mentioned one another. 

 

Overall Statistics

  

Language

Number of Users

Source Application

Number of Users

Location

Number

Hashtag 

Number of Users 

English

80446

Twitter for BlackBerry

22820

London

13042

#teamGB

1075

Dutch

948

Web

21283

Amsterdam

6481

#paralympics

1057

Spanish

246

Twitter for iPhone

12917

Casablanca

2038

#London2012

530

German

213

Twitter for Android

10947

Hawaii

1410

#inspirational

379

French

205

Mobile Web

2502

Greenland

799

#isitok

285

Swedish

166

UberSocial for BlackBerry

2277

Edinburgh

684

#amazing

276

Norwegian

93

TweetDeck

1790

Dublin

533

#respect

254

Japanese

56

Twitter for iPad

1182

Pacific Time

495

#c4paralympics

241

Italian

53

Echofon

562

Eastern Time

456

#inspiring

213

Portuguese

51

HootSuite

512

Central Time

408

#PPProud

187

Turkish

16

txt

505

Athens

383

#Paralympics2012

178

Russian

16

Facebook

478

Quito

321

#Superhumans

157

Indonesian

13

TweetCaster for Android

428

Pretoria

316

#excited

145

da

11

Tweetbot for iOS

341

Alaska

283

#lastleg

145

Korean

8

Twitter for BlackBerry®

300

Singapore

263

#PROUD

125

 

 

 Network Graph 1 

  

Paralympics1.png 

  

 Network Graph 2  

 

Paralympics2.png 

English was the language of choice for people tweeting about the Paralympic games making up about 97% of the sample. An exact similar trend was observed in an earlier study using Olympic tweets. This was followed by other languages such as Dutch, Spanish, German, French and Swedish. Tweets arising from Twitter for BlackBerry constituted about 27% of the total tweets followed by Web based Tweets (27%), Twitter for iPhone (15%), Twitter for Android (13%) followed by the rest. This indicates that more people were tweeting about the Paralympic Games through mobile devices. Most of the tweets are arising from Europe, followed by North America. London leads the list, followed by Amsterdam, Casablanca and Hawaii. Analysing the hashtags used reveals that #teamGB is the most used followed by #paralympics, #London2012 and #inspirational. Team Great Britain seems to be the most popular team.