Small World !!!: April 2015

Saturday, April 11, 2015

Cupid in your network

Single and secretly wondering which of your friends might be able to introduce you to your future soulmate?

Sociologists have long studied the dating pool problem, noting that an acquaintance is more likely to introduce you to your mate [1] than a close friend is. Your close friends tend to know all of the same people as you, while acquaintances have greater “bridging social capital,” connecting you to diverse clusters of new people. On the other hand, your close friends might be more motivated to set you up; they care more about your happiness, and they know you better. So who, really, is more likely to set you up? And do these “matchmakers” have special characteristics that make them ideally suited to the task? (Note that when we say “matchmaker” we mean “a friend who introduced you” rather than a professional.)

To answer these questions, a survey was done of approximately 1500 English speakers around the world who had listed a relationship on their profile at least one year ago but no more than two years, asking them how they met their partner and who introduced them (if anyone) and then analyzed network properties of couples and their matchmakers using de-identified, aggregated data.

What does a typical matchmaker looks like?

Unlike the stereotype of an older woman from Fiddler on the Roof, our matchmakers looked pretty similar to the couples they were matching. They were young (most often in their 20s, as were most of the survey respondents), half of them were men, and roughly half of them were single (much like most of the future couple's other friends). So you might not be able to identify your cupid at a glance.

However, our matchmakers had a secret weapon: more unconnected friends. To start, matchmakers have far more friends than the people they're setting up. Before the relationship began, the people who later got matched had an average of 459 friends each, while matchmakers had 73% more friends (794 on average). This isn't that surprising, because network science tells us that your friends usually have more friends than you do [2] . But even compared to the couples' other friends, matchmakers have 8% more friends. So they're unusually well-connected.

Matchmakers have 73% more friends than the typical person they set up.

But more than simple friend count, what matters is whether matchmakers' networks have a different structure. And indeed, they do. Though they have more friends, their networks are less dense: their friends are less likely to know each other. We measure this with the clustering coefficient [3], which is 6% lower for matchmakers than other friends of the couple. Matchmakers also have 4% more isolated nodes as a proportion of their network, or people who have no other mutual friends. Matchmakers typically have about 16 unconnected ties in their networks, and each one is an opportunity: people who both know the matchmaker, but don't yet know each other.

A typical matchmaker's graph has a lot of nodes but relatively few triangles (people with another friend in common). Each pair of unconnected nodes is an opportunity for an introduction.

Are matchmakers close friends or acquaintances?

Matchmakers were pretty close friends to the people they later introduced. (Makes you wonder why they didn't introduce their friends sooner!) More than half of the survey respondents said they felt “very” or “extremely” close to the matchmaker before the match was made. Nearly half spoke to the matchmaker every day, and two thirds spoke to their matchmaker more than once a week. And half had known the matchmaker for three years or more, as did their future partner. So by every measure, matchmakers were close friends with the future couple.

Matchmakers were more likely to be close friends, rather than acquaintances.

However, survey responses are clouded by retrospective bias. Wouldn't you grow closer over time to the person who introduced you to your partner? And that rosy haze can make it hard to remember how close you really felt two years ago. So, for that an examination was done on the network data from the time before the relationship to see if people were actually close to their matchmaker and then counted the number of mutual friends each person in the couple had with the matchmaker before the relationship started, a common proxy for relationship closeness. Matchmakers typically had 57 friends in common with either person in the future couple, amounting to about 6% of their pool of friends. Does that make them close friends? One good comparison point is the couple today: they typically have 52 friends in common, or about 7% of their pool of friends. If romantic partners are your gold standard for close friends, the matchmakers look like strong ties!

Notably, the fraction of mutual friends between the matchmaker and a member of the couple is significantly higher when they are of the same gender, around 7.4% when the relationship started compared to only 5.1% when they are of the opposite gender.

Couples who met through a friend vs. those who met another way: How do their networks differ?

Couples who were introduced by a friend are more likely to have at least one friend in common a year prior to the start of their relationship (84% of them) compared to those who met in other ways (74% only), which is to be expected – this quantity increases over time, and a few months into the relationship, more than 99% of couples have a friend in common, regardless of how they met.
More interesting though, is that the fraction of mutual friends starts higher for couples who were *not* introduced by a mutual friend (around 4.4% overlap 18 months prior to the start of the relationship) compared to those who were (about 3.4% at that time). It is notable that that quantity increases faster for the latter, to the point that 18 months into the relationship the overlap of social networks of both members of the couple averages at around 7.7% whether or not they were introduced by a friend.

We can understand this by the fact that matchmakers act as bridges between communities or social groups: when you meet someone at school or at work, both you and the other person evolve in the same social circles, hence you have a higher overlap. But imagine one of your friends introduced you to someone at a different school; you might have fewer friends in common with that person at first, but over time you both start making new friends in each other's schools.

Couples who were introduced by a friend had a smaller fraction of mutual friends prior to the relationship.

If no one introduced them, how did people meet their partner?

Couples who found each other without a friend introducing them most often met at school. But nearly 1 out of 5 couples met through an online dating service or app. Keep in mind these are primarily couples in their twenties and thirties who started dating in 2013 or early 2014.

Fun fact: As many couples met through Facebook as at a party or in a bar.

This is consistent with a Pew study [4] finding that many social network site users have followed or friended someone because one of their friends suggested they'd want to date that person. And many of our respondents cited Facebook as playing a role in the early stages of their relationships: (Quotes have been edited for length and names have been changed.)

Dan first became interested in me because of a picture he saw of me on facebook. I met the person who introduced me to Dan at a party that was organized on Facebook. Now that I think of it, it's possible that I wouldn't even know Dan if Facebook didn't exist.

We were acquaintances, but after she posted on my wall that we had a lot in common, we began dating.

Without that friend suggestion in his newsfeed, I am not sure he and I would be together today!

People most often met their partner at school. As many couples met through Facebook as met at a party or bar.

References
1. Interpersonal Ties (http://en.wikipedia.org/wiki/Interpersonal_ties)
2. Friendship Paradox (http://en.wikipedia.org/wiki/Friendship_paradox)
3. Clustering Coefficient (http://en.wikipedia.org/wiki/Clustering_coefficient#Local_clustering_coefficient)
4. Pew Study (http://www.pewinternet.org/files/old-media//Files/Questionnaire/2013/Survey%20Questions_Online%20Dating.pdf)

Wednesday, April 8, 2015

How do I become a Twitter Stud ?

Today,Twitter is one of the most trending OSN. People are crazy for popularity on twitter. Do you post your tweet and keep endlessly waiting whether some of your friend/family will retweet it and it will give you an adrenaline rush ? Believe me,Deep down your heart even you do that ! So,The questions is How do I become popular on twitter ? This is a two step process - The first step being following massive number of people.Many of them will follow you back because of the "Follow back syndrome" .Once you have enough number of followers,it's time for some math. ( You might wish to unfollow few of the people who trusted you and followed you back! Thanks to twitter,it doesn't show unfollow notifications)

The second step being increasing the visibility of your tweets. To increase the visibility of your tweets,you would like to mention few users in your tweet which you think would retweet it.Now,the question is How do we select the users ??

The user must be popular. duh!
The users are active on twitter while you want to post the tweet
The users have a good history of retweeting
The users are susceptible of being interested in the tweet

Here are few statistics on the tweet proportion v/s number of retweets for different number of mentions.

Now,the following picture shows us this mention-dependency of top 10 trending hashtags .It seems that highly popular hashtags are heavily dependent on mention links.

```

Several users can be mentioned in a tweet, however this number is limited as the tweet should be less than 140 characters. This problems thus leads to finding the right features evaluating these three components and the trade-off between them.What if we assign a utility score to each user and map them to a knapsack problem ?

The knapsack problem with the following constraints.

Total Budget : 140 - number of characters in the message(m).

Value of the user : S(u,m) ( This my friend is the utility score which we calculate for user u based on his popularity and the expected probability of retweeting m)

Weight of user : Number of characters in his name + 2 ( Well,one for @ and one for space between multiple users)

This problem will return the set of users which we will mention in our tweet.

Okay,Knapsack is defined.Brace yourselves! Here comes the most important part,the utility score for a user.

This score is the product of three functions that respectively evaluates the Popularity of the user to

mention, its Activity and its Interest to the message.

The score is given by: S(u, m) = f (u) * pow(g(u),α) *pow(f (u, m), β )

where

f(u) = the number of followers of u.

g(u) = the number of retweets made recently by u

f(u, m) = the similarity of the message m to the last messages of u.(Standard NLP techniques used )

Obviously,now is the time for Twitter's beautifully made Twitter API! Sadly,there are few constraints within a 15 minutes window.

get 5000 followers (resp. friends) of a user: 15 requests.
get the basic information of a user (particularly its number of followers): 180 requests.
get the 200 last tweets of a user: 180 requests.

So,the algorithm goes as follows :

Crawl the followers wishing to tweet the message m.
Select the friends of the user which are not his followers and among these users select randomly 180 users
For each of these 180 users u, crawl its number of followers and set the value of f(u), crawl its 200 last tweets and compute g(u) and f(u, m) out of it
Now that the S(u,m) is calculated for all users,Solve the knapsack problem defined above and return the set of users.

Now,Tweet the message m followed by a newline character and the screen names of the users selected, each one preceded by the character @ and separated by spaces. (Now realized why did we do a +2 in the knapsack problem ?? )

After one day the tweets of the users mentioned are crawled to check which users retweeted. The coefficients α and β are then updated taking all available data to improve the utility score.

Increasing the visibility of your tweets will have a higher chance of retweets. More the retweets,more viral your tweet,more viral you and voila,you're a twitter stud!

Please follow me on twitter for further updates. ( See what I did there ?)

References :

An empirical approach towards an efficient “whom to mention?” Twitter app
Adrien Guille, Hakim Hacid, Cecile Favre, and Djamel A Zighed. Information diffusion in online social networks: A survey. ACM SIGMOD Record, 42(1):17–28, 2013
Kyumin Lee, Jalal Mahmud, Jilin Chen, Michelle Zhou, and Jeffrey Nichols. Who will retweet this? In Proceedings of the 19th international conference on Intelligent User Interfaces, pages 247–256. ACM, 2014.

Monday, April 6, 2015

Diseases and Protein Protein interactions

To understand the molecular basis of genetic diseases, it is important to discover their causal genes. Typically, a disease is associated with a linkage interval on the chromosome if single nucleotide polymorphism (SNPs) in the interval are correlated with an increased susceptibility to the disease. These linkage intervals define a set of candidate disease-causing genes. Genes related to the same disease are also known to have protein products that physically interact. A class of computational approaches have recently been proposed that exploit these two sources of information—physical interaction networks and linkage intervals—to predict associations between genes and diseases. Methods typically begin with an artificial disease subinterval and test how well they can identify a known causal gene from among a fixed number of nearby genes in the query subinterval. In another stringent method instead of ranking only genes in the subinterval, all genes in all intervals related to a query disease are ranked. This more stringent approach is advantageous because it allows us to find disease-causing genes that lie in existing disease intervals but that were previously not associated with the disease. Consequently, we can gauge a gene's relatedness to any query disease.

Network-based algorithms to predict gene–disease associations:

A widely used network-based approach (‘Neighborhood’) predicts for a protein p the annotations that are associated with more than θ percent of p's network neighbors. The method associates a gene with a disease if it lies within a linkage interval associated with the disease and interacts with ≥1 gene annotated with the disease.

Random walks have been used to transfer annotations within networks. We define a random walk (‘RW’) starting from genes known to be associated with a query disease d. At each time step, the walk has a probability r of returning to the initial nodes. Once the process converged (L₂-distance between probability vectors in consecutive time steps <10⁻⁶), a prediction was made for all genes in relevant intervals with visitation probability greater than θ.

Graph partitioning is a promising technique for predicting gene–disease associations because it can uncover functional modules in PPI networks, and phenotypically similar diseases are often caused by proteins that have similar biological processes. Three graph partitioning algorithms were recently shown to find the most biologically relevant modules: GS, MCL and VI-Cut. GS1 losslessly compresses the input network, producing a smaller summary network and a list of corrections to over-generalizations in the summary. The nodes in this summary correspond to modules in the input network. The summary graph can be further compressed by discarding the list of corrections and applying GS again, resulting in larger modules (‘GS2’). This process can be repeated i times, yielding a ‘GSi’ method. The ‘GS-All’ method makes the union of the predictions made by GS1, GS2 and GS3. VI-Cut is a semi-supervised clustering method that uses annotations in the training set when creating modules.

Quality of Network based predictions:

In a typical implementation the following results were obtained:

The random walk methods and Prop show a clear dominance over the clustering and neighborhood methods. The clustering methods, VI-Cut, and GS1 and its variants GS2, GS3 and GS-All, which have not previously been appraised for the task of predicting gene–disease associations, performed slightly worse than the random walk methods, but better than the neighborhood approaches. They achieve between 18.4% and 68.6% precision and 1.1% and 17.9% recall.

Conclusion:

The classes of network-based methods considered here each approached the task of predicting gene–disease associations using very different philosophies. Although random walk approaches are superior to clustering and neighborhood approaches. In general certain diseases for which high-throughput PPI networks were an especially useful source from which to make high-quality predictions can be found. Diseases that have little correlation with the interaction network call for higher quality networks or an integrative approach that considers sequence, functional annotations, expression data or other additional information.

References:

Aerts S, et al. Gene prioritization through genomic data fusion. Nat. Biotechnol. 2006

Brohee S,

van Helden J.

Evaluation of clustering algorithms for protein-protein interaction networks. BMC Bioinformatics 2006

Fraser HB,

Plotkin JB

. Using protein complexes to predict phenotypic effects of gene mutation. Genome Biol. 2007

Ideker T,

Sharan R
. Protein networks in disease. Genome Res. 2008