Exploratory analysis of text re-writing processes -SS2020

Vishal Khanna, Seenam Sabahat, Mahdi Shakerirad, Sagar Nagaraj Simha, Michael Völske, Magdalena Anna Wolska


The larger goal of the project is to analyze how people present themselves on the web. We conduct our exploratory analysis on researchers who have a personalized web profile online. Broadly we intend to categorize and find patterns in how these webprofiles are structured and designed, in both form and content.  

We found significant correlations of researchers’ professional attributes (such as number of affiliations), as reflected on their web pages, with their age and gender. Furthermore, it was found that certain personality traits have a strong correlation with the researchers’ department.

Research Questions

  1. What kind of features, prevalent in researchers' profiles, distinguish them from non-researchers? Are there any features which distinguish some researchers from others?
  2. Can researchers' personal (like age, gender, nationality, rank) or professional (like area of research, years of experience) attributes be inferred from certain aspects of their profiles?
  3. What kind of behavioural patterns emerge upon analysis of researchers' profiles over a period of time?
  4. Do researchers’ personality traits correlate to certain aspects of their web profiles?


We conglomerate multiple data sources to conduct our experiments on.

  1. Over 9000 Web Profiles from universities across the USA - We collect, using web scraping tools, the professional and personal web profile links of researchers along with other information such as Research interests, Education history etc as available on the university websites.  
  2. DBLP -  The archive has about 42,000 researchers with their weblinks. However, not all of them contain personal web profiles.
  3. ArnetMiner Dataset - Contains around 1760 researchers’ annotated personal web pages. This is the primary source for most of our analysis as of now.


Our approach has been on two folds, a. Hypothesis generation and b. Hypothesis confirmation.

  1. Do researchers tend to put younger pictures of themselves online?
  2. Do publications indicate a researcher’s trait such as openness to collaboration, diversity in areas of working, amount of mentorship etc
  3. Is there a correlation between age, gender and the choice of color schemes for their online portfolio?
  4. Can we infer some aspects of their personality and group researchers into the Big five personality traits based on their webpage content?
  5. To what extent do researchers modify their web pages over different periods of time?
  6. What impact do age and gender have on researchers’ professional attributes such as affiliations and research interests?
  7. What patterns may emerge doing a spatial analysis of researcher’s location.

Results (so far)

  1. Over the sub population of ~800 researcher’s profiles analyzed, people tend to put a profile picture which is approximately 12 years younger than their current age (estimated based on the education history).
  2. A researcher’s area of expertise has an effect on the personality trait (specifically  openness, conscientiousness & extraversion) derived from the Big Five.
  3. The older the researcher’s profile, higher is the degree of modification it has undergone.

    Box plot of year when researcher’s webpages was created vs its text similarity to its oldest available version. Older web pages correspond to lower similarity values (implying higher modification).

  1. Most researchers have 1 or 2 affiliations. A few have more than 8. Across all age groups, the number of men having a higher number of affiliations exceeds women.

    Scatter plot of number of affiliations with respect to age

  1. Number of years to finish a PhD has gone down over time.

    Age vs number of years between BS and PhD degrees. There is a significantly high number of researchers aged over 50 who took over 10 years to obtain their PhD.