



















Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
An analysis of the use of the #Singapore hashtag on Twitter. The study examines the prevalence of the hashtag in tweets related to local events, news, users' current location, landmarks, and social media sites like Instagram. The document also discusses the importance of hashtags in Twitter conversations and their role in indicating the main theme of a tweet.
Typology: Slides
1 / 27
This page cannot be seen from the preview
Don't miss anything!
Running Head: HASHTAG ANALYSIS OF #SINGAPORE
Nanyang Technological University
Author Note The author wishes to specially thank his PhD supervisors Professor Schubert Foo and Assistant Professor Natalie Pang for their helpful insights during the execution of the research study which was conducted for the fulfilment of the Independent Study in Information course.
Correspondence concerning this paper can be sent to Aravind Sesagiri Raamkumar, Wee Kim Wee School of Communication and Information, Nanyang Technological University, 04-39, 31 Nanyang Link, Singapore-637718.
Twitter as a micro-blogging platform rose to instant fame mainly due to its minimalist features that allow seamless communication between users. As the conversations grew thick and faster, a placeholder feature called as Hashtags became important as it captured the themes behind the tweets. Prior studies have investigated the conversation dynamics, inter- play with other media platforms and communication patterns between users for specific event-based hashtags such as the #Occupy movement. Commonplace hashtags which are used on a daily basis have been largely ignored due to their seemingly innocuous presence in tweets and also due to the lack of connection with real-world events. However, it can be postulated that utility of these hashtags is the main reason behind their continued usage. This study is aimed at understanding the rationale behind the usage of a particular type of commonplace hashtags:-location hashtags such as country and city name hashtags. Tweets with the hashtag #singapore were extracted for a week’s duration. Manual and automatic tweet classification was performed along with social network analysis, to identify the underlying themes. Seven themes were identified. Findings indicate that the hashtag is prominent in tweets about local events, local news, users’ current location and landmark related information sharing. Users who share content from social media sites such as Instagram make use of the hashtag in a more prominent way when compared to users who post textual content. News agencies, commercial bodies and celebrities make use of the hashtag more than common individuals. Overall, the results show the non-conversational nature of the hashtag. The findings are to be validated with other country names and cross- validated with hashtag data from other social media platforms.
Keywords: Hashtags; Hashtag Analysis; Hashtag Studies; Tweet Classification; Twitter Studies
interesting to study the rationale behind the usage of country/city hashtags and their contribution to conversations at a holistic level. It is observed that some of the popular and frequently used hashtags refer to place names and people names^2. A place name can be related to a location, town, city or a country. Country names are most often used in this category. These country hashtags are added to show that the particular tweet’s content is related to the country. This study takes an explorative approach to understand the dynamics around #singapore by using Text Classification and Social Network Analysis (SNA) based techniques.
Twitter was launched in 2006 as a microblogging platform that facilitated users in sharing and consuming information about day-to-day happenings and opinions on topics. It was a unique product during the time of its introduction due to its character limit on user posts (users are allowed to post tweet messages within 140 characters limit). Twitter has become immensely popular (Rank 10 in Alexa web rankings^3 , Rank 3 among social networking sites^4 ) which has to lead to regional spinoffs such as Sina Weibo^5. Noticing the dynamics around the interactions in twitter, academic research in twitter started in 2008- 2009(Krishnamurthy et al., 2008). Twitter research has been surveyed and summarized in (Cheong & Lee, 2010; Cheong & Ray, 2011; Williams, Terras, & Warwick, 2013). Research has furthered in different directions with varied focus such as organising information (Sriram, Fuhry, Demir, Ferhatosmanoglu, & Demirbas, 2010), understanding trends and convergence events from a communications perspective(Lin et al., 2012), usage of twitter data in practical applications (ex: governments, activism) (Bruns & Burgess, 2011), cross-application of
(^23) http://www.trendinalia.com/twitter-trending-topics/singapore/singapore-131126.html 4 http://www.alexa.com/siteinfo/twitter.comhttp://www.socialnetworkingwatch.com/international-social-netw.html (^5) Sina Weibo http://www.weibo.com/
twitter data (in cross-platform recommendations(Abel, Herder, Houben, & Henze, 2011)) and traditional computer science oriented focus on information retrieval (Magnani, Montesi, Nunziante, & Rossi, 2011) and semantics(Abel, Celik, Houben, & Siehndel, 2011).
The two high level entities in twitter are User and Message (Cheong & Lee, 2010). Recent research has introduced two additional entities Technology and Concept (the central topic being addressed in the tweet)(Williams et al., 2013). This classification scheme has been used in studying twitter data. On the topic of tweet^6 classification, past research has identified many categories which differ based on the method of classification, amount of data, period of data and frame of reference. The categories identified by earlier research studies are presented in Table 1 (a&b). Categories such as News, Information sharing, Events, Opinions and Promotions appear to be common across the schemes. The variation in schemes is mainly due to the vocabulary used for naming the categories and the purpose of classification. There has been a lack of consolidation across studies except for the work of Dann(2010) where four earlier classification schemes have been combined to form a new scheme with six generic categories. It is to be noted that all these classification attempts have not used hashtag as the frame of reference. Table 1 (a) Classification Schemes from Previous Twitter Studies(2007-2010) Java et al(2007) Jansen et al (2009) Honeycutt & Herring (2009) Pear Analytics (2009) Horn (2010) Conversations URL sharing News Daily chatter reporting
Info seeking Info providing Comment/Sentiment
About addressee Advertise Exhort Info for others Info for self Meta-commentary Media use Express opinion Other's experience Self experience Solicit info Other miscellaneous
Mainstream Spam News Self businesses-promotion of Babble Conversations Pass (retweets)-along messages
C1: News, Events, Company C2: Factual, Opionated
(^6) Tweet is the short post or message posted by the user in Twitter
The two use-cases are content organisation and community participation. Measures such as relevance, preference, prestige and influence were used as the main features for the machine learning model. A similar albeit technically focussed approach by (Tsur & Rappoport, 2012) used more number of features to predict the spread of ideas (hashtag) in twitter environment. Bruns & Burgess (2011) used social network analysis based techniques to study the growth and decline of conversations happening around hashtags at different points of time and raise the need for a detailed catalogue to better understand the patterns of interaction. Lin et al (2012) did a broader study by analysing 256 hashtags related to the US presidential elections for understanding the growth, survival and context of their usage. They put forth a two-way classification of hashtags with the categories ‘Winners’ and ‘Also-rans’ and introduced a theoretical framework to understand the adoption behaviour of user-generated content. As seen from earlier studies, the focus has been largely on event-based hashtags. The dynamics around commonplace hashtags are yet to be explored.
It is apparent from the earlier studies that hashtags play a focal role in directing conversations in twitter. Explorative hashtag studies (Pöschko, 2011)have taken a generalized approach by not looking at a particular type of hashtags. In-depth studies on hashtags so far have been based on political events which are of periodic nature (Bruns & Burgess, 2011; Lin et al., 2012). Therefore, there is a necessity to explore the dynamics around commonplace hashtags that are used on a regular basis. Hashtags which are about a place (location) are quite common trending topics in twitter. Not much is known about the rationale behind their usage. In this study, hashtag with country names will be studied, particularly in the case of #singapore. It is to be noted that in the case of Singapore, the country and city name are the same.
The overarching research question for the current study is Why do users make use of the hashtag #singapore? The specific research questions are stated below:- RQ1a: What are the categories that represent the tweets and does the classification scheme built using #singapore as a frame of reference defer from the existing classification schemes and why? Justification for RQ1a: The tweet classification approaches have their learning mechanism based on the whole tweet content. It would be interesting to see if the classification performed by keeping the hashtag as the frame of reference, differs from the existing schemes. RQ1b: What are the relationships between #singapore and other co-occurring hashtags? RQ1c: Does the provenance data of the tweets provide any new insights? RQ1d: What is the communication pattern between users using #singapore?
Twitter data extraction service TweetArchivist^10 was used to extract data for the hashtag ‘#singapore’ for the period between August 26thand September 1st^ 2013. Table 2 provides the tweet count for #singapore for the dates in the selected period along with the other dates from the complete data extraction period provided by the extraction service. It was observed that the tweet contribution was highest on Fridays and weekends. The sample set originally comprised of 20757 tweets for the 7 day period, of which 17798 were considered for the study as the other remaining tweets were not in English.
(^10) TweetArchivist http://www.tweetarchivist.com/
displayed high accuracy, precision and recall based on a comparison of different machine learning algorithms conducted before the actual classification phase. Seven features were used for the classification process – Username, Tweet Text, Tweet Source, User Location, Tweet Hashtags, Tweet Urls and Tweet Url Mentions. The manually coded tweets from the first coder were used as training set. The statistical programming language R along with the machine learning library RTextTools^11 was used for the automatic text classification. Content analysis was employed on tweets from the manually coded set. Social Network Analysis (SNA) techniques were used to analyse the hashtags and user-mentions^12 data extracted from the tweets. A directed graph for user-mentions data and an undirected graph for hashtags were built using the visualization tool Gephi^13. Modularity based community detection algorithm (Newman, 2006) was run to identify the underlying communities in the data. In- degree property was used to filter sparse data in the graphs. The node sizes for the graphs were based on Betweeness Centrality(Wasserman, 1994).
Seven composite categories were identified after the manual classification process. The first coder identified 23 subcategories which were used for coding by the other two coders. The aggregation of the 23 subcategories to the seven categories was performed at a later stage to reduce sparsity in the assignment of categories to the tweets and facilitate stronger agreement between the coders. The seven categories are Local Events and News (LEN), Current Location and Landmarks (CLL),Asia Related and Unrelated topics (ARU), Commercial Deal (CD), Tourism and Travel Related (TTR), National Identity and Group Reference (NGR) and Personal Events and Rants (PER) .Figure1provides the count for the
(^1112) RTextTools http://www.rtexttools.com/ ‘UserIn tweets, users can tag other users so as to meaningfully direct the tweet’s content. This feature is called as-mentions’ (^13) Gephi https://gephi.org/
category assignments by the three coders. The most differences were seen in the categories NGR and TTR with the third coder tending to assign more tweets to these categories. Table 3provides data about the inter-coder agreement between the three coders. The kappa value for 3 coders was 0.47 which translates to moderate level of agreement (Gwet, 2012) between the coders. The agreement was not high due to the presence of multiple themes in a single tweet which meant the coders had to assign the tweet to the category that they felt to be appropriate. Lower number of categories at the start of the coding would have also facilitated better agreement.
Figure 1. Manual Classification – Inter-Coder Agreement Table 3 Inter-Coder Agreement Stats Coders Kappa Z A,B,C 0.47 37. A,B 0.46 20. B,C 0.44 21. A,C 0.51 23.
The precision, recall and F-measure values for the machine learning algorithms Maximum Entropy (MAXENT) and Support Vector Machines (SVM) used in training is provided in Table A1 in Appendix. MAXENT was chosen as the final algorithm due to better
0
50
100
150
200
250
Coder A Coder B Coder C
FigureA3 (Appendix) is a line graph depicting the hourly tweet count. The activity is high at the start of the business hours (8-9 AM) and it reaches its peak after office hours (5- PM). These findings are largely similar to general twitter traffic where the bulk of the postings happen in the evening. FigureA4 (Appendix) is a graph correlating the tweet counts with twitter trends ranking. For the period of August 20 to September 5 2013, it is observed whether the tweet count of either singapore or #singapore has an impact on the ranking of #singapore. The figure shows that #singapore finds its place in the top 20 twitter trends on a consistent basis with a few days going above the top 30. There is no perceivable relationship between the tweet count and the tweet rank i.e. the rank does not improve due to increased number of tweets. To further understand the causal factors behind the ranking of #singapore, the top 20 twitter trend keywords along with the corresponding tweet counts need to be extracted and correlated.
RQ1a: What are the categories that represent the tweets and does the tweet classification scheme built using #singapore as a frame of reference defer from the existing classification schemes and why? Local Events and News (LEN) This theme with a tweet count of 5140 (28.88%) corresponds to the tweets that are about local events and news that are mainly posted by news agencies and commercial bodies. It is quite evident that news agencies use the hashtag more than any other type of user (user accounts personaSingapore - 509 tweets, sgbroadcast- 309 tweets). There are two reasons for this behaviour, the first reason is to gain attention of the public by having a hashtag which is easily relatable and secondly, the hashtag is added to indicate that the tweet content is to be interpreted within the context of Singapore. The local news category subsumes news about
sports, weather, entertainment and business. The count of these tweets is higher than any other category due to the regular tweeting done by commercial bodies and also due to the heavy retweeting by normal users (1755 retweets out of 5140). The news subcategory is prevalent in most of the previous studies (Horn, 2010; Java et al., 2007; Rosa et al., 2011; Sriram et al., 2010) while events subcategory in seen in (Horn, 2010). Current Location and Landmark (CLL) This theme with a tweet count of 4973 (27.94%) corresponds to the tweets that are about the current location of the user and references to landmarks in the locality. The tweets are mainly visual content shared from social sharing sites such as Instagram (2488 out of 4973 tweets). In fact, the finding that majority of the tweets in the sample set were posted from Instagram (refer Table A3 in Appendix) indicates the popularity of the hashtag among users who frequently take photographs in and around Singapore. This category corresponds to the ‘Me Now’ category from (Naaman et al., 2010). The absence of this category from most of the previous classifications is due to its unique association with a location based hashtag such as #singapore. Asia Related and Unrelated topics (ARU) This theme with a tweet count of 2822 (15.86%) corresponds to the tweets that are about topics related to other Asian countries and in some cases, topics that do not have relation to #singapore. The spam subcategory plays a major part with just a single spammer contributing to 13% of the total tweets. It is seen that certain Asia-pacific user accounts (Alert_from_Asia, Cherascity)have the tendency to add the hashtag in many tweets regardless of the topic. This theme also covers the tweets that are posted by local news agencies when the topic is related to Asia. The hashtag co-occurrence graph diagram has a specific community (highlighted in brown in Fig A2.1) with #singapore appearing with other Asian country names. These tweets are mostly spams with no value being added by #singapore.
category does not have peers in the classifications of previous studies due to its specific nature. National Identity and Group References (NGR) This theme with a tweet count of 1214 (6.82%) corresponds to tweets that are posted by users as references to fellow Singaporeans and also to convey information or opinion related to the image of Singapore. These are the only set of tweets that are directly addressed to the main topic ‘Singapore’ from either a geographic or geopolitical viewpoint. This category is dominated with normal user accounts unlike other categories which are mainly represented by group accounts. Greeting messages (ex: good morning wishes) and socio- cultural messages (ex: “Majority of Singaporeans want slower pace of life...”) are the types of tweets represented by this category. Similar categories from previous studies are conversational (Dann, 2010), Meta-commentary(Honeycutt & Herring, 2009)and Statements (Naaman et al., 2010). Personal Events and Rants (PER) This theme with a tweet count of 205 (1.15%) corresponds to tweets that are entirely user specific, referring to a personal event or a personal rant (opinion) about an entity. Frustrations about traffic or a personal communication between smaller groups of people are the candidates for this theme. This category has the lowest allocation amongst all the categories which indicates the lack of its usage by users. However, the tweets from this category are the ones that are extensively studied due to their subjective, user-specific content. A very high percentage of these tweets (48.78%) are sent from the app ‘Twitter for IPhone’ which underlines the usage by individual users. The seven categories identified during the classification exercise have both similarities and differences with the categories from previous studies. The common categories include News, Current Location, Commercial Deals, Spams and Group References
which shows the pervasive nature of these categories. The novel categories Asia Related, Personal Events, Tourism and National Identity have been newly identified mainly due to the specific nature of a location hashtag such as #singapore and usage of hashtag centric frame of reference. These findings need to be verified by performing a similar exercise with other country and city names. It can be claimed that an all-encompassing classification method needs to have sub-categories to capture the theme of tweets or the classification has to be set at an abstract level. RQ1b: What are the relationships between #singapore and other co-occurring hashtags? As a precursor towards understanding the relationship between #singapore and other co-occurring hashtags, the importance of #singapore as an independent entity needs to be studied. It is to be noted that there is no restriction on the number of hashtags that can be added to a tweet baring the 140 character limit. Tweets in the sample set have #singapore as the first hashtag in 8087 tweets which constitutes to about 45% of the total tweets in the sample set. The count of tweets containing #singapore by the hashtag position is provided in FigureA5 (Appendix) where the chart data follows a power law distribution. These stats indicate that #singapore plays a primary role in majority of the tweets, at least based on the positioning of hashtags. SNA was used to understand the relationship between #singapore and other hashtags. The undirected graph (FigureA1 in Appendix) constructed using hashtags extracted from the tweets shows the presence of six interrelated communities. There are two communities that are related to food (purple and dark green), corresponding to the CLL theme. Three communities are mainly related to locations and Instagram related tweets (red, blue and light green) corresponding to the themes PER, LEN and CLL. One community is mainly about country names corresponding to ARU theme. The findings from the graph largely corroborates with themes identified during the text classification. It can be claimed that #singapore plays a complimentary and meaningful role in the context of co-occurring
retweeting as major user activity in the sample set as the aforementioned accounts are heavily retweeted by normal users (refer Table A5 in Appendix). This finding can be generalized to a bigger population as earlier findings in the current study have already established the commercial interest behind the usage of #singapore unlike event based hashtags such as #sghaze and #occupy which are mainly used for conversational purposes. The tweets in the sample set indicate very minimal personal communication using #singapore. Average path length of 3.5 indicates longer distances between nodes which translate to communication being restricted within small groups.
The results are based on an in-depth analysis of tweets collected for a week’s duration which is considered appropriate for the current study’s scope. Future studies are planned to be conducted with tweets collected for a longer duration so that generalization is not an issue. In the technical front, the features used in the automatic tweet classification exercise are deemed basic as they are direct fields from the Twitter extract. Complex features are to be devised in future studies to improve the accuracy of the classification algorithms.
The objective of the current study was to identify the rationale behind the usage of the hashtag #singapore in tweets. Seven themes/categories were identified as a part of a tweet classification exercise. The hashtag is prominent in tweets about local events, local news, users’ current location and landmark related information sharing. Users who share content from social media sites such as Instagram make use of the hashtag in a more prominent way when compared to users who post textual content. News agencies, commercial bodies and celebrities make use of the hashtag more than common individuals. Similarities and differences with existing tweet classifications were identified along with the justifications for the novel categories. The case for using hashtag as the frame of reference for classification
purposes has also been raised. Owing to the relatively small size of the country, the hashtag continues to be one of the top trends on a regular basis due to commercial elements in twitter. SNA based techniques were conducted to further supplement the findings from the classification exercise. The current study’s results are to be validated with similar exercise with different country and city names as the dynamics related to Singapore might not be applicable to other cities and countries. Cross-media validation is to be performed by extracting similar data from platforms such as Google Plus and Facebook as the hashtag has become a common feature across most social media platforms. It would be interesting to study whether users make use of commonplace hashtags with similar intents across platforms.
Abel, F., Celik, I., Houben, G., & Siehndel, P. (2011). Leveraging the Semantics of Tweets for Adaptive Faceted Search on Twitter. Proceedings of the 10th International Semantic Web Conference (pp. 1–17). Abel, F., Herder, E., Houben, G., & Henze, N. (2011). Cross-system User Modeling and Personalization on the Social Web. User Modeling and User-Adapted Interaction , 22 (3), 1 – 42. Benevenuto, F., Magno, G., Rodrigues, T., & Almeida, V. (2010). Detecting Spammers on Twitter. Collaboration, electronic messaging, anti-abuse and spam conference (CEAS). Bruns, A., & Burgess, J. (2011). Publics (pp. 25–27). Retrieved from The Use of Twitter Hashtags in the Formation of Ad Hoc http://eprints.qut.edu.au/46515/1/The_Use_of_Twitter_Hashtags_in_the_Formation_of_ Ad_Hoc_Publics_(final).pdf Cheong, M., & Lee, V. (2010). Dissecting Twitter: A Review on Current MicrobloggingResearch and Lessongs from Related Fields. From Sociology to Computing in Social Networks (pp. 343–362). Springer Vienna. Cheong, M., & Ray, S. (2011). A Literature Review of Recent Microblogging Developments (pp. 1–43). Retrieved from http://www.csse.monash.edu.au/publications/2011/tr-2011- 263-full.pdf Dann, S. (2010). Twitter content classification.http://firstmonday.org/ojs/index.php/fm/article/view/2745/2681 First Monday , 15 (12). Retrieved from