Challenge 4 Fletcher Project

Published: by Creative Commons Licence

Project 4 Fletcher Project

A Glance At New York Times Science and Technology News

Story

The generation of technologies is rapidly changing. The hotness in the new technology attracts people and capital, meanwhile, some area is losing attentions, like physics or biology. When Dolly was born, genetic engineering was a hot topic and young generations were willing to study it, however, there were not enough positions to absorb so many new graduates, quite a lot of them need to tune majors to find jobs after graduation. At this time, computer science or data science become popular, it is common to hear that the market needs millions of professionals in data science. But no exception, data scientist some day will have to face the similar situation like biologists now.

With this motivation, I want to read history, see the trend of a frontier technology or science subject in the past decades from New York Times. With more data, like statistics of labor in different position, maybe we can see when a new tech generates, how many unemployment it will bring to the market. Futher with more source or analysis, maybe we can predict the year that AI occupies some positions.

Goal

In this project, I start the journey by finding the keywords abstracting from New York Times leading paragraphs in technology and science yearly from 1945 to 2017. Because the unemployment is related to companies, I did a sample of name entitites of Google using Word2Vec.

Data

Using New York Times Developers API, I got more than 8.35 million daily documents. Each document includes information of publication time, section, keywords, author, leading paragraph and etc., then using MongoDB mask to choose 3.76 million leading paragraph related to science and technology.


Figure 1. The statistics of Sci & Tech News in NYT
p4_stats

At first, I created a mask for section in science & technology, but after I checked the data after mask, I found that before Oct.10, 1980, there was no section of science & technology. As a reslut, I added keywords of science and technology to the mask, so if the document has anything related to science and technology, it will appear in my database.

Approach 1, the wordcloud of science & technology for decades.

Take year 2015 for example, with the science & technology related leading paragraphs, first applied word token, separating paragraphs to words, then with my own stopwords bag, deleting the words that could not represent the frontier technology. After the process, I tried stemming, which is to abstract the stem of the word, like take went, gone and goes all as go. But the attempt to make a word cloud with stemming gave strage result, appearing uncomplete words due to over-stem, besides, counts of word was reduced to 9k compared to 33k after stopwords.

Sci-Tech Leading paragraphs: 2,994
Total words: 157,000
After stopwords: 33,000
After stemming: 9,600
Word2Vect: 14,800

Word Cloud over Time With parameter of LDA, min_df=1, max_df=0.1


Figure 2. Word Cloud from 1945 - 2017.
p4_wordcloud

List of events from word cloud

Year Key Words What happend
1945 atomic, war Still in World War II
1947 war, photographic  
1949 wiener Mr. Wiener: an American mathematician and philosopher, published a book Cybernetics: Or Control and Communication in the Animal and the Machine in 1948.
1950 norbert  
1953 battery, storage  
1955 aeronautical Finding way to the moon
1956 enthnological  
1958 alamos, computer Develop the computer technology
1960 cell, rabbits  
1961 radio, Schlesinger Mr. Schlesinger: son and father, American historian, social critic, and public intellectual.
1963 stars  
1964 Artucio Artucio: Uruguayan architect and architectural historian.
1966 mathematics  
1970 teleprompter  
1971 monoxide  
1974 infrared, television  
1976 bacteria  
1980 hopkins  
1982 egg, diary  
1983 saturn, computer  
1984 particle  
1985 comet, disk, binoculars  
1986 genetic,biotchnology  
1988 Kodak  
1991 computer, printer  
1993 windows  
1994 chip, computer, optic, communications  
1995 www, http  
1999 mars, simon, HP  
2000 fiber, optic, camera, dell, alamos  
2001 stem, embryos, cell, amazon, dell  
2003 gene  
2008 carbon  
2009 app, car, telescope, plant, ray  
2010 ipod, chip  
2011 smart, silicon, dioxide  
2012 web, sandy  
2014 Albert Einstein, psychology  
2016 cloud, car, trump  
2017 car, vehicle, Elon Musk, nasa  

word2vec on 2015

PCA n_components=2

w2v

Approach 2, clustering comparison.

Clustering Models:

KMeans versus WARD

n_clusters = 6


Figure 3. K-Means Clustering
p4_kmeans


Figure 4. Ward Clustering
p4_ward_section

Possible groups

1: UPenn, NYU, Cornell, Yale, date
2: Navy, NJIT, Evangelist Catholic, locations
3: IPO, Exxon Mobil, Elon Musk, Mark Zuckerberg, gmail
4: biologist, excellence, Zajfman(Physicist), some first names
5: China, Paris, silicon, NASA, Amazon, Uber, Yahoo
6: economics, foundation, father, daughter-in-law

Approach 3, named-entity recognition of Google.

Google relationship extraction on 2017
document: 1215
words: 27,000

Figure 5. Google relationship
p4_google

Reference

https://en.wikipedia.org/wiki/List_of_years_in_science
https://www.codeschool.com/blog/2016/03/25/machine-learning-working-with-stop-words-stemming-and-spam/
https://github.com/smilli/py-corenlp
http://lab.hakim.se/reveal-js/#/

You can see my technical implementations here