Challenge 4 Fletcher Project

Published: August 20, 2017 by Tian Yu

Categories:
project 5

Project 4 Fletcher Project

A Glance At New York Times Science and Technology News

Story

The generation of technologies is rapidly changing. The hotness in the new technology attracts people and capital, meanwhile, some area is losing attentions, like physics or biology. When Dolly was born, genetic engineering was a hot topic and young generations were willing to study it, however, there were not enough positions to absorb so many new graduates, quite a lot of them need to tune majors to find jobs after graduation. At this time, computer science or data science become popular, it is common to hear that the market needs millions of professionals in data science. But no exception, data scientist some day will have to face the similar situation like biologists now.

With this motivation, I want to read history, see the trend of a frontier technology or science subject in the past decades from New York Times. With more data, like statistics of labor in different position, maybe we can see when a new tech generates, how many unemployment it will bring to the market. Futher with more source or analysis, maybe we can predict the year that AI occupies some positions.

Goal

In this project, I start the journey by finding the keywords abstracting from New York Times leading paragraphs in technology and science yearly from 1945 to 2017. Because the unemployment is related to companies, I did a sample of name entitites of Google using Word2Vec.

Data

Using New York Times Developers API, I got more than 8.35 million daily documents. Each document includes information of publication time, section, keywords, author, leading paragraph and etc., then using MongoDB mask to choose 3.76 million leading paragraph related to science and technology.

Figure 1. The statistics of Sci & Tech News in NYT
p4_stats

At first, I created a mask for section in science & technology, but after I checked the data after mask, I found that before Oct.10, 1980, there was no section of science & technology. As a reslut, I added keywords of science and technology to the mask, so if the document has anything related to science and technology, it will appear in my database.

Approach 1, the wordcloud of science & technology for decades.

Take year 2015 for example, with the science & technology related leading paragraphs, first applied word token, separating paragraphs to words, then with my own stopwords bag, deleting the words that could not represent the frontier technology. After the process, I tried stemming, which is to abstract the stem of the word, like take went, gone and goes all as go. But the attempt to make a word cloud with stemming gave strage result, appearing uncomplete words due to over-stem, besides, counts of word was reduced to 9k compared to 33k after stopwords.

Sci-Tech Leading paragraphs: 2,994
Total words: 157,000
After stopwords: 33,000
After stemming: 9,600
Word2Vect: 14,800

Word Cloud over Time With parameter of LDA, min_df=1, max_df=0.1

Figure 2. Word Cloud from 1945 - 2017.
p4_wordcloud

List of events from word cloud

Year	Key Words	What happend
1945	atomic, war	Still in World War II
1947	war, photographic
1949	wiener	Mr. Wiener: an American mathematician and philosopher, published a book Cybernetics: Or Control and Communication in the Animal and the Machine in 1948.
1950	norbert
1953	battery, storage
1955	aeronautical	Finding way to the moon
1956	enthnological
1958	alamos, computer	Develop the computer technology
1960	cell, rabbits
1961	radio, Schlesinger	Mr. Schlesinger: son and father, American historian, social critic, and public intellectual.
1963	stars
1964	Artucio	Artucio: Uruguayan architect and architectural historian.
1966	mathematics
1970	teleprompter
1971	monoxide
1974	infrared, television
1976	bacteria
1980	hopkins
1982	egg, diary
1983	saturn, computer
1984	particle
1985	comet, disk, binoculars
1986	genetic,biotchnology
1988	Kodak
1991	computer, printer
1993	windows
1994	chip, computer, optic, communications
1995	www, http
1999	mars, simon, HP
2000	fiber, optic, camera, dell, alamos
2001	stem, embryos, cell, amazon, dell
2003	gene
2008	carbon
2009	app, car, telescope, plant, ray
2010	ipod, chip
2011	smart, silicon, dioxide
2012	web, sandy
2014	Albert Einstein, psychology
2016	cloud, car, trump
2017	car, vehicle, Elon Musk, nasa

word2vec on 2015

PCA n_components=2

w2v

Approach 2, clustering comparison.

Clustering Models:

KMeans versus WARD

n_clusters = 6

Figure 3. K-Means Clustering
p4_kmeans

Figure 4. Ward Clustering
p4_ward_section

Possible groups

1: UPenn, NYU, Cornell, Yale, date
2: Navy, NJIT, Evangelist Catholic, locations
3: IPO, Exxon Mobil, Elon Musk, Mark Zuckerberg, gmail
4: biologist, excellence, Zajfman(Physicist), some first names
5: China, Paris, silicon, NASA, Amazon, Uber, Yahoo
6: economics, foundation, father, daughter-in-law

Approach 3, named-entity recognition of Google.

Google relationship extraction on 2017
document: 1215
words: 27,000

Figure 5. Google relationship
p4_google

Reference

https://en.wikipedia.org/wiki/List_of_years_in_science
https://www.codeschool.com/blog/2016/03/25/machine-learning-working-with-stop-words-stemming-and-spam/
https://github.com/smilli/py-corenlp
http://lab.hakim.se/reveal-js/#/