SQL, Pandas, 関数型言語におけるそれぞれの集計・分析方法の比較と類似
乱立するデータ集計技術
ビッグデータだの、人工知能だのバズワードが様々に叫ばれていますが、今でも主流はエクセルで分析しているということを聞いたりします。
エクセルを超えた量のデータを扱う時には綺麗に成型されている時に限り、SQLなどが使えますが、データが汚かったり、膨大な量になってくると関数型言語からヒントを得たMapReduce系の集計フレームワークが使われます。
また、さすがにVBScriptでデータ集計したくないという人もそれなりにいて、Excelのデータをpandasに投入して必要なデータを集計している人もいます。
昔からデータを貯める、分析するということを生業にしていた人は、ビッグデータの考えをそのままSQLを適応して、SQLで集計したりしています。
色々な集計方法が出てきた今、改めて、主に私の周りで使われている集計技術(アドテクの一部の領域)について、それぞれが確かに互換性能があることを示したいと思います。
今回評価する方法
以下の三つを評価します
最も昔からあるデータ構造の管理の形だと思います。
私も学生時代はPHP + MySQL + Perl(or Ruby)でのWebシステムを作ったことがあります。
メリットとしてはSQLは使える人が多く、システム屋さんも知識的に保有していることが多いので、人の流動性が確保しやすいという印象があります。
デメリットとしては、データが正規化されている必要があるので、実験的なデータを取り込むのが弱く、アドホックでスピード感がある実験を繰り返せる環境にはあまり向いていないです(実体験ベース)
Pandas
直近でも経験したのですが、Excelの膨大なデータを渡されて分析してと投げられることがあります。
私は、ExcelをCSVにダンプしてPythonやRubyに投入して分析することしていたのですが、当時同じメンバーがExcel VB Macroで分析するということをやっており、[Pythonでの機械学習のアウトプット]-> [Excel]の繋ぎに不便を感じていました
今年ぐらいから、周りのデータサイエンス業界と、Kaggleの強いツール(Kaggleで多く使われているツールはよく流行るので参考になります)という角度で、Excelでデータをもらっても、基本的に、PythonのPandasで処理することになりました。
メリットとしては、Pythonはそれなりに使える人が多いし、Excelをこねくり回すよりだいぶマシで、SQLとやりたいことは同じことができます
デメリットとしては、Pandasの作り自体が独自のエコシステムを形成しており、SQLや関数型の考えとも微妙に異なっており、独自のポジジョンにいることで学習コストがかさみがちです
MapReduceのアーキテクチャが提案されてから、どんなに巨大なデータでも、この仕組みの上に載せることで、集計が可能になりました。
関数型言語のmap関数とreduce関数にヒントを得て作られており、畳み込み(fold)に該当する操作を行うReduceのリソースをうまく分散処理させられれば、どんなデータでも処理可能です。
メリットとしては、プログラミングをして集計をするので、かなり柔軟であり、複雑怪奇な非構造化データでも処理可能です。
デメリットとしては、プログラムをかけないレベルの人は集計するのも困難になることが挙げられます。
互換性を示す
圏論などを用いて、数式上で、データの操作が等価であることを示してもいいのですが、今回は、dataquest.ioさんが一般的にSQLを用いて基礎のデータサイエンティストが集計できるべき角度のデータとそのクエリをその教育プランで行なっており、それを他のデータ集計方法でも行えることを示すことで、基礎的な側面においては、等価であることを示したいと考えています
(よろしければ、dataquest.ioさんのコンテンツも見てみてください。基礎的な側面においては、充実している印象があります。)
使用するデータ
jobs.db
sqlite形式で保存したデータベースです。そんなに大きいデータでないので、これを今回、集計してきます
大学の専攻で、就職率、就職した先、人数、女性の比率などが記されています。
sqlite> PRAGMA table_info(recent_grads);
0|index|INTEGER|0||0
1|Rank|INTEGER|0||0
2|Major_code|INTEGER|0||0
3|Major|TEXT|0||0
4|Major_category|TEXT|0||0
5|Total|INTEGER|0||0
6|Sample_size|INTEGER|0||0
7|Men|INTEGER|0||0
8|Women|INTEGER|0||0
9|ShareWomen|REAL|0||0
10|Employed|INTEGER|0||0
11|Full_time|INTEGER|0||0
12|Part_time|INTEGER|0||0
13|Full_time_year_round|INTEGER|0||0
14|Unemployed|INTEGER|0||0
15|Unemployment_rate|REAL|0||0
16|Median|INTEGER|0||0
17|P25th|INTEGER|0||0
18|P75th|INTEGER|0||0
19|College_jobs|INTEGER|0||0
20|Non_college_jobs|INTEGER|0||0
21|Low_wage_jobs|INTEGER|0||0
factbook.db
sqlite形式で保存したデータベースです
主に、各国の人口や出生率など、国力を表すKPIが多く記されています
テーブルの構造はこのようになっています
sqlite> PRAGMA table_info(facts);
0|id|INTEGER|1||1
1|code|varchar(255)|1||0
2|name|varchar(255)|1||0
3|area|integer|0||0
4|area_land|integer|0||0
5|area_water|integer|0||0
6|population|integer|0||0
7|population_growth|float|0||0
8|birth_rate|float|0||0
9|death_rate|float|0||0
10|migration_rate|float|0||0
11|created_at|datetime|0||0
12|updated_at|datetime|0||0
データが少ないので、githubで管理しています。
jupyter notebook上での実行です
一つのSQLのクエリとして知っておくべき最小での粒度での、例をSQL, Pandas, 関数型の順で示します
本当は関数型はKotlinで書きたかったのですが、Jupyter上でPythonとKotlinを両方一つのノートブックで使う方法がわからなかったので、Rubyで書いています(Rubyは別に関数型言語ではないですが、SyntaxはGoogle Cloud DataFlowやSparkなどに似せられるので、そのように書きました。)
%load_ext sql
%sql sqlite:////var/jobs.db
%sql SELECT * FROM recent_grads LIMIT 3;
index |
Rank |
Major_code |
Major |
Major_category |
Total |
Sample_size |
Men |
Women |
ShareWomen |
Employed |
Full_time |
Part_time |
Full_time_year_round |
Unemployed |
Unemployment_rate |
Median |
P25th |
P75th |
College_jobs |
Non_college_jobs |
Low_wage_jobs |
0 |
1 |
2419 |
PETROLEUM ENGINEERING |
Engineering |
2339 |
36 |
2057 |
282 |
0.120564344 |
1976 |
1849 |
270 |
1207 |
37 |
0.018380527 |
110000 |
95000 |
125000 |
1534 |
364 |
193 |
1 |
2 |
2416 |
MINING AND MINERAL ENGINEERING |
Engineering |
756 |
7 |
679 |
77 |
0.10185185199999999 |
640 |
556 |
170 |
388 |
85 |
0.117241379 |
75000 |
55000 |
90000 |
350 |
257 |
50 |
2 |
3 |
2415 |
METALLURGICAL ENGINEERING |
Engineering |
856 |
3 |
725 |
131 |
0.153037383 |
648 |
558 |
133 |
340 |
16 |
0.024096386 |
73000 |
50000 |
105000 |
456 |
176 |
0 |
専攻(Major)列に限定して、10行取り出す
%sql SELECT Rank, Major FROM recent_grads LIMIT 10;
Rank |
Major |
1 |
PETROLEUM ENGINEERING |
2 |
MINING AND MINERAL ENGINEERING |
3 |
METALLURGICAL ENGINEERING |
4 |
NAVAL ARCHITECTURE AND MARINE ENGINEERING |
5 |
CHEMICAL ENGINEERING |
6 |
NUCLEAR ENGINEERING |
7 |
ACTUARIAL SCIENCE |
8 |
ASTRONOMY AND ASTROPHYSICS |
9 |
MECHANICAL ENGINEERING |
10 |
ELECTRICAL ENGINEERING |
import sqlite3
import pandas as pd
conn = sqlite3.connect("/var/jobs.db")
df = pd.read_sql_query("SELECT * FROM recent_grads ;", conn)
rank_major = df[["Rank", "Major"]]
rank_major.head(10)
|
Rank |
Major |
0 |
1 |
PETROLEUM ENGINEERING |
1 |
2 |
MINING AND MINERAL ENGINEERING |
2 |
3 |
METALLURGICAL ENGINEERING |
3 |
4 |
NAVAL ARCHITECTURE AND MARINE ENGINEERING |
4 |
5 |
CHEMICAL ENGINEERING |
5 |
6 |
NUCLEAR ENGINEERING |
6 |
7 |
ACTUARIAL SCIENCE |
7 |
8 |
ASTRONOMY AND ASTROPHYSICS |
8 |
9 |
MECHANICAL ENGINEERING |
9 |
10 |
ELECTRICAL ENGINEERING |
%%ruby
## Jupyter notebookでrubyで記述するとこのように表現できる(毎回読み込む必要がある)
require "sqlite3"
db = SQLite3::Database.new "/var/jobs.db"
cols = db.execute("PRAGMA table_info(recent_grads)").map { |xs| xs[1]}
db.execute( "SELECT * FROM recent_grads ;" ).map { |xs|
cols.zip(xs).to_h
}&.map { |xs|
[xs["Rank"], xs["Major"]]
}&.slice(0..9).map { |xs|
p xs
}
[1, "PETROLEUM ENGINEERING"]
[2, "MINING AND MINERAL ENGINEERING"]
[3, "METALLURGICAL ENGINEERING"]
[4, "NAVAL ARCHITECTURE AND MARINE ENGINEERING"]
[5, "CHEMICAL ENGINEERING"]
[6, "NUCLEAR ENGINEERING"]
[7, "ACTUARIAL SCIENCE"]
[8, "ASTRONOMY AND ASTROPHYSICS"]
[9, "MECHANICAL ENGINEERING"]
[10, "ELECTRICAL ENGINEERING"]
Rank,Major_code,Major,Major_category,Total列に限定して、20件の行を取り出す
%sql SELECT Rank,Major_code,Major,Major_category,Total FROM recent_grads limit 20;
Rank |
Major_code |
Major |
Major_category |
Total |
1 |
2419 |
PETROLEUM ENGINEERING |
Engineering |
2339 |
2 |
2416 |
MINING AND MINERAL ENGINEERING |
Engineering |
756 |
3 |
2415 |
METALLURGICAL ENGINEERING |
Engineering |
856 |
4 |
2417 |
NAVAL ARCHITECTURE AND MARINE ENGINEERING |
Engineering |
1258 |
5 |
2405 |
CHEMICAL ENGINEERING |
Engineering |
32260 |
6 |
2418 |
NUCLEAR ENGINEERING |
Engineering |
2573 |
7 |
6202 |
ACTUARIAL SCIENCE |
Business |
3777 |
8 |
5001 |
ASTRONOMY AND ASTROPHYSICS |
Physical Sciences |
1792 |
9 |
2414 |
MECHANICAL ENGINEERING |
Engineering |
91227 |
10 |
2408 |
ELECTRICAL ENGINEERING |
Engineering |
81527 |
11 |
2407 |
COMPUTER ENGINEERING |
Engineering |
41542 |
12 |
2401 |
AEROSPACE ENGINEERING |
Engineering |
15058 |
13 |
2404 |
BIOMEDICAL ENGINEERING |
Engineering |
14955 |
14 |
5008 |
MATERIALS SCIENCE |
Engineering |
4279 |
15 |
2409 |
ENGINEERING MECHANICS PHYSICS AND SCIENCE |
Engineering |
4321 |
16 |
2402 |
BIOLOGICAL ENGINEERING |
Engineering |
8925 |
17 |
2412 |
INDUSTRIAL AND MANUFACTURING ENGINEERING |
Engineering |
18968 |
18 |
2400 |
GENERAL ENGINEERING |
Engineering |
61152 |
19 |
2403 |
ARCHITECTURAL ENGINEERING |
Engineering |
2825 |
20 |
3201 |
COURT REPORTING |
Law & Public Policy |
1148 |
step2 = df[["Rank","Major_code","Major","Major_category","Total"]]
step2.head(20)
|
Rank |
Major_code |
Major |
Major_category |
Total |
0 |
1 |
2419 |
PETROLEUM ENGINEERING |
Engineering |
2339 |
1 |
2 |
2416 |
MINING AND MINERAL ENGINEERING |
Engineering |
756 |
2 |
3 |
2415 |
METALLURGICAL ENGINEERING |
Engineering |
856 |
3 |
4 |
2417 |
NAVAL ARCHITECTURE AND MARINE ENGINEERING |
Engineering |
1258 |
4 |
5 |
2405 |
CHEMICAL ENGINEERING |
Engineering |
32260 |
5 |
6 |
2418 |
NUCLEAR ENGINEERING |
Engineering |
2573 |
6 |
7 |
6202 |
ACTUARIAL SCIENCE |
Business |
3777 |
7 |
8 |
5001 |
ASTRONOMY AND ASTROPHYSICS |
Physical Sciences |
1792 |
8 |
9 |
2414 |
MECHANICAL ENGINEERING |
Engineering |
91227 |
9 |
10 |
2408 |
ELECTRICAL ENGINEERING |
Engineering |
81527 |
10 |
11 |
2407 |
COMPUTER ENGINEERING |
Engineering |
41542 |
11 |
12 |
2401 |
AEROSPACE ENGINEERING |
Engineering |
15058 |
12 |
13 |
2404 |
BIOMEDICAL ENGINEERING |
Engineering |
14955 |
13 |
14 |
5008 |
MATERIALS SCIENCE |
Engineering |
4279 |
14 |
15 |
2409 |
ENGINEERING MECHANICS PHYSICS AND SCIENCE |
Engineering |
4321 |
15 |
16 |
2402 |
BIOLOGICAL ENGINEERING |
Engineering |
8925 |
16 |
17 |
2412 |
INDUSTRIAL AND MANUFACTURING ENGINEERING |
Engineering |
18968 |
17 |
18 |
2400 |
GENERAL ENGINEERING |
Engineering |
61152 |
18 |
19 |
2403 |
ARCHITECTURAL ENGINEERING |
Engineering |
2825 |
19 |
20 |
3201 |
COURT REPORTING |
Law & Public Policy |
1148 |
女性率が0.5(50%)を超える専攻と女性率の20件の行を表示する
%sql SELECT Major,ShareWomen FROM recent_grads WHERE ShareWomen>0.5 limit 20;
Major |
ShareWomen |
ACTUARIAL SCIENCE |
0.535714286 |
COMPUTER SCIENCE |
0.578766338 |
ENVIRONMENTAL ENGINEERING |
0.558548009 |
NURSING |
0.896018988 |
INDUSTRIAL PRODUCTION TECHNOLOGIES |
0.75047259 |
COMPUTER AND INFORMATION SYSTEMS |
0.7077185020000001 |
INFORMATION SCIENCES |
0.526475764 |
APPLIED MATHEMATICS |
0.75392736 |
PHARMACOLOGY |
0.524152583 |
OCEANOGRAPHY |
0.688999173 |
MATHEMATICS AND COMPUTER SCIENCE |
0.927807246 |
COGNITIVE SCIENCE AND BIOPSYCHOLOGY |
0.854523227 |
SCHOOL STUDENT COUNSELING |
0.56486557 |
INTERNATIONAL RELATIONS |
0.632986838 |
AGRICULTURE PRODUCTION AND MANAGEMENT |
0.59420765 |
GENERAL AGRICULTURE |
0.515543329 |
GENETICS |
0.643331121 |
MISCELLANEOUS SOCIAL SCIENCES |
0.5434054220000001 |
UNITED STATES HISTORY |
0.6307163179999999 |
AGRICULTURAL ECONOMICS |
0.589711902 |
major_sharewomen = df[lambda df:df["ShareWomen"]>0.5][["Major","ShareWomen"]]
major_sharewomen.head(20)
|
Major |
ShareWomen |
6 |
ACTUARIAL SCIENCE |
0.535714 |
20 |
COMPUTER SCIENCE |
0.578766 |
30 |
ENVIRONMENTAL ENGINEERING |
0.558548 |
34 |
NURSING |
0.896019 |
38 |
INDUSTRIAL PRODUCTION TECHNOLOGIES |
0.750473 |
42 |
COMPUTER AND INFORMATION SYSTEMS |
0.707719 |
45 |
INFORMATION SCIENCES |
0.526476 |
47 |
APPLIED MATHEMATICS |
0.753927 |
48 |
PHARMACOLOGY |
0.524153 |
49 |
OCEANOGRAPHY |
0.688999 |
52 |
MATHEMATICS AND COMPUTER SCIENCE |
0.927807 |
54 |
COGNITIVE SCIENCE AND BIOPSYCHOLOGY |
0.854523 |
55 |
SCHOOL STUDENT COUNSELING |
0.564866 |
56 |
INTERNATIONAL RELATIONS |
0.632987 |
63 |
AGRICULTURE PRODUCTION AND MANAGEMENT |
0.594208 |
64 |
GENERAL AGRICULTURE |
0.515543 |
67 |
GENETICS |
0.643331 |
68 |
MISCELLANEOUS SOCIAL SCIENCES |
0.543405 |
69 |
UNITED STATES HISTORY |
0.630716 |
71 |
AGRICULTURAL ECONOMICS |
0.589712 |
%%ruby
require "sqlite3"
db = SQLite3::Database.new "/var/jobs.db"
cols = db.execute("PRAGMA table_info(recent_grads)").map { |xs| xs[1]}
db.execute( "SELECT * FROM recent_grads ;" ).map { |xs|
cols.zip(xs).to_h
}.select { |xs|
xs["ShareWomen"] > 0.5
}.map { |xs|
[xs["Rank"], xs["Major"]]
}.slice(0..19).map { |xs|
p xs
}
[7, "ACTUARIAL SCIENCE"]
[21, "COMPUTER SCIENCE"]
[31, "ENVIRONMENTAL ENGINEERING"]
[35, "NURSING"]
[39, "INDUSTRIAL PRODUCTION TECHNOLOGIES"]
[43, "COMPUTER AND INFORMATION SYSTEMS"]
[46, "INFORMATION SCIENCES"]
[48, "APPLIED MATHEMATICS"]
[49, "PHARMACOLOGY"]
[50, "OCEANOGRAPHY"]
[53, "MATHEMATICS AND COMPUTER SCIENCE"]
[55, "COGNITIVE SCIENCE AND BIOPSYCHOLOGY"]
[56, "SCHOOL STUDENT COUNSELING"]
[57, "INTERNATIONAL RELATIONS"]
[64, "AGRICULTURE PRODUCTION AND MANAGEMENT"]
[65, "GENERAL AGRICULTURE"]
[68, "GENETICS"]
[69, "MISCELLANEOUS SOCIAL SCIENCES"]
[70, "UNITED STATES HISTORY"]
[72, "AGRICULTURAL ECONOMICS"]
就職者数が10000人を超える専攻と就職者数を10件の行を表示する
%sql SELECT Major,Employed FROM recent_grads WHERE Employed > 10000 limit 10;
Major |
Employed |
CHEMICAL ENGINEERING |
25694 |
MECHANICAL ENGINEERING |
76442 |
ELECTRICAL ENGINEERING |
61928 |
COMPUTER ENGINEERING |
32506 |
AEROSPACE ENGINEERING |
11391 |
BIOMEDICAL ENGINEERING |
10047 |
INDUSTRIAL AND MANUFACTURING ENGINEERING |
15604 |
GENERAL ENGINEERING |
44931 |
COMPUTER SCIENCE |
102087 |
MANAGEMENT INFORMATION SYSTEMS AND STATISTICS |
16413 |
major_employed = df[lambda df:df["Employed"]>10000][["Major","Employed"]]
major_employed.head(10)
|
Major |
Employed |
4 |
CHEMICAL ENGINEERING |
25694 |
8 |
MECHANICAL ENGINEERING |
76442 |
9 |
ELECTRICAL ENGINEERING |
61928 |
10 |
COMPUTER ENGINEERING |
32506 |
11 |
AEROSPACE ENGINEERING |
11391 |
12 |
BIOMEDICAL ENGINEERING |
10047 |
16 |
INDUSTRIAL AND MANUFACTURING ENGINEERING |
15604 |
17 |
GENERAL ENGINEERING |
44931 |
20 |
COMPUTER SCIENCE |
102087 |
24 |
MANAGEMENT INFORMATION SYSTEMS AND STATISTICS |
16413 |
女性率が50%を超えて、かつ従業員が10000人を超える専攻を10行取り出す
%sql SELECT Major,ShareWomen,Employed FROM recent_grads WHERE ShareWomen>0.5 AND Employed>10000 LIMIT 10;
Major |
ShareWomen |
Employed |
COMPUTER SCIENCE |
0.578766338 |
102087 |
NURSING |
0.896018988 |
180903 |
COMPUTER AND INFORMATION SYSTEMS |
0.7077185020000001 |
28459 |
INTERNATIONAL RELATIONS |
0.632986838 |
21190 |
AGRICULTURE PRODUCTION AND MANAGEMENT |
0.59420765 |
12323 |
CHEMISTRY |
0.5051405379999999 |
48535 |
BUSINESS MANAGEMENT AND ADMINISTRATION |
0.580948004 |
276234 |
BIOCHEMICAL SCIENCES |
0.515406449 |
25678 |
HUMAN RESOURCES AND PERSONNEL MANAGEMENT |
0.672161443 |
20760 |
MISCELLANEOUS HEALTH MEDICAL PROFESSIONS |
0.702020202 |
10076 |
triple = df[lambda df: (df["Employed"]>10000) & (df["ShareWomen"] > 0.5) ][["Major","ShareWomen","Employed"]]
triple.head(10)
|
Major |
ShareWomen |
Employed |
20 |
COMPUTER SCIENCE |
0.578766 |
102087 |
34 |
NURSING |
0.896019 |
180903 |
42 |
COMPUTER AND INFORMATION SYSTEMS |
0.707719 |
28459 |
56 |
INTERNATIONAL RELATIONS |
0.632987 |
21190 |
63 |
AGRICULTURE PRODUCTION AND MANAGEMENT |
0.594208 |
12323 |
74 |
CHEMISTRY |
0.505141 |
48535 |
76 |
BUSINESS MANAGEMENT AND ADMINISTRATION |
0.580948 |
276234 |
82 |
BIOCHEMICAL SCIENCES |
0.515406 |
25678 |
86 |
HUMAN RESOURCES AND PERSONNEL MANAGEMENT |
0.672161 |
20760 |
88 |
MISCELLANEOUS HEALTH MEDICAL PROFESSIONS |
0.702020 |
10076 |
%%ruby
require "sqlite3"
db = SQLite3::Database.new "/var/jobs.db"
cols = db.execute("PRAGMA table_info(recent_grads)").map { |xs| xs[1]}
db.execute( "SELECT * FROM recent_grads ;" ).map { |xs|
cols.zip(xs).to_h
}.select { |xs|
xs["ShareWomen"] > 0.5 && xs["Employed"] > 10000
}.map { |xs|
["Major","ShareWomen","Employed"].map { |x| xs[x]}
}&.slice(0..9).map { |xs|
p xs
}
["COMPUTER SCIENCE", 0.578766338, 102087]
["NURSING", 0.896018988, 180903]
["COMPUTER AND INFORMATION SYSTEMS", 0.7077185020000001, 28459]
["INTERNATIONAL RELATIONS", 0.632986838, 21190]
["AGRICULTURE PRODUCTION AND MANAGEMENT", 0.59420765, 12323]
["CHEMISTRY", 0.5051405379999999, 48535]
["BUSINESS MANAGEMENT AND ADMINISTRATION", 0.580948004, 276234]
["BIOCHEMICAL SCIENCES", 0.515406449, 25678]
["HUMAN RESOURCES AND PERSONNEL MANAGEMENT", 0.672161443, 20760]
["MISCELLANEOUS HEALTH MEDICAL PROFESSIONS", 0.702020202, 10076]
女性率が50%を超え、非雇用率が5.1%未満の"専攻"と"専攻のカテゴリ"について、10行取り出す
%sql SELECT Major, Major_category, ShareWomen, Unemployment_rate FROM recent_grads where (Major_category = 'Engineering') AND (ShareWomen > 0.5 or Unemployment_rate < 0.051) LIMIT 10;
Major |
Major_category |
ShareWomen |
Unemployment_rate |
PETROLEUM ENGINEERING |
Engineering |
0.120564344 |
0.018380527 |
METALLURGICAL ENGINEERING |
Engineering |
0.153037383 |
0.024096386 |
NAVAL ARCHITECTURE AND MARINE ENGINEERING |
Engineering |
0.107313196 |
0.050125313 |
MATERIALS SCIENCE |
Engineering |
0.310820285 |
0.023042836 |
ENGINEERING MECHANICS PHYSICS AND SCIENCE |
Engineering |
0.183985189 |
0.006334343 |
INDUSTRIAL AND MANUFACTURING ENGINEERING |
Engineering |
0.34347321799999997 |
0.042875544 |
MATERIALS ENGINEERING AND MATERIALS SCIENCE |
Engineering |
0.292607004 |
0.027788805 |
ENVIRONMENTAL ENGINEERING |
Engineering |
0.558548009 |
0.093588575 |
INDUSTRIAL PRODUCTION TECHNOLOGIES |
Engineering |
0.75047259 |
0.028308097 |
ENGINEERING AND INDUSTRIAL MANAGEMENT |
Engineering |
0.174122505 |
0.03365166 |
triple = df[lambda df:(df["Major_category"] == 'Engineering') & ((df["ShareWomen"] > 0.5) | (df["Unemployment_rate"] < 0.051)) ][["Major", "Major_category", "ShareWomen", "Unemployment_rate"]]
triple.head(10)
|
Major |
Major_category |
ShareWomen |
Unemployment_rate |
0 |
PETROLEUM ENGINEERING |
Engineering |
0.120564 |
0.018381 |
2 |
METALLURGICAL ENGINEERING |
Engineering |
0.153037 |
0.024096 |
3 |
NAVAL ARCHITECTURE AND MARINE ENGINEERING |
Engineering |
0.107313 |
0.050125 |
13 |
MATERIALS SCIENCE |
Engineering |
0.310820 |
0.023043 |
14 |
ENGINEERING MECHANICS PHYSICS AND SCIENCE |
Engineering |
0.183985 |
0.006334 |
16 |
INDUSTRIAL AND MANUFACTURING ENGINEERING |
Engineering |
0.343473 |
0.042876 |
23 |
MATERIALS ENGINEERING AND MATERIALS SCIENCE |
Engineering |
0.292607 |
0.027789 |
30 |
ENVIRONMENTAL ENGINEERING |
Engineering |
0.558548 |
0.093589 |
38 |
INDUSTRIAL PRODUCTION TECHNOLOGIES |
Engineering |
0.750473 |
0.028308 |
50 |
ENGINEERING AND INDUSTRIAL MANAGEMENT |
Engineering |
0.174123 |
0.033652 |
%%ruby
require "sqlite3"
db = SQLite3::Database.new "/var/jobs.db"
cols = db.execute("PRAGMA table_info(recent_grads)").map { |xs| xs[1]}
db.execute( "SELECT * FROM recent_grads ;" ).map { |xs|
cols.zip(xs).to_h
}.select { |xs|
xs["Major_category"] == 'Engineering' && (xs["ShareWomen"] > 0.5 || xs["Unemployment_rate"] < 0.051)
}.map { |xs|
["Major", "Major_category", "ShareWomen", "Unemployment_rate"].map { |x| xs[x]}
}&.slice(0..9).map { |xs|
p xs
}
["PETROLEUM ENGINEERING", "Engineering", 0.120564344, 0.018380527]
["METALLURGICAL ENGINEERING", "Engineering", 0.153037383, 0.024096386]
["NAVAL ARCHITECTURE AND MARINE ENGINEERING", "Engineering", 0.107313196, 0.050125313]
["MATERIALS SCIENCE", "Engineering", 0.310820285, 0.023042836]
["ENGINEERING MECHANICS PHYSICS AND SCIENCE", "Engineering", 0.183985189, 0.006334343]
["INDUSTRIAL AND MANUFACTURING ENGINEERING", "Engineering", 0.34347321799999997, 0.042875544]
["MATERIALS ENGINEERING AND MATERIALS SCIENCE", "Engineering", 0.292607004, 0.027788805]
["ENVIRONMENTAL ENGINEERING", "Engineering", 0.558548009, 0.093588575]
["INDUSTRIAL PRODUCTION TECHNOLOGIES", "Engineering", 0.75047259, 0.028308097]
["ENGINEERING AND INDUSTRIAL MANAGEMENT", "Engineering", 0.174122505, 0.03365166]
専攻のカテゴリが”ビジネス”か”芸術”か”ヘルス”で、就職者が20000人を超えているか非雇用率が5.1%以下で、専攻、専攻カテゴリ、就職者数、非就職者を10行知りたい
%sql SELECT Major, Major_category, Employed, Unemployment_rate \
FROM recent_grads \
WHERE (Major_category = 'Business' OR Major_category = 'Arts' OR Major_category = 'Health') \
AND (Employed > 20000 OR Unemployment_rate < 0.051) \
LIMIT 10;
Major |
Major_category |
Employed |
Unemployment_rate |
OPERATIONS LOGISTICS AND E-COMMERCE |
Business |
10027 |
0.047858702999999995 |
NURSING |
Health |
180903 |
0.04486272400000001 |
FINANCE |
Business |
145696 |
0.060686356 |
ACCOUNTING |
Business |
165527 |
0.069749014 |
MEDICAL TECHNOLOGIES TECHNICIANS |
Health |
13150 |
0.03698279 |
MEDICAL ASSISTING SERVICES |
Health |
9168 |
0.042506527 |
GENERAL BUSINESS |
Business |
190183 |
0.072861468 |
BUSINESS MANAGEMENT AND ADMINISTRATION |
Business |
276234 |
0.07221834099999999 |
MARKETING AND MARKETING RESEARCH |
Business |
178862 |
0.061215064000000007 |
HUMAN RESOURCES AND PERSONNEL MANAGEMENT |
Business |
20760 |
0.059569649 |
def filt(df):
res = ((df['Major_category'] == 'Business' ) | (df['Major_category'] == 'Arts') | (df['Major_category'] == 'Health')) & \
( (df["Employed"] > 20000 ) | (df["Unemployment_rate"] < 0.051) )
return res
quad = df[ filt(df) ][["Major", "Major_category", "Employed", "Unemployment_rate"]]
quad.head(10)
|
Major |
Major_category |
Employed |
Unemployment_rate |
27 |
OPERATIONS LOGISTICS AND E-COMMERCE |
Business |
10027 |
0.047859 |
34 |
NURSING |
Health |
180903 |
0.044863 |
35 |
FINANCE |
Business |
145696 |
0.060686 |
40 |
ACCOUNTING |
Business |
165527 |
0.069749 |
44 |
MEDICAL TECHNOLOGIES TECHNICIANS |
Health |
13150 |
0.036983 |
51 |
MEDICAL ASSISTING SERVICES |
Health |
9168 |
0.042507 |
57 |
GENERAL BUSINESS |
Business |
190183 |
0.072861 |
76 |
BUSINESS MANAGEMENT AND ADMINISTRATION |
Business |
276234 |
0.072218 |
77 |
MARKETING AND MARKETING RESEARCH |
Business |
178862 |
0.061215 |
86 |
HUMAN RESOURCES AND PERSONNEL MANAGEMENT |
Business |
20760 |
0.059570 |
%%ruby
require "sqlite3"
db = SQLite3::Database.new "/var/jobs.db"
cols = db.execute("PRAGMA table_info(recent_grads)").map { |xs| xs[1]}
db.execute( "SELECT * FROM recent_grads ;" ).map { |xs|
cols.zip(xs).to_h
}.select { |xs|
(xs["Major_category"] == 'Engineering' || xs["Major_category"] == 'Arts' || xs["Major_category"] == 'Health' ) && (xs["Employed"] > 20000 || xs["Unemployment_rate"] < 0.051)
}.map { |xs|
["Major", "Major_category", "Employed", "Unemployment_rate"].map { |x| xs[x]}
}.slice(0..9).map { |xs|
p xs
}
["PETROLEUM ENGINEERING", "Engineering", 1976, 0.018380527]
["METALLURGICAL ENGINEERING", "Engineering", 648, 0.024096386]
["NAVAL ARCHITECTURE AND MARINE ENGINEERING", "Engineering", 758, 0.050125313]
["CHEMICAL ENGINEERING", "Engineering", 25694, 0.061097712]
["MECHANICAL ENGINEERING", "Engineering", 76442, 0.057342277999999997]
["ELECTRICAL ENGINEERING", "Engineering", 61928, 0.059173845]
["COMPUTER ENGINEERING", "Engineering", 32506, 0.065409275]
["MATERIALS SCIENCE", "Engineering", 3307, 0.023042836]
["ENGINEERING MECHANICS PHYSICS AND SCIENCE", "Engineering", 3608, 0.006334343]
["INDUSTRIAL AND MANUFACTURING ENGINEERING", "Engineering", 15604, 0.042875544]
専攻をアルファベットを降順にソートして10行取り出す
%sql select Major \
from recent_grads \
order by Major desc \
limit 10;
Major |
ZOOLOGY |
VISUAL AND PERFORMING ARTS |
UNITED STATES HISTORY |
TREATMENT THERAPY PROFESSIONS |
TRANSPORTATION SCIENCES AND TECHNOLOGIES |
THEOLOGY AND RELIGIOUS VOCATIONS |
TEACHER EDUCATION: MULTIPLE LEVELS |
STUDIO ARTS |
STATISTICS AND DECISION SCIENCE |
SPECIAL NEEDS EDUCATION |
asd = df[["Major"]].sort_values(by=["Major"], ascending=False)
asd.head(10)
|
Major |
168 |
ZOOLOGY |
153 |
VISUAL AND PERFORMING ARTS |
69 |
UNITED STATES HISTORY |
126 |
TREATMENT THERAPY PROFESSIONS |
106 |
TRANSPORTATION SCIENCES AND TECHNOLOGIES |
158 |
THEOLOGY AND RELIGIOUS VOCATIONS |
154 |
TEACHER EDUCATION: MULTIPLE LEVELS |
159 |
STUDIO ARTS |
46 |
STATISTICS AND DECISION SCIENCE |
100 |
SPECIAL NEEDS EDUCATION |
%%ruby
require "sqlite3"
db = SQLite3::Database.new "/var/jobs.db"
cols = db.execute("PRAGMA table_info(recent_grads)").map { |xs| xs[1]}
db.execute( "SELECT * FROM recent_grads ;" ).map { |xs|
cols.zip(xs).to_h
}.sort_by { |xs|
xs["Major"]
}.reverse
.slice(0..9).map { |xs| p xs["Major"]}
"ZOOLOGY"
"VISUAL AND PERFORMING ARTS"
"UNITED STATES HISTORY"
"TREATMENT THERAPY PROFESSIONS"
"TRANSPORTATION SCIENCES AND TECHNOLOGIES"
"THEOLOGY AND RELIGIOUS VOCATIONS"
"TEACHER EDUCATION: MULTIPLE LEVELS"
"STUDIO ARTS"
"STATISTICS AND DECISION SCIENCE"
"SPECIAL NEEDS EDUCATION"
専攻をアルファベットで昇順、給与で降順で、20行を表示する
%sql SELECT Major_category, Median, Major \
FROM recent_grads \
ORDER BY Major ASC, Median DESC \
LIMIT 20;
Major_category |
Median |
Major |
Business |
45000 |
ACCOUNTING |
Business |
62000 |
ACTUARIAL SCIENCE |
Communications & Journalism |
35000 |
ADVERTISING AND PUBLIC RELATIONS |
Engineering |
60000 |
AEROSPACE ENGINEERING |
Agriculture & Natural Resources |
40000 |
AGRICULTURAL ECONOMICS |
Agriculture & Natural Resources |
40000 |
AGRICULTURE PRODUCTION AND MANAGEMENT |
Agriculture & Natural Resources |
30000 |
ANIMAL SCIENCES |
Humanities & Liberal Arts |
28000 |
ANTHROPOLOGY AND ARCHEOLOGY |
Computers & Mathematics |
45000 |
APPLIED MATHEMATICS |
Engineering |
54000 |
ARCHITECTURAL ENGINEERING |
Engineering |
40000 |
ARCHITECTURE |
Humanities & Liberal Arts |
35000 |
AREA ETHNIC AND CIVILIZATION STUDIES |
Education |
32100 |
ART AND MUSIC EDUCATION |
Humanities & Liberal Arts |
31000 |
ART HISTORY AND CRITICISM |
Physical Sciences |
62000 |
ASTRONOMY AND ASTROPHYSICS |
Physical Sciences |
35000 |
ATMOSPHERIC SCIENCES AND METEOROLOGY |
Biology & Life Science |
37400 |
BIOCHEMICAL SCIENCES |
Engineering |
57100 |
BIOLOGICAL ENGINEERING |
Biology & Life Science |
33400 |
BIOLOGY |
Engineering |
60000 |
BIOMEDICAL ENGINEERING |
tri = df[["Major_category", "Median", "Major"]].sort_values(by=["Major", "Median"], ascending=[True, False])
tri.head(20)
|
Major_category |
Median |
Major |
40 |
Business |
45000 |
ACCOUNTING |
6 |
Business |
62000 |
ACTUARIAL SCIENCE |
98 |
Communications & Journalism |
35000 |
ADVERTISING AND PUBLIC RELATIONS |
11 |
Engineering |
60000 |
AEROSPACE ENGINEERING |
71 |
Agriculture & Natural Resources |
40000 |
AGRICULTURAL ECONOMICS |
63 |
Agriculture & Natural Resources |
40000 |
AGRICULTURE PRODUCTION AND MANAGEMENT |
152 |
Agriculture & Natural Resources |
30000 |
ANIMAL SCIENCES |
162 |
Humanities & Liberal Arts |
28000 |
ANTHROPOLOGY AND ARCHEOLOGY |
47 |
Computers & Mathematics |
45000 |
APPLIED MATHEMATICS |
18 |
Engineering |
54000 |
ARCHITECTURAL ENGINEERING |
58 |
Engineering |
40000 |
ARCHITECTURE |
99 |
Humanities & Liberal Arts |
35000 |
AREA ETHNIC AND CIVILIZATION STUDIES |
136 |
Education |
32100 |
ART AND MUSIC EDUCATION |
148 |
Humanities & Liberal Arts |
31000 |
ART HISTORY AND CRITICISM |
7 |
Physical Sciences |
62000 |
ASTRONOMY AND ASTROPHYSICS |
110 |
Physical Sciences |
35000 |
ATMOSPHERIC SCIENCES AND METEOROLOGY |
82 |
Biology & Life Science |
37400 |
BIOCHEMICAL SCIENCES |
15 |
Engineering |
57100 |
BIOLOGICAL ENGINEERING |
123 |
Biology & Life Science |
33400 |
BIOLOGY |
12 |
Engineering |
60000 |
BIOMEDICAL ENGINEERING |
%%ruby
require "sqlite3"
db = SQLite3::Database.new "/var/jobs.db"
cols = db.execute("PRAGMA table_info(recent_grads)").map { |xs| xs[1]}
db.execute( "SELECT * FROM recent_grads ;" ).map { |xs|
cols.zip(xs).to_h
}.sort_by { |as, bs|
[as["Major"], !as["Median"] ]
}.slice(0..19).map { |xs| p [xs["Major"], xs["Median"]]}
["ACCOUNTING", 45000]
["ACTUARIAL SCIENCE", 62000]
["ADVERTISING AND PUBLIC RELATIONS", 35000]
["AEROSPACE ENGINEERING", 60000]
["AGRICULTURAL ECONOMICS", 40000]
["AGRICULTURE PRODUCTION AND MANAGEMENT", 40000]
["ANIMAL SCIENCES", 30000]
["ANTHROPOLOGY AND ARCHEOLOGY", 28000]
["APPLIED MATHEMATICS", 45000]
["ARCHITECTURAL ENGINEERING", 54000]
["ARCHITECTURE", 40000]
["AREA ETHNIC AND CIVILIZATION STUDIES", 35000]
["ART AND MUSIC EDUCATION", 32100]
["ART HISTORY AND CRITICISM", 31000]
["ASTRONOMY AND ASTROPHYSICS", 62000]
["ATMOSPHERIC SCIENCES AND METEOROLOGY", 35000]
["BIOCHEMICAL SCIENCES", 37400]
["BIOLOGICAL ENGINEERING", 57100]
["BIOLOGY", 33400]
["BIOMEDICAL ENGINEERING", 60000]
FactBook Dataset
factbookと呼ばれるデータセットで、SQLとpandas, Rubyでの記述を示します
%sql sqlite:////var/factbook.db
conn = sqlite3.connect("/var/factbook.db")
df = pd.read_sql_query("select * from facts;", conn)
データに入っているbirth_rateの件数をカウント
%sql SELECT COUNT(birth_rate) FROM facts;
df["birth_rate"].count()
228
%%ruby
require "sqlite3"
db = SQLite3::Database.new "/var/factbook.db"
cols = db.execute("PRAGMA table_info(facts)").map { |xs| xs[1]}
r = db.execute( "SELECT * FROM facts ;" ).map { |xs|
cols.zip(xs).to_h
}.select { |xs|
xs["birth_rate"] != nil
}.size
p r
228
birth_rateの合計値を計算
%sql SELECT SUM(birth_rate) \
FROM facts;
SUM(birth_rate) |
4406.909999999998 |
df["birth_rate"].sum()
4406.9099999999999
%%ruby
require "sqlite3"
db = SQLite3::Database.new "/var/factbook.db"
cols = db.execute("PRAGMA table_info(facts)").map { |xs| xs[1]}
r = db.execute( "SELECT * FROM facts ;" ).map { |xs| cols.zip(xs).to_h
}.select { |xs| xs["birth_rate"] != nil
}.map { |x| x["birth_rate"]
}.reduce { |y,x| y+x }
p r
4406.909999999998
birth_rateの平均値の計算
%sql SELECT AVG(birth_rate) \
FROM facts;
AVG(birth_rate) |
19.32855263157894 |
df["birth_rate"].mean()
19.328552631578948
%%ruby
require "sqlite3"
db = SQLite3::Database.new "/var/factbook.db"
cols = db.execute("PRAGMA table_info(facts)").map { |xs| xs[1]}
all = db.execute( "SELECT * FROM facts ;" ).map { |xs| cols.zip(xs).to_h
}.select { |xs| xs["birth_rate"] != nil }.map { |x| x["birth_rate"] }
p all.reduce{|y,x| y+x }/all.size
19.32855263157894
出生率をUniq(Distinct)して、人口が2000万を超えるデータの平均値を計算する
%sql SELECT AVG(DISTINCT birth_rate) \
FROM facts \
WHERE population > 20000000;
AVG(DISTINCT birth_rate) |
20.43473684210527 |
df[ df.population > 20000000 ][ ["birth_rate"] ].drop_duplicates().mean()
birth_rate 20.434737
dtype: float64
%%ruby
require 'set'
require "sqlite3"
db = SQLite3::Database.new "/var/factbook.db"
cols = db.execute("PRAGMA table_info(facts)").map { |xs| xs[1]}
all = db.execute( "SELECT * FROM facts ;" ).map { |xs| cols.zip(xs).to_h
}.select { |xs| xs["population"] != nil and xs["population"] > 20000000
}.map { |xs| xs["birth_rate"] }.to_set.to_a
p all.reduce { |y,x| y+x}/all.size
20.43473684210527
より細かい条件指定
ここから、データセットをjobs.dbに戻します
groupbyなど一歩踏み込んだSQLのオペレーションであっても、Pandasや関数型言語でも同様に扱えることを示します
%sql sqlite:////var/jobs.db
conn = sqlite3.connect("/var/jobs.db")
df = pd.read_sql_query("select * from recent_grads;", conn)
専攻カテゴリごとにおける、女性率の平均値の計算
%sql SELECT Major_category, AVG(ShareWomen) \
FROM recent_grads \
GROUP BY Major_category;
Major_category |
AVG(ShareWomen) |
Agriculture & Natural Resources |
0.6179384232 |
Arts |
0.56185119575 |
Biology & Life Science |
0.584518475857143 |
Business |
0.4050631853076923 |
Communications & Journalism |
0.64383484025 |
Computers & Mathematics |
0.5127519954545455 |
Education |
0.6749855163125 |
Engineering |
0.2571578951034483 |
Health |
0.6168565694166667 |
Humanities & Liberal Arts |
0.6761934042 |
Industrial Arts & Consumer Services |
0.4493512688571429 |
Interdisciplinary |
0.495397153 |
Law & Public Policy |
0.3359896912 |
Physical Sciences |
0.5087494197 |
Psychology & Social Work |
0.7777631628888888 |
Social Science |
0.5390672957777778 |
df[ ["Major_category", "ShareWomen"] ].groupby( ["Major_category"]).mean()
|
ShareWomen |
Major_category |
|
Agriculture & Natural Resources |
0.617938 |
Arts |
0.561851 |
Biology & Life Science |
0.584518 |
Business |
0.405063 |
Communications & Journalism |
0.643835 |
Computers & Mathematics |
0.512752 |
Education |
0.674986 |
Engineering |
0.257158 |
Health |
0.616857 |
Humanities & Liberal Arts |
0.676193 |
Industrial Arts & Consumer Services |
0.449351 |
Interdisciplinary |
0.495397 |
Law & Public Policy |
0.335990 |
Physical Sciences |
0.508749 |
Psychology & Social Work |
0.777763 |
Social Science |
0.539067 |
%%ruby
require 'set'
require "sqlite3"
db = SQLite3::Database.new "/var/jobs.db"
cols = db.execute("PRAGMA table_info(recent_grads)").map { |xs| xs[1]}
db.execute( "SELECT * FROM recent_grads ;" ).map { |xs| cols.zip(xs).to_h
}.map { |xs| ["Major_category", "ShareWomen"].map {|x| xs[x]}
}.group_by { |xs| xs[0]}.to_a
.map { |xs|
key,vals = xs
vals = vals.map { |x| x[1]}
[key, vals.reduce{ |y,x| y+x}/vals.size]
}.map { |xs|
p xs
}
["Engineering", 0.2571578951034483]
["Business", 0.4050631853076923]
["Physical Sciences", 0.5087494197]
["Law & Public Policy", 0.3359896912]
["Computers & Mathematics", 0.5127519954545455]
["Agriculture & Natural Resources", 0.6179384232]
["Industrial Arts & Consumer Services", 0.4493512688571429]
["Arts", 0.56185119575]
["Health", 0.6168565694166667]
["Social Science", 0.5390672957777778]
["Biology & Life Science", 0.584518475857143]
["Education", 0.6749855163125]
["Humanities & Liberal Arts", 0.6761934042]
["Psychology & Social Work", 0.7777631628888888]
["Communications & Journalism", 0.64383484025]
["Interdisciplinary", 0.495397153]
専攻カテゴリごとの平均就職者数、全人数の平均値を計算して、"就職者数/平均人数"を算出する
%sql SELECT Major_category, AVG(Employed) / AVG(Total) AS share_employed \
FROM recent_grads \
GROUP BY Major_category;
Major_category |
share_employed |
Agriculture & Natural Resources |
0.8369862842425075 |
Arts |
0.8067482429367457 |
Biology & Life Science |
0.6671565365683841 |
Business |
0.8359659576036412 |
Communications & Journalism |
0.8422291333949735 |
Computers & Mathematics |
0.7956108197773972 |
Education |
0.858190149321534 |
Engineering |
0.7819666916550562 |
Health |
0.8033741337996244 |
Humanities & Liberal Arts |
0.7626382682895378 |
Industrial Arts & Consumer Services |
0.8226700668430581 |
Interdisciplinary |
0.7987150292778139 |
Law & Public Policy |
0.8083994483744353 |
Physical Sciences |
0.7506564085422069 |
Psychology & Social Work |
0.790724459311403 |
Social Science |
0.7575825619001975 |
df_mean = df[ ["Major_category", "Employed", "Total"] ].groupby( ["Major_category"]).mean()
df_mean[ "Employed/Total" ] = df_mean.apply(lambda x:x[0]/x[1],axis=1)
df_mean
|
Employed |
Total |
Employed/Total |
Major_category |
|
|
|
Agriculture & Natural Resources |
6694.300000 |
7998.100000 |
0.836986 |
Arts |
36014.250000 |
44641.250000 |
0.806748 |
Biology & Life Science |
21628.357143 |
32418.714286 |
0.667157 |
Business |
83749.384615 |
100182.769231 |
0.835966 |
Communications & Journalism |
82665.000000 |
98150.250000 |
0.842229 |
Computers & Mathematics |
21626.727273 |
27182.545455 |
0.795611 |
Education |
29989.937500 |
34945.562500 |
0.858190 |
Engineering |
14495.586207 |
18537.344828 |
0.781967 |
Health |
31012.250000 |
38602.500000 |
0.803374 |
Humanities & Liberal Arts |
36274.533333 |
47564.533333 |
0.762638 |
Industrial Arts & Consumer Services |
27006.142857 |
32827.428571 |
0.822670 |
Interdisciplinary |
9821.000000 |
12296.000000 |
0.798715 |
Law & Public Policy |
28958.000000 |
35821.400000 |
0.808399 |
Physical Sciences |
13923.100000 |
18547.900000 |
0.750656 |
Psychology & Social Work |
42260.444444 |
53445.222222 |
0.790724 |
Social Science |
44610.333333 |
58885.111111 |
0.757583 |
%%ruby
require 'set'
require "sqlite3"
db = SQLite3::Database.new "/var/jobs.db"
cols = db.execute("PRAGMA table_info(recent_grads)").map { |xs| xs[1]}
db.execute( "SELECT * FROM recent_grads ;" ).map { |xs| cols.zip(xs).to_h
}.map { |xs| ["Major_category", "Employed", "Total"].map {|x| xs[x]}
}.group_by { |xs| xs[0]}.to_a
.map { |xs|
key,vals = xs
emps = vals.map { |x| x[1]}
emps_mean = emps.reduce{ |y,x| y+x}/emps.size
totals = vals.map { |x| x[2]}
totals_mean = totals.reduce{ |y,x| y+x}/totals.size
[key, emps_mean.to_f/totals_mean]
}.map { |xs|
p xs
}
["Engineering", 0.7819496142849436]
["Business", 0.8359685372621828]
["Physical Sciences", 0.7506874427131073]
["Law & Public Policy", 0.8084084754752798]
["Computers & Mathematics", 0.7956000294312413]
["Agriculture & Natural Resources", 0.8369592398099525]
["Industrial Arts & Consumer Services", 0.8226764553568708]
["Arts", 0.8067471606818843]
["Health", 0.8033780633127817]
["Social Science", 0.7575783306444765]
["Biology & Life Science", 0.6671602196310692]
["Education", 0.8581771354986407]
["Humanities & Liberal Arts", 0.7626356067614162]
["Psychology & Social Work", 0.790719431190944]
["Communications & Journalism", 0.8422312786551197]
["Interdisciplinary", 0.7987150292778139]
先ほど計算した、"就職者数/平均人数"が0.8を超えるデータを表示する
%sql SELECT Major_category, AVG(Employed) / AVG(Total) AS share_employed \
FROM recent_grads \
GROUP BY Major_category \
HAVING share_employed > .8;
Major_category |
share_employed |
Agriculture & Natural Resources |
0.8369862842425075 |
Arts |
0.8067482429367457 |
Business |
0.8359659576036412 |
Communications & Journalism |
0.8422291333949735 |
Education |
0.858190149321534 |
Health |
0.8033741337996244 |
Industrial Arts & Consumer Services |
0.8226700668430581 |
Law & Public Policy |
0.8083994483744353 |
df_having = df[ ["Major_category", "Employed", "Total"] ].groupby( ["Major_category"]).mean()
df_having[ "Employed/Total" ] = df_mean.apply(lambda x:x[0]/x[1],axis=1)
df_having[ df_having["Employed/Total"] > 0.8 ]
|
Employed |
Total |
Employed/Total |
Major_category |
|
|
|
Agriculture & Natural Resources |
6694.300000 |
7998.100000 |
0.836986 |
Arts |
36014.250000 |
44641.250000 |
0.806748 |
Business |
83749.384615 |
100182.769231 |
0.835966 |
Communications & Journalism |
82665.000000 |
98150.250000 |
0.842229 |
Education |
29989.937500 |
34945.562500 |
0.858190 |
Health |
31012.250000 |
38602.500000 |
0.803374 |
Industrial Arts & Consumer Services |
27006.142857 |
32827.428571 |
0.822670 |
Law & Public Policy |
28958.000000 |
35821.400000 |
0.808399 |
%%ruby
require 'set'
require "sqlite3"
db = SQLite3::Database.new "/var/jobs.db"
cols = db.execute("PRAGMA table_info(recent_grads)").map { |xs| xs[1]}
db.execute( "SELECT * FROM recent_grads ;" ).map { |xs| cols.zip(xs).to_h
}.map { |xs| ["Major_category", "Employed", "Total"].map {|x| xs[x]}
}.group_by { |xs| xs[0]}.to_a
.map { |xs|
key,vals = xs
emps = vals.map { |x| x[1]}
emps_mean = emps.reduce{ |y,x| y+x}/emps.size
totals = vals.map { |x| x[2]}
totals_mean = totals.reduce{ |y,x| y+x}/totals.size
[key, emps_mean.to_f/totals_mean]
}.select{ |xs| xs[1] > 0.8
}.map { |xs|
p xs
}
["Business", 0.8359685372621828]
["Law & Public Policy", 0.8084084754752798]
["Agriculture & Natural Resources", 0.8369592398099525]
["Industrial Arts & Consumer Services", 0.8226764553568708]
["Arts", 0.8067471606818843]
["Health", 0.8033780633127817]
["Education", 0.8581771354986407]
["Communications & Journalism", 0.8422312786551197]
女性率を小数点以下、第二桁まで計算して10行を表示
%sql SELECT Major_category, ROUND(ShareWomen, 2) AS rounded_share_women \
FROM recent_grads \
LIMIT 10;
Major_category |
rounded_share_women |
Engineering |
0.12 |
Engineering |
0.1 |
Engineering |
0.15 |
Engineering |
0.11 |
Engineering |
0.34 |
Engineering |
0.14 |
Business |
0.54 |
Physical Sciences |
0.44 |
Engineering |
0.14 |
Engineering |
0.44 |
df_round = df[ ["Major_category", "ShareWomen"] ]
df_round["round"]= df_round.apply(lambda x:"%0.2f"%x[1], axis=1)
df_round.head(10)
|
Major_category |
ShareWomen |
round |
0 |
Engineering |
0.120564 |
0.12 |
1 |
Engineering |
0.101852 |
0.10 |
2 |
Engineering |
0.153037 |
0.15 |
3 |
Engineering |
0.107313 |
0.11 |
4 |
Engineering |
0.341631 |
0.34 |
5 |
Engineering |
0.144967 |
0.14 |
6 |
Business |
0.535714 |
0.54 |
7 |
Physical Sciences |
0.441356 |
0.44 |
8 |
Engineering |
0.139793 |
0.14 |
9 |
Engineering |
0.437847 |
0.44 |
%%ruby
require 'set'
require "sqlite3"
db = SQLite3::Database.new "/var/jobs.db"
cols = db.execute("PRAGMA table_info(recent_grads)").map { |xs| xs[1]}
db.execute( "SELECT * FROM recent_grads ;" ).map { |xs| cols.zip(xs).to_h
}.map { |xs| ["Major_category", "ShareWomen"].map {|x| xs[x]}
}.map { |xs|
mc = xs[0]
sw = sprintf("%0.2f", xs[1])
[mc, sw]
}.map{ |x| p x}
["Engineering", "0.12"]
["Engineering", "0.10"]
["Engineering", "0.15"]
["Engineering", "0.11"]
["Engineering", "0.34"]
["Engineering", "0.14"]
["Business", "0.54"]
["Physical Sciences", "0.44"]
["Engineering", "0.14"]
["Engineering", "0.44"]
["Engineering", "0.20"]
["Engineering", "0.20"]
["Engineering", "0.12"]
["Engineering", "0.31"]
["Engineering", "0.18"]
["Engineering", "0.32"]
["Engineering", "0.34"]
["Engineering", "0.25"]
["Engineering", "0.35"]
["Law & Public Policy", "0.24"]
["Computers & Mathematics", "0.58"]
["Agriculture & Natural Resources", "0.22"]
["Engineering", "0.33"]
["Engineering", "0.29"]
["Business", "0.28"]
["Engineering", "0.23"]
["Industrial Arts & Consumer Services", "0.34"]
["Business", "0.32"]
["Engineering", "0.19"]
["Law & Public Policy", "0.25"]
["Engineering", "0.56"]
["Engineering", "0.09"]
["Arts", "0.41"]
["Engineering", "0.32"]
["Health", "0.90"]
["Business", "0.36"]
["Social Science", "0.34"]
["Business", "0.25"]
["Engineering", "0.75"]
["Physical Sciences", "0.43"]
["Business", "0.25"]
["Computers & Mathematics", "0.24"]
["Computers & Mathematics", "0.71"]
["Physical Sciences", "0.45"]
["Health", "0.43"]
["Computers & Mathematics", "0.53"]
["Computers & Mathematics", "0.28"]
["Computers & Mathematics", "0.75"]
["Biology & Life Science", "0.52"]
["Physical Sciences", "0.69"]
["Engineering", "0.17"]
["Health", "0.18"]
["Computers & Mathematics", "0.93"]
["Computers & Mathematics", "0.27"]
["Biology & Life Science", "0.85"]
["Education", "0.56"]
["Social Science", "0.63"]
["Business", "0.42"]
["Engineering", "0.32"]
["Business", "0.28"]
["Health", "0.45"]
["Biology & Life Science", "0.08"]
["Business", "0.20"]
["Agriculture & Natural Resources", "0.59"]
["Agriculture & Natural Resources", "0.52"]
["Engineering", "0.00"]
["Engineering", "0.38"]
["Biology & Life Science", "0.64"]
["Social Science", "0.54"]
["Humanities & Liberal Arts", "0.63"]
["Psychology & Social Work", "0.44"]
["Agriculture & Natural Resources", "0.59"]
["Physical Sciences", "0.43"]
["Industrial Arts & Consumer Services", "0.43"]
["Physical Sciences", "0.51"]
["Industrial Arts & Consumer Services", "0.23"]
["Business", "0.58"]
["Business", "0.38"]
["Social Science", "0.49"]
["Social Science", "0.47"]
["Biology & Life Science", "0.62"]
["Computers & Mathematics", "0.18"]
["Biology & Life Science", "0.52"]
["Biology & Life Science", "0.53"]
["Computers & Mathematics", "0.31"]
["Physical Sciences", "0.47"]
["Business", "0.67"]
["Law & Public Policy", "0.59"]
["Health", "0.70"]
["Law & Public Policy", "0.48"]
["Physical Sciences", "0.88"]
["Psychology & Social Work", "0.75"]
["Biology & Life Science", "0.58"]
["Communications & Journalism", "0.31"]
["Law & Public Policy", "0.13"]
["Arts", "0.37"]
["Communications & Journalism", "0.72"]
["Physical Sciences", "0.67"]
["Communications & Journalism", "0.67"]
["Humanities & Liberal Arts", "0.76"]
["Education", "0.37"]
["Biology & Life Science", "0.91"]
["Social Science", "0.62"]
["Health", "0.64"]
["Health", "0.77"]
["Computers & Mathematics", "0.86"]
["Industrial Arts & Consumer Services", "0.32"]
["Agriculture & Natural Resources", "0.56"]
["Biology & Life Science", "0.48"]
["Interdisciplinary", "0.50"]
["Physical Sciences", "0.12"]
["Agriculture & Natural Resources", "0.69"]
["Agriculture & Natural Resources", "0.76"]
["Education", "0.81"]
["Humanities & Liberal Arts", "0.65"]
["Humanities & Liberal Arts", "0.73"]
["Humanities & Liberal Arts", "0.51"]
["Education", "0.73"]
["Health", "0.65"]
["Education", "0.79"]
["Education", "0.45"]
["Health", "0.56"]
["Biology & Life Science", "0.57"]
["Biology & Life Science", "0.60"]
["Social Science", "0.53"]
["Communications & Journalism", "0.88"]
["Health", "0.64"]
["Business", "0.73"]
["Education", "0.58"]
["Humanities & Liberal Arts", "0.76"]
["Education", "0.72"]
["Social Science", "0.72"]
["Biology & Life Science", "0.65"]
["Education", "0.60"]
["Health", "0.77"]
["Humanities & Liberal Arts", "0.42"]
["Education", "0.69"]
["Humanities & Liberal Arts", "0.34"]
["Education", "0.92"]
["Industrial Arts & Consumer Services", "0.68"]
["Humanities & Liberal Arts", "0.70"]
["Arts", "0.69"]
["Social Science", "0.50"]
["Agriculture & Natural Resources", "0.61"]
["Education", "0.42"]
["Psychology & Social Work", "0.78"]
["Arts", "0.44"]
["Education", "0.51"]
["Humanities & Liberal Arts", "0.85"]
["Arts", "0.67"]
["Industrial Arts & Consumer Services", "0.75"]
["Psychology & Social Work", "0.81"]
["Agriculture & Natural Resources", "0.91"]
["Arts", "0.70"]
["Education", "0.80"]
["Psychology & Social Work", "0.91"]
["Psychology & Social Work", "0.90"]
["Humanities & Liberal Arts", "0.75"]
["Humanities & Liberal Arts", "0.73"]
["Arts", "0.58"]
["Industrial Arts & Consumer Services", "0.38"]
["Agriculture & Natural Resources", "0.72"]
["Humanities & Liberal Arts", "0.97"]
["Health", "0.71"]
["Education", "0.97"]
["Humanities & Liberal Arts", "0.69"]
["Arts", "0.63"]
["Humanities & Liberal Arts", "0.67"]
["Biology & Life Science", "0.64"]
["Psychology & Social Work", "0.82"]
["Psychology & Social Work", "0.80"]
["Psychology & Social Work", "0.80"]
["Education", "0.88"]
大学で仕事を得た人の専攻カテゴリごとの平均と、専攻カテゴリごとの平均人数を割ることで、”大学での仕事の人/全人数の平均”を計算し、30%以下のデータを表示する
%sql SELECT Major_category, ROUND(AVG(College_jobs) / AVG(Total), 3) AS share_degree_jobs \
FROM recent_grads \
GROUP BY Major_category HAVING share_degree_jobs < .3;
Major_category |
share_degree_jobs |
Agriculture & Natural Resources |
0.248 |
Arts |
0.265 |
Business |
0.114 |
Communications & Journalism |
0.22 |
Humanities & Liberal Arts |
0.27 |
Industrial Arts & Consumer Services |
0.249 |
Law & Public Policy |
0.163 |
Social Science |
0.215 |
df_having = df[ ["Major_category", "College_jobs", "Total"] ].groupby( ["Major_category"]).mean()
df_having[ "College_jobs/Total" ] = df_having.apply(lambda x: float("%.3f"%(x[0]/x[1])),axis=1)
df_having = df_having[ df_having["College_jobs/Total"] < 0.3 ]
df_having
|
College_jobs |
Total |
College_jobs/Total |
Major_category |
|
|
|
Agriculture & Natural Resources |
1986.000000 |
7998.100000 |
0.248 |
Arts |
11848.125000 |
44641.250000 |
0.265 |
Business |
11426.000000 |
100182.769231 |
0.114 |
Communications & Journalism |
21639.000000 |
98150.250000 |
0.220 |
Humanities & Liberal Arts |
12843.333333 |
47564.533333 |
0.270 |
Industrial Arts & Consumer Services |
8171.428571 |
32827.428571 |
0.249 |
Law & Public Policy |
5844.200000 |
35821.400000 |
0.163 |
Social Science |
12662.222222 |
58885.111111 |
0.215 |
%%ruby
require 'set'
require "sqlite3"
db = SQLite3::Database.new "/var/jobs.db"
cols = db.execute("PRAGMA table_info(recent_grads)").map { |xs| xs[1]}
db.execute( "SELECT * FROM recent_grads ;" ).map { |xs| cols.zip(xs).to_h
}.map { |xs| ["Major_category", "College_jobs", "Total"].map {|x| xs[x]}
}.group_by { |xs| xs[0]}.to_a
.map { |xs|
key,vals = xs
cljs = vals.map { |x| x[1]}
cljs_mean = cljs.reduce{ |y,x| y+x}/cljs.size
totals = vals.map { |x| x[2]}
totals_mean = totals.reduce{ |y,x| y+x}/totals.size
[key, cljs_mean.to_f/totals_mean]
}.select{ |xs| xs[1] < 0.3
}.map { |xs|
key = xs[0]
val = sprintf("%0.3f", xs[1])
p [key, val]
}
["Business", "0.114"]
["Law & Public Policy", "0.163"]
["Agriculture & Natural Resources", "0.248"]
["Industrial Arts & Consumer Services", "0.249"]
["Arts", "0.265"]
["Social Science", "0.215"]
["Humanities & Liberal Arts", "0.270"]
["Communications & Journalism", "0.220"]
まとめ
このように、基本的なSQLでできることは、Pandasと関数型で全てできるということができそうだとわかりました
機械学習の文脈は、この中では含んでいませんが、必要なデータを様々な方法で集めて、機械学習のかけるなどはよくするので、前処理の一環でもあります
いろんな案件に応じて、適切な集計方法を選択するのですが、SQLに入っていっていて、SQLクエリだけで済むのであれば、そのようにすればいいですし、一台のローカルマシンで済むぐらいのExcelファイルならば、Pandasなどが良いでしょう。複数台数に跨って収められているビッグデータに関しては、MapReduceなどを選択すれば良いでしょう。
アドテクはこの程度で済むという経験則があるのですが、これ以上何か集計ツールが増えるには、できるだけ機能やシンタックスを対応させて覚えさせて、情報量があまり増えすぎないようにコントロールしたいです、