- PVSM.RU - https://www.pvsm.ru -

База свободных репозиториев Github доступна через интерфейс BigQuery

2,8 млн репозиториев, 3 ТБ исходного кода и метаданных

База свободных репозиториев Github доступна через интерфейс BigQuery - 1

Google в сотрудничестве с Github выложила [1] для общественного пользования полную актуальную базу всех open-source репозиториев через интерфейс BigQuery [2]. (Проверка свободной лицензии [3] осуществляется через API.)

Наборы данных Google BigQuery Public Datasets [4] содержат информацию о более чем 2,8 млн свободных репозиториев, о более чем 2 млрд файлов (исходный код последних версий 163 млн файлов), 145 млн коммитов и т.д. Общий размер базы — около 3 терабайт.

Раньше архивы Github выкладывались на Github Archive [5]. Теперь всё это богатство доступно для полнотекстового поиска и анализа через простые SQL-запросы. Github обещает [6] обновлять наборы данных еженедельно.

BigQuery — это облачный сервис для интерактивного анализа больших наборов данных. Модель использования BigQuery довольно проста: передача данных реализована с помощью обычных API-вызовов. Это избавляет от необходимости предварительного кэширования и позволяет выполнять потоковую обработку данных в режиме реального времени.

Другими словами, теперь с помощью простого SQL-запроса можно получить любую статистику по репозиториям Github. Как репозитории самые популярные за всё время? За последний год? За последнюю неделю? Например, вот результат [7] выдачи по запросу 1000 самых популярных репозиториев за 2015-2016 гг (перечислены первые десять).

repo_id repo_name num_stars
1 28457823 FreeCodeCamp/FreeCodeCamp [8] 143920
2 21737465 sindresorhus/awesome [9] 36429
3 29028775 facebook/react-native [10] 36285
4 10270250 facebook/react [11] 34656
5 44838949 apple/swift [12] 33838
6 2126244 twbs/bootstrap [13] 28916
7 45717250 tensorflow/tensorflow [14] 28345
8 35955666 jlevy/the-art-of-command-line [15] 27766
9 14440270 getify/You-Dont-Know-JS [16] 27473
10 6498492 airbnb/javascript [17] 27247

(Выдача по запросу 1000 самых популярных репозиториев — результат обработки по трём параметрам всего 15,4 ГБ данных из бесплатного месячного лимита 1 ТБ [18], который Google даёт каждому пользователю BigQuery)

Примеры запросов

Примеры запросов BigQuery [19]

Все репозитории с комментарием в коде «Такое никогда не должно произойти»

SELECT count(*)
FROM (SELECT id, repo_name, path
        FROM [bigquery-public-data:github_repos.sample_files]
      ) AS F
JOIN (SELECT id
        FROM [bigquery-public-data:github_repos.sample_contents]
       WHERE NOT binary
         AND content CONTAINS 'This should never happen') AS C
ON F.id = C.id;

Самые часто используемые пакеты Go

SELECT
  REGEXP_EXTRACT(line, r'"([^"]+)"') AS url,
  COUNT(*) AS count
FROM FLATTEN(
  (SELECT
  SPLIT(SPLIT(REGEXP_EXTRACT(content, r'.*imports*[(]([^)]*)[)]'), 'n'), ';') AS line,
  FROM
    (SELECT id, content FROM [bigquery-public-data:github_repos.sample_contents]
     WHERE REGEXP_MATCH(content, r'.*imports*[(][^)]*[)]')) AS C
    JOIN EACH
    (SELECT id FROM [bigquery-public-data:github_repos.sample_files]
     WHERE path LIKE '%.go' GROUP BY id) AS F
    ON C.id = F.id), line)
GROUP BY url
HAVING url IS NOT NULL
ORDER BY count DESC

Самые часто используемые пакеты Java

SELECT package, COUNT(*) c
FROM (
  SELECT REGEXP_EXTRACT(line, r' ([a-z0-9._]*).') package, id
  FROM (
     SELECT SPLIT(content, 'n') line, id
     FROM [bigquery-public-data:github_repos.sample_contents]
     WHERE content CONTAINS 'import'
     AND sample_path LIKE '%.java'
     HAVING LEFT(line, 6)='import'
  )
  GROUP BY package, id
)
GROUP BY 1
ORDER BY c DESC
LIMIT 500;

Эмодзи в коммитах

SELECT
REGEXP_REPLACE(REGEXP_REPLACE(word, r'^[^:]*:', ''), r':[^:]*$', '') as emoticon,
C.committer.time_sec as committed_at,
count(*) as count
FROM ( SELECT SPLIT(message, ' ') as word, committer.time_sec
      FROM [bigquery-public-data:github_repos.sample_commits]
    ) as C
where
REGEXP_MATCH(word, r':[w]+:') and
REGEXP_REPLACE(REGEXP_REPLACE(word, r'^[^:]*:', ''), r':[^:]*$', '') in ("+1", "-1", "100", "1234", "8ball", "a", "ab", "abc", "abcd", "accept", "aerial_tramway", "airplane", "alarm_clock", "alien", "ambulance", "anchor", "angel", "anger", "angry", "anguished", "ant", "apple", "aquarius", "aries", "arrow_backward", "arrow_double_down", "arrow_double_up", "arrow_down", "arrow_down_small", "arrow_forward", "arrow_heading_down", "arrow_heading_up", "arrow_left", "arrow_lower_left", "arrow_lower_right", "arrow_right", "arrow_right_hook", "arrow_up", "arrow_up_down", "arrow_up_small", "arrow_upper_left", "arrow_upper_right", "arrows_clockwise", "arrows_counterclockwise", "art", "articulated_lorry", "astonished", "atm", "b", "baby", "baby_bottle", "baby_chick", "baby_symbol", "back", "baggage_claim", "balloon", "ballot_box_with_check", "bamboo", "banana", "bangbang", "bank", "bar_chart", "barber", "baseball", "basketball", "bath", "bathtub", "battery", "bear", "bee", "beer", "beers", "beetle", "beginner", "bell", "bento", "bicyclist", "bike", "bikini", "bird", "birthday", "black_circle", "black_joker", "black_medium_small_square", "black_medium_square", "black_nib", "black_small_square", "black_square", "black_square_button", "blossom", "blowfish", "blue_book", "blue_car", "blue_heart", "blush", "boar", "boat", "bomb", "book", "bookmark", "bookmark_tabs", "books", "boom", "boot", "bouquet", "bow", "bowling", "bowtie", "boy", "bread", "bride_with_veil", "bridge_at_night", "briefcase", "broken_heart", "bug", "bulb", "bullettrain_front", "bullettrain_side", "bus", "busstop", "bust_in_silhouette", "busts_in_silhouette", "cactus", "cake", "calendar", "calling", "camel", "camera", "cancer", "candy", "capital_abcd", "capricorn", "car", "card_index", "carousel_horse", "cat", "cat2", "cd", "chart", "chart_with_downwards_trend", "chart_with_upwards_trend", "checkered_flag", "cherries", "cherry_blossom", "chestnut", "chicken", "children_crossing", "chocolate_bar", "christmas_tree", "church", "cinema", "circus_tent", "city_sunrise", "city_sunset", "cl", "clap", "clapper", "clipboard", "clock1", "clock10", "clock1030", "clock11", "clock1130", "clock12", "clock1230", "clock130", "clock2", "clock230", "clock3", "clock330", "clock4", "clock430", "clock5", "clock530", "clock6", "clock630", "clock7", "clock730", "clock8", "clock830", "clock9", "clock930", "closed_book", "closed_lock_with_key", "closed_umbrella", "cloud", "clubs", "cn", "cocktail", "coffee", "cold_sweat", "collision", "computer", "confetti_ball", "confounded", "confused", "congratulations", "construction", "construction_worker", "convenience_store", "cookie", "cool", "cop", "copyright", "corn", "couple", "couple_with_heart", "couplekiss", "cow", "cow2", "credit_card", "crocodile", "crossed_flags", "crown", "cry", "crying_cat_face", "crystal_ball", "cupid", "curly_loop", "currency_exchange", "curry", "custard", "customs", "cyclone", "dancer", "dancers", "dango", "dart", "dash", "date", "de", "deciduous_tree", "department_store", "diamond_shape_with_a_dot_inside", "diamonds", "disappointed", "disappointed_relieved", "dizzy", "dizzy_face", "do_not_litter", "dog", "dog2", "dollar", "dolls", "dolphin", "donut", "door", "doughnut", "dragon", "dragon_face", "dress", "dromedary_camel", "droplet", "dvd", "e\-mail", "ear", "ear_of_rice", "earth_africa", "earth_americas", "earth_asia", "egg", "eggplant", "eight", "eight_pointed_black_star", "eight_spoked_asterisk", "electric_plug", "elephant", "email", "end", "envelope", "es", "euro", "european_castle", "european_post_office", "evergreen_tree", "exclamation", "expressionless", "eyeglasses", "eyes", "facepunch", "factory", "fallen_leaf", "family", "fast_forward", "fax", "fearful", "feelsgood", "feet", "ferris_wheel", "file_folder", "finnadie", "fire", "fire_engine", "fireworks", "first_quarter_moon", "first_quarter_moon_with_face", "fish", "fish_cake", "fishing_pole_and_fish", "fist", "five", "flags", "flashlight", "floppy_disk", "flower_playing_cards", "flushed", "foggy", "football", "fork_and_knife", "fountain", "four", "four_leaf_clover", "fr", "free", "fried_shrimp", "fries", "frog", "frowning", "fu", "fuelpump", "full_moon", "full_moon_with_face", "game_die", "gb", "gem", "gemini", "ghost", "gift", "gift_heart", "girl", "globe_with_meridians", "goat", "goberserk", "godmode", "golf", "grapes", "green_apple", "green_book", "green_heart", "grey_exclamation", "grey_question", "grimacing", "grin", "grinning", "guardsman", "guitar", "gun", "haircut", "hamburger", "hammer", "hamster", "hand", "handbag", "hankey", "hash", "hatched_chick", "hatching_chick", "headphones", "hear_no_evil", "heart", "heart_decoration", "heart_eyes", "heart_eyes_cat", "heartbeat", "heartpulse", "hearts", "heavy_check_mark", "heavy_division_sign", "heavy_dollar_sign", "heavy_exclamation_mark", "heavy_minus_sign", "heavy_multiplication_x", "heavy_plus_sign", "helicopter", "herb", "hibiscus", "high_brightness", "high_heel", "hocho", "honey_pot", "honeybee", "horse", "horse_racing", "hospital", "hotel", "hotsprings", "hourglass", "hourglass_flowing_sand", "house", "house_with_garden", "hurtrealbad", "hushed", "ice_cream", "icecream", "id", "ideograph_advantage", "imp", "inbox_tray", "incoming_envelope", "information_desk_person", "information_source", "innocent", "interrobang", "iphone", "it", "izakaya_lantern", "jack_o_lantern", "japan", "japanese_castle", "japanese_goblin", "japanese_ogre", "jeans", "joy", "joy_cat", "jp", "key", "keycap_ten", "kimono", "kiss", "kissing", "kissing_cat", "kissing_closed_eyes", "kissing_face", "kissing_heart", "kissing_smiling_eyes", "koala", "koko", "kr", "large_blue_circle", "large_blue_diamond", "large_orange_diamond", "last_quarter_moon", "last_quarter_moon_with_face", "laughing", "leaves", "ledger", "left_luggage", "left_right_arrow", "leftwards_arrow_with_hook", "lemon", "leo", "leopard", "libra", "light_rail", "link", "lips", "lipstick", "lock", "lock_with_ink_pen", "lollipop", "loop", "loudspeaker", "love_hotel", "love_letter", "low_brightness", "m", "mag", "mag_right", "mahjong", "mailbox", "mailbox_closed", "mailbox_with_mail", "mailbox_with_no_mail", "man", "man_with_gua_pi_mao", "man_with_turban", "mans_shoe", "maple_leaf", "mask", "massage", "meat_on_bone", "mega", "melon", "memo", "mens", "metal", "metro", "microphone", "microscope", "milky_way", "minibus", "minidisc", "mobile_phone_off", "money_with_wings", "moneybag", "monkey", "monkey_face", "monorail", "moon", "mortar_board", "mount_fuji", "mountain_bicyclist", "mountain_cableway", "mountain_railway", "mouse", "mouse2", "movie_camera", "moyai", "muscle", "mushroom", "musical_keyboard", "musical_note", "musical_score", "mute", "nail_care", "name_badge", "neckbeard", "necktie", "negative_squared_cross_mark", "neutral_face", "new", "new_moon", "new_moon_with_face", "newspaper", "ng", "nine", "no_bell", "no_bicycles", "no_entry", "no_entry_sign", "no_good", "no_mobile_phones", "no_mouth", "no_pedestrians", "no_smoking", "non\-potable_water", "nose", "notebook", "notebook_with_decorative_cover", "notes", "nut_and_bolt", "o", "o2", "ocean", "octocat", "octopus", "oden", "office", "ok", "ok_hand", "ok_woman", "older_man", "older_woman", "on", "oncoming_automobile", "oncoming_bus", "oncoming_police_car", "oncoming_taxi", "one", "open_file_folder", "open_hands", "open_mouth", "ophiuchus", "orange_book", "outbox_tray", "ox", "package", "page_facing_up", "page_with_curl", "pager", "palm_tree", "panda_face", "paperclip", "parking", "part_alternation_mark", "partly_sunny", "passport_control", "paw_prints", "peach", "pear", "pencil", "pencil2", "penguin", "pensive", "performing_arts", "persevere", "person_frowning", "person_with_blond_hair", "person_with_pouting_face", "phone", "pig", "pig2", "pig_nose", "pill", "pineapple", "pisces", "pizza", "plus1", "point_down", "point_left", "point_right", "point_up", "point_up_2", "police_car", "poodle", "poop", "post_office", "postal_horn", "postbox", "potable_water", "pouch", "poultry_leg", "pound", "pouting_cat", "pray", "princess", "punch", "purple_heart", "purse", "pushpin", "put_litter_in_its_place", "question", "rabbit", "rabbit2", "racehorse", "radio", "radio_button", "rage", "rage1", "rage2", "rage3", "rage4", "railway_car", "rainbow", "raised_hand", "raised_hands", "raising_hand", "ram", "ramen", "rat", "recycle", "red_car", "red_circle", "registered", "relaxed", "relieved", "repeat", "repeat_one", "restroom", "revolving_hearts", "rewind", "ribbon", "rice", "rice_ball", "rice_cracker", "rice_scene", "ring", "rocket", "roller_coaster", "rooster", "rose", "rotating_light", "round_pushpin", "rowboat", "ru", "rugby_football", "runner", "running", "running_shirt_with_sash", "sa", "sagittarius", "sailboat", "sake", "sandal", "santa", "satellite", "satisfied", "saxophone", "school", "school_satchel", "scissors", "scorpius", "scream", "scream_cat", "scroll", "seat", "secret", "see_no_evil", "seedling", "seven", "shaved_ice", "sheep", "shell", "ship", "shipit", "shirt", "shit", "shoe", "shower", "signal_strength", "six", "six_pointed_star", "ski", "skull", "sleeping", "sleepy", "slot_machine", "small_blue_diamond", "small_orange_diamond", "small_red_triangle", "small_red_triangle_down", "smile", "smile_cat", "smiley", "smiley_cat", "smiling_imp", "smirk", "smirk_cat", "smoking", "snail", "snake", "snowboarder", "snowflake", "snowman", "sob", "soccer", "soon", "sos", "sound", "space_invader", "spades", "spaghetti", "sparkle", "sparkler", "sparkles", "sparkling_heart", "speak_no_evil", "speaker", "speech_balloon", "speedboat", "squirrel", "star", "star2", "stars", "station", "statue_of_liberty", "steam_locomotive", "stew", "straight_ruler", "strawberry", "stuck_out_tongue", "stuck_out_tongue_closed_eyes", "stuck_out_tongue_winking_eye", "sun_with_face", "sunflower", "sunglasses", "sunny", "sunrise", "sunrise_over_mountains", "surfer",
 "sushi", "suspect", "suspension_railway", "sweat", "sweat_drops", "sweat_smile", "sweet_potato", "swimmer", "symbols", "syringe", "tada", "tanabata_tree", "tangerine", "taurus", "taxi", "tea", "telephone", "telephone_receiver", "telescope", "tennis", "tent", "thought_balloon", "three", "thumbsdown", "thumbsup", "ticket", "tiger", "tiger2", "tired_face", "tm", "toilet", "tokyo_tower", "tomato", "tongue", "top", "tophat", "tractor", "traffic_light", "train", "train2", "tram", "triangular_flag_on_post", "triangular_ruler", "trident", "triumph", "trolleybus", "trollface", "trophy", "tropical_drink", "tropical_fish", "truck", "trumpet", "tshirt", "tulip", "turtle", "tv", "twisted_rightwards_arrows", "two", "two_hearts", "two_men_holding_hands", "two_women_holding_hands", "u5272", "u5408", "u55b6", "u6307", "u6708", "u6709", "u6e80", "u7121", "u7533", "u7981", "u7a7a", "uk", "umbrella", "unamused", "underage", "unlock", "up", "us", "v", "vertical_traffic_light", "vhs", "vibration_mode", "video_camera", "video_game", "violin", "virgo", "volcano", "vs", "walking", "waning_crescent_moon", "waning_gibbous_moon", "warning", "watch", "water_buffalo", "watermelon", "wave", "wavy_dash", "waxing_crescent_moon", "waxing_gibbous_moon", "wc", "weary", "wedding", "whale", "whale2", "wheelchair", "white_check_mark", "white_circle", "white_flower", "white_large_square", "white_medium_small_square", "white_medium_square", "white_small_square", "white_square_button", "wind_chime", "wine_glass", "wink", "wolf", "woman", "womans_clothes", "womans_hat", "womens", "worried", "wrench", "x", "yellow_heart", "yen", "yum", "zap", "zero", "zzz")
group by 1, 2
order by count DESC
LIMIT 100;

Из каких университетов (.edu) пришло больше всего коммитов

SELECT REGEXP_EXTRACT(email, r'@(.*)') domain, COUNT(*) c
FROM (
  SELECT REGEXP_EXTRACT(payload, r'([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+.[a-zA-Z0-9-.]+)') email
  FROM [githubarchive:month.201512]
  WHERE REGEXP_EXTRACT(payload, r'([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+.[a-zA-Z0-9-.]+)') IS NOT null
  GROUP BY 1
  HAVING email CONTAINS '.edu'
)
GROUP BY 1
ORDER BY 2 DESC
LIMIT 500;

Тем кто хочет поэкспериментировать с запросами к базе GitHub Data, рекомендуют [20] не обращаться к основной таблице [bigquery-public-data:github_repos.contents] размером 1,5 ТБ, которая содержит все небинарные файлы размером меньше 1 МБ. Она мгновенно сожрёт ваш бесплатный месячный лимит. Вместо этого следует обращаться к таблице с выжимкой 10% файлов самых популярных репозиториев [bigquery-public-data:github_repos.sample_contents [21]] размером всего 23 ГБ или к отдельным таблицам для языков программирования .go, .rb. .js, .php, .py и .java [fh-bigquery:github_extracts.contents_* [22]].

Официальное сообщение Google [23]
Официальное сообщение Github [24]
Документация по GitHub Data [25]

В сети уже начали появляться интересные сервисы [26] с визуализацией данных Github Data.

База свободных репозиториев Github доступна через интерфейс BigQuery - 2

База свободных репозиториев Github доступна через интерфейс BigQuery - 3

База свободных репозиториев Github доступна через интерфейс BigQuery - 4База свободных репозиториев Github доступна через интерфейс BigQuery - 5

Автор: alizar

Источник [27]


Сайт-источник PVSM.RU: https://www.pvsm.ru

Путь до страницы источника: https://www.pvsm.ru/open-source/147161

Ссылки в тексте:

[1] выложила: http://google-opensource.blogspot.com.by/2016/06/github-on-bigquery-analyze-all-code.html

[2] BigQuery: https://cloud.google.com/bigquery/

[3] свободной лицензии: https://developer.github.com/v3/licenses/

[4] Google BigQuery Public Datasets: https://cloud.google.com/bigquery/public-data/

[5] Github Archive: https://www.githubarchive.org/

[6] обещает: http://venturebeat.com/2016/06/29/github-releases-data-on-2-8-million-open-source-repositories-through-google-bigquery/

[7] вот результат: https://docs.google.com/spreadsheets/d/16yDS2wDdDOTxjVsjGvWmpHVsOIU65wLEjXFHDtDeKU4/edit?usp=sharing

[8] FreeCodeCamp/FreeCodeCamp: https://github.com/FreeCodeCamp/FreeCodeCamp

[9] sindresorhus/awesome: https://github.com/sindresorhus/awesome

[10] facebook/react-native: https://github.com/facebook/react-native

[11] facebook/react: https://github.com/facebook/react

[12] apple/swift: https://github.com/apple/swift

[13] twbs/bootstrap: https://github.com/twbs/bootstrap

[14] tensorflow/tensorflow: https://github.com/tensorflow/tensorflow

[15] jlevy/the-art-of-command-line: https://github.com/jlevy/the-art-of-command-line

[16] getify/You-Dont-Know-JS: https://github.com/getify/You-Dont-Know-JS

[17] airbnb/javascript: https://github.com/airbnb/javascript

[18] бесплатного месячного лимита 1 ТБ: https://cloud.google.com/bigquery/pricing

[19] Примеры запросов BigQuery: https://gist.github.com/arfon/49ca314a5b0a00b1ebf91167db3ff02c#most-commonly-used-go-packages

[20] рекомендуют: https://medium.com/@hoffa/github-on-bigquery-analyze-all-the-code-b3576fd2b150#.andswjzc2

[21] bigquery-public-data:github_repos.sample_contents: https://bigquery.cloud.google.com/table/bigquery-public-data:github_repos.sample_contents

[22] fh-bigquery:github_extracts.contents_*: https://bigquery.cloud.google.com/dataset/fh-bigquery:github_extracts

[23] Официальное сообщение Google: https://cloudplatform.googleblog.com/2016/06/GitHub-on-BigQuery-analyze-all-the-open-source-code.html

[24] Официальное сообщение Github: https://github.com/blog/2201-making-open-source-data-more-available

[25] Документация по GitHub Data: https://cloud.google.com/bigquery/public-data/github

[26] интересные сервисы: https://datastudio.google.com/open/0ByGAKP3QmCjLdXBlWVdrZU5yZW8

[27] Источник: https://habrahabr.ru/post/304476/?utm_source=habrahabr&utm_medium=rss&utm_campaign=best