Infrastructures de données et introduction à la science des données
Master Analyse et politique économiqueParcours Data science pour l'économie et l'entreprise
Description
- SQL : requêtes et opérations
- NoSQL: requêtes et opérations
- Calcul distribué
- Introduction à l’apprentissage automatique (machine learning)
Compétences visées
A l’issue des enseignements de cette UE, les étudiants seront capables de:
- Mettre en forme et manipuler des bases de données structurées et non-structurées
- Comprendre et utiliser les méthodes de régression et de classification de base
- Comprendre les paradigmes pour le calcul distribué sur les clusters
- Améliorer leur autonomie de travail sur Python
- Acquérir de l’expérience dans le travail en équipe et la gestion de projets
Modalités d'organisation et de suivi
Oral lectures (in English and French) and hands-on for each macro module. Lectures will be complemented by online resources and take-home assignments, as integral parts of the course to enhance student engagement outside the classroom.
Disciplines
- Sciences économiques
Syllabus
- Big Data is best described with the “six Vs”, that is volume, variety, velocity, value, veracity and variability. Today these data can come from multiple structured and unstructured sources. One of the main tasks of the data scientist consists in feeding knowledge and information into databases (SQL/noSQL) and using the available machine and resources (Hadoop, Spark, Cloud) to generate, manipulate and process the data. This course is structured in three macro modules: (i) it starts with a focus on relational databases, and in particular on data wrangling and data extraction using SQL (Structured Query Language) commands; (ii) students will be then exposed to basic unstructured formats such as json, xml, dictionaries, hence one of the most currently used noSQL database: MongoDB; (iii) finally, the course will provide students with some principles of distributed computing systems and virtualization. The map/reduce paradigm for distributed systems will be introduced with applications on the Hadoop and Spark distributed computing frameworks. The course will also train students in the use of Cloud Computing, mainly in the infrastructure as a service layer (IaaS). More details:
I. SQL (10h)
- Relational data
- Basic SQL queries
- Filtering rows and selecting columns
- Sorting and grouping
- Merging tables
- Aggregate results
II. noSQL (20h)
- Relational vs. Non-relational databases
- File formats (json, xml, dictionary)
- Hands-on 1: Reddit datasets
- MongoDB shell and compass
- PyMongo
- Setup MongoDB server for remote connections
- MongoDB Clusters
- Hands-on 2: Reddit datasets with clusters
- Alternatives to MongoDB: Redis/DynamoDB/Cassandra
III. Distributed computing (15h)
- What is cloud-computing
- Distributed processing of large data sets: Hadoop, Spark
- Hands-on 1: Virtualization
- Hands-on 2: Cloud
- Hands-on 3: Hadoop
- Hands-on 4: Spark
Bibliographie
Beaulieu, A. (2009). Learning SQL: master SQL fundamentals. " O'Reilly Media, Inc.".
Perkins, L., Redmond, E., & Wilson, J. (2018). Seven databases in seven weeks: a guide to modern databases and the NoSQL movement. Pragmatic Bookshelf.
Cloud et transformation digitale : SI hybride, protection des données, anatomie des grandes plateformes. Plouin Guillaume, Dunod, 2019
Cloud Computing: Theory and Practice. Dan Marinescu, 2nd Edition, Elsevier, 2017
High Performance Spark. Holden Karau, Rachel Warren. O’Reilly Media, 2017
Advanced Analytics with Spark, 2nd Edition. Sandy Ryza, Uri Laserson, Sean Owen & Josh Wills- O’Reilly Media, 2017