🤔What is Data Mesh: the essence of the concept
Data Mesh is a decentralized flexible approach to the work of various distributed teams and the dissemination of information. Data Mesh was born as a response to the dominant concepts of working with data in data-driven organizations - Data Warehouse and Data Lake. They are united by the idea of centralization. All data flows into a central repository, from where different teams can take it for their own purposes. However, all this needs to be supported by a team of data engineers with a special set of skills. Also, with the growth in the number of sources and the variety of data, it becomes more and more difficult to ensure their business quality, pipelines for transformation become more and more difficult.
Datamesh proposes to solve these and other problems based on four main principles:
1. Domain-oriented ownership - domain teams own data, not a centralized Data team. A domain is a part of an organization that performs a specific business function, for example, it can be product domains (mammography, fluorography, CT scan of the chest) or a domain for working with scribes.
2. Data as a product - data is perceived not as a static dataset, but as a dynamic product with its users, quality metrics, development backlog, which is monitored by a dedicated product-owner.
3. Self-serve data platform. The main function of the data platform in Data Mesh is to eliminate unnecessary cognitive load. This allows developers in domain teams (data product developers and data product consumers) who are not data scientists to conveniently create Data products, build, deploy, test, update, access and use them for their own purposes.
4. Federated computational governance - instead of centralized data management, a special federated body is created, consisting of representatives of domain teams, data platforms and experts (for example, lawyers and doctors), which sets global policies in the field of working with data and discusses the development of the data platform.
💥YTsaurus: a system for storing and processing Yandex's Big Data has become open-source
YTsaurus is an open source distributed storage and processing platform for big data. This system is based on MapReduce, distributed file system and NoSQL key-value database.
YTsaurus is built on top of Cypress (or Cypress) - a fault-tolerant tree-based storage that provides features such as:
a tree namespace whose nodes are directories, tables (structured or semi-structured data), and files (unstructured data);
support for columnar and row mechanisms for storing tabular data;
expressive data schematization with support for hierarchical types and features of data sorting;
background replication and repair of erasure data that do not require any manual actions;
transactions that can affect many objects and last indefinitely;
In general, YTsaurus is a fairly powerful computing platform that involves running arbitrary user code. Currently, YTsaurus dynamic tables store petabytes of data, and a large number of interactive services are built on top of them.
The Github-repository contains the YTsaurus server code, deployment infrastructure using k8s, as well as the system web interface and client SDK for common programming languages - C ++, Java, Go and Python. All this is under the Apache 2.0 license, which allows everyone to upload it to their servers, as well as modify it to suit their needs.
⛑⛑⛑Medical data: what to observe when working with the health service
The main problem with health data is its vulnerability. They contain confidential information protected by the Health Insurance Portability and Accountability Act (HIPAA) and may not be used without express consent. In the medical field, sensitive details are referred to as protected health information (PHI). Here are a few factors to consider when working with medical datasets:
Protected Health Information (PHI) is contained in various medical documents: emails, clinical notes, test results, or CT scans. While diagnoses or medical prescriptions are not considered sensitive information in and of themselves, they are subject to HIPAA when matched against so-called identifiers: names, dates, contacts, social security or account numbers, photographs of individuals, or other elements that can be used to locate or identify a particular patient, as well as contact him.
Anonymization of medical data and removal of personal information from them. Personal identifiers and even parts of them (such as initials) must be disposed of before medical data can be used for research or business purposes. There are two ways to do this - anonymization and deletion of personal information. Anonymization is the permanent elimination of all sensitive data. Removing personal information (de-identification) only encrypts personal information and hides it in separate datasets. Later, identifiers can be re-associated with health information.
Medical data markup. Any unstructured data (texts, images or audio files) for training machine learning models requires markup or annotation. This is the process of adding descriptive elements (labels or tags) to data blocks so that the machine can understand what is in the image or text. When working with healthcare data, healthcare professionals should perform the markup. The hourly cost of their services is much higher than that of annotators who do not have domain knowledge. This creates another barrier to the generation of high-quality medical datasets.
In summary, preparing medical data for machine learning typically requires more money and time than the average for other industries due to strict regulation and the involvement of highly paid subject matter experts. Consequently, we are seeing a situation where public medical datasets are relatively rare and are attracting serious attention from researchers, data scientists, and companies working on AI solutions in the field of medicine.
📝How to improve medical datasets: 4 small tips
Getting the right data in the right amount - before ordering datasets of medical images, project managers need to coordinate with teams of machine learning, data science and clinical researchers. This will help overcome the difficulty of getting “bad” data or having annotation teams filter through thousands of irrelevant or low quality images and videos when annotating training data, which is costly and time consuming.
Empowering annotator teams with AI-based tools - annotating medical images for machine learning models requires precision, efficiency, high levels of quality, and security. With AI-based image annotation tools, medical annotators and specialists can save hours of work and generate more accurately labeled medical images.
Ensuring ease of data transfer - clinical data should be delivered and communicated in a format that is easy to parse, annotate, port, and after annotation, quickly and efficiently transfer to an ML model.
Overcome the complexities of storage and transmission - medical image data often consists of hundreds or thousands of terabytes that cannot simply be mailed. Project managers need to ensure the end-to-end security and efficiency of purchasing or retrieving, cleaning, storing and transferring medical data
💥Top sources of various datasets for data visualization
FiveThirtyEight is a journalism site that makes its datasets from its stories available to the public. These provide researched data suitable for visualization and include sets such as airline safety, election predictions, and U.S. weather history. The sets are easily searchable, and the site continually updates.
Earth Data offers science-related datasets for researchers in open access formats. Information comes from NASA data repositories, and users can explore everything from climate data to specific regions like oceans, to environmental challenges like wildfires. The site also includes tutorials and webinars, as well as articles. The rich data offers environmental visualizations and contains data from scientific partners as well.
The GDELT Project collects events at a global scale. It offers one of the biggest data repositories for human civilization. Researchers can explore people, locations, themes, organizations, and other types of subjects. Data is free, and users can also download RAW data sets for unique use cases. The site also offers a variety of tools as well for users with less experience doing their own visualizations.
Singapore Public Data - another civic source of data, the Singapore government makes these datasets available for research and exploration. Users can search by subject through the navigation bar or enter search terms themselves. Datasets cover subjects like the environment, education, infrastructure, and transport.
📚Top Data Science Books 2022
Ethics and Data Science - in this book, the author introduces us to the principles of working with data and what to do to implement them today.
Data Science for Economics and Finance - this book deals with data science, including machine learning, social network analysis, web analytics, time series analysis, and more in relation to economics and finance.
Leveraging Data Science for Global Health - This book explores the use of information technology and machine learning to fight disease and promote health.
Understanding Statistics and Experimental Design - the book provides the foundations needed to properly understand and interpret statistics. This book covers the key concepts and discusses why experiments are often not reproducible.
Building data science teams - the book covers the skills, perspectives, tools, processes needed to grow teams
Mathematics for Machine Learning - this book covers the basics of mathematics (linear algebra, geometry, vectors, etc.), as well as the main problems of machine learning.
👑Working with Graphs in Python with NetWorkX
NetWorkX is a library that is designed to create, study and manipulate graphs and other network structures. This library is free, distributed under the BSD license. NetworkX is used to teach graph theory, as well as scientific research and solving applied problems in which it is applied. NetWorkX has a number of strong benefits, including:
High performance - NetworkX is able to freely operate large network systems containing up to 10 million vertices and 100 million edges between them. This is especially useful when analyzing Big Data - for example, downloads from social networks that unite millions of users.
Easy to use - since the NetworkX library is written in Python, working with it is not difficult for both professional programmers and amateurs. And graph visualization modules provide visibility of the result, which can be corrected in real time. In order to create a full-fledged graph, you need only 4 lines of code (one of them is just an import):
import networkx as nx
G=nx.Graph()
G.add_edge(1,2)
G.add_edge(2,3,weight=0.9)
Efficiency - due to the fact that the library is implemented on the low-level data structure of the Python programming language, the efficient use of computer hardware and software resources is ensured. This increases the ability to scale graphs, and also reduces dependence on the features of the hardware platform and operating system.
Ongoing Support - Detailed documentation has been developed for NetworkX, describing the functionality and limitations of the library. The repositories are constantly updated. They contain ready-made standard solutions for programmers, which greatly facilitate the work.
Open source code - the user gets great opportunities for customizing and expanding the functionality of the library, adapting it to specific tasks. If desired, the user himself can develop additional software for working with this library.
😎What is Pandas and why is it so good?
Pandas is a Python library for processing and analyzing structured data, its name comes from "panel data" ("panel data"). Panel data refers to information obtained because of research and structured in the form of tables. To work with such data arrays in Python, the Pandas library was created.
This library is based on DataFrame - a table-type data structure. Any tabular representation of data, such as spreadsheets or databases, can be used as a DataFrame. The DataFrame object is made up of Series objects, which are one-dimensional arrays that are combined under the same name and data type. Series can be thought of as a table column. Pandas has in its arsenal such advantages as:
Easily manage messy data in an organized form - dataframes have indexes for easy access to any element
Flexible change of forms: adding, deleting, attaching new or old data.
Intelligent indexing, which involves the manipulation and management of columns and rows.
Quickly merge and merge data sets by indexing them, for example, two or more Series objects into one DataFrame.
Support for hierarchical indexing - the ability to combine columns under a common category (MultiIndex).
Open Access - Pandas is an open source library, meaning its source code is in the public domain
Detailed documentation - Pandas has its own official website, which contains detailed documentation with explanations. More details can be found at the following link: https://pandas.pydata.org/docs/
💡Top 6 data sources for deep diving into Machine Learning
Chronic disease data is a source where you can find data on various chronic diseases in the United States.
IMF Data - The International Monetary Fund, which also publishes data on international finance, debt indicators, foreign exchange reserves, investments, and so on.
Financial Times Market Data - contains information about the financial markets around the world, which includes indicators such as commodities, currencies, stock price indices
ImageNet is the image data for the new algorithms, organized according to the WordNet hierarchy, where hundreds and thousands of images represent each node of the hierarchy
Stanford Dogs Dataset - contains a huge number of images of various breeds of dogs
HotspotQA Dataset - question-answer type data that allows you to create systems for answering questions in the most understandable way.
🤔Why is the NumPy library so popular in Python?
NumPy is a Python language library that adds support for large multi-dimensional arrays and matrices, as well as high-level (and very fast) math functions to operate on these arrays. This library has several important features that have made it a popular tool.
Firstly, you can find its source code on GitHub, which is why NumPy is called an open-source module for Python. https://github.com/numpy/numpy/tree/main/numpy
Second, NumPy is written in C. This language is compiled, that is, the constructions crated with the kind of this language standards and rules are converted into machine code - a set of instructions for a particular type of processor. The conversion takes place with the help of a special compiler program, due to which all calculations occur quickly enough.
Let’s compare the performance between NumPy arrays and standard Python lists by the code below:
import numpy
import time
list1 = range(1000000)
list2 = range(1000000)
array1 = numpy.arange(1000000)
array2 = numpy.arange(1000000)
initialTime = time.time()
resultantList = [(a * b) for a, b in zip(list1, list2)]
print("Time taken by Python lists :",(time.time() - initialTime),"secs")
initialTime = time.time()
resultantArray = array1 * array2
print("Time taken by NumPy :", (time.time() - initialTime),"secs")
As the result of the test we can guess that NumPy arrays (0.002 sec) are more faster that standard Python lists (0.11 sec)
Performance differs across platforms due to software and hardware differences. The default bit generator has been chosen to perform well on 64-bit platforms. Performance on 32-bit operating systems is very different. You can see the details here: https://numpy.org/doc/stable/reference/random/performance.html#performance-on-different-operating-systems
💥Not only Copilot: try Codex by OpenAI
OpenAI released Codex models based of GPT-3 that can interpret and generate code. Their training data contains both natural language and billions of lines of public code from GitHub. These neural network models are best versed in Python and speak over a dozen languages, including JavaScript, Go, Perl, PHP, Ruby, Swift, TypeScript, SQL, and even Shell. The code is useful in the following tasks:
• Make code from comments
• Complete a statement or function in a spec.
• Find payload or API call
• Add comments
• Make code refactoring to achieve its efficiency
https://beta.openai.com/docs/guides/code
🤔SQLlite as an embedded database in Python
SQLite is an embedded relational file-based database management system (RDBMS) that can be used in Python applications without installing additional software. It is enough to import the sqlite3 built-in Python library to use SQLite.
First, create a database connection by importing sqlite3 and then call the .connect() method with the name of the database to be created, eg new_database.db.
import sqlite3
conn = sqlite3.connect('new_database.db')
Before creating a table, you need to create a cursor - an object that is used to create a connection for executing SQL queries. This calls the created connection and the .cursor() method.
c = conn.cursor()
You can then use the .execute() method to create a new table in the database. Inside the quotes is written the usual SQL query syntax used to create a table in most DBMS. For example, creating a table through the CREATE TABLE statement:
c.execute("""CREATE TABLE new_table (
name TEXT,
weight INTEGER)""")
After filling the table with data, you can execute standard SQL queries on it to select and change values.
https://towardsdatascience.com/yes-python-has-a-built-in-database-heres-how-to-use-it-b3c033f172d3
🤔PyPy vs CPython: Under the Hood of Python
Every Python-developer knows about CPython, the most common implementation of a virtual machine that interprets written code. As alternative to CPython, there is PyPy built using the RPython language. Compared to CPython, PyPy is faster, implementing Python 2.7.18, 3.9.15, and 3.8.15. PyPy supports most of the commonly used Python library modules. The x86 version of PyPy runs on multiple instances such as Linux (32/64 bits), MacOS (64 bits), Windows (32 bits), OpenBSD, FreeBSD. Versions other than x86 are found on Linux, and ARM64 on MacOS.
However, PyPy doesn't scale up code in short-running processes that take less than a couple of seconds to load: the JIT compiler won't have enough time to "warm up". Also, PyPy does not give a speed gain if all the execution time falls on runtime libraries, i.e. in C functions, not in the actual execution of the Python code. Therefore, PyPy works best when executing executable programs, when bigger part of the time is spent executing Python code.
In terms of memory consumption, PyPy can also outperform CPython: Python programs with high RAM consumption (hundreds of MB or more) can end up taking up less space in PyPy than in CPython.
https://www.pypy.org/features.html
🌲TOP-5 DS-events in December 2022:
1. Nov 28-Dec 9 • NeurIPS 2022 https://nips.cc/
2. Dec 5-6 • 7th Global Conference on Data Science and Machine Learning, Dubai,UAE https://datascience.pulsusconference.com/
3. Dec 7 • Data Science Salon NYC | AI and ML in Finance & Technology • New York, NY, USA https://www.datascience.salon/newyork/
4. Dec 12-16 • The 20th Australasian Data Mining Conference 2022 (AUSDM’22) • Sydney, Australia + Virtual https://ausdm22.ausdm.org/
5. Dec 17-18 • 3rd International Conference on Data Science and Cloud Computing (DSCC 2022), Dubai, UAE https://cse2022.org/dscc/index
Yandex named the laureates of its annual scientific award
Scientists who are engaged in research in the field of computer science will receive one million rubles for the development of their projects. In 2022, six young scientists became laureates:
•Maxim Velikanov — is engaged in the theory of deep learning, studies infinitely wide neural networks and statistical physics;
•Petr Mokrov — studies Wasserstein gradient flows, nonlinear filtering and Bayesian logistic regression;
•Maxim Kodryan — is engaged in deep learning, as well as optimization and generalization neural network models;
•Ruslan Rakhimov — works with neural visualization, CV and deep learning;
•Sergey Samsonov — studies Monte Carlo algorithms with Markov chains, stochastic approximation and other topics;
•Taras Hahulin — works in the field of computer vision.
It's cool that scientific supervisors are singled out also separately. This year, two people received grants — Dmitry Vetrov, Head of the HSE Center for Deep Learning and Bayesian Methods, and Alexey Naumov, Associate Professor at the HSE Faculty of Computer Science, Head of the International Laboratory of Stochastic Algorithms and Analysis Multidimensional Data.
More information about the awards and laureates of 2022 can be found on the website
🤔Data Mesh Architecture: Essence and Troubleshooting
Data Mesh is a decentralized flexible approach to the work of various distributed teams and the dissemination of information. The main idea is interdisciplinary teams that publish and consume Data products, thereby significantly increasing the efficiency of using data.
The essence of the concept of this technology is as follows:
Data domains. In Data Mesh, a data domain is a way to define where enterprise data begins and ends. The boundaries depend on the company and its needs. Sometimes it makes sense to model domains by considering business processes or source systems.
The self-service platform in Data Mesh is built by broad-based experts who create and manage versatile products. As part of this approach, you will rely on decentralization and agreement with business users who understand the subject area, what value certain data has. In this case, you will have specialized teams that develop stand-alone products that do not depend on a central platform.
Federated governance - when moving to a self-service distributed data platform, you need to focus on Governance. If you do not pay attention to it, it is possible to find yourself in a situation where disparate technologies are used in all domains, and data is duplicated. Therefore, both at the platform level and at the data level, you need to implement automated policies.
Data products are an important component of the Data Mesh, related to the application of product thinking to data. For a Data Product to work, it must deliver long-term value to users and be usable, valuable, and tangible. It can be implemented as an API, report, table, or dataset in a data lake.
With all this, the Data Mesh architecture has several problems:
Budget Constraints - The financial viability of the new platform project is threatened by several factors. In particular, this is the inability to pay for infrastructure, the development of expensive applications, the creation of Data products or the maintenance of such systems. For example, if the platform development team manages to create a tool that closes a technical gap, but the volume of data and the complexity of Data products continue to grow, the price of the solution may be too high.
Lack of technical skills - Delegating full data ownership to domains means they have to take the project seriously. Perhaps they will hire new employees or undergo training themselves, but it is possible that soon the requirements will be overwhelming for them. When performance drops drastically, problems can constantly appear here and there.
Data Products Monitoring - The team needs the right tools to build Data products and monitor what's going on in the company. Perhaps some domains lack a deep understanding of technical metrics and their impact on workloads. The platform development team needs resources to identify and address issues such as overburden or inefficiencies.
📖Top Enough Useful Data Visualization Books
Effective Data Storytelling: How to Drive Change - The book was written by American business intelligence consultant Brent Dykes, and is also suitable for readers without a technical background. It's not so much about visualizing data as it is about how to tell a story through data. In the book, Dykes describes his own data storytelling framework - how to use three main elements (the data itself, narrative and visual effects) to isolate patterns, develop concept solutions and justify them to the audience.
Information Dashboard Design is a practical guide that outlines the best practices and most common mistakes in creating dashboards. A separate part of the book is devoted to an introduction to design theory and data visualization.
The Accidental Analyst is an intuitive step-by-step guide for solving complex data visualization problems. The book describes the seven principles of analysis, which determine the procedure for working with data.
Beautiful visualization. Looking at Data Through the Eyes of Experts - this book talks about the process of data visualization using examples of real projects. It features commentary from 24 industry experts—from designers and scientists to artists and statisticians—who talk about their data visualization methods, approaches, and philosophies.
The Big Book of Dashboards - This book is a guide to creating dashboards. In addition, the book has a whole section devoted to psychological factors. For example, how to respond if a customer asks you to improve your dashboard by adding a couple of useless charts.
🤼♂️Hive vs Impala: very worthy opponents
Hive and Impala are technologies that are used to analyze big data. In this post, we will look at the advantages and disadvantages of both technologies and compare them with each other.
Hive is a data analysis tool that is based on the HiveQL query language. Hive allows users to access data in Hadoop Distributed File System (HDFS) using SQL-like queries. However, due to the fact that Hive uses the MapReduce architecture, it may not be as fast as many other data analysis tools.
Impala is an interactive data analysis tool designed for use in a Hadoop environment. Impala works with SQL queries and can process data in real time. This means that users can quickly receive query results without delay.
What are the advantages and disadvantages of Hive and Impala?
Advantages of Hive:
• Hive is quite easily scalable and can handle huge amounts of data;
• Support for custom functions: Hive allows users to create their own functions and aggregates in the Java programming language, allowing the user to extend the functionality of Hive and create their own customized data processing solutions.
Disadvantages of Hive:
• Restrictions on working with streaming data: Hive is not suitable for working with streaming data because it uses MapReduce, which is a batch data processing framework. Hive processes data only after it has been written to files on HDFS, which limits Hive's ability to work with streaming data.
• Restrictions on working with streaming data: Hive is not suitable for working with streaming data because it uses MapReduce, which is a batch data processing framework. Hive processes data only after it has been written to files on HDFS, which limits Hive's ability to work with streaming data.
Advantages of Impala:
• Fast query processing: Impala provides high performance query processing due to the fact that it uses the MPP architecture and distributed data in memory. This allows analysts and developers to quickly get query results without delay.
• Restrictions on working with streaming data: Hive is not suitable for working with streaming data because it uses MapReduce, which is a batch data processing framework. Hive processes data only after it has been written to files on HDFS, which limits Hive's ability to work with streaming data.
Disadvantages of Impala:
• Limited scalability: Impala does not handle as large volumes of data as Hive and may experience scalability limitations when dealing with big data. Impala may require more resources to run than Hive.
• High resource requirements: Impala consumes more resources than Hive due to distributed memory usage. This may result in the need for more powerful servers to ensure performance.
The final choice between Hive and Impala depends on the specific situation and user requirements. If you work with large amounts of data and need a simple and accessible SQL-like environment, then Hive might be the best choice. On the other hand, if you need fast data processing and support for complex queries, then Impala may be preferable.
📈📉📊Python libraries for getting data about things you may not have heard of but might be very useful to you
Bokeh is an interactive rendering library for modern web browsers. It provides elegant, concise general-purpose graphics and provides high-performance interactivity when working with large or streaming datasets.
Geoplotlib is a Python language library that allows the user to design maps and plot geographic data. This library is used to draw various types of maps such as heat maps, point density maps, and various cartographic charts.
Folium is a data visualization library in Python that helps the developer visualize geospatial data.
VisPy is a high performance interactive 2D/3D data visualization library. This library uses the processing power of modern graphics processing units (GPUs) through the OpenGL library to display very large datasets.
Pygal is a Python language library that is used for data visualization. This library also develops interactive charts that can be embedded in a web browser.
🌎TOP-10 DS-events all over the world in March:
Mar 6-7 • REINFORCE AI CONFERENCE: International AI and ML Hybrid Conference • Budapest, Hungary https://reinforceconf.com/2023
Mar 10-12 • International Conference on Machine Vision and Applications (ICMVA) • Singapore http://icmva.org/
Mar 13-16 • International Conference on Human-Robot Interaction (ACM/IEEE) • Stockholm, Sweden https://humanrobotinteraction.org/2023/
Mar 14 • Quant Strats • New York, USA https://www.alphaevents.com/events-quantstratsus
Mar 20-23 • Gartner Data & Analytics Summit • Orlando, USA https://www.gartner.com/en/conferences/na/data-analytics-us
Mar 20-23 • NVIDIA GTC • Online https://www.nvidia.com/gtc/
Mar 24-26 • 5th ICNLP Conference • Guangzhou, China http://www.icnlp.net/
Mar 27-28 • Data & Analytics in Healthcare • Melbourne, Australia https://datainhealthcare.coriniumintelligence.com/
Mar 27-31 • Annual Conference on Intelligent User Interfaces (IUI) • Sydney, Australia https://iui.acm.org/2023/
Mar 30 • MLCONF • New York, USA https://mlconf.com/event/mlconf-new-york-city/
⛑Top 7 Medical DS Startups in 2022
SWORD Health is a physical therapy and rehabilitation service that includes a range of wearable devices that can read physiological indicators that signal pain, allowing you to analyze large amounts of data and offer more effective treatment, as well as adjust movements to eliminate pain
Cala Health is currently the only prescription non-invasive treatment for essential tremor based on measured fluctuation data from wearable devices that are also capable of personalized peripheral nerve stimulation based on this.
AppliedVR is a platform for treating chronic pain by building a library of pain-influenced data, enabling immersive therapy through VR
Digital Diagnostics is the first FDA (Food and Drug Administration)-approved standalone AI based on retinal imagery data to diagnose eye diseases caused by diabetes without the participation of a doctor
Iterative Health is a product that is a service for automating the processes of conducting and analyzing the results of endoscopy data. This technology is based on the interpretation of endoscopic image data, thereby helping clinicians to better evaluate patients with potential gastrointestinal problems.
Viz.ai is a service for intelligent coordination and medical care in radiology. This platform is designed to analyze data from CT scans of the brain in order to find blockages in large vessels in it. The system transmits all the results obtained to a specialist in the field of neurovascular diseases in order to ensure therapeutic intervention at an early stage. The system receives such results in just a few minutes, thus providing a quick response.
Unlearn is a startup that offers a platform to accelerate clinical trials using artificial intelligence, digital twins and various statistical methods. This service is capable of processing historical datasets of clinical trials from patients to create “disease-specific” machine learning models, which in turn could be used to create digital twins with the corresponding virtual medical records.
🔥Processing data with Elastic Stack
• Elastic Stack is a vast ecosystem of components that are used to search and process big data. This ecosystem is a JSON-based distributed system that combines the features of a NoSQL database. The work of the Elastic Stack is provided by such components as:
• Elasticsearch is a large, fast, and highly scalable non-relational data store that has become a great tool for log search and analytics due to its power, simplicity, schemaless JSON documents, multilingual support, and geolocation. The system can quickly process large volumes of logs, index system logs as they arrive, and query them in real time. Performing operations in Elasticsearch, such as reading or writing data, usually takes less than a second, which makes it suitable for use cases where you need to react almost in real time, such as monitoring applications and detecting any anomalies.
• Longstash - is utility to help centralizing data related to events, such as information from log files (logs), various indicators (metrics) or any other data in any format. It can perform data processing before forming the sample you need. It is the key component of the Elastic Stack and is used to collect and process your data containers. Logstash is considered a server side component. Its main purpose is to perform the collection of data from a wide range of input sources in a scalable way, as well as processing the information and sending it to the destination. By default, the converted information goes to Elasticsearch, and you can choose from many other output options. Logstash's architecture is plugin-based and easily extensible. Three types of plugins are supported: input, filtering and output.
• Kibana is an Elastic Stack visualization tool that helps visualize data in Elasticsearch. Kibana offers a variety of visualization options such as histogram, map, line graphs, time series, and more. Kibana allows you to create visualizations with just a couple of mouse clicks and explore your data in an interactive way. In addition, it is possible to create beautiful dashboards consisting of various visualizations, share them, and also receive high-quality reports.
• Beats is an open source data delivery platform that complements Logstash. Unlike Logstash, which runs on the server side, Beats is on the client side. At the heart of this platform is the libbeat library, which provides an API for passing data from a source, configuring input, and implementing data collection. Beats is installed on devices that are not part of the server components such as Elasticsearch, Logstash or Kibana. They are hosted on non-clustered nodes, which are also sometimes referred to as edge nodes.
You can download the elements of the Elastic Stack from the following link: https://www.elastic.co/elastic-stack/
💥Top 5 Reasons to Use Apache Spark for Big Data Processing
Apache Spark is a popular open source Big Data framework for processing large amounts of data in a distributed environment. It is part of the Apache Hadoop project ecosystem. This framework is good because it has the following elements in its arsenal:
Wide API - Spark provides the developer with a fairly extensive API, which allows you to work with different programming languages, for example: Python, R, Scala and Java. Spark also offers the user a dataframe abstraction (dataframe), which uses object-oriented methods for transforming, combining data, filtering it, and many other useful features.
Pretty broad functionality - Spark has a wide range of functionality due to components such as:
1. Spark SQL - a module that serves for analytical data processing using SQL queries
2. Spark Streaming - a module that provides an add-on for processing streaming data online
3. MLLib - a module that provides a set of machine learning libraries in a distributed environment
Lazy evaluations - allow reducing the total amount of calculations and improving the performance of the program by reducing memory requirements. This type of calculation is very useful, as it allows you to determine the complex structure of transformations represented as objects. It is also possible to check the structure of the result without performing any intermediate steps. Spark also automatically checks the query execution plan or program for errors. This allows you to quickly catch bugs and debug them.
Open Source - Part of the Apache Software Foundation's line of projects, Spark continues to be actively developed through the developer community. In addition, despite the fact that Spark is a free tool, it has very detailed documentation: https://spark.apache.org/documentation.html
Distributed data processing - Apache Spark provides distributed data processing, including the concept of distributed datasets RDD (resilient distributed dataset) is a distributed data structure that resides in RAM. Each such dataset contains a fragment of data distributed over the nodes of the cluster. This makes it fault-tolerant: if a partition is lost due to a node failure, it can be restored from its original sources. Thus, Spark itself spreads the code across all nodes of the cluster, breaks it into subtasks, creates an execution plan and monitors the success of the execution.
😎Top of 6 libraries for time series analysis
Time series is an ordered sequence of points or features that are measured at identified time intervals and that represent a characteristic feature of process. There are some popular libraries for time series processing:
• Statsmodels is an open source library. Based on NumPy and SciPy. Statsmodel allows to build and analyze statistical models, including time series models. It also includes statistical tests, the ability to work with big data, etc.
• Sktime is an open source machine learning library in Python. It is designed specifically for time series analysis. Sktime includes special machine learning algorithms, is well suited for forecasting, and time series classification tasks.
• tslearn - a universal library designed for time series analysis using the Python language. It is based on the scikit-learn, numpy and scipy libraries. This library offers tools for preprocessing and feature extraction, as well as special models for clustering, classification, and regression.
• Tsfresh - this library is great for preparing data for a classic tabular form in order to formulate and solve problems of classification, forecasting, etc. With Tsfresh you can quickly select a large number of time series features, and then select only the necessary ones.
• Merlion is an open source library. It is designed to work with time series, mainly for forecasting and detecting collective anomalies. There is generic interface for most models and datasets. Allows quickly developing a model for solving common time series problems and testing it on various data sets.
• PyOD (Python Outlier Detection or PyOD) is a Python library that able to detect point anomalies or outliers in data. More than 30 algorithms are implemented in PyOD, ranging from classical algorithms such as Isolation Forest to methods recently presented in scientific articles, such as COPOD and others. PyOD also allows to combine outlier search models into ensembles to improve the quality of problem solving. The library is simple and straightforward, and the examples in the documentation detail how it can be used.
🌎TOP-25 DS-events all over the world:
• Feb 9-11 • WAICF - World Artificial Intelligence Cannes Festival • Cannes, France https://worldaicannes.com/
• Feb 15-16 • Deep Learning Summit• San Francisco, USA https://ai-west-dl.re-work.co/
• Mar 30 • MLconf • New York City, USA https://mlconf.com/event/mlconf-new-york-city/
• Apr 26-27 • Computer Vision Summit • San Jose, USA https://computervisionsummit.com/location/cvsanjose
• Apr 27-29 • SIAM International Conference on Data Mining (SDM23) • Minneapolis, USA https://www.siam.org/conferences/cm/conference/sdm23
• May 01-05 • ICLR - International Conference on Learning Representations • online https://iclr.cc/
• May 17-19 • World Data Summit• Amsterdam, The Netherlands https://worlddatasummit.com/
• May 25-26 • The Data Science Conference • Chicago, USA https://www.thedatascienceconference.com/
• Jun 14-15 • The AI Summit London • London, UK https://london.theaisummit.com/
• Jun 18-22 • Machine Learning Week • Las Vegas, USA https://www.predictiveanalyticsworld.com/machinelearningweek/
• Jun 19-22 The Event For Machine Learning Technologies & Innovations • Munich, Germany https://mlconference.ai/munich/
• Jul 13-14 • DELTA - International Conference on Deep Learning Theory and Applications • Rome, Italy https://delta.scitevents.org/
• Jul 23-29 • ICML - International Conference on Machine Learning • Honolulu, Hawai’i https://icml.cc/
• Aug 06-10 • KDD - Knowledge Discovery and Data Mining • Long Beach, USA https://kdd.org/kdd2023/
• Sep 18-22 • RecSys – ACM Conference on Recommender Systems • Singapore, Singapore https://recsys.acm.org/recsys23/
• Oct 11-12 • Enterprise AI Summit • Berlin, Germany https://berlin-enterprise-ai.re-work.co/
• Oct 16-20 • AI Everything 2023 Summit • Dubai, UAE https://ai-everything.com/home
• Oct 18-19 • AI in Healthcare Summit • Boston, USA https://boston-ai-healthcare.re-work.co/
• Oct 23-25 • Marketing Analytics & Data Science (MADS) Conference • Denver, USA https://informaconnect.com/marketing-analytics-data-science/
• Oct 24-25 • Data2030 Summit 2023 • Stockholm, Sweden https://data2030summit.com/
• Nov 01-02 • Deep Learning Summit • Montreal, Canada https://montreal-dl.re-work.co/
• Dec 06-07 • The AI Summit New York • New York, USA https://newyork.theaisummit.com/
• Nov • Data Science Conference • Belgrade, Serbia •https://datasciconference.com/
• Dec • NeurIPS • https://nips.cc/
• Dec • Data Science Summit • Warsaw, Poland • https://dssconf.pl/
🤔TOP 4 risks of embeddings in ML
Embedded models are widely used in machine learning to translate raw input data into a low-dimensional vector that captures its semantic meaning and can be used for various subsequent models. Pretrained embeddings as feature extractors are used to develop features of various input types (text, image, audio, video, multimodal) or categorical features with high cardinality.
The main risks of embeddings are:
• High obfuscation - Changing the output of the upstream embedding model affects the performance of the downstream model. Relying on the output of the embedding model, downstream models should be retrained.
• Hidden feedback loops. Pre-trained embedding features are often used as a black box. But knowing what the raw input embedding was trained on is very important for the quality of the model and its interpretability.
• High costs of real-time output (storage and maintenance). This directly affects the return on investment in ML. It is important to ensure quality during embedding service outages, cost of training, and cost of service per request.
• High cost of debugging: - debugging and monitoring built-in features or root causes of failures is very expensive. Therefore, built-in features that are not of great importance for the model should be abandoned.
https://medium.com/better-ml/embeddings-the-high-interest-credit-card-of-feature-engineering-414c00cb82e1
🚀Speed up the quality and runtime of your Python code with Refurb
Every Data Scientist knows that Python is an interpreted language. Interpreted code is always slower than compiled to direct machine code because the interpreted instruction takes much longer to implement. Therefore, the way you write your Python code greatly affects the speed of its execution. Having a good code structure and respecting the language transformation will increase the speed of Python code. The Refurb library will help improve the quality of Python code. It can upgrade or change Python code with a single command. The library is inspired by clippy, Rust's built-in linter.
Just install it via pip package manager
pip install refurb
And use hints to improve the quality and speed of your Python code.
https://github.com/dosisod/refurb
💥3 Useful Python Libraries for Data Science and More
Let’s introduce 3 libraries that can be useful in Data Science tasks:
• Fabric is a high-level Python library (2.7, 3.4+) for executing shell commands remotely over SSH to get useful Python objects. It is based on the Invoke API (execution of subprocess commands and command line functions) and Paramiko (an implementation of the SSH protocol). https://github.com/fabric/fabric
• TextDistance is a library for comparing the distance between two or more sequences using over 30 algorithms. Useful in NLP tasks to determine the distance and similarity between sequences. https://github.com/life4/textdistance
• Watchdog - Python API (3.6+) and shell utilities for monitoring file system events. Useful for monitoring directories specified as command line arguments, allows you to log generated events. https://github.com/gorakhargosh/watchdog
On December 3, Sberbank is holding a One Day Offer for Data scientists, Data analysts and Data engineers. Pass all the selection stages in one day and get an offer from the largest bank in the country!
👨🎓We are looking for specialists in the field of AI, ML, RecSys, Watch V, NLP.
Our team create information products for decision-making based on data, analytics, machine learning and artificial intelligence.
👉 You will have to:
- Solve classification/regression/ uplift-modeling tasks;
- Support the output of models in the PROM;
- Analyze and monitor the quality of models;
- Calculate CLTV and Unit - economics;
- Interact with the validation and finance departments on assessing the quality of models/financial results issues.
Data on more than 1 billion transactions daily, 75PB of information, 100TB of memory and over 7,200 CPU cores in sandboxes will be available for work.
Become a part of the bank's AI community!
✍️ Send a request for participation
👍🏻Need beautiful graphs in Jupyter Notebook? It’s easy with PivotTable.js!
PivotTable.js is a Javascript implementation of source code pivot tables with drag and drop functionality. The project is distributed under the MIT license and compiled through the pip package manager:
pip install pivot tables
The library allows you to conveniently and quickly visualize the statistics of a data set by simply drag and drop the necessary data.
https://pypi.org/project/pivottablejs/