😳Dataset published for training an improved alternative to LLaMa
A group of researchers from various organizations and universities (Together, ontocord.ai, ds3lab.inf.ethz.ch, crfm.stanford.edu, hazyresearch.stanford.edu, mila.quebec) are working on an open-source alternative to the LLaMa model and have already published dataset relevant to the one used to create the last one. The non-free but well-balanced LLaMa has been used as the basis for projects such as Alpaca, Vicuna and Koala.
Now RedPajama is published in the public domain - a dataset of texts containing more than 1.2 trillion tokens. The next step, according to the developers, will be the creation of the model itself, which will require serious computing power.
🧐What is Data observability: Basic Principles
Data observability is a new level in the modern data processing stack, providing data teams with transparency and quality. The goal of data observability is to reduce the chance of errors in business decisions due to incorrect information in the data.
Observability is ensured by the following principles:
Freshness indicates how fresh data structures are.
Distribution tells you if the data falls within the expected range.
Volume involves understanding the completeness of data structures and the state of data sources.
The schema allows you to understand who and when makes changes to data structures.
Lineage maps upstream data sources to downstream data sinks, helping you determine where errors or failures occurred.
More about data observability in the source: https://habr.com/ru/companies/otus/articles/559320/
😳A selection of Python libraries for random generation of test data
Many people love Python for its convenience in data processing. But sometimes it happens that it is necessary to write and test an algorithm on a certain topic, but there is little or no data on this topic in the public domain. For such purposes, there are libraries with which you can generate fake data with the desired types.
Faker is a library for generating various types of random information. It also has an intuitive API. There are also implementations for languages such as PHP, Perl, and Ruby.
Mimesis is a Python library that helps generate data for various purposes. The library is written using the tools included in the Python standard library, so it does not have any third-party dependencies.
Radar - Library for generating random dates and times
Fake2db - a library that allows you to generate data directly in the database (there are also engines for different DBMS).
😎Searching for data and learning SQL at the same time is easy!!!
Census GPT is a tool that allows users to search for data about cities, neighborhoods, and other geographic areas.
Using artificial intelligence technology, Census-GPT organized and analyzed huge amounts of data to create a superdatabase. Currently, the Census-GPT database contains information about the United States, where users can request data on population, crime rates, education, income, age, and more. In addition, Census-GPT can display US maps in a clear and concise manner.
On the Census GPT site, users can also improve existing maps. The data results can be retrieved along with the SQL query. Accordingly, you can learn SQL and automatically test yourself on real examples.
🤓What is synthetic data and why is it used?
Synthetic data is artificial data that mimics observations of the real world and is used to prepare machine learning models when obtaining real data is not possible due to complexity or cost. Synthesized data can be used for almost any project that requires computer simulation to predict or analyze real events. There are many reasons why a business might consider using synthetic data. Here are some of them:
1. Efficiency of financial and time costs. If a suitable dataset is not available, generating synthetic data can be much cheaper than collecting real world event data. The same applies to the time factor: synthesis can take a matter of days, while collecting and processing real data sometimes takes weeks, months or even years.
2. Research of rare data. In some cases, the data is rare or there is danger in collecting it. An example of sparse data would be a set of unusual fraud cases. An example of dangerous real-world data is traffic accidents, which self-driving cars must learn to respond to. In this case, they can be replaced by synthetic accidents.
3. Eliminate privacy issues. When it is necessary to process or transfer sensitive data to third parties, confidentiality issues should be taken into account. Unlike anonymization, synthetic data generation removes any trace of real data identity, creating new valid datasets without sacrificing privacy.
4. Ease of layout and control. From a technical point of view, fully synthetic data simplifies markup. For example, if an image of a park is generated, it is easy to automatically label trees, people, and animals. You don't have to hire people to manually lay out these objects. In addition, fully synthesized data is easy to control and modify.
🤔Data Mesh Architecture: Essence and Troubleshooting
Data Mesh is a decentralized flexible approach to the work of various distributed teams and the dissemination of information. The main idea is interdisciplinary teams that publish and consume Data products, thereby significantly increasing the efficiency of using data.
The essence of the concept of this technology is as follows:
Data domains. In Data Mesh, a data domain is a way to define where enterprise data begins and ends. The boundaries depend on the company and its needs. Sometimes it makes sense to model domains by considering business processes or source systems.
The self-service platform in Data Mesh is built by broad-based experts who create and manage versatile products. As part of this approach, you will rely on decentralization and agreement with business users who understand the subject area, what value certain data has. In this case, you will have specialized teams that develop stand-alone products that do not depend on a central platform.
Federated governance - when moving to a self-service distributed data platform, you need to focus on Governance. If you do not pay attention to it, it is possible to find yourself in a situation where disparate technologies are used in all domains, and data is duplicated. Therefore, both at the platform level and at the data level, you need to implement automated policies.
Data products are an important component of the Data Mesh, related to the application of product thinking to data. For a Data Product to work, it must deliver long-term value to users and be usable, valuable, and tangible. It can be implemented as an API, report, table, or dataset in a data lake.
With all this, the Data Mesh architecture has several problems:
Budget Constraints - The financial viability of the new platform project is threatened by several factors. In particular, this is the inability to pay for infrastructure, the development of expensive applications, the creation of Data products or the maintenance of such systems. For example, if the platform development team manages to create a tool that closes a technical gap, but the volume of data and the complexity of Data products continue to grow, the price of the solution may be too high.
Lack of technical skills - Delegating full data ownership to domains means they have to take the project seriously. Perhaps they will hire new employees or undergo training themselves, but it is possible that soon the requirements will be overwhelming for them. When performance drops drastically, problems can constantly appear here and there.
Data Products Monitoring - The team needs the right tools to build Data products and monitor what's going on in the company. Perhaps some domains lack a deep understanding of technical metrics and their impact on workloads. The platform development team needs resources to identify and address issues such as overburden or inefficiencies.
📖Top Enough Useful Data Visualization Books
Effective Data Storytelling: How to Drive Change - The book was written by American business intelligence consultant Brent Dykes, and is also suitable for readers without a technical background. It's not so much about visualizing data as it is about how to tell a story through data. In the book, Dykes describes his own data storytelling framework - how to use three main elements (the data itself, narrative and visual effects) to isolate patterns, develop concept solutions and justify them to the audience.
Information Dashboard Design is a practical guide that outlines the best practices and most common mistakes in creating dashboards. A separate part of the book is devoted to an introduction to design theory and data visualization.
The Accidental Analyst is an intuitive step-by-step guide for solving complex data visualization problems. The book describes the seven principles of analysis, which determine the procedure for working with data.
Beautiful visualization. Looking at Data Through the Eyes of Experts - this book talks about the process of data visualization using examples of real projects. It features commentary from 24 industry experts—from designers and scientists to artists and statisticians—who talk about their data visualization methods, approaches, and philosophies.
The Big Book of Dashboards - This book is a guide to creating dashboards. In addition, the book has a whole section devoted to psychological factors. For example, how to respond if a customer asks you to improve your dashboard by adding a couple of useless charts.
🤼♂️Hive vs Impala: very worthy opponents
Hive and Impala are technologies that are used to analyze big data. In this post, we will look at the advantages and disadvantages of both technologies and compare them with each other.
Hive is a data analysis tool that is based on the HiveQL query language. Hive allows users to access data in Hadoop Distributed File System (HDFS) using SQL-like queries. However, due to the fact that Hive uses the MapReduce architecture, it may not be as fast as many other data analysis tools.
Impala is an interactive data analysis tool designed for use in a Hadoop environment. Impala works with SQL queries and can process data in real time. This means that users can quickly receive query results without delay.
What are the advantages and disadvantages of Hive and Impala?
Advantages of Hive:
• Hive is quite easily scalable and can handle huge amounts of data;
• Support for custom functions: Hive allows users to create their own functions and aggregates in the Java programming language, allowing the user to extend the functionality of Hive and create their own customized data processing solutions.
Disadvantages of Hive:
• Restrictions on working with streaming data: Hive is not suitable for working with streaming data because it uses MapReduce, which is a batch data processing framework. Hive processes data only after it has been written to files on HDFS, which limits Hive's ability to work with streaming data.
• Restrictions on working with streaming data: Hive is not suitable for working with streaming data because it uses MapReduce, which is a batch data processing framework. Hive processes data only after it has been written to files on HDFS, which limits Hive's ability to work with streaming data.
Advantages of Impala:
• Fast query processing: Impala provides high performance query processing due to the fact that it uses the MPP architecture and distributed data in memory. This allows analysts and developers to quickly get query results without delay.
• Restrictions on working with streaming data: Hive is not suitable for working with streaming data because it uses MapReduce, which is a batch data processing framework. Hive processes data only after it has been written to files on HDFS, which limits Hive's ability to work with streaming data.
Disadvantages of Impala:
• Limited scalability: Impala does not handle as large volumes of data as Hive and may experience scalability limitations when dealing with big data. Impala may require more resources to run than Hive.
• High resource requirements: Impala consumes more resources than Hive due to distributed memory usage. This may result in the need for more powerful servers to ensure performance.
The final choice between Hive and Impala depends on the specific situation and user requirements. If you work with large amounts of data and need a simple and accessible SQL-like environment, then Hive might be the best choice. On the other hand, if you need fast data processing and support for complex queries, then Impala may be preferable.
📈📉📊Python libraries for getting data about things you may not have heard of but might be very useful to you
Bokeh is an interactive rendering library for modern web browsers. It provides elegant, concise general-purpose graphics and provides high-performance interactivity when working with large or streaming datasets.
Geoplotlib is a Python language library that allows the user to design maps and plot geographic data. This library is used to draw various types of maps such as heat maps, point density maps, and various cartographic charts.
Folium is a data visualization library in Python that helps the developer visualize geospatial data.
VisPy is a high performance interactive 2D/3D data visualization library. This library uses the processing power of modern graphics processing units (GPUs) through the OpenGL library to display very large datasets.
Pygal is a Python language library that is used for data visualization. This library also develops interactive charts that can be embedded in a web browser.
🌎TOP-10 DS-events all over the world in March:
Mar 6-7 • REINFORCE AI CONFERENCE: International AI and ML Hybrid Conference • Budapest, Hungary https://reinforceconf.com/2023
Mar 10-12 • International Conference on Machine Vision and Applications (ICMVA) • Singapore http://icmva.org/
Mar 13-16 • International Conference on Human-Robot Interaction (ACM/IEEE) • Stockholm, Sweden https://humanrobotinteraction.org/2023/
Mar 14 • Quant Strats • New York, USA https://www.alphaevents.com/events-quantstratsus
Mar 20-23 • Gartner Data & Analytics Summit • Orlando, USA https://www.gartner.com/en/conferences/na/data-analytics-us
Mar 20-23 • NVIDIA GTC • Online https://www.nvidia.com/gtc/
Mar 24-26 • 5th ICNLP Conference • Guangzhou, China http://www.icnlp.net/
Mar 27-28 • Data & Analytics in Healthcare • Melbourne, Australia https://datainhealthcare.coriniumintelligence.com/
Mar 27-31 • Annual Conference on Intelligent User Interfaces (IUI) • Sydney, Australia https://iui.acm.org/2023/
Mar 30 • MLCONF • New York, USA https://mlconf.com/event/mlconf-new-york-city/
⛑Top 7 Medical DS Startups in 2022
SWORD Health is a physical therapy and rehabilitation service that includes a range of wearable devices that can read physiological indicators that signal pain, allowing you to analyze large amounts of data and offer more effective treatment, as well as adjust movements to eliminate pain
Cala Health is currently the only prescription non-invasive treatment for essential tremor based on measured fluctuation data from wearable devices that are also capable of personalized peripheral nerve stimulation based on this.
AppliedVR is a platform for treating chronic pain by building a library of pain-influenced data, enabling immersive therapy through VR
Digital Diagnostics is the first FDA (Food and Drug Administration)-approved standalone AI based on retinal imagery data to diagnose eye diseases caused by diabetes without the participation of a doctor
Iterative Health is a product that is a service for automating the processes of conducting and analyzing the results of endoscopy data. This technology is based on the interpretation of endoscopic image data, thereby helping clinicians to better evaluate patients with potential gastrointestinal problems.
Viz.ai is a service for intelligent coordination and medical care in radiology. This platform is designed to analyze data from CT scans of the brain in order to find blockages in large vessels in it. The system transmits all the results obtained to a specialist in the field of neurovascular diseases in order to ensure therapeutic intervention at an early stage. The system receives such results in just a few minutes, thus providing a quick response.
Unlearn is a startup that offers a platform to accelerate clinical trials using artificial intelligence, digital twins and various statistical methods. This service is capable of processing historical datasets of clinical trials from patients to create “disease-specific” machine learning models, which in turn could be used to create digital twins with the corresponding virtual medical records.
🔥Processing data with Elastic Stack
• Elastic Stack is a vast ecosystem of components that are used to search and process big data. This ecosystem is a JSON-based distributed system that combines the features of a NoSQL database. The work of the Elastic Stack is provided by such components as:
• Elasticsearch is a large, fast, and highly scalable non-relational data store that has become a great tool for log search and analytics due to its power, simplicity, schemaless JSON documents, multilingual support, and geolocation. The system can quickly process large volumes of logs, index system logs as they arrive, and query them in real time. Performing operations in Elasticsearch, such as reading or writing data, usually takes less than a second, which makes it suitable for use cases where you need to react almost in real time, such as monitoring applications and detecting any anomalies.
• Longstash - is utility to help centralizing data related to events, such as information from log files (logs), various indicators (metrics) or any other data in any format. It can perform data processing before forming the sample you need. It is the key component of the Elastic Stack and is used to collect and process your data containers. Logstash is considered a server side component. Its main purpose is to perform the collection of data from a wide range of input sources in a scalable way, as well as processing the information and sending it to the destination. By default, the converted information goes to Elasticsearch, and you can choose from many other output options. Logstash's architecture is plugin-based and easily extensible. Three types of plugins are supported: input, filtering and output.
• Kibana is an Elastic Stack visualization tool that helps visualize data in Elasticsearch. Kibana offers a variety of visualization options such as histogram, map, line graphs, time series, and more. Kibana allows you to create visualizations with just a couple of mouse clicks and explore your data in an interactive way. In addition, it is possible to create beautiful dashboards consisting of various visualizations, share them, and also receive high-quality reports.
• Beats is an open source data delivery platform that complements Logstash. Unlike Logstash, which runs on the server side, Beats is on the client side. At the heart of this platform is the libbeat library, which provides an API for passing data from a source, configuring input, and implementing data collection. Beats is installed on devices that are not part of the server components such as Elasticsearch, Logstash or Kibana. They are hosted on non-clustered nodes, which are also sometimes referred to as edge nodes.
You can download the elements of the Elastic Stack from the following link: https://www.elastic.co/elastic-stack/
💥Top 5 Reasons to Use Apache Spark for Big Data Processing
Apache Spark is a popular open source Big Data framework for processing large amounts of data in a distributed environment. It is part of the Apache Hadoop project ecosystem. This framework is good because it has the following elements in its arsenal:
Wide API - Spark provides the developer with a fairly extensive API, which allows you to work with different programming languages, for example: Python, R, Scala and Java. Spark also offers the user a dataframe abstraction (dataframe), which uses object-oriented methods for transforming, combining data, filtering it, and many other useful features.
Pretty broad functionality - Spark has a wide range of functionality due to components such as:
1. Spark SQL - a module that serves for analytical data processing using SQL queries
2. Spark Streaming - a module that provides an add-on for processing streaming data online
3. MLLib - a module that provides a set of machine learning libraries in a distributed environment
Lazy evaluations - allow reducing the total amount of calculations and improving the performance of the program by reducing memory requirements. This type of calculation is very useful, as it allows you to determine the complex structure of transformations represented as objects. It is also possible to check the structure of the result without performing any intermediate steps. Spark also automatically checks the query execution plan or program for errors. This allows you to quickly catch bugs and debug them.
Open Source - Part of the Apache Software Foundation's line of projects, Spark continues to be actively developed through the developer community. In addition, despite the fact that Spark is a free tool, it has very detailed documentation: https://spark.apache.org/documentation.html
Distributed data processing - Apache Spark provides distributed data processing, including the concept of distributed datasets RDD (resilient distributed dataset) is a distributed data structure that resides in RAM. Each such dataset contains a fragment of data distributed over the nodes of the cluster. This makes it fault-tolerant: if a partition is lost due to a node failure, it can be restored from its original sources. Thus, Spark itself spreads the code across all nodes of the cluster, breaks it into subtasks, creates an execution plan and monitors the success of the execution.
😎Top of 6 libraries for time series analysis
Time series is an ordered sequence of points or features that are measured at identified time intervals and that represent a characteristic feature of process. There are some popular libraries for time series processing:
• Statsmodels is an open source library. Based on NumPy and SciPy. Statsmodel allows to build and analyze statistical models, including time series models. It also includes statistical tests, the ability to work with big data, etc.
• Sktime is an open source machine learning library in Python. It is designed specifically for time series analysis. Sktime includes special machine learning algorithms, is well suited for forecasting, and time series classification tasks.
• tslearn - a universal library designed for time series analysis using the Python language. It is based on the scikit-learn, numpy and scipy libraries. This library offers tools for preprocessing and feature extraction, as well as special models for clustering, classification, and regression.
• Tsfresh - this library is great for preparing data for a classic tabular form in order to formulate and solve problems of classification, forecasting, etc. With Tsfresh you can quickly select a large number of time series features, and then select only the necessary ones.
• Merlion is an open source library. It is designed to work with time series, mainly for forecasting and detecting collective anomalies. There is generic interface for most models and datasets. Allows quickly developing a model for solving common time series problems and testing it on various data sets.
• PyOD (Python Outlier Detection or PyOD) is a Python library that able to detect point anomalies or outliers in data. More than 30 algorithms are implemented in PyOD, ranging from classical algorithms such as Isolation Forest to methods recently presented in scientific articles, such as COPOD and others. PyOD also allows to combine outlier search models into ensembles to improve the quality of problem solving. The library is simple and straightforward, and the examples in the documentation detail how it can be used.
🌎TOP-25 DS-events all over the world:
• Feb 9-11 • WAICF - World Artificial Intelligence Cannes Festival • Cannes, France https://worldaicannes.com/
• Feb 15-16 • Deep Learning Summit• San Francisco, USA https://ai-west-dl.re-work.co/
• Mar 30 • MLconf • New York City, USA https://mlconf.com/event/mlconf-new-york-city/
• Apr 26-27 • Computer Vision Summit • San Jose, USA https://computervisionsummit.com/location/cvsanjose
• Apr 27-29 • SIAM International Conference on Data Mining (SDM23) • Minneapolis, USA https://www.siam.org/conferences/cm/conference/sdm23
• May 01-05 • ICLR - International Conference on Learning Representations • online https://iclr.cc/
• May 17-19 • World Data Summit• Amsterdam, The Netherlands https://worlddatasummit.com/
• May 25-26 • The Data Science Conference • Chicago, USA https://www.thedatascienceconference.com/
• Jun 14-15 • The AI Summit London • London, UK https://london.theaisummit.com/
• Jun 18-22 • Machine Learning Week • Las Vegas, USA https://www.predictiveanalyticsworld.com/machinelearningweek/
• Jun 19-22 The Event For Machine Learning Technologies & Innovations • Munich, Germany https://mlconference.ai/munich/
• Jul 13-14 • DELTA - International Conference on Deep Learning Theory and Applications • Rome, Italy https://delta.scitevents.org/
• Jul 23-29 • ICML - International Conference on Machine Learning • Honolulu, Hawai’i https://icml.cc/
• Aug 06-10 • KDD - Knowledge Discovery and Data Mining • Long Beach, USA https://kdd.org/kdd2023/
• Sep 18-22 • RecSys – ACM Conference on Recommender Systems • Singapore, Singapore https://recsys.acm.org/recsys23/
• Oct 11-12 • Enterprise AI Summit • Berlin, Germany https://berlin-enterprise-ai.re-work.co/
• Oct 16-20 • AI Everything 2023 Summit • Dubai, UAE https://ai-everything.com/home
• Oct 18-19 • AI in Healthcare Summit • Boston, USA https://boston-ai-healthcare.re-work.co/
• Oct 23-25 • Marketing Analytics & Data Science (MADS) Conference • Denver, USA https://informaconnect.com/marketing-analytics-data-science/
• Oct 24-25 • Data2030 Summit 2023 • Stockholm, Sweden https://data2030summit.com/
• Nov 01-02 • Deep Learning Summit • Montreal, Canada https://montreal-dl.re-work.co/
• Dec 06-07 • The AI Summit New York • New York, USA https://newyork.theaisummit.com/
• Nov • Data Science Conference • Belgrade, Serbia •https://datasciconference.com/
• Dec • NeurIPS • https://nips.cc/
• Dec • Data Science Summit • Warsaw, Poland • https://dssconf.pl/
😩Uncertainty in data: common bugs
There is a lot of talk about “data preparation” and “data cleaning” these days, but what separates high quality data from low quality data?
Most machine learning systems today use supervised learning. This means that the training data consists of (input, output) pairs, and we want the system to be able to take the input and match it with the output. For example, the input might be an audio clip and the output might be a transcription of a speech. To create such datasets, it is necessary to label them correctly. If there is uncertainty in the labeling of the data, then more data may be needed to achieve high accuracy of the machine learning model.
Data collection and annotation may not be correct for the following reasons:
1. Simple annotation errors. The simplest type of error is misannotation. An annotator, tired of a lot of markup, accidentally puts sample data in the wrong class. Although this is a simple bug, it is quite common and can have a huge negative impact on the performance of the AI system.
2. Inconsistencies in annotation guidelines. There are often subtleties of various kinds in annotating data items. For example, you might imagine reading social media posts and annotating whether they are product reviews. The task seems simple, but if you start to annotate, you can realize that “product” is a rather vague concept. Should digital media, such as podcasts or movies, be considered products? One specialist may say yes, another no, so the accuracy of the AI system can be greatly reduced.
3. Unbalanced data or missing classes. The way data is collected greatly affects the composition of datasets, which in turn can affect the accuracy of models on specific data classes or subsets. In most real world datasets, the number of examples in each category that we want to classify (class balance) can vary greatly. This can lead to reduced accuracy, as well as exacerbating balance problems and skew. For example, Google's AI facial recognition system was notorious for not being able to recognize faces of people of color, which was largely the result of using a dataset with insufficiently varied examples (among many other problems).
😎🥹Libraries for comfortable data processing
PyGWalker - Simplifies the data analysis and visualization workflow in Jupyter Notebook by turning a pandas dataframe into a Tableau-style user interface for visual exploration.
SciencePlots - A library for creating various matplotlib plots for presentations, research papers, etc.
CleverCSV is a library that fixes various parsing errors when reading CSV files with Pandas
fastparquet - Speeds up pandas I/O by about 5 times. fastparquet is a high performance Python implementation of the Parquet format designed to work seamlessly with Pandas dataframes. It provides fast read and write performance, efficient compression, and support for a wide range of data types.
Feather is a library that is designed to read and write data from devices. This library is great for translating data from one language to another. It is also able to quickly read large amounts of data.
Dask - this library allows you to effectively organize parallel computing. Big data collections are stored here as parallel arrays/lists and allow you to work with them through Numpy/Pandas
Ibis - provides access between the local environment in Python and remote data stores (for example, Hadoop)
📝A selection of sources with medical datasets
The international healthcare system generates a wealth of medical data every day that (at least in theory) can be used for machine learning.
Here are some sources with medical datasets:
1. The Cancer Imaging Archive (TCIA) funded by the US National Cancer Institute (NCI) is a publicly accessible repository of radiological and histopathological images
2. National Covid-19 Chest Imaging Database (NCCID), part of the NHS AI Lab, contains radiographs, MRIs, and Chest CT scans of hospital patients across the UK. It is one of the largest archives of its kind, with 27 hospitals and foundations contributing.
3. Medicare Provider Catalog collects official data from Centers for Medicare and Medicaid Services (CMS). It covers many different topics, from the quality of care in different hospitals, rehabilitation centers, hospices and other healthcare facilities, to the cost of a visit and information about doctors and clinicians. The data can be viewed in a browser, download specific datasets in CSV format, or connect your own applications to the website using the API.
4. Older Adults Health Data Collection on Data.gov consists of 96 datasets managed by the US federal government. Its main purpose is to collect information about the health of people over 60 in the context of the Covid-19 pandemic and beyond. Organizations involved in maintaining the collection include the US Department of Health and Human Services, the Department of Veterans Affairs, the Centers for Disease Control and Prevention (CDC), and others. Datasets can be downloaded in various formats: HTML, CSV, XSL, JSON, XML and RDF.
5.The Cancer Genome Atlas (TCGA) is a major genomics database covering 33 disease types, including 10 rare ones.
6. Surveillance, Epidemiology, and End Results (SEER) is the most reliable source of cancer statistics in the United States, designed to reduce the proportion of cancer in the population. Its database is maintained by the Surveillance Research Program (SRP), which is part of the Division of Cancer Control and Population Sciences (DCCPS) of the National Cancer Institute.
🌎TOP-10 DS-events all over the world in April:
Apr 1 - IT Days - Warsaw, Polland - https://warszawskiedniinformatyki.pl/en/
Apr 3-5 - Data Governance, Quality, and Compliance - Online - https://tdwi.org/events/seminars/april/data-governance-quality-compliance/home.aspx
Apr 4-5 - HEALTHCARE NLP SUMMIT - Online - https://www.nlpsummit.org/
Apr 12-13 - Insurance AI & Innovative Tech USA 2022. Chicago, IL, USA. - Chicago, USA - https://events.reutersevents.com/insurance/insuranceai-usa
Apr 17-18 - ICDSADA 2023: 17. International Conference on Data Science and Data Analytics - Boston, USA - https://waset.org/data-science-and-data-analytics-conference-in-april-2023-in-boston
Apr 25 - Data Science Day 2023 - Vienna, Austria - https://wan-ifra.org/events/data-science-day-2023/
Apr 25-26 - Chief Data & Analytics Officers, Spring. San Francisco, CA, USA. - https://cdao-spring.coriniumintelligence.com/
Apr 25-27 - International Conference on Data Science, E-learning and Information Systems 2023 - Dubai, UAE - https://iares.net/Conference/DATA2022
Apr 26-27 - Computer Vision Summit. San Jose, CA, USA. - San Jose, USA - https://computervisionsummit.com/location/cvsanjose
Apr 26-28 - PYDATA SEATTLE 2023 - Seattle, USA - https://pydata.org/seattle2023/
🤔What is Data Mesh: the essence of the concept
Data Mesh is a decentralized flexible approach to the work of various distributed teams and the dissemination of information. Data Mesh was born as a response to the dominant concepts of working with data in data-driven organizations - Data Warehouse and Data Lake. They are united by the idea of centralization. All data flows into a central repository, from where different teams can take it for their own purposes. However, all this needs to be supported by a team of data engineers with a special set of skills. Also, with the growth in the number of sources and the variety of data, it becomes more and more difficult to ensure their business quality, pipelines for transformation become more and more difficult.
Datamesh proposes to solve these and other problems based on four main principles:
1. Domain-oriented ownership - domain teams own data, not a centralized Data team. A domain is a part of an organization that performs a specific business function, for example, it can be product domains (mammography, fluorography, CT scan of the chest) or a domain for working with scribes.
2. Data as a product - data is perceived not as a static dataset, but as a dynamic product with its users, quality metrics, development backlog, which is monitored by a dedicated product-owner.
3. Self-serve data platform. The main function of the data platform in Data Mesh is to eliminate unnecessary cognitive load. This allows developers in domain teams (data product developers and data product consumers) who are not data scientists to conveniently create Data products, build, deploy, test, update, access and use them for their own purposes.
4. Federated computational governance - instead of centralized data management, a special federated body is created, consisting of representatives of domain teams, data platforms and experts (for example, lawyers and doctors), which sets global policies in the field of working with data and discusses the development of the data platform.
💥YTsaurus: a system for storing and processing Yandex's Big Data has become open-source
YTsaurus is an open source distributed storage and processing platform for big data. This system is based on MapReduce, distributed file system and NoSQL key-value database.
YTsaurus is built on top of Cypress (or Cypress) - a fault-tolerant tree-based storage that provides features such as:
a tree namespace whose nodes are directories, tables (structured or semi-structured data), and files (unstructured data);
support for columnar and row mechanisms for storing tabular data;
expressive data schematization with support for hierarchical types and features of data sorting;
background replication and repair of erasure data that do not require any manual actions;
transactions that can affect many objects and last indefinitely;
In general, YTsaurus is a fairly powerful computing platform that involves running arbitrary user code. Currently, YTsaurus dynamic tables store petabytes of data, and a large number of interactive services are built on top of them.
The Github-repository contains the YTsaurus server code, deployment infrastructure using k8s, as well as the system web interface and client SDK for common programming languages - C ++, Java, Go and Python. All this is under the Apache 2.0 license, which allows everyone to upload it to their servers, as well as modify it to suit their needs.
⛑⛑⛑Medical data: what to observe when working with the health service
The main problem with health data is its vulnerability. They contain confidential information protected by the Health Insurance Portability and Accountability Act (HIPAA) and may not be used without express consent. In the medical field, sensitive details are referred to as protected health information (PHI). Here are a few factors to consider when working with medical datasets:
Protected Health Information (PHI) is contained in various medical documents: emails, clinical notes, test results, or CT scans. While diagnoses or medical prescriptions are not considered sensitive information in and of themselves, they are subject to HIPAA when matched against so-called identifiers: names, dates, contacts, social security or account numbers, photographs of individuals, or other elements that can be used to locate or identify a particular patient, as well as contact him.
Anonymization of medical data and removal of personal information from them. Personal identifiers and even parts of them (such as initials) must be disposed of before medical data can be used for research or business purposes. There are two ways to do this - anonymization and deletion of personal information. Anonymization is the permanent elimination of all sensitive data. Removing personal information (de-identification) only encrypts personal information and hides it in separate datasets. Later, identifiers can be re-associated with health information.
Medical data markup. Any unstructured data (texts, images or audio files) for training machine learning models requires markup or annotation. This is the process of adding descriptive elements (labels or tags) to data blocks so that the machine can understand what is in the image or text. When working with healthcare data, healthcare professionals should perform the markup. The hourly cost of their services is much higher than that of annotators who do not have domain knowledge. This creates another barrier to the generation of high-quality medical datasets.
In summary, preparing medical data for machine learning typically requires more money and time than the average for other industries due to strict regulation and the involvement of highly paid subject matter experts. Consequently, we are seeing a situation where public medical datasets are relatively rare and are attracting serious attention from researchers, data scientists, and companies working on AI solutions in the field of medicine.
📝How to improve medical datasets: 4 small tips
Getting the right data in the right amount - before ordering datasets of medical images, project managers need to coordinate with teams of machine learning, data science and clinical researchers. This will help overcome the difficulty of getting “bad” data or having annotation teams filter through thousands of irrelevant or low quality images and videos when annotating training data, which is costly and time consuming.
Empowering annotator teams with AI-based tools - annotating medical images for machine learning models requires precision, efficiency, high levels of quality, and security. With AI-based image annotation tools, medical annotators and specialists can save hours of work and generate more accurately labeled medical images.
Ensuring ease of data transfer - clinical data should be delivered and communicated in a format that is easy to parse, annotate, port, and after annotation, quickly and efficiently transfer to an ML model.
Overcome the complexities of storage and transmission - medical image data often consists of hundreds or thousands of terabytes that cannot simply be mailed. Project managers need to ensure the end-to-end security and efficiency of purchasing or retrieving, cleaning, storing and transferring medical data
💥Top sources of various datasets for data visualization
FiveThirtyEight is a journalism site that makes its datasets from its stories available to the public. These provide researched data suitable for visualization and include sets such as airline safety, election predictions, and U.S. weather history. The sets are easily searchable, and the site continually updates.
Earth Data offers science-related datasets for researchers in open access formats. Information comes from NASA data repositories, and users can explore everything from climate data to specific regions like oceans, to environmental challenges like wildfires. The site also includes tutorials and webinars, as well as articles. The rich data offers environmental visualizations and contains data from scientific partners as well.
The GDELT Project collects events at a global scale. It offers one of the biggest data repositories for human civilization. Researchers can explore people, locations, themes, organizations, and other types of subjects. Data is free, and users can also download RAW data sets for unique use cases. The site also offers a variety of tools as well for users with less experience doing their own visualizations.
Singapore Public Data - another civic source of data, the Singapore government makes these datasets available for research and exploration. Users can search by subject through the navigation bar or enter search terms themselves. Datasets cover subjects like the environment, education, infrastructure, and transport.
📚Top Data Science Books 2022
Ethics and Data Science - in this book, the author introduces us to the principles of working with data and what to do to implement them today.
Data Science for Economics and Finance - this book deals with data science, including machine learning, social network analysis, web analytics, time series analysis, and more in relation to economics and finance.
Leveraging Data Science for Global Health - This book explores the use of information technology and machine learning to fight disease and promote health.
Understanding Statistics and Experimental Design - the book provides the foundations needed to properly understand and interpret statistics. This book covers the key concepts and discusses why experiments are often not reproducible.
Building data science teams - the book covers the skills, perspectives, tools, processes needed to grow teams
Mathematics for Machine Learning - this book covers the basics of mathematics (linear algebra, geometry, vectors, etc.), as well as the main problems of machine learning.
👑Working with Graphs in Python with NetWorkX
NetWorkX is a library that is designed to create, study and manipulate graphs and other network structures. This library is free, distributed under the BSD license. NetworkX is used to teach graph theory, as well as scientific research and solving applied problems in which it is applied. NetWorkX has a number of strong benefits, including:
High performance - NetworkX is able to freely operate large network systems containing up to 10 million vertices and 100 million edges between them. This is especially useful when analyzing Big Data - for example, downloads from social networks that unite millions of users.
Easy to use - since the NetworkX library is written in Python, working with it is not difficult for both professional programmers and amateurs. And graph visualization modules provide visibility of the result, which can be corrected in real time. In order to create a full-fledged graph, you need only 4 lines of code (one of them is just an import):
import networkx as nx
G=nx.Graph()
G.add_edge(1,2)
G.add_edge(2,3,weight=0.9)
Efficiency - due to the fact that the library is implemented on the low-level data structure of the Python programming language, the efficient use of computer hardware and software resources is ensured. This increases the ability to scale graphs, and also reduces dependence on the features of the hardware platform and operating system.
Ongoing Support - Detailed documentation has been developed for NetworkX, describing the functionality and limitations of the library. The repositories are constantly updated. They contain ready-made standard solutions for programmers, which greatly facilitate the work.
Open source code - the user gets great opportunities for customizing and expanding the functionality of the library, adapting it to specific tasks. If desired, the user himself can develop additional software for working with this library.
😎What is Pandas and why is it so good?
Pandas is a Python library for processing and analyzing structured data, its name comes from "panel data" ("panel data"). Panel data refers to information obtained because of research and structured in the form of tables. To work with such data arrays in Python, the Pandas library was created.
This library is based on DataFrame - a table-type data structure. Any tabular representation of data, such as spreadsheets or databases, can be used as a DataFrame. The DataFrame object is made up of Series objects, which are one-dimensional arrays that are combined under the same name and data type. Series can be thought of as a table column. Pandas has in its arsenal such advantages as:
Easily manage messy data in an organized form - dataframes have indexes for easy access to any element
Flexible change of forms: adding, deleting, attaching new or old data.
Intelligent indexing, which involves the manipulation and management of columns and rows.
Quickly merge and merge data sets by indexing them, for example, two or more Series objects into one DataFrame.
Support for hierarchical indexing - the ability to combine columns under a common category (MultiIndex).
Open Access - Pandas is an open source library, meaning its source code is in the public domain
Detailed documentation - Pandas has its own official website, which contains detailed documentation with explanations. More details can be found at the following link: https://pandas.pydata.org/docs/
💡Top 6 data sources for deep diving into Machine Learning
Chronic disease data is a source where you can find data on various chronic diseases in the United States.
IMF Data - The International Monetary Fund, which also publishes data on international finance, debt indicators, foreign exchange reserves, investments, and so on.
Financial Times Market Data - contains information about the financial markets around the world, which includes indicators such as commodities, currencies, stock price indices
ImageNet is the image data for the new algorithms, organized according to the WordNet hierarchy, where hundreds and thousands of images represent each node of the hierarchy
Stanford Dogs Dataset - contains a huge number of images of various breeds of dogs
HotspotQA Dataset - question-answer type data that allows you to create systems for answering questions in the most understandable way.
🤔Why is the NumPy library so popular in Python?
NumPy is a Python language library that adds support for large multi-dimensional arrays and matrices, as well as high-level (and very fast) math functions to operate on these arrays. This library has several important features that have made it a popular tool.
Firstly, you can find its source code on GitHub, which is why NumPy is called an open-source module for Python. https://github.com/numpy/numpy/tree/main/numpy
Second, NumPy is written in C. This language is compiled, that is, the constructions crated with the kind of this language standards and rules are converted into machine code - a set of instructions for a particular type of processor. The conversion takes place with the help of a special compiler program, due to which all calculations occur quickly enough.
Let’s compare the performance between NumPy arrays and standard Python lists by the code below:
import numpy
import time
list1 = range(1000000)
list2 = range(1000000)
array1 = numpy.arange(1000000)
array2 = numpy.arange(1000000)
initialTime = time.time()
resultantList = [(a * b) for a, b in zip(list1, list2)]
print("Time taken by Python lists :",(time.time() - initialTime),"secs")
initialTime = time.time()
resultantArray = array1 * array2
print("Time taken by NumPy :", (time.time() - initialTime),"secs")
As the result of the test we can guess that NumPy arrays (0.002 sec) are more faster that standard Python lists (0.11 sec)
Performance differs across platforms due to software and hardware differences. The default bit generator has been chosen to perform well on 64-bit platforms. Performance on 32-bit operating systems is very different. You can see the details here: https://numpy.org/doc/stable/reference/random/performance.html#performance-on-different-operating-systems
💥Not only Copilot: try Codex by OpenAI
OpenAI released Codex models based of GPT-3 that can interpret and generate code. Their training data contains both natural language and billions of lines of public code from GitHub. These neural network models are best versed in Python and speak over a dozen languages, including JavaScript, Go, Perl, PHP, Ruby, Swift, TypeScript, SQL, and even Shell. The code is useful in the following tasks:
• Make code from comments
• Complete a statement or function in a spec.
• Find payload or API call
• Add comments
• Make code refactoring to achieve its efficiency
https://beta.openai.com/docs/guides/code