bdscience | Unsorted

Telegram-канал bdscience - Big Data Science

4168

Big Data Science channel gathers together all interesting facts about Data Science. For cooperation: a.chernobrovov@gmail.com 💼 — https://t.me/bds_job — channel about Data Science jobs and career 💻 — https://t.me/bdscience_ru — Big Data Science [RU]

Subscribe to a channel

Big Data Science

😎💥YouTube-ASL repository has been made available to the public
This repository provides information about the YouTube-ASL dataset, which is an extensive open source dataset. It contains videos showing American Sign Language with English subtitles.
This dataset includes 11,093 American Sign Language (ASL) footage and has a total length of 984 hours of footage. In addition, there are 610,193 English subtitles in the set.
The repository contains a link to a text document with the video data ID.
This repository is located at the link: https://github.com/google-research/google-research/tree/master/youtube_asl

Читать полностью…

Big Data Science

🔎📝Datasets for Natural Language Processing
Sentiment analysis - a set of different datasets, each of which contains the necessary information for analyzing the sentiment of a text. So, the data taken from IMDb is a binary set for sentiment analysis. It consists of 50,000 reviews from the Movie Database (IMDb), marked as either positive or negative.
WikiQA is a collection of question and suggestion pairs. They were collected and annotated to investigate responses to questions in open domains. WikiQA is created using a more natural process. It includes questions for which there are no correct sentences, allowing researchers to work on the response trigger, a critical component of any QA system.
Amazon Reviews dataset - This dataset consists of several million Amazon customer reviews and their ratings. The dataset is used to enable fastText to learn by analyzing consumer sentiment. The idea is that despite the huge amount of data, this is a real business challenge. The model is trained in minutes. This is what sets Amazon Reviews apart from its peers.
Yelp dataset - The Yelp dataset is a collection of businesses, testimonials, and user data that can be applied to a Pet project and academia. You can also use Yelp to teach students how to work with databases, when learning NLP, and as a sample of production data. The dataset is available as JSON files and is a "classic" in natural language processing.
Text classification - Text classification is the task of assigning an appropriate category to a sentence or document. The categories depend on the selected dataset and may vary depending on the topics. For example, TREC is a question classification dataset that consists of fact-based open-ended questions. They are divided into broad semantic categories. The dataset has six-grade (TREC-6) and fifty-grade (TREC-50) versions. Both versions include 5452 training and 500 test cases.

Читать полностью…

Big Data Science

⚔️⚡️🤖MDS vs PCA or what is better to use when reducing data dimensionality
Multidimensional Scaling (MDS) and Principal Component Analysis (PCA)
are two popular data analysis techniques that are widely used in statistics, machine learning, and data visualization. Both methods aim to compress the information contained in multidimensional data and present it in a form convenient for analysis. Despite similarities in their goals, MDS and PCA have significant differences in their approach and applicability.
Principal Component Analysis (PCA) is a statistical technique used to reduce the dimensionality of data. It looks for linear combinations of the original variables, called principal components, that explain the largest amount of variance in the data.
Benefits of PCA:
1. Eliminate multicollinearity:
PCA can be used to eliminate multicollinearity inherent in a set of original variables. It allows you to combine related variables into principal components, eliminating information redundancy.
2. Computational speed: PCA is usually quite efficient in terms of computational costs.
3. Resilience to noise: PCA exhibits greater resilience to noise in the data. In PCA, the dominant components (principal components) explain most of the variance in the data, while the less significant components can represent noise. This allows PCA to better separate signals and noise, which is especially useful when analyzing data with a low signal-to-noise ratio.
Disadvantages of PCA:
1. Linear relationship:
PCA is based on the assumption of a linear relationship between variables. In the case of a non-linear relationship, PCA may not detect an important data structure.
2. Loss of interpretability: after projecting the data onto the space of principal components, their interpretation can be difficult, since the new variables are linear combinations of the original variables.
3. Sensitivity to outliers: PCA can be sensitive to the presence of outliers in the data, as they can strongly influence the distribution of principal components.

Multidimensional scaling (MDS) is a data visualization technique that seeks to preserve the relative distances between features in the original data when projected into low-dimensional space.
Advantages of MDS:
1. Accounting for non-linear relationships: MDS does not require the assumption of a linear relationship between variables and can detect non-linear relationships in the data.
2. Preserve relative distances: MDS strives to preserve the relative distances between objects in the source data when rendered. This allows you to detect a data structure that may be lost in the process of dimensionality reduction.
3. Interpretability: MDS makes it relatively easy to interpret low-dimensional projections of data because they preserve the relative distances of the original data.
Disadvantages of MDS:
1. Computational Complexity:
MDS can be computationally complex when dealing with large datasets, especially when accurate relative distances between all pairs of objects need to be maintained.
2. Dependency on the metric: MDS requires the definition of a distance metric between objects. Choosing the wrong metric can lead to skewed results.

Thus, PCA and MDS are quite effective data analysis tools. PCA is widely used to reduce data dimensionality and reveal structure in linearly dependent variables, while MDS provides the ability to preserve relative distances and detect non-linear relationships between objects. The choice between these methods depends on the specifics of the data and the goals of the analysis. Where dimensionality reduction and principal component detection is required, PCA may be the preferred choice, while MDS is recommended for visualizing and maintaining relative distances in data.

Читать полностью…

Big Data Science

🤔🤖Is Trino so good: advantages and disadvantages
Trino is a distributed SQL query engine for large-scale data analysis. It is designed to run analytical queries on large amounts of data that can be distributed across multiple nodes or clusters.
Trino Benefits:
1. Scalable:
Trino can handle huge amounts of data efficiently across multiple nodes. It can scale horizontally by adding new nodes to the cluster and load balancing to process requests.
2. High performance with Big Data: Trino is optimized for analytic queries on big data. It uses parallel query processing to speed up complex queries and improve overall performance.
3. Flexibility and compatibility: Trino supports the SQL standard and can work with various data sources such as Hadoop HDFS, Amazon S3, Apache Kafka, MySQL, PostgreSQL and many more. It also integrates with various data analysis tools and platforms.
Trino Disadvantages:
1. Difficult to set up:
Setting up and managing the Trino can be a difficult task, especially for non-professionals. It requires skilled professionals to properly tune and optimize performance.
2. Limited support for administrative functions: Trino is focused on executing analytical queries and processing data, so it may have limited support for administrative functions such as monitoring, data backup and recovery. You may need additional tools or settings for these tasks.
3. No built-in resource management system: Trino does not have a built-in resource management system or scheduler. This means that you must use third party tools or tweaks to efficiently allocate resources between queries and control cluster performance.
4. Dependency on Third Party Tools and Platforms: Trino integrates with various data analysis tools and platforms, but its functionality may depend on these third party components. This can make it difficult to manage and update the entire ecosystem, especially when using new versions or additional integrations.
5. Not suitable for transactional operations: Trino is not designed to perform transactional operations such as inserting, updating, and deleting data. If you require transaction processing, you should consider other systems that specialize in this area.

All in all, Trino is a powerful tool for large-scale data analysis, especially in the realm of executing analytical queries on large volumes of data. It provides high performance and flexibility, allows you to work with various data sources and integrate with other data analysis tools and platforms. However, when using Trino, it is necessary to take into account its disadvantages and take into account the specific requirements of the project and infrastructure.

Читать полностью…

Big Data Science

⚔️🤖🧠Spark DataFrame vs Pandas Dataframe: Advantages and Disadvantages
Spark DataFrame
and Pandas DataFrame are data structures designed to make it easy to work with tabular data in Python, but they have some differences in their functionality and the way they process data.
Spark DataFrame is the core component of Apache Spark, a distributed computing platform for processing large amounts of data. It is a distributed collection of data organized in named columns.
Pandas DataFrame is a data structure provided by the Pandas library that provides powerful tools for parsing and manipulating tabular data. Pandas DataFrame is a two-dimensional labeled array of rows and columns, similar to a database table or spreadsheet.
Benefits of Spark Dataframe:
1. Distributed data processing:
Spark Dataframe is designed to process large amounts of data and can work with data that does not fit in the memory of a single node. It distributes data and calculations across the cluster, which allows you to achieve high performance.
2. Programming language support: Spark Dataframe supports multiple programming languages, including Python, Scala, Java, and R. This allows developers to use their preferred language when working with data.
3. Support for different data sources: Spark Dataframe can work with different data sources such as Hadoop Distributed File System (HDFS), Apache Hive, Apache HBase, Apache Cassandra and many more. It provides convenient APIs for working with different data formats.
Disadvantages of Spark Dataframe:
1. Difficulty in setting up and managing a cluster:
Spark requires setting up and managing a cluster for distributed data processing. This can be difficult for first time users or projects with limited resources.
2. Slow startup: Starting a Spark cluster can take time, especially if networking and other settings need to be configured. For small datasets, this can be inefficient and take longer than processing the data by Spark itself.

Benefits of Pandas Dataframe:
1. Ease of use:
Pandas Dataframe provides a simple and intuitive API for working with data. It offers many features for filtering, sorting, grouping, and aggregating data, making it convenient for data analysis.
2. Large user community: Pandas is a very popular tool in the data analytics and machine learning community. This means that there are many resources, documentation and communities where you can get help and support.
3. High performance on small datasets: Pandas is optimized to work with relatively small datasets that can fit in the memory of a single node. In such cases, Pandas can be faster than Spark.
Disadvantages of Pandas Dataframe:
1. Memory Limits:
Pandas Dataframe stores all data in memory, so working with large datasets can be limited by the available memory on your computer. This can cause performance issues or even crash the program.
2. Limited scalability: Pandas is designed to run on a single node and cannot scale efficiently for distributed processing. If you have a large amount of data that doesn't fit in a single node's memory, Pandas can become inefficient.
Thus, the choice between Spark Dataframe and Pandas Dataframe depends on specific needs. If there are large amounts of data and distributed processing is required, then Spark Dataframe may be preferable. If you are working with small datasets and looking for a simple and fast way to analyze your data, then Pandas Dataframe might be the best choice.

Читать полностью…

Big Data Science

💥TOP-7 DS-events all over the world in June:
Jun 2-4
- Machine Learning Prague - Prague, Czech - https://mlprague.com/
Jun 7-8 - Data Science Salon NYC | AI and ML in Finance & Technology - New York, NY, USA - https://www.datascience.salon/nyc/
Jun 8 - DATA CENTER 2023 - Любляна, Словения - https://datacenter.palsit.com/en/
Jun 14-15 - AI Summit London 2023 - London, Great Britain - https://london.theaisummit.com/
Jun 18-22 - Machine Learning Week - Las Vegas, USA - https://www.predictiveanalyticsworld.com/machinelearningweek/
Jun 19-22 - The Event for Machine Learning Technologies & Innovations - Munich, Germany - https://mlconference.ai/munich/
Jun 30 - Jul 2 - 4th Int. Conf. on Artificial Intelligence in Education Technology - Berlin, Germany - https://www.aiet.org/index.html

Читать полностью…

Big Data Science

🧐📝🤖Clickhouse data processing: advantages and disadvantages
ClickHouse
is an open column database for analytics and processing of large amounts of data. It was developed by Yandex and is designed for processing and analyzing Big Data in real time.
Clickhouse Benefits:
1. High performance:
ClickHouse is designed to handle very large amounts of data at high speed. It can efficiently handle queries that require complex aggregations and analytics on billions of rows of data.
2. Scalability: ClickHouse provides horizontal scaling, which means that it can be easily scaled by adding new cluster nodes to increase performance and handle large amounts of data.
3. Low latency: ClickHouse provides low latency query execution due to its columnar architecture that allows you to quickly filter and aggregate the data needed to answer queries.
4. Efficient use of resources: ClickHouse is optimized to work on high-load systems and offers various mechanisms for data compression and memory management, which allows efficient use of server resources.
5. SQL query support: ClickHouse supports the standard SQL query language, which makes it easy to develop and integrate existing tools and applications.

However, despite all the advantages, Clickhouse has a number of disadvantages:
1. Focused on analytics: ClickHouse is the best choice for analytical tasks, but may be less suitable for operational or transactional workloads where frequent data changes or the recording of a large number of small transactions are required.
2. Complexity of configuration and management: Setting up and managing ClickHouse can be a complex process, especially for beginners. Some aspects, such as data distribution, require careful planning and experience to achieve optimal performance.
3. Lack of full support for transactions: ClickHouse does not provide full support for transactions, which may be a disadvantage for some applications that require data consistency and atomic operations.
4. Difficulty in making data schema changes: Making data schema changes in ClickHouse can be complex and require reloading data or rebuilding tables, which can be time and resource consuming.
Thus, ClickHouse is a powerful and efficient system for analytics on large volumes of data, but requires careful planning and configuration for optimal performance.

Читать полностью…

Big Data Science

😱YouTube video has been turned into a data warehouse
The AKA ISG algorithm was created by enthusiasts to turn YouTube videos into free and virtually unlimited data storage. The essence of the algorithm is that it allows you to embed files in videos and upload them to YouTube as part of the video. Each file is made up of bytes, which can be represented as numbers. In turn, each pixel in the video can be interpreted as either white (1) or black (0).
The result is a video, each frame of which contains information.
According to the developers, YouTube has no limit on the number of videos that can be uploaded. This means that, in this way, it is effectively infinite cloud storage.

Читать полностью…

Big Data Science

⚔️🤖Pandas vs Datatable: features of comparison when working with big data
Pandas and Datatable are two popular libraries for working with data in the Python programming language. However, they have some features that are used to choose one or another library for a specific task.
Pandas is one of the most common and popular data manipulation libraries in Python. It provides a wide and flexible toolkit for working with large data types, including tables, time series, multidimensional arrays, and more. Pandas also provides many data manipulation features such as filtering, sorting, grouping, aggregation, and more.
Pandas Benefits:
1. Powerful tools for working with large data types, including tables, time series, multidimensional arrays, and more.
2. Widespread community support that often causes bugs and library updates.
3. A rich set of functions for working with data, such as filtering, sorting, grouping, aggregation and more.
4. Fairly extensive documentation
Disadvantages of pandas:
1. Poor performance when working with large amounts of data.
2. Inconvenience when dealing with high column averages.

Datatable is a library designed to improve the performance and efficiency of working with data in Python. It provides faster data handling than Pandas, especially when working with large amounts of data. Datable provides a syntax very similar to Pandas that makes it easier to switch from one library to another.
Advantages
1. Sufficiently high performance when working with large amounts of data.
2. Syntax very similar to Pandas, which makes it easier to switch from one library to another.
Disadvantages of Datatable:
1. More limited functionality than Pandas
2. Limited cross-platform: some functions in Datatable may work differently on different textures, which can cause problems during development and testing.
3. Small community: Datatable is not as widely used as Pandas, which means that there is relatively little community and users who can help with issues and issues involved with the library.

Thus, the choice between Pandas and Datatable depends on the specific task and performance guarantee. If you need to work with large amounts of data and need maximum performance, then Datatable may be the best choice. If you need to work with data type discovery and operations, then Pandas is the best choice.

Читать полностью…

Big Data Science

🤓🧐Data consistency and its types
The concept of data consistency is complex and ambiguous, and its definition may vary depending on the context. In his article, which was translated by the VK Cloud team, the author discusses the concept of "consistency" in the context of distributed databases and offers his own definition of this term. In this article, the author identifies 3 types of data consistency:
1. Consistency in Brewer's theorem - According to this theorem, in a distributed system it is possible to guarantee only two of the following three properties:
Consistency: the system provides an up-to-date version of the data when it is read
Availability: every request to a node that is functioning properly results in a correct response
Partition Tolerance: The system continues to function even if there are network traffic failures between some nodes
2. Consistency in ACID transactions - In this category, consistency means that a transaction cannot lead to an invalid state, since the following components must be observed:
Atomicity: any operation will be performed completely or will not be performed at all
Consistency: after the completion of the transaction, the database is in a correct state
Isolation: when one transaction is executed, all other parallel transactions should not have any effect on it
Reliability: even in the event of a failure (no matter what), the completed transaction is saved
3. Data Consistency Models - This definition of the term also applies to databases and is related to the concept of consistency models. There are two main elements in the database consistency model:
Linearizability - replication of a single piece of data across multiple nodes affects its processing in the database
Serializability - simultaneous execution of transactions that work with several pieces of data affects their processing in the database
More details can be found in the source: https://habr.com/ru/companies/vk/articles/723734/

Читать полностью…

Big Data Science

😎TOP-10 DS-events all over the world in May:
May 1-5
- ICLR - International Conference on Learning Representations - Kigali, Rwanda - https://iclr.cc/
May 8-9 - Gartner Data & Analytics Summit - Mumbai, India - https://www.gartner.com/en/conferences/apac/data-analytics-india
May 9-11 - Open Data Science Conference EAST - Boston, USA - https://odsc.com/boston/
May 10-11 - Big Data & AI World - Frankfurt, Germany - https://www.bigdataworldfrankfurt.de/
May 11-12 - Data Innovation Summit 2023 - Stockholm, Sweden - https://datainnovationsummit.com/
May 17-19 - World Data Summit - Amsterdam, The Netherlands - https://worlddatasummit.com/
May 19-22 - The 15th International Conference on Digital Image Processing - Nanjing, China - http://www.icdip.org/
May 23-25 - Software Quality Days 2023 - Munich, Germany - https://www.software-quality-days.com/en/
May 25-26 - The Data Science Conference - Chicago, USA - https://www.thedatascienceconference.com//
May 26-29 - 2023 The 6th International Conference on Artificial Intelligence and Big Data - Chengdu, China - http://icaibd.org/index.html

Читать полностью…

Big Data Science

😩Uncertainty in data: common bugs
There is a lot of talk about “data preparation” and “data cleaning” these days, but what separates high quality data from low quality data?
Most machine learning systems today use supervised learning. This means that the training data consists of (input, output) pairs, and we want the system to be able to take the input and match it with the output. For example, the input might be an audio clip and the output might be a transcription of a speech. To create such datasets, it is necessary to label them correctly. If there is uncertainty in the labeling of the data, then more data may be needed to achieve high accuracy of the machine learning model.
Data collection and annotation may not be correct for the following reasons:
1. Simple annotation errors. The simplest type of error is misannotation. An annotator, tired of a lot of markup, accidentally puts sample data in the wrong class. Although this is a simple bug, it is quite common and can have a huge negative impact on the performance of the AI system.
2. Inconsistencies in annotation guidelines. There are often subtleties of various kinds in annotating data items. For example, you might imagine reading social media posts and annotating whether they are product reviews. The task seems simple, but if you start to annotate, you can realize that “product” is a rather vague concept. Should digital media, such as podcasts or movies, be considered products? One specialist may say yes, another no, so the accuracy of the AI system can be greatly reduced.
3. Unbalanced data or missing classes. The way data is collected greatly affects the composition of datasets, which in turn can affect the accuracy of models on specific data classes or subsets. In most real world datasets, the number of examples in each category that we want to classify (class balance) can vary greatly. This can lead to reduced accuracy, as well as exacerbating balance problems and skew. For example, Google's AI facial recognition system was notorious for not being able to recognize faces of people of color, which was largely the result of using a dataset with insufficiently varied examples (among many other problems).

Читать полностью…

Big Data Science

😎🥹Libraries for comfortable data processing
PyGWalker  - Simplifies the data analysis and visualization workflow in Jupyter Notebook by turning a pandas dataframe into a Tableau-style user interface for visual exploration.
SciencePlots - A library for creating various matplotlib plots for presentations, research papers, etc.
CleverCSV is a library that fixes various parsing errors when reading CSV files with Pandas
fastparquet - Speeds up pandas I/O by about 5 times. fastparquet is a high performance Python implementation of the Parquet format designed to work seamlessly with Pandas dataframes. It provides fast read and write performance, efficient compression, and support for a wide range of data types.
Feather is a library that is designed to read and write data from devices. This library is great for translating data from one language to another. It is also able to quickly read large amounts of data.
Dask - this library allows you to effectively organize parallel computing. Big data collections are stored here as parallel arrays/lists and allow you to work with them through Numpy/Pandas
Ibis  - provides access between the local environment in Python and remote data stores (for example, Hadoop)

Читать полностью…

Big Data Science

📝A selection of sources with medical datasets
The international healthcare system generates a wealth of medical data every day that (at least in theory) can be used for machine learning.
Here are some sources with medical datasets:
1. The Cancer Imaging Archive (TCIA) funded by the US National Cancer Institute (NCI) is a publicly accessible repository of radiological and histopathological images
2. National Covid-19 Chest Imaging Database (NCCID), part of the NHS AI Lab, contains radiographs, MRIs, and Chest CT scans of hospital patients across the UK. It is one of the largest archives of its kind, with 27 hospitals and foundations contributing.
3. Medicare Provider Catalog collects official data from Centers for Medicare and Medicaid Services (CMS). It covers many different topics, from the quality of care in different hospitals, rehabilitation centers, hospices and other healthcare facilities, to the cost of a visit and information about doctors and clinicians. The data can be viewed in a browser, download specific datasets in CSV format, or connect your own applications to the website using the API.
4. Older Adults Health Data Collection on Data.gov consists of 96 datasets managed by the US federal government. Its main purpose is to collect information about the health of people over 60 in the context of the Covid-19 pandemic and beyond. Organizations involved in maintaining the collection include the US Department of Health and Human Services, the Department of Veterans Affairs, the Centers for Disease Control and Prevention (CDC), and others. Datasets can be downloaded in various formats: HTML, CSV, XSL, JSON, XML and RDF.
5.The Cancer Genome Atlas (TCGA) is a major genomics database covering 33 disease types, including 10 rare ones.
6. Surveillance, Epidemiology, and End Results (SEER) is the most reliable source of cancer statistics in the United States, designed to reduce the proportion of cancer in the population. Its database is maintained by the Surveillance Research Program (SRP), which is part of the Division of Cancer Control and Population Sciences (DCCPS) of the National Cancer Institute.

Читать полностью…

Big Data Science

🌎TOP-10 DS-events all over the world in April:
Apr 1
- IT Days - Warsaw, Polland - https://warszawskiedniinformatyki.pl/en/
Apr 3-5 - Data Governance, Quality, and Compliance - Online - https://tdwi.org/events/seminars/april/data-governance-quality-compliance/home.aspx
Apr 4-5 - HEALTHCARE NLP SUMMIT - Online - https://www.nlpsummit.org/
Apr 12-13 - Insurance AI & Innovative Tech USA 2022. Chicago, IL, USA. - Chicago, USA - https://events.reutersevents.com/insurance/insuranceai-usa
Apr 17-18 - ICDSADA 2023: 17. International Conference on Data Science and Data Analytics - Boston, USA - https://waset.org/data-science-and-data-analytics-conference-in-april-2023-in-boston
Apr 25 - Data Science Day 2023 - Vienna, Austria - https://wan-ifra.org/events/data-science-day-2023/
Apr 25-26 - Chief Data & Analytics Officers, Spring. San Francisco, CA, USA. - https://cdao-spring.coriniumintelligence.com/
Apr 25-27 - International Conference on Data Science, E-learning and Information Systems 2023 - Dubai, UAE - https://iares.net/Conference/DATA2022
Apr 26-27 - Computer Vision Summit. San Jose, CA, USA. - San Jose, USA - https://computervisionsummit.com/location/cvsanjose
Apr 26-28 - PYDATA SEATTLE 2023 - Seattle, USA - https://pydata.org/seattle2023/

Читать полностью…

Big Data Science

💥📖TOP DS-events all over the world in July:
Jul 7-9
- 2023 IEEE the 6th International Conference on Big Data and Artificial Intelligence (BDAI) - Zheijiang, China - http://www.bdai.net/
Jul 11-13 - International Conference on Data Science, Technology and Applications (DATA) - Rome, Italy - https://data.scitevents.org/
Jul 12-16 - ICDM 2022: 23th Industrial Conf. on Data Mining - New York, NY, USA - https://www.data-mining-forum.de/icdm2023.php
Jul 14-16 - 6th International Conference on Sustainable Sciences and Technology - Istanbul, Turkey - https://icsusat.net/home
Jul 15-19 - MLDM 2023: 19th Int. Conf. on Machine Learning and Data Mining. New York, NY, USA - New York, USA - http://www.mldm.de/mldm2023.php
Jul 21-23 - 2023 7th International Conference on Artificial Intelligence and Virtual Reality (AIVR2023) - Kumamoto, Japan - http://aivr.org/
Jul 23-29 - ICML - International Conference on Machine Learning - Honolulu, Hawai'I - https://icml.cc/
Jul 27-29 - 7th International Conference on Deep Learning Technologies - Dalian, China - http://www.icdlt.org/
Jul 31-Aug 1 - Gartner Data & Analytics Summit - Sydney, Australia - https://www.gartner.com/en/conferences/apac/data-analytics-australia

Читать полностью…

Big Data Science

📊📈📉🤖Pandas AI - AI library for Big Data analysis
Pandas AI is a Python library with built-in generative artificial intelligence or language model.
How it works: in the code editor, you can ask any question about data in natural language and, without writing code, you will get a ready-made answer based on your data.
You can install Pandas AI with the following command:
pip install pandasai
After installation, you need to import the pandasai library and the LLM (Large Language Model) function:
import pandas as pd
from pandasai import PandasAI
from pandasai.llm.openai import OpenAI
However, Pandas AI does not position itself as a replacement for Pandas. As the developers note, this is more of an improvement for the standard Pandas.
The developers also warn that the entire data frame is transmitted along with the question each time, so the solution is far from ideal for processing large data sets.

Читать полностью…

Big Data Science

😎🧠⚡️One of the best startups for generating synthetic data for various industries
Hazy is a UK-based synthetic data generation startup that aims to train raw banking data for fintech industries.
Tonic.ai - offers an automated and anonymous way to synthesize data for use in testing and developing various software. This platform also implements database de-identification, which means filtering personal data (PII) from real data, as well as protecting customer privacy.
Mostly.AI is a Vienna-based synthetic data platform that serves industries such as insurance, banking, and telecommunications. It provides cutting-edge AI and top-notch privacy by extracting patterns and structure from source data to prepare vastly different datasets.
YData is a Portuguese startup that helps data scientists solve the problems of poor data quality or access to large user data with scalable AI solutions. When performing tests such as inference attacks, YData engineers are responsible for any risks of leaking or re-identifying an identity. Therefore, they use the TSTR (Train Synthetic Test Real) method, which evaluates the possibility of using AI-generated data to train predictive models.
Anyverse - creates a synthetic 3D environment. Anyverse renders and sets up various scenarios for sample data using a ray tracing engine. The technology calculates the interaction of light rays with scene objects at the physical level.
Rendered.AI is a startup that generates physics-based synthetic datasets for satellites, autonomous vehicles, robotics and healthcare.
Oneview is an Israeli data science platform that uses satellite imagery and remote sensing technologies for defense intelligence.

Читать полностью…

Big Data Science

📖💡🤔 Top 6 Datasets for Computer Vision
CIFAR-10 - contains 60 thousand 32x32 color images with 10 classes (animals and real world objects). Each class consists of 6000 images. This dataset contains 50,000 training images and 10,000 test images. The classes are mutually exclusive and there is no overlap between them.
Kinetics-700 - This is a large array of videos, consisting of 650,000 clips describing 700 classes of human actions. The video includes human-thing interactions such as playing musical instruments and human-human interactions such as hugging. Each activity class contains at least 700 video clips, and each clip is annotated with an activity class that is longer than 10 seconds.
IMDB-Wiki - One of the largest publicly available datasets of human faces with gender, age and name. In total it contains 523051 images, 460723 faces are photos of 20284 celebrities from IMDb and 62328 celebrities from Wikipedia.
Cityscapes is a database containing a diverse set of stereographic video clips recorded on the streets of fifty cities. The clips were filmed for a long time under various lighting and weather conditions. Cityscapes contains semantic segmentation of object instances with pixel precision for 30 classes divided into 8 categories. It provides 5000 frame per pixel annotation and 20000 frame approximate annotation.
Fire and Smoke Dataset - This is a dataset of over seven thousand unique HD resolution images. It consists of photos of starting fires and smoke taken by mobile phones in real situations. The pictures were taken in a wide range of lighting and weather conditions. This data set can be used for fire and smoke recognition and detection, as well as anomaly recognition.
FloodNet Dataset - This array consists of high resolution images taken from unmanned drones. The images contain detailed semantic annotations of damage caused by hurricanes.

Читать полностью…

Big Data Science

📉📊📈Top 6 tools to analyze data of any nature
DataRobot is a tool for scaling machine learning capabilities. Contains a massive library of open source and in-house models. Solves complex problems in the field of Data Science. Delivers fully explainable AI through human-friendly visual representations. The downside is the price, but a free trial is available
Alteryx combines analytics, machine learning, data science and process automation into a single end-to-end platform. The platform accepts data from hundreds of other data stores (including Oracle, Amazon, and Salesforce), allowing you to spend more time analyzing and less searching. Alteryx allows you to quickly prototype machine learning models and pipelines using automated model training blocks. It helps you easily visualize data throughout the entire problem solving and modeling journey.
H2O is an open source distributed memory machine learning tool with linear scalability. It supports almost all popular statistical and machine learning algorithms, including generalized linear models, deep learning, and gradient boosted machines. H2O takes data directly from Spark, Azure, Spark, HDFS, and various other sources into its distributed in-memory key value store.
SPSS Statistics - designed to solve business and research problems through detailed analysis, hypothesis testing and predictive analytics. SPSS can read and write data from spreadsheets, databases, ASCII text files, and other statistical packages. It can read and write data to external relational database tables via SQL and ODBC.
RapidMiner - supports all stages of the machine learning method, including data preparation, result visualization, model validation, and optimization. In addition to its own collection of datasets, RapidMiner provides several options for creating a database in the cloud to store huge amounts of data. It is possible to store and load data from various platforms such as NoSQL, Hadoop, RDBMS, etc.
Weka is a set of visualization tools and algorithms for data analysis and predictive modeling. All of them are available free of charge under the GNU General Public License. Users can experiment with their datasets by applying different algorithms to see which model gives the best result. They can then use visualization tools to explore the data.

Читать полностью…

Big Data Science

🔥💥Sufficiently useful web data analytics platforms today
Segment is a web platform and API for web analytics and collecting user data to send it to hundreds of tools or data stores. With Segment, you can export data to any internal system and application, play historical data, view real-time events, such as when someone makes a purchase on a website or application.
Metabase is an open source business intelligence tool. Users ask questions about the data, and Metabase displays the answers in meaningful formats like a bar chart or table. Data questions are saved and grouped into informative dashboards that the entire team uses.
Matomo is a web analytics platform that includes a built-in tag manager that allows you to monitor and control the performance of various marketing campaigns. Features include custom data storage, SAML and LDAP authentication, activity logs, media analytics, and custom reports.
SimilarWeb is a cloud-based website traffic analysis platform. Features include data export, performance metrics, custom dashboards, and conversion analysis. Marketing teams benefit from the ability to view demographic data, analyze customer behavior, and discover new opportunities.
Amplitude is a popular product analytics suite that tracks website visitors through collaborative analysis. The software uses custom behavior reports and notifications to offer a better understanding of how visitors interact, as well as provide actionable insights to speed up product development. Amplitude also allows you to define product strategies, improve customer engagement and optimize conversions.

Читать полностью…

Big Data Science

💥⚡️Data markup services today
1. Hasty.ai - this platform has a lot of "smart" tools like GrabCut, Contour and Dextr that recognize the edges or contours of objects, which can be manually adjusted with a threshold value for the best segmentation Images. It also supports markup prediction after annotating enough data. The second feature of the platform is the ability to train your own object recognizer, semantic segmentation and object segmentation. The only drawback is that processing takes time (up to 10-20 seconds), and it could be spent on the markup itself.
2. Superannotate is a Silicon Valley startup that provides vector annotations (rectangles, polygons, lines, ellipses, pattern keypoints, and cuboids) and pixel-by-pixel brush annotation. The best part of this tool is the super pixel function. It is able to recognize the edges of objects with extremely high accuracy, which greatly speeds up semantic and object segmentation compared to other tools. The only problem is that if the boundaries between the subject and the background are fuzzy, she spends more time manipulating the segments than doing the actual work.
3. Datasaur is a data markup program that focuses on text markup. If you need a data markup tool for natural language processing, then this is a pretty interesting option.
4. Clarifai - provides many useful opportunities for AI training. It can mark up data in pictures, videos and text.
5. V7 Darwin - this tool is actively used for annotating images. It allows you to recognize any area or object. It can even be used in videos to automatically annotate keyframes.

Читать полностью…

Big Data Science

😎🔎Several useful geodata Python libraries
gmaps is a library for working with Google maps. Designed for visualization and interaction with Google geodata.
Leafmap is an open source Python package for creating interactive maps and geospatial analysis. It allows you to analyze and visualize geodata in a few lines of code in the Jupyter environment (Google Colab, Jupyter Notebook and JupyterLab)
ipyleaflet is an interactive widget library that allows you to visualize maps
Folium is a Python library for easy geodata visualization. It provides a Python interface to leaflet.js, one of the best JavaScript libraries for creating interactive maps. This library also allows you to work with GeoJSON and TopoJSON files, create background cartograms with different palettes, customize tooltips and interactive inset maps.
geopandas is a library for working with spatial data in Python. The main object of the GeoPandas module is the geodataframe, which is exactly the same as the Pandas dataframe definition, but also includes a Geometry field that contains the definition of the feature.

Читать полностью…

Big Data Science

😳😱Sber has published a dataset for emotion recognition in Russian
Dusha is a huge open dataset for recognizing emotions in oral speech in Russian.
Dusha is suitable for recognizing emotions in oral speech in Russian. The dataset consists of over 300,000 audio recordings with transcripts and emotional tags. The duration is about 350 hours of audio. The team chose four main emotions that usually appear in a dialogue with a voice assistant: joy, sadness, anger, and a neutral emotion. Since each record was marked up by several annotators, who from time to time still performed various control tasks, the markup turned out to be about 1.5 million records.
Read more about the Dusha dataset in the publication: https://habr.com/ru/companies/sberdevices/articles/715468/

Читать полностью…

Big Data Science

DS books for the newest ones
1. Data science. John Kelleher, Brendan Tierney
- the book covers the main aspects, from the moment of setting up data collection and analysis, to addressing the ethical revelations that are growing due to privacy policies. The reader will walk you through how to run neural networks and machine learning, and guide you through case studies of business problems and how to solve them. Additionally, they will talk about technical requirements that can be transferred to a greater extent.
2. Practical statistics for Data Science specialists. Peter Bruce, Bruce Bruce - A hands-on textbook presented for data scientists with programming language skills and familiarity with the definition of mathematical statistics. Here, in an accessible way, the main points from the statistics of data science are presented, as well as an explanation of what are the important needs and sides of data analysis.
3. We study the spark. Holden Karau, Matei Zachariah, Patrick Wendell, Andy Konwinski - The authors of the books are the developers of the Spark system. They will talk about the analysis of the execution of tasks with a few lines of code, as well as understand the scheme through examples.
4. Data science. Data science from scratch. Joel Gras - Joel Gras talks about the Python language, elements of linear algebra, mathematical statistics, methods for collecting, normalizing and processing data. Additionally, it provides an information base for machine learning. Describes mathematical models and ways to develop them according to the "k" recipe.
5. Fundamentals of Data Science and Big Data. Davy Silen, Arno Meisman, Mohamed Ali - Readers are introduced to theoretical framework, machine learning sequencing, working with large datasets, NoSQL, detailed text analysis and computational information. Examples are Data Science scripts in Python.

Читать полностью…

Big Data Science

😳Dataset published for training an improved alternative to LLaMa
A group of researchers from various organizations and universities (Together, ontocord.ai, ds3lab.inf.ethz.ch, crfm.stanford.edu, hazyresearch.stanford.edu, mila.quebec) are working on an open-source alternative to the LLaMa model and have already published dataset relevant to the one used to create the last one. The non-free but well-balanced LLaMa has been used as the basis for projects such as Alpaca, Vicuna and Koala.
Now RedPajama is published in the public domain - a dataset of texts containing more than 1.2 trillion tokens. The next step, according to the developers, will be the creation of the model itself, which will require serious computing power.

Читать полностью…

Big Data Science

🧐What is Data observability: Basic Principles
Data observability is a new level in the modern data processing stack, providing data teams with transparency and quality. The goal of data observability is to reduce the chance of errors in business decisions due to incorrect information in the data.
Observability is ensured by the following principles:
Freshness indicates how fresh data structures are.
Distribution tells you if the data falls within the expected range.
Volume involves understanding the completeness of data structures and the state of data sources.
The schema allows you to understand who and when makes changes to data structures.
Lineage maps upstream data sources to downstream data sinks, helping you determine where errors or failures occurred.
More about data observability in the source: https://habr.com/ru/companies/otus/articles/559320/

Читать полностью…

Big Data Science

😳A selection of Python libraries for random generation of test data
Many people love Python for its convenience in data processing. But sometimes it happens that it is necessary to write and test an algorithm on a certain topic, but there is little or no data on this topic in the public domain. For such purposes, there are libraries with which you can generate fake data with the desired types.
Faker is a library for generating various types of random information. It also has an intuitive API. There are also implementations for languages such as PHP, Perl, and Ruby.
Mimesis is a Python library that helps generate data for various purposes. The library is written using the tools included in the Python standard library, so it does not have any third-party dependencies.
Radar - Library for generating random dates and times
Fake2db - a library that allows you to generate data directly in the database (there are also engines for different DBMS).

Читать полностью…

Big Data Science

😎Searching for data and learning SQL at the same time is easy!!!
Census GPT is a tool that allows users to search for data about cities, neighborhoods, and other geographic areas.
Using artificial intelligence technology, Census-GPT organized and analyzed huge amounts of data to create a superdatabase. Currently, the Census-GPT database contains information about the United States, where users can request data on population, crime rates, education, income, age, and more. In addition, Census-GPT can display US maps in a clear and concise manner.
On the Census GPT site, users can also improve existing maps. The data results can be retrieved along with the SQL query. Accordingly, you can learn SQL and automatically test yourself on real examples.

Читать полностью…

Big Data Science

🤓What is synthetic data and why is it used?
Synthetic data is artificial data that mimics observations of the real world and is used to prepare machine learning models when obtaining real data is not possible due to complexity or cost. Synthesized data can be used for almost any project that requires computer simulation to predict or analyze real events. There are many reasons why a business might consider using synthetic data. Here are some of them:
1. Efficiency of financial and time costs. If a suitable dataset is not available, generating synthetic data can be much cheaper than collecting real world event data. The same applies to the time factor: synthesis can take a matter of days, while collecting and processing real data sometimes takes weeks, months or even years.
2. Research of rare data. In some cases, the data is rare or there is danger in collecting it. An example of sparse data would be a set of unusual fraud cases. An example of dangerous real-world data is traffic accidents, which self-driving cars must learn to respond to. In this case, they can be replaced by synthetic accidents.
3. Eliminate privacy issues. When it is necessary to process or transfer sensitive data to third parties, confidentiality issues should be taken into account. Unlike anonymization, synthetic data generation removes any trace of real data identity, creating new valid datasets without sacrificing privacy.
4. Ease of layout and control. From a technical point of view, fully synthetic data simplifies markup. For example, if an image of a park is generated, it is easy to automatically label trees, people, and animals. You don't have to hire people to manually lay out these objects. In addition, fully synthesized data is easy to control and modify.

Читать полностью…
Subscribe to a channel